Full Proceedings - International Conference on Dublin Core and

Full Proceedings - International Conference on Dublin Core and
Metadata Intersec,ons: Bridging the Archipelago of Cultural Memory 2014 Proceedings of the Interna5onal Conference on Dublin Core and Metadata Applica5ons Proceedings Edited by: William Moen College of Informa7on, University of North Texas, United States Amy Rushing University Libraries, University of Texas at San Antonio, United States Published by: Dublin Core Metadata Ini9a9ve a project of ASIS&T Conference Host: Aus5n, Texas, USA 8-­‐11 October 2014 ISSN: 1939-­‐1366 (Online) WORKSHOPS DC-­‐1, Dublin, Ohio USA — 1-­‐3 March 1995 DC-­‐2, Warwick, U. K— 1-­‐3 April 1996 DC-­‐3, Dublin, Ohio USA — 24-­‐25 September 1996 DC-­‐4, Canberra, Australia — 3-­‐5 March 1997 DC-­‐5, Helsinki, Finland — 6-­‐8 October 1997 DC-­‐6, Washington D.C. USA — 2-­‐4 November 1998 DC-­‐7, Frankfurt, Germany — 25-­‐27 October 1999 DC-­‐8, OUawa, Canada — 4-­‐6 October 2000 CONFERENCES DC-­‐2001, Tokyo, Japan — 22-­‐26 October 2001 DC-­‐2002, Florence, Italy — 14-­‐17 October 2002 DC-­‐2003, SeaUle, Washington, USA — 28 September -­‐ 2 October 2003 DC-­‐2004, Shanghai, China — 10-­‐14 October 2004 DC-­‐2005, Leganés (Madrid), Spain — 12-­‐15 September 2005 DC-­‐2006, Manzanillo, Colima, Mexico — 3-­‐6 October 2006 DC-­‐2007, Singapore — 27-­‐31 August 2007 DC-­‐2008, Berlin, Germany – 22-­‐26 September 2008 DC-­‐2009, Seoul, Korea — 12-­‐16 October 2009 DC-­‐2010, PiUsburgh, Pennsylvania, USA — 20-­‐22 October 2010 DC-­‐2011, The Hague, The Netherlands — 21-­‐23 September 2011 DC-­‐2012, Kuching, Sarawak, Malaysia — 3-­‐7 September 2012 DC-­‐2013, Lisbon, Portugal — 2-­‐6 September 2013 DC-­‐2014, Aus9n, Texas — 8-­‐11 October 2014 © DCMI 2014 Copyright for individual articles is retained by the authors with first publication
rights granted to DCMI for publication in print and electronic proceedings. By
virtue of their appearance in this open access publication, articles are free to
be used with proper attribution of the author for educational and other noncommercial purposes. Other uses may require the permission of the authors. ISSN: 1939-­‐1366 (Online) DC-2014
Welcome Welcome to DC-2014 in Austin, Texas! This gathering of researchers, practitioners and
students of metadata for the annual meeting and conference of the Dublin Core Metadata
Initiative (DCMI) marks the twenty-second formal meeting of our community. It also marks the
end of a year of reinvention and reimagining of the way DCMI works, manifested in large part
through a return to the organizational model found in the early years of the initiative, with
responsibility for direction and management resting in the hands and minds of the
membership.
Much of the groundwork for this re-envisioning came from meetings of the Advisory Board
held at the annual meeting in Lisbon last year, and has been shepherded through early
stages by the newly elected officers of the Advisory Board, the newly named Governing
Board (formerly the Oversight Committee), and the newly formed Technical Board, along with
critical assistance from our current Managing Director, Stuart Sutton. We believe the end
outcome will be a stronger, member-driven organization that opens new doors for ideas and
initiatives that will build on the strengths and reputation DCMI has already established in the
international community.
The program this year has a number of new innovations designed to foster this goal, ranging
from the Best Practice Posters and Demonstrations, which will showcase concrete examples
of current practice in metadata applications, to the Next Generation Metadata Specialist
Program, which provides an opportunity for emerging professionals to network with veterans
throughout the conference. As always, there are many opportunities to catch up with some of
the boundary-pushing technical work being accomplished by task groups and your colleagues
in their everyday work, as well as pre and post conference workshops and tutorials. Many
thanks to the program committee and chairs for creating a stimulating and diverse program
this year.
As you participate in this year’s conference we hope that you will think about how you can
contribute to the growth and strengthening of DCMI in the coming year—through contributions
to the technical work and outreach being accomplished by task groups, by volunteering to
take on a role as a chair or co-chair of one of the Standing Committees or the Technical or
Advisory Board, by helping engage new members or re-engage old participants, or just by
joining as an Individual Member to help support the important work DCMI is doing that we all
benefit from in our own activities during the rest of the year.
I personally hope that many of you will also take advantage of the opportunity to add your
voice and support to the background work critical to keeping DCMI alive and thriving by
attending the Annual Meeting on Saturday. Active members are important in keeping the
initiative moving ahead, and this is a chance to join in that work and meet some of the people
who contribute their time and effort to making it possible for all of us to reap the benefits of
the thought and creativity engendered through DCMI’s activities.
We are sure that you will find the conference and meeting exciting and that you will leave
Austin with an even greater commitment to the work DCMI is doing, and a deeper
engagement with your colleagues in the world of metadata throughout the coming year.
Enjoy!!
Michael Crandall, Chair, DCMI Governing Board
i Chair’s Notes on the Program
At the 2012 South by Southwest Interactive Conference in Austin, Texas, Jon Voss led a
panel of speakers to introduce and discuss a “global movement afoot” that encourages
greater public access to metadata in the world’s libraries, archives, and museums. The
movement, led by a network of practitioners and professionals across cultural heritage
institutions, aims to increase adoption and implementation of Linked Open Data within the
cultural heritage community. The panelists discussed use cases and applications of linked
open data and presented a variety of possibilities for cultural data access, remix, and reuse.
In keeping with the Dublin Core’s history of reflecting and engaging the evolution of the
metadata field, this year’s conference builds upon that movement Voss and his panelists
spoke about. The theme of this year’s conference – “Metadata Intersections: Bridging the
Archipelago of Cultural Memory”—acknowledges that while metadata is the essential element
to enabling access to the world’s galleries, libraries, archives, and museums (GLAM), there
are significant differences in domain praxis. The conference program explores how these
differences may be bridged in the context of linked open data.
A pre-conference comprising a full-day workshop and two half-day tutorials launches the
conference. A post-conference workshop on linked data brings the week to an end. These
enable deeper engagement with a variety of topics that touch on this year’s theme including
emerging practices in archival description; linked open data hands on training; RDF in the
cultural heritage sector; and a historical overview of the accomplishments of the DCMI
community.
This year’s conference program includes two days of full-length conference papers and
project reports. Special sessions and poster viewings run concurrent to the papers and
reports throughout the two days. Participants from a variety of cultural heritage institutions
and practitioners utilizing linked open data and semantic web technologies will present both
theoretical and project-based papers. In this year’s submissions, we are seeing true
momentum in the exploration and adoption of linked open data across all cultural heritage
sectors.
DC 2014 unveils two new efforts: one that attempt to recruit young professionals and students
to attend, and the other to provide more opportunities for presenting the best practices of
metadata workers. The Next Generation Metadata Specialist Program solicited iSchools,
other library and information science programs, and libraries to sponsor one or more of their
students and early-career metadata professionals to attend the conference. Thirteen
organizations are participating. The participants selected for the Next Generation Metadata
Specialist Program will engage one on one and in group interactions with leading
researchers, consultants, and practitioners shaping the metadata ecosystem and in a special
session, designed for them; they will gain an understanding of how the discourse and practice
of metadata are evolving.
Also new this year, the non-peer reviewed Best Practices Poster and Demonstrations tracks.
Intended to encourage practitioners to showcase innovative approaches to metadata best
practices, these tracks garnered a great response: we have a total of 17 posters and two
demonstrations. A conference as special as Dublin Core owes so much to so many. We are
grateful to all of the people who submitted proposals to share their ideas, experiences, and
research. Similarly we are grateful to the many people who volunteered their time as
reviewers of all of those proposals. As program co-chairs we are especially grateful for the
opportunity to serve and contribute to this year’s conference.
William E. Moen, College of Information, University of North Texas, United States
Amy Rushing, University of Texas at San Antonio Libraries, United States
ii ORGANIZING COMMITTEE
DC-2014 Conference Committee Chair
Stuart A. Sutton, Dublin Core Metadata Initiative (DCMI), United States
Program Committee Chairs
William E. Moen, College of Information, University of North Texas, United States
Amy Rushing, University of Texas at San Antonio Libraries, University of Texas at San
Antonio, United States
Outreach Committee Chair
Eric Childress, OCLC Research, United States
Local Organizing Committee Chair
Kristi Park, Texas Digital Library
Program Committee
Leif Andresen, Advisor to the Director the Royal Library. National Library of Denmark,
Denmark
Ana Alice Baptista, Universidade do Minho, Portugal
Uldis Bojars, National Library of Latvia, Latvia
Dan Brickley, Vrije Universiteit Amsterdam
Joseph A Busch, Taxonomy Strategies, United States
Eric Childress, OCLC Research, United States
Marie-Claude Côté, Treasury Board Secretariat of Canada, Canada
Karen Coyle, Consultant, United States
Michael D. Crandall, University of Washington
Makx Dekkers, AMI Consult SARL, Spain
Jacques Ducloy, University of Lorraine, France
Gordon Dunsire, Independent Consultant, United Kingdom
Kai Eckert, University of Mannheim, Germany
Kevin Ford, Library of Congress, United States
Muriel Foulonneau, Public Research Centre Henri Tudor, Luxembourg
Anne Gilliland, Department of Information Studies, UCLA, United States
Carol Jean Godby, OCLC, United States
Jane Greenberg, University of North Carolina, Chapel Hill, United States
Willem Robert van Hage, VU University Amsterdam, Netherlands
Corey A. Harper, New York University
Seth van Hooland, Université Libre de Bruxelles, Belgium
Eero Hyvönen, Aalto University, Finland
Antoine Isaac, Europeana & Vrije Universiteit Amsterdam, Netherlands
Masahide Kanzaki, Keio University Xenon Limited Partners, Japan
Tomi Kauppinen, University of Muenster, Germany
Johannes Keizer, Food and Agriculture Organization of the United Nations (FAO), Italy
Dean Blackmar Krafft, Cornell University Library, United States
Michael Lauruhn, Elsevier, United States
Akira Maeda, Ritsumeikan University, Japan
Filiberto Felipe Martinez-Arellano, National Autonomus University of Mexico, Mexico
Philipp Mayr, GESIS - Leibniz Institute for the Social Sciences, Germany
Eva M. Méndez, University Carlos III of Madrid, Spain
Shawne Miksa, University of North Texas
Steven J. Miller, University of Wisconsin-Milwaukee School of Information Studies, United
States
Akira Miyazawa, National Institute of Informatics, Japan
William E. Moen, College of Information, University of North Texas, United States
Peter E Murray, LYRASIS, United States
Jin-Cheon Na, Nanyang Technological University, Singapore
Liddy Nevile, Independent Consultant, Australia
Annelies van Nispen, Eye Film Institute, Netherlands
iii Johan Oomen, Netherlands Institute for Sound and Vision, Netherlands
Jung-ran Park, College of Information Science and Technology, Drexel University, United
States
Oknam Park, Sangmyung University, Republic of Korea
Cristina Pattuelli, Pratt Institute, United States
Vivien Petras, Humboldt-Universität zu Berlin, Germany
Magnus Pfeffer, Stuttgart Media University, Germany
Serhiy Polyakov, University of North Texas, United States
Sarah Potvin, Texas A&M University Libraries, United States
Jian Qin, Syracuse University, United States
KS Raghavan, Centre for Knowledge Analytics & Ontological Engineering, PES Institute of
Technology, India
Stefanie Ruehle, SUB Goettingen, Germany
Amy Rushing, University of Texas at San Antonio, United States
Johann Wanja Schaible, GESIS - Leibniz-Institute for the Social Sciences, Germany
Bernhard Schandl, Gnowsis.com, Austria
Jodi Schneider, INRIA Sophia Antipolis, France
Ryan Shaw, University of North Carolina at Chapel Hill, United States
Aida Slavic, UDC Consortium, United Kingdom
Shigeo Sugimoto, University of Tsukuba, Japan
Stuart A. Sutton, Dublin Core Metadata Initiative (DCMI), United States
Lars G. Svensson, Deutsche Nationalbibliothek, Germany
Hannah Tarver, University of North Texas Libraries, United States
Joseph T. Tennis, University of Washington, United States
Douglas Tudhope, University of Glamorgan, United Kingdom
Vassilis Tzouvaras, National Technical University of Athens, Greece
Sherry L. Vellucci, University of New Hampshire, United States
Paul Walk, EDINA, United Kingdom
Mei-Ling Wang, Graduate Institute of Library, Information and Archival Studies, Taiwan
Laura Waugh, University of North Texas, United States
Andrew C Wilson, Queensland State Archives, Australia
Oksana Zavalina, University of North Texas, United States
Marcia Lei Zeng, Kent State University, United States
Local Organizing Committee
Linda Abbey, University of Texas Libraries
Effie Bradley, Texas Digital Library
Debra Hanken Kurtz, Texas Digital Library
Gad Krumholz, Texas Digital Library
Nick Lauland, Texas Digital Library
Jason Sick, University of Texas Libraries
Ryan Steans, Texas Digital Library
Antoinette Yost, Texas Digital Library
iv TABLE OF CONTENTS
Distributed metadata Environments & Aggregation—Part A
1-11
Linked Data Mapping Cultures: An Evaluation of Metadata Usage and Distribution
in a Linked Data Environment
Konstantin Baierer, Evelyn Dröge, Vivien Petras & Violeta Trkulja
12-23
The Digital Public Library of America Ingestion Ecosystem: Lessons Learned After
One Year of Large-Scale Collaborative Metadata Aggregation
Mark A. Matienzo & Amy Rudersdorf
24-30
Applying a Linked Data Compliant Model: The Usage of the Europeana Data
Model by the Deutsche Digitale Bibliothek
Stefanie Rűhle, Francesca Schulze & Michael Bűchner
Distributed metadata Environments & Aggregation—Part B
31-36
Designing a Multi-level Metadata Standard based on Dublin Core for Museum
Data
Jing Wan, Yubin Zhou, Gang Chen & Junkai Yi
37-42
"Lo-Fi to Hi-Fi": A New Way of Conceptualizing Metadata in Underserved Areas
with the eGranary Digital Library
Deborah Maron, Cliff Missen & Jane Greenberg
43-52
How Descriptive Metadata Changes in the UNT Libraries' Collections: A Case
Study
Hannah Tarver, Oksana Zavalina, Mark Phillips, Daniel Alemneh & Shadi
Shakeri
Metadata in Support of Research
53-63
Metadata Integration for an Archaeology Collection Architecture
Sivakumar Kulasekaran, Jessica Trelogan, Maria Esteva & Michael Johnson
64-73
Dublin Core Metadata for Research Data–Lessons Learned in a Real-World
Scenario with datorium
Andias Wira Alam
74-82
Metadata for Research Data: Current Practices and Trends
Sharon Farnel & Ali Shiri
Infrastructure & Models—Part A
83-94
95-108
The ARK Identifier Scheme: Lessons Learnt at the BnF and Questions Yet
Unanswered
Sébastien Peyrard, Jean-Philippe Tramoni & John A. Kunze
Requirements on RDF Constraint Formulation and Validation
Kai Eckert & Thomas Bosch
109-118
Extracting Description Set Profiles from RDF Datasets using Metadata Instances
and SPARQL Queries
Tsunagu Honma, Kei Tanaka, Mitsuharu Nagamori & Shigeo Sugimoto
119-128
The 1:1 Principle in the Age of Linked Data
Richard J. Urban
Infrastructure & Models—Part B
v 129-137
Towards Description Set Profiles for RDF using SPARQL as Intermediate
Language
Thomas Bosch & Kai Eckert
138-146
Describing Theses and Dissertations Using Schema.org
Jeff Keith Mixter, Patrick OBrien & Kenning Arlitsch
Metadata Praxis
147-156
Provenance Description of Metadata using PROV with PREMIS for Long-term
Use of Metadata
Chunqiu Li & Shigeo Sugimoto
157-166
Interlinking Cross Language Metadata Using Heterogeneous Graphs and
Wikipedia
Xiaozhong Liu, Miao Chen & Jian Qin
167-172
Automated Enhancement of Controlled Vocabularies: Upgrading Legacy Metadata
in CONTENTdm
Andrew Weidner, Annie Wu & Santi Thompson
173-175
Retaining Metadata in Remixed Cultural Heritage Objects
Jamie Viva Wittenberg
176-178
Embedded Metadata – A Tool for Digital Excavation
Ana Cox
179-180
Dublin Core to Ensure Interoperability between Models Generated by Tools of
Species Distribution Modeling
Cleverton Ferreira Borba & Pedro Luiz P. Correa
181-183
Building Bridges to the Future of a Distributed Network: From DiRT Categories to
TaDiRAH, a Methods Taxonomy for Digital Humanities
Jody Perkins, Quinn Dombrowski, Luise Borek & Christof Schöch
184-186
Metadata Workflows Across Research Domains: Challenges and Opportunities for
Supporting the DFC Cyberinfrastructure
Adrian T. Ogletree
187-190
A Cooperative Project by Libraries and Museums of China: Metadata Standards
for the Digital Preservation of Cultural Heritage
Ying Feng & Long Xiao
191-195
Undressing Fashion Metadata: Ryerson University Fashion Research Collection
Naomi Eichenlaub, Marina Morgan & Ingrid Masak-Mida
Posters (Peer Reviewed)
Best Practice Posters & Demonstrations
196-198
MARC to schema.org: Providing Better Access to UIUC Library Holdings Data
Timothy Cole, Michael Norman, Patricia Lampron, William Weathers, Ayla
Stein, M. Janina Sarol & Myung-Ja Han
199-200
The TR32DB Metadata Schema: A Multi-level Metadata Schema for an
Interdisciplinary Project Database
Constanze Curdt & Dirk Hoffmeister
201-203
Development of the EDDA Study Design Terminology to Enhance Retrieval of
Clinical and Bibliographic Records in Dispersed Repositories
Ashleigh N. Faith, Eugene Tseytlin & Tanja Bekhuis
204-206
Applying Concepts of Linked Data to Local Digital Collections to Enhance Access
and Searchability
Virginia A Dressler
vi 207-209
Normalizing Decentralized Metadata Practices Using Business Process
Improvement Methodology: A Data-informed Approach to Identifying
Institutional Core Metadata
Emily Porter
210-215
The NDL Great East Japan Earthquake Archive: Features of Metadata Schema
Julie Fukuyama & Akiko Hashizume
216-218
Reusing Legacy Metadata for Digital Projects: The Colorado Coal Project
Collection
Michael Dulock
219-221
A Model and Roles of a Common Terminology to Improve Metadata
Interoperability
(Boaz) Sunyoung Jin
222-224
Converting Personal Comic Book Collection Records to Linked Data
Sean Petiya
225-226
Making Vendor-Generated Metadata Work for Archival Collections Using VRA and
Python
Carolyn Hansen & Sean Crowe
227
A Library Catalog REST API Framework
Jason Thomale & William Hicks
228-229
Building the Bridge: Collaboration between Technical Services and Special
Collections
Susan Matveyeva & Lizzy Anne Walker
230-231
Best Practices for Complex Diacritics Handling in CONTENTdm
Jason W. Dean & Deborah E. Kulczak
232-233
Ecco!: A Linked Open Data Service for Collaborative Named Entity Resolution
Matthew Miller & M. Cristina Pattuelli
234-236
Wikipedia-based Extraction of Lightweight Ontologies for Concept Level
Annotation
Michael Lauruhn & Elshaimaa Ali
237-238
How To Build A Local Thesaurus
Robert H. Estep
239
Designing an Archaeology Database: Mapping Field Notes to Archival Metadata
Ann Ellis
240
Utilizing Drupal for the Implementation of a Dublin Core-Based Data Catalog
Lisa Federer
241
PunkCore: Developing an Application Profile for the Culture of Punk
Joelen Pastva & Valerie Harris
vii Distributed Metadata
Environments & Aggregation—
Part A Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Linked Data Mapping Cultures:
An Evaluation of Metadata Usage and Distribution in a Linked
Data Environment
Konstantin Baierer
Humboldt-Universität zu
Berlin, Germany
[email protected]
Evelyn Dröge
Humboldt-Universität zu
Berlin, Germany
[email protected]
Vivien Petras
Humboldt-Universität zu
Berlin, Germany
[email protected]
Violeta Trkulja
Humboldt-Universität zu
Berlin, Germany
[email protected]
Abstract
In this paper, we present an analysis of metadata mappings from different providers to a Linked
Data format and model in the domain of digitized manuscripts. The DM2E model is based on
Linked Open Data principles and was developed for the purpose of integrating metadata records
to Europeana. The paper describes the differences between individual data providers and their
respective metadata mapping cultures. Explanations on how the providers map the metadata from
different institutions, different domains and different metadata formats are provided and
supported by visualizations. The analysis of the mappings serves to evaluate the DM2E model
and provides strategic insight for improving both mapping processes and the model itself.
Keywords: mapping evaluation; ontology evaluation; mapping varieties; DM2E model; Linked
Data; Europeana
1. Introduction
Do mapping preferences of individual institutions influence the resulting data from a mapping
process? In this paper, mapped datasets from eight different data providers (DP) processed by six
different mapping institutions (MI) were analyzed. The primary aim of the analysis was an
evaluation of the model to which the data is mapped. Based on the differences of mappings in the
evaluation, different Linked Data mapping cultures emerged.
The evaluation of a dataset or data model provides insight into over- and underused parts of the
model or misrepresented or misunderstood data mappings. Previous studies have looked at the
distribution and usage of fields or model classes and properties and the mapping data in library
catalogs (e.g. Seiffert, 2001; Smith-Yoshimura, Argus et al., 2010). These studies show that only
a subset of the provided properties in data formats are used in practice. Palavitsinis, Manouselis
& Sanchez-Alonso (2014) observed in their study of metadata quality in cultural collections that
the “perceived usefulness for all elements of an application profile drops when the number of
these elements rises” (p. 9). In Linked Data research, the focus has been on the analysis of certain
vocabularies (e.g. Alexander, Cyganiak et al., 2009) and statistics on individual or aggregations
of RDF datasets including data accessibility and coverage (Auer, Demter et al., 2012). Klimek,
Helmich & Nacasky (2014) built a Linked Data Visualization Model (LDVM) which creates an
analytical RDF abstraction and a visual mapping transformation.
This paper first introduces the DM2E model and its application context and then provides
general statistics on the use of different model classes and properties by different providers and
1
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
mapping institutions. Different data and model characteristics are discussed to provide an analysis
of different mapping styles (cultures) and their consequences.
2. A Data Model for Cultural Heritage
Europeana1 is the European digital library, which gives access to more than 30 million library,
archive, museum and audio-visual objects from 36 countries. These objects are digitized and
described by content providers in different metadata formats. National or domain aggregators
deliver the object metadata to Europeana in the Europeana data model (EDM) (EDM Primer,
2013). Digitised Manuscripts to Europeana (DM2E)2 is a domain aggregator contributing to the
development of Europeana. Among other goals, DM2E collects, maps and delivers rich metadata
about manuscripts to Europeana.
The metadata mapping and the ingestion of mapped data into Europeana are supported by a
specialization of the EDM for manuscripts that was developed for DM2E. The EDM is very
broad and generic in order to fit the different metadata standards like TEI or METS/MODS in
which cultural heritage objects (also referred to as CHOs) are described by data providers. The
model is RDF-based and can thus easily be extended by others as done in the DM2E project. The
resulting specialization is called the DM2E model.
The DM2E model (Dröge, Iwanowa & Hennicke, 2014a) has been built as a specialization of
the EDM in order to represent rich manuscript metadata in Europeana, which is also published as
Linked Open Data (LOD) (Heath & Bizer, 2011). The development approach of the model was
bottom-up: requirements from data providers as well as from technical partners were collected
and new properties or classes were created or reused from external vocabularies. Properties and
classes were added as subproperties / -classes to EDM resources when possible in order to enable
backwards compatibility. In that way, the main structure of the EDM remains unchanged in the
DM2E model. The core classes of both models are edm:ProvidedCHO for the cultural heritage
object, ore:Aggregation for the provided metadata record and edm:WebResource for Web
resources related to a CHO, e.g. an image of it. The class that is most extensively specialized in
the DM2E model is edm:ProvidedCHO. More than 50 properties were added to this class to
better describe the creator of a CHO, its contributors and concepts, places and time spans related
to it. Similar to the EDM, the DM2E model mainly focuses on properties and not on classes to
describe the provided data. Nevertheless, a small amount of classes were also added, e.g. to
differentiate various types of CHOs like dm2e:Page, bibo:Book or fabio:Article. These classes
are important to model hierarchical objects which are not yet fully supported in EDM.
3. Distribution of Classes and Properties
Ten datasets mapped to the RDF-based DM2E model describing manuscripts, books, letters
and journal articles were analyzed. The total amount of RDF statements in the analyzed sample is
61,365,146. The data was delivered by eight data providers (DP) and mapped by six different
mapping institutions (MI). The DPs, MIs and datasets were anonymized as the focus of the study
does not lay in specifics of a single dataset but in the differences between the mapping behaviour
of the six MIs. Our assumption is that not only the provided data but also the particular mapping
approach influences the resulting data in the DM2E model. Table 1 shows the providers, datasets,
the metadata format of the data before the ingestion and the responsible mapping institution. All
data was mapped to the DM2E model version 1.1, latest revision (Dröge, Iwanowa et al., 2014b).
1
Europeana website: http://europeana.eu/ (last accessed 22.04.2014).
DM2E website: http://dm2e.eu/ (last accessed 22.04.2014).
3
https://github.com/DM2E/dm2e-analysis/tree/master/sparql (last accessed 15.05.2014).
4
https://github.com/DM2E/dm2e-analysis/blob/master/build_tables.py (last accessed 15.05.2014).
25
https://developers.google.com/chart/
DM2E
website: http://dm2e.eu/ (last (last
accessed
accessed
22.04.2014).
15.05.2014).
2
2
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The first aim of the analysis was to evaluate the DM2E model by identifying properties and
classes that were not mapped. Unmapped resources could potentially be removed from the model
to reduce its complexity. The analysis of the mappings could also be used to evaluate whether the
model can cover different domains. Can a generic model like the EDM and its specializations be
used to represent this data or do the Linked Data mapping cultures vary too much? Does a
mapping reflect the institution that has mapped the data?
TABLE 1: Analyzed datasets.
Data Provider (DP)
DP I
DP I
DP II
DP II
DP III
DP IV
DP V
DP VI
DP VII
DP VIII
Dataset
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
Dataset 6
Dataset 7
Dataset 8
Dataset 9
Dataset 10
Metadata format
proprietary format
proprietary format
MAB2
MAB2
METS/MODS
METS/MODS
TEI P5
EAD
TEI P5
TEI P5
Mapping institution (MI)
MI A
MI A
MI B
MI B
MI C
MI C
MI D
MI D
MI E
MI F
The evaluation reported in this paper is based on an automated analysis and visualizations. The
RDF data in the triple store is organized in Named Graphs (Carroll et al., 2005), each Named
Graph representing a specific ingestion of a specific dataset including full provenance. Using
SPARQL, the latest ingestion of each dataset was determined. Then, a set of SPARQL queries
was run on the data in these ingestions3 to gather the raw counts for various quantifiable aspects
of these datasets, including generic statistics such as number of statements, number of specific
predicates, number of different ontologies, ranges of predicates, RDF types, as well as DM2Especific statistics such as frequency of certain subclasses of edm:PhysicalThing or occurrences of
predefined statement patterns. A Python script4 then collated the raw tabular data, calculated
means, sums and ratios within and across datasets and produced HTML with embedded SVG
using the Google Chart data visualization API5. Unprocessed visualizations6 and the source code7
are available.
The providers or mapping institutions used a large variety of classes and properties of the
DM2E model and produced rich mappings. Still, more than a half of all classes (24 out of 43) and
about a third of all properties (47 out of 125) that the model offers were not used by any of the
providers. The counts do not include classes and properties that are used for means beyond
manuscript metadata, e.g. for external annotation tools or for tracking provenance within the
DM2E interoperability infrastructure.
Figure 1 shows the distribution of all properties. The most frequently used properties are
dc:contributor, edm:rights, dc:format und dc:description. Properties which must be used exactly
once occur for each of the ca. 2.1 million CHOs: dm2e:hasAnnotatableObject (strongly
recommended), dc:language (mandatory), edm:dataProvider (mandatory), dc:type (mandatory),
edm:aggregatedCHO (connection between the CHO and the aggregation; this is mandatory and
must occur once per object), edm:type (mandatory), dm2e:displayLevel (mandatory). The
property dc:title is not mandatory and is used “only” 1,722,542 times in 2,134,934 CHOs. The
strongly recommended properties were used almost as often as the mandatory ones. A major part
3
https://github.com/DM2E/dm2e-analysis/tree/master/sparql (last accessed 15.05.2014).
https://github.com/DM2E/dm2e-analysis/blob/master/build_tables.py (last accessed 15.05.2014).
5
https://developers.google.com/chart/ (last accessed 15.05.2014).
6
http://data.dm2e.eu/visualize/index.html (last accessed 24.07.2014).
7
https://github.com/DM2E/dm2e-analysis (last accessed 15.05.2014).
4
3
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
of the properties is used infrequently compared to the number of CHOs, a logical consequence
because specific properties just fit particular datasets. About one third of the properties was not
mapped. Both, DM2E-specific properties but also EDM properties, were not mapped. Properties
from contextual classes, e.g. coordinates of places (wgs84_pos:lat, wgs84_pos:long), the date an
institution started (rdaGr2:dateOfEstablishment) or ended (rdaGr2:dateOfTermination) are
possibly simply missing in the data. SKOS properties like skos:broader, skos:narrower or
skos:notation were not mapped. Uncommon properties like dm2e:levelOfGenesis,
dm2e:influencedBy or dm2e:misattributed were not mapped even though they were explicitly
requested by data providers. The distribution of properties mirrors previous findings from Seiffert
(2001), who analyzed MAB fields of title data in libraries and showed that 58.46% of MAB fields
for bibliographic data were unused. The same results could be found in an internal statistical
analysis of EDM data at Europeana conducted in January 2014, which concluded that 40% of the
fields remained unused.
FIG. 1: Absolute frequency of all predicates. Properties on the right side of the vertical bar were never used in any
dataset.
FIG. 2: Distribution of classes across datasets in DM2E.
4
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The most frequently used classes (as shown in fig. 2) are edm:WebResource (every CHO must
point to at least one Web resource), followed by ore:Aggregation and edm:ProvidedCHO. They
occur equally often, as there is always one aggregation per CHO, and are mandatory. Although
contextual classes are not mandatory and less frequently mapped, they are very useful as they
allow contextual data to become Linked Data representations with dereferenceable IRIs8 as
opposed to mere strings. The class skos:Concept (the fifth most mapped class) is used very
unevenly: DP V-Dataset 7 uses it 138,440 times, DP I-Dataset 1, DP III-Dataset 5 and DP IIDataset 4 do not use it at all. Subclasses of foaf:Organization, e.g. vivo:Library, dm2e:Archive,
edm:Event were never used. Altogether, 24 of 43 classes are unused.
The class dm2e:Page is used most often as the aggregation level of an object (see table 2).
While DM2E prepared for different types and aggregation levels, the data appears to be
aggregated almost exclusively on the page level. However, in the mappings, several levels are
used. Most datasets make use of two different levels of hierarchy within a CHO. This can not
only be explained with the provided metadata. For example, chapters are never mapped but exist
in the provided books. Which and how many levels of a hierarchical object are mapped seems to
be mostly based on the mandatory elements in the model and on the decisions of the MI.
TABLE 2: Different CHO types (subclasses of edm:PhysicalThing or skos:Concept).
Dataset
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
Dataset 6
Dataset 7
Dataset 8
Dataset 9
Dataset 10
Total
bibo:
Series
4,552
4,552
bibo:
Book
1,251
39,873
2,916
1,295
45,335
dm2e:
Manuscript
24
10
175
1,012
20
1,241
dm2e:
Paragraph
bibo:
Journal
bibo:
Issue
fabio:
Article
bibo:
Letter
dm2e:
Page
-
-
-
-
-
9,635
9,635
1
1
346
346
42,173
42,173
3,630
3,630
10,427
530,314
46,006
307,202
472,994
416,172
34,596
159,277
1,976,988
Only few mappers use edm:Agent (DP IV-Dataset 6: 2,919; DP II-Dataset 3: 11,796; DP VIIIDataset 10: 35). In the same datasets where edm:Agent is used, foaf:Organization and
foaf:Person are mapped as well. foaf:Organization and foaf:Person are mapped by everyone. In
some datasets, they are rarely mapped (DP I-Dataset 1: 2 organizations, 3 persons and 0 agents;
DP II-Dataset 4: 0 agents, 33 organizations, 275 persons), in other datasets they are very often
mapped (DP II-Dataset 3: 11,796 agents, 21,592 persons, 175 organizations). Here, it seems that
these mappings of agents do not depend on the mapper but on the provided data.
4. Linked Data References vs. Literal Statements
Broadly speaking, an RDF statement can have either a literal (a possibly typed string) or a
reference to a resource (an IRI, a blank node or an RDF container type). Since the DM2E model
strongly recommends using literals and IRI exclusively, the relationship between statements
referring to literals or resources and the total number of statements in a dataset reveals differences
in the datasets as can be seen in figure 3. When the datasets are grouped by the percentage of
8
Internationalized resource identifier. An extension of URI allowing unencoded Unicode characters in
most places of a URI (RFC 3987).
5
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
literal statements, clusters of similar percentages appear according to the respective MI independent of the metadata content.
For example, the percentage of literal statements in DP V-Dataset 7 (28.273%) and DP VIDataset 8 (28.270%) is almost equal, yet the content is vastly different (collection of digitized
prints of various genres and ages vs. personal correspondence of an 19th century scholar), the
metadata originally created by different data providers (research project vs. library) and in
different formats (TEI vs. EAD). The only commonality between the datasets is that the same
organization (MI D) created the mappings to DM2E. Therefore, we put forth the correlation
between the ratio of literal statements and the mapping institution is much stronger than between
ratio of literal statements and similarity of the original data.
FIG. 3: Ratio of statements with literal statements to resource statements per dataset.
While the relationship between resource and literal statements gives some insight into how MIs
structure the data, it does not answer questions pertaining to the quality and usefulness of literal
statements. To tackle this problem, the literal statements containing properties with literals
allowed as their range were clustered into three groups (see fig. 4). The “preferred” literal
statements (properties that are either mandatory, recommended or increase the descriptive
content)9 are a sign of data quality since they enhance the descriptiveness of the data, improve the
search and browse experience and granularize textual information. The “neutral” literal
statements are those neither preferred nor unwanted, i.e. properties where it is not important for
contextual information if they refer to resources or literals. Lastly, the “deprecated” literal
statements are statements with those properties that allow both literals and resources in their
range, yet the data providers chose to use literals.10 Even though the label implies it, it is not
necessarily a wrong choice to use literals when they are allowed as an alternative to an IRI.
However, inconsistent usage is detrimental to the homogeneity of the data, requiring data
consumers to use more complex queries to capture both types of statements and are often a sign
for poor structure within the data.
As can be seen in figure 4, there is some evidence that the relationship of the number of
preferred and deprecated literal statements is correlative with the mapping institution. For
9
Preferred properties in literal statements: skos:prefLabel, rdfs:label, skos:altLabel, dc:description,
dm2e:displayLevel, edm:type, dc:title, dm2e:subtitle, dc:language, dc:format, dc:identifier.
10
"Deprecated" properties in literal statements: dc:rights, dcterms:created, dcterms:modified,
dcterms:issued, dcterms:temporal, rdaGr2:dateOfBirth, rdaGr2:dateOfDeath,
rdaGr2:dateOfEstablishment, rdaGr2:dateOfTermination. The model recommends for time-related
properties the use of edm:TimeSpan resources but also allows xsd:dateTime/xsd:gYear or rdf:Literal.
6
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
example, the data produced with mappings by MI A (Dataset 1 and 2) and MI C (Dataset 5 and 6)
is very coherent in this regard. However, for the datasets produced by MI B (Dataset 3 and 4) we
see a slight variance, for the datasets produced by MI D (Dataset 7 and 8) even a significant
variance in the ratios. Taking the more specific grouping into account, the preferred-deprecated
ratio is much more influenced by the original metadata than the overall literal-resource ratio.
Considering the data produced by MI D, it is remarkable that the one dataset (Dataset 7) contains
the largest proportion of deprecated literal statements within the set of datasets, whereas the other
dataset (Dataset 8) contains no deprecated statements at all.
FIG. 4: Distribution of “preferred”, “neutral” and “deprecated” literal statements within the datasets.
5. Variance of Statements and Redundancy of Data in Triples
To measure the redundancy of data in triples, we introduce the measure of Predicate-ObjectEquality-Ratio (POER-n), which is defined as the percentage of triples that share the same
predicate and object with at least n other statements. In other words, POER-n measures how
many statements state the same facts about different subjects. The smallest possible POER-n of
the datasets in DM2E, POER-1, ranges from 0.08% (Dataset 5) to 2.48% (Dataset 3). While
impressive as a signifier of structural redundancy, using POER-n to assess data-intrinsic
redundancy proves to be much more difficult. First of all, there is a lot of duplication required by
the triple structure of RDF, i.e. rdf:type statements have a limited range of possible values
defined by the DM2E model. Certain literal properties have even smaller ranges. Other areas of
redundancy can be explained by the original metadata, such as manuscripts being published in the
same year or by the same author. Some redundancies, however, can point to problems. For
example, redundancies in dc:subject statements will, when passing a certain frequency threshold,
not be discriminatory for any kind of search (e.g. assigning the keyword “philosophy” to any
CHO). Redundant dc:title statements can show mapping errors or missing content. For example,
if many dc:title statements contain the text “Untitled Page” or just a page number, the content
may have been mapped incorrectly.
Hence, the usefulness of POER-n is very dependent on the value of n. Whereas the bulk of the
statements contained in POER-1 or even POER-100 can be discarded as arbitrary similarities, a
high POER-1000 or POER-10000 cannot be easily explained with random chance. If the same
fact is stated about 10,000 different subjects within a dataset, this is a strong indicator that either
the original metadata is very homogenic (e.g. by the same author or released in the same year) or
that the data is not properly internally aligned (e.g. hundreds of different auto-generated
skos:Concepts with the same skos:prefLabel). Instead of setting n to an arbitrary number, a lot
can be gained by using the number of instances of certain classes as the threshold, for example, in
7
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
the case of DM2E, the number of ore:Aggregation/edm:ProvidedCHO instances. The exact
mechanics of how to fine-tune POER-n, finding proper threshold values and visualizing both, the
POER-n and the statements it represents, is still subject to further research.
Figure 5 presents the average number of statements per instance of a class within a dataset
(ANOS). We see that the data mapped by MI C is very homogenous with regards to the ANOS,
for both ore:Aggregation/edm:ProvidedCHO and contextual classes. Obviously, the workflow for
the RDFization of the original data used by MI C is organized in such a way (e.g. by reusing the
same XSLT scripts) that the resulting RDF follows a relatively rigid structure.
For the edm:ProvidedCHO instances, we see a significant higher ANOS for data mapped by
MI D. Since the data is generated from very different input formats, the deciding factor here is
apparently MI D's thorough mapping process, producing more statements by normalizing
unstructured fields, adding alternative titles, different languages etc.
FIG. 5: Average number of statements per class per dataset.
The three outliers with significantly more-than-average ANOS for ore:Aggregations are all
generated from TEI data. Apparently, TEI's exhaustive mechanics for adding metadata to the
header of a TEI document heavily and positively influences the richness of the metadata on
aggregation level. While still slightly above average, the ANOS for edm:ProvidedCHO from TEI
data is much lower than for ore:Aggregation, leading to the conclusion that TEI is a top-heavy
format, inciting TEI producers to create exhaustive meta-metadata describing the provenance of
the TEI document rather than the manuscript itself.
Looking at the distribution of ANOS for edm:WebResource instances, clusters of very similar
ANOS defined by the respective MI emerge. The explanation for this is that most information
assigned to edm:WebResource instances is boilerplate (format and rights information mostly)
with only the IRI of the edm:WebResource instance itself changing.
In general, the distribution of ANOS across datasets is more homogenous for contextual
classes (foaf:Person, foaf:Organization, edm:Place, edm:TimeSpan, skos:Concept) than for
manuscript-related classes (ore:Aggregation, edm:ProvidedCHO). The main reason for this is
that ANOS for the former is significantly smaller than for the latter, i.e. relatively few statements
are asserted about instances of contextual classes (the highest ANOS for contextual classes is
3.96 for skos:Concept in Dataset 10). On the other hand, this is also a sign that there is still
potential for possible improvement on account that, e.g. digitization projects focusing on the
8
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
written legacies of individuals tend to have extensive dossiers about the context (like places,
persons and concepts). Apparently, the full richness of this data is not yet fully ported over to the
RDFized data.
6. Being Linked Open Data - Usage of different Ontologies
The Linked Data principles recommend using existing namespaces and ontologies. The DM2E
model included a number of other ontologies and encouraged data providers to map their data
using properties from them. Figure 6 shows the ontologies and their number of properties
referenced by the DM2E model as well the number of properties used by data providers.
Every ontology is used, however, not all properties are used: of DM2E, slightly more than 50%
of the offered properties are used, around 66% of EDM. Most of the properties of the DC and
BIBO ontologies are used (75%). Vocabularies like DC and DCTerms have fewer resources in
the model than DM2E but they are more often used. Other ontologies like rdaGr2 provide very
specific properties for very specific contextual classes which are also often not mapped (e.g. the
already mentioned rdaGr2:dateOfEstablishment). Even though the two CIDOC-CRM properties
in the model, crm:P79F.beginning_is_qualified_by and crm:P80F.end_is_qualified_by, are also
very specific, they serve an important case: they are used to indicate how accurate a timespan is.
FIG. 6: Number of properties defined in the DM2E model vs. number of properties actually used in the data, by
referenced ontology.
The fact that only half the properties defined in the DM2E model are actually used (see also
fig. 1) deserves closer scrutiny, however. Because the ontology is being developed by DM2E for
DM2E, this cannot be explained with the specificity of the domain of the ontology, but with the
dynamics of the process of ontology development: In the early stages, the intricate knowledge of
data providers about the details of their data led them to require increasingly semantically narrow
properties from the DM2E ontology engineers (e.g. dm2e:honoree or dm2e:wasStudiedBy).
However, when the MI (which do not necessarily coincide with the DP, see table 1) started
implementing the mappings, many of those requirements were dropped due to the specific
properties being hard to map or not being readily discernible from the original metadata. Over the
9
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
course of many cycles of mapping, data ingestion and refinement of the data model, new
properties have been added but unused properties were never dropped.
7. Conclusion: Linked Data Mapping Cultures
The analyses have shown that the particular mapping institution plays an important role in the
way that data actually is represented after a mapping process. Datasets mapped by the same MIs
have similar characteristics in the various analyzed aspects, e.g. which resources are used for the
mappings and which are not. The representation of the data before the mapping has a less
significant influence on the structure of the mapped data as has the domain or CHO types. The
source format is reflected in the number of provided statements, e.g. whenever TEI is used
(where the full text of an object is also annotated and can be used for mappings), many more
statements are produced.
As already identified in previous model evaluations, mapping institutions do not make use of
the full range of possible ontology elements that could be mapped. Models, including the DM2E
model, could be reduced (especially when only a small percentage of specific vocabularies is
used as shown in the last figure). Contextual resources are not mapped as thoroughly as the core
classes for the representation of the object (edm:ProvidedCHO) and its metadata record
(ore:Aggregation).
From a user's perspective, the Linked Data representation should be derived from the source
data by a function of the source data and not strongly be influenced by the specifics of the
mapping process. While technical means such as the quantitative analyses presented here help
make the skew more evident, it can eventually only be rectified by a more agile development
process that involves all stakeholders balancing semantic expressivity with data interoperability,
peer-review of mappings or ongoing evaluation of mappings and mapped data, improved and
extended mapping guidelines with a strong focus on reusability and sustainability of data and data
model. From a Linked Data mapping cultural perspective, our conclusion is that ontologies
should not just be extended to fit new requirements but also pruned from over-specific bloat
regularly and that this can only be achieved when ontologists, data providers, mapping
institutions, developers and data consumers incessantly communicate, compromising between
semantic accuracy and technical feasibility.
Acknowledgements
This work was supported by a grant from the European Commission “ICT Policy Support
Programmes” provided for the DM2E project (GA no. 297274). The authors would like to thank
all colleagues of DM2E as well as the European Commission for their support.
References
Alexander, Keith, Richard Cyganiak, Michael Hausenblas, and Jun Zhao. (2009). Describing Linked Datasets. On the
Design and Usage of VoID, the “Vocabulary of Interlinked Datasets”. In Bizer et al. (Eds.), Proceedings of the
Linked Data on the Web Workshop (LDOW2009), Madrid, Spain, April 20, 2009, CEUR Workshop Proceedings.
Retrieved, May 14, 2014, from http://ceur-ws.org/Vol-538/.
Auer, Sören, Jan Demter, Michael Martin, and Jens Lehmann. (2012). LODStats – An Extensible Framework for HighPerformance Dataset Analytics. In ten Teije et al. (Eds.), Knowledge Engineering and Knowledge Management.
18th International Conference, EKAW 2012, Galway City, Ireland, October 8-12, 2012, Proceedings (pp. 356-362).
Berlin, Heidelberg: Springer. doi: 10.1007/978-3-642-33876-2.
Carroll, J. Carroll, Christian Bizer, Pat Hayes, and Patrick Stickler. (2005). Named Graphs. In Journal of Web
Semantics, 3, 247-267.
Dröge, Evelyn, Julia Iwanowa, and Steffen Hennicke. (2014a). A specialisation of the Europeana Data Model for the
representation of manuscripts: The DM2E model. In Libraries in the Digital Age (LIDA) Proceedings, Volume 13,
2014. Retrieved, July, 24, 2014, from http://ozk.unizd.hr/proceedings/index.php/lida/article/view/117.
Dröge, Evelyn, Julia Iwanowa, Steffen Hennicke and Kai Eckert. (2014b, March). DM2E Model V1.1 Retrieved, May
12, 2014, from http://pro.europeana.eu/documents/1044284/0/DM2E+Model+V+1.1+Specification.
10
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Europeana Data Model Primer, v14/07/2013. (2013, July). Retrieved from: Europeana Professional website. Retrieved,
April 28, 2014, from http://pro.europeana.eu/ documents/900548/770bdb58-c60e-4beb-a687-874639312ba5.
Heath, Tom, and Christian Bizer. (2011). Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures
on the Semantic Web: Theory and Technology (Vol. 1). Morgan & Claypool.
Klimek, Jakub, Jirí Helmich, and Martin Necasky. (2014). An analysis supported by numerous visualizations
Application of the Linked Data Visualization Model on Real World Data from the Czech LOD Cloud. Linked Data
on the Web (LDOW 2014) Workshop. Retrieved, May 14, 2014, from
http://events.linkeddata.org/ldow2014/papers/ldow2014_paper_13.pdf.
Palavitsinis, Nikos, Nikos Manouselis, and Salvador Sanchez-Alonso. (2014). Metadata quality in digital repositories:
Empirical results from the cross-domain transfer of a quality assurance process. Journal of the Association for
Information Science and Technology. doi: 10.1002/asi.23045.
Seiffert, Florian. (2001). Eine Analyse der Verbunddaten des HBZ. ABI-technik 21(2): 125-146.
Smith-Yoshimura, Karen, Catherine Argus, Timothy J. Dickey, Chew Chiat Naun, Lisa Rowlison de Ortiz, Hugh
Taylor. (2010, March). Implications of MARC Tag Usage on Library Metadata Practices, OCLC Online Computer
Library Center, Inc. Retrieved, May 14, 2014, from http://www.oclc.org/research/publications/library/2010/201006.pdf.
11
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The Digital Public Library of America Ingestion Ecosystem:
Lessons Learned After One Year of Large-Scale Collaborative
Metadata Aggregation
Mark A. Matienzo
Digital Public Library of America, USA
[email protected]
Amy Rudersdorf
Digital Public Library of America, USA
[email protected]
Abstract
The Digital Public Library of America (DPLA) aggregates metadata for cultural heritage
materials from 20 direct partners, or Hubs, across the United States. While the initial build-out of
the DPLA’s infrastructure used a lightweight ingestion system that was ultimately pushed into
production, a year’s experience has allowed DPLA and its partners to identify limitations to that
system, the quality and scalability of metadata remediation and enhancement possible, and areas
for collaboration and leadership across the partnership. Although improved infrastructure is
needed to support aggregation at this scale and complexity, ultimately DPLA needs to balance
responsibilities across the partnership and establish a strong community that shares ownership of
the aggregation process.
Keywords: metadata aggregation; metadata remediation; harvesting; software development;
community development; JSON-LD
1. Introduction
The Digital Public Library of America (DPLA) recently celebrated its first anniversary
aggregating the riches of America’s libraries, archives, and museums and sharing them through a
single portal. With its Hubs (the 20 direct partners from whom DPLA harvests records) and their
partners (approximately 1,300 in all), DPLA has worked to make these resources freely available
to the world. After a year focusing resources on growth, with the DPLA holdings more than
tripling to over seven million records in twelve months, it seems an appropriate time to take stock
of the technologies and processes within which this work occurs, as well as the data models used
to aggregate the Hubs’ various metadata standards and the nature of collaboration between DPLA
and the Hubs. It is important to identify areas both of success and improvement that have become
apparent since the launch in April 2013. This assessment takes into consideration outside
variables, as well, including feedback from Hubs, users of DPLA’s open and freely available
application programming interface (API), and others interested in the DPLA technology stack and
metadata model. A few areas of future work have been identified, which will help to create a
roadmap for ongoing investigation and development. It is hoped, too, that this process will
involve current and future partners, and create a community of practice around these open source
technologies and metadata management systems.
2. Development, implementation, and current status of DPLA infrastructure
DPLA launched its services on April 18, 2013, with 2.4 million records from 16 Hubs (and
their over 900 partners) after a two-year planning phase. The components that make up the
technology stack that supports the infrastructure are lightweight and open source, which allowed
DPLA’s initial technical implementation team to prototype and deploy working iterations
quickly. DPLA also developed a metadata application profile, or MAP (Digital Public Library of
America, 2014a), based on existing data standards and models. In addition to the ingestion system
described below, DPLA’s infrastructure also provides both an application programming interface
(API) and a public user interface that serves as the primary discovery system for the ingested
metadata. The platform, or API layer, is a Ruby on Rails web application that provides an
12
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
abstraction mechanism over the primary data store and search index. The portal, or user-facing
front-end application, is built on Ruby on Rails, and is a client of the platform application.
The DPLA technical infrastructure was implemented over a period of 18 months, which
demanded a relatively short build-out process. During the initial implementation period (October
2012-April 2013), the DPLA Assistant Director for Content undertook primary responsibility for
developing the metadata mappings, and a team of contractors developed the metadata ingestion
system and other areas of infrastructure and ran the ingestion processes. Since late 2013, the
DPLA staff has steadily grown, including the hiring of a Director of Technology (December
2013), two Technology Specialists (January and May 2014), a Data Services Coordinator (August
2014), and a Metadata and Platform Architect (August 2014). During this time, DPLA has
undertaken most of the responsibility for maintaining the existing infrastructure, overseeing the
ingestion process, and identifying areas for improvement.
2.1 The DPLA Metadata Application Profile
The DPLA Metadata Application Profile (MAP) is an extension of the Europeana Data Model,
or EDM (Europeana, 2014). Version 3, the first public version of the MAP, was developed in
early 2013 by DPLA staff and others, in collaboration with Europeana staff and public data
specialists who provided input during an open review period in late 2012. Like EDM, the MAP
incorporates or references a variety of standards and models, including the Dublin Core Metadata
Element Set, Dublin Core Terms, the DCMI Type Vocabulary, OAI-Object Reuse and Exchange,
and others. While based on EDM, the DPLA MAP nonetheless slightly diverges from it. First,
one of the MAP’s core classes, the Source Resource (dpla:SourceResource), is defined as a
subclass of the corresponding class in EDM (Provided Cultural Heritage Object, or
edm:ProvidedCHO). The primary motivation for this was to make clear that the properties of
dpla:SourceResource in some cases may have different cardinalities or requirements than those
defined for edm:ProvidedCHO. In addition, because of limitations on both the data available from
DPLA’s providers and the geocoding enrichments implemented near launch, DPLA developed its
own spatial location class, dpla:Place.
FIG. 1. Core classes and relationships in the DPLA Metadata Application Profile, versions 3 and 3.1.
DPLA staff reviewed and revised the requirements for the MAP in mid-2014, and released
MAP version 3.1 in July 2014. Many of the differences between MAP versions 3 and 3.1 relate to
cardinality requirements, which were changed based on recognition of the properties DPLA could
13
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
not reliably receive, map, or otherwise derive from metadata provided by Hubs. DPLA also added
a new property (Intermediate Provider, or dpla:intermediateProvider) to allow for the declaration
of an entity understood to be distinct from the two provider-related properties within EDM
(edm:Provider and edm:dataProvider). MAP version 3.1 defines an Intermediate Provider as “an
intermediate organization that selects, collates, or curates data from [an edm:dataProvider] that is
then aggregated by [an edm:Provider] from which DPLA harvests” (Digital Public Library of
America, 2014a). Beyond these changes, MAP version 3.1 also contains several changes which
bring it towards further alignment with EDM, such as clearly identifying the super-properties for
a given property when available, aligning internal properties with EDM definitions, adding the
edm:hasType property to express genre statements, and adding the edm:rights property. The
addition of edm:rights allows for association of rights information available at from a given URI
to two core classes within the MAP.
2.2 Ingestion system and workflow
The DPLA ingestion system (Digital Public Library of America, 2014b) is an application,
written in Python using the Akara (2010) framework, that provides REST endpoints for web
services to transform or enrich data serialized in JSON. The primary DPLA data store is a
BigCouch/CouchDB document-oriented database, with metadata both stored and serialized using
JSON-LD 1.0 (Sporny, Kellogg, and Lanthaler, 2014). Once stored in BigCouch, all ingested
metadata is indexed using Elasticsearch, a REST-based search server built upon Apache Lucene.
Additional scripts that support or control the ingestion process are also written in Python. The
ingestion workflow for a given ingestion source has a designated ingestion profile. In most cases,
Hubs only provide one ingestion source, but a small number of Hubs are continuing to develop
internal systems to support the single-ingestion-source model that is, technically, a requirement to
DPLA participation. Accordingly, a single Hub that has more than one ingestion source may have
multiple ingestion profiles. Each ingestion profile is a JSON document containing configuration
information such as the type of harvest, (e.g., OAI-PMH, site-specific API, static files, etc.),
location of an HTTP endpoint if applicable (e.g., the OAI-PMH provider URI), the specific
mapping and enrichments to be applied, and other internal settings required by the ingestion
system.
FIG 2. Overview of the DPLA ingestion workflow.
The ingestion workflow is invoked by a support script that reads the ingestion profile for a
given source and creates an ingestion document in the dashboard database for a given ingestion
process. The ingestion document contains data about the state of particular ingestion task (e.g.,
14
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
whether a specific step has started, completed, or failed). The dashboard database also
temporarily contains a representation of each fetched record to allow staff to identify what parts
of an ingested record have changed. Once the ingestion document is created, the staff running the
ingestion process invokes the fetch task, which obtains the metadata to be ingested from the
source defined in the profile. The metadata is then deserialized from its native format (typically
XML), reserialized as a JSON expression of the original data, and persisted to disk in a temporary
location. Once the fetch process is complete, the ingestion document is updated to contain the
location of the data transformed to JSON.
The ingestion staff then invokes the transformation and enrichment tasks. These tasks map and
transform the JSON-serialized metadata to the DPLA MAP, and normalize, enhance, and
augment the metadata using a “pipeline” that orchestrates requests to the application’s REST
endpoints (see section 2.3 for more information). Once complete, the records are temporarily
persisted to disk as a JSON-LD serialization of the MAP, and the ingestion document is updated
with information about transformation and enrichment processes, including location of the
transformed records and the extent of any failures within the process. The ingestion staff then
runs the save task, which reads the MAP-compliant JSON-LD records and persists them to the
primary data store. After the save process completes, the ingestion staff runs the check ingestion
counts task, which identifies the number of new, updated, or deleted records for each ingestion
process and automatically alerts the identified staff when those values are above a certain
threshold defined in the ingestion profile. Finally, the ingestion staff runs two concluding tasks:
the remove deleted records task and the dashboard database cleanup task. Both tasks remove
objects from the primary data store or dashboard database. These objects correspond to the
metadata from ingested records that were either deleted from the ingestion source by the provider
(e.g., as identifiable using the <deleted> element from an OAI-PMH provider) or otherwise not
present or available during a given ingestion process.
2.3 The metadata transformation and enrichment pipeline
Most of the work to transform, normalize, and enhance the metadata ingested into DPLA
occurs as part of the transformation and enrichment pipeline, which executes a list of specific
steps defined in an ingestion profile in a specific, linear order. Each of the steps is implemented in
the ingestion system as a module mounted at a defined REST endpoint. Each of the endpoints
receives JSON data over an HTTP POST request, and returns JSON data, either modified if the
step was applicable and successful or unchanged if the step was inapplicable or if it failed. Most
of the ingestion profiles share a number of common steps, and the modular design allows DPLA
to easily reuse them and add extra parameters as needed.
FIG 3. Sample transformation and enrichment pipeline for ingestion from the Portal to Texas History.
At a minimum, the pipeline must contain two steps: one that selects the source of the identifier
from the ingested metadata (which is required for persistence), and another that transforms and
maps the metadata to the DPLA MAP. Despite the pressures related to launch, DPLA was also
15
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
able to implement some degree of normalization and enrichment. Much of the DPLA staff’s
ongoing work involves revising and ensuring that these normalization and enrichment modules
remain robust and error-free. At a minimum, the enhancements applied to most metadata ingested
into DPLA include what Hillmann, Dushay, and Phipps (2004) term “safe transforms,” through
global cleanup of values to address minor differences in capitalization, punctuation, or
whitespace, or alignment and reconciliation of terms against comparatively small controlled
vocabularies such as the DCMI Type Vocabulary or ISO 639-3 language codes. In addition, the
ingestion system undertakes more complex transformations based on diversity of practice, such as
normalizing dates or date ranges to a common format, and “shredding” a string literal based on a
given delimiter to yield multiple values. In addition, the ingestion system also includes a
geocoding enrichment service, which uses external services to take geographic name values and
geocode them to return latitude and longitude pairs, and then uses those coordinates to build out a
geographic hierarchy. More details about these services are provided below.
The quick lead up to the launch meant turnaround times were limited and the need to ingest
metadata created using different schemas under varying practice and assumptions meant that
some areas of work on the transformation and enrichment pipeline had to be reprioritized. Work
during the initial ingest, which took place roughly between February and mid-April 2013, focused
on mapping and the conceptual alignment of fields from the initial 16 Hubs, rather than on the
review and quality control of the actual values. Likewise, a loosening of validation against the
MAP assertions was necessary to ensure that goals and timelines were met. This period focused
on return on investment in the strictest sense: providing the best data in the shortest period of time
with the least remediation. In addition, since MAP version 3 was only finalized approximately
three months before launch (and only days before the first ingests began), additional changes to
the ingestion code and DPLA’s Platform API were necessary to ensure that all of the data was
available through the portal by mid-April 2013.
3. Concerns and challenges
The technology and data model established for the launch has served DPLA well. It has
effectively aggregated over seven million records, enabling hundreds of users to utilize the API
and effectively build apps, and more than a million users to search and enjoy the resources
available through the portal. With sustained use and the ongoing need to continue the ingestion of
metadata from both current and future Hubs, challenges have arisen that signal a need to consider
potential new options for aggregation, storage, and delivery.
3.1. The ingestion process
Ingest remains a very hands-on endeavor. Once a Hub’s data is mapped to the DPLA Metadata
Application Profile (by the Assistant Director for Content, at the time of publication), a new
ingestion profile is written (by DPLA technology staff) that delineates the harvesting,
transformation and enrichment steps. In addition, despite using common metadata standards (e.g.,
DCMES or MODS) or harvesting protocols (e.g., OAI-PMH), differences in local
implementation often require DPLA technology staff to modify or supplement implemented
mappings, employ new transformation services, or resolve other inconsistencies before an ingest
moves to production. For example, several Hubs have found it difficult to reliably provide URIs
for thumbnail images for the items associated with the metadata ingested by DPLA. As this
information is mandatory in MAP version 3.1, DPLA technology staff must often undertake a
degree of reverse engineering to add an enrichment step that identifies or constructs this URI.
Nonetheless, while discussions between Hub and DPLA personnel lead to good results, the
process of getting a new data set into production often lasts between four and eight weeks.
The ingestion process itself is also resource intensive, and as described above, the architectural
paradigm of the current ingestion system currently expects that a consistent transformation and
enrichment pipeline be used across all ingestion processes from a given ingestion source. A large
number of processes are applied to all incoming ingests regardless of the metadata schema used
16
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
or quality of the metadata received. Currently, data from each Hub is reingested in its entirety
monthly, every other month, or quarterly, depending on the frequency of local updates.
Accordingly, each step defined in the transformation and enrichment pipeline runs during each
ingestion process. This ultimately leads to the potential for some enhancements to be lost or
misapplied if a Hub has modified its metadata in the interim. Improved control over the
enrichment workflow, such as enabling or disabling certain processes for a scheduled ingestion
process for a specific Hub, and supplementing those enrichments with provenance information,
could provide better control and reduce complexity of ingestion on an ongoing basis. And while
the process has been internally standardized, it remains somewhat opaque to some Hubs,
especially those who may not be familiar with the languages in which the transformation and
enrichment pipeline modules are written. In the experience of DPLA, this also points to the need
for improved unit tests and documentation that make the intent of the pipeline modules clearer to
domain experts without programming knowledge.
Other challenges to the current model that have come to light over the past year include the
inconsistency of some of the enrichment and normalization processes that are applied to all
collections. For example, DPLA staff recently identified that structured spatial information (i.e., a
place hierarchy) provided by some Hubs was not successfully mapped to the property required
for the literals to appear in the user interface (skos:prefLabel). Diagnosis of issues in the
enrichment process proves to be an ongoing challenge for DPLA given that the ingestion system
does not track the provenance of statements created or modified during transformation and
enrichment. In addition, while the DPLA MAP is a data model based upon RDF, the current
infrastructure has not yet implemented a complete expression of the constraints defined by it.
These limitations originate mostly because the current implementation of validation relies on a
simplified expression of the MAP using JSON Schema (Galiegue, Zyp, and Court, 2013), with
any validation of the statements about a given item against the MAP currently limited to
cardinality checks and simple controlled value verification based on the JSON serialization of the
data.
Another area in which DPLA continues to face challenges is the geocoding enrichment
process, which retrieves a “best guess” set of coordinates for a term from the Bing Maps API, and
uses those coordinates to build out the rest of a geographic hierarchy for that term using the
Geonames API. For the value “Charlotte (NC),” the values “35.226944, -80.843333” are
automatically assigned via the Bing Maps API to indicate the geographic center of the city. Then,
those coordinates are sent to the Geonames API to extract the geographic hierarchy for Charlotte,
i.e., United States -- North Carolina -- Mecklenberg County -- Charlotte. This is rich and valuable
data that allows DPLA to plot “Charlotte (NC)” on the interactive map in the portal. Like any
scaled transformation, this process is not fail-safe, as a careful study of the map exposes. For
example, consider a record with the spatial value of “Wisconsin.” In this model, the coordinates
for the central point of the state identify a hierarchy that contains county-level information
(United States -- Wisconsin -- Portage County), which introduces data that can be misleading, if
not erroneous. In addition, DPLA staff has discovered that external web services like the Bing
Maps API often update the data they provide or their indexing mechanism, which has led to
inconsistencies in the geocoding enrichment processes over time. Considering the lack of
confidence about the geocoding process and the inability to track provenance of statements in
DPLA’s current infrastructure, DPLA has chosen not to implement reconciliation of geographic
names with URIs from sources such as Geonames until these issues can be addressed.
3.2. The metadata
Over the past year, DPLA staff has had the opportunity to work closely with Hubs from across
the United States. Not surprisingly, the Hubs employ various metadata standards, maintain data in
many different repository types, and manage localized workflow models. The process of
aggregation, and especially enrichment and normalization, has been eye-opening for most of the
parties involved. DPLA staff knew even before harvesting began in early 2013 that the process
17
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
would be complex and not without challenges, as evidenced by past work on projects such as the
National Science Digital Library (Lagoze et al., 2006), the Digital Library Federation Aquifer
Project (Riley et al., 2008), and Europeana. One immediate revelation was somewhat surprising,
however. The greatest difference between collections—and the source of the most difficulties—is
not the metadata schemas employed or repositories used, but the extent to which simple metadata,
like unqualified Dublin Core exposed over OAI-PMH, must be processed, and, more importantly,
how metadata is input and managed locally.
When data is shared in MODS, MARCXML, or even qualified Dublin Core, the richness and
completeness of the records transfers relatively easily to the DPLA model. Not surprising, of
course, is that the more granular the original record, the better the output at the other end.
However, unqualified Dublin Core—most often exposed over OAI-PMH—requires a great deal
more analysis and a greater number of complex transformations to identify and map discrete
values in a single field to multiple fields in the MAP. For example, specific transformation and
enrichment modules are created to determine when a dc:coverage field contains only spatial
information, spatial information together with temporal information, or only temporal
information. Similar issues, although no less challenging, arise from the varied interpretation of
values in dc:source, dc:contributor, dc:relation, dc:type, and others. In evaluating the importance
or the efficacy of these transforms, DPLA is reminded that “minimally descriptive metadata … is
still minimally descriptive after multiple quality repairs” (Lagoze, et al. 2006). In some ways, this
problem is exacerbated further given that Hubs are often aggregators themselves. The degree to
which values have been “dumbed down” is not always well documented in terms of how or
where this simplification occurred.
It also became immediately clear when a Hub, or its partners, consistently employed and
applied (or didn’t) controlled vocabularies. While most Hubs follow general guidelines for
geographic names (e.g., selecting terms from vocabularies like TGN or LCSH), they are not
always applied consistently. Again, this is in part because many Hubs are themselves aggregators
of content from hundreds of partners. On DPLA’s long-term roadmap for implementation is the
work to implement reconciliation of string literals against large controlled vocabularies.
Interestingly, in many collections, Hubs’ partner names are not taken from controlled
vocabularies, or if they are, either this is not indicated in the data or the authorized form of name
lacks important contextual information. This has led to a surprising number of errors or
unfamiliar values in the data, at least initially. One Hub utilizes the Library of Congress Name
Authority File to create their controlled list of partner names. While on the surface this seems like
a prudent approach, until the terms are associated with URIs and are augmented with more
information, many of the names have very little meaning outside of their local context. For
example, not everyone can readily associate the LC Name Authority “J. Y. Joyner Library” with
East Carolina University (the parent institution).
4. Responses and requests from DPLA Hubs
DPLA personnel have actively worked in partnership with Hubs to identify and openly
communicate quality issues in the data that they are sharing. Hubs have been responsive and
often eager to make updates and changes to data and even the mappings in their local systems to
better align with international practice and the DPLA data model. All agree that this has meant
better data quality at both the local and global level. Through this process, Hubs have shared
thoughts on ways that ingest could be improved. In some cases, they have begun local
development on tools that transform and enrich their data before it reaches DPLA. Some of the
requests DPLA has heard align well with its own internal priorities and needs.
4.1. Greater control over and feedback during the ingestion process
As mentioned earlier, the community feels strongly that they would benefit from an “ingestion
dashboard” that offers a selection of enrichment processes from which Hubs could choose to
apply to their data during the ingest process. Because the Hubs know their data best, enabling
18
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
access to an ingestion dashboard and involving them as early as possible in the initial mapping
process would give the Hubs more control over the way their data is exposed via DPLA. Also, it
would shed light on what remains a somewhat opaque process for those who are not proficient
with the technologies in use. In the interim, DPLA has developed a basic content quality
assurance dashboard for internal use and review by Hubs before an initial ingestion reaches the
DPLA production data store. The dashboard application is part of the platform API infrastructure,
and provides a stripped-down user interface for search and browse of ingested metadata, and the
generation of reports on metadata output from the transformation and enrichment pipeline. In
addition, integrating tools that provide better visual representations of how metadata is mapped at
ingestion and presented in the DPLA portal interface (e.g., Gregory and Williams, 2014) would
benefit stakeholders across the DPLA network.
4.2. Access to data quality reports
As part of the initial ingestion process for a new Hub, a series of reports are produced that
enable DPLA staff to review the values in each field mapped to the DPLA application profile. For
each property, two reports are produced: an itemized list of all values in the field and the
corresponding DPLA record identifier, and a count of all of the values in that field. The reports
are produced from the enriched data, after geocoding and normalization have been applied. Some
Hubs, especially those with repository systems that cannot easily generate aggregated reports for
a given element or predicate, have requested access to reports on their unprocessed data. This
would allow them to assess their metadata and perform remediation locally, before it is ever
harvested by DPLA. While valuable, this will require significant re-engineering of the ingestion
system before it can be implemented.
4.3. Upstream data flow: receiving DPLA-provided enrichments
The greatest challenge, but one that several Hubs have voiced interest in investigating, is a
method for applying enrichments undertaken by DPLA as part of the ingestion process back to
their local data sets. While DPLA provides data dumps for all Hubs’ metadata both as individual
and collective compressed dump files on the DPLA portal, working with this data can be
challenging due—in part—to the sheer size of the files. For Hubs that have a strong technology
team and a software environment that would allow it, pulling data from the DPLA API and
merging changes with their local data might be a possibility. For others, especially those using
systems like CONTENTdm that do not allow for the expression of relationships between fields,
this will likely remain an impossibility. Nonetheless, to provide this service in a scalable fashion
will require DPLA to better track how and when enrichments are applied, and when they may or
may not be necessary.
4.4. Further tool and infrastructure development
While DPLA provides guidance to Hubs about particular standards, schemas, or protocols used
to standardize, aggregate, and/or provide metadata, DPLA does not usually recommend or require
use of any specific tools or applications to harvest, transform, or enrich metadata. Some Hubs
have expressed an interest in working with other Hubs or with DPLA to develop tools to help
with these processes. Even when formal collaboration has not yet been established, DPLA now
finds itself providing an important service, mediating connections across Hubs to identify when
the community faces common challenges.
5. Planning for needed improvements
Based on this feedback from Hubs, as well as needs identified through the challenges listed
previously, DPLA is now reassessing its priorities and planning to address these issues. In some
cases, resolving these issues may directly impact the infrastructure DPLA has in place, and
addressing others clearly relates the need for DPLA to identify the level to which it should
19
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
provide services on behalf of its Hubs. Some of the major areas of focused effort over the next
year include the following.
5.1. Revision of the DPLA Metadata Application Profile
While the Metadata Application Profile is based on the Europeana Data Model (EDM), it has
nonetheless diverged from it due to the pressures of DPLA’s initial launch outlined above.
Accordingly, DPLA is undertaking revision of the MAP to bring it back to closer alignment with
EDM, which will allow the ingestion process to better associate URIs with given predicates in the
MAP. As indicated in section 2.1, DPLA had sufficient needs that led to the development and
implementation of MAP version 3.1. As an organization, DPLA has committed to reviewing the
MAP on an ongoing basis, and is already planning for further changes to be included in MAP
version 4. These include shifting to the class defined by EDM for spatial data (edm:Place), better
support for controlled vocabularies for subject and genre statements, and investigating the
addition of a class to provide support for annotation information. Future versions will also allow
DPLA and other consumers of the ingested metadata to better incorporate annotations, either in
the form of user-generated metadata, or automated output based on the results of transforms and
enrichments during each ingest process.
5.2. Reassessment of “data quality” and “validation” in the context of DPLA
To provide better tools that ensure the validity and quality of metadata, there will need to be a
clear understanding of how those terms are defined in the context of the DPLA/Hub
collaboration. Lagoze et al. (2006) suggest that safe transforms are not necessarily scalable, and
as such, DPLA and its Hubs must work together to clearly identify which remediation or
augmentation processes add the most value to partners and other stakeholders. In addition, DPLA
needs to determine whether validation against the MAP is a priority, and to have a clearer
delineation of which party must provide the appropriate source data to fulfill the obligations of
the MAP (i.e., DPLA, the Hub, or the partner). If explicit validation against the MAP becomes a
priority for DPLA and its stakeholders, it will likely require the addition of a means to validate a
set of statements against the constraints of the MAP as an RDF application profile. As a
preliminary investigation, the co-authors have contributed use cases to the DCMI RDF
Application Profiles Task Group.
5.3. Encouraging Hubs to undertake metadata transformation and enrichment
locally and to develop appropriate tools
Since Hubs often know their metadata (and that of their partners) best, DPLA sees promise in
Hubs taking on greater responsibility for metadata remediation, enrichment, and transformation to
the MAP at the local level whenever possible. In many cases, DPLA has seen leadership in this
area from Service Hubs, in particular (organizations or collaborative endeavors that aggregate
metadata and provide services to several cultural heritage organizations, usually at a state or
regional level). Some Service Hubs are already actively developing open source software to
support these processes. Ultimately, software and infrastructure developed by the Hubs may
benefit DPLA and its network further if it can be easily reused.
There are several notable examples of this leadership shown by Service Hubs. Developers at
the Boston Public Library (2014) have developed a Ruby module for improved geocoding and
reconciliation of geographic names against vocabularies, which is used to augment both their own
data as well as data aggregated by Digital Commonwealth, the Service Hub for Massachusetts.
University of Minnesota Libraries (2014a, 2014b, 2014c) are developing a suite of tools to
harvest, transform, and augment metadata for materials aggregated by the Minnesota Digital
Library, with the ultimate goal to provide DPLA with the metadata compliant with the MAP. In
addition, the North Carolina Digital Heritage Center (NCDHC) has gained significant expertise in
using REPOX for metadata aggregation as a DPLA Service Hub and has developed additional
quality assurance applications to support this work (Gregory and Williams, 2014). In addition, to
20
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
promote reuse, NCDHC released these as open source applications on GitHub. The tools allow
NCDHC staff to review mappings, check for the presence of required properties or elements
(NCDHC 2014a), and to provide a preview simulating the DPLA’s portal user interface for
individual new records that can be reviewed by their partners (NCDHC 2014b).
5.4. Improvement of documentation for metadata model and ingestion process
Despite both metadata mapping documentation and the code for the ingestion system being
publicly available, there is still a significant gap in terms of materials available to understand the
DPLA ingestion process. Accordingly, DPLA has begun to address this need by releasing an
introductory white paper that explains the MAP (Digital Public Library of America 2014c) and
creating a wiki page that collocates existing documentation about metadata, partnerships, and
related activities (2014d). DPLA continues to develop further documentation that describes the
ingestion process. This work will also likely give DPLA staff better insight about the expectations
for these processes. In addition, DPLA staff has also supplemented the MAP version 3.1
documentation with explicit references to how properties within MAP are serialized as JSON-LD.
5.5. Improvement or replacement of the DPLA ingestion system
Many of the issues identified by DPLA demonstrate that the current ingestion system, while
suitable as a prototype platform for the harvesting, remediation, mapping, and enhancement from
many sources, is not entirely suited to the needs of a large-scale aggregator. Internally, DPLA
staff has been working to address some issues while investigating whether a substantial refactor
or a complete replacement would better serve the needs of the organization. A few areas for
immediate focus include increasing efficiency, providing better automation, allowing DPLA
content staff to oversee and understand the ingestion process directly with less mediation by the
DPLA technology staff by the development of the aforementioned ingestion and QA dashboards,
and more clearly defining the shared set of transforms and enrichments for all sources. In
addition, the use of domain specific languages that are purpose-built for metadata mapping,
transformation, and enhancement holds promise (e.g., Phillips, Tarver and Frakes, 2014 and
LibreCat, 2014). These changes, in turn, could allow DPLA to create a system with its Hubs that
is more approachable and transparent for those less comfortable with command-line applications
and the orchestration of web services. DPLA has not committed to specific candidates for a
replacement or undertaken extensive requirements analysis for a new ingestion system.
Nonetheless, DPLA is interested in investigating both the previously described software suite
under development by University of Minnesota, as well as Supplejack, the harvesting and
augmentation framework used by DigitalNZ (2014).
6. Conclusion
Despite ongoing challenges with its existing infrastructure, DPLA has successfully aggregated
over seven million records from 20 Hubs and nearly 1,300 partner institutions. The lightweight
infrastructure used to support ingestion, storage, and indexing allowed the technical
implementation team to quickly develop a system to harvest, remediate, and enrich metadata in
varying formats. While the current ingestion system clearly has limits, the experience has allowed
DPLA and its Hubs to identify shared needs and opportunities for collaboration while adding
value to metadata for digitized cultural heritage materials. As the partnership around DPLA
grows, the organization is uniquely situated to foster a community of practice that develops and
provides documentation, software, and a forum to address ongoing needs in the remediation and
enhancement of metadata at a national scale.
Acknowledgements
The Digital Public Library of America wishes to thank and acknowledge the support of the
following organizations that have funded its efforts: The Alfred P. Sloan Foundation; The
21
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Andrew W. Mellon Foundation; The Arcadia Fund; The Bill & Melinda Gates Foundation; the
Institute of Museum and Library Services; The John S. and James L. Knight Foundation; The
Mrs. Giles Whiting Foundation; and the National Endowment for the Humanities.
References
Akara. (2010). Retrieved August 7, 2014, from http://akara.info/.
DigitalNZ. (2014). Supplejack documentation,
http://digitalnz.github.io/supplejack/.
version
0.1.
Retrieved
August
7,
2014,
from
Boston Public Library. (2014). Bplgeo. Retrieved August 7, 2014, from https://github.com/boston-library/Bplgeo.
Digital Public Library of America. (2014a). Digital Public Library of America Metadata Application Profile, Version
3.1. Retrieved August 7, 2014, from http://dp.la/about/map.
Digital Public Library of America. (2014b). The DPLA ingestion system, version
http://dx.doi.org/10.5281/zenodo.11226. Retrieved August 7, 2014, from https://github.com/dpla/ingestion.
31.1.
Digital Public Library of America. (2014c). An introduction to the DPLA metadata model. Retrieved August 7, 2014,
from http://dp.la/info/2014/03/25/intro-dpla-metadata-model/.
Digital Public Library of America (2014d). Content wiki. Retrieved
https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/CT/Content.
DPLA
RDF
application
profile
use
cases.
(2014).
Retrieved
http://wiki.dublincore.org/index.php/DPLA_RDF_application_profile_use_cases.
August
August
7,
7,
2014,
2014,
from
from
Europeana. (2013). Europeana Data Model primer. 14 July 2013. Retrieved August 7, 2014, from
http://pro.europeana.eu/documents/900548/770bdb58-c60e-4beb-a687-874639312ba5.
Europeana. (2014). Definition of the Europeana Data Model v5.2.5. 22 May 2014. Retrieved August 7, 2014, from
http://pro.europeana.eu/documents/900548/0d0f6ec3-1905-4c4f-96c8-1d817c03123c.
Galiegue, Francis, Kris Zyp, and Gary Court. (2013). JSON Schema: interactive and non interactive validation. IETF
Internet-Draft, January 30, 2013. Retrieved August 7, 2014 from http://json-schema.org/latest/json-schemavalidation.html.
Gregory, Lisa, and Stephanie Williams. (2014). On being a hub: some details behind providing metadata for the Digital
Public Library of America. D-Lib Magazine, 20(7/8). http://dx.doi.org/10.1045/july2014-gregory.
Hillmann, Diane I., Naomi Dushay, and Jon Phipps. (2004). Improving metadata quality: augmentation and
recombination. Proceedings of the International Conference on Dublin Core and Metadata Applications, 2004.
Retrieved May 15, 2014 from http://hdl.handle.net/1813/7897.
Lagoze, Carl, Dean Krafft, Tim Cornwell, Naomi Dushay, Dean Eckstrom, and John Saylor. (2006). Metadata
aggregation and “automated digital libraries”: A retrospective on the NSDL experience. In G. Marchionini, M. L.
Nelson, and C. Marshall (Eds.): JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on digital
libraries (pp. 230-239). New York: Association for Computing Machinery.
LibreCat.
(2014).
Catmandu:
Introduction.
https://github.com/LibreCat/Catmandu/wiki/Introduction.
Retrieved
August
7,
2014,
from
NCDHC. (2014a). dpla-aggregation-tools. Retrieved August 7, 2014, from https://github.com/ncdhc/dpla-aggregationtools.
NCDHC. (2014b). dpla-submission-precheck. Retrieved August 7, 2014, from https://github.com/ncdhc/dplasubmission-precheck.
Phillips, Mark, Hannah Tarver, and Stacy Frakes. (2014). Implementing a collaborative workflow for metadata
analysis, quality improvement, and mapping. Code4lib Journal, 23. Retrieved August 7, 2014, from
http://journal.code4lib.org/articles/9199.
Riley, Jenn, John Chapman, Sarah Shreeves, Laura Akerman, and William Landis. (2008). Promoting shareability:
metadata activities of the DLF Aquifer initiative. Journal of Library Metadata, 8(3).
Sporny, Manu, Gregg Kellogg, and Markus Lanthaler (Eds.). (2014). JSON-LD 1.0: A JSON-Based Serialization of
Linked Data. W3C Recommendation 16 January 2014. Retrieved August 7, 2014, from http://www.w3.org/TR/jsonld/.
University
of
Minnesota
Libraries.
(2014a).
https://github.com/UMNLibraries/dpla.client.
dpla.client.
22
Retrieved
August
7,
2014,
from
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
University
of
Minnesota
Libraries.
(2014b).
https://github.com/UMNLibraries/dpla.docs.
University of Minnesota Libraries. (2014c).
https://github.com/UMNLibraries/dpla.services.
dpla.docs.
dpla.services.
23
Retrieved
Retrieved
August
August
7,
2014,
from
7,
2014,
from
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Applying a Linked Data Compliant Model: The Usage of the
Europeana Data Model by the Deutsche Digitale Bibliothek
Stefanie Rühle
Göttingen State and
University Library,
Germany
[email protected]
Francesca Schulze
German National Library,
Germany
[email protected]
Michael Büchner
German National Library,
Germany
[email protected]
Abstract
In 2013/14 the Deutsche Digitale Bibliothek (DDB) changed its data model from the CIDOC
conceptual reference model to the Europeana Data Model (EDM). This decision was taken
against the background of two major mandates the DDB has to fulfill: as a portal and as a
platform the DDB is providing access to digital objects from German cultural heritage and
research institutions. The DDB also aims to become the German aggregator for Europeana. Using
EDM as the internal DDB data model was considered the most reasonable solution to meet these
challenges. The DDB uses the data model for all portal functions that require semantic links
between metadata (search facets, hierarchies, links between authority files and digital objects).
The application of EDM for the DDB portal created some difficulties since not all necessary
classes and properties had been entirely implemented in Europeana-EDM at that time. Therefore,
DDB defined a metadata model which is based on the Europeana Data Model Definition but
contains additional extensions. The DDB publishes metadata under the CC0 Public Domain
Dedication license in EDM-RDF/XML via an OAI-PMH interface to serve Europeana and also
via an Application Programming Interface (API) for external users to develop new applications
on the basis of metadata harmonized by the DDB.
Keywords: Deutsche Digitale Bibliothek; German Digital Library; Europeana Data Model;
CIDOC Conceptual Reference Model; metadata model; metadata mapping; metadata
interoperability; linked data
1. Introduction
The Deutsche Digitale Bibliothek (DDB) provides a portal and a platform providing access to
digital objects from German cultural heritage and research institutions. It brings together
specialists from archives, museums, libraries as well as research, monument protection and media
institutions in a Competence Network, funded by federal, state and local authorities. The full
version of the portal was launched in March 2014. Besides being the main access point to
digitized cultural and academic objects from Germany the DDB aims to become the German
aggregator for Europeana, the central access point to Europe´s digitized cultural heritage.
Europeana is operated by the Europeana Foundation and provides public services like the
Europeana portal1. It accumulates and distributes metadata on digital collections from data
providers across Europe, for example the DDB. Europeana encouraged the DDB to change the
basis for its internal metadata model from CIDOC-CRM to the Europeana Data Model (EDM).
EDM is a linked data compliant model developed by Europeana. It uses properties and classes of
different namespaces, i. e. terms of the Dublin Core Metadata Element Set, the DCMI Terms and
the OAI-ORE (EDM Definition, 2013). The DDB metadata model also uses properties and
classes defined by Europeana taking into account the event-based modelling of object lifecycles
1
URL to Europeana portal: http://www.europeana.eu/portal/
24
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
in CIDOC-CRM, however, these descriptions are less complex than in CRM (CRM Definition,
2014). In 2013/14 the DDB replaced CRM with EDM. As a result, mappings to the internal DDB
format became less complex which reduces costs of metadata transformations. Using EDM also
enables the reusability of Europeana tools. This report presents different applications on the basis
of EDM in the DDB and describes the extensions of the model for DDB purposes. With this
example, we want to illustrate that EDM is suitable as a domain model for the representation of
digital cultural heritage. This model can also be used beyond the purpose of delivering metadata
to Europeana. Other projects which adapted or extended EDM for their purpose are for instance
The European Library2, Digitised Manuscripts to Europeana3 or Europeana Fashion4.
2. Use of EDM in the DDB
The requirements of the DDB concerning the data model are a result of the EDM triples’
functions in the DDB. EDM in the DDB (in the following called DDB-EDM) is used
• for an advanced and facet-based search in the DDB portal,
• to represent the hierarchical organization of the digitized objects,
• to interlink objects and authorities, and
• to publish the data via OAI-PMH and an Application Programming Interface (API).
2.1. Facets
The facet-based search enables users to filter their search results by means of defined
categories.
FIG. 1. Facets in the DDB Portal
The categories are based on the classes edm:TimeSpan for time, edm:Place for location,
edm:Agent and dcterms:ProvenanceStatement for person/organization and data provider,
skos:Concept for keyword, media type and sector, and dcterms:LinguisticSystem for language. In
a next step, some of these categories will be refined using triples and controlled terms specifying
the relation between an object and a place, time, person or organization.5 This will allow users to
distinguish between the “aboutness” of an object and information concerning its lifecycle and
help them to differentiate whether it is the time and place of creation or modification, whether a
person was involved in the finding or the destruction of an object etc.
2
For a project description, see http://dm2e.eu.
For a project description, see http://www.theeuropeanlibrary.org/tel4/.
4
For a project description, see http://www.europeanafashion.eu/portal/home.html.
5
For the specification of these relations the DDB uses URIs of the event vocabulary developed by the LIDO Community, see
http://terminology.lido-schema.org/eventType.
3
25
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
2.2. Hierarchy
To describe the hierarchical relations between objects – e. g. the hierarchy of resources from
libraries or archives – we use two classes to express the different nodes of a hierarchy in the user
interface: the edm:ProvidedCHO for objects with a proper name or description (e. g. monographs,
journals, articles, illustration) and the edm:PhyscialThing for nodes that are not described with a
proper name or description but are needed to express the hierarchical structure (e. g. an issue).
We use a domain specific property called ddb:hierarchyPosition for the description of the
order of resources inside the hierarchy. Besides this property, DDB-EDM includes
6
edm:isNextInSequence for compliance with the EDM used in Europeana.
FIG. 2. Description of an edm:ProvidedCHO in DDB-EDM
2.3. Interlinking with Authorities
We use EDM to interlink DDB objects with resources from external data sources. As a first
step, we connected DDB objects with person authority files from the Integrated Authority File
(Gemeinsame Normdatei, GND). To establish the relations we exploit only GND URIs which are
delivered in the original metadata. For persons, who play a role in the lifecycle of an object (e. g.
author), we extended EDM with the CIDOC-CRM-Property P11_had_participant. For the
inverse relation, i. e. from a GND person to DDB objects, we use the EDM property
edm:wasPresentAt. Furthermore, we use the Dublin Core property dcterms:subject for
persons, who are described or depicted by the object. To exploit information behind respective
GND URIs and to offer person pages in the DDB portal, we apply the web service Entity Facts7
offered by the German National Library. It allows other applications to integrate and interlink
information from GND entities with their data sources. Entity Facts is implementing data
enrichment, therefore different data sources (e. g. external links from BEACON files or images of
persons from Wikipedia) are merged into a simple and easy-to-use JSON-LD fact sheet. The first
version delivers information on entities of the GND entity type Person via an API. Subsequent
versions will supply information on places and corporations as well. The GND is widely used in
the library community and less represented in other sectors. Therefore, the DDB is developing an
assessment tool8 that will support users to compare, match and map their domain-specific
vocabularies to the GND in a semi-automatic way.
2.4. Publication as Linked Data
We provide metadata of the cultural heritage institutions in the DDB-EDM RDF/XML format
by applying linked data principles. We use URIs to uniquely identify different resources and their
relations in RDF. Therefore, we transfer URIs from the original metadata records during the
mapping to EDM whenever possible. Apart from the GND, we take URIs from vocabularies
6
7
8
For information about hierarchies in Europeana see Task Force on hierarchical objects, 2013.
For an example see the query for “Johann Wolfgang von Goethe” at http://hub.culturegraph.org/entityfacts/v1/118540238.
The assessement tool is developed by digiCULT, a project partner of the DDB, see http://www.digicult-verbund.de/.
26
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
which are available as Linked Open Data, like Iconclass9, Dewey Decimal Classification10 or the
Library of Congress vocabularies11. We also create URIs, for instance by adding a namespace to a
code or identifier provided in the original metadata record (e.g. ISO 639-2 code “eng” to
“http://id.loc.gov/vocabulary/iso639-2/eng”). Moreover, we include URIs from the ddb-vocnet
namespace into EDM properties to receive controlled terms for the search in the DDB portal.
This affects mostly properties, which express the type of a resource (e.g. type of a digital
representation of an object). For the identification of some resources, however, it was necessary
to additionally establish DDB-internal URIs. These URIs have a DDB-namespace and are created
on the basis of common rules for respective DDB resources (e. g. resource class
name/ISIL12/local identifier). In order for external users to recognize non-resolvable DDBinternal
URIs
they
are
encoded
by
a
hash
(e.g.
EO5NPTOTBJL4V3RXVRLXE7YME7HY6DCW as can be seen in figure 2).
DDB-EDM RDF/XML records contain the results of our normalization and enrichment
processes. An example is the use of DDB license URIs for both the metadata record and the
digital object. The DDB licenses, which are compliant with the Europeana Licensing Model, give
external users information whether and how they can reuse the metadata and digital objects. The
DDB publishes its metadata records under the CC0 Public Domain Dedication license via its
API13. This allows the development of further applications by using DDB metadata. Even though
the DDB-API supplies the metadata in different XML formats (source format, DDB-EDM, DDBView), DDB-EDM is considered as the most harmonized, interlinked and enriched representation
of the metadata describing the objects. An application on the basis of the DDB-API is
“Archivportal-D14” – a portal which provides a view on the DDB content and metadata from an
archival perspective. DDB also delivers EDM metadata sets under the license CC0 via an OAIPMH interface to Europeana. The interface is open to the public as well.
3. Mapping Workflow
The workflow to integrate metadata sets from institutions into the DDB consists of three main
steps: 1) clarification of formal and content-specific aspects, 2) data clearing, and 3) ingest. An
institution willing to participate has to fill out a content questionnaire including information about
the holding/collection and the metadata format (MARC, METS/MODS, ESE, EAD, LIDO et al.).
The data clearing begins with the analysis of test data and the adjustment of mapping rules. The
original metadata is transformed with XSLT scripts to all DDB target formats, comprising EDM.
All metadata representations of an object record are structured in the container format Cortex
defined by the DDB. After the ingestion into the DDB test system, data experts review the quality
of the transformation result in the test portal and in an XML preview. To support quality control,
the DDB is implementing a validation tool. Data clearing is an iterative process with several
circles of reviews and adjustments. After approval by the data provider the complete data
contribution is ingested into the DDB backend and published via the DDB frontend (portal) and
other public interfaces.
The switch from CRM to EDM had a strong impact on our mapping workflow and back-end
operations. Since the data sets from all providers that were published via the DDB at that time
had to be represented in the new DDB-EDM data format we had a big one-time effort to adjust all
respective steps in our workflow. These were: a) the definition of new rules to map the elements
and their contents from seven source formats to EDM, b) the indication of provider specific
9
A classification system for art and iconography. For further information see http://www.iconclass.nl/home.
See http://dewey.info/.
11
See http://id.loc.gov/.
12
ISIL is an acronym for International Standard Identifier for Libraries and Related Organisations. The registration for German
institutions is managed by the German ISIL and Library Codes Agency at the Staatsbibliothek zu Berlin.
13
The API of the DDB is documented in the wiki space “API der Deutschen Digitalen Bibliothek”, available under the URL:
https://api.deutsche-digitale-bibliothek.de/doku/display/ADD/API+der+Deutschen+Digitalen+Bibliothek.
14
The development of Archivportal-D is funded by Deutsche Forschungsgemeinschaft (DFG). The portal will be launched publicly in
September 2014. For a project description in German language see http://www.landesarchiv-bw.de/web/54267.
10
27
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
information in the mappings, c) the adaptation of the transformation tools including the
programming of new XSLT scripts, d) the adjustment of the SOLR schema, e) the configuration
of the search facets and hierarchies for the frontend, f) the transformation, ingestion and indexing
of the complete DDB holdings which comprised around six million records in 2013.
Even though we installed a process that ensured that CRM and EDM records could be ingested in
parallel, a few concessions had to be made. For instance, we prioritized the change of the
published data sets to EDM. This resulted in a slower increase of content in the DDB since little
resources were left for new ingests.
However, the introduction of DDB-EDM decreased the workload for the conceptual and
technical mappings considerably. The establishment of mappings to CRM required expert
knowledge. Our domain experts, however, were more familiar with EDM because they were
already involved in mapping activities for contributing metadata to Europeana via other projects.
Furthermore, with EDM the mappings became less complex and less error-prone, because in
CRM a statement can be expressed in many ways which often resulted into a series of triples. For
example, to state that an object is about a person the mapper had to opt for one of the following
paths in CRM:
•
E89 Propositional Object (or Subclass) P67F refers to E39 Actor (or Subclass)
•
E89 Propositional Object (or Subclass) P129F is about E39 Actor (or Subclass)
• E24 Physical Man-Made Thing (or Subclass) P62F depicts E39 Actor (or Subclass)
We map this statement to DDB-EDM as follows:
• edm:ProvidedCHO dcterms:subject edm:Agent
This example shows that we lost precision in DDB-EDM regarding semantic relations, because
the CRM properties “refers to”, “is about” and “depicts” were merged into the single EDM
property “dcterms:subject”. But this generic property is sufficient to distinguish the “aboutness”
from the lifecycle of an object which is the crucial requirement for our search facets. This
decision was also reasonable regarding the time saved for mappings, the processing of records
and thus the ingestion of data contributions.
4. The DDB-EDM Model
The decision to minimize the transformation costs by using EDM in the DDB raised some
difficulties. Coming from the event based CIDOC-CRM, the DDB needed properties and classes
to describe the events in the lifecycle of the digitized resource. Such properties and classes were
available in EDM, but at that time Europeana had not yet implemented them entirely, especially
not the necessary event class and its associated properties. Therefore we developed a DDB-EDM
model that was an extension of the implemented Europeana EDM described in the Europeana
Mapping Guidelines (EDM Mapping Guidelines, 2013).
28
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
FIG. 3. DDB-EDM model
Figure 3 gives an overview over the properties and classes used in the DDB. Properties and
classes used in the DDB, but not implemented in Europeana in 2013, are colored green. This
concerns all statements about edm:Event and edm:PhysicalThing. Properties and classes used in
the DDB, but not compliant with the Europeana Model, are red. We use these terms for domain
specific requirements. These are:
•
dcterms:rights with ore:Aggregation as domain and dcterms:RightsStatement as
range, used for rights statements about the metadata. Depending on the value of this
property the metadata will be provided by the DDB to Europeana or not15,
•
ddb:aggregator
•
dcterms:rights with edm:WebResource as domain and dcterms:RightsStatement as
range, used for DDB specific rights statements,
•
dcterms:language
•
dcterms:subject
•
ddb:hierarchyType
with ore:Aggregation as domain and edm:Agent as range, used for
the aggregator providing data to the DDB16,
with edm:ProvidedCHO as domain and dcterms:LinguisticSystem
as range, used to describe the language of the resource with non-literal values17,
with edm:ProvidedCHO as domain and a non-literal value as range
which may be an instance of one of the EDM conceptual classes edm:Agent, edm:Place,
18
edm:TimeSpan etc. ,
with edm:ProvidedCHO as domain and a literal value as range, used
to describe the object type of an edm:ProvidedCHO or edm:PhysicalThing as part of a
15
Metadata are only exposed to Europeana or others when the value is “CC0”.
Europeana uses edm:provider for Europeana aggregators which in our case is the DDB. Because the property is not repeatable the
DDB needs a domain specific property for the description of DDB aggregators.
17
Europeana uses dc:language and allows the use of literal and non-literal values whereas the use of URIs in the DDB is mandatory.
18
Europeana uses dc:subject and allows the use of literal and non-literal values whereas the use of URIs in the DDB is mandatory.
16
29
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
hierarchy (e.g. journal, volume, article, illustration). Values used here are based on a
vocabulary that will be published as Linked Open Data in the future, which will result in
a revision of the DDB-EDM model,
•
ddb:hierarchyPosition with edm:ProvidedCHO as domain and a literal value as range,
used to describe the order of an edm:ProvidedCHO or edm:PhysicalThing in a hierarchy,
•
ddb:aggregationEntity with edm:ProvidedCHO as domain and a literal value as range,
used to distinguish between hierarchical levels with proper descriptions and levels
without such descriptions (e.g. an issue that is only identified by the number),
•
rdf:type
•
crm:P11_had_participant
with edm:Agent as domain and skos:Concept as range, used to describe the
relation between a corporate body and the type of sector it belongs to, and
with edm:Event as domain and edm:Agent as range, used to
describe that there is a relation between an event and an agent (e. g. the creation event
and the creator).
5. Conclusion and Outline
The implementation of EDM has turned out to be the most effective way to serve the
requirements of the DDB portal for functions based on linked data principles and external
applications like Europeana. Prospectively, DDB-EDM will also contain the results of further
enrichment and normalization processes the DDB is currently establishing for authority data and
controlled vocabularies which will subsequently improve the portal as well.
References
CRM Definition (2014). Definition of the CIDOC Conceptual Reference Model, Version 5.1.2. Retrieved April 24,
2014 from http://cidoc-crm.org/docs/cidoc_crm_version_5.1.2.pdf
EDM Definition (2013). Definition of the Europeana Data Model, version 5.2.4. Retrieved April 24, 2014
http://pro.europeana.eu/edm-documentation
EDM Mapping Guidelines (2013). Europeana Data Model – Mapping Guidelines, Version 2.0. Retrieved April 29,
2014 from http://pro.europeana.eu/edm-documentation
Task Force on hierarchical objects (2013). Recommendations for the representation of hierarchical objects in
Europeana.
Retrieved
April
29,
2014
from
http://pro.europeana.eu/web/network/europeana-tech//wiki/Main/Taskforce+on+hierarchical+objects
30
Distributed Metadata
Environments & Aggregation—
Part B Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Designing a Multi-level Metadata Standard based on Dublin Core
for Museum data
Jing Wan
Beijing University of Chemical
Technology, China
[email protected]
Yubin Zhou
Beijing University of Chemical
Technology, China
[email protected]
Gang Chen
Beijing Gehua Culture Development
Group, China
[email protected]
Junkai Yi
Beijing University of Chemical
Technology, China
[email protected]
Abstract
Metadata is a critical aspect of describing, managing and sharing museum data. It is challenging
to develop a general metadata schema that meets the requirements of different museums due to
the large range of data types. The capability of concise description and the simplicity of use need
to be considered. In this paper, we report on a finished project that aims to design a metadata
schema for museums in China. An extensible metadata standard based on Dublin Core is
presented, which includes core of metadata, extension rules and specific metadata. For the core
metadata, we introduce terms, definitions, registration rules and detailed examples of description.
The principle of choosing terms and refinements is discussed. A specific metadata schema for
porcelain is discussed as an extension example.
Keywords: metadata; Dublin Core; museum
1. Introduction and Motivation
With the rapid development of information technology since the 1992, lots of museums
adopted collection management systems, digitalized collection data, and provided public. Data
sharing and integration among museums became important.
Metadata is defined as “structured data about data”. As a key issue of data standardization and
data sharing, metadata for cultural heritage has attracted worldwide attention. A number of
organizations and initiatives made great efforts to address this issue. Some published metadata
schemas have been widely used and accepted as international standards, for example, Dublin
Core (DCMI, 2012), CDWA (Getty Research Institute, 2008), EDM (Europeana Foundation,
2013), CIDOC CRM (CIDOC CRM Special Interest Group, 2011), VRA Core (Visual Resources
Association Data Standards Committee, 2007), EAD (Society of American Archivists and the
Library of Congress, 2002), and FGDC/CSDGM (Federal Geographic Data Committee, 1988).
China’s management system of cultural relics is different from that of other countries. Most
cultural relics are owned by the state and under the protection of the state. A state department
takes charge of the work concerning cultural relics throughout the country. From 1978, a serial of
regulations were published by China’s State Administration of Cultural Heritage, which aimed to
establish the standard process in registering and compiling files for museum collections. Many
government funded projects promoted the work of museum informatics. The project “Cultural
Relics Census and Collection Management System Construction” started in 2001, with 48,006
pieces of valuable collections and 1,370,000 pieces of general collections recorded in the
database by 2010. In 2012, the project “First National Movable Cultural Relics Census” started,
which aimed to investigate, identify and register movable relics through information technology.
31
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Many museums in China established collection management systems and digitized their
collections progressively, such as, the Palace Museum, the Capital Museum, and Shanghai
Museum. Some museums designed their own data specifications. And several specifications were
published by the government, for example, “Data Specification for Museum Collections”,
“Standard for Image Archive of Unmovable Cultural Relics”, “Data Specification for the Third
National Heritage Sites Census”, and “Data specification for the First National Movable Cultural
Relics Census”.
But there is still no national standard for museum data in China. Considering the different
management systems, it is difficult to utilize existing metadata schemas without modification.
And museums are different in collection types, collection quantities, data quality and the skill
levels of staff. So different requirements for metadata need to be considered. The metadata
schema should be capable of concise description, be simple to use, and be compatible with the
published specifications.
We describe an effort in developing a metadata architecture to address this issue. In the project,
we design the core metadata based on Dublin Core, and specific metadata extensions for
drawings, porcelain, ancient buildings and inscriptions. For each metadata of these categories, we
provide terms, definitions, refinements, registration rules and detailed samples. In this paper, we
focus on the core metadata and describe one specific example of metadata extension.
2. Metadata Architecture
Figure 1 shows the metadata architecture, which includes the core metadata, specific metadata
and extension rules.
Core Metadata
Optional Elements
Required Elements
Extension Rules
Reuse
Horizontal
Extension
Deletion
Vertical
Extension
Specific Metadata
Movable Relics
Porcelain
Immovable Relics
Ancient
Inscription
Building
Drawing
...
...
FIG. 1. Metadata architecture
The core metadata is simple and based on Dublin Core Metadata Element Set, version 1.1
(Dublin Core, 2012). This level is used to describe the general core attributes of digital resources.
It supports retrieval, integration and data exchange. The elements of this level are easy to use. A
museum that has simple data could use it directly. And a museum with a large number of
collections and complex data structures can use it as the first stage of a plan. For these museums,
entering complete data usually takes several years or even decades. First make it work, and then
make it better. This rule is helpful to motivate the staff, get support from other divisions, and
gain experience.
The specific metadata is used for data sets of particular type or domain. It is designed by
analyzing existing archives and possible data requirements coming from museum management.
32
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The extension rules are used to extend metadata to meet the actual requirements of a specific
museum. Rules and implement approaches need to be provided to guide users in customizing
metadata.
3. Core Metadata
3.1. Approach
The Core Metadata consists of an elements set and qualifiers. It is a vocabulary of nineteen
properties for use in digital museum’s collection description. “Core” means its elements are
generic, and usable for describing a wide range of museum data.
Taking into account versatility, scalability, and interoperability, we design the core based on
the Dublin Core Metadata Element set. In addition, data specifications and published standards in
China are considered. Existing data are stored in the database or on paper are analyzed. We also
consider the data elements adopted by the cultural relics census. Using this approach, we adopt
eleven elements from Dublin Core and add eight elements and qualifiers.
Element qualifiers make the meaning of an element narrower or more specific. Following the
practice of Dublin Core Qualifiers, there are two classes of qualifiers, element refinements and
encoding schemes. The element refinements include object qualifier, basic qualifiers, and
composite qualifiers.
1. Object qualifier. The metadata should be capable of describing movable and
immovable cultural relics. But these two types of relics have great differences. This
qualifier is used to describe the range of an element.
2. Basic qualifier. It is the basic unit of qualifier. It cannot be extended.
3. Composite qualifier. It consists of basic qualifiers and/or composite qualifiers. For
example, the copyright of the image has a composite qualifier including three basic
qualifiers—owner, copyright restriction, and copyright description.
We define each element and qualifier by nine properties, which are name, identifier, version,
definition, repeatability, data type, required status, domain, and qualifier.
3.2. Element Set
The element set of the core metadata includes nineteen terms. We adopt terms from Dublin
Core Metadata Element Set, version 1.1 with the exception of language, contributor, publisher
and source (DCMI, 2012). For collections in China, the language element always has the value
“in chinese”, so we don’t adopt it now. The contributor element and the publisher element of a
collection are the same as its keeper, which is included in the element rights. So we don’t adopt
contributor or publisher. We ignore the element source for it has no value for a collection. Table
1 shows the correspondence between the core metadata elements and the Dublin Core metadata
elements. Many of these terms have basic constraints.
We describe standard vocabularies for some elements. The value “Yes” of the Encoding
Scheme column of Table 1 on the following page indicates that vocabularies for the element are
provided. For example, the grade of the movable cultural relics includes the values “grade one”,
“grade two”, “grade three”, “not determined”, and “normal”. These terms are defined in the
standard “Grading Standard For Cultural Relics” published by China’s Ministry of Culture.
33
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
TABLE 1: Alignment of the core metadata element set and DC element set.
Term
Name
Identifier
Type
Date
Subject
Description
Creator
Coverage
Comment
DC: Title
DC: Identifier
DC: Type
DC: Date
DC: Subject
DC: Description
DC: Creator
DC: Coverage
Right
Relation
Material
Acquisition
Grade
Measurement
DC: Rights
DC: Relation
DC: Format
Conservation
Quantity
Condition
Environment
DamageCause
Refinements
Registered Name, Alternative Name
Encoding Scheme
Yes
Yes
Geographic Coordinate, Scope
Coordinates(Measure point number, Measure
Point Coordinates, Adjacent Measure point),
Geographic Name
Ownership Type, Affiliation
Image, Reference, Component
Material Type, Specific Material
Approach, Enter Scope, Enter Date
Yes
Yes
Yes
Yes
Dimension(Length, Width, Height), Weight,
Distribution Area, Protection Scope Area,
Building Area, Construction Control Zone Area
Residual Level, Conservation Status, Status
Assessment,
Use Unit, Subordination Unit
Natural Environment, Humanities Environment
Natural Cause, Man-made Cause
Yes
Yes
4. Extension Rules
Because of the large range of museum collections, it is hard to use the core metadata to meet
the description of each item. So we design the extension rules to generate more specific metadata.
And we provide the design of four specific metadata, which includes terms, definitions,
registration rules and detailed examples.
There are four classes of extension approach:
Reuse. It refers to adopting existing elements or refinements of the core metadata. It includes
complete reuse and partial reuse. The reuse class indicates adoption without modification. Partial
reuse adds some restrictions.
Deletion. Refers to deleting elements or refinements that are useless in this level.
Horizontal extension. Refers to adding a new element.
Vertical extension. Refers to adding refinements according to the extension rules.
5. Metadata for Porcelain
The specific metadata for porcelain is an example of how the extension rules are applied. Table
2 shows how the specific metadata for porcelain is extended from the core metadata. It includes
sixteen elements. The followings are examples of four extension rules with the porcelain
metadata:
1. Reuse. The element name and its two refinements (registered name and alternative name)
from the core metadata are included in the specific metadata. It is complete reuse. The
element grade from the core metadata is included in it too. But the value range of the
element grade is changed , so it is part reuse.
2. Deletion. The element coverage has three refinements in the core metadata. We delete one
refinement (scope coordinates) in the specific metadata for it is useless for porcelain.
34
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
3. Horizontal extension. There is no horizontal extension.
4. Vertical extension. The element name has a new refinement (original name). We add it
because many original names of porcelain collections are revised in order to conform to
the naming rules published by the authority. The revised name is the registered name of a
collection. But sometimes the original name is well known. So we need to record it too.
TABLE 2: Specific metadata for porcelain.
Index
1
Term
Name
2
3
4
5
6
7
Identifier
Type
Date
Subject
Description
8
9
10
11
12
13
14
Coverage
Right
Relation
Material
Acquisition
Grade
Measurement
15
Conservation
16
Quantity
Creator
Refinements
Registered Name, Alternative Name,
Original Name
Manufacture Date, Use Date
This term has 17 refinements.
Name, Gender, Native Place, Birth, Death,
Creator Description
Geographic Coordinate, Geographic Name
Ownership Type, Affiliation
Image, Reference, Component
Material Type, Specific Material
This term has 12 refinements.
Dimension (Length, Width, Height), Weight
Current Condition, Natural Damage,
Physical Damage, Remarks, Citations
Extension
Complete Reuse+vertical Extension
Complete Reuse
Part Reuse
Vertical Extension
Vertical Extension
Vertical Extension
Vertical Extension
Deletion+Complete Reuse
Complete Reuse
Complete Reuse+Vertical Extension
Complete Reuse
Complete Reuse+Vertical Extension
Part Reuse
Deletion+Complete Reuse
Complete Reuse+Vertical
Extension+Deletion
Complete Reuse
5. Conclusions and Future Work
This paper introduces a project aimed to design an extensible metadata standard for museum
data in China. We consider the capability of concise description and the simplicity of use. We
present a standard including core metadata, extension rules, and specific metadata. The core
metadata is based on Dublin Core and is easy to use. It includes nineteen elements and
refinements. There are four extension approaches that are reuse, deletion, horizontal extension
and vertical extension.
In the future, we plan to develop a metadata management system, which will help museums to
customize the metadata element set for their application. We also plan to enhance the use of
standard vocabularies and make them compatible with the international standards.
Acknowledgements
We would like to express our thanks to the Capital Museum for providing data samples. We also
gratefully appreciate the contribution of Hu Chui from the Palace Museum, Xiao Fei from the
National Museum of China, Shen Guihua from China Cultural Heritage Information and
Consulting Center for the discussion on the metadata. This work is financed by the State Cultural
Relics Bureau within the project “A Study of a Metadata Standard for Museum Data in China” .
References
China's State Administration of Cultural Heritage. (2001). The Data Specification for Museum Collections. Retrieved
April 1, 2014, http://www.nach.gov.cn/art/2008/7/9/art_343_3636.html.
China's State Administration of Cultural Heritage. (2005). The Standard for Image Archive of Unmovable Cultural
Relics. Retrieved April 1, 2014, http://www.sach.gov.cn/art/2008/7/8/ art_343_3633.html.
35
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
China's State Administration of Cultural Heritage. (2007). The Data Specification for the Third National Heritage Sites
Census. Retrieved April 1, 2014, from http://pucha.sach.gov.cn/tabid/64/Default.aspx.
China's State Administration of Cultural Heritage. (2012). The Data Specification for the First National Movable
Cultural Relics Census. Retrieved April 1, 2014, http://www.wenwu.gov.cn/kydwwpc/.
CIDOC CRM Special Interest Group (SIG). (2011). Definition of the CIDOC CRM. Retrieved April 1, 2014, from
http://www.cidoc-crm.org/definition_cidoc.html.
DCMI. (2012). DCMI Metadata Terms. Retrieved April 1, 2014, from http://dublincore.org/documents/dces/.
Europeana Foundation. (2013). EDM Definition. Retrieved April 1, 2014, from ehttp://pro.europeana.eu/documents/
900548/770bdb58-c60e-4beb-a687-874639312ba5.
Federal Geographic Data Committee. (1998). Content Standard for Digital Geospatial Metadata. Retrieved April1,
2014, from http://www.fgdc.gov/standards/projects/FGDC-standards-projects/metadata/base-metadata/index_html.
Getty Research Institute. (2008). Categories for the Description of Works of Art (CDWA). Retrieved April 1, 2014,
http://www.getty.edu/research/publications/electronic_publications/cdwa/.
Metadata Architecture and Application Team and Taipei National Palace Museum.(2007). The Information
Requirements Specification for Taipei National Palace Museum. Retrieved April 1, 2014, from http://www2.
ndap.org.tw/eBook08/showContent.php?PK=51.
Society of American Archivists and the Library of Congress, 2002). Encoded Archival Description. Retrieved April 1,
2014, http://www.loc.gov/ead/.
Visual Resources Association Data Standards Committee. (2007). VRA Core 4.0 Introduction. Retrieved April 1, 2014,
from http://www.loc.gov/standards/vracore/schemas.html.
36
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
"Lo-Fi to Hi-Fi": A New Way of Conceptualizing Metadata in
Underserved Areas with the eGranary Digital Library
Deborah Maron
UNC Chapel Hill, USA
[email protected]
Cliff Missen
UNC Chapel Hill, USA
[email protected]
Jane Greenberg
Metadata Research Center
[email protected]
Abstract
Digital information can bridge age-old gaps in access to information in traditionally underserved
areas of the world. However, for those unfamiliar with abundant e-resources, their early exposure
to the digital world can be like “drinking from a fire hose.” For these audiences, abundant
metadata and findability, along with easy-to-use interfaces, are key to their early success and
adoption. To hasten the creation of metadata and user interfaces, the authors are experimenting
with “crowd cataloging.” This report documents their experimental and intended work and
Maron’s Lo-Fi to Hi-Fi metadata pyramid model guiding a developing metadata initiative being
pursued with the eGranary Digital Library, the technology used by Widernet in a global effort to
ameliorate information poverty. The work in development, the Lo-Fi to Hi-Fi model, has
principles adapted from technical design processes and tried and true methods within metadata
creation such as crowdsourcing. It attempts to reconceptualize the metadata modeling paradigm
and aligns with research that has shown that community-based librarians are better poised to
identify culturally congruent resources, but many require significant training in metadata
concepts and skills. The model has amateurs (mostly students) crowdsource “lo-fi” terms, which
domain experts and information professionals can curate and cull in “hi-fi” to enhance findability
of resources within the eGranary while simultaneously honing their own computer, information
and metadata literacies. Though the focus here is on Africa, the findings and practices can be
universalized to off-line collections around the globe.
Keywords: information literacy; computer literacy; WiderNet; eGranary; Africa; metadata
literacy; crowdsource; crowd catalog; folksonomy; LIS education; industrial design; hi-fi
prototype; lo-fi prototype
1. Introduction
Many citizens of first world nations have become accustomed to, and even routinely take for
granted, accessing information immediately and easily through the Web and mobile technologies.
With years of experience under their belts, they blithely operate search engines, barely recalling
how they developed their search skills over hundreds of hours of Internet use. However, not all
individuals have that luxury, with over five billion people lacking access to Internet resources.
Information poverty, begotten of a lack of technology and knowledge of how to use it, is a
pervasive problem that affects quality of life, as well as development of crucial 21st century skills
like information and computer literacy, for children and adults around the globe.
Lack of the aforementioned literacies is an information and library science (ILS) issue that
impacts many librarians and library workers in all aspects of work including metadata creation
and management. More specifically “metadata literacy” is adversely affected by the conditions in
environments in which workers lack sufficient computer or information literacy.
The focus of the work reported here is the rural, indigenous sub-Saharan library, where a
dearth in literacy is exacerbated by a lack of connectivity to the Internet. Driving questions
include the following: How should metadata for essential resources be developed in these and
similarly afflicted regional libraries? Also, how can the knowledge and terminology of a
particular society be leveraged within these regions to create metadata? We propose that concepts
37
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
from the informal technical design process employed by WiderNet over the last five years to
develop topical portals for end users (in medicine, nursing, public health, rural agriculture, life
skills, disability rights, etc.) be formally expanded and evaluated to provide a first-rate framework
for conceptualizing and delivering metadata formation for people in underserved communities.
This prototyping was further defined as “hi-fi” and “lo-fi” and extrapolated into a model by
Maron in 2014.
We present this model in the context of [email protected], an initiative to bring Internet content
to places worldwide that are not connected, via massive hard disks of material that mirrors what
the Internet has available. WiderNet’s hard disks of material, called eGranary Digital Libraries,
are currently used in over 800 locations and contain 35 million items each; while the majority of
the documents can be searched using a built-in term search engine, only a fraction of the items
are catalogued because the onus falls on the small team of paid and volunteer cataloguers to
create records.
In evaluating logs from dozens of eGranary servers, it has been noted that users, generally
unfamiliar with search engines, are more likely to use the limited catalog to locate resources. In
most cases, 90-95% of the documents retrieved were listed in the catalog. Clearly, new users
prefer well-cataloged resources.
This project report details the expanded concept of the Lo-Fi to Hi-Fi metadata pyramid,
representing a method by which metadata can be crowdsourced and curated for resources by the
very people that use and operate eGranaries within underserved areas of Africa in a tiered system;
students and other general users in the respective communities identify folksonomic terms and
useful resources as “suggestions” (lo-fi), which are then winnowed by domain specialists,
approved, and finally become part of the canon of knowledge (hi-fi) in the hands of more expert
catalogers. Hopefully this scheme would imbue metadata and other types of literacy in general
users, scholars, practitioners and library professionals, and foster the creation of metadata in
regions with eGranaries that critically need it so that more information can be found. As well, it is
expected to reveal culturally congruent metadata that external agents can adopt and employ. The
Lo-Fi to Hi-Fi metadata pyramid can also, if successful, be globally applied to other collections
and digital libraries in communities facing similar obstacles; as such obstacles are fairly
universal. Before delving into the method and model, it is imperative to go over definitions of
terms and provide context for the problem.
1.1. Definitions of the terms
Information literacy is defined as “the set of skills needed to find, retrieve, analyze, and use
information.” Those who are information literate “have learned how to learn” and find
information for virtually any task (“Introduction to Information Literacy,” n.d.). Computer
literacy is a term being continually redefined, but Childers writes that “a person is either
computer literate or not based on how proficient they are at some basic computer tasks”
(“Computer Literacy,” n.d.). Finally, metadata literacy is a term coined by Erik Mitchell and
concerns a person’s ability to cultivate adequate metadata for digital objects (Mitchell, 2010).
1.2. Computer and information literacy in areas of sub-Saharan rural Africa
For members of communities around the world, computers are critical in terms of cultivating
skills necessary to be an active, participatory member of the information age. In fact, the
computer as a beacon of hope and its ability to revolutionize and improve many facets of an
African citizen’s life was recognized in the early 1990’s by Oduaran and others, but a lack of
computer literacy persists even today (Oduaran, 1991; du Plessis & Webb, 2012). This paucity of
computer literacy begets information illiteracy, a problem pervading not just the general,
indigenous rural populace in sub-Saharan Africa but the population of teachers and information
workers as well (Jager & Nassimbeni, 2007) . This problem manifests itself in not only libraries
but also in the issues for which information professionals are to provide information, such as the
AIDS epidemic and prevention. Compounding a lack of computer and information literacy in
38
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
certain African libraries is a lack of metadata literacy, a skill not possessed even by many
American library professionals (Park, Tosaka, Maszaros, & Lu, 2010).
2. [email protected]: Bringing information and literacies to the masses
2.1. Overview of WiderNet
[email protected] is a research program at the University of North Carolina, Chapel Hill that
focuses on low-cost, high-impact uses of ICT and training modalities for under resourced
communities worldwide. Its sister organization, the WiderNet Project, is a non-profit service
program founded in 2001 that aims to bring educational digital content to places worldwide that
lack adequate Internet connectivity. Using massive hard disks of material that mirror thousands of
World Wide Web sites, WiderNet’s eGranary Digital Libraries are currently used in over 800
locations and contain 35 million items each.
2.2. Metadata Principles and Challenges
Utilizing Dublin Core and Library of Congress (LoC) standards, WiderNet cataloguers have
developed a protocol for adding metadata, highlighting resources, and creating user-centric
collections, for e-Granaries. However, only a fraction of the items are findable through the
catalogue because the onus falls on the small team of student and volunteer cataloguers to create
records. Many more resources could be found and privileged if there was more metadata
available, and if users and library workers in Africa were contributing to the process.
WiderNet has worked with partners in developing countries to create custom user-centric
“portals” from catalogued records. For example, in 2008 they launched collaboration with the
medical college at the University of Zambia and the School of Public Health at the University of
Alabama to create a portal for teaching health sciences in Zambia. Over 1.5 million documents
were garnered from the inputs of dozens of educators and practitioners around the world (lo-fi)
and then WiderNet librarians cataloged over 2,000 items that had been highlighted by the expert
advisors. Then, in consultation with their Zambian counterparts, they mapped 600 cataloged
items to the Zambia national medical curriculum. Students and instructors were quick to adopt
this curated collection and eventually insisted on it being installed in dozens of other institutions
where they practiced and taught.
In another example, they worked with the United States International Council on Disabilities
and over 100 advocacy groups around the world to create a dozen portals around disability rights
and resources for persons with disability. Over 2.5 million new resources were added to the
eGranary library and mostly librarians in the U.S. and Europe catalogued 4,000 items.
3. Hi-Fi/Lo-Fi Prototyping: Can the principle be adapted to metadata?
A prototype is defined by Merriam-Webster as an “original or first model of something from
which other forms are copied or developed”, or a “first or early example that is used as a model
for what comes later” (“prototype”, n.d.). It is proposed that methods involving high and low
fidelity prototyping (hereafter called “hi” and “lo” fi) be used as a model for creating and curating
metadata for resources in eGranaries. Egger describes prototypes thus:
Low-fidelity (lo-fi) prototyping is characterized by a quick and easy translation of highlevel design concepts into tangible and testable artifacts. Lo-fi is also know as low-tech,
as the means required for such an implementation consist, most of the time, of a mixture
of paper, cardboard, post-it notes, acetone sheets etc. A clear advantage of lo-fi
prototyping is its extremely low cost and the fact that non-programmers can actively be
part of the idea-crystallization process.
At the other extreme, high-fidelity (hi-fi) prototypes are characterised by a high-tech
representation of the design concepts, resulting in partial to complete functionality. Hightech, however, implies higher costs, both temporal and financial, and necessitates good
39
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
programming skills to implement the prototype. The main advantage of hi-fi, high-tech
prototyping is that users can truly interact with the system, as opposed to the sometimes
awkward facilitator-driven simulations found in lo-fi prototyping. Obviously, there is a
continuum from low to high-fidelity prototyping that usually stretches out from early to
late design.(“Lo-Fi vs. Hi-Fi Prototyping,” n.d.)
Lo-fi prototyping, which Egger explains is “cheap, fast and accessible to non-programmers” aids
participants of all levels of computer and information literacy to assist in idea and product
formation, and is therefore proposed as the first step of the pyramid process, outlined in section 4.
4. The Lo-Fi to Hi-Fi Metadata Model: Crowd-Cataloguing the eGranary
This section introduces the model that is being developed, a study that is a result of meetings and
the exchange of ideas at UNC Chapel Hill.
4.1. Tier 1: Lo-Fi
(lo-fi): Crowdsourcing a folksonomy
Example of Participants of Tier 1 populace: Mitchell’s metadata literacy study focused on the
ability of college students to create and curate metadata (Mitchell, 2010). It therefore is proposed
here that college students form the majority of the lower tier of the model, the “lo-fi” stage of the
process. Additionally, a machine algorithm will automatically extract metadata (indicating
anything from whether something is, say, a book or web site only, to other technical details) and
will feed it into this tier (or higher tiers, if a resource already contains adequate metadata to go
straight to tier 2 or 3). African and international university students (graduate and undergraduate)
familiar with a particular domain, e.g. hydrology, and possessing some degree of and or aptitude
for metadata literacy, will create metadata. Here, terms and relationships can be drafted, thrown
out and drafted again in iterative, rapid succession, in either an analog or digital environment. An
eGranary resource page might have for instance a pop-up that allows one to easily tag it with
descriptive terms. Alternatively, there could be paper-only environment in which students, some
of whom might be more comfortable with lower-tech, are collaboratively brainstorming terms on
post-its, which are later added digitally to the system. Creating terms in this manner prevents
what Egger calls “tunnel vision,” when people get caught up in the design of the product or resort
to processes most comfortable to them, instead of focusing on what best benefits end users (“LoFi vs. Hi-Fi Prototyping,” n.d.). Further, people at this level are imbued with metadata,
information and computer literacy through their efforts. Items in the “lo-fi” tier are not hardcoded
into the canon, but Tier 1 products are passed to Tier 2 upon completion.
4.2. Tier 2: Middle-Fi
(middle-fi): Refining the terms and their relationships (synonyms, broader, narrower, if
applicable). The participants are regional and international domain experts (e.g. hydrology
professors/researchers, practicing hydrologists); here, the participants are fewer than in Tier 1.
4.3. Tier 3: High-Fi
(hi-fi): The smallest tier. Information specialists (African and international) approve and refine
terms and relationships and add them to the canon of knowledge (hard coded as a nearly final
product) in the form of a vocabulary, ontology or descriptive metadata applied to records.
Domain experts and other information specialists can review this almost-final product though
changes to the canon are harder to make. It is expected that at this level terms are more or less
definitive and reflect what are used in a particular culture and discipline. Such high-level
activities also enhance the indigenous library worker’s multiple literacies, so the benefits of this
process are multitudinous.
40
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
FIG. 1. Illustration of the pyramid model
TABLE 1: Summary of the pyramid model
Tier Type of terms level 1 Folksonomy, technical metadata Fidelity 2 med. 3 Creator Metadata population Creator large Student, machine Folksonomy/ontology/vocabulary/tec medium Domain hnical metadata expert, machine Ontology/vocabulary/other small Informati
descriptive metadata/technical on metadata specialist, machine low high Employing rudimentary examples of Lo-Fi/Hi-Fi metadata creation, [email protected] has
demonstrated promising ideas for scaling up the creation of culturally-congruent metadata and
user-centric portals through crowd-cataloging and tiered expertise. The authors will continue to
explore these concepts as they expand metadata knowledge and use in target populations.
5. Conclusions
Metadata developments have progressed at a tremendous pace, particularly in technology-rich
first world nations. The attention to metadata has been basic in developing countries, given more
substantial priorities, such as implementing networking capacities. As technologies and
opportunities such as the eGranary Digital Library are implemented, the need to address metadata
issues has become increasingly apparent. This paper reported on steps taken to address metadata
challenges and advance current practices. The Lo-Fi to Hi-Fi metadata pyramid model, taking its
cues from other fields like design, is guiding a developing metadata initiative being pursued with
the eGranary Digita Library and helping the initiative to understand how to expedite the creating
of good quality metadata, making resources more findable and usable.
Next steps including testing the model in information and technology-poor areas of Africa by
assessing the needs and available manpower to source the effort through a series of methods
including surveys and experiments. We hope to discover through our research how to best
implement the pyramid model, thereby eliminating much of the information, computer and
41
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
metadata illiteracy plaguing certain areas while bolstering eGranary resource findability. If
successful, the effort can be duplicated in other countries, such as Bangladesh and India, and
environments such as prisons, with eGranaries.
References
Computer
Literacy:
Necessity
or
http://www.ala.org/lita/ital/22/3/childers
Buzzword?
(n.d.).
Retrieved
May
16,
2014,
from
Du Plessis, A., & Webb, P. (2012). Teachers’ Perceptions about their Own and their Schools’ Readiness for Computer
Implementation: A South African Case Study. Turkish Online Journal of Educational Technology - TOJET, 11(3),
312–325.
Global access to aging information and the gerontology healthy ageing portal. Lisa E Skemp, Ji Woon Ko, Cliff
Missen, Diane Peterson. The University of Iowa College of Nursing, Iowa City, IA, USA.
Introduction
to
Information
Literacy.
(n.d.).
http://www.ala.org/acrl/issues/infolit/overview/intro
Retrieved
May
16,
2014,
from
Jager, K. de, & Nassimbeni, M. (2007). Information Literacy in Practice: engaging public library workers in rural
South Africa. IFLA Journal, 33(4), 313–322. doi:10.1177/0340035207086057
Journal of Gerontological Nursing 01/2011; 37(1):14-9.
Lo-Fi vs. Hi-Fi Prototyping: how real does the real thing have to be? (n.d.). Telono. Retrieved from
http://www.telono.com/en/articles/lo-fi-vs-hi-fi-prototyping-how-real-does-the-real-thing-have-to-be/
Mitchell, E. T. (2010). Metadata literacy: An analysis of metadata awareness in college students (Ph.D.). The
University of North Carolina at Chapel Hill, United States -- North Carolina. Retrieved from
http://search.proquest.com.libproxy.lib.unc.edu/docview/304160575/abstract?accountid=14244
Oduaran, A. (1991). The Computer Revolution and Adult Education. Growth Prospects in Africa.
Park, J., Tosaka, Y., Maszaros, S., & Lu, C. (2010). From Metadata Creation to Metadata Quality Control: Continuing
Education Needs Among Cataloging and Metadata Professionals. Journal of Education for Library & Information
Science, 51(3), 158–176.
prototype.
(n.d.).
Merriam-Webster.
webster.com/dictionary/prototype.
Retrieved
May
17,
2014,
from
http://www.merriam-
Zambia (n.d.). Sparkman Center for Global Health brings eGranaries to Universities in Zambia. Retrieved September
18, 2014, from http://www.widernet.org/node/891 and http://www.sparkmancenter.org/egranary.
42
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
How Descriptive Metadata Changes in the UNT Libraries’
Collections: A Case Study
Hannah Tarver
University of North Texas
Libraries, USA
[email protected]
Oksana Zavalina
University of North Texas,
USA
[email protected]
Daniel Alemneh
University of North Texas
Libraries, USA
[email protected]
Mark Phillips
University of North Texas
Libraries, USA
[email protected]
Shadi Shakeri
University of North Texas,
USA
[email protected]
Abstract
This paper reports results of an exploratory quantitative analysis of metadata versioning in a
large-scale digital library hosted by University of North Texas. The study begins to bridge
the gap in the information science research literature to address metadata change over time.
The authors analyzed the entire population of 691,495 unique item-level metadata records in
the digital library, with metadata records supplied from multiple institutions and by a number
of metadata creators with varying levels of skills. We found that a high proportion of
metadata records undergo changes, and that a substantial number of these changes result in
increased completeness (the degree to which metadata records include at least one instance of
each element required in the Dublin Core-based UNTL metadata scheme). Another
observation of this study is that the access status of a high proportion of metadata records
changes from hidden to public; at the same time the reverse process also occurs, when
previously visible to the public metadata records become hidden for further editing and
sometimes remain hidden. This study also reveals that while most changes -- presumably
made to improve the quality of metadata records -- increase the record length, surprisingly,
some changes decrease record length. Further investigation is needed into reasons for
unexpected findings as well as into more granular dimensions of metadata change at the level
of individual records, metadata elements, and data values. This paper suggests some research
questions for future studies of metadata change in digital libraries that capture metadata
versioning information.
Keywords: metadata quality; distributed digital libraries; metadata change; measurement;
quality assessment; best practices
1. Introduction and Background
Maintaining usable digital libraries requires high-quality metadata; one related piece involves
looking at how metadata records change to determine how frequently records are edited, and,
ultimately, if they have been improved. These measurements can factor into various kinds of
evaluations including aspects of quality, such as “completeness,” one commonly-accepted quality
criterion (Moen, Stewart, & McClure, 1998; Park & Tosaka, 2010; Zavalina, 2011, etc.).
Metadata completeness is evaluated as an extent to which objects are described using all
applicable metadata elements to their full access capacity (Park, 2009). Of the three major
metadata quality criteria (completeness, accuracy, and consistency), accuracy is the most
subjective and therefore difficult to measure, while the consistency and especially completeness
criteria lend themselves to variety of analyses including computational.
43
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Stvilia and colleagues (Stvilia et al., 2004; Stvilia & Gasser, 2008) concluded that metadata
changes made to improve metadata quality should be quantified and justified based on changes of
value and cost of metadata to assist metadata specialists in optimizing quality assurance processes
and to provide justification for spent resources. However, the analysis of literature demonstrates
little research into metadata change in information science literature.
To the end of our knowledge, none of the published metadata quality studies measured
metadata change. One exception is a small-scale component examining metadata change in the
broader study of collection-level metadata quality in the IMLS DCC aggregation. As part of this
study, researchers conducted longitudinal analysis of the modifications that had been made by
digital collection developers housed at various cultural heritage institutions throughout the United
States to collection-level metadata records created by hosting institutions’ staff in the IMLS DCC
(Zavalina, Palmer, Jackson, & Han, 2008) and found that the data values associated with the
Dublin Core Collections Application Profile’s Subject, Audience, Size, Spatial Coverage and
Temporal Coverage metadata elements are modified the most frequently.
A number of information science studies relied on Wikipedia’s so called “revision metadata”
that documents who made a particular revision to the Wikipedia article and when, as well as
“rollbacks” -- the process of restoring a database or program to a previously defined state -- to
detect vandalism (e.g., West, Kannan, & Lee 2010; Alfonseca, Garrido, Delort, & Peñas, 2013).
Similarly, Yan and McLane (2012) discussed the metadata management process for “revision
metadata,” including the edits, history, and tracking, made to spatial data and GIS (Geographic
Information System) map figures. While using administrative metadata that documents revisions
as a tool to answer other research questions, none of these studies focused on the changes made to
metadata per se as opposed to information objects (e.g., Wikipedia articles) themselves.
Outside of the information science field in general and the metadata quality area in particular,
one can see discussion of change in relation to texts, strings, files, etc., however, a review of the
literature identified a gap in information science research in relation to analysis of metadata
change. In particular, no studies to date have attempted to measure metadata change in digital
libraries. The authors of this paper believe that metadata change can and should be viewed as one
of the indicators of metadata quality and therefore should be examined as a step toward
improving the quality of metadata in digital libraries. To begin bridging this gap, the study
reported in this paper sought to answer the following research question: What is the amount of
change in metadata?
The authors of this paper selected as the target for their research the centralized digital library
hosted by the University of North Texas Libraries, consisting of multiple collections with varying
subject scope, material types, etc. The UNT digital collections include the UNT Digital Library
(containing items owned by UNT and the output of the University’s research, creative, and
scholarly activities), The Portal to Texas History (containing historical materials owned by more
than 200 partner institutions across the state of Texas), and the Gateway to Oklahoma History
(containing primarily newspapers and photographs through partnership with the Oklahoma
Historical Society). The collections incorporate different types of materials including
photographs, theses and dissertations, newspapers, artwork, performances, musical scores,
journals, government documents, rare books and manuscripts, and posters. All items in the UNT
digital collections are described using a locally-modified Dublin Core metadata schema. The
digital library’s infrastructure has been established according to open-source components and
standards, protocols, and formats. At the time of data collection (April 18, 2014), this large-scale
digital library held 691,495 unique objects, with item-level metadata records written by a number
of metadata creators with varying levels of metadata creation skills.
These records reside in the digital library infrastructure operated by the UNT Libraries that is a
purpose-built system for managing and providing long-term access to digital resources. Aubrey,
the system used for this management, was put into production during June of 2009. The UNT
Libraries placed the current metadata editing component into service in September 2009; as part
44
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
of metadata management, this component versions metadata records each time they change in the
system. This provides a unique collection of rich data for analysis into metadata changes.
2. Methods
According to Ochoa and Duval (2009), most of the metadata quality studies involve manual
content analysis on statistically-significant samples of metadata records. Collection-level
metadata records that describe entire collections of information objects as a whole, as opposed to
individual objects, can still often be examined manually due to the reasonable numbers of
metadata records in each sample. However, with the rapid growth of digital libraries and
repositories that aggregate hundreds of thousands and often millions of items and their respective
item-level metadata records, the evaluation of much more numerous item-level metadata will
need to rely -- at least in part -- on computational approaches.
The study reported in this paper adopted the semi-automated quantitative research approach to
analyze the entire population of metadata records in the target centralized digital library with the
purpose to answer the following research question: What is the amount of change in metadata?
The following broad indicators of metadata change were selected:
• frequency distribution of the number of editing events per record (i.e., how many records
were edited only once and how many were edited 2, 3, or more times),
• frequency distribution of the number of editors per record,
• frequency distribution of the record length change in the process of editing,
• frequency distribution of change in record completeness (in terms of the number of
metadata elements, including required elements, used), and
• frequency distribution of change in the record status (i.e., availability for the user)
through the process of editing.
To measure these indicators, metadata records from the UNT Digital Library, The Portal to
Texas History, and the Gateway to Oklahoma History were extracted (Phillips, 2014). The
authors wrote a Python script to extract and aggregate statistics about each metadata record
version into a tab-delimited format that presents a less complex view of the data (see the
Appendix for the full list of data collected for each record). The dataset extraction script
processed each of the 1,193,813 record instances -- including all versions of each unique record -in the Aubrey system and calculated the number of instances (presented in the dataset as an
integer) for each of the elements in the UNTL metadata scheme (UNT Libraries, 2014).
Additionally, the script extracted important creation information for each metadata record
including the timestamp for when it was created and last updated, the metadata creator and the
last metadata modifier, whether the record is hidden to the public or unhidden, and the number of
seconds that elapsed between the metadata record creation date and the metadata record edit date.
There are three fields in the dataset which may need additional description: the completeness
metric, the record_length, and the record_content_length. The completeness metric calculates
how “complete” a metadata record is in terms of the UNTL metadata scheme. This metric is
calculated by examining the record and the existence of values for the seven fields required in our
database: title, description, language, resource type, format, collection, institution, and subject.
The existence or nonexistence of these values is used in a calculation that results in a number
between 0 and 1, where 0 indicates a severely incomplete record with none of the required
elements present, and 1 represents a complete record that has at least one instance of each of the
seven elements that are required in the UNTL metadata scheme. The record_length measurement
is the total number of bytes that the metadata record occupies on disk, and the
record_content_length is the number of bytes of the record excluding metadata elements -- field
names, qualifiers, attributes, and attribute values -- which results in the total length of data values
in these metadata fields. By removing the text of metadata elements, administrative changes to
45
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
the record status -- such as hiding and unhiding the record -- are not included, so a better sense of
the records’ full size can be seen.
3. Findings
In the dataset used for this study there are a total of 1,193,813 record instances of edited or
unedited metadata record versions (see Table 1). These record instances represent 691,495 unique
objects in the UNT digital collections; in the following analyses, this number is used as the
“total” number of unique records in the system. The data presented in Table 1 demonstrates the
steady growth in both the total number of metadata records in the system and the number of
metadata records edited each year, with the highest proportion of metadata records (24.5%) added
or edited in 2013.
TABLE 1: Valid edited and unedited record instances by year*.
Year
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
New Record
Instances
928
43,425
33,899
31,053
25,138
88,580
179,498
188,810
248,439
292,342
61,695
Percent of
Dataset
0.1%
3.6%
2.8%
2.6%
2.1%
7.4%
15.0%
15.8%
20.8%
24.5%
5.2%
*Note: 6 records in the dataset are missing a metadata creation date.
As of April 2014, there were 502,675 instances of edited record versions. These versions
represent 271,754 unique metadata records that have undergone changes since September 2009,
when we started versioning metadata (see Table 2), or 39.3% of all metadata records in the
system. Additionally, the data indicates that 9,830 records were edited one or more times before
the migration to the Aubrey system and have not been edited since. That means that a total of
42.5% of all item-level metadata records in the UNT digital collections have been edited at least
once. However, the records last edited before September 2009 are excluded from the edit analysis
since only one -- the most current -- version of each record was retained prior to migration.
TABLE 2: Valid instances of edited records (versions) by year, September 2009-April 2014.
Year of
Last Edit
2009
2010
2011
2012
2013
2014
Record
Instances
20,314
39,817
105,465
124,041
188,652
24,386
Percentage of
Edited Record
Instances
4.0%
7.9%
21.0%
24.7%
37.5%
4.9%
The data presented in Table 2 demonstrates the steady growth in the number of metadata
records edited each year, with the sharp spike (from 7.9% to 21%) in 2011 and the highest
proportion of metadata records (37.5%) edited in 2013.
46
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
To get a better sense of the scope of editing frequency across the collections, we analyzed the
number of edits per record and the number of editors per record. Of the edited records, nearly all
(99%) have been edited five or fewer times (see Table 3), although some outlying records have
been edited more than 50 times. Additionally, the majority of edited records (93.6%) have only
been changed by one or two different editors (see Table 4).
For the following data analyses, edit events are compared across the entire collection of unique
metadata records (n=691,495), or across the unique metadata records that have been edited at
least once since September 2009 (n=271,754).
TABLE 3: Number of edits per record (n=691,495).
Number
of Edits
0
1
2
3
4
5
6
7
8
9
10
11-20
21-50
51+
Number of
Records
419,741
152,900
66,236
27,983
12,004
4,944
2,925
1,963
950
664
373
772
33
7
Percentage of Edits
60.7%
22.1%
9.6%
4.0%
1.7%
0.7%
0.4%
0.3%
0.1%
0.1%
0.1%
0.1%
0.0%
0.0%
Cumulative
Percentage of Edits
60.7%
82.8%
92.4%
96.4%
98.1%
98.8%
99.2%
99.5%
99.6%
99.7%
99.8%
99.9%
100.0%
100.0%
TABLE 4: Number of metadata editors per record (n=271,754).
Number of
Editors
1
2
3
4
5
6
7
8
9
10
Number of
Records
197,358
57,068
15,397
1,731
180
75
3
0
0
1
Percentage of
Records
72.6%
21.0%
5.7%
1.0%
0.1%
0.0%
0.0%
0.0%
0.0%
0.0%
In order to understand how records change over time, the authors investigated how the size of
a metadata record changes during its life using the record_content_length field. The instance of
this value from the first stored record (either newly created or migrated from the previous system)
was compared to the most recent version in the dataset. This resulting number was categorized as
an increase, a decrease, or no change in the size of the record over its life. Records that have not
yet been edited have “no change.” Across the entire collection, more than sixty-six percent of the
records have not changed in length (see Table 5); however, among the subset of records that have
been edited, more than half increased in size (see Table 6).
47
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
TABLE 5: Change in size of metadata records September 2009-April 2014 (n=691,495).
Change Category
No Size Change (0)
Size Increase (+)
Size Decrease (-)
Number of
Records
459,350
146,046
86,099
Percentage of All
Records
66.4%
21.1%
12.5%
TABLE 6: Change in size of edited metadata records September 2009-April 2014 (n=271,754).
Change Category
No Size Change (0)
Size Increase (+)
Size Decrease (-)
Number of
Records
39,610
146,046
86,099
Percentage of
Edited Records
14.6%
53.7%
31.7%
The authors took a similar approach to determine the change in completeness among records
across time as they did for calculating the record content length over time (using the
automatically-calculated metric that measures the presence of all required fields in a metadata
record). The earliest value of completeness from the record samples was compared with the most
recently edited values to determine whether the completeness increased, decreased, or stayed the
same. A large majority of the whole collection -- nearly 96% -- had no change in completeness
(see Table 7); and, even among the subset of edited records, roughly 90% had no change in
completeness (see Table 8). Overall, completeness generally stayed the same or increased,
although thirteen records decreased in completeness, likely due to a mistake or misunderstanding
when editing.
TABLE 7: Change in completeness of metadata records September 2009-April 2014 (n=691,495).
Change Category
No Completeness Change (0)
Completeness Increase (+)
Completeness Decrease (-)
Number of
Records
662,508
28,974
13
Percentage of All
Records
95.8%
4.2%
0.0%
TABLE 8: Change in completeness of edited metadata records September 2009-April 2014 (n=271,754).
Change Category
No Completeness Change (0)
Completeness Increase (+)
Completeness Decrease (-)
Number of
Records
242,767
28,974
13
Percentage of
Edited Records
89.3%
10.7%
0.0%
Aside from general size and completeness of records, the final research indicator involves an
aspect of particular interest in this analysis, which relates to the accessibility of records to the
public. In UNTL metadata, records contain a field that controls whether or not a record is hidden;
if the value is “true,” the record cannot be viewed in any way without administrative access to the
item. For items that have a hidden value of “false,” the metadata record is visible to the public
and searchable. This value only governs the metadata record and does not affect the accessibility
of the item (i.e., items that have restricted usage or embargoes can still have a hidden value of
“false”).
First, to see how this value changes over time, the authors compiled statistics for the number of
records for which the record access status value has changed -- either hidden to visible, or visible
to hidden. More than eighty percent of unique metadata records in the system have not changed
48
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
in access status (see Table 9), while a lesser majority (65%) of the edited records remained
unchanged (see Table 10).
TABLE 9: Change in access status of metadata records September 2009-April 2014 (n=691,495).
Change Category
Access Status Changed
Access Status Unchanged
Number of
Records
94,516
596,979
Percentage of All
Records
13.7%
86.3%
TABLE 10: Change in access status of edited metadata records September 2009-April 2014 (n=271,754).
Change Category
Access Status Changed
Access Status Unchanged
Number of
Records
94,516
177,238
Percentage of
Edited Records
34.8%
65.2%
In general, looking at how record access status has changed is important since it affects
accessibility and usage, however, we particularly want to highlight records that have moved from
a visible status to a hidden status. This event represents a situation in which a digital object that
was available to the public -- and may have been viewed, cited, or linked -- is no longer available.
Tables 11 and 12 present a more detailed analysis of this kind of metadata change, breaking down
the number of records that had a value of “false” (visible) that changed to “true” (hidden) at any
point in their edit history. For comparison, Tables 11 and 12 also contain statistics for records that
did not change access status, but an additional column gives the current status of each set of
records, providing detail as to how many records are unchanged but visible, versus unchanged but
hidden.
Overall, more than ninety percent of the all metadata records currently have a hidden value of
“false,” making them publicly accessible (see Table 11). More than 60% of the records that have
been edited have started as visible and not changed, while another 33% have been changed in
access status from hidden to visible during the course of editing (see Table 12).
TABLE 11: Current (April 2014) access status and status changes across all records (n=691,495).
Change Category
Access Status Changed
Access Status Changed
Access Status Changed
Access Status Unchanged
Access Status Unchanged
Changed from
Visible to Hidden
No
Yes
Yes
No
No
Final Hidden Value
False (Visible)
False (Visible)
True (Hidden)
False (Visible)
True (Hidden)
Number of
Records
90,295
1,899
2,322
553,262
43,717
Percentage of
All Records
13.1%
0.3%
0.3%
80.0%
6.3%
TABLE 12: Current (April 2014) access status and status changes across edited records (n=271,754).
Change Category
Access Status Changed
Access Status Changed
Access Status Changed
Access Status Unchanged
Access Status Unchanged
Changed from
Visible to Hidden
No
Yes
Yes
No
No
Final Hidden Value
False (Visible)
False (Visible)
True (Hidden)
False (Visible)
True (Hidden)
Number of
Records
90,295
1,899
2,322
167,478
9,760
Percentage of
Edited Records
33.2%
0.7%
0.9%
61.6%
3.6%
The rows that have particular significance in Tables 11 and 12 show statistics for the records
that have changed in status from visible to hidden at some point in their history. Forty-five
percent of those 4,221 records have ultimately been edited in some way and then made visible
49
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
again. However, the other fifty-five percent (2,322 records) have remained hidden and may need
additional review.
4. Discussion and Conclusions
In summary, the data in this paper outlines information to answer some general questions
about change in metadata records across a body of digital items, as a preliminary step toward
further research. This study revealed that a high proportion of metadata records in the UNT
digital collections (almost 40%) have been edited at least once in the period between September
2009 and April 2014 to change record content and/or access status. In addition, our data provides
evidence that the purposive metadata change activity -- expressed in the sheer number and
proportion of edited records -- has steadily and substantially grown over time. These findings
support the assumption that metadata is a constantly-evolving resource.
Several other points particularly stood out as part of this analysis. First, a considerable number
(nearly 11%) of edited records improved in quality based solely on the “completeness” metric.
Although this does not give a holistic view of the final metadata quality of those records (in
particular, with regards to accuracy, consistency or record completeness beyond the mere
presence of at least one instance of each required metadata element), in general, metadata editors
are adding required information when it is missing, improving the overall value of the metadata.
Next, regarding change in length, a larger than expected number of edited records (31.7%)
decreased in size as a result of changes, suggesting the removal of information. However, since
the record_content_length indicator represents the total number of characters in the record, even
minor changes could have accounted for a net decrease in record length, such as the removal of
an extra space, the correction of typographical errors with extra letters/characters, or the
replacement of longer placeholder values with shorter actual values as editors completed partial
records. Additionally, qualifiers and terms from controlled vocabularies contribute to the length,
so changing those values could decrease the number of characters. Based on this understanding, a
decrease in record length does not necessarily equate to a loss of information, or a decrease in the
quality or accuracy of a particular record.
Finally, as noted in the previous section, a number of metadata records (2,322) were hidden at
the time of data collection, even though they had been visible at some point in their edit history.
Although it is a small subset within the whole system -- only .3% of the total records -- any links
to those records have been broken. Since the general goal is to provide as much access as possible
and maintain permanent links to items and their respective metadata records in the UNT digital
collections, those records should be reviewed to see if changes would allow them to become
accessible once again, and to gain details about the circumstances in order to limit or avoid
similar situations in the future.
4.1. Further Study
The research reported in this paper is a case study that sought to explore quantitative
dimensions of metadata change and its general effects within a large digital collection. It helps
identify some areas for future exploration that will be addressed by further, more in-depth,
mixed-methods studies. These future studies will need to examine both quantitative and
qualitative characteristics of metadata change in various digital repositories to answer these and
other research questions:
●
What is the frequency of change? What is the distribution of the lengths of time between
initial record creation and its first modification; between the first and subsequent
modifications?
●
How does the number of instances of key metadata elements (such as title, creator,
description, subject, etc.) change in the process of editing?
50
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
●
Which common metadata change categories can be identified? What is the relative
frequency of occurrence of these metadata change categories?
●
Which elements in metadata records are changed the most often?
○ How do they change?
○ How do these changes affect the overall quality – completeness, consistency, and
accuracy – of metadata records?
To answer these and other more specific research questions, future studies will need to involve
in-depth manual comparative analysis of versions for a manageable sample of metadata records.
The role of the current exploratory study is to serve as the first stepping stone and to spur interest
among metadata practitioners in conducting research into metadata change.
With major digital content management tools (e.g., Fedora, Islandora, and Hydra) now
incorporating metadata versioning, more and more digital repositories will be able to capture
versions of their metadata records and explore the change in their metadata over time. Further
work by other institutions in this same area could allow for important comparative research.
Without similar data, there is no way to evaluate whether the findings in this study are consistent
across most digital libraries, or to determine the significance of any situations in which the
experience at UNT differs from other digital libraries. Results of measuring metadata change will
also help to determine the overall metadata quality, compare metadata quality across different
collections of items, and will inform metadata management decisions such as setting priorities in
metadata quality.
References
Alfonseca, E., Garrido, G., Delort, J., & Peñas, A. (2013). WHAD: Wikipedia historical attributes data. Language
Resources and Evaluation, 47(4), 1163-1190. DOI: http://dx.doi.org/10.1007/s10579-013-9232-5
Moen, W.E., Stewart, E.L, & McClure, C.R. (1998). The Role of Content Analysis in Evaluating Metadata for the U.S.
Government Information Locator Service (GILS): Results from an Exploratory Study. Retrieved from:
http://www.unt.edu/wmoen/publications/GILSMDContentAnalysis.htm.
Ochoa, X., & Duval, E. (2009). Automatic evaluation of metadata quality in digital repositories. International Journal
of Digital Libraries, 10, 67-91.
Park, J. (2009). Metadata quality in digital repositories: a survey of the current state of the art. Cataloging &
Classification Quarterly, 47 (3), 213-228.
Park, J. & Tosaka, Y. (2010). Metadata quality control in digital repositories and collections: criteria, semantics, and
mechanisms. Cataloging & Classification Quarterly, 48 (8), 96-715.
Phillips,
M.
(April
2014).
UNT
Libraries
http://digital.library.unt.edu/ark%3A/67531/metadc304852/
metadata
edit
dataset.
Retrieved
from:
Shreeves, S., Knutson, E., Stvilia, B., Palmer, C., Twidale, M., & Cole, T. (2004). Is “quality” metadata “shareable”
metadata? The implications of local metadata practices for federated collections. In Thompson, H.A. (Ed.).
Proceedings of the Twelfth National Conference of the Association of College and Research Libraries, pp. 223-237.
Stvilia, B., Gasser, L., Twidale, M., Shreeves, S., & Cole, T. (2004). Metadata quality for federated collections.
Proceedings of ICIQ04, 111-125.
Stvilia, B. & Gasser, L. (2008). Value-based metadata quality assessment. Library & Information Science Research,
30(1), 67-74.
University of North Texas Libraries (2014). Input Guidelines for Descriptive Metadata (Revised version). Retrieved
from: http://www.library.unt.edu/digital-projects-unit/input-guidelines-descriptive-metadata.
West, A.G., Kannan, S., & Lee I. (2010). STiki: An anti-vandalism tool for Wikipedia using spatio-temporal analysis
of revision metadata. Proceedings of the 6th International Symposium on Wikis and Open Collaboration (WikiSym
'10). DOI=10.1145/1832772.1832814
Yan, Y., & McLane, T. (2012). Metadata management and revision history tracking for spatial data and GIS map
figures. Proceedings of the 3rd International Conference on Computing for Geospatial Research and Applications.
DOI=10.1145/2345316.2345357
Zavalina, O.L., Palmer, C.L., Jackson, A.S., & Han, M.-J. (2008). Evaluating descriptive richness in collection-level
metadata. Journal of Library Metadata, 8 (4), 263-292.
51
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Zavalina, O.L. (2011). Contextual metadata in digital aggregations: Application of collection-level subject metadata
and its role in user interactions and information retrieval. Journal of Library Metadata, 11(3/4), 104-128.
Appendix
Alphabetical list of information captured in the dataset for all versions of metadata records in
the UNT.
Field
sample_id
ark
citation
collection
completeness
contributor
coverage
creator
date
degree
description
format
hidden
identifier
institution
language
meta
metadata_creation_date
metadata_creator
metadata_edit_date
metadata_editor
note
primarySource
publisher
record_content_length
record_length
relation
resourceType
rights
source
subject
time_since_creation
title
Example
ark:/67531/metacrs10000_2009-1220T02:07:08
ark:/67531/metacrs10000
0
1
0.983050847458
0
1
4
0
0
2
1
False
2
1
1
11
2007-06-12, 16:50:25
mphillips
2008-02-18, 15:22:21
govdocs
0
0
1
1775
2445
0
1
1
0
12
2168116
1
52
Description
Unique identifier for a sample record version.
Unique record identifier.
Number of citation element entries.
Number of collection element entries.
Completeness metric.
Number of contributor element entries.
Number of coverage element entries.
Number of creator element entries.
Number of date element entries.
Number of degree element entries.
Number of description element entries.
Number of format element entries.
Record hidden status (true/false).
Number of identifier element entries.
Number of institution element entries.
Number of language element entries.
Number of meta element entries.
Date and time record was created.
Username for the record creator.
Date and time record was last edited.
Username of the last metadata editor.
Number of note element entries.
Number of primary source element entries.
Number of publisher element entries.
Record length in bytes, excluding “meta” fields.
Size of the metadata record in bytes.
Number of relation element entries.
Number of resource type element entries.
Number of rights element entries.
Number of source element entries.
Number of subject element entries.
Time in seconds from record creation to last edit.
Number of title element entries.
Metadata in Support of
Research Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Metadata Integration for an Archaeology Collection Architecture
Sivakumar Kulasekaran
Texas Advanced
Computing Center
The University of Texas
at Austin
[email protected]
Jessica Trelogan
Institute of Classical
Archaeology
The University of Texas at
Austin
[email protected]
Maria Esteva
Texas Advanced
Computing Center
The University of Texas at
Austin
[email protected]
Michael Johnson
L - P : Archaeology, United
Kingdom
[email protected]
Abstract
During the lifecycle of a research project, from the collection of raw data through study to
publication, researchers remain active curators and decide how to present their data for future
access and reuse. Thus, current trends in data collections are moving toward infrastructure
services that are centralized, flexible, and involve diverse technologies across which multiple
researchers work simultaneously and in parallel. In this context, metadata is key to ensuring that
data and results remain organized and that their authenticity and integrity are preserved. Building
and maintaining it can be cumbersome, however, especially in the case of large and complex
datasets. This paper presents our work to develop a collection architecture, with metadata at its
core, for a large and varied archaeological collection. We use metadata, mapped to Dublin Core,
to tie the pieces of this architecture together and to manage data objects as they move through the
research lifecycle over time and across technologies and changing methods. This metadata,
extracted automatically where possible, also fulfills a fundamental preservation role in case any
part of the architecture should fail.
Keywords: archeology; collection architecture; metadata integration; automated metadata
extraction; ARK; iRODS rules; Corral; Rodeo; Ranch
1. Introduction
Data collections are the focal point through which study and publishing are currently
accomplished by large research projects. Increasingly they are developed across what we refer to
as collection architectures, in which data and metadata are curated across multi-component
infrastructures and in which tasks such as data analysis and publication can be accomplished by
multiple users seamlessly and simultaneously across a collection’s lifecycle. It is well known that
metadata is indispensable in furthering a collection’s preservation, interpretation, and potential
for reuse, and that the process of documenting data in transition to an archival collection is
essential to those goals. In the collection architecture we present here, we use metadata in a novel
way: to integrate data across recordkeeping and archival lifecycle phases as well as to manage
relationships between data objects, research stages, and technologies. In this paper, we introduce
and illustrate these concepts through the formation of an archaeological collection spanning many
years. We show how metadata, formatted in Dublin Core (DC), is used to bridge data and
semantics developed as teams and research methods have changed over the decades.
The model we propose differs from traditional data management practices that have been
described as the “long tail of research” (Wallis et al., 2014), in which researchers may store data
in scattered places like home computers, hard-drives and institutional servers, with data integrity
53
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
potentially compromised. Without a clear metadata strategy, data provenance becomes blurry and
integration impossible. In the traditional model, archiving in an institutional repository or in a
data publication platform comes at the end of the research lifecycle, when projects are finalized,
often decades after they started, and sometimes too late to retain their original intended meaning
(Eiteljorg, 2011). At that final stage, reassembling datasets into collections that can be archived
and shared becomes arduous and daunting, preventing many from depositing data at all. Instead, a
collection architecture such as the one presented here, which is actively curated by the research
team throughout a project, helps to keep ongoing research organized, aggregates metadata on the
go, facilitates data sharing as research progresses, and enables the curator-researcher to control
how the public interacts with the data. Moreover, data that are already organized and described
can be promptly transferred to a canonical repository.
For research projects midway between the “long tail” and the new data model, the challenge is
to merge old and new practices, to shape legacy data into new systems without losing meaning
and without overwriting the processes through which data were conceived. We present one such
case: a collection created by the Institute of Classical Archaeology (ICA, 2014) representing
several archaeological investigations (excavations, field surveys, conservation, and study
projects) in Italy and Ukraine going back as far as the mid-1970s. As such, it includes data
produced by many generations of research teams, each with their own idiosyncratic recording
methods, research aims, and documentation standards. Integrating it into a collection architecture
that is accessible for ongoing study while thinking ahead about data publishing and long-term
archiving has been the subject of ongoing collaboration between ICA and the Texas Advanced
Computing Center (TACC, 2014) for the last five years (Trelogan et al., 2010; Walling et al.,
2011; Rabinowitz et al., 2013).
In this project metadata is at the center of a transition from a disorganized aggregation of
data—belonging to both the long tail of research, and new data that is being actively created
during study and publication—into a collection architecture. The work has involved reengineering research workflows and the definition of two instances of the collection with
different functions and structures: one is a stable collection which we call the archival instance
and the other, a study and presentation instance. Both are actively evolving as research continues,
but the methods we have developed allow researchers to archive data on the fly, enter metadata
only once, and to move documented data from the archive into the presentation instance and vice
versa, ensuring data integrity and avoiding the duplication of effort. The DC standard integrates
the data objects within the collection and binds the collection instances together.
2. Archaeology as the Conceptual Framework for a Collection Architecture
Archaeology is an especially relevant domain for exploring issues of data curation and
management because of the sheer volume and complexity of documentation produced during the
course of fieldwork and study (Kansa et al., 2011). Likewise, because a typical archaeological
investigation requires teams of specialists from a large number of disciplines (such as physical
anthropology, paleobotany, geophysics, and archaeozoology) a great deal of work is involved in
coordinating the datasets produced (Faniel et al., 2013). Making such coordination even more
challenging is the tendency for large archaeological research projects, like those in the ICA
collection, to carry on for multiple seasons, sometimes lasting for decades. Projects with such
long histories and large teams can contain layer upon layer of documentation that reflect changes
in technologies, standard practices, methodologies, teams, and the varied ways in which they
record the objects of their particular study.
As in an archaeological excavation, understanding these sediments is key to unlocking the
collection’s meaning and to developing strategies for its preservation. Due to the inevitable lack
of consistency in records that span years and specialties, these layers can easily become closedoff information silos that make it impossible to understand their purpose or usefulness. The work
we are doing focuses on revealing and documenting those layers through metadata, without
54
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
erasing the semantics of past documentation, and without a huge investment of labor at the end.
To address these challenges within the ICA collection, we needed a highly flexible, lightweight
solution (in terms of cost, time, and skills required to maintain) for file management, ongoing
curation, publishing, and archiving.
3. Functional and Resource Components of the Collection Architecture
Currently the ICA collection is in transition from disorganized data silos to an organized
collection architecture, illustrated in Figure 2. The disorganized data, recently centralized in a
networked server managed by the College of Liberal Arts Instructional Technology Service
(LAITS, 2014), represents an aggregation of legacy data that had been previously dispersed
across servers, hard-drives and personal computers. The data were centralized there to round up
and preserve disconnected portions of the collection so that active users could work
collaboratively within a single, shared collection. Meanwhile, new data are continuously
produced as paper records are digitized and as born-digital data are sent in from specialists
studying abroad. To manage new data and consolidate the legacy collection, we created a
recordkeeping system consisting of a hierarchical file structure implemented within the file share,
with descriptive labels and a set of naming conventions for key data types, allowing users to
promptly classify the general contents and relationships between data objects while performing
routine data management tasks (see Figs. 1 and 5). The recordkeeping system is used as a staging
area where researchers simultaneously quality check files, describe and organize them (by
naming and classifying into labeled directories) and purge redundant copies, all without resorting
to time-consuming data entry. Once organized, data are ingested into the collection’s archival
instance (See Fig. 2) where they are preserved for the long term and can be further studied,
described, and exposed for data sharing.
3.1. Staging and recordkeeping system: gathering basic collection metadata
Basic metadata for the collection is generated from the recordkeeping system mentioned
above. Using the records management big bucket theory (Cisco, 2008) as a framework, we
developed a file structure that would be useful and intuitive for active and future research and
extensible to all of the past, present, and future data that will be part of the ICA collection (Fig.
1). This file structure was implemented within the fileshare and is mirrored in the archival
instance of the collection for a seamless transition to the stable archive. The core organizing
principle for the data is its provenance as the archaeological “site” or “project” for which it was
generated. Within each of these larger “buckets”, we group data according to three basic research
phases appropriate to any investigation encountered in the collection, be it surface survey,
geophysical prospection, or excavation1: 1) field, 2) study, 3) publication. These top two tiers of
the hierarchy allow us to semantically represent, per project, what we consider primary or raw
versus processed, interpreted data, and the final polished data that are tied to specific print or
online publications. The third tier includes classes of data recorded during fieldwork and study
(e.g. field notes, site photos, object drawings) and the subjects of special investigations (e.g.
black-gloss pottery, physical anthropology, or paleobotany). The list was generated inductively
from the materials produced during specific investigations and is applicable to most ICA projects.
As projects continue through the research lifecycle this list may expand to add other materials
that were not initially accounted for. Curators can pick the appropriate classes and file data
accordingly. Files are named according to a convention (Fig. 5), which encodes provenance,
relationships between objects found together, the subject represented (e.g. a bone from a specific
context), as well as the process history of the data object (e.g. a scanned photograph).
This recordkeeping system is invaluable for the small team at ICA managing large numbers of
documentation objects (>50,000 per each of over two dozen field projects). Because many
1
This is, in fact, an appropriate way to describe the lifecycle of any kind of investigation – archaeological
or otherwise – that involves a fieldwork or data-collection stage.
55
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
projects in ICA’s collection are still in the study phase and do not yet have a fully developed
documentation system, the filenames and directories are often the sole place to record metadata.
As the data are moved to the new collection architecture, the metadata is automatically mapped as
a DC document with specific qualifiers that preserve provenance and contextual relationships
between objects. Metadata is thus entered only once, and is carried along through the archival to
the study and presentation instances where specialists may expand and further describe them as
they study and prepare their publications.
PZ
MetSur
field
study
site/project
SAV
research phase
publication
final
draft
publication stage
objects
site
notes
GIS
structures
documentation class
black
gloss
subject
FIG. 1. The highest levels of the file structure, represented here as “big buckets” whose labels embed metadata about
the project, stages of research, classes of documentation, and subjects of specialist study.
ICA
TACC
User1
User2
c. Corral with d. iRODS
User3
UT
NETWORK
Archival Instance
TACC
NETWORK
Staging area
b. Rodeo
VM1
a. LAITS
file share
VM2
Presentation
Instance
VM3
e. Ranch
FIG. 2. Resource components of ICA’s collection architecture: a. LAITS file share (staging area); b. Rodeo, cloud
computing resource that hosts Virtual Machines (VMs); c. Corral, storage resource that contains active collections; d.
iRODS, data management system; e. Ranch, tape archive for backups and long-term storage.
56
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
3.2. Archival instance: Corral/iRODS
Corral is a high performance resource maintained by TACC to service UT System researchers
(TACC, 2014; Corral, 2014). This system includes 6 petabytes of on- and off-site storage for data
replication, as well as data management services through iRODS (integrated Rule-Oriented Data
System) (iRODS, 2014). iRODS is an open-source software system that abstracts data from
storage in order to present a uniform view of data within a distributed storage system. In iRODS a
central metadata database called iCAT holds both user defined and system metadata, and a rule
engine is available to create and enforce data policies. We implemented custom iRODS rules to
automate the metadata extraction process. To access the data on Corral/iRODS, users can use
GUI-based interfaces like iDROP and WebDAV or a command-line utility. Data on
Corral/iRODS are secured through geographical replication to another site at UT Arlington.
3.3. Presentation instance
3.3.1. ARK
To provide a central platform for collaborative study of all material from each project, to
record richer descriptions and interpretations, and to define complex contextual relationships, we
adopted ARK, the Archaeological Recording Kit (ARK, 2014). ARK is a web-based, modular
“toolkit” with GIS support, a highly flexible and customizable database and user interface, and a
prefabricated data schema to which any kind of data structure can be mapped (Eve et al., 2008).
This has allowed ICA staff to create—relatively quickly and easily—a separate ARK for each site
or project, and to pick and choose the main units of observation within that (e.g. the “site” in the
case of a survey project, or the “context” and “finds” for an excavation project). At ARK’s core
are user-configured “modules”, in which the data structure is defined for each project. In terms of
the “big buckets” shown in Fig. 1, each of the top tier (site/project) buckets can have an
implementation of ARK, with custom modules that may correspond to the documentation classes
and/or study subjects represented in the third tier of buckets, depending on the methodological
approach.2 Metadata mappings are defined within the modules in each ARK (e.g., Fig. 6). This
presentation instance allows the user to interact with data objects that reside in the archival
instance on Corral/iRODS, describe them more fully in context of the whole collection (creating
more metadata), and then push that metadata back to the archival instance.
3.3.2. Rodeo
Rodeo is TACC’s cloud and storage platform for open science research (RODEO, 2014). It
provides web services, virtual machine (VM) hosting, science gateways, and storage facilities.
Virtual machines can be defined as a “software based emulation of a computer” (VM, 2014).
Rodeo allows users to create their own VM instance and customize it to perform scientific
activities for their research needs. All of the ARK services, including the front-end web services,
databases, and GIS, are hosted in Rodeo’s cloud environment. We use three VM instances to host
each of these services. To comply with best security practices we separate out the web services
from the GIS and the databases. If the web service is compromised or any security issues arise,
none of the other services are affected and only the VM that hosts the affected web service needs
to be recreated. During the study and publication stages, data on iRODS are called from ARK,
and metadata from ARK is integrated into the iCAT database.
2
We currently have three live implementations of ARK hosted at TACC, one housing legacy data from
excavations carried out from the 1970s to the 1990s, recorded with pen and paper and film photography
with finds as the main unit of observation; a contemporary excavation, from 2001 to 2007, which was
mostly born digital (digital photos, total station, in-the-field GIS, etc.) and focused on the stratigraphic
context; and one survey project, from the 1980s to 2007, consisting of a combination of born digital and
digitized data and centered on the “site” and surface scatters of finds.
57
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
3.3.3. Ranch
Ranch is TACC’s long-term mass storage solution with a high-performance tape-based system.
We are using it here as a high-reliability backup system for the publication instance of the
collection and its metadata hosted in Rodeo on the VMs. We also routinely back up the ARK
code base and custom configurations. Across Corral and Ranch, the entire collection architecture
is replicated for high data availability and fault tolerance.
4. Workflow and DC Metadata
4.1. Automated metadata extraction from the recordkeeping system
To keep manual data entry to a minimum, we developed a method for automatically extracting
metadata embedded in filenames and folders of our recordkeeping system. We used a
modularized approach using Python (Python, 2014) and customized iRODS rules so that
individual modules can be easily plugged in or reused for other collections. One module extracts
technical metadata using FITS (FITS, 2014) and maps the extracted information to DC and to
PREMIS (PREMIS, 2014) using an XSLT stylesheet. Another module creates a METS document
(METS, 2014) also using a XSLT stylesheet transformation from the FITS document. The
module focusing on descriptive metadata extracts information from the recordkeeping system and
maps it to DC following the instructions from the data dictionary. Metadata is integrated into a
METS/DC document. Finally, metadata from the METS document is parsed and registered in the
iCAT database (Walling et al., 2011). Some files do not conform to the recordkeeping system
because they could not be properly identified and thus named and classified. For those, the
descriptive metadata will be missing and only a METS document with technical metadata is
created, with the technical information added into iCAT. This metadata extraction happens on
ingest to iRODS, so it occurs only as frequently as the users upload data that are understood and
organized by the researchers. The accuracy of the extracted metadata depends upon the accuracy
of the filenames (e.g., adherence to naming convention or correctness of object identification).
These are then further quality checked within the ARK interface during detailed collaborative
study, and corrections are pushed back to the iRODS database as needed by the user.
4.2. Syncing data between ARK and iRODS
The next phase was to sync metadata between the two databases: ARK and iCAT/iRODS. A
new function was created within ARK to pull in metadata from iRODS and display it alongside
the metadata from ARK for each item in a module (e.g. object photographs).
FIG. 3 Metadata subform from ARK, allowing user to compare the information from the two collection instances.
58
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Fields in ARK are used to define what data are stored where in the back-end ARK database,
the way that they should be displayed on the front-end website, and the way that they should be
added or edited by a researcher. The data classes used in ARK are specific to that environment
and have been customized and defined according to user needs within each implementation. The
mapping between the DC term and the corresponding field within ARK is defined in the module
configuration files.
While research progresses, data and metadata are added and edited via the ARK interface. The
user can update the metadata in iRODS from ARK or vice versa, using arrow buttons showing the
direction that the data will move. The system automatically recognizes if the user is performing
an add or edit operation. PHP is used to read and edit the information from ARK and iRODS, and
Javascript is used to give the user feedback and confirm the modifications (Fig. 3). The metadata
linked to either the DC term or the ARK field are then presented and updated through the ARK
web interface.
The workflow represented in Fig. 4 allows us to transition data into the collection architecture
and to perform ongoing data curation tasks throughout the research lifecycle. Note that in this
workflow, data are ingested first to the archival instance of the collection. This allows archiving
as soon as data are generated, assuring integrity at the beginning of the research lifecycle.
FIG. 4. Curation workflow.
4.3. Dublin Core metadata: the glue that binds it all together
Metadata schemas are typically used to describe data for ease of access, to provide
normalization, and to establish relationships between objects. They can be highly specialized to
include elements that embed domain-specific constructs. A general schema like DC, on the other
hand, can be used in most disciplines, if fine-grained description is not a priority. In choosing a
schema for this project we considered its ability to relate objects to one another, its
generalizability in representing the wide range of recording systems represented in the collection,
and its ease of use. With this in mind, we chose to use DC, which is widely used for
59
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
archaeological applications, including major data repositories like the UK-based Archaeology
Data Service (ADS, 2014) and, in the US, the Digital Archaeological Record (tDAR, 2014).
In this project the DC standard is a bridge over which data are exchanged between collection
instances and across active research workflows, turning non-curated into curated data, while
providing a general, widely understood method for describing the collection and the relationships
between the objects. Given the need for automated metadata extraction and organization
processes, we required higher levels of abstraction to map between the different organizational
and recording systems, data structures, and concepts used over time. Furthermore, DC is the
building block for future mapping to a semantically rich ontology like CIDOC-CRM (CRM,
2014), a growing standard that is used for describing cultural heritage objects that is particularly
relevant for representing archaeology data in online publishing platforms (OpenContext, 2014).
CIDOC-CRM provides the scope to fully expose the richness of exhaustive analysis, and allows
the precise expression of contextual relationships between objects of study, as well as the
research process and events (historical and within an excavation or study), provenance (of
cultural artifacts as well as of data objects), and people. Such semantic richness, however, only
fully emerges at the final stages of a project, and we are here concerned with ongoing work
resulting in a collection that is still in formation and evolving rapidly.
FIG. 5. Metadata extracted from filename and folder labels are mapped to DC terms. Once in ARK further descriptive
metadata can be added and pushed back to iRODS.
4.4. Metadata mapping and its semantics
The mapping to DC for this project was considered in two stages. For the archival instance of
the collection, we focused on expressing relationships between individual data objects
(represented by unique identifiers) through the DC elements “spatial,” “temporal,” and “isPartof.”
This allows grouping, for example, of all the documentation from a given excavation, or all
60
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
artifacts found within the same context. We also categorized documentation types and versions to
help us relate data objects to the physical objects they represent (e.g., a drawing or photo of an
artifact). For the publication instance presented in ARK, mapping focused on verbal descriptions,
interpretations, and the definition of relationships produced during study. These then populate the
“description” and “isPartOf” elements in the DC document. As a data object enters the collection
to be further analyzed and documented in ARK, all the key documentation related to that object is
exchanged over time throughout all pieces of the collection architecture and remain in the
archival instance once complete. For example, when a photo is scanned, named, and stored in the
appropriate folder, this embeds provenance information for the object in the photo (e.g., context
code, site and year of excavation), the provenance of the photo itself (e.g., location of negative in
physical archive), the process history of the data object (e.g., raw scan or an edited version), its
relations to other objects in the collection, and the description created by specialists in ARK (see
Fig. 5). For the data curator, the effort is minimal, and information is extracted automatically and
mapped to terms that are clearly understood. The information is carried along as the data object
moves from the primary data archive to the interpretation platform, and is enhanced through
study and further description every time the metadata is updated. By mapping key metadata
elements to DC (Table 1) we reduce data entry and provide a base for future users of the
collection.
TABLE 1. Extract of a data dictionary that maps the fields in an ARK object photo module to the recordkeeping system
and DC elements.
ARK term
Short Description
File Name
Photo Type
Date Excavated
Date Photographed
Photographed by
Area
Zone
ARK field
$conf_field_short_desc
$conf_field_filename
$conf_field_phototype
$conf_field_excavyear
$conf_field_takenon
$conf_field_takenby
$conf_field_area
$conf_field_zone
Record Keeping Example
Terracotta Figurine
PZ77_725T_b38_p47_f18_M.tif
PZ/field/finds/bw
1977
1978
Chris Williams
Pantanello
Sanctuary
DC Term
description
identifier
format
date
created
creator
spatial
spatial
4.5. Metadata for integrity
In addition to the technical metadata extraction, descriptive metadata added throughout the
research lifecycle assures the collection’s integrity in an archaeological sense by reflecting
relationships between data objects. Moreover, because we have the same metadata stored in both
the archival and presentation instances, if one or more parts of the complex architecture should
fail, the collection can be restored. Once the publication instance is completed and accessible to
the public, users will be able to download selected images and their correspondent DC metadata,
containing all the information related to those images.
5. Conclusion
This work was developed for an evolving archaeological dataset, but can act as a model to
inform any kind of similarly complex academic research collection. The model illustrates that DC
metadata can act as an integrative platform for a non-traditional (but increasingly common)
researcher-curated, distributed repository environment. With DC as a bridge between collection
instances we ensure that the relationships between objects and their metadata are preserved and
that original meaning is not lost. Integration also reduces overhead in entering repetitive
information and provides a means for preservation. In the event that a database fails or becomes
obsolete, or if ICA can no longer support the presentation instance, the archival instance can be
sent to a canonical repository with all its metadata intact.
61
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Finally, we can also attest that the model enables an organized and documented research
process in which curators can conduct a variety of tasks including archiving, study, and
publication, while simultaneously integrating legacy data. Our whole team, including specialists
working remotely, can now access our entire collection as a whole, view everything in context,
and work collaboratively in a single place. Because this work was developed with and by the
users actively testing it during ongoing study, we can also speak to the real benefits that have
been gained. In the course of this work, ICA lost over 2/3 of its research and publication staff due
to budget cuts. While this was a serious blow, the collection architecture we have described here
has allowed us to radically streamline our study and publication process enough that, despite
losing valuable staff, we are actually producing our publications much more efficiently than we
ever have before and have helped ensure a future for the data behind them.
References
ADS, Archaeology Data Service. (2014). Retrieved May 9, 2014 from http://archaeologydataservice.ac.uk/.
ARK, the Archaeological Recording Kit. (2014). Retrieved May 9, 2014 from http://ark.lparchaeology.com/.
Cisco, Susan. (2008). Trimming your bucket list. ARMA International’s hot topic. Retrieved May 9, 2014 from
http://www.emmettleahyaward.org/uploads/Big_Bucket_Theory.pdf.
Corral. (2014). Retrieved August 14, 2014 from https://www.tacc.utexas.edu/resources/corral.
CRM. (2014). CIDOC Conceptual Reference Model. Retrieved May 9, 2014 from
http://www.cidoc-crm.org/.
Eiteljorg, Harrison. (2011). What are our critical data-preservation needs? In: Eric C. Kansa, Sarah Whitcher Kansa, &
Ethan Wattrall (eds). Archaeology 2.0: New Approaches to Communication and Collaboration. Cotsen Digital
Archaeology series 1, 251–264. Los Angeles: Cotsen Institute of Archaeology Press.
Eve, Stuart, and Guy Hunt. (2008). ARK: A Developmental Framework for Archaeological Recording. In: A.
Posluschnya, K. Lambers, & I. Herzong. (eds). Layers of Perception: Proceedings of the 35th International
Conference on Computer Applications and Quantitative Methods in Archaeology (CAA), Berlin, Germany, April 2–
6, 2007. Kolloquien zur Vor- und Frühgeschichte 10. Bonn: Rudolf Habelt GmbH. Retrieved from:
http://proceedings.caaconference.org/files/2007/09_Eve_Hunt_CAA2007.pdf.
Faniel, Ixchel, Eric Kansa, Sarah Whitcher Kansa, Julianna Barrera-Gomez, and Elizabeth Yakel. (2013). The
Challenges of Digging Data: A Study of Context in Archaeological Data Reuse. JCDL 2013 Proceedings of the 13th
ACM/IEEE-CS Joint Conference on Digital Libraries, 295–304. New York: Association for Computing Machinery.
doi:10.1145/2467696.2467712
FITS, File Information Tool Set. (2014). Retrieved May 9, 2014 from https://code.google.com/p/fits/.
Harris, Edward C. (1979). Laws of Archaeological Stratigraphy. World Archaeology Vol. 11, No. 1: 111–117.
ICA, Institute of Classical Archaeology. (2014). Retrieved May 9, 2014 from http://www.utexas.edu/research/ica/.
iRODS, A data management software. (2014). Retrieved May 9, 2014 from http://irods.org/.
Kansa, Eric C., Sarah Whitcher Kansa, & Ethan Watrall (eds). (2011). Archaeology 2.0: New Approaches to
Communication and Collaboration. Cotsen Digital Archaeology series 1. Los Angeles: Cotsen Institute of
Archaeology Press.
LAITS, College of Liberal Arts. (2014). Retrieved May 9, 2014 from http://www.utexas.edu/cola/laits/.
METS, Metadata Encoding & Transmission
http://www.loc.gov/standards/mets/.
Standard.
(2014).
Retrieved
May
9,
2014
from
Retrieved
May
9,
2014
from
OpenContext. (2014). Retrieved May 9, 2014 from http://opencontext.org/.
PREMIS, Preservation Metadata Maintenance
http://www.loc.gov/standards/premis/.
Activity.
(2014).
Python, a programming Language. (2014). Retrieved May 9, 2014 from https://www.python.org/.
PHP, A hypertext preprocessor. (2014). Retrieved May 9, 2014 from http://www.php.net.
Rabinowitz, Adam, Jessica Trelogan, and Maria Esteva. (2012). Ensuring a future for the past: long term preservation
strategies for digital archaeology data. Presented at Memory of the Worlds in the Digital Age Conference:
Digitization and Preservation, UNESCO, September 26–28, 2012, Vancouver, British Columbia, Canada.
62
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Rodeo. (2014). Retrieved August 14, 2014 from https://www.tacc.utexas.edu/resources/data-storage/#rodeo.
TACC, The Texas Advanced Computing Center. (2014). Retrieved May 9, 2014 from https://www.tacc.utexas.edu/.
tDAR, Digital Archaeological Record. (2014). Retrieved May 9, 2014 from http://www.tdar.org/.
VM, Virtual Machine. (2014) Retrieved May 9, 2014 from http://en.wikipedia.org/wiki/Virtual_machine.
Walling, David, and Maria Esteva. (2011). Automating the Extraction of Metadata from Archaeological Data Using
iRods Rules. International Journal of Digital Curation Vol. 6, No. 2: 253–264.
Wallis, Jillian C., Elizabeth Rolando, and Christine L. Borgman. 2013. If We Share Data, Will Anyone Use Them?
Data Sharing and Reuse in the Long Tail of Science and Technology. PLoS ONE 8(7): e67332.
doi:10.1371/journal.pone.0067332
63
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Dublin Core Metadata for Research Data – Lessons Learned in a
Real-World Scenario with datorium
Andias Wira Alam
GESIS – Leibniz Institute for the Social Sciences
Germany
[email protected]
Abstract
As a continuation of our work in the datorium project, we provide a service for autonomous
documentation and upload of research data. In this paper we discuss and share our experience of
developing such a service by using Dublin Core Metadata. Even small and simple, DC Metadata
is an appropriate standard to be taken as basic metadata, for instance in the repository systems.
The required elements for describing research data are mostly complex, in particular the acquired
information about the data, including survey methods, survey periods, or number of variables. DC
Metadata cannot cover all elements needed in the research data repository. However, we show
that with some extended elements and front-end based manipulations the DC Metadata can be
applied usefully in this real-world scenario and support complex description without overcoming
the “simplicity” of the standard.
Keywords: research data repository; metadata; DSpace; infrastructure; datorium
1. Introduction
GESIS – Leibniz Institute for the Social Sciences provides services for research in multiple
phases of the research process, such as study planning, data collection, data analysis, and
archiving and registration. The main targets are data collected from surveys, data from historical
social research, as well as scientific publications. Figure 1 shows the research data lifecycle used
by the institute to structure its services. Our project datorium belongs to the phase “archiving and
registering”. We provide a data repository service for social science and economic researchers
with a user-friendly tool for the autonomous documentation, upload and publication of their
research data, as illustrated in Figure 2. As stated in Linne (2012), the service focuses particularly
on small research projects by researchers who do not necessarily work for an institution or are
self-funded. A detailed review carried out by the GESIS Data Archive ensures the quality of the
submitted data.
Describing research data requires comprehensive and rich metadata elements, such as provided
by the Data Documentation Initiative (DDI)1 or da|ra metadata2. DDI can be used not only to
describe the research data on study level (general overview of the research data), but also on the
variable level - e.g. for information details about variables, questionnaires, and results. The da|ra
metadata is now commonly used in assigning persistent identifiers to research data in the context
of the DataCite3 community. Nonetheless, DC elements are the most-used elements for describing
resources, particularly scientific resources (cf. Ell et al. 2011; Qin et al. 2013; Malta et al. 2014).
Príncipe et al. (2013) also stated that OpenAIRE is starting to move from a publication
infrastructure to a more comprehensive infrastructure that covers all types of scientific “output”.
DC metadata as a fundamental metadata infrastructure for scientific publications is therefore
slowly evolving into an infrastructure for research data as well.
1
See http://www.ddialliance.org/
See http://www.da-ra.de/en/home/
3
http://www.datacite.org/
2
64
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The choice of metadata schema or standard is likely not the main focus for researchers looking
to publish the data. Researchers need platforms which allow them to publish their data in an easy
way, making the data visible and citable (Wira-Alam et al. 2012; Dimitrov et al. 2013). During
the requirements analysis for datorium it became apparent that we had to specify the requirements
so as to balance simplicity and usefulness. Similar to the lessons learned reported by Wallis et al.
(2010), though in a different context, the discussions between the computer and social scientists
in the project team started with the question “what should we build for you?” answered by “what
could you build for us?”.
FIG. 1. Research data lifecycle in multiple phases4
FIG. 2. Illustration of the step-by-step processes within datorium5
An ideal vision is that any piece of information in the research data should be well-documented
and described. Castro et al. (2013) proposed e.g. to use domain-specific elements in order to fully
describe scientific experiments. However, our tool is targeted not only at institution-based
researchers but also at any self-funded researchers or even students. Documenting and describing
research data is time-consuming and hence expensive work. Thus, increasing the complexity of
the documentation process would impact the usability of the tool, and consequently potential
users might lose their interest in using it. Accordingly, one of the key challenges is how to make
the tool as simple as possible for users, in particular the data depositors, while at the same time
gathering as much information about the data as possible. Simultaneously, we have to make the
data visible and easy find, especially for the data consumers. Another important feature of the
tool is that it shall be available in two languages, namely German and English, in order to target
prospective international users: data depositors as well as data consumers.
2. Metadata Design
In the metadata design, we identify not only critical information about metadata in general, but
also more detailed information about the research data, e.g. survey / data collection methods,
survey periods, or number of variables / units. However, in order to keep the metadata simple, we
4
5
http://www.gesis.org/en/services/
https://datorium.gesis.org/
65
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
use a Dublin Core subset whose schema is simply flat and has no complex hierarchical structure.
As stated by Rice et al. (2008) and Greenberg et al. (2013), datasets are digital materials that need
to be described for discovery, preservation, and re-use, e.g. for partner repositories. Furthermore,
analogous to the aforementioned work and Greenberg et al. (2009), by using DC elements we
provide understandable information about complex objects and help partner repositories or data
consumers to become acquainted with the research data. Research data also become more useful
when they are interoperable with other data and therefore need a common standard or set of
standards (Ball, 2010). As depicted in Figure 3, datorium’s metadata schema consists of DC
elements and some extended elements. As some elements, e.g. file description, demand
hierarchical entries, the schema forms a tree structure. A complex tree structure cannot be
described by a schema such as Dublin Core. As mentioned in Chen et al. (2013), research datasets
may contain unusual file formats; therefore the uploaded files need additional information e.g. on
the number of variables, number of units, languages used in the files, or even software to read the
files for further processing. Figure 3 depicts the abstraction of the metadata schema.
FIG. 3. Illustration of the metadata schema in datorium
To meet the requirement that we need as much information about the data as possible We
require 6 mandatory and about 14 optional entries. As mentioned above, we assume that most
users are not willing to capture many entries in the tool for simplicity reasons. However we
cannot exclude this possibility as there are users who want to provide rich information about their
data e.g. to increase the visibility of the data. This situation contrasts with the identified
requirements, but we discuss later in the next section how we alleviate this problem. In Table 1
we describe our metadata schema. In comparison with the first design (Wira-Alam et al, 2012),
we use 10 DC elements and specify all extended elements with the namespace “dbk” taken from
GESIS – Data Catalogue DBK6. We also organized the elements in five groups, e.g. General
Description or Methodology, according to their affinities. Moreover, we decomposed two
elements, Primary Researcher and Contributor, to increase the exactness. In the element Primary
Researcher, for instance, we distinguish between person and institution. According to DC
standard, however, this property can be filled either with person or institution. Our adaptation
makes it possible for users to search only for persons or institutions.
As mentioned above, we support users by increasing the visibility of the data. For this purpose
we offer controlled vocabularies, e.g. for Subject Area or Data Collection Method. Controlled
vocabularies improve the visibility of the data on the one hand by enhancing the semantic of the
metadata, and on the other hand by making the submission process easier for the users. Moreover,
in order to support internationalization, we provide all controlled vocabularies in two languages:
German and English. This affects both the technical implementation and the search functionality.
We demonstrate in the next section how users can benefit from this feature and what the technical
implementation looks like.
6
https://dbk.gesis.org/
66
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Hidden
Files
Additional Notes
Methodology
Content
General Description
TABLE 1: Metadata schema using DC elements and extended elements
Labels
Title
DC Elements
dc.title
DOI
dc.identifier.uri
Primary Researcher *
dc.creator
Publisher
Publication Year
Availability
Embargo
Embargo (until)
dc.publisher
-
Contributor *
dc.contributor
Subject Area *
dc.subject.other
Extended Elements
dbk.othertitle
dbk.othertitle.text
dbk.othertitle.type
dbk.primaryresearcher.person
dbk.primaryresearcher.institution
dbk.publicationyear
dbk.availability
dbk.embargo.availability
dbk.embargo.end
dbk.contributor.person
dbk.contributor.institution
dbk.contributor.type
-
Other Title *
-
Topic Classification *
dc.subject.classification
-
Abstract
dc.description
-
Geographical Area *
Universe *
Selection Method
Data Collection Method *
dc.coverage.spatial
-
(f)
Survey Period *
-
Rights *
dc.rights
Notes *
-
dbk.universe
dbk.selectionmethod
dbk.datacollectionmethod
dbk.surveyperiod
dbk.surveyperiod.start
dbk.surveyperiod.end
dbk.notes
dbk.notes.text
dbk.notes.type
dbk.source
dbk.publication
dbk.publication.text
dbk.publication.id
dbk.file
dbk.file.filename
dbk.file.filedescription
dbk.file.version
dbk.file.versionNumber
dbk.file.versionDate
dbk.file.resource
dbk.file.resourceType
dbk.file.resourceTypeGeneral
dbk.file.language
dbk.file.numberofvariables
dbk.file.unit
dbk.file.unitNumberOf
dbk.file.unitType
dbk.file.software
dbk.file.alternateId
dbk.file.relatedId
dbk.file.relatedIdIdentifier
dbk.file.relatedIdType
(k)
text
text
text
text
date: YYYYMMDD
text
text
text
text
int
text
int
text
text
text
text
text
text
dc.issued
-
(l)
date: YYYYMMDD
-
intern.cheklist
Source *
-
Publications *
-
File *
Date Issued
(for sorting purpose)
Checklist
(for Curators only)
-
67
Type
text
(d)
text
text
URI
text
text
text
date: YYYY
text
text
date: YYYYMMDD
text
text
text
text
(e)
text
(a)
(b)
(c)
text
(g)
text
text
text
text
date
date
text
(h)
text
text
text
text
text
(i)
(j)
text
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
As explained, the elements consist of DC elements and DBK elements. In the first group,
namely General Description, we place all mandatory entries (written in italics). On publication
each submission automatically receives a persistent identifier, in this case a DOI® (generated by
the system). This increases the visibility of the submitted data by making them as they are citable
as scientific publications. We set GESIS – Data Archive as the publisher of the data since they
are published through datorium; therefore the value of the element has been fixed and it is not
editable. Elements marked with an asterisk are repeatable. We also introduce Other Title in order
to accommodate research data that have several titles for some reasons, e.g. original title,
translated title in several languages, or project title.
In the second group, namely Content, users can provide a description of the data as a free-text
abstract. In addition, users have two important elements: Subject Area and Topic Classification,
whose values are controlled vocabularies. The controlled vocabularies provide a possibility for
semantic enhancement and thus facilitate connections between the research data and Linked Data
on the Web (cf. Isaac et al. 2013). In the group Files we collect relevant information about the
files as completely as possible. Further explanation for the elements marked alphabetically from
(a) to (l) is as follows:
(a)
dbk.othertitle.type – Type of other title can be selected from the controlled vocabularies
provided by DBK (Zenk-Möltgen et al. 2012), such as “project title” or “original title”.
(b)
dbk.availability – It consists of three controlled vocabularies: “free access”, “restricted access”,
and “embargo”.
dbk.contributor.type – Type of contributor is based on the category scheme of the
ContributorType from DataCite.
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
dc.subject.other – Subject Area has been chosen from the disciplines in SSOAR - Social Science
Open Access Repository.
dc.subject.classification – Topic Classification is based on DBK7, consists of overall 38 terms,
such as “Economic Systems” or “Social Policy”.
dc.coverage.spatial – Geographical Area consists of places, such as countries, cities, or
provinces / states, based on ISO-3166.
dbk.datacollectionmethod – Data Collection Method consists of the controlled vocabularies
provided by DDI (unreleased beta version, March 2013), such as “Email interview”,
“recording”, or “Telephone interview: CATI”.
dbk.notes.type – It is based on DescriptionType provided by DataCite, such as “Abstract” or
“TableOfContents”.
dbk.file.language – Languages provided by ISO-639.
dbk.file.unitType – Unit Type is based on “Analysis Unit” provided by DDI, such as “Family”,
“Individual”, or “Organization”.
dbk.file.relatedIdType – Type of the related identifier is based on the RelationType provided by
DataCite, such as “IsCitedBy”, “IsDocumentedBy”, or “IsPartOf”.
dc.issued – Date Issued has been generated by the system at the time of publication.
As mentioned above, datorium offers multi-language support for the controlled vocabularies.
The tool supports a so-called ad-hoc translation automatically. Users do not have to take any
action in this regard. All controlled vocabularies are stored in a dictionary in two languages. Each
vocabulary item in both languages is unique and therefore the correctness of the translation is
guaranteed. The controlled vocabularies for Subject Area and Data Collection Method have a tree
structure, in opposite of having a long list, to make it easier for users to find and choose the
relevant terms for their data.
For the types of the elements, we use rudimentary types for reasons of simplicity. Thus, there
are only 4 rudimentary types: text, URI, int, and date. Theoretically, with text we can cover any
types of values. However, we apply a simple validation in order to avoid wrong values. Values
typed with date without a fixed date format, namely Survey Period, can be given in three
7
https://dbk.gesis.org/dbksearch/Kategorien.htm
68
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
variants: year only (format: YYYY), month and year only (format: YYYYMM), or an exact date
(format: YYYMMMDD).
3. Technical Implementation
We use a DSpace8 repository as basis platform for the implementation. The metadata model in
DSpace, which is based on Dublin Core, is simply flat and has no complex hierarchical structure.
It consists of schema, element, and qualifier. A schema is equivalent to namespace, element can
be considered as content, and qualifier can be seen as sub-element if an extra attribute needs to be
added. DSpace is a web-based application that follows the Model-view-controller (MVC)
architectural pattern (Gamma et al. 1994). This pattern ensures the consistency of the model
(data) and the user interface / front-end (view) by employing a controller. DSpace also offers
many features such as user management, review process, and discovery / faceted search. Our
development process is loosely based on agile software development, which is an iterative
process throughout the development cycle.
As described in Table 1, we have groups of elements. In the implementation, we display each
group of elements as a tab. This strategy is suitable for data depositors who do not want to spend
time capturing information about the data. However, even though all mandatory elements are
placed in the first tab, each submission needs to go through all. Figure 4 shows the mandatory and
non-mandatory elements in the first tab. For example, the mandatory element Principle
Investigator can be filled only by a person, an institution, or both. For the data depositors who are
willing to provide as more information about their data, this strategy is also convenient as it
provides more structure and orientation for data depositors than a single form with many
elements. After the data has been successfully published, the system will assign a persistent
identifier (DOI) automatically via a separate module connected with the da|ra API for DOI
registration9.
FIG. 4. Editor form for General Description
8
As we mentioned in the previous work (Wira-Alam et al. 2012), we use DSpace (version 1.8.2) as it is an
open source repository application. Furthermore, DSpace supports Dublin Core elements by default and has
a flat metadata schema which helps us as developers to maintain the data. According to DSpace’s website,
there are more than 1000+ institutions that have registered to use DSpace for their repository application
which is widely used worldwide (May 2014).
9
http://www.da-ra.de/en/for-data-centers/register-data/
69
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
In the Content tab, as captured in Figure 5, one important feature of the tool, namely the
controlled vocabularies is shown. We display the vocabularies in their original, i.e. hierarchical
form. The hierarchical selection is very comfortable since users can, for example, find or
determine an appropriate subject, or more, by its discipline. This feature was implemented
without changing the metadata schema. We performed a pure front-end based manipulation and
thus the validation occurs in the view as well. A big advantage of this strategy is that the metadata
schema becomes flexible since the view does not depend on the model. A possible disadvantage
could be wrong values in the database because of a front-based validation level that does not
guarantee the consistency. Nevertheless, wrong values only apply for the corresponding element
and cannot break the whole elements. Besides, a review process is carried out before the data is
published.
FIG. 5. Editor form for Content
The next feature regarding the controlled vocabularies is autocomplete. In Figure 6, in the
element Geographical Area the data depositors can select places from a given list. Since there are
thousands of places to be selected, we provide an autocomplete widget in order to make the
selection easier. Users can decide the preferred language (DE/EN); the whole user interface and
the controlled vocabularies are then available in the selected language. Another feature is a
widget to pick a date. This can be an exact date but also year only or month and year only.
FIG. 6. Editor form for Methodology
70
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
As repeatedly mentioned, since we cannot apply a complex metadata schema in the model, we
can only modify the view to meet the requirements. For instance, each uploaded file has several
elements and each submission / dataset can have several files since it is a repeatable element. This
situation therefore leads to a hierarchical form in the model, which is actually not implementable.
As shown in Figure 7, we wrap these elements in an XML as if they are seen as a single value of
the element File to compromise the limitation of the flat metadata model.
FIG. 7. Editor form for Files Upload
For the data consumers, finding data is an intellectual effort. In addition to free-text search,
faceted search is a well-known technique that helps users to browse large data collections, e.g.
images or documents, and delve into more details if required (Yee at al. 2003). By using this
technique, and since the controlled vocabularies are available in German and English, the data
can be also searched with keywords in a language in which the data was not documented
originally. Figure 8 demonstrates the multi-language support for the faceted search. The element
Geographical Area shows same values according to the preferred language.
FIG. 8. Multi-language support for faceted search
71
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
All front-end based manipulations make use of JavaScript, in particular jQuery10 and its
plugins, and had been successfully tested in various browsers in different versions, among others
Internet Explorer, Safari, Opera, Mozilla Firefox, and Google Chrome. The layout and user
interface are based on Manakin’s XMLUI11 with many modifications according to the GESIS
Web-Style-Guide12.
4. Conclusion and Future Work
We use Dublin Core for our purpose since it is a simple and appropriate schema for documenting
research data. However, to meet all requirements, some extensions are needed. We have shown
some approaches to make the application useful and cover complex description without
overcoming the “simplicity” of DC metadata. The front-end based manipulation, as we
demonstrated in this paper, can remedy the limitation of the schema, e.g. to deal with complex,
repeatable elements structures. The documentation of the research data currently refers to the
study level; details about the variables used in the survey are not covered. Nevertheless, it is at all
times possible to extend the schema so as to meet new requirements. Since the schema and frontend are quite distinct from each other, our approach is suitable for this situation because of its
flexibility.
As future work, we want to establish the connection between publication and research data
automatically (Boland et al. 2012; Ritze et al. 2013) in order to incorporate scientific publications
in research data and the other way around. Moreover, an integrated search with other partner
repositories is under way. Therefore we plan to implement an export / import, harvesting (e.g.
OAI-PMH) interface, and a schema crosswalk to other standards, e.g. DDI.
Acknowledgements
This project is funded by GESIS – Leibniz Institute for the Social Sciences. We thank our
project members, Monika Linne, Wolfgang Zenk-Möltgen, Oliver Hopt, Natascha Schumann,
Stefan Müller, Reiner Mauer, and Martin Friedrichs, for their useful inputs on requirements
analysis and for their feedback and fruitful discussions during the implementation. Special thanks
go to Sigit Nugraha who helped extensively with the technical implementations and Astrid
Recker for proof reading this paper. Last but not least, we also thank the DSpace developers for
their effort to continually develop this great repository system as an open source project. datorium
can be accessed online at https://datorium.gesis.org/.
References
Ball, A. (2010): Review of the State of the Art of the Digital Curation of Research Data (version 1.1). ERIM Project
Document erim1rep091103ab11. Bath, UK: University of Bath. Retrieved April 28, 2014 from
http://opus.bath.ac.uk/18774/2/erim1rep091103ab11.pdf
Boland, K., Ritze, D., Eckert, K., Mathiak, B. (2012): Identifying References to Datasets in Publications. TPDL, Vol.
7489 of Lecture Notes in Computer Science, page 150-161. http://dx.doi.org/ 10.1007/978-3-642-33290-6_17
Castro, J. A., Ribeiro, C., Rocha da Silva, J. (2013): Designing an Application Profile Using Qualified Dublin Core: A
Case Study with Fracture Mechanics Datasets. Proc. International Conference on Dublin Core and Metadata
Applications 2013. Retrieved April 28, 2014 from http://dcpapers.dublincore.org/pubs/article/view/3685
Chen, H., Lin, Y., Chen, C. (2013): Approaches to Building Metadata for Data Curation. Proc. International
Conference on Dublin Core and Metadata Applications 2013. Retrieved April 28, 2014 from
http://dcpapers.dublincore.org/pubs/article/view/3691
Dimitrov, D., Baran, E., Wegener, D. (2013): Making Data Citable - A Web-based System for the Registration of
Social and Economics Science Data. In: Krempels, Karl-Heinz; Stocker, Alexander (Hrsg.): Proceedings of the 9th
10
http://jquery.com/
https://wiki.duraspace.org/display/DSDOC18/XMLUI+Configuration+and+Customization
12
http://www.gesis.org/styleguide/
11
72
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
International Conference on Web Information Systems and Technologies: Aachen, Germany, 8 - 10 May 2013:
SciTePress, pages 155-159
Ell, B., Vrandečić, D., Simperl, E. (2011): Labels in the Web of Data. In Proceeding of the 10th International Semantic
Web Conference, 2011. http://dx.doi.org/10.1007/978-3-642-25073-6_11
Gamma, E., Helm, R., Johnson, R., Vlissides, J. (1994): Design Patterns: Elements of Reusable Object-Oriented
Software. Addison-Wesley Professional. ISBN-13: 978-0201633610
Greenberg, J., Swauger, S., Feinstein, E. (2013): Metadata Capital in a Data Repository. Proc. International Conference
on Dublin Core and Metadata Applications 2013. Retrieved April 28, 2014 from
http://dcpapers.dublincore.org/pubs/article/view/3678
Greenberg, J., White, H. C., Carriera, S., Scherleb, R. (2009): A Metadata Best Practice for a Scientific Data
Repository. Journal of Library Metadata. Volume 9, Issue 3-4, pages 194-212. doi:10.1080/19386380903405090
Isaac, A., Charles, V., Fernie, K., Dallas, C., Gavrilis, D., Angelis, S. (2013): Achieving Interoperability between the
CARARE Schema for Monuments and Sites and the Europeana Data Model. Proc. International Conference on
Dublin Core and Metadata Applications 2013. http://dcevents.dublincore.org/IntConf/dc-2013/paper/view/171
Linne, M. (2013): Sustainable data preservation using datorium: facilitating the scientific ideal of data sharing in the
social sciences. In: Borbinha, José; Nelson, Michael; Knight, Steve (Hrsg.): Proceedings of the 10th International
Conference on Preservation of Digital Objects, Lisbon: Biblioteca Nacional de Portugal, S. 150-155. Retrieved May
16, 2014 from http://purl.pt/24107/1/
Malta, M. C., Baptista, A. A. (2014): A panoramic view on metadata application profiles of the last decade. Int. Journal
of Metadata Semantic and Ontologies Vol. 9, Issue 1 (February 2014), pages 58-73.
Príncipe, P., Rodrigues, E., Rettberg, N., Schirrwagen, J., Loesch, M., Karstensen, M., Nielsen, L. H. (2013):
OpenAIRE Guidelines for Data Archive, Literature Repository and CRIS Managers. Proc. International Conference
on Dublin Core and Metadata Applications 2013. Retrieved April 28, 2014 from
http://dcpapers.dublincore.org/pubs/article/viewFile/3695
Qin, J., Li, K. (2013): How Portable Are the Metadata Standards for Scientific Data? A Proposal for a Metadata
Infrastructure. Proc. International Conference on Dublin Core and Metadata Applications 2013. Retrieved April 28,
2014 from http://dcpapers.dublincore.org/pubs/article/viewFile/3670/1893
Rice, R. (2008). Applying DC to Institutional Data Repositories. Proceedings of the International Conference on
Dublin Core and Metadata Applications, 2008. Retrieved May 16, 2014 from
http://dcpapers.dublincore.org/pubs/article/view/945
Ritze, D., Boland, K. (2013): Integration of Research Data and Research Data Links into Library Catalogues. Proc.
International Conference on Dublin Core and Metadata Applications 2013. Retrieved April 28, 2014 from
http://dcpapers.dublincore.org/pubs/article/view/3683
Wallis, J. C., Mayernik, M. S., Borgman, C. L., Pepe, A. (2010): Digital libraries for scientific data discovery and
reuse: from vision to practical reality. Proc. 10th Joint Conference on Digital libraries (JCDL '10). ACM, New York,
NY, USA, pages 333-340. doi:10.1145/1816123.1816173
Wira-Alam, A., Dimitrov, D., Zenk-Möltgen, W. (2012): Extending Basic DublinCore Elements for an Open Research
Data Archive. Proc. International Conference on Dublin Core and Metadata Applications 2012. Retrieved April 28,
2014 from http://dcpapers.dublincore.org/pubs/article/view/3664/1887
Yee, K.-P., Swearingen, K., Li, K., Hearst, M. (2003): Faceted Metadata for Image Search and Browsing. ACM
SIGCHI: Human Factors in Computing Systems, 2003. http://dx.doi.org/10.1145/642611.642681
Zenk-Möltgen, W., Habbel, N. (2012): Der GESIS Datenbestandskatalog und sein Metadatenschema. Version 1.8.
GESIS Technical Reports 2012/1. Retrieved June 21, 2012 from http://nbn-resolving.de/urn:nbn:de:0168-ssoar292372
73
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Metadata for Research Data: Current Practices and Trends Sharon Farnel
University of Alberta,
Canada
[email protected]
Ali Shiri
University of Alberta,
Canada
[email protected]
Abstract
This paper reports a study that examined the metadata standards and formats used by a select
number of research data services, namely Datacite, Dataverse Network, Dryad, and FigShare.
These services make use of a broad range of metadata practices and elements. The specific
objective of the study was to investigate the number and nature of metadata elements, metadata
elements specific to research data, compliance with interoperability and preservation standards,
the use of controlled vocabularies for subject description and access and the extent of support for
unique identifiers as well as the common and different metadata elements across these services.
The study found that there was a variety of metadata elements used by the research data services
and that the use of controlled vocabularies was common across the services. It was found that
preservation and unique identifiers are central components of the studied services. An interesting
observation was the extent of research data specific metadata elements, with Dryad making use of
a wider range of metadata elements specific to research data than other services.
Keywords: metadata; research data; research data services; standards
1. Data Repositories
“And yet, data is the currency of science, even if publications are still the currency of tenure.
To be able to exchange data, communicate it, mine it, reuse it, and review it is essential to
scientific productivity, collaboration, and to discovery itself” (Gold 2007). Although the nature of
research data can vary widely depending on the discipline, its importance to the replication,
refutation or validation of the findings or observations of a research project has never been in
doubt.
Research data has recently been viewed as being part of a larger data landscape, namely big
data. A number of researchers have referred to research data, linked data, the web of data and
open data as constituting elements of the big data landscape (Hudson, 2012; Shiri, 2013). The
Report of the 2011 Canadian Research Data Summit (Research Data Strategy Working Group,
2011) provides a specific categorization of digital data, namely research data, produced by
academia, industry and government.
The sharing of research data has long been a practice among many research communities, often
through informal means made increasingly easy with the advent of the internet and associated
tools such as email, ftp sites, etc. Borgman (2007) provides four rationales for the sharing
research data, namely “to (a) reproduce or verify research, (b) make results of publicly funded
research available to the public, (c) enable others to ask new questions of extant data, and (d)
advance the state of research and innovation”. She also notes that common metadata formats,
ontologies and data structures will support the integration of multiple data sources and services.
The rise of the open data1 and open science data2 movements, in conjunction with the
increasing implementation of data management and sharing policies by funding bodies3,
1
http://en.wikipedia.org/wiki/Open_data
2
http://en.wikipedia.org/wiki/Open_science_data
3
http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm
74
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
governments4 and journals5, has led to an explosion in the number of research data services
created to serve institutions, association members, and research communities. Databib6 and
re3data.org7 maintain listings of research data services, and as of August 2014 combined list
nearly one thousand. Many services enable the deposit of research data and associated metadata,
while others focus on metadata describing research data that is housed in other repositories.
This proliferation of services offering a range of functionalities and designed to serve different
communities with different needs poses many challenges to researchers, librarians and others
within the research community working to create an interoperable research environment.
Documenting the range of functionalities as well as defining means of comparing one service to
another have been recognized as important activities and have begun to be addressed by Databib8
and Dryad9 respectively. Key to any overall comparison or evaluation is an understanding of the
metadata practices within services.
2. Metadata in Data Repositories
Metadata is structured information that provides context for information objects of all kinds,
including research data, and in doing so enables the use, preservation, and reuse of those objects.
The importance of quality, standards based metadata has long been understood by those in the
fields of librarianship and research data management; NISO’s six principles of good metadata
(NISO 2007) being an excellent and oft-cited expression of that understanding. The same,
however, has not always been the case among research communities. A recent study (Tenopir et
al., 2011) found that there is a “lack of awareness about the importance of metadata among the
scientific community - at least in practice” and recommended that institutions and individuals
within them who work with researchers can and should do more to help researchers prepare the
metadata necessary to enable the discovery, preservation, and reuse of their data. In a scoping
study, Ball (2009) explored the feasibility and desirability of a harmonized application profile to
improve resource discovery and reuse of scientific and research data in the repository landscape.
The two key findings of his study were that a) a comparison of data models and metadata
schemes from a variety of disciplines suggested that a carefully generalized metadata profile
could be constructed that is both widely applicable and yet still fulfils the requirements of the use
cases and b) while the comparison of several different data models shows sufficient common
ground for a relatively detailed data model on which to base a Scientific Data Application Profile,
from an implementation perspective a simple model is preferred.
One of the main arguments for the identification and documentation of metadata practices and
formats for research data services is to create a solid basis upon which subject and semantic
interoperability can be ensured. Identifying useful metadata elements and practices will support
various interoperability models reported in the literature (Nicholson and Shiri, 2003; Hafezi, et
al., 2010). The same arguments that were made in the first generation of digital libraries, open
archives and content management systems hold true for research data services - the variety of
disciplines involved and the vastness of research data call for a more systematic and holistic
approach to metadata. In their 2012 study, Willis et al. identified 11 fundamental metadata goals
for metadata documenting research data and highlighted the need for further metadata-related
research. An evidence-based approach to the study of emerging research data management
systems allows us not only to study emerging trends but also to develop a basis for formulating
4
http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
5
http://www.plosone.org/static/policies#sharing
6
http://databib.org/index.php
7
http://www.re3data.org/
8
http://goo.gl/mQvy0F
9
http://www.dcc.ac.uk/webfm_send/750
75
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
best practices and policies for research data management. This study aims to take a step towards
that goal.
3. Purpose
Given the confluence of increased requirements around data management and sharing with
greater demand by researchers for services around metadata standards and applications, an
examination and comparison of the metadata standards and practices of research data services
would be both timely and beneficial. Given the emerging nature of research data repositories and
the urgent need for evidence-based practices, it is important to study examples of the repositories
that have been experimenting with how best to organize and manage research data. This is not
only useful for the metadata community in conceptualizing metadata standards in a new and
emerging context, it is particularly important for planners and practitioners who aim to embark on
research data repository projects. The objective of this study is to examine the metadata standards
and formats used by a select number of research data services to address several specific research
questions. These research questions are concerned with both theoretical as well as practical
aspects of organizing, managing and providing access to research data.
1. What is the number and nature of metadata elements available?
2. What research data specific metadata do these services provide in addition to common
metadata elements?
3. To what extent do the research data management services adhere to widely recognized
interoperability and preservation metadata standards?
4. Which research data repositories benefit from and promote controlled vocabularies for
subject description and access?
5. How many of the services provide support for unique identifiers (e.g., DOIs)?
6. What kind of metadata assistance (documentation, etc.) is provided?
7. What metadata elements are common and different across these services?
4. Methodology and Analysis
The nature of this study is exploratory in the sense that it aims to gain an insight into the
current metadata practices and trends in four research data services: Datacite,10 Dataverse
Network,11 Dryad,12 and FigShare.13 The rationale for the selection of these services lies in the fact
that these are widely popular and internationally used research data services that cover multiple
disciplines. A significant number of research-intensive and academic institutions are already
using these services and some are considering them in their research data management planning.
Table 1 provides an overview of the geographic distribution of these research data services,
their subject areas as well as their main services.
TABLE 1: Research data services
10
http://www.datacite.org
11
http://thedata.org/
12
http://datadryad.org/
13
http://figshare.com/
76
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Service
Subject
area
Main services
Location
Datacite
General
Metadata, DOI
UK
Dataverse
Network
General
Cite, analyze, preserve
US
Dryad
General
data underlying scholarly publications discoverable, accessible,
understandable, freely reusable, and citable
US
FigShare
General
figures, datasets, media, papers, posters, presentations and
filesets, altmetrics
UK
The seven research questions above, which are informed by the NISO principles for good
metadata (NISO 2007), provide the analytical framework for examination of research data
services focusing on various aspects of metadata elements, formats, and standards. As was stated
earlier, an evidence-based approach for this study was thought particularly useful, partly because
of the emerging nature of research data management systems and partly because of the variety of
disciplines and domains that current research data management services cover. To address the
research questions, existing metadata records, metadata creation interfaces, and associated
documentation will be examined. The following comparative table addresses the key research
questions.
5. Findings
Table 2 provides an overview of our sample set of research data services with respect to
research questions 1 through 6.
TABLE 2: Research data services comparison (research questions 1-6)
Datacite
Dataverse Network
Dryad
Figshare
Number of
metadata
elements
41
100
52
12
Research
specific
metadata
elements
No
Yes
Yes
No
Compliance
with
standards
Datacite Metadata
Schema, which is
an application
profile of Dublin
Core (DC),
OAI
Data Documentation
Initiative (DDI)
Codebook, compliant
with Dublin Core (DC)
and Content Standard
for Digital Geospatial
Metadata (CSDGM),
MARC
LOCKSS, OAI
Dublin Core,
Darwin Core,
Bibliographic
Ontology,
METS/MODS
OAI/DC
OAI/ORE (Object
Reuse and
Exchange)
RDF/DC
CLOCKSS
For now, OAI/DC
is the
recommended
format.
CLOCKSS
Use of
controlled
Includes controlled
vocabularies for
Supports use of
controlled vocabularies
Supports use of
ontologies and
No formal
controlled
77
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
vocabularies
some elements,
supports use of
controlled
vocabularies for
other elements;
MESH, OBI, NCBI
Support for
DOI
Yes
Metadata
assistance
full documentation
of metadata
schema, user
guidelines, full api
documentation
controlled
vocabularies such
as Open
Biomedical
Ontologies &
Gene Ontology. A
trial version of
HIVE is provided
to support subject
description.
LCSH, TGN,
MESH, Integrated
Taxonomic
Information
Systems (ITIS),
National Biological
Information
Infrastructure
Biocomplexity
Thesaurus, LC
Name Authorities
file
vocabularies; only
14 high level
categories
Yes
Yes
Yes
metadata documentation
available via user guide,
contextual help available
for each element in
metadata entry form
Dryad Wiki pages
provide detailed
documentation
including
Cataloguing
guidelines
Partner with
DataCite
In terms of metadata elements, the services range in number from 12 to 100. Of course, the
number of elements is not a measure of success or performance of a system. The number of
metadata elements may be dependent on a wide range of factors, including the simple or
sophisticated approaches that the research data repositories adopt, the disciplines and domains
that they cover as well as the applicability of the elements in terms of metadata creation and
maintenance. The proportion of general metadata elements in comparison to research data
specific elements ranges quite dramatically; Datacite has no research data specific metadata
elements while Dryad has 35 (of 52 total). Dataverse and Dryad provide a more sophisticated set
of metadata elements and standards. Figshare takes a minimalist approach and provides a very
basic set of metadata elements to facilitate quick and easy deposit of research data.
Preservation appears to be one of the central components of research data services to ensure
long term access to data. Most have adopted preservation strategies associated with LOCKSS14
(Lots of Copies Keep Stuff Safe) and CLOCKSS15 (Controlled LOCKSS) as widely used and
common information and data preservation approaches. Given the importance of interoperability
in research data management services, DataCite, Dataverse Network and Dryad support OAIPMH16 (Open Archives Initiative/ Protocol for Metadata Harvesting) to ensure the wider
findability and discoverability of research data
Initial comparison of several of the sample research data services demonstrates that a variety of
metadata standards are in use, although Dublin Core is used or supported across most of the
services. Support for controlled vocabularies is common, although few incorporate them by
default into their schema. For instance, while Dryad and DataCite adopt a more systematic
14
http://www.lockss.org/
15
http://www.clockss.org/clockss/Home
16
http://www.openarchives.org/pmh/
78
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
approach to the use of various controlled vocabularies for subject description and access,
recommending various thesauri and knowledge organization systems, Figshare does not provide
any specific provision for this feature; the only subject access mechanism in Figshare is the high
level subject categories that appear when users click on the ‘browse’ option on the homepage.
An encouraging sign is the common support for DOIs which are seen as key to discovery,
preservation and citation of research data. All of the services appear to have metadata
documentation available to aid users.
Table 3 provides a detailed account of the common and unique metadata elements used by the
four research data repository services.
TABLE 3: Research data services comparison (research question 7)17
17
Datacite
Dataverse Network
Dryad
Figshare
Titles
title
- title
- subtitle
- document title
- article title
- journal title
- data package title
title
Creators,
Contributors
- creator
- contributor
- publisher
- author
- producer
- funding agency
- distributor
- depositor
- contact
- data collector
- author
- creator
- author
- collaborators
Topical
subject(s)
subject
- keyword
- topic classification
- keyword
- scientific name
- categories
- tags
General
description
description
abstract
- article abstract
- description
description
Object type(s)
resource type
kind of data
type
type
Date(s)
- date
- publication year
- production date
- distribution date
- deposit date
- version date
- date of collection-start
- date of collection-end
- date of issuance
- deposit date
- date available
- embargo date
- date created
- date published
Rights, Access,
Use
rights
- data access place
- original archive
- availability status
- confidentiality
declaration
- special permissions
- restrictions
- conditions
- provenance
- document holdings
- disclaimer
- rights statement
- location of related
content outside of
Dryad
license
Object technical
characteristics
- size
- format
- software
- software version
- size of collection
- study completion
- file format
- file size
- provenance
file size
Note that table 3 does not reference attributes or attribute values and is not meant to be an element by element
mapping
79
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Spatial
subject(s)
- geo location
- country/nation
- geographic coverage
- geographic unit
- geographic bounding
box
- spatial coverage
Identifiers
- identifier
- alternate
identifier
- related identifier
- study global ID
- other ID
- article identifier
- associated Dryad
data package
identifier
- data package
identifier
- identifier for
related data in
Dryad partner
repository
- associated Dryad
publication record
identifier
- associated Dryad
data file record
identifier
- data file identifier
- issn
- electronic issn
Temporal
subject(s)
- time period coveredstart
- time period coveredend
- temporal coverage
Citation
- citation requirements
- depositor requirements
- journal volume
number
- journal issue
- article start page
- article end page
- article pages
Versioning
Methodology
version
version
- unit of analysis
- universe
- time method
- frequency
- sampling procedure
- major deviations for
sample design
- collection mode
- type of research
instrument
- data sources
- origin of sources
- characteristics of
sources noted
- documentation and
access to sources
- characteristics of data
collection situation
- actions to minimize
losses
- control operations
- weighting
- cleaning operations
- study level error nores
- response rate
- estimates of sampling
errors
- other forms of data
appraisal
80
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Related
resources
Language(s)
- series
- series information
- replication for
- related publications
- related material
- related studies
- other references
language
Status
- status
- article publication
status
Production
- production place
Additional grant
information
- grant number
- grant number agency
Note(s)
notes
Dryad, Dataverse and DataCite make use of Dublin Core as well as other metadata schemes
and standards. It is not surprising to note that there are common metadata elements across these
services. Dryad also utilizes Darwin Core, Bibliographic Ontology and its own repository specific
elements. While Figshare makes limited use of metadata elements, at least seven out of eleven
metadata elements are consistent with Dublin Core. Therefore, one can argue that there is a set of
elements across these four services that allow for basic interoperability if a meta-service were to
be created for cross-searching and cross-browsing
One of the key questions this study aimed to address was the inclusion or creation of metadata
elements specifically for research data. Our comparative analysis of the above research data
services shows that there are research data specific metadata elements being used. Dataverse
Network and Dryad incorporate metadata elements in this area. For instance, Dataverse makes
use of such metadata elements as date of data collection, data collectors, depositor, deposit date,
data specific file types such as raw data, processed data. Dryad offers a number of metadata
elements related to the data package and data files deposited into Dryad. Examples of these
elements include: Associated Dryad Data Package Identifier, Data Package Title, Data Package
Identifier, Associated Dryad Data File Record Identifier, Data File Identifier, Deposit Date.
6. Conclusions and Future Work
This study compared four different research data services in terms of metadata and research
data management practices. The results of this study will improve understanding among
researchers, librarians and research data managers of the application of metadata in research data
services. These preliminary findings contribute to the development of a set of guidelines and best
practices for developing and implementing metadata for research data services in order to pave
the way for the development of an interoperable research data environment. Furthermore, the
identification of metadata elements and formats in commonly used research data services will
contribute to the creation of an interoperable research data environment. Future work will include
expanding this analysis to additional research data services, both general and domain-focused, as
well as comparing in detail the metadata elements common across and unique among the
services. The development of a framework that takes into account such important components as
preservation infrastructures, unique identifiers, interoperability architecture and the definition of a
set of research data specific metadata should guide further research and development in this area.
81
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
References
Alipour-Hafezi, Mehdi, Abbas Horri, Ali Shiri, and Amir Ghaebi. (2010). Interoperability Models in Digital Libraries:
An Overview. The Electronic Library, 28(3), 438-452.
Ball, A. (2009). Scientific data application profile scoping study report. June 3rd. Retrieved August 5, 2014, from
http://alexball.me.uk/docs/ball2009sda/.
Borgman, Christine L. (2012). The conundrum of sharing research data. Journal of the American Society for
Information Science and Technology, 63(6), 1059-1078.
Gold, Anna. (2007, September/October). Cyberinfrastructure, Data, and Libraries, Part 1: A Cyberinfrastructure Primer
for Librarians. D-Lib Magazine. 13/9/10. Retrieved, August 5, 2014 from
http://www.dlib.org/dlib/september07/gold/09gold-pt1.html.
Hodson, Simon. (2012). JISC and Big Data. Eduserv Symposium 2012: Big Data, Big Deal? May 10, 2012, London,
UK.
Nicholson, Dennis and Ali Shiri. (2003). Interoperability in Subject Searching and Browsing. OCLC Systems &
Services, 19(2), 58 - 61.
NISO. (2007). A Framework of Guidance for Building Good Digital Collections: Metadata. Retrieved, August 5, 2014,
from http://www.niso.org/publications/rp/framework3.pdf.
Research Data Strategy Working Group. (2011). Mapping the Data Landscape: Report of the 2011 Canadian Research
Data Summit. Retrieved, August 5, 2014, from https://web.archive.org/web/20140312192321/http://rds-sdr.cistiicist.nrc-cnrc.gc.ca/obj/doc/2011_data_summit-sommet_donnees/Data_Summit_Report.pdf.
Shiri, Ali. (2013). Linked Data Meets Big Data: A Knowledge Organization Systems Perspective. Advances in
Classification Research Online, North America, 24(1). Retrieved, August 5, 2014, from
http://journals.lib.washington.edu/index.php/acro/article/view/14672.
Tenopir, Carol, Suzie Allard, Kimberly Douglass, Arsev Umur Aydinoglu, Lei Wu, Eleanor Read, Maribeth Manoff,
and Mike Frame. (2011) Data Sharing by Scientists: Practices and Perceptions. PLoS ONE, 6(6). Retrieved, August
5, 2014, from http://dx.plos.org/10.1371/journal.pone.0021101.
Wiley, Christie. (2014), Metadata Use in Research Data Management. Bulletin of the Association for Information
Science and Technology, 40(6). Retrieved, August 5, 2014 from http://www.asis.org/Bulletin/Aug14/AugSep14_Wiley.html.
Willis, Craig, Jane Greenberg, and Hollie White. (2012). Analysis and synthesis of metadata goals for scientific data.
Journal of the American Society for Information Science and Technology, 63(8), 1505 - 1520.
82
Infrastructure & Models—Part A Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The ARK Identifier Scheme: Lessons Learnt at the BnF and
Questions Yet Unanswered
Sébastien Peyrard
BnF, France
[email protected]
Jean-Philippe Tramoni
BnF, France
[email protected]
John Kunze
California Digital Library,
USA
[email protected]
Abstract
The Bibliothèque nationale de France (BnF) looks back at lessons learnt over eight years of
implementing persistent identifiers (ARKs). While persistent identification is still a relatively
young field, this is enough time to gain practical experience, and to conduct a meaningful gap
analysis between what is and what should be, especially in a semantic web context. That analysis
has exposed important issues concerning best practices and compliance with existing standards.
Keywords: Archival Resource Key; persistent identifiers; web of data; linked data.
Introduction
“Eternity is a very long time, especially towards the end.” W. Allen1
When considering persistent identifiers, one tends to focus on two ends of the timeline: the
immediate near term (at the initial implementation stage) and the very long term, the latter often
being too abstract to act on directly. After eight years of implementation experience and almost
20 million ARKs assigned, the BnF now takes the opportunity to look back. This article explores
what issues have to be considered during the lifespan of persistent identifiers, in this case ARKs.
It also touches on the ARK standard: this 13-year-old standard might benefit from clarification or
modification. At a time when institutions are diving into linked data and appear as key
stakeholders in the web of data, we believe persistent identifiers have a key role in supporting
trustworthy and stable bridges across data silos.
1. The ARK identifier scheme: overview
ARK identifiers have been introduced in various articles and web resources (CDL, 2013)
(Kunze, 2003). This section summarizes only enough to make the rest easily understandable.
1.1. Purpose and aim
The ARK standard addresses the same issues as other persistent identification schemes.
Although anyone can use them, and there are about 270 organizations currently registered (CDL,
2014), ARKs have been most popular with heritage institutions. These institutions are usually
tasked with indefinite retention of content, well beyond expected lifetimes of commercial
institutions, and where the perspective is set on the very long term.
ARKs have a very conservative approach to persistent identification. Like URNs and DOIs,
ARKs are designed to be independent of DNS and the HTTP protocol; however, they are also
designed to work directly in today’s web environment URLs, by specifying that the hosting
arrangement does not affect identity. For example, these ARKs identify the same resource:
1
•
http://gallica.bnf.fr/ark:/12148/bpt6k5834013m
•
http://bnf.example.org/ark:/12148/bpt6k5834013m
http://edition.cnn.com/2006/WORLD/asiapcf/07/04/talkasia.hawking.script
83
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
• ark:/12148/bpt6k5834013m
The last of these (with no hostname) is the core immutable identifier.
1.2. Anatomy
The base ARK name is typically a completely opaque (meaningless) identifier in order to
drastically reduce any pressure to change the identifier string over the long term. For example,
•
http://gallica.bnf.fr/ark:/12148/bpt6k5834013m
•
This sort of base name is often extended with a qualifier that may be less opaque, as in
• http://gallica.bnf.fr/ark:/12148/bpt6k5834013m/f19.highres
An actionable ARK (an ARK that works in today’s web) has three main parts.
The core immutable identifier itself is mandatory and is designed to be globally
unambiguous, persistent and opaque. To that end, it has a structure proceeding from the
most general to the most specific (left to right):
- the identifier scheme (“ark:/”), a label that is easy to find by simple text miners;
- the Name Assigning Authority (NAA), which has a 5-to-9 digit NAA Number
(NAAN) for opacity. NAAN uniqueness is guaranteed via a registry2 based at the
California Digital Library (CDL);
- the ARK name itself, which should be opaque and is assigned by the NAA; if
independent ARK name assignments are performed within a single NAA, the NAA
often designates sub-naming authorities corresponding to short prefixes for the ARK
name, to ensure ARK names uniqueness.
•
•
The Name Mapping Authority (NMA), which enables the identifier to resolve to a
resource. The NMA is implemented with a Name Mapping Authority Hostport (NMAH),
which in today’s web environment is usually an HTTP server. This part can change over
the long term, which is why it is optional. Here for example the NMAH is
“http://gallica.bnf.fr”.
•
The optional qualifier part, which enables extra services provided by the NMAH using
the standard ARK reserved characters “.” and “/”. At BnF they are often used as follows.
- Naming sub-parts of a resource (e.g. a specific page in a digitized book). This is
achieved by hierarchy qualifiers beginning with “/” (/f19 in the example).
- Naming variants or services of the resource (e.g. a specific version in the lifecycle of
a digitized book, or the thumbnail of a given image). This is achieved by variant
qualifiers beginning with “.” (.highres in the example)
1.3. Using ARKs
ARKs raise many of the same issues as other persistent identification schemes.
2
•
Institutional commitment and policy. Persistent identification is not a technical problem.
It will only work if an institution commits to ensure persistence and global uniqueness over
the long term. There needs to be a clearly articulated stewardship policy.
•
Assignment procedures. Clearly articulated procedures are also required to ensure that
assignments are unique and consistently applied to defined resource types. Decisions to be
made comprise what ARKs are identifying, which resources are considered to deserve
separate ARKs, and which resources should be considered variants of the same ARK.
•
Resolution. One or more NMAHs are needed to resolve ARKs, each NMA defining a
level of service provided with the ARKs. Reliable resolution allows reliable citation.
The NAAN registry can be accessed at http://www.cdlib.org/uc3/naan_registry.txt.
84
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
ARKs also offer two ways of supporting linked data. Besides using content negotiation, ARK
end-users may instead append suffixes, called inflections, to gain access to services related to a
resource, but without requiring them to remember whole new identifiers. For example,
•
http://texashistory.unt.edu/ark:/67531/metapth346793/
(ARK for the resource)
•
http://texashistory.unt.edu/ark:/67531/metapth346793/?
(its metadata)
• http://texashistory.unt.edu/ark:/67531/metapth346793/?? (the NMA’s commitment)
By itself an ARK should lead to the resource (object). Appending a single “?” should lead to
the resource’s metadata (Kunze, 2010) and appending “??” should lead to metadata describing the
kind of persistence to expect. In the current archival environment, the latter is critical for
indicating when a resource is truly invariant, or subject to correction, or is a growing resource.
As an alternative to content negotiation, ARK inflections are easier to use and more precise.
Inflections are not as easy to support, however, with the Tomcat-based web services at BnF.
2. A brief history of ARKs at the BnF
2.1. Adoption and initial implementation of the ARK identifier scheme
In 2006, the BnF conducted a risk-driven requirements analysis to adopt the ARK persistent
identification scheme. Two core requirements used for selection criteria were (1) financial
independence of the NAA: identifiers subject to a fee, such as DOIs, were discarded and (2)
technical independence of the naming authority (since identifiers had to be directly integrated
into our in-house Information Systems): identifiers relying on installing special-purpose software,
such as Handles, or on external services, such as PURLs, were discarded. BnF needed stable,
location-independent URLs, which do not redirect to temporary URLs (avoiding the overhead of
managing an endlessly increasing number of redirects).
URNs also fit our criteria fairly well, but the ARK specification addressed some areas more
precisely than URNs, such as the definition of a persistence policy, and additional services on a
particular resource in a web context (through the use of qualifiers). Like the URN scheme, the
ARK scheme does not mandate use of one particular vendor or service for its identifiers. Unlike
URNs, DOIs, and Handles, however, ARKs also do not mandate use of one well-known DNS
resolution starting point, so ARKs can be implemented directly on a local web server. While
some consider this a weakness, citing the “inherent” fragility of DNS names, their argument
usually suggests using dx.doi.org, handle.net, or n2t.net instead; the logical flaw is that these are
DNS names too, and we note that none of them are as long-lived as bnf.fr. The bottom line is that
ARKs are implementable with the simplest of technologies, and they do not require a specialpurpose global infrastructure uniquely built for their own scheme.
At this stage, ARKs were defined for two distinct types of resources: digitized documents,
available in the digital library Gallica – using http://gallica.bnf.fr as NMAH and catalogue
records, which needed identification for exchange with BnF’s OAI repositories – using
http://catalogue.bnf.fr as NMAH.
For both NMAHs, we defined an initial complete set of qualifiers to name subparts and
variants. As an illustration, in gallica.bnf.fr, we defined qualifiers to name the pages of a book
(e.g. http://gallica.bnf.fr/ark:/12148/bpt6k5834013m/f10 to name page number 10 in the digitized
document, /f10n5 to name the set of pages 10 to 14), and qualifiers to invoke variants of a book
or a page (http://gallica.bnf.fr/ark:/12148/bpt6k5834013m/f10.highres, .medres, .lowres and
.thumbnail for the different resolutions of the same page; .text to access the OCR for a particular
page, .vocal to access the sound version of the same page). For the main catalogue, qualifiers
were used to name distinct formats of the same record.
More details about the initial approach and the first implementation choices are available in
(Bermès, 2006). During the eight following years, ARKs became the lingua franca across the
institution, and their use expanded to new areas.
85
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
2.2. Fostering ARK identifiers: new resources, new clients
Since 2006 BnF has expanded its initial use of ARKs for two different purposes:
•
Identifying descriptive records in order to manage them in our OAI repositories, and more
recently, in our data.bnf.fr linked data services. This led to assigning ARKs to EAD
finding aids, manuscript illumination records, museographic descriptions.
Preserving digital documents. In 2010 our preservation repository, SPAR (Scalable
Preservation and Archiving Repository), went operational. As each Information Package
had to have a persistent identifier, SPAR played the role of an ARK assigner whenever
there was no pre-existing ARK assigned to the ingested document.
These different resource sets had different scales and creation workflows, which made it very
difficult to have a single ARK assignment procedure. The most central assigner is SPAR, but it is
only for digital documents (not descriptive records) and it was rolled out after the assignment
channels for mass digitization were operational and optimized, which led to path dependence. On
the descriptive records end, some databases had much smaller datasets than the 15 million
records of the catalogue, which made semi-automated assignment procedures more suitable.
In the end, ARKs were assigned using three different means.
•
•
Automated, based on an existing number: used for our two legacy systems (Gallica and
our catalogue records), and for our finding aids database. Our large datasets have preexisting reliable numeric ids that we can “dress” as ARKs. E.g. the record n°32915216
from the main catalogue had the “c” sub-naming authority for descriptive resources, and
the “b” 2nd level sub-naming authority for records from the main catalogue. Thus,
32915216 became ark:/12148/cb32915216j (with the addition of a final check character).
•
Automated, independent of any number: used for medium to large datasets with no
reusable id (because significant or incompatible with the ARK structure). Our preservation
repository, SPAR, automatically assigns an ARK upon ingest. E.g. ark:/12148/bc6p01zndd
assigned to a web archiving container file, indicates (to BnF staff) that assignment was
routed to sub-naming authority “b” (digital content) and to repository “c6p0”, a 2nd–level
sub-naming authority that takes care of uniqueness at repository level.
Semi-automated: with a list of ARKs that curators assign to resources (one spreadsheet
per sub-naming authority), this is used for very small datasets. It meant defining a subnaming authority per database to guarantee uniqueness. E.g. ark:/12148/cdt9x5ww
identifies a book binding description. As a descriptive record, assignment was routed to
sub-naming authority “c”, then to 2nd-level sub-naming authority “dt9x” for book binding.
On the access side, as new services were being built upon new resources, several ARK
NMAHs could be used simultaneously for the same resource3. For instance, the same catalogue
record can be displayed in the main catalogue, which delivers a “full” but isolated record in
traditional formats, and also in data.bnf.fr, which provides the RDF view of this record, but
displays it in an enriched landing page that aggregates related resources. The difference is
obvious for authority records, which can be seen, for instance, between these two ARKs:
http://catalogue.bnf.fr/ark:/12148/cb118905823
http://data.bnf.fr/ark:/12148/cb1189058234
•
3
This might be considered a risky practice, as with several NMAHs for the same ARK identifier, you need
to know all the NMAHs of a particular resource to have a complete view of it. We addressed this problem
by defining a default NMAH for a given resource that is considered the “master” view for such a resource.
For instance, http://catalogue.bnf.fr (main catalogue) is the default for bibliographic descriptions. A
strength of potentially distinct NMAHs for a single ARK is that it forces one to dissociate the resource
from the current application providing access to it, which forces one to adopt a long term perspective.
4
As of 2014, July, data.bnf.fr accounts for only 60% of the catalogue data. Therefore, 40% of the ARKs in
the main catalogue are not (yet) in data.bnf.fr.
86
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
2.3. At international scale: backing up the ARK registry
The NAAN registry maintained by the CDL described in §1.2 is a cornerstone for the viability
of ARKs, because the centralized registration of NAAs ensures the uniqueness of each NAA
Number (NAAN). To this end, it was important to guarantee its persistence over the long term,
which led to registry mirroring arrangements with the US National Library of Medicine (NLM)
and the BnF. From the BnF point of view, it meant formalizing a partnership with the CDL with a
Memorandum of Understanding. As this MoU had to be signed off by the president of the BnF, it
had the beneficial side-effect of securing institutional commitment for ARK identifiers from toplevel management.
3. Implementation gap analysis: Consolidating ARK curation at the BnF
The previous section describes how ARKs gained momentum at the BnF and were
progressively applied for different purposes and resource types beyond the originally envisioned
use cases. This led to a wide variety of implementation choices and management rules, and
consequently a call for centralized policy and harmonization. A gap analysis was conducted in
2014 to address this question in a systematic fashion. It consisted of summarizing the lessons
learnt and problems encountered over the past 8 years, and then organizing those lessons around
the following focal areas: functional, organizational or technical issues, qualifier implementation
questions, policy descriptions, and compliance with standards. Those focal areas are described in
separate subparts of §3 and §4.
The next subsection summarizes the issues uncovered by the gap analysis. Most of them are
not complicated technical issues, but rather simple observations that we think would likely be
made by any organization similar to BnF after 8 years of managing persistent identifiers.
3.1. Organizational issues
A persistent identifier and its policy should outlive its initial implementers. Obvious as this
statement sounds, its direct implications are not readily apparent in the early implementation
stages. It requires continuous improvement and refinement of the identifier policy and uses,
which must remain stable while accommodating new and evolving uses and needs. This prevents
identifiers from falling into obsolescence or disgrace, with a decrease in perceived relevance or
visibility. Neither must they become “over-used”; frequent or casual assignment leads to misuse.
A disciplined approach to organization and communication are key factors to sustainability.
In eight years, there has been a good deal of staff turnover in the ARK BnF expert team. Only
one person from the original seven-member team remains. What’s more, as ARK use expanded to
new areas (as addressed in §2), its audience got much wider than the original team. This includes
library curators that use, or might use, ARKs to cite resources; digital object curators, that
handle the lifecycle of the object, including identification and access; web application
managers, on the IT and librarian sides; linked data experts, especially for the data.bnf.fr
project. As a result the communication and documentation had to be adapted for the larger
audience, which needed to be aware of policy and key curation issues without necessarily
understanding all the details.
Our “ARK consolidation approach” had two organizational phases.
Communicate: gather all the users, train them in the main underlying concepts of persistent
identifiers, common misconceptions about them and best practices, and mandate two
“reference ARK coordinators” – one on the IT and one on the librarian side.
Set up targeted working groups, led by the “ARK coordinators”, these focused on specific
resource types or applications, reducing the identified gaps and addressing new needs.
87
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
3.2. Functional gap analysis
The functional gap analysis itself revealed many areas for improvement in our persistent
identifier services, particularly for resolution and associated services5.
•
Some applications do not create resolvable ARKs, but only record them as metadata.
•
Whenever a resource is not available in the ARK-aware URL, there is only a 404 or 403
browser response, which should be replaced by one of the following more explicit
statements: 1) Resource not found – this is an incorrect URL and no resource has ever been
available at this URL; 2) Resource deleted – the resource was there, but it was deleted; in
this case, provide core metadata and if possible the reason for the deletion; 3) Access
disallowed in this context; as with deletion, one should provide core metadata and if
possible the reason of the withdrawal (e.g. copyright status).
Across some applications there are obsolete or inconsistent ARK redirects. E.g. an old test
version of the digital library, gallica2.bnf.fr, no longer redirects to gallica.bnf.fr.
In all these cases, our minimum baseline service is clearly not achieved. Our first goal is
therefore simple but attainable: define BnF “ARK core services” that any persistent-id aware
application should comply with, namely,
•
•
Provide access to the object behind the ARK
•
In case of object unavailability, provide metadata to understand what was there and why
access is no longer possible.
•
Set up a generic process for updating redirects at the level of the BnF “ARK coordinators”.
3.3. Refining the identification and persistence policies
When ARKs were first implemented, we had an unclear view of what stewardship promise we
could return with identifiers. Therefore we ended up with a very high-level statement6:
•
No identifier re-assignment;
•
Identifier string policy: opaque strings, no vowels, use of a final check character;
Persistence policy: guaranteed, but needs to be refined in the future; the form of the
underlying resource can change to ensure its persistence (e.g. format migration).
With almost a decade of experience managing ARK identifiers, digital preservation objects
(PREMIS Maintenance Activity, 2012), and alignments between our catalogue records and other
linked data sources, we can see possibilities for differentiated persistence policies.
•
•
For a digital document that we preserve, our aim is to keep the information content stable
and accessible and useable to end-users. This means permanent access with stable content.
•
For a catalogue record, the information content can be updated as the catalogue record is
corrected, enriched, updated, etc. This means permanent access with somewhat more
dynamic content.
For an archival records document, the identifier will be maintained but the content may be
suppressed for legal reasons. In this case, we provide a “tombstone” with the metadata and
reasons for the object unavailability.
The BnF is currently considering formalizing these policies in a systematic way.
•
5
This analysis is limited to ARK implementation. Time permitting, BnF could have studied additional
identifier systems to get ideas for improvement, however the ARK scheme, being built upon experience
with other schemes, was already a leading choice, so a broader study was not considered a priority.
6
http://gallica.bnf.fr/ark:/12148/btv1b8451622d.policy
88
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
3.4. Refining the qualifier implementation
One issue we have to deal with is proliferation of identifier qualifiers (introduced in §1.2), in
response to which we decided to create a consistent qualifier policy. From the most generic
service to the most specific, we see three tiers of qualifiers.
•
Generic qualifiers. Applicable to any resource, these are qualifiers providing a description
of the resource (.description), its persistence policy (.policy), and potentially a qualifier
revealing the sub-parts and variants available for the object.
•
Content-type-dependent qualifiers. For digitized documents, you can use generic display
resolution variants (thumbnails, low, medium or high resolution). For descriptive records,
you can use generic metadata formats (RDF, XML…). The list of possible qualifiers can
be maintained independently of any application.
• Application-specific qualifiers. These are specific to a particular NMAH.
We also consolidated our policy about when it is appropriate to define a new qualifier, due to
two considerations revealed in the gap analysis. The first has to do with querying vs. citations.
Variant qualifiers are not a query language, but do allow citation of services that one considers
“persistent” and relevant from an end-user point of view. In that light,
http://gallica.bnf.fr/ark:/12148/bpt6k65581775.r=food, in particular, the “.r=” qualifier raises a
red flag. This qualifier can be viewed as a way to search for a word in a digitized document; but
ARK qualifiers are intended to refer to the document, not to “look” into documents. It can also be
viewed as a way to act upon a document by returning it (from BnF) “with highlights” added (here
on the word “food”). This use case could comply with ARK qualifiers, but the side-effects could
be distracting if not misleading. Unfortunately, it is easy to do accidentally; if a user previously
searched for a word in a document before copying and pasting the URL, it will include the
“r=word” qualifier. In the end, this creates a reference to a document with highlights, whilst most
of the time all the user wants to do is refer to the document without them. This means that, in
most cases, revealing such parameters is not recommended for persistent URLs.
A second consideration is technical vs. non-technical qualifiers. Any qualifier that concerns a
detail of implementation, technology, or a temporary information object should not be expressed
in the URI. Unlike the “ARK name” part, qualifiers are not meant to be long-term persistent.
However, their stability and maintenance is important for the perceived trustworthiness of the
service, and it is costly. Supporting the aforementioned .r= qualifier has a cost, as the syntax for
searching for several words “.r=word1+word2+wordn” has to be maintained over reimplementations.
As a result of this gap analysis, the BnF intends to raise awareness of good practices among
ARK users (developers and web application managers) and to formalize a general best practices
document. A list of qualifiers will be created and maintained for the three aforementioned levels.
3.5. Technical issues: consolidating the technical framework
From its first implementation, the ARK resolver at the BnF had to meet two basic
requirements: complying with the security policy of the IT operations service and managing the
increasing flow of network requests.
Initially, the ARK resolver was a part of a general-purpose document viewer application. For
each domain-specific application, every incoming URI including an /ark:/ pattern had to be
detected by an HTTP reverse proxy and redirected toward this viewer application. The ARK
resolver had to analyze the ARK identifier and the request, change it to a domain-specific format,
and then forward the request for processing to the domain-specific application. These
applications were hosted on multiple servers using virtual IP and load-balancing in order to share
the load between these servers. This architecture had some shortcomings. First, the use of a
reverse proxy conflicted with the IT operations requirements. Second, to detect, change and
89
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
redirect the requests, the ARK resolver had to implement some domain-specific rules. This was
dangerous for the security and maintainability of the whole system.
After this first architecture was in operation for two years, it was agreed to define a new
system that would be more generic, parameterized, and scalable. The multi-server load-balancing
system was kept, but three modules were added.
a) A domain-specific module that checks if the incoming request is in the scope of the
domain, and if not, sends it to the ARK redirection module. This filter module is generic
but uses domain-specific patterns to verify incoming requests before they go to module b).
b) Domain-specific sub-modules analyze the request, and if necessary, reformat it according
to the domain’s requirements before transferring it to the domain-specific application.
c) The ARK redirection module is able to analyze the ARK identifier and the incoming
request and then forward the request for processing to the domain-specific application. The
redirection rules are parameters defined in an XML file.
The new document viewer application is now leaner because it does not handle the resolution
of ARK requests. This task has been distributed between the generic redirection filter, the specific
reformatting filters, and the centralized ARK redirection module. The workload of this
redirection module is lower since many of the incoming requests are going directly to a domainspecific application that can resolve the ARK identifier.
Three years later, new requirements came out in parallel with new developments of the Gallica
viewer module. Some tools were implemented to manage ARK identifiers and qualifiers, which
are now defined by a configuration file. The processing of ARK qualifiers gained leverage by
becoming more generic, which made them easier to use in the Gallica API. The ARK redirection
module was enhanced by migrating the old redirection rules to mapping tables stored in a
database. That module is also using a copy of the ARK NAAN registry that is mirrored at regular
intervals from the NAAN registry at the CDL. The new architecture is summarized in Figure 1.
FIG. 1. BnF ARK resolver architecture
90
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The ARK minting process, functional aspects of which are outlined in §2.2, has followed a
similar evolution. Initially, ARK creation was completely delegated to domain-specific
applications. This method was easy to implement but problematic in terms of maintenance and
robustness. With implementation of the SPAR repository came the development of a generic
function to mint new ARKs. A growing proportion of new identifier assignment is now
performed by this generic function.
Since its early stages, the ARK system at the BnF has been tuned regularly to become easier to
maintain and configure, although technical issues still remain. To keep a robust system that can
be trusted by end-users, we have to consider an increasing diversity of applications, the number
of ARKs involved, and the flow of incoming requests.
3.6. Main lessons learnt about persistent identifier curation
To allow operational persistent identifier curation at a non-expert level, core questions have to
be answered. With our eight-year hindsight, the key questions could boil down to this check-list:
•
Who should be contacted in your institution when new kinds of objects are to be given
persistent identifiers or when persistent-id aware applications are defined or revised?
•
What are your identifiers identifying?
•
Will your identifiers be re-assigned over the long-term or not?
•
How much can the underlying content change over time? Can objects be deleted?
•
Which services and subparts do you want to reveal, if any, so that end-users can cite a
specific portion of the resource and/or a particular variant of that object?
4. Standards gap analysis
4.1. Machine-readable commitments
No identifier, regardless of scheme, can tell us if it will prove to be persistent into the future.
The best “it” can do is to tell us (via its NMA) enough about itself, its resource, and resource
provider to help us judge how and when to use it. The story it tells must be able to convey such
things as provider support policies, expected changes to the resource (e.g., none, or corrections
only), and the nature of the provider itself. A persistence promise is not black or white. Instead it
is multi-dimensional, suggesting a breakdown into metadata elements.
Because we assume people searching for resources at scale will usefully want to filter based on
persistence promise attributes, it will be necessary to support machine-readable commitments
expressible via metadata. As was described earlier, the ARK inflection, “??”, is designed to gain
access to metadata statements about providers’ persistence promises. Unfortunately, the ARK
standard does not specify how to create machine-readable persistence promises. This section
explores some of the areas that metadata should cover in such machine-readable commitments.
Support policies
Support policies and commitments vary between institutions, collections, and even between
resources within a collection. For example, users often expect unchanging content behind durable
links to published content, but they expect dynamic content behind durable links (persistent
identifiers) to advertised content, such as a home page, curated database, or per-second updated
stream of sensor data.
Setting expectations about this “content invariance” (or lack thereof), is critical, because
audiences often avoid one kind and seek out the other kind, or vice versa, depending on the
situation. Both are legitimate uses of persistent identifiers. Prior work at NLM (Byrnes, 2000)
suggests at least four kinds of content invariance:
•
correctable: Previously recorded content may be corrected (only) at any time.
91
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
•
dynamic: Previously recorded content may be overwritten arbitrarily at any time, provided
the resulting new content continues to match its metadata description. For example, the
NLM homepage and the local weather page may both advertise very persistent identifiers
for content that is completely overwritten from time to time.
•
unchanging: Previously recorded content will not change, but encodings and markup may
change during a format migration.
•
bitstream: The bitstream representing previously recorded content will not change.
Datasets that grow
There is an important dimension of content invariance describing resources that grow, but
whose growth pattern does not alter previously recorded content. We might describe such
resources as subject to non-disruptive growth, as it is concerned with growth that does not in
itself disrupt or displace previously recorded content. This applies to many common information
resources, such as live, sensor-based data feeds, citation databases, and even serial publications.
The nature of the provider
Anyone can promise anything, but we might value a promise from one source more than from
another. Relevant factors include not only what a provider promises in regard to identifier and
resource support, but also how that provider is motivated, supported, and perceived. Thus
mission, profit motive, succession plan, and reputation come to bear. Work to be done includes
expressing these via metadata.
Support level
What are the provider’s naming practices? How often is the collection inspected for broken
identifiers? What action is taken when outages occur, and at what priority? Realistically, not all
resources are equally important to a provider and its audience. To better support some resources
means lowering priority support for other resources. What is a resource’s “track record” and can
one inspect it? These are all questions that can inform user choices of identifier.
4.2. Using ARKs in a semantic web context: investigating best practices
When the ARK specification came out in 2001, the core semantic web concepts and standards
were already out or on their way (RDF was released in 1999). However, as the semantic web
gained wider adoption, new best practices about URIs emerged over the next decade (W3C,
2008) and it is timely to re-evaluate the ARK specification in this new context. The main
observation is that on one hand, ARKs can be embedded in URIs, which allows their use in the
web of data, but on the other hand, the linked data best practices call for “Cool URIs” that,
among other properties, “don’t change” (Berners-Lee, 1998). For institutions that implement
them, ARKs are a natural way to push identified resources onto the web of data. The question
now is how to reconcile these two normative contexts at the BnF while implementing ARKs on
the data.bnf.fr linked data service.
One could first ask how those two contexts address the question of multiple representations of
a resource. On the semantic web, content negotiation using a generic URI yields the relevant
representation of a resource; whether to reveal specific URIs for the variants is up to the content
provider to decide. There is no reason why a provider could not implement an unqualified ARK
name and rely on content negotiation to return linguistic or format variants to the user; or the user
can reveal these variants by using traditional qualifiers7.
7
For the moment however, data.bnf.fr does not use ARK-URIs for its content negotiation. Early in the
project when such choices were made, non-opaque URIs were considered better for SEO, as visibility on
the web was one of the core aims of data.bnf.fr. Therefore, http://data.bnf.fr/ark:/12148/cb118905823
redirects to the temporary URI http://data.bnf.fr/11890582/charles_baudelaire/, which provides access to a
particular representation of the object depending on the result of content negotiation (RDF/XML,
92
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
However, the real question is about the form of the URIs. In the early semantic web, a good
deal of debate was about “real-world resources” that can be described on the web of data (with
URIs), but could only be put on the web via substitutes (e.g. a description and/or a web page). It
was initially considered wiser to use non-dereferenceable URIs. Non-HTTP URI schemes like
“urn:” could be used to that end, and “info:” was explicitly defined for that purpose. By the end
of the 2000’s however, there was global consensus that an HTTP URI could be used for any
resource. As a result, putting resources on the web of data now implies using HTTP URIs, i.e.
URLs. This poses no conflict with ARKs since they are designed to be embedded in URLs using
an NMAH that resolves them.
The main conflict between ARKs and URIs used on the semantic web concerns the qualifier
part. At issue is distinguishing between a descriptive resource (available on a web page) and its
underlying content (which might, or might not, be interpreted as a web page):
“It is important to understand that using URIs, it is possible to identify both a thing
(which may exist outside of the Web) and a Web document describing the thing. For
example the person Alice is described on her homepage. Bob may not like the look of the
homepage, but fancy the person Alice. So two URIs are needed, one for Alice, one for
the homepage or a RDF document describing Alice. The question is where to draw the
line between the case where either is possible and the case where only descriptions are
available.” (W3C, 2008).
With ARKs, the URI to reference the descriptive resource is constructed by adding the “?”
inflection to the URI of the content resource. Unfortunately, supporting the single “?” (what looks
like an empty query string) directly was impossible with the BnF infrastructure. What’s more,
BnF made the implementation choice to create ARKs directly for descriptive resource (e.g.
authority records), so the mechanism needed was the opposite: from the identified descriptive
resource (identified with an ARK name) to its underlying content resource, not the other way
round. Therefore, we had to consider the other two mainstream choices:
•
“suffix hash URI”: you have http://example.com/resource for a web resource (e.g. a web
page about a person), and http://example.com/resource#classifier for the underlying thing
(e.g. the person itself). A browser client automatically strips off the # for consumption,
which relies on standard web architecture and best practices.
“prefix slash URI”: you have http://example.com/doc/resource for the web document and
http://example.com/id/resource for the underlying thing. This requires an HTTP 303
redirect from the resource URI to the URI of the web document.
The semantic web best practices highlight an area currently unaddressed by ARK qualifiers:
how to name the underlying “thing” when the ARK is assigned to a descriptive resource. This is
clearly not a whole-part problem (addressed by “/). Neither is it really a “service” or “variant”
qualifier (addressed by “.”) because the two identified things are quite distinct.
With ARKs only the “prefix slash URI” strategy is possible for the current state of the
standard, which means using e.g. http://data.bnf.fr/id/ark:/12148/ark:/12148/cb118905823 (the
French poet Charles Baudelaire) and http://data.bnf.fr/doc/ark:/12148/cb118905823 (the record
describing him). This was not implemented because the redirection rules would present too great
an extra server burden for our application.
From a technical standpoint, in data.bnf.fr the decision was made to locally extend ARKs and
use “hash URIs”. For example, we separate http://data.bnf.fr/ark:/12148/cb118905823 (web page
about Charles Baudelaire) from http://data.bnf.fr/doc/ark:/12148/cb118905823#foaf:Person
(Charles Baudelaire himself).
•
Notation3, N-Triples, JSON, or HTML, and language variants). We intend to reconsider this question with
the evolution of SEO practices.
93
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Looking back at the standard, would accommodating this change mean defining a new kind of
qualifier, beginning with #, to name the underlying resource? Though technically possible, this
would cause backwards compatibility issues, because the # character is not reserved in ARK
names. In other terms, one could perfectly define the following (unqualified) ARK core
identifier: ark:/9999/c5j3r4#hz45, with a # in the ARK name itself. Defining a # qualifier would
break backwards compatibility in such cases. On the other hand, # already has a use in the
standard web architecture (fragment for a URL) which makes it unlikely that implementers will
use this character in their own implementation. A comprehensive survey of ARK implementers
would be useful before any decision. If a # qualifier proved to be possible, we believe this would
be a valid scenario to reconcile semantic web and ARK implementation approaches.
Conclusion
This article intended to look back at the history of using ARK persistent identifiers in one
institution, and possible evolutions of the standard. Standards-wise, the question boils down to
whether we should consider expanding the core features to increase cross-resolver
interoperability and adapt ARKs to new contexts, or should we stick to the current ARK
recommendation, which is flexible, simple, easy to use, and in most cases successful? Such
questions will be taken up in follow-on work with the implementer community.
References
Archer, Phil. (2013) Study on persistent URIs: with identification of best practices and recommendations on the topic
for the Member States and the European Commission. Retrieved May 02, 2014, from
http://philarcher.org/diary/2013/uripersistence .
Bermès, Emmanuelle. (2006). Des identifiants pérennes pour les ressources numériques. Retrieved May 02, 2014,
from http://2007.jres.org/planning/pdf/163.pdf.
Berners-Lee, Tim. (1998). Cool URIs don’t change. Retrieved May 02, 2014, from
http://www.w3.org/Provider/Style/URI.
BnF. (2013). URI and URL in data.bnf.fr. Retrieved May 02, 2014, from http://data.bnf.fr/en/semanticweb#Ancre3.
Byrnes, Margaret. (2000). Defining NLM's Commitment to the Permanence of Electronic Information. ARL 212:8-9.
Retrieved May 07, 2014, from http://www.arl.org/newsltr/212/nlm.html
PREMIS Maintenance Activity. (2012). SPAR – Scalable Preservation and Archiving Repository; Retrieved May 02,
2014, from http://www.loc.gov/standards/premis/registry/premis-project_name.php?proj_ID=697.
CDL. (2013). ARK (Archival Resource Key) Identifiers. Retrieved May 02, 2014, from
https://wiki.ucop.edu/display/Curation/ARK.
CDL. (2014). Registered Name Assigning Authority Numbers. Retrieved August 14, 2014, from
http://www.cdlib.org/uc3/naan_table.html.
Hilse, Hans Werner, and Jochen Kothe. (2006). Implementing Persistent Identifiers. Consortium of European Research
Libraries and European Commission on Preservation and Access. Retrieved May 02, 2014, from http://nbnresolving.de/urn:nbn:de:gbv:7-isbn-90-6984-508-3-8.
IETF. (2013). The ARK Identifier Scheme. Internet-Draft. Retrieved May 02, 2014, from
http://datatracker.ietf.org/doc/draft-kunze-ark.
Kunze, John. (2003). Towards Electronic Persistence Using ARK Identifiers. California Digital Library. Retrieved May
02, 2014, from https://wiki.ucop.edu/download/attachments/16744455/arkcdl.pdf.
Kunze, John and Adrian Turner. (2010). The ARK Identifier Scheme. Retrieved August 14, 2014, from
http://dublincore.org/groups/kernel/spec/.
W3C. (2005). Uniform Resource Identifier (URI): Generic Syntax. Retrieved May 02, 2014, from
http://www.ietf.org/rfc/rfc3986.txt.
W3C. (2008). Cool URIs for the Semantic Web. Retrieved May 02, 2014, from http://www.w3.org/TR/cooluris/.
94
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Requirements on RDF Constraint Formulation and Validation
Thomas Bosch
GESIS – Leibniz Institute for the
Social Sciences, Mannheim, Germany
[email protected]
Kai Eckert
Research Group Data and Web
Science
University of Mannheim, Germany
[email protected]
Abstract
For many RDF applications, the formulation of constraints and the automatic validation of data
according to these constraints is a much sought-after feature. In 2013, the W3C invited experts
from industry, government and academia to the RDF Validation Workshop, where first use cases
have been presented and discussed. In collaboration with the W3C, a working group on RDF
Application Profiles (RDF-AP) is currently established in the Dublin Core Metadata Initiative
that follows up on this workshop and addresses among others RDF constraint formulation and
validation.
In this paper, we present a database of requirements obtained from various sources, including the
use cases presented at the workshop as well as in the RDF-AP WG. The database, which is
openly available and extendible, is used to evaluate and compare several existing approaches for
constraint formulation and validation. We present a classification and analysis of the
requirements, show that none of the approaches satisfy all requirements and aim at laying the
ground for future work, as well as fostering discussions how to close existing gaps.
Keywords: RDF validation; RDF constraint formulation; RDF constraint validation;
requirements; OWL 2; RDF; linked data; semantic web.
1. Introduction
The notion of Linked (Open) Data and its principles clearly increased the acceptance – not to
say the excitement – of data providers for the underlying Semantic Web technology. Early
concerns of the data providers regarding stability and trustability of the data have been addressed
and largely been solved, not only by technical means regarding versioning and provenance, but
also by the providers getting accustomed to the open data world with its pecularities.
Linked Data and RDF, however, still are not the primary means to create, store, and manage
data on the side of the providers. Linked Data is mostly provided as a view on data, a one-way
road, disconnected from the internal data representation. To the obstacles for full adoption of
RDF, possibly comparable to XML, belong the lack of accepted ways to formulate (local)
constraints on data and to validate data. The W3C reports a consensus among 27 participants
from industry, government and academia of RDF Validation Workshop1 that there are the
following needs:
1. Declarative definition of the structure of a graph for validation and description.
2. Extensible to address specialized use cases.
3. A mechanism to associate descriptions with data.
Several use-cases with requirements have been presented at the workshop, further requirements
are described in talks about general approaches and experiences outside of RDF, like Dublin Core
Application Profiles or XML Schema Definitions. An important finding is that there are nonfunctional requirements for data validation in a Linked Data setting, particularly the need to
1
RDF Validation Workshop – Practical Assurances for Quality RDF Data. 10-11
September 2013, Cambridge, MA, USA. http://www.w3.org/2012/12/rdf-val/report
95
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
“communicate the constraints against which data is to be validated in a way which is both easy to
understand by human beings and discoverable by programs.”
SPARQL and SPIN are powerful and widely used for constraint formulation and validation
(Fürber and Hepp, 2010), but constraints formulated as SPARQL queries are not as
understandable as one wishes them to be. Consider the following example of the simple
constraint stating that only dogs are allowed as pets:
SELECT ?this ?subope ?object WHERE {
?C owl:allValuesFrom :Dog .
?C owl:onProperty :hasPet .
?C a owl:Restriction .
?this rdf:type ?subC . ?subC rdfs:subClassOf* ?C .
?this ?subOPE ?object . ?subOPE rdfs:subPropertyOf* :hasPet .
FILTER NOT EXISTS { ?object rdf:type :Dog . } }
This query checks the constraint and returns violating triples, but the actual constraint could be
formulated much shorter, for instance using the OWL 2 Functional-Style syntax:
SubClassOf( :strictDogOwner ObjectAllValuesFrom( :hasPet :Dog ) )
Similarly, but even shorter, as Shape Expression:
<StrictDogOwnerShape> { :hasPet :Dog+ }
Partly as follow-up to the W3C workshop and partly due to further expressed requirements at
the Semantic Web in Libraries conference 20132, the Dublin Core Metadata Initiative in
collaboration with the W3C currently establishes a Working Group for RDF Application Profiles
(RDF-AP WG) that will investigate existing approaches and best-practices, identify possible gaps
and propose practical solutions for the representation of application profiles, including the
formulation of data constraints3. The RDF-AP WG bases its work on currently 8 case studies and
use cases provided by internal and external stakeholders, mostly from the library domain. In a
heterogeneous environment like the Web, there is not necessarily a one-size-fits-all solution,
especially as existing solutions should rather be integrated than replaced, not least to avoid long
and fruitless discussions about the “best” approach.
Our work presented in this paper is supposed to lay the ground for subsequent activities in the
working group. Our contributions are two-fold: first, we propose to relate existing solutions to
specific case-studies and use-cases by means of requirements extracted from the latter and
fulfilled by the former. We therefore created and present an exhaustive database of all
requirements identified in the validation workshop and the RDF-AP WG. Additionally, we added
requirements from other sources, particularly in the form of constraint types that are supported by
existing approaches, e.g., expressible in OWL2.
Second, we use this database to provide an overview on different classes of requirements and
give examples, to what degree these classes of requirements are supported by different
approaches. We want to highlight strengths and weaknesses of these approaches and identify gaps
and possible solutions for their elimination.
2
SWIB13 – Semantic Web in Libraries, 25 - 27 November 2013, Hamburg, Germany.
http://swib.org/swib13/
3
http://wiki.dublincore.org/index.php/RDF-Application-Profiles
96
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
2. From a Case Study to a Solution (and Back)
In the development of standards, as in software, case studies and/or use cases are usually taken
as starting point. In case studies, the full background of a specific scenario is described, where the
standard or the software is to be applied. Use cases are smaller units where a certain action or a
typical user enquiry is described. They can be extracted from and thus linked to case studies, but
often they are defined directly.
Requirements are extracted from use cases; they form the basis for development and are used
to test the result. We specifically use the requirements to evaluate existing approaches for
constraint formulation and validation. Via the requirements, the approaches get linked to use
cases and case studies and it becomes visible which approaches can be used in a given scenario
and what drawbacks might be faced.
We classify the requirements to provide a high-level view on different approaches and to
facilitate a better understanding of the problem domain. Our database is openly available and can
be extended with new case studies, use cases, requirements and approaches.
Table 1 shows an excerpt from our database. The general structure is a polyhierarchy from
case-studies over use-cases and requirements to solutions. All instances contain at least uplinks to
the next level, i.e., solutions are linked to requirements that they fulfill and possibly requirements
that they explicitly do not fulfill. Requirements are linked to use-cases, which are linked to case
studies.
TABLE 1: Database Examples
ID
Title
Links
Description
UC-1
The Digital Public Library of America maintains an
access portal to digitized cultural heritage objects...
4
We harvest data using several different methods...
CS-1
Some properties may not be mandatory, but may be
recommended to indicate a “value-added” level of
compliance with MAPv3...
Case Studies
CS-1
DPLA
Use Cases
UC-1
Recommended
Property
Requirements
R-1
Optional
Properties
UC-1
A property can be marked as optional. Valid data
MAY contain the property.
R-2
Recommended
Properties
UC-1, R-3
An optional property can be marked as recommended.
A report of missing recommended properties is
generated. Fulfilled if R-3 is fulfilled.
Classified
Properties
Solutions
UC-1
A custom class like “recommended” or “deprecated” can
be assigned to properties and used for reporting.
S-1
ShEx
R-1/2/3
S-2
SPIN
R-1/2/3
Fulfilled: R-1 (minimum cardinality = 0, maximum
cardinality = 1). Not fulfilled: R-2, R-3.
Fulfilled: R-1, R-2, R-3.
R-3
The polyhierarchy allows the linking of all elements to more than one parent, requirements
particularly are linked to several use cases. Our goal is to maintain a set of distinct requirements.
Only this way it is possible to evaluate the solutions regarding their suitability for the use cases
and case studies in our database. Use cases can be shared between case studies as well, but this is
harder to maintain as use cases are less formal and often more case specific than a requirement.
4
http://wiki.dublincore.org/index.php/DPLA_RDF_application_profile_use_cases
97
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Requirement R-2 is an example, where a link between requirements is established. In this case,
the link is used to point to a requirement that is “broader” than this requirement, i.e., should that
requirement be fulfilled, then this requirement is automatically fulfilled as well. In a similar way
requirements can be linked to duplicates if they should occur. Our goal is a relative stability
regarding the requirements, which then can prove useful to mediate between data and solution
providers.
The database is made available at http://purl.org/net/rdf-validation. The initial
database was created manually and forms the basis of this paper. The web application to access
the database is currently in a beta state and still under development. Nevertheless, the full
database can already be browsed online and interested participants can register and contribute to
the database.
3. Related Work
Requirements engineering is recognized as a crucial part of project and software development
processes. Similar to our collaborative effort, Lohmann et al. propose social requirements
engineering, i.e. the use of social software like wikis to support collaborative requirements
engineering (Lohmann et al., 2009). Their approach focuses on simplicity and supports in
particular the early phases of requirements engineering with many distributed participants and
mainly informal collaboration. They emphasize the social experience of developing requirements
for software systems: Stakeholders are enabled to collaboratively collect, discuss, improve, and
structure requirements. Under the supervision of experts, the requirements are formulated in
natural language and are improved by all participants step by step. Later on, experienced
engineers may clean and refine requirements. As basis for their work, they developed a generic
approach (Softwiki) using semantic technologies and the SWORE ontology for capturing
requirements relevant information semantically (Lohmann et al., 2008). The SWORE ontology,
as well as a prototypical implementation of their approach is available online5. We evaluated the
implementation and the ontology regarding a possible reuse, but it turned out that Softwiki
focuses clearly on the requirements within a traditional software development process, while we
need a broader view including case studies, use cases and various implementing approaches.
Nevertheless we will reuse parts of the SWORE ontology and include links wherever possible.
To the best of our knowledge, there is no comparable prior work regarding the collection of a
comprehensive list of requirements for the formulation and validation of constraints, neither exist
general approaches to compare different solutions based on common or differing requirements.
More related work focuses on specific constraint languages and implementations, which we will
introduce in the next section.
4. Approaches for Constraint Formulation and Validation
In this section, we present current approaches for constraint formulation and validation which
have been the most discussed in the mentioned workshops and WGs. These approaches differ in 2
dimensions: (1) the used constraint language and (2) if they offer validation systems.
OWL, Resource Shapes (ReSh), Shape Expressions (ShEx), Description Set Profiles (DSPs),
SPARQL, and SPIN are the most promising and applied constraint languages. Stardog ICV,
Pellet ICV, and SPIN use OWL 2 constructs to formulate constraints. SPIN6 provides a
vocabulary to represent SPARQL queries as RDF triples and uses SPARQL to specify inference
rules and logical constraints (Fürber and Hepp, 2010). The Pellet Integrity Constraint Validator
(ICV)7 is a proof-of-concept extension for the OWL reasoner Pellet. Stardog ICV8 validates RDF
5
http://softwiki.de/netzwerk/en/
http://spinrdf.org
7
http://clarkparsia.com/pellet/icv/
8
http://docs.stardog.com/icv/icv-specification.html
6
98
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
data stored in a Stardog RDF database. ReSh9 defines its own RDF vocabulary Open Services for
Lifecycle Collaboration (OSLC) to define constraints (Ryman et al., 2013). ShEx10 also specifies
a new constraint language whose syntax and semantics are similar to regular expressions. DCMI
RDF Application Profile (AP)11 and Bibframe12 are approaches to specify profiles for applicationspecific purposes. DCMI RDF-AP uses DSP13 as generic constraint language which is also
intuitive for non-experts. The Bibframe constraint language has a strong overlap with DSP.
Kontokostas et al. define 17 data quality integrity constraints represented as SPARQL query
templates called Data Quality Test Patterns (DQTP) (Kontokostas et al., 2014). Schemarama14 is
based on the Squish RDF language instead of SPARQL. For XML, Schematron15 is an ISO
standard for validation and quality control of XML documents based on XPath and XSLT. XML
Schema16 is the primary technology for specifying and constraining the structure of XML
documents.
In addition to constraint validation languages, SPIN (open source API), Stardog ICV (as part
of the Stardog RDF database), DQTP (tests), Pellet ICV (extension of Pellet OWL reasoner) and
ShEx offer executable validation systems using SPARQL as implementation language.
In this paper, we evaluate to which extend these approaches cover classes of requirements (1)
to express different types of constraints and (2) to formulate constraints. For the formulation of
constraints, it is important that the constraint language is concise and intuitive and that the
declarative constraint language is translated to an implementation language like SPARQL in
order to execute constraint validation automatically. In form of concrete examples, we show how
current approaches can be used to express different types of constraints and how they can be used
together to fulfill the majority of the identified requirements classes.
5. Requirements
Use cases discussed within the scope of the mentioned workshops and working groups led to
the definition of requirements on RDF constraint formulation and validation. We classified these
requirements into the 2 top-level categories ’Constraint Formulation’ and ’Constraint
Expressivity’.
5.1. Formulation of Constraints
Intuitive and concise language. We claim that all constraints can be expressed using the lowlevel language SPARQL. The majority of the constraints can also be written more declaratively,
intuitively, and concisely in form of OWL 2 axioms in the concrete syntax Turtle. Although,
OWL 2 is a very expressive language, we cannot express every constraint in OWL 2. The
succeeding existential quantification contains those individuals that are connected by the
:fatherOf property to individuals that are instances of the class :Man. The ontology, the
constraint, and RDF data are expressed with the same OWL 2 axiom and the same concrete
syntax:
[ rdfs:subClassOf [
a owl:Restriction;
owl:onProperty :fatherOf;
owl:someValuesFrom :Man ] ] .
9
http://www.w3.org/Submission/shapes/
http://www.w3.org/2013/ShEx/Definition
11
http://dublincore.org/documents/singapore-framework/
12
http://bibframe.org/
13
http://dublincore.org/documents/dc-dsp/
14
http://swordfish.rdfweb.org/discovery/2001/01/schemarama/
15
http://www.schematron.com/
16
http://www.w3.org/TR/xmlschema-1/
10
99
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The main purpose of OWL 2 is to infer new knowledge from existing schemata and data rather
than to check data for inconsistencies. Therefore, most constraint validation approaches define
constraints with other high-level declarative languages, even though most people are familiar
with OWL 2 and its concise human-understandable concrete syntax Turtle. OWL 2 can be used to
describe RDF data, to infer new knowledge, and to validate RDF data using the same expressive
OWL 2 axioms. With XML Schemas, we also structure and validate our data according to that
structure.
Shape Expressions contain elements from regular expressions making the language concise
and intuitive. In the following example, an employee has at least 1 given name, 1 family name,
any number of phone numbers, and 1 mail box:
<EmployeeShape> {
foaf:givenName xsd:string+ ,
foaf:familyName xsd:string ,
foaf:phone IRI* ,
foaf:mbox IRI }
As different constraints can be expressed with different languages, we propose to use multiple
languages to define constraints depending on the requirements which have to be satisfied.
Translated to implementation language. High-level declarative languages like OWL 2
cannot be executed directly to validate constraints. Therefore, we take a low-level execution
language like SPARQL. Sirin and Tao (2009) showed how constraints can be translated to
nonrecursive Datalog programs for validation and Angles and Gutierrez (2008) explained that
SPARQL has the same expressive power as nonrecursive Datalog programs. As a consequence,
we can also use SPARQL queries to validate constraints. Thus, constraint validation can be
reduced to SPARQL query answering. The participants of the 2013 W3C RDF Validation
workshop agreed that SPARQL should be the language to execute constraint validation17.
Furthermore, all evaluated constraint validation approaches execute constraint validation with
SPARQL. The next SPARQL query shows how the OWL 2 existential quantification is
implemented in SPIN:
CONSTRUCT {
_:violation
a spin:ConstraintViolation ;
rdfs:label ?violationMessage
spin:violationRoot ?this }
WHERE {
?this rdf:type ?subC . ?subC rdfs:subClassOf* ?C .
?C owl:someValuesFrom ?CE .
?C owl:onProperty ?OPE .
?C a owl:Restriction .
FILTER ( sp:not ( spl:hasValueOfType ( ?this, ?OPE, ?CE ) ) ).
FILTER EXISTS { ?this ?OPE ?object . ?object rdf:type owl:Thing . }
BIND ( ( ... ) AS ?violationMessage ) . }
RDF representation of constraints. One of the main benefits of SPIN is that arbitrary
SPARQL queries and thus constraints are represented as RDF triples. SPIN provides a
vocabulary, the SPIN SPARQL Syntax, to represent SPARQL queries in RDF. The benefits of an
RDF representation of constraints are:
•
17
constraints can be consistently stored together with ontologies and RDF data
http://www.w3.org/2013/09/10-rdfval-minutes
100
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
•
constraints can be easily shared on the web of data
•
constraint validation can be executed automatically
•
constraints can be processed by a plethora of already existing RDF tools
• constraints are linked to RDF data
The subsequent code snippet demonstrates how SPIN represents SPARQL 1.1 NOT EXISTS
filter expressions in RDF:
FILTER NOT EXISTS { ?person foaf:name ?name }
----[
a sp:Filter ;
sp:expression [
a sp:notExists ;
sp:elements (
[
sp:subject [ sp:varName "person" ] ;
sp:predicate foaf:name ;
sp:object [ sp:varName "name" ] ] ) ] ] )
Our approach, which is implemented in Java, executes constraint validation with SPIN. SPIN
templates define the validation of both OWL 2 constraints and constraints only expressible with
SPARQL. These constraints are checked for each resource of the type owl:Thing (all resources
are assigned to the common super-class owl:Thing).
Constraint validation results. Like ontologies, instance data, and constraints, we should also
represent constraint violations in RDF. SPIN templates construct (SPARQL CONSTRUCT)
constraint violation triples containing information about constraint violations, which cannot be
expressed directly in OWL 2:
CONSTRUCT {
_:icViolation
a spin:ConstraintViolation ;
rdfs:label ?violationMessage ;
spin:violationRoot ?violationRoot ;
spin:violationPath ?violationPath ;
spin:violationSource ?violationSource ;
spin:fix ?violationFix ;
:severityLevel ?severityLevel }
Constraint violations (of the type spin:constraintViolation) should provide a useful
message (rdfs:label) explaining the reasons why the data did not satisfy the constraints, which
aids data debugging and repair. If we do not state the triples :Peter :fatherOf :Stewie . and
:Stewie a :Man ., the SPIN template checking the OWL 2 existential quantification on the
object property :fatherOf constructs a constraint violation triple raising the message
‘ObjectSomeValuesFrom( :fatherOf :Man ) - :Stewie must be an instance of :Man’. Now, you
know exactly why the data violated this constraint and you know where you have to modify your
data. Constraint violation triples contain references to triples causing the constraint violations
(spin:violationRoot) and references to constraints causing constraint violations
(spin:violationSource). In our example, the subject :Peter causes the constraint violation and
the constraint :ObjectSomeValuesFrom constructs the constraint violation triple. To fix
constraint violations we need to give some guidance how to become valid data (spin:fix).
Appropriate triples may point to useful messages explaining in detail how to overcome constraint
violations. Constraint violations can be classified according to different levels of severity
(:severityLevel having controlled vocabulary as range with elements like :Error and
101
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
:Warning). It is also important to find not validated triples, i.e. triples which have not been
validated by any constraint, as it may be enforced that every triple of the data have to be
validated.
5.2. Constraint Expressivity
Cardinality Restrictions. Class expressions in OWL 2 can be formed by placing restrictions
on the cardinality of object and data property expressions. All cardinality restrictions can be
qualified or unqualified. The class expressions contain those individuals that are connected by a
property expression to at least, at most, and exactly a given number of instances of a specified
class expression. Qualified and unqualified cardinality restrictions can be expressed in OWL 2:
:CE rdfs:subClassOf [
a owl:Restriction ;
owl:maxQualifiedCardinality "1"^^xsd:nonNegativeInteger ;
owl:onProperty :hasSon ;
owl:onClass :Man ] .
:Peter a :CE ;
:hasSon :Stewie [ a :Man ] .
is an instance of the class expressions containing those individuals having at most 1
son which is :Stewie in the RDF instance data. If we state that :Peter has a second son or if we
do not assign :Stewie to the class :Man, the qualified maximum cardinality restriction will be
violated. SPIN, Stardog, and Shape Expressions are the only approaches with which qualified and
unqualified cardinality restrictions on data and object properties can be specified.
Disjointness. Disjointness of classes and union of class expressions, (class-specific) object and
data properties, and individuals is a very important type of constraints which can be completely
covered with SPIN (implementing OWL 2 constructs). An OWL 2 disjoint union axiom
DisjointUnion( C CE1 ... CEn ) states that a class C is a disjoint union of the class expressions
CEi, 1 ≤ i ≤ n, all of which are pairwise disjoint. Each instance of C is an instance of exactly one
CEi, and each instance of CEi is an instance of C18. According to the next disjoint union of 2 class
expressions, each child is either a boy or a girl, each boy is a child, each girl is a child, and
nothing can be both a boy and a girl. As in this example, :Stewie is both a boy and a girl, a
constraint violation is raised:
:Peter
:Child owl:disjointUnionOf ( :Boy :Girl ) .
:Stewie a :Child ; a :Boy ; a :Girl .
Disjoint groups of object and data properties can be expressed in OWL 2:
[
rdf:type owl:Class ;
owl:unionOf (
[
rdf:type owl:Restriction ;
owl:qualifiedCardinality 1
;
owl:onProperty foaf:name ;
owl:onClass xsd:string ]
[
rdf:type owl:Class ;
owl:intersectionOf (
[
rdf:type owl:Restriction ;
owl:minQualifiedCardinality 1 ;
owl:onProperty foaf:givenName ;
owl:onClass xsd:string ] .
18
http://www.w3.org/TR/owl2-syntax/
102
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
[
rdf:type owl:Restriction ;
owl:qualifiedCardinality 1 ;
owl:onProperty foaf:familyName ;
owl:onClass xsd:string ] ) ] ) ] .
In this example, we define a shape for persons. A person has either a FOAF name or 1 or more
given names and 1 family name. Although this kind of constraint can be realized in OWL 2, the
definition of disjoint groups of properties is not that intuitive and declarative. Exactly the same
constraint can be expressed more concisely with Shape Expressions:
<PersonShape> {
( foaf:name xsd:string
|
foaf:givenName xsd:string+ ,
foaf:familyName xsd:string ) }
Shape Expressions and SPIN are the only approaches to specify disjoint groups of properties
for given classes.
Constraints on RDF Properties. Object as well as data properties may be constrained. The
main component of an OWL 2 ontology is a set of axioms - statements that say what is true in the
domain. OWL 2 provides axioms that can be used to characterize and establish relationships
between object and data property expressions. An object property functionality axiom states that
an object property expression is functional - that is, for each individual x, there can be at most
one distinct individual y such that x is connected by the object property expression to y19. With
Pellet ICV, we can state a couple of object and data property axioms like the following object
property functionality axiom in OWL Turtle syntax (Sirin and Tao, 2009):
:isManufacturedBy a owl:FunctionalProperty .
:Product :isManufacturedBy :Manufacturer1 , :Manufacturer2 .
The object property :isManufacturedBy is defined as functional. The OWL interpretation
would infer that the manufacturers are the same resources, as nothing contradicts the inference
that these two manufacturers are the same and there is no Unique Name Assumption. With
constraint semantics, however, a constraint violation is raised. With Resource Shapes 2.0 and
Shape Expressions it is not possible to declare functionality axioms on object and data properties.
We can define these axioms with SPIN (and OWL 2), Stardog, and Pellet.
Object property paths (supported by Stardog and SPIN) are important constraints within
various domains. Object property chains can be expressed as OWL 2 axioms
SubObjectPropertyOf( ObjectPropertyChain( OPE1 ... OPEn ) OPE ) stating that, if an individual
x is connected by a sequence of object property expressions OPE1 , ..., OPEn with an individual y,
then x is also connected with y by the object property expression OPE20. As the triple :Stewie
:hasAunt :Carol . is not contained in the following data set, a constraint violation results:
:hasAunt owl:propertyChainAxiom ( :hasMother :hasSister ) .
:Stewie :hasMother :Lois . :Lois :hasSister :Carol .
19
20
http://www.w3.org/TR/owl2-syntax
http://www.w3.org/TR/owl2-syntax
103
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Constraints on RDF objects. For RDF objects, we can state constraints such as allowed
values, default values, and negative object constraints. Resource Shapes 2.0 enables defining
allowed values for RDF objects as well as RDF literals:
:oslc-change-request a oslc:ResourceShape ;
oslc:property :oslc_cm-status .
:oslc_cm-status a oslc:Property ;
oslc:allowedValues :status-allowed-values .
:status-allowed-values a oslc:AllowedValues ;
oslc:allowedValue "Done" , "InProgress" , "Submitted" .
The constraint above specifies the only allowed values of the status data property for change
request resources. If change requests have other status values, constraint violations will be raised.
In addition to Resource Shapes 2.0, the DCMI RDF-APs and SPIN (and OWL 2) allow
specifying allowed values for RDF literals. For RDF objects, we can apply the approaches
Resource Shapes 2.0, Shape Expressions, DCMI RDF-APs, and SPIN (and OWL 2) to define
allowed values.
With DCMI RDF-APs and SPIN, we can declare that RDF objects and literals have to be part
of specific controlled vocabularies. These statements are represented with DCMI RDF-APs using
an RDF triple comprising an RDF subject that is the value RDF node, an RDF predicate
dcam:memberOf, and an RDF object with a corresponding RDF URI Reference being the DCAM
vocabulary encoding scheme URI21. The following excerpt states that a given book is assigned to
the topic ’Ornitology’ which is part of a particular controlled vocabulary:
:Book
dcterms:subject [
rdf:value "Ornitology" ;
dcam:memberOf :ControlledVocabulary ] .
Constraints on RDF Literals. Constraint on RDF literals are not that significant in the Linked
Data community, but they are very important in communities like the library domain. For RDF
literals, range-specific, constraining facet-specific, datatype-specific constraints, and languagespecific can be defined. We can restrict the datatypes, RDF literals have to correspond to, with
XML Schema constraining facets. SPIN allows us to implement all constraining facets. DQTPs
enables constraining literal values to match or not to match a certain regex pattern
(xsd:pattern):
SELECT DISTINCT ?s WHERE { ?s %% P1 %% ? value .
FILTER ( %% NOP %% regex (str (? value ), %% REGEX %) ) }
P1 is the property we need to check against REGEX and NOP can be a not operator (!) or
empty. An example binding could be to check if the dbo:isbn format is different (!) from
“ˆ([iIsSbBnN 0-9-])*$” (Kontokostas et al., 2014). DQTPs also enables constraining literal values
(having a certain datatype) to be or not to be within a specific range (xsd:maxInclusive,
xsd:maxExclusive, xsd:minExclusive, xsd:minInclusive):
SELECT DISTINCT ?s WHERE {
?s rdf:type %% T1 %% . ?s %% P1 %% ?value .
FILTER ( %% NOP %% (?value < %% Vmin %% || ?value > %% Vmax %%))) }
21
http://dublincore.org/documents/dc-rdf/
104
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
For instance, we can restrict geographical longitudes and latitudes (geo:lat, geo:long) of a
spatial feature to be within the range [-90,90] (Kontokostas et al., 2014). Furthermore, we
implemented the constraining facet xsd:whiteSpace in SPIN to avoid leading and trailing white
spaces in literals. Sub-types of language-specific constraints on RDF literals are constraints (1) to
check if a literal for a specific data property within the context of a particular class has a given
language tag, (2) to check whether the literal, within the context of a given property and class, is
missing, or (3) to ensure that resources of a given type must have at most 1 value of a specific
language for a given data property (e.g. a single English (“en”) rdfs:label). Default values can
be defined with Bibframe, Resource Shapes 2.0, and SPIN. For this purpose, SPIN constructors
may contain SPARQL CONSTRUCT queries for specific classes (e.g. USA is the birth country
of each USCitizen):
:USCitizen a rdfs:Class ;
spin:constructor [ a sp:Construct ; sp:text """
CONSTRUCT { ?this :birthCountry "USA" . } WHERE {} """ ] .
6. Evaluation
In this section, we evaluate current approaches according to the top-level classification of
constraint validation requirements. This kind of evaluation is crucial for future improvements
regarding constraint formulation and validation of both existing and new approaches. The
underlying facts result primarily from the individual official specifications. We categorize
requirements classes to see which requirements are well, badly, and limited satisfied by which
approaches. The goal of this evaluation is not to completely evaluate all currently available
constraint validation approaches. We want to show in a generic way that none of the current
approaches satisfies all requirements and that different approaches cover different requirements
classes. Case studies and use cases define what requirements classes have to be covered. This
evaluation indicates which approaches to use to cover specific requirements classes and therefore
use cases. There are 2 first level requirements classes: ’constraint expressivity’ and ’constraint
formulation’. Tables 2 and 3 show for each approach what second level requirements classes are
covered to which extent. Numbers in brackets behind requirements classes indicate the number of
requirements contained in that class. Numbers in brackets in table cells indicate that requirements
are limited satisfied.
TABLE 2: Constraint Expressivity
Requirements Classes
BF
DCMI
DQTP
Disjointness (8)
û
û
û
û
3
Equivalence (4)
Constraints on RDF properties (20)
Constraints on RDF objects (7)
2
Pellet
RS
SE
SPIN
û
û
3
5
û
4
12
3
1(1)
2
20
7
2
1
1
3(1)
5
5
2
4
3(1)
2(1)
7
û
(2)
4
Constraints on RDF literals (14)
2
2
Identification (5)
û
û
Uniqueness (2)
Provenance Constraints
Constraints on Individuals (6)
û
û
û
û
û
û
û
û
2
Set-Oriented Operations (6)
û
û
û
1
4
1
6
3
6
6
2
2
8
3
12
12
3
2
1
1
Property Restrictions (10)
Cardinality Restrictions (12)
û
1
Class Relationships (4)
Property Occurrences (9)
Stardog
û
û
105
1
3
1
2
6
û
(12)
û
6
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Good Coverage. Although equivalence (e.g. equivalent classes) is only considered by 1
approach (SPIN), all 4 associated requirements are satisfied. 1 approach (SPIN) covers all 20
requirements on RDF properties constraints (e.g. object property paths) and 2 approaches (DQTP
and Stardog) fulfill half of these requirements. Just 1 approach (SPIN) covers 4 of 5 identification
requirements (e.g. to check if IRIs correspond to specific patterns). Class expressions represent
sets of individuals by formally specifying conditions on the individuals’ properties; individuals
satisfying these conditions are said to be instances of the respective class expressions. Subcategories of this requirements class are well satisfied by 3 approaches (DQTP, Shape
Expressions, and Stardog) and nearly exhaustively satisfied by 1 approach (SPIN). Classrelationships (e.g. subsumption) and set-oriented operations (e.g. negation of classes) are not
supported by many approaches. In contrast, property occurrences (e.g. mandatory or optional),
property restrictions (e.g. existential quantifications), and cardinality restrictions are supported by
the majority of current approaches. Constraints on individuals (e.g. negative object property
assertions) are only considered by 1 approach (SPIN) which fulfills all associated requirements.
Limited Coverage. Approach developers should mention requirements which are not covered
exhaustively by current approaches. Only 3 approaches (DQTP, Shape Expressions, and SPIN)
consider disjointness constraints (e.g. class-specific disjoint property groups) and 1 approach
(SPIN) covers 5 of 8 disjointness requirements. 5 of 7 requirements on RDF objects constraints
(e.g. allowed values) can be expressed with 2 approaches (Shape Expressions and SPIN). There
are 2 requirements to ensure uniqueness (e.g. unique URIs), but only 1 approach (SPIN) satisfies
1 requirement. Other approaches do not cover uniqueness requirements.
Bad Coverage. For future development of approaches it is crucial to especially consider
requirements which are currently not satisfied at all by any approach. So far, provenance
constraints are not considered by approach developers. Most approaches satisfy just 2 of 14
requirements on RDF literal constraints (e.g. range of literal values). At least 1 approach (SPIN)
covers 50% of these requirements.
Table 3 shows constraint formulation requirements (classes) and their coverage by current
approaches. Even though, almost each constraint language is intuitive, only 4 constraint
languages can be seen as both intuitive and concise (Pellet, Shape Expressions, SPIN, and
Stardog). 3 of these 4 approaches use OWL 2 as declarative language - the standard language to
define ontologies. Shape Expressions uses a language similar to regular expressions.
TABLE 3: Constraint Formulation
Requirements Classes
BF
DCMI
DQTP
Pellet
RS
SE
SPIN
Stardog
Intuitive Language
ü
ü
~
ü
ü
ü
ü
ü
Concise Language
û
û
û
û
û
û
û
û
û
û
û
û
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
~
~
û
û
û
ü
û
ü
û
2
2
û
û
û
û
û
û
ü
6
9
1
Translated to Implementation Language
Implemented Constraint Validation
Implementation Publicly Available
RDF Representation of Constraints
Constraint Validation Results (10)
Five of 8 approaches translate declarative constraints formulations to an implementation
language (e.g. SPARQL) to execute constraint validation. It is very important for future
enhancements by the whole community that implementations are not only existent but also
publicly available. 5 of 8 approaches are implemented, but implementations are publicly available
for only 2 approaches (public availability of implementations is limited for 2 further approaches).
Constraints are represented as RDF triples by only 1 approach (SPIN). RDF should be the natural
and standard format to represent constraints within the Linked Data community. 2 approaches
(Shape Expressions and SPIN) cover almost all requirements on validation results (e.g. providing
106
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
some guidance how to become valid data). Unfortunately, 3 of the remaining approaches cover
requirements on validation results very poorly.
7. Conclusion and Future Work
Heterogeneous approaches with different strengths and weaknesses are not a bad thing; we do
not expect there to be a one-size-fits-all solution, nor do we aim at creating one. With this paper,
we rather want to raise the awareness towards the differences and commonalities of existing
approaches as well as to shed some light on the different requirements that data providers
currently have. Therefore, we presented our approach to collect case studies, use cases and
especially requirements collaboratively and in structured form. By linking the requirements to
existing constraint languages and validation systems, we could identify strengths and weaknesses,
commonalities and differences not only intellectually, but based on reliable data.
The main purpose of this work is to support discussions of the different approaches and to help
stakeholders in the choice or in the development of appropriate solutions. In the context of
application profiles, where the publication of constraints together with the data model is crucial,
we want to emphasize the need for concise, easy to understand constraint languages. This
requirement is often neglegtected in discussions of approaches. While consistency is
understandably desired, it has to be questioned if one constraint language can fulfill all
requirements without being overly complicated or if different approaches should rather be used
for different classes of requirements. This holds especially for different levels of abstraction, as
the possibility to define constraints on the format of RDF literals compared to constraints on the
availability or special properties of provenance information. Both represent examples where all
current approaches lack proper support.
Gaps within a class of requirements, e.g., disjointness, constraints on RDF objects, or
uniqueness, should be easier to close within the existing approaches. This would lead to a
harmonization of the approaches regarding their expressivity and enable translations in-between
or towards a general constraint language, e.g., the translation of well-readable constraints in any
language to executable SPARQL queries. The latter is especially promising considering that
SPARQL is able to fulfil all functional requirements and already considered by many as a
practical solution to formulate constraints.
As future work, we plan to provide a complete implementation of OWL 2 constraints in form
of SPIN templates to demonstrate this approach. We will extend and maintain the requirements
database and hope to establish it as an important tool for the advancement of constraint
formulation and validation in RDF. Within the DCMI RDF Application Profiles Working Group,
we pursue the establishment of application profiles that among others allow to link constraints
directly to published datasets and ontologies.
Acknowledgements
Kai Eckert is funded by the European Commission within the DM2E project (http://dm2e.eu)
References
Angles Renzo and Gutierrez Claudio. (2008). The expressive power of SPARQL. In Proceedings of the 7th
International Semantic Web Conference (ISWC2008), pages 114–129, 2008.
Fürber Christian and Hepp Martin. (2010). Using SPARQL and SPIN for Data Quality Management on the Semantic
Web. In Witold Abramowicz and Robert Tolksdorf, editors, Business Information Systems, volume 47 of Lecture
Notes in Business Information Processing, pages 35–46. Springer Berlin Heidelberg, 2010.
Lohmann Steffen, Dietzold Sebastian, Heim Philipp, and Heino Norman. (2009). A web platform for social
requirements engineering. In Jürgen Münch and Peter Liggesmeyer, editors, Software Engineering (Workshops),
volume 150 of LNI, pages 309–315. GI, 2009.
107
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Lohmann Steffen, Heim Philipp, Auer Sören, Dietzold Sebastian, and Riechert Thomas. (2008). Semantifying
requirements engineering – the softwiki approach. In Proceedings of the 4th International Conference on Semantic
Technologies (I-SEMANTICS ’08), J.UCS, pages 182–185, 2008.
Kontokostas Dimitris, Westphal Patrick, Auer Sören, Hellmann Sebastian, Lehmann Jens, Cornelissen Roland, and
Zaveri Amrapali. Test-driven evaluation of linked data quality. In Proceedings of the 23rd International Conference
on World
Ryman Arthur G., Le Hors Arnaud, and Speicher Steve. (2013) Oslc resource shape: A language for defining
constraints on linked data. In Christian Bizer, Tom Heath, Tim Berners-Lee, Michael Hausenblas, and Sören Auer,
editors, LDOW, volume 996 of CEUR Workshop Proceedings. CEUR-WS.org, 2013.
Sirin E. and Tao J.. (2009). Towards integrity constraints. In Proceedings of the Workshop on OWL: Experiences and
Directions, OWLED 2009, 2009.
Wide Web, WWW ’14, pages 747–758, Republic and Canton of Geneva, Switzerland, 2014. International World Wide
Web Conferences Steering Committee.
108
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Extracting Description Set Profiles from RDF Datasets using
Metadata Instances and SPARQL Queries
Tsunagu Honma
Graduate School of
Library, Information
and Media Studies,
University of
Tsukuba, Japan
[email protected]
ac.jp
Kei Tanaka
NTT DATA
Corporation, Japan
[email protected]
mail.com
Mitsuharu Nagamori
Faculty of Library,
Information and
Media Science,
University of
Tsukuba, Japan
[email protected]
ba.ac.jp
Shigeo Sugimoto
Faculty of Library,
Information and
Media Science,
University of
Tsukuba, Japan
[email protected]
kuba.ac.jp
Abstract
A variety of communities create and publish metadata as Linked Open Data (LOD). Users of
those datasets find and use them for their own purposes and may combine the datasets to add
value. Each LOD dataset uses various vocabularies, structures and constraints for describing
resources. In order to improve the usability of LOD datasets, it is very important for metadata
designers to enhance the interoperability of their own metadata with that of other datasets. In
order to create new interoperable metadata, metadata schema designers have to understand the
Application Profiles of the existing LOD datasets.
Dublin Core Description Set Profile (DSP) is a component of Dublin Core Application Profiles.
A DSP describes the structures and constraints of metadata in an application (e.g., resource
classes, properties cardinality and value scheme). Metadata schema registries, which collect and
provide metadata schemas, have a large potential for helping metadata schema designers find,
compare, and adopt existing schemas. However, most LOD datasets are not published with their
DSPs. As a result, metadata schema designers have to look at each dataset and guess the DSPs.
This paper proposes a method to extract the structural constraints of metadata records
automatically from metadata instances using existing metadata schema. The goal of this study is
to reduce the cost of metadata schema extraction and to increase the number of metadata schemas
registered in metadata schema registries. We have experimentally extracted constraints from
LOD datasets using SPARQL. To evaluate, we applied our approach to 10 datasets in the
DataHub. By comparing the structural constraints that were extracted using our approach with a
manual approach, we found that our approach was able to extract more constraints.
Keywords: application profiles; metadata schema design; metadata schema extraction
1. Introduction
A considerable number of metadata datasets are published as Linked Open Data (LOD)1 for
sharing on the Web. LOD is widespread across many specific domains such as government,
geography and e-science. Many communities create and publish LOD datasets on the Web and
users are free to combine those datasets. Before designing new LOD datasets, metadata schema
designers design a new application profile, which defines some constraints of metadata that are
important for users of datasets. Particularly, in order to mash-up different datasets, metadata
schema designers should create schema that enhance the interoperability of those metadata.
Application Profiles (Coyle and Baker, 2009) are helpful for users to understand the
constraints of datasets. Dublin Core Description Set Profile (DSP) (Nilsson, 2008) is a
component of an application profile, which explains the structural constraints of metadata
1
http://linkeddata.org/
109
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
instances (Nilsson and Baker, 2008). If metadata schema designers are able to find and use DSPs,
they can understand what vocabularies, structures, and constraints are used for describing datasets
in that specific domain.
There are some metadata schema registries for accumulating and publishing metadata
vocabularies and application profiles. Metadata schema designers can use those registries for
finding existing application profiles that are similar to their own application profile. In order to
cover a more specific domain, we have to increase the number of application profiles. However,
most LOD datasets are not published with their profiles (Nishide, et al,. 2013). Therefore,
metadata schema designers have to look into datasets and try to deduce their structural
constraints. There are a lot of datasets in each specific domain, and those datasets are often too
large to look into to determine structural constraints. It is therefore costly for metadata schema
designers to have to make deductions about structural constraints manually.
We propose a method to extract the structural constraints of LOD datasets automatically.
Creators of LOD datasets describe metadata instances based on their implicit or explicit structural
constraints. Therefore, we use metadata instances, which are included in LOD datasets and
existing metadata schema, for extracting structural constraints. We extract structural constraints
from LOD datasets using SPARQL. We create Description Templates for each class membership,
which resources are instances of. After creating Description Templates, we also extract property
URIs, value types, language tags and datatypes for creating Statement Templates.
We apply our approach in practice to 10 datasets in the DataHub for evaluating our approach
and clarifying issues which we need to solve for improving our method.
2. Sharing Application Profiles to Design a New Interoperable Schema
When metadata schema designers design a new application profile, they try to find existing
application profiles in order to 1) reduce the cost of designing application profiles, 2) improve the
interoperability of their metadata and 3) develop requirements for their metadata. Creating
application profiles from scratch comes at a high cost, because metadata schema designers have
to find suitable metadata vocabularies and structures for their purposes. If there are existing
application profiles which have been created for similar purposes, designers can reuse those
schema to reduce the cost of finding metadata vocabularies and deciding on the structure of
metadata. As a result, the new application profile has improved interoperability because schema
designers reuse common vocabularies and structures in the specific domain in which their
metadata is used. Through reusing and customizing existing application profiles, metadata
schema designers develop requirements for their metadata
In order to accomplish these goals, metadata schema designers should find and reuse existing
application profiles in the same domain. Metadata schema registries are useful for metadata
schema designers to find existing parts of application profiles. Metadata schema registries
support the sharing of metadata schema on the web and promote reuse of metadata schemas and
metadata interoperability (Nagamori et al., 2011). The Open Metadata Registry (Hillmann et al.,
2006) is one such metadata schema registry. This registry can store metadata vocabularies and
metadata schema in the form of element sets. MetaBridge (Nagamori et al., 2011) is also a
metadata schema registry which is compatible with OWL-DSP based on DSP. If metadata
creators share their application profile explicitly in those registries, metadata schema designers
can use those registries as examples of metadata structures and constraints when they design new
application profiles.
The number of application profiles that are registered in those registries is not enough for
metadata schema designers to find and reuse those profiles. Therefore, it is important to create
and register application profiles of various datasets. If metadata creators publish LOD datasets
with their application profiles, schema registries can accumulate and share those application
profiles. However, most LOD datasets are published without explicit application profiles. For that
reason, one has to look into each LOD dataset and create its application profile manually. LOD
110
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
datasets are often too large for observing as a whole, and observing those datasets and creating
application profiles are difficult for metadata schema designers. It is necessary to extract
application profiles from existing LOD datasets automatically.
There is related work in the area of schema extraction (Chidlovskii, 2002). Here, the
researchers proposed methods for extracting XML Schema. XML Schema defines the structural
constraints of metadata, which have been serialized in XML, such as the hierarchies of each
XML element and its attribute. However, we would like to extract the structural constraints of
resources, properties and values that are described with the RDF model, not only serialized with
XML. Such constraints are independent of the serialization found in XML elements hierarchies.
SchemEX (Konrath et al., 2012) is an existing approach for extracting metadata schema from
LOD datasets. This approach extract schema that includes RDF type clusters and relationships
between resources that are instances of type clusters. Those schema abstract structural constraints
about dataset with typed resources and properties, but not define metadata value constraints,
especially literal value constraints such as datatypes and language tags.
In this research, we propose a method to extract application profiles for LOD datasets
automatically using metadata instances and existing schema. In the Singapore Framework, an
application profile consists of five components. This research aims to extract Description Set
Templates, which define the structural constraints of metadata instances. Metadata instances are
described based on implicit or explicit structural constraints. We can extract those constraints
from existing metadata instances.
3. Extracting Structural Constraints from Metadata Instances
Definitions of metadata vocabularies, structural constraints of metadata and description
formats are all components of a metadata schema. In this research, our goal is to extract structural
constraints as a DSP when a user inputs metadata instances. A DSP consists of Description
Templates and Statement Templates. Description Templates define the constraints of resources,
and Statement Templates define the constraints of attributes. In DSP, we are able to describe the
following constraints using Description Templates and Statement Templates.
・ Description Templates
- Resource class membership constraints
- Statement Templates which belongs to this Description Template
・ Statement Templates
- Property URI
- Type constraint, “literal” or “non-literal”
- Class membership of non-literal metadata values
- Datatypes and language tags of literal metadata values
In this section, we explain our approach for extracting structural constraints with an example.
Figure 1 shows an example of metadata instances. The example shows that _:group1 is an
instance of foaf:Group ∩ foaf:Organization. This resource has two members using foaf:member,
_:person1 and _:person2 which have their own names and email addresses with foaf:name and
foaf:mbox. Our goal is extracting the structural constraints of these metadata instances as seen in
table 1 and table 2. Table 1 shows the constraints of resources which are instances of foaf:Group
∩ foaf:Organization. Table 2 shows the constraints of resources which are instances of
foaf:Person.
111
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
_:person1 rdf:type foaf:Person;
foaf:name "Alice"@en;
foaf:mbox <mailto:[email protected]>.
_:person2 rdf:type foaf:Person;
foaf:name "Bob"@en;
foaf:mbox <mailto:[email protected]>.
_:group1 rdf:type foaf:Group, foaf:Organization;
foaf:name "University of Tsukuba"@en;
foaf:homepage <http://www.tsukuba.ac.jp/>;
foaf:member _:person1, _:person2 .
FIG. 1. An example of metadata instances for extracting structural constraints of metadata
TABLE 1: Structural constraints of an instance of foaf:Group ∩ foaf:Organization
Attribute
Property
Value Constraints
name
foaf:name
rdfs:Literal, @en
website
foaf:homepage
foaf:Document
member
foaf:member
foaf:Person
TABLE 2: Structural constraints of an instance of foaf:Person
Attribute
Property
Value Constraints
name
foaf:name
rdfs:Literal, @en
email
foaf:mbox
rdfs:Resource
Metadata instances are described based on the above constraints, and we extract them from
metadata instances using the following steps. In each step, we extract resources, properties and
values using SPARQL because we need to estimate the structural constraints of metadata
instances. Before extracting the structural constraints, we loaded metadata instances in an RDF
database.
Step 1: Get the class membership which resources are instances of
Step 2: Get the properties for each class membership
Step 3: Get a value type constraint (literal or non-literal)
Step 4: Get other value constraints
Step 4-1: Get literal value constraints (e.g., language tag and datatype)
Step 4-2: Get non-literal value constraints (e.g., resource class membership and base URI)
In the first step, we extract class memberships of resources which are described using rdf:type
because typed resources are useful starting anchors for defining Description Templates. In our
example, there are two class memberships, foaf:Person and (foaf:Group ∩ foaf:Organization).
We extract those memberships using a SPARQL query, which is shown in Figure 2, and create
two Description Templates.
112
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
SELECT DISTINCT (GROUP_CONCAT(DISTINCT(?type) ; separator = ", ") as ?types)
WHERE {
?s rdf:type ?type.
?s ?p ?o.
FILTER(?p!=rdf:type)
}
GROUP BY ?s
ORDER BY ?type
FIG. 2. A SPARQL query for extracting the class membership which resources are instances of
The second step is a process for creating Statement Templates. Statement Templates are
created for defining the constraints of metadata attributes. In this step, we execute queries to find
properties for each class membership, which are defined by Description Templates. When we
execute a SPARQL query, such as that shown in Figure 3, we get minimum Statement Templates
that define only property constraints.
SELECT DISTINCT ?p
WHERE {
?s ?p ?o .
?s rdf:type foaf:Group .
?s rdf:type foaf:Organization .
FILTER NOT EXISTS {
?s rdf:type ?type .
FILTER(?type != foaf:Group)
FILTER(?type != foaf:Organization)
}
}
FIG. 3. A SPARQL query for extracting properties which instances of foaf:Group ∩
foaf:Organization have
We estimate value constraints in the third step. After we get metadata values using classes of
resources and a property, we classify those values into “literal”, “non-literal” and “mix”. To
estimate value constraints, we count the number of the three metadata values below.
A) The number of all metadata values,
B) The number of literal metadata values, and
C) The number of non-literal metadata values.
When A = B, we define value constraints as “literal”. If A > B and A > C, we define value
constraints as “mix”. For extracting B and C, we use isLiteral, isIRI and isBlank, which are
SPARQL functions that are shown in a SPARQL query in Figure 4.
113
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
SELECT (COUNT (?o) as ?count)
WHERE {
?s rdf:type foaf:Person .
FILTER NOT EXISTS {
?s rdf:type ?type .
FILTER(?type != foaf:Person)
}
{
?s foaf:mbox ?o .
FILTER isBlank(?o)
}
UNION
{
?s foaf:mbox ?o .
FILTER isIRI(?o)
}
}
FIG. 4. A SPARQL query for extracting the number of non-literal metadata values for
foaf:Person and foaf:mbox
In the final step, we extract the constraints of literal and non-literal metadata values such as
class memberships of non-literal resources, base URIs, language tags and datatypes of literal
metadata values. This process is executed based on the result of step 3. If the metadata type value
is “non-literal”, we extract resource classes and the base URI of metadata values by analyzing all
objects data pulled back and load to the RDF database. When we found metadata with the value
“literal”, we defined datatype and language tags.
4. Evaluation
We implemented a system to extract DSPs using our approach. To evaluate our system and
approach, we extract DSPs from 10 datasets and verify those DSPs. We used 10 LOD datasets
that are published as RDF files on the DataHub2. It is difficult to extract metadata schema
manually, so to evaluate our method for large datasets, we chose datasets that could be accessed
on the Web and were the top 10 largest in file size at the time of access. In this evaluation, we
confirm only precision by comparing constraints which are extracted using our approach and a
manual method, and also comparing extracted constraints and actual datasets.
First, we compared structural constraints defined by DSPs, which were extracted by our
approach and a manual method. Using this comparison, we attempted to confirm if the system we
implemented is running correctly based on our proposed method. A person who executes a
manual method has knowledge and experience of designing metadata schema, but may not have
knowledge about the specific domain of each dataset (e.g., geography, statistics, etc.). In a
manual method, the process of extracting a DSP is based on 5 steps that were shown in section 3.
The difference of our approach and a manual method is the data size of RDF files. For extracting
a DSP from a dataset, our approach used entire RDF files belonging to that dataset, whereas the
manual method used the top 200 lines from each RDF file.
Table 3 shows the number of Description Templates and Statement Templates that were
extracted using our approach and a manual method. We confirmed that all of structural
constraints extracted manually were included in the structural constraints extracted by our
approach. The constraints that we compared are shown in section 3. We also confirmed the
constraints which were extracted by our approach are not contradictory to actual datasets. There
are, however, differences between numbers of templates that were extracted by our approach and
a manual method. One reason is because the amount of data that was used to manually extract
was smaller than our approach. Another reason is that some resources have multiple RDF types,
2
http://datahub.io/
114
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Table 3: The number of Description Templates and Statement Templates that were extracted by our approach
and a manual method
Dataset ID in the DataHub
nytimes
colinda
mondial
eurostat-rdf
linked-open-vocabularies-lov
farmers-markets-geographicdata-united-states
msc
nuts-geovocab
osm-semantic-network
parole-simple-out
Description Templates
our
manual
approach
method
1
1
2
1
19
4
9
2
9
4
33
4
6
4
3
168
1
3
3
2
Statement Templates
our
manual
approach
method
13
9
15
7
107
31
75
8
63
15
164
18
39
15
44
669
4
11
22
7
and those class memberships are difference each other. For these cases, we created a Description
Template for each class membership, so that most Description Templates have only a few
resources. For example, we extracted 168 Description Templates from parole-simple-out, but 106
Description Templates have less than 10 resources. We discuss this problem in section 5.
After we confirmed our system is running correctly, we checked constraints that were
extracted by our method but weren’t extracted by the manual method. As a result of comparing
those constraints and original datasets, those constraints were not contradictory to datasets.
Finally, we looked into parts of each dataset in order to find constraints which were not extracted
by our method.
In the above procedure, we confirmed that it is possible to extract most structural constraints,
which described in section 3, using our approach. However, there are constraints which we could
not extract using our approach. We discuss whether or not the constraints that we extracted are
useful to understand existing metadata structures in the next section.
5. Discussion
We could not extract structural constraints of resources which do not have rdf:type using our
approach. For example, nuts-geovocab, for describing geographical metadata, includes RDF
Collections in order to describe the exterior of geospatial objects with multiple coordinates.
Figure 5 shows metadata instances from nuts-geovocab. There are more than two coordinates for
describing the exterior of the resource “http://nuts.geovocab.org/id/AT111_geometry”. Those
coordinates are described using non-typed blank nodes which are connected with rdf:first and
rdf:rest. This meant that we could not extract the Description Templates for resources that
describe coordinates. When we guess the classes of each resource using existing metadata schema
and definitions about metadata vocabularies which include rdfs:domain or rdfs:range, we can
extract more Description Templates.
There are other issues that need to be solved in order to improve our approach. In this
evaluation, we could extract a large number of Description Profiles from farmers-marketsgeographic-data-united-states and parole-simple-out. We proceeded to check their Description
Templates and Statement Templates. As a result, in some cases, we could merge the Description
Templates into other templates. For example, farmers-markets-geographic-data-united-states,
there are the following two class memberships,
115
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Class membership defined in Description Template A
・ http://logd.tw.rpi.edu/source/data-gov/vocab/Dataset (logd:Dataset)
・ http://purl.org/twc/vocab/conversion/Dataset (conversion:Dataset)
・ http://purl.org/twc/vocab/conversion/MetaDataset (conversion:MetaDataset)
・ http://rdfs.org/ns/void#Dataset (void:Dataset)
Class membership defined in Description Template B
・ http://logd.tw.rpi.edu/source/data-gov/vocab/Dataset (logd:Dataset)
・ http://purl.org/twc/vocab/conversion/Dataset (conversion:Dataset)
・ http://purl.org/twc/vocab/conversion/SameAsDataset (conversion:SameDataset)
・ http://rdfs.org/ns/void#Dataset (void:Dataset)
Description Template A and B have differences in the two classes conversion:MetaDataset and
conversion:SameDataset. Both Description Templates have 8 Statement Templates, and those
Statement Templates are similar. If there are a large number of Description Templates, metadata
schema designers cannot easily understand the structural constraints of the dataset. In that case,
we should define one Description Template for resources which are instance of (logd:Dataset ∩
conversion:Dataset ∩ void:Dataset).
We believe that we are unable to extract DSPs correctly if there are resources that have
multiple roles in the datasets. We have created and published Aozora Bunko LOD3 which is a
dataset including bibliographies based on Aozora Bunko4. Aozora Bunko is a Japanese digital
library that publishes digitized books. The bibliographies, which are published on Aozora Bunko,
have some resources about persons, such as “creator”, “translator” and “reviser”. We described
person as a instance of aozora:Person. However, instances of aozora:Person have different roles
in that dataset as mentioned above. In that case, we can only extract one Description Template
about aozora:Person, and in the Description Template, the metadata attributes for the persons
with different roles are mixed. There are two approaches to resolve this problem. One is by
<geometry:Polygon
xmlns:geometry="http://geovocab.org/geometry#"
xmlns:wgs84pos=” http://www.w3.org/2003/01/geo/wgs84_pos#”
rdf:about="http://nuts.geovocab.org/id/AT111_geometry">
<geometry:exterior>
<geometry:LinearRing>
<geometry:posList>
<rdf:Description>
<rdf:first>
<rdf:Description>
<wgs84pos:lat>47.35300025</wgs84:lat>
<wgs84pos:long>16.435400050000055</wgs84:long>
</rdf:Description>
</rdf:first>
<rdf:rest>
<rdf:Description>
<rdf:first>
<rdf:Description>
<wgs84:lat>47.455132750000018</wgs84:lat>
<wgs84:long>16.281081050000068</ns48:long>
…
FIG. 5. An example of resource which are described using non-typed resources
3
4
http://mdlab.slis.tsukuba.ac.jp/lodc2012/aozoralod/
http://www.aozora.gr.jp/
116
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
adding different classes for each type of person in the original datasets. Because it is required to
change source data, this approach is not practical. The other is extracting a Description Template
for each pair of a class membership and a property that has an instance of that class membership
as a range. For example, if there are metadata instances which figure 6 shows, we should extract
two Description Templates for aozora:Person as dc:creator and aozora:Person as dc:translator.
<book_A> dc:creator <person_X> ;
dc:translator <person_Y> .
<person_X> rdf:type aozora:Person .
<person_Y> rdf:type aozora:Person .
FIG. 6. An examples of resources which are both instance of aozora:Person, and have different roles
“dc:creator” and “dc:translator”
6. Conclusion
In this paper, we have proposed a method for extracting the structural constraints of LOD
datasets using metadata instances and existing schema. Metadata schema about existing datasets
are important for metadata schema designers to create a new interoperable schema with a low
cost. However, because creating formal metadata schema is costly, there are few schema about
existing LOD datasets on the web. We aim to extract metadata schema automatically, especially
the structural constraints of metadata records, in order to add metadata schema to metadata
schema registries.
To evaluate our approach, we compared the number of structural constraints which were
extracted by our approach and manually with 10 datasets in the DataHub. That evaluation showed
that our approach could extract all the structural constraints which could be extracted manually.
We also compared metadata instances and structural constraints which are extracted using our
approach. As a result, it has become clear that there are three issues to be solved when extracting
structural constraints using our approach. One is the need to improve our method for extracting
Description Templates of resources which have no rdf:type. The second issue is that we need to
merge Description Templates when the extracted templates are similar to other templates. The
last issue is that we separate templates for resources, which have same classes, but have different
roles in a dataset.
References
Chidlovskii. Boris (2002). Schema extraction from XML collections. Proceedings of the 2nd ACM/IEEE-CS joint
conference of Digital libraries, 2002, 291-292.
Coyle, Karen and Thomas Baker. (2009). Guidelines for Dublin Core Application Profiles. Retrieved May 15, 2014,
from http://dublincore.org/documents/2009/05/18/profile-guidelines/ .
Hillmann, I. Diane I, Stuart A. Sutton, Jon Phipps and Ryan Laundry. (2006). A Metadata registry from vocabularies
up: The NSDL registry project. Proceedings of the International Conference on Dublin Core and Metadata
Applications, 2006.
Konrath, Mathias, Thomas Gottron, Steffen Staab and Ansgar Scherp. (2012). SchemEX – Efficient Construction of a
Data Catalogue by Stream-based Indexing of Linked Data. Journal of Web Semantics, 2012, vol. 16.
Nagamori, Mitsuharu, Masahide Kanzaki, Naohisa Torigoshi and Shigeo Sugimoto. (2011). Meta-Bridge: A
Development of Metadata Information Infrastructure in Japan. Proceedings of the International Conference on
Dublin Core and Metadata Applications, 2011, 63-68.
Nilsson, Mikael. (2008). Description Set Profiles: A constraint language for Dublin Core Application Profiles. Retrieve
May 15, 2014, from http://dublincore.org/documents/dc-dsp/ .
Nilsson, Mikael, Thomas Baker and Pete Johnston. (2008). The Singapore Framework form Dublin Core Application
Profiles. Retrieve May 15, 2014, from http://dublincore.org/documents/2008/01/14/singapore-framework/ .
117
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Nishide, Yoritsugu, Tsunagu Honma and Mitsuharu Nagamori. (2013). An Investigation of Japanese Open Data
Schema and Links to Improve the Use of Datasets. Digital Library, 2014.
118
Infrastructure & Models—Part B Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The 1:1 Principle in the Age of Linked Data
Richard J. Urban
Florida State University
School of Information
[email protected]
Abstract
This paper explores the origins of the 1:1 Principle within Dublin Core Metadata Initiative
(DCMI). It finds that the need for the 1:1 Principle emerged from prior work among cultural
heritage professionals responsible for describing reproductions and surrogate resources using
traditional cataloging methods. As the solutions to these problems encountered new ways to
model semantic data that emerged outside of libraries, archives, and museums, tensions arose
within DCMI community. This paper aims to fill the gaps in our understanding of the 1:1
Principle by outlining the conceptual foundations that led to its inclusion in DCMI
documentation, how the Principle has been (mis)understood in practice, how violations of the
Principle have been operationalized, and how the fundamental issues raised by the Principle
continue to challenge us today. This discussion situates the 1:1 Principle within larger
discussions about cataloging practice and emerging Linked Data approaches.
Keywords: 1:1 Principle, RDF, Abstract Model,
1. Introduction
In general, Dublin Core metadata describes one manifestation or version of a resource,
rather than assuming that manifestations stand in for one another. For instance, a jpeg
image of the Mona Lisa has much in common with the original painting, but it is not the
same as the painting. As such the digital image should be described as itself, most likely
with the creator of the digital image included as a Creator or Contributor, rather than just
the painter of the original Mona Lisa. The relationship between the metadata for the
original and the reproduction is part of the metadata description, and assists the user in
determining whether he or she needs to go to the Louvre for the original, or whether
his/her need can be met by a reproduction (Hillmann, 2003).
The Dublin Core Metadata Initiative (DCMI) 1:1 Principle appears to offer a simple dictum:
“metadata is about one, and only one, resource” (Powell, Nilsson, Naeve, Johnston, & Baker,
2007).1 Yet despite its apparent simplicity, “one to one…is a many headed snake, and it has
bitten us often over the years.” (Weibel, 2010). Metadata creators find the Principle confusing or,
at best, routinely ignore it because it remains unsupported by digital library software and
exchange protocols (Han, Cho, Cole, & Jackson, 2009; Hutt & Riley, 2005; S. J. Miller, 2010;
Park & Childress, 2009; Park, 2009; Shreeves et al., 2005; Stvilia, et al., 2004; Urban, 2012).
Although the specific definition provided in Hillmann’s (2003) Using Dublin Core (and the “oneto-one” label itself) has fallen out of favor, the fundamental questions embodied in the Principle
continue to animate debates and discussions about the DCMI Abstract Model and DCMI’s
relationship to the Resource Description Framework (RDF).
This paper aims to fill the gaps in our understanding of the 1:1 Principle by outlining the
conceptual foundations that led to its inclusion in DCMI documentation, how the Principle has
been (mis)understood in practice, how violations of the Principle have been operationalized, and
how the fundamental issues raised by the Principle continue to challenge us today. This
1
For consistency, I use 1:1 Principle except when variants are used in direct quotes. i.e. “one-to-one,” etc.
119
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
discussion situates the 1:1 Principle within larger discussions about cataloging practice and
semantic knowledge representations.
2. Background
While the specifics of the 1:1 Principle are directly tied to the development of Dublin Core
(DC), the general problem that it references — how to model the description of original resources
and their associated reproductions or surrogates in various formats — is one that has plagued
cataloging standards since reproductive technologies (such as photography, microfilm, and
microfiche) became widely available in the mid-20th century. At the heart of these discussions are
ontological distinctions among different kinds of bibliographic entities (e.g. multiple versions,
electronic resources, non-book resources). But is also an account of how flat bibliographic
records have struggled to represent the complex relationships among these entities. At the time
that DC was being defined in the mid-1990s, many of the key stakeholders in its development
had already been wrestling with these issues for more than a decade.
2.1. Describing Reproductions, Multiple Versions, and Electronic Resources
From the earliest cataloging guidelines, concerns about representing “reproductions” of
bibliographic materials complicated emerging descriptive standards. As libraries began collecting
an increasing number of different reproductive media (microfilms and microfiche), or multiple
versions of the same work (i.e. a musical recording released simultaneously on vinyl, cassette,
and/or compact disc), the problems began to multiply (Graham, 1992; Knowlton, 2009).
Simonton’s report (1962), commissioned by the Association of Research Libraries (ARL),
defined two solutions to the problem that serve as the foundations for current practice:
• The Facsimile Theory privileged the intellectual content of an item by making the
“original” resource the focus of the record representing a reproduction. Following the
long-standing practice of dash entries, a description of the reproduction itself would be
included as a note.
• The Edition Theory required a record to represent the physical features of the
reproduction, using a note to provide a description of the “original” resource.
The first edition of the Anglo-American Cataloging Rules (AACR1) used the facsimile theory
and dashed entries to continue a common practice. However, AACR2’s cardinal principle required
a shift in cataloging rules towards an edition theory (item-at-hand) perspective (Graham, 1992).2
This shift was not welcomed by the cataloging community who “assailed [it] as ‘an obsession
with principle to the exclusion of common sense’” (Graham, 1992). Most vocal in their
opposition to the rule change were libraries and information centers that dealt in large numbers of
“reproduction” records, such as the Library of Congress (LOC), the National Library of Medicine
(NLM), and academic libraries participating in the NEH-funded U.S. Newspaper Program
(USNP). In response, the LOC issued a rule interpretation upholding a facsimile theory approach
(Graham, 1992; Library of Congress, 2010). While some bibliographic services, such as the
Research Libraries Group (RLG) RLIN, adapted to these rule interpretations, many cataloging
services could not take full advantage of them, leaving “a fractured set of approaches” in place
(Jones, 1997). Following the precedent set with microfilm reproductions, the Library of Congress
applied the same rule interpretation to the digitization of its photography collections (Arms,
1999). “The records describe the intellectual expression and the original form of the material and
provide a link to the corresponding digital reproductions” (Library of Congress, 2010).
Many of the arguments about which theory should be used center around user needs and the
functions of information retrieval systems. For example, an advantage of the facsimile theory is
that it allowed records about originals and reproductions to co-locate in the catalog, thereby
2
“The starting point for description is the physical form of the item at hand, not the original or any
previous form in which the work has been published” (American Library Association, et al., 1988).
120
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
saving the time of the user. The facsimile theory also had economic advantages. Under an edition
theory approach (AACR2), a cataloger had to “start over” to create a new record for the
reproduction. The facsimile theory (AACR1/LOC 1.11) allowed catalogers to quickly clone an
existing description and append a reproduction note, saving significant costs. (Graham, 1992).
2.2. Beyond the Book: The Description of Art, Visual Resources, and Archival
Materials
At the same time that cataloging standards struggled with reproductions a parallel conversation
was taking place about the representation of surrogates for non-book visual materials, such as
artworks, photography, and archival materials. Members of this community drew careful
distinctions between a reproduction that fully represented an original object and surrogates which
merely stood-in for the object, i.e. a photograph of a 3-dimensional sculpture does not reproduce
the sculpture, but does allow us to represent it in an information system. This community
included professionals responsible for managing visual resource collections (art and architectural
slide collections) and museum collections (the Getty’s Art History Information Program, later the
Getty Information Institute – GII) (Fink, 1999; McRae & White, 1998). Until the advent of
centralized online catalogs, the distinction between originals and surrogates was handled by
establishing physically separate card catalogs. However, in a MARC-based catalog what kind of
resource a record represented was less clear. In order to make this more explicit, the MARC
Visual Materials (MARC-VM) and Archival Materials Control (MARC-AMC) formats
introduced new control fields that made the “type of record” explicit (Dooley & Zinham, 1990;
Evans & Will, 1988). In discussing the need for these new features, we see examples that would
later be revisited to illustrate the need for the 1:1 Principle:
The [Art and Architecture Thesurus] considers reproductions of works of art to be
surrogates for original works and will recommend that they be indexed in a similar
fashion. For example, PAINTING (655) would be used to describe both Leonardo's
Mona Lisa and a slide reproduction; SLIDE (655) would also be used in the latter case.
This holds serious implications for effective retrieval….In an integrated database
containing both of these media, searchers interested only in examples of actual paintings
might have to learn to exclude slides, microfilm, and other reproduction media in their
search queries to retrieve only records for original paintings. . . . One solution might be
the addition of a “reproduction” facet to indexing strings for object surrogates so that
they would be differentiated from “originals” in a browse display (Dooley & Zinham,
1990).
The ability to distinguish between descriptions of originals and surrogates in various analog
and digital formats was a key component of emerging standards for describing information about
artworks and museum objects. Both the Categories for the Description of Works of Art (CDWA)
and the Visual Resource Association’s VRACore included structures that enabled the separation
of information about different kinds of resources (Baca, 2002; Harpring & Baca, 2009; Visual
Resources Association & Whiteside, 1999).
2.3. A Principle is Born
When the DCMI began, it had an explicit goal to describe “document-like objects” (DLO)
found on the World Wide Web (Weibel, 1995). The development of this new standard soon came
to the attention of several organizations interested in developing online representations for their
collections, including RLG, the Getty Information Institute (GII), and the UKOLN Arts and
Humanities Data Service (AHDS) (Erway, 1996; Fink, 1999; P. Miller & Greenstein, 1997).
Advocating for the needs of library, archive, and museum (LAM) collections, RLG argued that
DC could be used to describe offline physical collections and that the definition of DLOs should
extend to images (Erway, 1996). The Guidelines for Extending the Use of Dublin Core Elements
grounded its recommendations for a “record type” indicator or element refinements on earlier
121
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
work for reproduction/surrogate descriptions (Research Libraries Group, 1997a, Research
Libraries Group, 1997b).
The RLG proposal became a central point of discussion at the 1997 DC-4 Workshop in
Helsinki, Finland. Rather than adopt the proposed changes in the RLG Guidelines, workshop
participants discussed the relationship between “logical clusters of metadata…that reference one,
and only one, state of the information resource,” which became the nucleus of the 1:1 Principle
(Bearman, 1999; Weibel & Hakala, 1998).
Following the Helsinki meeting, 1:1 Principle issues emerged in several working groups (Oneto-One, Relations, and Data Model). The discussions were frequently contentious debates
between members in different camps. Cultural heritage professionals’ concerns with the 1:1
Principle primarily focused on the kinds of resources that could be described using DC. Drawing
on their experiences with previous standardization efforts, this camp felt it necessary to provide
guidance for different types of materials. However, there was a strong resistance to DCMI getting
into the cataloging rules business, especially ones that needed to deal with complexities of
different ontological kinds. The members of this group preferred to let Dublin Core remain a
simple vocabulary for resource discovery. Acknowledging the concerns of cultural heritage
professionals, the latter group argued that the kind of discrimination sought for cultural materials
could be handled by more robust local standards (P. Miller & Greenstein, 1997). Furthermore,
discussions on the dc-one2one listserv:
. . . made absolutely clear that there is no consensus on what 1:1 really means in practice.
In the end, people will describe what *they* want to describe, for their purposes and
the purposes of their user community. That means they may describe a TIFF of an
Ansel Adams photograph as having been created by Ansel Adams. Who's to say they're
wrong? (Wendler, 1999)
By the end of 1999, discussion in the One-to-One group dwindled without having reached a clear
consensus on the Principle. It was formally combined with other task groups into the DCArchitecture working group which attacked the problem from a different perspective.
Discussions in the Relation working group focused more on developing logical clusters of
metadata that could be linked together. The discussions echoed concerns found in earlier MARCbased solutions to representing originals and reproductions. In particular, there were concerns that
separating descriptions into distinct records could result in a loss of information when shared
outside of an application. The suggestion of separate records also raised concerns about how to
display them to users, with a sense that independent representations of originals and
reproductions would make the task harder. Proponents of “keeping Dublin Core simple”
suggested that atomic statements about resources enabled better discovery of resources without
the additional complexity of resource type-based models. Instead, statements about resources
could be dynamically organized into logical packages for particular uses such as retrieval or
display for a user (Lagoze, 1997, 2001a).
3. From Principle to Abstract Model
Thus far, the story of the 1:1 Principle has been about cataloging practices in a cultural
heritage community concerned with ontological distinctions and relationships among resources.
The introduction of these concerns into the development of DC metadata brought these practices
into contact with fundamentally different theories of description that emerged from formal
knowledge representation (KR) approaches. KR semantics were not merely concerned with fixing
the meaning of individual vocabulary terms, but how descriptions could consistently refer to
described resources (Urban, 2012).
This was of little concerned when Dublin Core was created as embedded metadata within a
document-like object, such as a HTML page. In this case the metadata described the resource that
it was embedded within. A desire to describe non-textual resources meant developing a
122
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
standalone Standard Generalized Markup Language (SGML) syntax that would provide “explicit
semantics of each Dublin Core element”; however, “discrete packages of metadata cannot be
identified and the semantics of repeated elements are not specified” (Burnard, Miller, Quin, &
Sperberg-McQueen, 1996). These conversations resulted in the emergence of the Warwick
Framework that would allow for the creation and exchange of metadata containers (Dempsey &
Weibel, 1996; Lagoze, 1996). A package might include DC metadata, or metadata in other
formats.
The Warwick Framework became one of several alternative metadata proposals submitted to
the World Wide Web Consortium (W3C) in order to address laws aimed at filtering adult content
on the Web. Among the others were the Platform for Internet Content Selection (PICS),
Microsoft’s XML Web Collections (XMLWC), and Apple’s Meta Content Framework (MCF).
Rather than developing each of these recommendations separately, the W3C rolled them together
into a new initiative known as the Resource Description Framework (RDF) (E. Miller, 1998).
As a model for expressing a formal semantics for metadata, RDF owes a great deal to earlier
artificial intelligence and knowledge representation research that took place before the advent of
the World Wide Web (Halpin, 2004). In addition to fixing the meaning of properties used to
describe resources, researchers in this area quickly realized that referent tracking was essential to
the development of computational reasoning (Lenat & Guha, 1990). Guha would add features
originally developed for the Cyc project to MCF and ultimately to RDF (Halpin, 2004). In the
context of the RDF model, the relationship between a metadata statement and a resource is
established through the consistent assignment of a URI (Berners-Lee, 2002; Hayes, 2004). In
theory, if all the objects of description are supplied with a URI, statements about those resources
will naturally organize themselves around these identifiers, fulfilling the main objectives of the
1:1 Principle.
The development of RDF and eXtensible Markup Language (XML) specifications encouraged
DCMI to begin work on a more formal data model for Dublin Core (Baker, 2012; Weibel &
Hakala, 1998; Weibel, 2010). Initially, this work expressed DC descriptions as a variant of RDF.
However, within the implementer community, there was a great deal of initial resistance to RDF
in favor of simpler “plain” XML representations. This was due in part to a lack of practice and
software tools that could understand RDF, and to fundamental misunderstandings within the
Dublin Core implementer community that saw RDF as an overly complex XML syntax (Baker &
Johnston, 2011; Baker, 2012). Because the XML serialization of RDF represented a graph
structure, it was also less human-readable than a document-like encoding of element/value pairs.
Resistance to RDF also came from the Open Archives Initiative (OAI) community, which was
developing a protocol for exchanging “packages” of metadata along the lines of the Warwick
Framework. “It may be that the vast majority of data providers don't need (or even understand)
RDF and are mainly interested in exposing metadata as simple attribute-value pairs or simple
trees for which XML is perfectly appropriate” (Lagoze, 2001b). In order to conform to the simple
DC and to provide a low barrier to use (i.e., by using well-supported technologies), OAI-PMH
initially required a minimal DC XML schema (later versions of OAI-PMH referenced official
DCMI XML syntax recommendations) (Lagoze, Van de Sompel, Nelson, & Warner, 2008). As a
container architecture, OAI-PMH left the aboutness of a record to the enclosed metadata
specification.
The intersection of XML and RDF models for DC metadata created some inherent tensions.
Although DCMI developed an implicit grammar for statements, it was intentionally scruffy in
order to accommodate the broad diversity emerging on the Web (Baker, 2000, 2012; Johnston,
2006). Addressing calls for more guidance, DCMI released official recommendations for
encoding Dublin Core in XML and RDF that included rudimentary definitions of an abstract
model. This initial model specified a one-to-one relationship between a record and a resource at
the same time recognizing that “there is no formal linkage between a simple DC record and
the resource being described. Such a linkage may be made by encoding the URI of
the resource as the value of the DC Identifier element, however this is not mandatory” (Powell &
123
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Johnston, 2002). Because of implementation confusions about this early model, a more formal
recommendation was published as the DCMI Abstract Model (DCAM) (Powell, Nilsson, Naeve,
Johnston, & Baker, 2005). Although DCAM borrowed some concepts from RDF, “DCAM was
meant to provide a basis for guidelines that would allow metadata records to be encoded using
XML, HTML, and in principle, any concrete implementation syntax…” (Baker, 2012, p. 121).
Although DCAM enabled syntaxes to include “slots” for URIs to reference a resource, it also
continued to support 1:1 Principle concepts:
The abstract model described above indicates that each DCMI metadata description
describes one, and only one, resource. This is commonly referred to as the one-to-one
principle…However, real-world metadata applications tend to be based on loosely
grouped sets of descriptions (where the described resources are typically related in some
way), known here as description sets. For example, a description set might comprise
descriptions of both a painting and the artist…(Powell et al., 2005)
Unfortunately, DCAM failed to achieve widespread adoption within the Dublin Core
implementer community, especially among LAMs that are the focus of this discussion. Instead of
resolving the tensions between RDF and XML approaches, the DCAM “fell between two stools,”
leaving neither group invested in applying it to their data (Baker & Johnston, 2011).
4. 1:1 Principle Violations and Metadata Quality
Because one of the fundamental objectives of Dublin Core is to enable to exchange of
interoperable metadata, studying metadata quality has been an important activity. Among studies
that examine DC metadata for cultural heritage resources, failure to comply with the 1:1
Principle has been identified as cause for many quality problems (Han et al., 2009; Hutt & Riley,
2005; S. J. Miller, 2010; Park & Childress, 2009; Park, 2005; Shreeves et al., 2005; Stvilia et al.,
2004).
For Shreeves, et al (2005), the 1:1 Principle is related to the internal cohesiveness of a
metadata record and the degree to which it represents related resources. In examining an
aggregation of cultural heritage metadata, they found that “…no collection maintained a
consistent one-to-one mapping between the metadata and a single resource…” Within an
individual collection, “between 57% and 100% of records in their sample included properties for
both physical and digital manifestations of a resource” (Shreeves et al., 2005). These findings
were later confirmed by Hutt & Riley (2005), Han, et al (2009) and again by S. J. Miller (2010).
S.J. Miller (2010) notes that 1:1 Principle problems result from “database and user interface
systems [that] do not have the capacity to adequately link separate records and to display them
together in a clear and meaningful way for end users.” Systems, such as CONTENTdm, base
their primary information models around digital assets, making it difficult to independently
represent non-digital source resources (Han et al., 2009). These systems also enable metadata
creators to add specialized, locally defined metadata elements on a collection-by-collection or
project-by-project basis. The ease with which these systems allow the addition of new properties
encourages ad-hoc modeling optimized for display in one local context, rather than more formal
and rigorous methods of modeling on at Web scale.
4.1. Limitations of Violations
In light of the debates that brought the 1:1 Principle into existence, it is necessary to question
many of the assumptions that have gone into quality studies. First, the studies themselves
demonstrate that the 1:1 Principle was not necessarily a concern among metadata creators.
Instead, conforming to cataloging rules for reproductions and/or surrogate resources provided the
context for descriptions. Regardless of whether an record uses facsimile (AACR1) or edition
(AACR2) theory approaches, MARC inherently describes more than one resource. While local
practices for Dublin Core may not alter the definition of DC terms, they implicitly changed the
referent to a different resource (i.e. the prevalence of date.original, date.digital). The adoption of
124
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
these rules in association with Dublin Core, particularly within the library community, is often
justified by user convenience and economics (Cronin, 2008; S. J. Miller, 2010).
Secondly, most of these studies use a “record” as the unit of analysis for assessing metadata
quality, especially the set of DC elements provided by an OAI-PMH DC record. As noted above,
oai_dc is based on a 2002 XML schema recommendation that pre-dates DCAM (Lagoze, Van de
Sompel, Nelson, & Warner, 2002; Lagoze et al., 2008). Neither the OAI-PMH container
architecture nor this Dublin Core schema enable DCAM-like description sets that would comply
with the 1:1 Principle. These problems are further compounded by the limitations of data
representations within commonly used digital repository systems like ContentDM (Han, et al.,
2009, S. J. Miller, 2010).
Furthermore, these studies are only able to detect a limited set of 1:1 Principle violations.
Most operationalize violations of the 1:1 Principle through a conjunction of oai_dc statements
(i.e., the resource hasFormat “image/jpeg” AND hasFormat “oil on board”). Although the
informal definition of a 1:1 Principle licenses such an assumption, it is not supported formally by
the XML semantics or the DCAM. The detection of 1:1 Principle violations has hinged on format
and date elements that supply ontological absurdities. Being aware that metadata represents
cultural heritage resources heightens our awareness of incoherent format statements that describe
the properties of both physical and digital resources. In a heuristic evaluation of metadata records,
qualitative researchers bring a great deal of background knowledge to their assessments. They
may intuitively understand that terms like image/jpeg and glass plate negative are properties that
are unlikely to be shared by the same resource. They also may understand that JPEGs are the
kinds of the resource that “reproduce” something like a glass plate negative, but rarely will glass
plate negatives “reproduce” a JPEG. They understand that JPEGs are the kind of resource that
can be associated with “2008” and are not resources that could have been created in “1901.”
These kinds of inferences are difficult to automate even when using robust taxonomies because
they require integrating and aligning knowledge from across multiple sources (for example, AAT
knows little about specific file formats described in a resource such as the Unified Digital Format
Registry (UDFR)). Even accepting these limitations, these automated approaches fail to identify
violations when DC records appear to be internally coherent. For example a DC description of a
microfilm that merely uses a URL to link to a digitized version of the resource.
5. Would RDF save us from 1:1 Principle Violations?
The studies discussed above all took OAI-PMH XML as their focus, leaving an important
question unanswered: Would an RDF-based approach save us from rampant violations of the 1:1
Principle? Debates from within the Semantic Web/Linked Data community suggest that RDF
alone does not solve the problems inherent in the 1:1 Principle but rather shifts the burden onto
URIs. Known as the Semantic Web Identity Crisis or http range-14 problem, the debates on this
issue closely parallel 1:1 Principle problems (Halpin, 2011; Hayes & Halpin, 2008). At the heart
of the problem is the question of whether a URI can refer to both an information object that
describes an entity (i.e., a surrogate representation) and the entity being described. Hayes and
Halpin (2008) provide the example of a URI that may refer to the Eiffel Tower itself (the
structure in Paris designed by Gustave Eiffel) and a photograph of the Eiffel Tower (or equally, a
set of RDF statements about the Eiffel Tower). According to Hayes & Halpin, what a URI refers
to may be specified by the formal interpretation associated with it. In one interpretation, the URI
may refer to the surrogate representation (the photo); in another, it may refer to the entity the
surrogate stands for (the Eiffel Tower itself). In contrast, Berners-Lee (2002) argues that URIs
refer to one, and only one, resource, as determined by the agent responsible for “minting” the
URI (in part through the authority bestowed by the owner of a domain name). To date, World
Wide Web Consortium (W3C) recommendations support Berners-Lee's approach (Sauermann &
Cyganiak, 2008). However in a study of available Linked Data, Halpin, et al (2010) found that
the same Linked Data URI was being used to refer to distinct entities in different contexts (for
example, the city of Paris as a political entity vs. Paris as a geographic location). Within the
125
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
present metadata quality literature, the question of whether a URI successfully refers to the
described resource is left unmeasured, especially for the use of URIs that do not provide access to
offline resources, but may successfully refer to them. While identifiers found in OAI-PMH
records had a high degree of uniqueness, this does not entail that any identifier refers uniquely to
one, and only one, resource. This suggests that another kind of 1:1 Principle violation may occur
if a URI is used to refer to more than one resource (Stvilia et al., 2004; Stvilia & Gasser, 2008).
6. Conclusion
The developers of Dublin Core intended it to be a simple vocabulary that could be broadly
applied to emerging Internet resources. The introduction of cultural heritage material introduced
more complex kinds of relationships between online and offline resources or “originals” and
“reproductions.” Faced with this problem, the cultural heritage community proposed solutions
based on many years of practice using document surrogates in information retrieval systems.
However, users of traditional cataloging systems also struggled with defining best practices for
describing reproductions and multiple versions. Conflicting interpretations meant that document
surrogates could appear in two forms based on the object of description (i.e., facsimile/edition
theory approaches). Within the DCMI, these developments in descriptive cataloging encountered
new approaches to representing descriptions as “metadata.” While emerging technologies such
as XML enabled the creation of document-like data models, the development of DC was also
influenced by more formal modeling techniques, such as RDF, that required a one-to-one
relationship between entities and their descriptions. Because this requirement conflicted with the
cultural heritage community's recommendations for handling reproductions, it was necessary to
articulate it in DCMI documentation as the 1:1 Principle. However, these recommendations
failed to overcome the limitations the cultural heritage community’s pragmatic understanding of
the relationship between descriptions and resources. While the limitations of systems for storing
and exchanging DC metadata are implicated in the prevalence of 1:1 Principle problems, there
also seemed to be little desire from within the community for more formal representation models,
such as RDF. However, it is important to recognize that RDF, in and of itself, is insufficient to
solve fundamental identity issues embodied by the 1:1 Principle. The more recent development
of complex bibliographic models, such as Functional Requirements for Bibliographic Records
(FRBR), and their implementation as Linked Data, suggest opportunities to reformulate our
ability to detect whether a description is about “one and only one resource.”
References
American Library Association, Australian Committee on Cataloguing, Canadian Committee on Cataloguing, British
Library, & Library of Congress. (1988). Anglo-American Cataloging Rules. (M. Gorman & P. W. Winkler, Eds.)
(2nd Edition revised.). Chicago: American Library Association.
Arms, C. R. (1999). Getting the picture: Observations from the library of congress on providing online access to
pictorial images. Library Trends, 48(2), 379–409.
Baca, M. (Ed.). (2002). Introduction to Art Image Access: Issues, Tools, Standards, Strategies. Los Angeles: Getty
Research Institute. Retrieved from http://www.getty.edu/research/conducting_research/standards/intro_aia/
Baker, T. (2000). A grammar of Dublin Core.
http://www.dlib.org/dlib/october00/baker/10baker.html
D-Lib
Magazine,
6(10).
Retrieved
from
Baker, T. (2012). Libraries, languages of description, and linked data: a Dublin Core perspective. Library Hi Tech,
30(1), 116–133.
Baker, T., & Johnston, P. (2011, May 13). Review of DCMI Abstract Model. Dublin Core Metadata Initiative.
Retrieved from http://wiki.dublincore.org/index.php/Review_of_DCMI_Abstract_Model
Bearman, D. (1999, January). A common model to support interoperable metadata: Progress report on reconciling
metadata requirements from the Dublin Core and INDECS/DOI Communities. D-Lib Magazine, 5(1). Retrieved
from http://www.dlib.org/dlib/january99/bearman/01bearman.html
Berners-Lee,
T.
(2002,
July
27).
What
http://www.w3.org/DesignIssues/HTTP-URI.html
do
126
URIs
identify?
W3C.
Retrieved
from
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Burnard, L., Miller, E., Quin, L., & Sperberg-McQueen, C. M. (1996, April 1). A syntax for Dublin Core Metadata.
Dublin Core Metadata Initiative. Retrieved from http://dublincore.org/workshops/dc2/report-19960401.shtml
Cronin, C. (2008). Metadata provision and standards development at the Collaborative Digitization Program (CDP): A
history.
First
Monday,
13(5).
Retrieved
from
http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2085/1957
Dempsey, L., & Weibel, S. L. (1996). The Warwick Metadata Workshop: A framework for the deployment of resource
description. D-Lib Magazine. Retrieved from http://www.dlib.org/dlib/july96/07weibel.html
Dooley, J. M., & Zinham, H. (1990). The object as “subject”: Providing access to genres, forms of materials, and
physical characteristics. In P. Molholt & T. Petersen (Eds.), Beyond the Book: Extending MARC for Subject Access
(pp. 43–80). Boston, MA: G.K. Hall & Co.
Erway, R. (1996). Digital initiatives of the Research Libraries Group. D-Lib Magazine. Retrieved from
http://www.dlib.org/dlib/december96/rlg/12erway.html
Evans, L. J., & Will, M. O. (1988). MARC for Archival Visual Materials. Chicago: Chicago Historical Society.
Fink, E. (1999). The Getty Information Institute: A retrospective. D-Lib Magazine, 5(3). Retrieved from
http://www.dlib.org/dlib/march99/fink/03fink.html
Graham, C. (1992). Microform reproductions and multiple versions. The Serials Librarian, 22(1), 213–234.
doi:10.1300/J123v22n01_14
Halpin, H. (2004). The Semantic Web: The origins of artificial intelligence redux. Presented at the Third International
Workshop on the History and Philosophy of Logic, Mathematics, and Computation (HPLMC-04 2005), Donostia
San Sebastian, Spain. Retrieved from http://www.ibiblio.org/hhalpin/homepage/publications/airedux.pdf
Halpin, H. (2011). Sense and reference on the Web. Minds and Machines, 21(2), 153–178. doi:10.1007/s11023-0119230-6
Halpin, H., Hayes, P. J., McCusker, J. P., McGuinness, D. L., & Thompson, H. S. (2010). When owl:sameAs isn’t the
same: An analysis of identity in Linked Data. In P. F. Patel-Schneider, Y. Pan, P. Hitzler, P. Mika, L. Zhang, J. Z.
Pan, … B. Glimm (Eds.), The Semantic Web – ISWC 2010 (Vol. 6496, pp. 305–320). Berlin, Heidelberg: Springer
Berlin Heidelberg. Retrieved from http://www.springerlink.com/content/v24433851k747864/
Han, M.-J., Cho, C., Cole, T., & Jackson, A. (2009). Metadata for special collections in CONTENTdm: How to
improve interoperability of unique fields through OAI-PMH. Journal of Library Metadata, 9(3), 213–238.
doi:10.1080/19386380903405124
Harpring, P., & Baca, M. (Eds.). (2009). Categories for the Description of Works of Art. J. Paul Getty Trust. Retrieved
from http://www.getty.edu/research/conducting_research/standards/cdwa/
Hayes, P. J. (2004). RDF Semantics. W3C. Retrieved from http://www.w3.org/TR/2004/REC-rdf-mt-20040210/
Hayes, P. J., & Halpin, H. (2008). In defense of ambiguity. International Journal on Semantic Web and Information
Systems, 4(2), 1–18.
Hillmann, D. (2003, August 26). Using Dublin Core. Dublin Core Metadata Initiative. Retrieved from
http://dublincore.org/documents/2003/08/26/usageguide/
Hutt, A., & Riley, J. (2005). Semantics and syntax of Dublin Core usage in Open Archives Initiative data providers of
cultural heritage materials. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries (p. 270).
Johnston, P. (2006, November 28). Why an abstract model for Dublin Core metadata? eFoundations. Retrieved from
http://efoundations.typepad.com/efoundations/2006/11/why_an_abstract.html
Jones, E. (1997). Multiple Versions Revisited. The Serials Librarian, 32(1), 177–198. doi:10.1300/J123v32n01_14
Knowlton, S. A. (2009). How the current draft of RDA addresses the cataloging of reproductions, facsimiles, and
microforms. Library Resources and Technical Services, 53(3), 159–165.
Lagoze, C. (1996). The Warwick Framework: A container architecture for diverse sets of metadata. D-Lib Magazine.
Retrieved from http://www.dlib.org/dlib/july96/lagoze/07lagoze.html
Lagoze, C. (1997). From static to dynamic surrogates: Resource discovery in the digital age. D-Lib Magazine.
Retrieved from http://www.dlib.org/dlib/june97/06lagoze.html
Lagoze, C. (2001a). Keeping Dublin Core simple: Cross-domain discovery or resource description? D-Lib Magazine,
7(1). Retrieved from http://dlib.anu.edu.au/dlib/january01/lagoze/01lagoze.html
Lagoze, C. (2001b, May 17). RE: RDF, OAI, and application within libraries.
Lagoze, C., & Van de Sompel, H. (2008, October 17). ORE Specification - Abstract Data Model. Open Archives
Initiative. Retrieved from http://www.openarchives.org/ore/1.0/datamodel#Foundations
Lagoze, C., Van de Sompel, H., Nelson, M., & Warner, S. (2002). Implementation guidelines for the Open Archives
Initiative
Protocol
for
Metadata
Harvesting.
Open
Archives
Initiative.
Retrieved
from
http://www.openarchives.org/OAI/2.0/guidelines.htm
127
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Lagoze, C., Van de Sompel, H., Nelson, M., & Warner, S. (2008). Open Archives Initiative Protocol for Metadata
Harvesting. (OAI Executive & OAI Technical Committee, Eds.). Open Archives Initiative. Retrieved from
http://www.openarchives.org/OAI/openarchivesprotocol.html
Lenat, D. B., & Guha, R. V. (1990). Building Large Knowledge-based Systems: Representation and Inference in the
Cyc Project. Reading, MA: Addison-Wesley Publishing Company.
Library of Congress. (2010). 1.11A Facsimiles, Photocopies, and Other Reproductions. Library of Congress. Retrieved
from http://www.loc.gov/cds/PDFdownloads/lcri/LCRI_2010-03.pdf
McRae, L., & White, L. S. (Eds.). (1998). ArtMARC Sourcebook: Cataloging Art, Architecture and their Visual
Images. Chicago, IL: American Library Association.
Miller, E. (1998). An introduction to the Resource Description Framework. D-Lib Magazine. Retrieved from
http://www.dlib.org/dlib/may98/miller/05miller.html
Miller, P., & Greenstein, D. (Eds.). (1997). Discovering Online Resources Across the Humanities: A Practical
Implementation of Dublin Core. London: UKOLN.
Miller, S. J. (2010). The One-to-One Principle: Challenges in Current Practice. International Conference on Dublin
Core and Metadata Applications. Retrieved from http://dcpapers.dublincore.org/ojs/pubs/article/view/1043/992.
Park, J. (2005). Semantic interoperability across digital image collections: a pilot study on metadata mapping. Lecture
Notes in Computer Science, 3237, 621–630.
Park, J. (2009). Metadata quality in digital repositories: A survey of the current state of the art. Cataloging &
Classification Quarterly, 47(3), 213–228. doi:10.1080/01639370902737240
Park, J., & Childress, E. (2009). Dublin Core metadata semantics: An analysis of the perspectives of information
professionals. Journal of Information Science, XX(X), 1–13. doi:10.1177/0165551509337871
Powell, A., & Johnston, P. (2002, January 31). Guidelines for implementing Dublin Core in XML. UKOLN. Retrieved
from http://www.ukoln.ac.uk/metadata/dcmi/dc-xml-guidelines/2002-01-31/#DCARCH
Powell, A., Nilsson, M., Naeve, A., Johnston, P., & Baker, T. (2005, March 7). DCMI Abstract Model. Dublin Core
Metadata Initiative. Retrieved from http://dublincore.org/documents/2005/03/07/abstract-model/
Powell, A., Nilsson, M., Naeve, A., Johnston, P., & Baker, T. (2007). DCMI Abstract Model. Dublin Core Metadata
Initiative. Retrieved from http://dublincore.org/documents/abstract-model/
Research Libraries Group. (1997a). Guidelines for extending the use of Dublin Core Elements. Retrieved October 1,
2010, from http://www.oclc.org/research/activities/past/rlg/dcmetadata/guidelines.htm
Research Libraries Group. (1997b). Metadata Summit summary.
http://www.oclc.org/research/activities/past/rlg/dcmetadata/summit.htm
Retrieved
October
1,
2010,
from
Sauermann, L., & Cyganiak, R. (2008, November 3). Cool URIs for the Semantic Web. W3C. Retrieved from
http://www.w3.org/TR/cooluris/
Shreeves, S. L., Knutson, E. M., Stvilia, B., Palmer, C. L., Twidale, M. B., & Cole, T. W. (2005). Is “quality” metadata
“shareable” Metadata? The implications of local metadata practices for federated collections. In Currents and
convergence: navigating the rivers of change: proceedings of the Twelfth National Conference of the Association of
College and Research Libraries April 7-10, 2005, Minneapolis, Minnesota (p. 223).
Simonton, W. (1962). The bibliographic control of microforms. Library Resources & Technical Services, 6(1), 29–40.
Stvilia, B., & Gasser, L. (2008). Value-based metadata quality assessment. Library and Information Science Research,
30(1), 67–74.
Stvilia, B., Gasser, L., Twidale, M., Shreeves, S. L., & Cole, T. W. (2004). Metadata quality for federated collections.
In Proceedings of ICIQ04-9th International Conference on Information Quality (pp. 111–125).
Urban, R. J. (2012). Principle paradigms: Revisiting the Dublin Core 1:1 Principle (Dissertation). University of
Illinois at Urbana-Champaign, Urbana, IL. Retrieved from http://hdl.handle.net/2142/31109
Visual Resources Association, & Whiteside, A. (1999, December 1). The core categories for visual resources introduction.
Retrieved
September
26,
2010,
from
http://web.archive.org/web/20010306092716/www.gsd.harvard.edu/~staffaw3/vra/coreintro.htm
Weibel, S. L. (1995). Metadata: the foundations of resource description. D-Lib Magazine. Retrieved from
http://www.dlib.org/dlib/July95/07weibel.html
Weibel, S. L. (2010). Dublin Core Metadata Initiative (DCMI): A personal history. In Encyclopdia of Library and
Information Sciences.
Weibel, S. L., & Hakala, J. (1998, February). DC-5: The Helsinki Metadata Workshop; A Report on the workshop and
subsequent developments. D-Lib Magazine. Retrieved from http://www.dlib.org/dlib/february98/02weibel.html
Wendler, R. (1999, April 14). Re:
http://dublincore.org/groups/one2one/
1:1
debate:
128
What’s
the
goal?
dc-one2one.
Retrieved
from
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Towards Description Set Profiles for RDF using SPARQL as
Intermediate Language
Thomas Bosch
GESIS – Leibniz Institute for the
Social Sciences,
Mannheim, Germany
[email protected]
Kai Eckert
Research Group Data and Web
Science
University of Mannheim, Germany
[email protected]
Abstract
Description Set Profiles (DSP) are used to formulate constraints on valid data within a Dublin
Core Application Profile. For RDF, SPARQL is generally seen as the method of choice to
validate data according to certain constraints, although it is not ideal for their formulation. In
contrast, DSPs are comparatively easy to understand, but lack an implementation to validate RDF
data. In this paper, we use SPIN as basic validation framework and present a general approach
how domain specific constraint languages like DSP can be executed on RDF data using SPARQL
as an intermediate language.
Keywords: RDF validation; RDF constraint formulation; RDF constraint validation; Description
Set Profiles; DSP; RDF; linked data; semantic web.
1. Introduction
In 2013, the W3C invited experts from industry, government and academia to the RDF
Validation Workshop1 to discuss use cases and requirements for constraint representation and
RDF data validation. The following needs are reported:
1. Declarative definition of the structure of a graph for validation and description.
2. Extensible to address specialized use cases.
3. A mechanism to associate descriptions with data.
An important finding is that there are non-functional requirements for data validation in a
Linked Data setting, particularly the need to “communicate the constraints against which data is
to be validated in a way which is both easy to understand by human beings and discoverable by
programs.”
Partly as follow-up to the W3C workshop and partly due to further expressed requirements at
the Semantic Web in Libraries conference 20132, the Dublin Core Metadata Initiative in
collaboration with the W3C currently establishes a Task Group for RDF Application Profiles
(RDF-AP) that will investigate existing approaches and best-practices, identify possible gaps and
propose practical solutions for the representation of application profiles, including the
formulation of data constraints3. In a heterogeneous environment like the Web, there is not
necessarily a one-size-fits-all solution, especially as existing solutions should rather be integrated
than replaced, not least to avoid long and fruitless discussions about the “best” approach.
SPARQL and SPIN are powerful and widely used for constraint formulation and validation
(Fürber & Hepp, 2010), but constraints formulated as SPARQL queries are not as understandable
1
RDF Validation Workshop – Practical Assurances for Quality RDF Data. 10-11 September 2013,
Cambridge, MA, USA. http://www.w3.org/2012/12/rdf-val/report
2
SWIB13 – Semantic Web in Libraries, 25 - 27 November 2013, Hamburg, Germany.
http://swib.org/swib13/
3
http://wiki.dublincore.org/index.php/RDF-Application-Profiles
129
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
as one wishes them to be. Consider the following example of the simple constraint stating that
only dogs are allowed as pets:
SELECT ?this ?subope ?object WHERE {
?C owl:allValuesFrom :Dog .
?C owl:onProperty :hasPet .
?C a owl:Restriction .
?this rdf:type ?subC . ?subC rdfs:subClassOf* ?C .
?this ?subOPE ?object . ?subOPE rdfs:subPropertyOf* :hasPet .
FILTER NOT EXISTS { ?object rdf:type :Dog . } }
This query checks the constraint and returns violating triples, but the actual constraint could be
formulated easier using Description Set Profiles4:
[ a dsp:NonLiteralStatementTemplate;
dsp:property :hasPet;
dsp:nonLiteralConstraint [
dsp:valueClass :Dog;
]
]
Of course, it can be argued if DSPs are the best possible way to represent constraints. They are,
however, familiar to the DCMI community and tailored to the Dublin Core Abstract Model and
the Singapore Framework. As stated above, there will probably be more than one constraint
language that can be used in an application profile, with DSPs being one of them. This leaves the
question, how the validation of data based on different constraint languages can be implemented.
Different implementations using different underlying technologies hamper the interoperability of
application profiles and a full implementation of several constraint languages is hard to maintain
for solution providers. We therefore propose to use SPARQL as intermediate language:
constraints in arbitrary languages are transformed to executable SPARQL queries used to validate
the data.
This approach obviously requires that all constraint languages can be expressed in SPARQL.
We have no formal proof, as use-cases and requirements still are collected and there is neither a
complete list of possible constraints nor one of supported constraint languages. However, even if
there are constraints that cannot be translated to SPARQL, the subset of supported constraints is
certainly large enough to justify the limitation to SPARQL-expressible constraints at least for one
class of RDF Application Profiles, comparable to the sublanguages of OWL.
This claim is supported by the fact that SPARQL is already widely used for constraint
formulation, as mentioned above. Additionally, Sirin and Tao showed how constraints can be
translated to nonrecursive Datalog programs for validation (Sirin & Tao, 2009), while Angles and
Gutierrez explained that SPARQL has the same expressive power as nonrecursive Datalog
programs (Angles & Gutierrez, 2008).
In this paper, we present our first results regarding the implementation of our approach using
SPIN. We will show that besides SPIN, no further dependencies exist. We create a full validation
environment based on SPIN that can be used to validate domain specific constraint languages
(Section 2). The only limitations are that the constraints have to be expressed in RDF and that the
constraint language is expressible in SPARQL. In Section 3, we introduce Description Set
Profiles as domain specific constraint language and subsequently describe its implementation in
4
In RDF-Turtle Syntax, omitting the surrounding description template, for details refer to http://
dublincore.org/documents/dc-dsp
130
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
our environment (Section 4). We conclude in Section 5 with a discussion of open questions and
an outlook to the next steps.
2. Validation Environment
We use the SPARQL Inferencing Notation (SPIN)5 to create what we call a validation
environment. The overall idea is that we see constraint languages as domain specific languages
(hence domain specific constraint languages, DSCL) that are translated and executed on RDF
data within our validation environment.
The translation is done once, for instance by the developer of the DSCL, and provided in form
of a SPIN mapping plus optional preprocessing instructions. From a user’s perspective, all that
is needed is a representation of constraints using the DSCL and some data to be validated
against these constraints. All these resources are purely declarative and provided in RDF or as
SPARQL queries. The actual implementation is trivial using SPIN and illustrated in Figure 1.
FIG. 1. Constraint Validation Process
First, an RDF graph has to be populated as follows:
1. the data is loaded that is to be validated,
2. the constraints in the DSCL are loaded,
3. the SPIN mapping is loaded that contains the SPARQL representation of the DSCL (see
Section 4 for a detailed explanation), and
4. the preprocessing is performed, which can for example be provided in the form of
CONSTRUCT queries.
When the graph is ready, the SPIN engine checks for each resource in the RDF data if the
resource satisfies all defined constraints and generates a result RDF graph containing information
about all constraint violations.
With this implementation, there is one obvious limitation of our approach: the DSCL needs an
RDF serialization. For DSP, this is the case, but in the future, we would like to support non-RDF
languages as well. We will further elaborate on this interesting topic in Section 5.
Connect SPIN to your data. SPIN uses templates for SPARQL queries that are executed on
every instance of a given class – for instance :toValidate.
Most of the SPIN mapping that has to be created by the DSCL developer consists of such
templates that are linked to a class for which the constraints should be evaluated:
5
http://spinrdf.org/
131
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
:ToValidate
spin:constraint
[ a dsp2spin:StatementTemplates_MinimumOccurrenceConstraint ] .
As the mapping is designed to be independent of any actual data, the class :toValidate is
purely generic. Instead of using such a generic class, it is also possible to link the constraints to
owl:Thing or rdfs:Resource, i.e., to all instances.
Neither of these classes have to be assigned explicitly to instances within the data to be
validated. They are either inferred using reasoning or explicitly assigned during the
preprocessing: a reasonable approach would be to assign :toValidate to all classes for which
constraints are actually defined – in the case of DSP classes that are linked via
dsp:resourceClass to a description template; this can be accomplished by a suitable
CONSTRUCT query that is executed before the actual validation.
After preprocessing, the data might look like this – with the added generic class in italics:
:ArtficialIntelligence
a swrc:Book, :ToValidate ;
dcterms:subject :ComputerScience .
denotes a book with the assigned subject “Computer Science.”
Mapping from a DSCL to SPARQL. The actual mapping is performed by creating appropriate
SPARQL templates for every constraint that is supported in the DSCL, for example a minimum
occurrence that is required:
:ArtificialIntelligence
dsp2spin:StatementTemplates_MinimumOccurrenceConstraint
a spin:Template;
spin:body [
a sp:Construct ;
sp:templates (...) ;
sp:where (...) ] .
This is the general structure of a SPIN template representing a SPARQL CONSTRUCT query.
We use CONSTRUCT queries to generate descriptions of each constraint violation, for instance:
CONSTRUCT {
_:violation
a spin:ConstraintViolation ;
rdfs:label ?violationMessage ;
spin:violationRoot ?violationRoot ;
spin:violationPath ?violationPath ;
spin:violationSource ?violationSource ;
spin:fix ?violationFix ;
:severityLevel ?severityLevel }
In SPIN, such a CONSTRUCT query is represented in RDF as follows:
a sp:Construct ;
sp:templates (
[ sp:subject _:violation ; sp:predicate rdf:type ; sp:object spin:ConstraintViolation ]
[ sp:subject _:violation ; sp:predicate rdfs:label ; sp:object [ sp:varName "violationMessage" ] ]
... ) ;
132
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Constraint violation triples (1) provide useful messages explaining the reasons why RDF
instances did not satisfy the constraints (rdfs:label), (2) contain references to RDF triples
causing the constraint violations (spin:violationRoot), and (3) include references to the
constraints causing constraint violations (spin:violationSource). Constraint violation triples
give some guidance how to become valid data (spin:fix) in order to be able to fix constraint
violations. Constraint violations can be classified according to different levels of severity
(:severityLevel).
These constraint violation triples are generated for each RDF instance which matches against
the WHERE clause graph pattern in the SPIN template. The SPARQL variable this represents the
current RDF resource for which the constraint is checked.
As the mapping of a DSCL is independent of a concrete constraint specification, all constraints
are generally linked to all instances (of the generic class, if applicable). Therefore, the WHERE
clause of the template always have to restrict on a class for which the constraint was actually
defined, for example in the case of DSP via the resource class:
WHERE { ?this rdf:type ?resourceClass . }
As for the CONSTRUCT part of the query, SPIN represents the WHERE clause in RDF as
well:
[ sp:subject [ sp:varName "this" ] ;
sp:predicate rdf:type ; sp:object [ sp:varName "resourceClass" ] ]
With this framework, we have all we need to implement our own DSCL, Description Set
Profiles, which we will briefly introduce in the next section. Full examples for SPIN mappings
are provided afterwards in Section 4.
3. DSP as Domain Specific Constraint Language
The Singapore Framework6 is a framework for designing metadata and for defining Dublin
Core Application Profiles (DCAP). The framework comprises descriptive components that are
necessary or useful for documenting DCAPs. A DCAP is a means to assemble and to customize
components from different independently created metadata standards within the context of a
specific community, application, and domain7.
The DCMI Abstract Model (DCAM)8 with its Description Set Model (DSM) forms the basis of
Dublin Core metadata. While the DSM is highly related to RDF, it differs in some aspects worth
mentioning. Table 1 shows the mappings from DSM elements to RDF triples, according to DCRDF, the recommendation how Dublin Core metadata is represented in RDF9.
TABLE 1: DSM in RDF
DSM
RDF
Description Set
RDF graph (containing description RDF graphs)
Description
RDF graph
Resource
RDF subject: DSM resource URI (or blank node) (root of description RDF graph)
6
http://dublincore.org/documents/singapore-framework/
cf. http://dublincore.org/documents/profile-guidelines/
8
http://dublincore.org/documents/2007/06/04/abstract-model/
9
http://dublincore.org/documents/dc-rdf/
7
133
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Statement
RDF subject: DSM resource
RDF predicate: RDF property
RDF object: DSM value (surrogate)
Non-Literal Value Surrogate
Vocabulary Encoding Scheme
DSM value URI (or blank node)
RDF subject: DSM value
RDF predicate: dcam:memberOf
RDF object: DSM vocabulary encoding scheme
Value String
RDF subject: DSM value
RDF predicate: rdf:value
RDF object: RDF Literal (DSM value string)
(RDF plain literal or RDF typed literal)
Literal Value Surrogate
DSM value is RDF literal
(RDF plain literal or RDF typed literal)
Value String Language
Language tag of RDF literal
Syntax Encoding Scheme
RDF datatype of RDF typed literal
A Description Set Profile (DSP)10 contains constraints on the data within a DCAP, i.e., a DSP
restricts valid descriptions of resources in a description set. Consider the following example of a
DSP:
:bookDescriptionTemplate
a dsp:DescriptionTemplate ;
dsp:standalone "true"^^xsd:boolean ;
dsp:minOccur "1"^^xsd:nonNegativeInteger ; dsp:maxOccur "infinity"^^xsd:nonNegativeInteger ;
dsp:resourceClass swrc:Book ;
dsp:statementTemplate [
a dsp:NonLiteralStatementTemplate ;
dsp:minOccur "1"^^xsd:nonNegativeInteger ; dsp:maxOccur "5"^^xsd:nonNegativeInteger ;
dsp:property dcterms:subject ;
dsp:nonLiteralConstraint [
a dsp:NonLiteralConstraint ;
dsp:descriptionTemplate :subjectDescriptionTemplate ;
dsp:valueClass skos:Concept ;
dsp:valueURIOccurrence "mandatory"^^dsp:occurrence ;
dsp:valueURI :ComputerScience, :SocialScience, :Librarianship ;
dsp:vocabularyEncodingSchemeOccurrence "mandatory"^^dsp:occurrence ;
dsp:vocabularyEncodingScheme :BookSubjects ;
dsp:valueStringConstraint [
a dsp:ValueStringConstraint ;
dsp:minOccur "1"^^xsd:nonNegativeInteger ; dsp:maxOccur "1"^^xsd:nonNegativeInteger ;
dsp:literal "Computer Science"@en , "Computer Science"^^xsd:string ;
dsp:literal "Social Science"@en , "Social Science"^^xsd:string ;
dsp:literal "Librarianship"en , "Librarianship"^^xsd:string ;
dsp:languageOccurrence "optional"^^dsp:occurrence ;
dsp:language "en"^^xsd:language ;
dsp:syntaxEncodingSchemeOccurrence "optional"^^dsp:occurrence ;
dsp:syntaxEncodingScheme xsd:string ] ] ] .
A DSP consists of dsp:DescriptionTemplates that put constraints on instances of a certain
class, denoted by dsp:resourceClass. The constraints can either be constraints on the
description itself, e.g., a minimum occurrence of instances of this class. Additionally, constraints
on single properties can be defined within a dsp:StatementTemplate. The example above
contains all but one of the 23 constraints defined in DSP (except the sub-property constraint; the
5 literal value constraints can be used for value strings as well).
10
http://dublincore.org/documents/2008/03/31/dc-dsp/
134
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The DSM description template :bookDescriptionTemplate describes DSM resources of the
type swrc:Book (referenced by dsp:recourceClass). swrc:Book resources are allowed to occur
standalone (dsp:standalone), i.e. without being the value of a property. Books must occur at
least once (dsp:minOccur) and may appear multiple times (dsp:maxOccur) in the DSM
description set (the RDF graph). The dsp:NonLiteralStatementTemplate restricts books to
have at least 1 (dsp:minOccur) and at most 5 (dsp:maxOccur) dcterms:subject
(dsp:property) relationships to DSM non-literal value surrogates which are further described by
the dsp:NonLiteralConstraint.
The DSM values have to be of the class skos:Concept (dsp:ValueClass) and are further
described in a dedicated DSM description template (referenced by dsp:descriptionTemplate).
A value URI must be given (dsp:valueURIOccurrence) for DSM values and allowed value URIs
(dsp:valueURI) are :ComputerScience, :SocialScience, and :Librarianship. Controlled
vocabularies (like :BookSubjects) are represented as skos:ConceptSchemes in RDF and as
dsp:VocabularyEncodingSchemes in DSM. If DSM vocabulary encoding schemes must be
stated (dsp:vocabularyEncodingSchemeOccurrence), they have to contain the DSM values. In
this case, DSM values are classified as skos:Concepts and are related to skos:ConceptSchemes
via the object properties skos:inScheme and dcam:memberOf (see RDF data above).
The DSM values must be represented as exactly one (dsp:minOccur and dsp:maxOccur - line
20) of the given three DSM value strings (dsp:literal). The language tag en (dsp:language) as
well as the RDF datatype xsd:string (dsp:syntaxEncodingScheme) may be stated
(dsp:languageOccurrence and dsp:syntaxEncodingSchemeOccurrence) for DSM value
strings.
An example for RDF data satisfying all these constraints for resources of the type swrc:Book
would be:
:ArtficialIntelligence
a swrc:Book , :ToValidate ;
dcterms:subject :ComputerScience .
:ComputerScience
skos:Concept , :ToValidate ;
dcam:memberOf :BookSubjects ;
skos:inScheme :BookSubjects ;
rdf:value "Computer Science"@en .
:BookSubjects
a skos:ConceptScheme , :ToValidate .
4. Mapping of DSP Constraints to SPIN
After the introduction of the general approach in Section 2, we now present a concrete example
of a SPIN mapping for a DSP constraint: the DSP statement template constraint ’Minimum
Occurrence Constraint’ (6.1) restricts the minimum number of times the given statement must
appear in the enclosing description.
This constraint is implemented by the following SPARQL query which is then represented in
SPIN RDF and linked to our generic class :ToValidate:
CONSTRUCT {
_:violation
a spin:ConstraintViolation ;
rdfs:label ?violationMessage ;
spin:violationRoot ?this ;
spin:violationSource dsp:minOccur }
WHERE {
?this rdf:type ?resourceClass .
135
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
?descriptionTemplate rdf:type dsp:DescriptionTemplate .
?descriptionTemplate dsp:resourceClass ?resourceClass .
?descriptionTemplate dsp:statementTemplate ?statementTemplate .
?statementTemplate dsp:minOccur ?minOccurStatement .
?statementTemplate dsp:property ?property .
BIND ( ( spl:objectCount ( ?this, ?property ) ) AS ?cardinalityStatement ) .
FILTER ( cardinalityStatement < ?minOccurStatement ) .
BIND ( (
fn:concat('cardinality of ', ?property, ' ( ', ?cardinalityStatement, ' )
< mininum cardinality of ', ?property, ' ( ', ?minOccurStatement, ' )' ) )
AS ?violationMessage ) . }
It can be seen that the WHERE clause is used to “detect” constraint violations. First, a graph is
matched that contains the instance data (using ?this as instance variable) and the applicable
constraint formulation from the DSP (linked to the instance via dsp:resourceClass). The
cardinality of the property in question is added. The actual validation is implemented by the
FILTER that identifies only instances that violate the constraint.
In this example, we create a violation message (?violationMessage) that can be displayed to
the user, together with the URI of the instance (?this as spin:violationRoot) and the violated
constraint (dsp:minOccur as spin:violationSource).
According to our DSP, if a resource in the RDF data
1. is assigned to the class swrc:Book (line 5), and
2. has no dcterms:subject relationships (line 8 and 9),
then the following constraint violation triple is generated:
_:violation
a spin:ConstraintViolation ;
rdfs:label
'cardinality of dcterms:subject ( 0 ) < mininum cardinality of dcterms:subject ( 1 )' ;
spin:violationRoot :IntroductionToAlgorithms ;
spin:violationSource dsp:minOccur .
This example demonstrates how a DSP constraint is implemented in SPARQL. In the same
manner, most other constraints can be implemented as well, although often the mapping gets
substantially longer and more complex. There are, however, constraints that cannot be
implemented at all, in the case of DSP for example the literal value constraint Syntax Encoding
Scheme Constraint (6.5.4). It determines whether DSP syntax encoding schemes (RDF datatypes)
are allowed for RDF literals, which can be ’mandatory’, ’optional’, or ’disallowed’.
This type of constraint cannot be validated as RDF literals always have associated datatype
IRIs. If there is no datatype IRI and no language tag explicitly stated, the datatype of an RDF
literal is implicitly xsd:string. If there is a language tag, the datatype is implicitly
rdf:langString. Fortunately this constraint can be replaced by an equivalent constraint using
Syntax Encoding Scheme List Constraint (6.5.5) which restricts the allowed DSP syntax encoding
schemes and which is fully implemented in the SPIN mapping for DSP.
5. Conclusion and Future Work
With our approach, we were able to fully implement Description Set Profiles, apart from the
exception noted above. The implementation can be tested at http://purl.org/net/rdfvaldemo. In this paper, we describe our general approach and demonstrated its applicability to
Description Set Profiles. In particular, we use SPIN as basis to define a validation environment in
which domain specific constraint languages – like DSP – can be implemented by representing
136
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
them in SPARQL. The approach is particularly appealing as it has only one dependency being
SPIN. The implementation of the DSCL is fully declarative, consisting of a SPIN mapping in
RDF and preprocessing instructions in form of SPARQL CONSTRUCT queries – which can also
be represented in RDF using SPIN. It is therefore possible to link the applicable constraints in a
given DSCL to an application profile, as well as the SPIN mapping and the preprocessing
instructions. All that is needed to validate data according to this application profile without the
need for a DSCL-specific validator. Our approach therefore fulfills an important requirement for
RDF Application Profiles.
A limitation of our approach are constraints that cannot be expressed in SPARQL, as for
example the Syntax Encoding Scheme Constraint of DSP. In this case this is an artefact resulting
from the way how RDF is implemented. There are most certainly other cases, but we argue that
our approach is nevertheless useful for the majority of constraints in the majority of DSCLs. We
propose, however, to document such missing constraints clearly as part of the DSCL so that users
can deal with it.
Our approach is currently limited to DSCLs that are expressible in RDF. This is not necessarily
a problem – the data and the data models are in RDF, so at least it is consistent – but it might be
sub-optimal regarding readability and understandability of the constraints and for now excludes
many existing DSCLs. We therefore plan to investigate this issue further as part of our future
work. Another interesting topic is the testing of the SPIN mappings, for which test data together
with expected outcomes could be provided in a certain form. Our next steps include the
application to further constraint languages, first and foremost OWL2, which is already used by
many to formulate constraints. The DSP mapping is developed and maintained at
https://github.com/dcmi/DSP-SPIN-Mapping.
Acknowledgements
Kai Eckert is funded by the European Commission within the DM2E project (http://dm2e.eu)
References
Angles, R., & Gutierrez, C. (2008). The expressive power of SPARQL. In Proceedings of the 7th International
Semantic Web Conference (ISWC2008) (pp. 114–129).
Fürber, C., & Hepp, M. (2010). Using SPARQL and SPIN for Data Quality Management on the Semantic Web. In W.
Abramowicz & R. Tolksdorf (Eds.), Business information systems (Vol. 47, pp. 35–46). Springer Berlin
Heidelberg. Retrieved from http://dx.doi.org/10.1007/978-3-642-12814-1 4 doi: 10.1007/978-3-642-12814-1 4
Sirin, E., & Tao, J. (2009). Towards integrity constraints. In Proceedings of the Workshop on OWL: Experiences and
Directions, OWLED 2009.
137
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Describing Theses and Dissertations Using Schema.org
Jeff Mixter
OCLC, USA
[email protected]
Patrick OBrien
Kenning Arlitsch
Montana State University,
Montana State University,
USA
USA
[email protected] [email protected]
Abstract
This report discusses the development of an extension vocabulary for describing theses and
dissertations, using Schema.org as a foundation. Instance data from the Montana State University
ScholarWorks institutional repository was used to help drive and test the creation of the extension
vocabulary. Once the vocabulary was developed, we used it to convert the entire ScholarWorks
data sample into RDF. We then serialized a set of three RDF descriptions as RDFa and posted
them online to gather statistics from Google Webmaster Tools. The study successfully
demonstrated how a data model consisting of primarily Schema.org terms and supplemented with
a list of granular/domain specific terms can be used to describe theses and dissertations in detail
Keywords: Schema.org; RDF; linked data; institutional repositories; semantic web; search
engine optimization; data modeling.
1. Introduction
As academic institutions realize the value of their intellectual output, well-organized and
discoverable institutional repositories are increasingly viewed as strategic assets. The intellectual
output of an academic institution is diverse and ranges from student theses and dissertations to
conference proceedings, presentations, books, journal articles, and the datasets that support
research conclusions. It is crucial for purposes of discovery to publish the metadata in a format
that is easily understood, consumed and indexed by search engines and other machine-based data
aggregators.
This project builds on research whose initial aim was to improve visibility of digitized special
collections in commercial search engines, and was partially funded by the Institute of Museum
and Library Services (IMLS). The first phases of research were successful in developing search
engine optimization (SEO) strategies and methods, and led to the publication of a book (Arlitsch
& OBrien, 2013). Beyond digitized special collections the research also revealed that
institutional repositories (IRs) pose unique and complex problems to scholarly search engines,
and as a result many IRs were not being consistently harvested and indexed. The project
described in this report examines a specific subset of IR content, theses and dissertations. The
scope of the project was to create a set of extension terms for Schema.org1 that can be used to
describe theses and dissertations and to create a process model that explains how we converted
the existing Montana State University Dublin Core metadata into Linked Data. Following this
proof of concept, we plan to explore how to integrate the new vocabulary into existing IRs so that
they can provide search engines with more meaning and context, ultimately resulting in more
accurate search results for users.
1.1. Data Sample
We used the Montana State University ScholarWorks IR dataset to drive and validate the
modeling process that expanded and implemented the Schema.org vocabulary. This approach
provided the group with a multitude of rich modeling examples and use cases but it also helped
1
http://schema.org
138
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
keep the process of modeling firmly grounded in the requirements presented by the data. The
ScholarWorks dataset that was used for the study was a collection of student theses and
dissertations. There were 1909 records in the sample, which had originally been described using
Dublin Core (DC) and, where necessary, additional DC extensions for granular details. It should
be noted that prior to use in this study, the ScholarWorks metadata was cleaned up to ensure that
all of the fields were populated with information, where appropriate, and that the fields were used
according to their proper definitions. This prior work mitigated the need to perform an initial
review and cleanup in order to use the data, but IR managers who plan to implement structured
metadata should be aware that this cleanup is a crucial first step.
2. Extension Vocabulary Development
In our initial review of the dataset, we tried to use existing vocabularies to describe theses and
dissertations. It became evident when reviewing the sample data extracted from ScholarWorks
that existing vocabularies alone were not robust enough to fully describe the items. Application
Profiles were an attempt by the larger metadata community to develop a set of vocabulary terms
that can be used within a specific context to describe unique items. The idea was that a metadata
schema could be developed from a variety of existing schemas, modified if needed and then used
to describe a unique set of items within the context of a specific application or domain (Heery &
Patel, 2000). Sir Tim Berners-Lee referred to this same type of modeling as “cherry-picking” at
the Gov 2.0 Expo in 2010, suggesting that nearly all of the vocabulary terms that one would need
to describe an item already exist (Berners-Lee, 2010). The work around application profiles was
recently restarted within the context of developing RDF application profiles. A DCMI Task
Group has begun to investigate how RDF application profiles could be created and used to help
with data validation.2 An early example of picking and choosing RDF terms from a variety of
vocabularies can be found in the British Data Model (Hodson, Deliot, Danskin, Rosie & Ashton,
2012). In this model, terms are taken from fifteen different vocabularies and combined to form a
comprehensive model for describing bibliographic items.
We used the same approach to develop the extension vocabulary for the theses and
dissertations sample set. Below is a table showing the vocabularies that we used.
TABLE 1: Vocabularies used in the project
Prefix
schema
dcterms
pto
rdfs
mont
Vocabularies used in the project
Namespace
http://schema.org
http://purl.org/dc/terms/
http://www.productontology.org/id/
http://www.w3.org/2000/01/rdf-schema#
http://purl.org/montana-state/library/
In addition to Table 1, we created and published a VoID dataset description.3 It includes
information about the sample datasets, including dataset statistics. The extension vocabulary that
we developed was not designed to be prescriptive. Rather, it was meant to be used with the entire
Schema.org vocabulary. In this sense, our extension vocabulary provides a descriptive way for
rationalizing existing descriptions of theses and dissertations as Linked Data without adding any
constraints or validation requirements. As Linked Data graphs continue to grow in size, validation
will obviously become an important topic and requirement for systems/services. Over the next
few years, it will be interesting to observe the path that the RDF Application Profile Task Group
2
3
http://wiki.dublincore.org/index.php/RDF_Application_Profiles
http://purl.org/montana-state/scholarworks/sampledataset
139
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
takes in dealing with validation requirements. The full list of extension classes and properties are
available online.4
2.1. Classes
The new classes we developed for the extension vocabulary were divided into two unique
categories. The first category included class extensions that were used to add a more granular
description of the item being described. The labels for these classes were derived from the
‘Appendix III – Types’ controlled vocabularies used in the Citation Style Language.5 Table 2 lists
the first category of classes.
TABLE 2: Citation Style Language terms
Extension Classes derived from Citation Style Language terms
mont:JournalArticle
mont:MagazineArticle
mont:NewspaperArticle
mont:Bill
mont:Chapter
mont:ConferencePaper
mont:Entry
mont:Figure
mont:Graphic
mont:Interview
mont:LegalCase
mont:Legislation
mont:Manuscript
mont:MusicalScore
mont:Pamphlet
mont:Patent
mont:PersonalCommunication
mont:Report
mont:Speech
mont:Thesis
mont:Treaty
The second category of classes that was developed for the extension vocabulary included terms
that were not covered by existing popular vocabularies but were required for the description of
theses and dissertations. Table 3 lists the second category of classes that were created for the
extension vocabulary.
TABLE 3: Extension Classes not covered by other vocabularies
Extension Class
mont:AcademicDepartment
mont:Collection
mont: School
mont:Concept
mont:DigitalCollection
mont:DoctoralThesis
mont:EtdCommittee
mont:InstitutionalRepository
mont:MasterThesis
mont:ScholarlyWork
mont:SpecialCollection
4
5
http://purl.org/montana-state/library
http://citationstyles.org/downloads/specification.html#appendix-iii-types
140
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
A diagram of the classes and relationships used in the project can be found in Appendix I.
2.2. Properties
Although Schema.org has a wide variety of properties, the ScholarWorks instance data helped
us identify use cases that required more granular terms to properly describe the item. We were
able to create relationships between entities that were otherwise mashed together in the Dublin
Core records. Figure 1 illustrates how we were able to identify individual people and committees
and also define how they were related to each other.
FIG 1: Relationships derived from DC records
Table 4 contains all of the properties that were created for the extension vocabulary as well as the
type of Web Ontology Language (OWL) property that should be interpreted for each.
TABLE 4: List of Properties and OWL equivalencies
Extension vocabulary property
mont:associatedDepartment
mont:associatedSchool
mont:adviser
mont:campus
mont:committeeChair
mont:committeeMember
mont:curates
mont:facultyMember
mont:hadDepartment
mont:hasEtdCommittee
mont:hasLibrary
mont:reviewedBy
mont:callNumber
mont:degreeGrantedForCompletion
mont:degreeGranted
mont:firstPage
mont:lastPage
141
Object or Data property
Object
Object
Object
Object
Object
Object
Object
Object
Object
Object
Object
Object
Data
Data
Data
Data
Data
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
3. Testing And Implementing The Model
After the model was developed, the entire ScholarWorks dataset was converted into Linked
Data using a modified version of OpenRefine6 called LODRefine.7 Once the data were imported
into LODRefine, a variety of data cleanup tasks were conducted and finally the Schema.org and
extension vocabulary were imported and used to generate Linked Data. The first major cleanup
task was to separate cells that contained multiple values into individual cells. After completing
the cleanup we attempted to reconcile named entities to existing Linked Data datasets. We
queried several datasets, including LCSH, VIAF and DBpedia. The most successful matching
came from values that were included in the ‘subjects’, ‘subjects.lcsh’ and ‘coverage.spatial’
fields. The ‘subject.lcsh’ terms had a particularly high match rate (78% match to LCSH URIs)
while the other fields matched at a lower rate (40% matched to DBpedia.org). The one problem
with querying LCSH terms was that there were many pre-coordinated headings. Since the LCSH
Linked Data dataset only includes terms that are part of the LCSH Authority files, there were
quite a few terms that did not match up correctly. A solution to this problem would be to coin
local URIs for the pre-coordinated headings and then include dc:hasPart or rdfs:seeAlso
properties pointing out to the individual LCSH URIs that are referenced in the compound
heading.
For the named entities that did not reconcile to the aforementioned datasets, local URIs were
coined. These URIs followed a set pattern and then used the string value of the field as the
identifier token. Figure 2 is an example of one of the URIs that was created when we could not
match it to an existing Linked Data dataset.
http://scholarworks.montana.edu/doc/entities.html#person/Angie_Keesee
FIG 2: Sample URI coined for string value
More information about how to clean up dirty data and generate Linked Data using OpenRefine
can be found in (Vorborgh & De Wild, 2013). In order to publish the Linked Data in a webfriendly serialization and to begin to test how much structured data search engines can mine, we
converted three of the descriptions into RDFa and published them on ScholarWorks.8 For all of
the entities that did not have existing metadata records, such as people, places, organizations, etc,
a single HTML page was generated that has a list of entity descriptions. The page is anchored
with the URI tokens that appear after the #, so if one of these ‘extra entity’ URIs is resolved in the
browser it will position the user in the appropriate portion of the page. The list can be found at
Montana Scholar Works.9
3.1. Instance Data Example
In order to give a better understanding of the results of the modeling, this section walks
through one of the sample records that was converted into Linked Data. The full RDFa
description of this record is available online.10 Figure 3 on the following page provides a graphic
representation of the terms used to describe the item. The sample pictured in Figure 3 is also
expressed in Turtle in Appendix II. The diagram does not list all of the properties and classes that
can/should be used to describe theses and dissertations. A complete list of all of the terms used in
the sample collection can be found in the Appendix III.
6
http://openrefine.org/
http://code.zemanta.com/sparkica/
8
http://scholarworks.montana.edu/doc/index.html).
9
http://scholarworks.montana.edu/doc/entities.html
10
http://scholarworks.montana.edu/doc/SampleWork1.html
7
142
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
FIG 3: Graphical representation of sample item
4. Conclusion
We were able to successfully map the theses and dissertations metadata into Schema.org and,
when needed, supplemented existing Dublin Core fields with terms we created as part of an
extension vocabulary for Schema.org. The extension terms followed the same standards and
practices as those in Schema.org and every attempt was made to position extension terms as subclasses or sub-properties of existing Schema.org terms. The project has thus far successfully
developed an extension vocabulary to describe theses and dissertations and show how to apply
the vocabulary to existing metadata. Since modeling is an iterative process, the next step in the
project will be to apply the vocabulary to more sets of theses and dissertations and make
additions/changes. We also plan to publish more RDFa and begin to track the amount of
structured data that is harvested by search engines using tools such as Google Webmaster Tools.11
Acknowledgements
This work could not have been completed without the help and assistance of Dr. Jean Godby and
Jeff Young of OCLC Research.
References
Arlitsch, K., & OBrien, P. S. (2013). Improving the visibility and use of digital repositories through SEO. Chicago:
ALA
TechSource,
an
imprint
of
the
American
Library
Association.
Retrieved
from
http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=578551
Berners-Lee, T. (2010). Open, Linked Data for a Global Community. Washington D.C. Retrieved from
https://www.youtube.com/watch?v=ga1aSJXCFe0&feature=player_embedded
Heery, R., & Patel, M. (2000). Application profiles: mixing and matching metadata schemas. Ariadne, 25, 27-31.
Hodson, T., Deliot, C., Danskin, A., Rosie, H., & Ashton, J. (2012). British Data Model – Book. Retrieved from
http://www.bl.uk/bibliographic/pdfs/bldatamodelbook.pdf
Vorborgh, R., & De Wilde, M. (2013). Using OpenRefine: The essential OpenRefine guide that takes you from data
analysis and error fixing to linking your dataset to the Web. Packt Publishing Ltd.
11
https://www.google.com/webmasters/tools/home?hl=en
143
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Appendix I: Visual graph of the vocabulary terms used
144
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Appendix II: Sample data serialized as Turtle
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix ns1: <http://purl.org/montana-state/library/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
@prefix xhv: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://scholarworks.montana.edu/xmlui/handle/1/861> a schema:CreativeWork,
schema:MediaObject, mont:Thesis,
<http://www.productontology.org/id/Portable_Document_Format> ;
dcterms:isPartOf <http://scholarworks.montana.edu/doc/entities.html#Collections/1/733> ;
dcterms:rights <http://scholarworks.montana.edu/doc/entities.html#InstitutionalRepository/CopyrightStatements/1> ;
ns1:associatedDepartment <http://scholarworks.montana.edu/doc/entities.html#college/5> ;
ns1:associatedSchool <http://scholarworks.montana.edu/doc/entities.html#college/5> ;
ns1:degreeGrantedForCompletion "M Arch" ;
ns1:firstPage "1" ;
ns1:lastPage "106" ;
ns1:reviewedBy <http://scholarworks.montana.edu/doc/entities.html#EtdCommittee/3593> ;
schema:about <http://dbpedia.org/resource/Four_Corners>,
<http://dbpedia.org/resource/United_States_Of_America>,
<http://id.loc.gov/authorities/sh2008110701#concept>,
<http://id.loc.gov/authorities/sh85026282#concept>,
<http://id.loc.gov/authorities/sh85140507#concept> ;
schema:author <http://scholarworks.montana.edu/doc/entities.html#person/Bailey_Clint_Brantley> ;
schema:dateCreated "2010" ;
schema:description "The American Small Town will forever have a place in the undertones of American culture and
in the American psychy. The small town has become an identifing piece of the fabric that the overall American Society
as a whole uses to project its own image, not only to the world but to its self. This study is an examination of key
elements of the American Small town and an exploration into why these places are disappearing. The study goes on to
utilize this information to derive a plan for a small town that is free of modern day plights, such as sprawl and
redundency. In the end, it proposes a plan for the community of Four Corners, M.T. This case study re-design is an
example of how small communities can be shaped early on to prevent waste, maximize efficiency and quality of life." ;
schema:encodesCreativeWork <http://scholarworks.montana.edu/doc/entities.html#physicalItem/3593> ;
schema:genre "Thesis" ;
schema:inLanguage "eng" ;
schema:name "Small town America [electronic resource] : a re-design / by Clint Brantley Bailey.",
"Small town America redesign" ;
schema:productID "3593" ;
schema:publisher <http://dbpedia.org/resource/Montana_State_University>,
<http://scholarworks.montana.edu/doc/entities.html#college/5> ;
schema:serialNumber "1513761" .
145
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Appendix III: List of classes and properties used in the study
Classes
schema:Intangible
schema:Person
schema:Organization
schema:CreativeWork
schema:CollegeOrUniversity
schema:EducationalOrganization
schema:MediaObject
pto:Portable_Document_Format
dcterms:RightsStatement
dcterms:Collection
mont:Concept
mont:EtdCommittee
mont:School
mont:InstitutionalRepository
mont:DigitalCollection
mont:AcademicDepartment
Object Properties
schema:subOrganization
schema:encoding
schema:author
schema:member
schema:encodesCreativeWork
schema:about
schema:department
schema:publisher
dcterms:isPartOf
dcterms:rights
mont:advisor
mont:associatedDepartment
mont:associatedSchool
mont:reviewedBy
Data Properties
schema:genre
schema:dateCreated
schema:inLanguage
schema:url
schema:serialNumber
schema:name
schema:productID
schema:description
mont:firstPage
mont:lastPage
mont:degreeGrantedForCompletion
mont:degreeGranted
rdfs:label
146
Metadata Praxis Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Provenance Description of Metadata using PROV with PREMIS
for Long-term Use of Metadata
Chunqiu Li
Graduate School of Library, Information
and Media Studies,
University of Tsukuba, Japan
[email protected]
Shigeo Sugimoto
Faculty of Library, Information
and Media Science,
University of Tsukuba, Japan
[email protected]
Abstract
Provenance description is necessary for long-term preservation of digital resources. Open
Archival Information System (OAIS) and Preservation Metadata: Implementation Strategies
(PREMIS), which are well-known standards designed for digital preservation, define descriptive
elements for digital preservation. Metadata has to be preserved as well as primary resource in
order to keep the primary resources alive. However, due to the changing technology and
information context, not only primary digital resources but also metadata are at risk of damage or
even loss. Thus, metadata preservation is important as well as preservation of primary digital
resources. Metadata preservation is a rather new research topic but critical for keeping metadata
about preserved resources consistently over time. This paper focuses on provenance as an
important issue in digital preservation. It discusses provenance description based on two major
metadata standards—PROV and PREMIS. The goal of this study is to clarify a model for
describing provenance for metadata preservation. This paper first describes some well-known
standards—OAIS, PREMIS, PROV, and so forth, and then discusses a novel model of
provenance description based on the PROV Ontology (PROV-O) and PREMIS OWL Ontology.
The paper gives provenance description examples using PROV-O and PREMIS OWL Ontology
respectively. Based on analysis and mapping among the basic classes of the PROV-O and
PREMIS OWL Ontology, we propose an integrated, merged model. We discuss metadata schema
provenance and some other open issues.
Keywords: digital provenance; metadata provenance; metadata longevity; PROV; PREMIS
1. Introduction Metadata plays crucial roles in long-term use of digital resources and digital preservation.
Damage or loss of metadata over time may cause serious problems in the long-term use of digital
resources. Metadata schema changes may cause inconsistency in the use of metadata, which is
also a risk for the long-term use of digital resources. Due to the high cost of re-creation of
metadata, longevity of metadata is an important issue for long-term use of digital resources.
Metadata schema, which defines a set of terms, structure of metadata instances and some related
characteristics of metadata instances, has to be maintained as well as the metadata instances over
time.
Provenance information is necessary for long-term use and preservation of digital resources.
Provenance is a fundamental principle of archives (Pearce-Moses, 2005) and keeping provenance
of every archived item is a fundamental archival function. Open Archival Information System
(OAIS) and Preservation Metadata: Implementation Strategies (PREMIS) are widely accepted
standards for digital preservation. They include provenance descriptions as primary information.
Both OAIS and PREMIS state the importance of provenance description for preservation
(Consultative Committee for Space Data System, 2012; PREMIS Editorial Committee, 2012).
As provenance is a general concept, provenance description is not limited to preservation of
digital objects. There are several standards for provenance description such as PROV developed
147
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
by the World Wide Web Consortium (W3C). PROV is defined as a general, high-level standard
for provenance, whereas provenance descriptions in OAIS and PREMIS are defined for
preservation of information resources. The primary goal of this paper is to study a model for
describing provenance of metadata by combining PROV and PREMIS.
This study is primarily aimed at understanding the underlying model for the provenance of
metadata for long-term use of metadata—in other words, the interoperability of metadata over
time. Metadata preservation is purposed to assure the persistent availability, understandability,
and usability of metadata. To make metadata interpretable correctly in the future context is a
main goal of metadata preservation. Longevity of digital objects is well known as a crucial issue
for the further progress of the networked information society. The technology standards for
longevity of digital objects are applicable to the metadata instances because the metadata
instances are mostly, but not necessarily, digital objects—e.g., an XML text file and an Excel file.
Longevity of digital objects does mean that the objects can be correctly rendered over time.
However, it does not necessarily mean that future users can properly understand the content of
the object. For example, a table stored in an Excel file may be rendered over time but the
attributes of the table cannot be properly understood without proper description of the meaning of
the attributes and values. This table example shows a typical problem in metadata
preservation—metadata as a digital object may be preserved; but metadata as a semantically
meaningful entity may be lost. Even if a metadata instance is encoded in XML and stored in a
plain-text file, semantics of XML elements may be lost if the meanings of the tags in the XML
text are not properly preserved. Thus, preservation of metadata is not same as preservation of
digital objects.
Metadata registries, which store the definitions of metadata terms and controlled vocabularies
and provide them over the Internet, have crucial roles in making the metadata terms and
controlled vocabularies usable across communities and over time. Moreover, maintaining
application profiles is a crucial function for long-term use of metadata. However, management
and use of provenance information of the metadata terms and vocabularies has not been discussed
except for versioning and its control. Provenance of application profiles has been neither well
discussed nor well recognized.
Based on this understanding about state-of-the-art of metadata provenance, this paper discusses
a basic model for metadata provenance. The proposed model is defined based on PROV Ontology
(PROV-O) and PREMIS OWL ontology. The rest of this paper is organized as follows. Section 2
describes provenance for the discussion in this paper followed by surveys of some major models
and standards for preservation of digital resources and provenance description. Section 3
discusses the provenance description using PROV-O and PREMIS OWL ontology respectively.
Section 4 shows mapping between PROV-O and PREMIS OWL ontology and proposes a novel
model to combine them for provenance description oriented to digital preservation. Section 5
states metadata schema provenance issues for metadata longevity. Finally, Section 6 concludes
the paper.
2. Survey of Provenance Description Standards and Models
2.1. Digital Provenance and Metadata Provenance
We discuss provenance from the dual viewpoints of digital object provenance and that of
metadata. Digital provenance and metadata provenance in this paper are defined as follows:
Digital provenance is chronology or chronological information related to management of a
digital object. Digital provenance typically describes agents responsible for the custody and
stewardship of digital objects, key events that occur over the course of the digital object’s life
cycle, and other information associated with the digital object’s creation, management, and
preservation (PREMIS Editorial Committee, 2012)—e.g., the organization responsible for eBook.
148
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Based on the definition above, we can define metadata provenance as chronology or
chronological information about metadata, typically responsible agents, influencing actions,
associated events and other related information about metadata over its lifecycle. Provenance
about metadata schema is also metadata provenance, e.g., actions and events in the revision
process of metadata schema, and so forth.
It is important for memory institutions to record and provide provenance information of their
holdings. W3C Provenance Incubator Group listed provenance-related use cases, which include
provenance in cultural heritage (W3C Provenance Incubator Group, 2010). Europeana provides
access to resources held at cultural heritage institutions throughout Europe. Europeana is a use
case of metadata provenance, in which metadata provenance is represented via Europeana Data
Model using the OAI-ORE model (Eckert, 2012).
The paragraphs below summarize digital provenance and metadata provenance from the
viewpoint of long-term use of digital objects:
(1) Metadata of preserved resources has to be consistently interpretable over time. It has to be
recognized that preservation policy and environment of preserved resources may change over
time and metadata interpretation may be affected by the changes. For example, in the case of
recordkeeping, digital provenance could provide information about the origin, e.g., where, when,
by whom, and how a resource was created and who are the successors of the preserved resource.
This information will contribute to the interpretation of metadata by users in the future.
(2) Metadata provenance describes and keeps track of responsible agents, influencing actions,
associated events that caused a change(s) in metadata. Change history of a metadata schema used
in a service is crucial to keeping track of changes to metadata instances created based on that
schema. Therefore, provenance of a metadata schema is crucial to keeping metadata correctly and
consistently interpretable and may include change history of the schema as well as relationships
to other entities such as base standards and system requirements.
2.2. Digital Preservation Standards—OASIS and PREMIS
The OAIS Reference Model is a widely used model for archiving and preserving digital
resources. Provenance information in OAIS is defined as the history of the Content Information,
which describes the origin of and changes on an archived resource, and agents who hold custody
since its origination (Consultative Committee for Space Data System, 2012). The provenance
description is a part of Preservation Description Information (PDI), and documents evolutionary
processing history associated with the Content Information over its complete life cycle.
PREMIS is a widely used international metadata standard for the preservation of digital
objects. The PREMIS Data Model defines five Entities for digital preservation, which are
Intellectual Entity, (Digital) Object, Event, Agent, and Right. Documentation of actions on a
digital object is critical for the maintenance of the object. The documentation, i.e., metadata about
the actions, is aggregated as an Event. Thus, Event is crucial component for provenance
description associated with Object. PREMIS Data Dictionary defines a set of descriptive elements
of the five Entities. Those elements are called semantic units. Some of the semantic units
associated with an Event record changes to a preserved digital object (PREMIS Editorial
Committee, 2012). PREMIS OWL ontology defines classes and properties to describe
preservation metadata in RDF.
2.3. Provenance Models—W3C PROV, Open Provenance Model and others
W3C PROV: The Provenance Working Group at W3C has published PROV family of
documents, including the PROV Data Model (PROV-DM), PROV-O and so forth. The working
group aims at the inter-operable interchange of provenance information in heterogeneous
environments such as the Web. PROV-DM is a conceptual data model, which defines a set of
concepts and relations to represent provenance (Moreau et al., 2013). PROV-O defines a set of
149
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
classes and properties as an OWL2 ontology allowing mapping PROV-DM to RDF (Lebo et al.,
2013).
Open Provenance Model (OPM): OPM is a research result of the International Provenance
and Annotation Workshop (IPAW). Based on the OPM Core Specification (v1.1), the OPM is
designed to meet six requirements, including: exchange of provenance information between
systems, representation of provenance for any “thing” and so forth (Moreau et al., 2010). OPM
Vocabulary (OPMV), OPM OWL Ontology (OPMO) and OPM for Workflows (OPMW) are
defined pertaining to OPM. OPMV as an OWL-DL ontology designed to assist the
interoperability between provenance information on the Semantic Web and to support provenance
descriptions for datasets beyond those in the Web of Data (Zhao, 2010). OPMO as an OWL
ontology allows full expressivity of OPM concepts and supports inferencing (Moreau et al., 2010).
OPMW is also OWL-DL ontology developed to represent abstract workflows and workflow
execution traces. OPMW extends and reuses OPM's core ontologies. In the latest release, OPMW
also extends PROV to represent scientific processes (Garijo and Gil, 2014).
Others: W7 model was developed o represent the semantics of data provenance in which
provenance is conceptualized as a combination of seven interconnected elements including “what
(occurring event)”, “how (action leading to event)”, “who (involved individuals or
organizations)”, “when (time of event)”, “where (location of event)”, “which (software or
instrument that was used)” and “why (reason for why event happened)” (Liu, 2011). A
Vocabulary for Data and Dataset Provenance (Voidp) defines terms to describe provenance
relationships of data in linked datasets (Omitola et al., 2011). Provenance Vocabulary (PRV) as
an OWL-DL ontology defines classes and properties for describing provenance of linked data on
the Web. PRV is a domain specific specialization of PROV-O. It is notable that PRV defines
terms for both data creation and data access (Hartig and Zhao, 2012). Provenance, Authoring and
Versioning Ontology (PAV) is designed for the capture of essential descriptions for tracking the
provenance, authoring and versioning of web resources (Ciccarese et al., 2013). BBC Provenance
Ontology is designed to capture data about the provenance of data in an RDF Triple Store (BBC,
2012). Provenir Ontology (PO) defined in OWL-DL describes the classes and the properties to
represent provenance metadata in eScience (Sahoo and Sheth, 2009).
2.4. Discussion on Provenance Description Standards and Models
Provenance may be about any resource, such as documents, rare books, web pages, datasets,
transaction execution records, etc. This means that we need to use an appropriate vocabulary or
vocabularies for provenance description in accordance with the type of resources and archiving
purposes. Provenance description in OAIS and PREMIS is primarily for digital preservation
whereas those standards shown in section 2.3 are defined for other purposes. Most of the
ontologies are OWL-based; thus, the OWL-based definitions are useful for the reference to term
definitions and reasoning of provenance.
PROV is designed generally and comprehensively for provenance description, referring to
representation, interchange, query, access, and validation of provenance. PREMIS is widely used
for digital preservation where provenance description is an important component. This study is
primarily aimed at definition of a model of metadata provenance description for long-term use of
metadata. We use PROV and PREMIS as a basis for general provenance description and
provenance description for preservation in this study. Hereafter we will refer to PROV and
PREMIS instead of PROV-O and PREMIS OWL Ontology unless we need to explicitly state the
ontology.
3. Provenance Description Scenarios for Preservation
We use PROV-O and the PREMIS OWL Ontology to describe provenance information created
during the lifecycle of digital objects and their metadata. Migration is a widely used method to
assure digital objects accessible and usable over time. This section presents some instnaces of
150
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
provenance description about the format migration shown below, referring to the
generationActivity/creationEvent occurred to Digital Object A, responsible Agent, related date
time, and also the derivation of Digital Object A in Format X to Digital Object B in Format Y via
migrationActivity which caused the format change, and so forth.
3.1. Description of Activity and Event
Figure 1 shows a generationActivity leading to the generation of Object A by using PROV. The
generationActivity (started at dateTime1, ended at dateTime2) resource is directed to Object A,
which is linked to a generation Date-Time literal. PREMIS uses preservation-specific value
vocabularies defined by Library of Congress. Those vocabularies provide terms expressed in
SKOS vocabulary, e.g., EventType, AgentType and RelationshipType. Likewise, Figure 2 shows a
creationEvent associated with Object A and the creationEvent happening during a period from
dateTime1 to dateTime2. Meanwhile, the Figure also presents the creationEvent is linked to an
EventOutcomeInformation resource, an EventType resource, and EventDateTime literal.
prov:Activity
rdf:type
prov:startedAtTime
generationActivity
prov:endedAtTime
prov:generated
Object A
"dateTime1"^^xsd:dateTime
"dateTime2"^^xsd:dateTime
"dateTime2"^^xsd:dateTime
prov:generatedAtTime
rdf:type
prov:Entity
FIG.1. Provenance graph of generationActivity happened on Digital Object A using PROV
premis:hasEventDateTime
"dateTime1/dateTime2"
^^xsd:dateTime
creationEvent
premis:Event
rdf:type
premis:hasObject
premis:hasEventType
premis:hasEventOutcomeInformation
A EventOutcomeInformation
Object A
rdf:type
http://id.loc.gov/vocabulary/
preservation/eventType/cre
premis:EventOutcomeInformation
skos:inScheme
http://id.loc.gov/vocabulary/
preservation/eventType
FIG.2. Provenance graph of creationEvent occurred to Digital Object A using PREMIS
3.2. Description of Responsible Agent
As shown in Figure 3, Object A is connected with a Person by property wasAttributedTo
defined in PROV. The generationAcitity is linked to that Person via property wasAssociatedWith,
from which we know the Person holds a responsibility for the generation of Object A. In
PREMIS, Agent influences Object through Event. That is, Agent is not directly connected to
Object as shown in Figure 4. However, PROV allows Agent, Entity and Activity to be related with
each other directly.
151
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
rdf:type
prov:Entity
rdf:type
Object A
A Person
prov:Person
prov:Agent
prov:wasAssociatedWith
prov:wasGeneratedBy
generationActivity
rdfs:subClassOf
rdf:type
prov:wasAttributedTo
rdf:type
prov:Activity
FIG.3. Provenance graph of Agent responsible for the generation of Digital Object A Using PROV
premis:Agent
rdf:type
premis:hasAgentName
A Person
"Name1"^^xsd:string
premis:hasAgentType
premis:hasEvent
creationEvent
premis:hasObject
http://id.loc.gov/vocabulary/
preservation/agentType/per
rdf:type
skos:inScheme
premis:Event
Object A
http://id.loc.gov/vocabulary/
preservation/agentType
rdf:type
premis:Object
FIG.4. Provenance graph of Agent responsible for Event using PREMIS
3.3.
Description of Relationships between Entities and Relationships between
Objects
PROV describes the relationship between entities with the properties wasDerivedFrom,
alternateOf, specializationOf, wasQuotedFrom, wasRevisionOf, hadPrimarySource, hadMember.
Figure 5 shows that Object A is the primary source of Object B using PROV. PREMIS holds two
types of relationship between Objects, including structural and derivation relationships defined in
a SKOS vocabulary by Library of Congress. Using PREMIS, Figure 6 shows the derivation
relationship between Object A and Object B due to the migrationActivity.
rdf:type
prov:Entity
rdf:type
Object A
Object B
prov:hadPrimarySource
FIG.5. Derivation Relationship between Digital Object A and Digital Object B using PROV
rdf:type
rdf:type
premis:Object
Object A
Object B
http://id.loc.gov/vocabulary/preservation/relationshipType/der
skos:inScheme
http://id.loc.gov/vocabulary/preservation/relationshipType
/
FIG.6. Derivation relationship between Digital Object A and Digital Object B using PREMIS
152
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Furthermore, PROV also defines relationships between Activities and relationships between
Agents, whereas PREMIS does not include those relationships. Figure 7 shows the relationship
expressed by property wasInformedBy between the migrationActivity and generationActivity,
which means the migrationActivity used Object A created by the generationActivity.
rdf:type
migrationActivity
prov:Activity
rdf:type
generationActivity
prov:wasInformedBy
prov:wasGeneratedBy
prov:used
Object A
rdf:type
prov:Entity
FIG.7. Relationship between Activities in PROV
4. A Merged Model for Provenance Representation by Integrating PROV-O
with PREMIS OWL Ontology
4.1. Mapping of Basic Classes between PROV-O and PREMIS OWL Ontology
PROV has the three base classes, i.e., prov:Entity, prov:Agent and prov:Activity. PREMIS
defines classes, including premis:IntellectualEntity, premis:Object, premis:Agent, premis:Event,
and so forth. Based on the interpretation in PROV (Lebo et al., 2013) and PREMIS (PREMIS
Editorial Committee, 2012), the paragraphs below discuss mappings between them.
premis:IntellectualEntity is a set of content items as a single intellectual unit, e.g., book, map,
photograph, or database. premis:Object is a discrete unit of information in digital form.
prov:Entity can be in physical or digital or conceptual or imaginary thing. We can conclude that
prov:Entity has a broader meaning than premis:IntellectualEntity and premis:Object. Hence, we
map premis:IntellectualEntity and premis:Object as subclass of prov:Entity.
premis:Event indicates a description about an action (or activity) impacting an Object.
prov:Activity means actions or processes performed by Agent(s) or acted on Entity (-ies).
premis:Event is oriented to preservation actions, and only important Events are recorded. On the
other hand, prov:Activity does not have limitation of action domain or types. That is, the meaning
of premis:Event is narrower than prov:Activity. Therefore, we map premis:Event as subclass of
prov:Activity.
premis:Agent can be a person, or an organization, or a software program/system associated
with Events in the life of an Object. prov:Agent bears responsibility for occurred Activity, or the
existence of Entity. However, their Agent types are almost the same. In a sense, premis:Agent can
be seen to be equal to prov:Agent. And the relation can be described using owl:equivalentClass.
4.2. A Proposed Model Integrating PROV-O with PREMIS OWL Ontology
Both PROV and PREMIS have properties to describe provenance, and they are defined based
on RDF and OWL. PROV is designed for generalized provenance description and interchange
among different systems, whereas PREMIS is primarily for preservation metadata description
used for digital preservation. The specialized PREMIS terms used to describe preservation could
enrich expressive power of PROV. By introducing the controlled vocabulary for event types
suggested in PREMIS, interoperability of Activity descriptions in PROV could be enhanced.
Based on the mapping shown in section 4.1, we propose a provenance description model for
preservation of digital resources and metadata, by integrating the PROV with PREMIS. The
merged model shown in Figure 8 introduces the premis:Object and premis:IntellectualEntity as
the subclass of prov:Entity, Collection, Bundle, and Plan are also subclasses of Entity. Meanwhile,
premis:Event is mapped to the subclass of prov:Activity, premis:Agent is equivalent to
prov:Agent. In the Figure, the classes in PROV are written in italic, and the classes in PREMIS
153
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
are shown with underline. Moreover, as shown in Figure 8, the relationships between classes, the
generation or invalidation time of Entity, and the start or end time of Activity/Event can also be
described via properties (written with namespace prefix, i.e., prov) from PROV.
prov:wasDerivedFrom
prov:wasRevisionOf
prov:hadMember
Entity
Collection
prov:generatedAtTime
Bundle
Plan
prov:wasAttributedTo
Object
IntellectualEntity
prov:wasGeneratedBy
prov:used
prov:wasInvalidatedBy
prov:invalidatedAtTime
Activity
xsd:dateTime
Agent
xsd:dateTime
prov:wasAssociatedWith
Event
prov:startedAtTime
prov:actedOnBehalfOf
prov:endedAtTime
xsd:dateTime
prov:wasInformedBy
xsd:dateTime
FIG.8. The merged model for provenance description oriented to digital preservation
4.3. Provenance Description Using the Proposed Model
Eckert presented the concept of Provenance Context. A Provenance Context can be seen as a
Named Graph about identified resource (Eckert, 2013). Named Graph may be used for tracking
provenance of RDF data, replication of RDF graphs, and versioning (Dodds and Davis, 2012).
PROV allows grouping of provenance description and defines Bundle as a named set of
descriptions (Lebo et al., 2013).
premis:hasObjectCharacteristics
Object A
An ObjectCharacteristics
rdf:type
premis:hasFormat
Format X
premis:Object
rdf:type
rdfs:subClassOf
prov:Entity
<# Bundle 1 >
rdf:type
prov:entity
prov:wasDerivedFrom
rdf:type
prov:hadActivity
prov:Entity
prov:qualifiedDerivation
An ObjectCharacteristics
premis:hasObjectCharacteristics
rdf:type
rdf:type
rdf:type
migrationActivity
rdf:type
Object B
prov:Derivation
premis:hasFormat
prov:Activity
Format Y
premis:Object
rdfs:subClassOf
prov:Entity
<# Bundle 2 >
FIG.9. Provenance graph of the format change from Digital Object A to Digital Object B using Bundle
154
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Through the definition of Bundle, we can describe the provenance of Bundle. For the assumed
example, Digital Object A in Format X is migrated to Digital Object B in Format Y. Here, we
define two Bundles, i.e., Bundle 1 and Bundle 2. Bundle 1 and Bundle 2 respectively describes the
format feature of Digital Object A and Digital Object B as shown in Figure 9, which shows the
format change caused by migrationActivity. As Bundle is an Entity in PROV, we can also express
the derivation between Bundle 1 and Bundle 2. In PROV, by using property qualifiedDerivation,
we can qualify how Bundle 2 was derived from Bundle 1. In Figure 9, Bundle 2 is linked to a
blank node through property qualifiedDerivation. And from the blank node, the migrationActivity
caused the format change is expressed.
5. Provenance Description for Long-term Use of Metadata
Metadata schema longevity is a vital aspect of metadata longevity. Given to the necessity of
provenance in preservation, metadata schema provenance should be documented and managed
with a purpose for metadata preservation. On one hand, a metadata is a digital object, and on the
other hand, a metadata is a logical data entity neutral to any particular physical representation as a
digital object. There are widely accepted standards for the longevity of digital objects, e.g., OAIS
and PREMIS. However, there is no well-established model or standards for the longevity of
metadata as a logical data entity. In this paper, the authors propose a model for provenance
description of metadata from the viewpoint of metadata longevity.
By the nature of metadata, there is meta-metadata and meta-meta-metadata which mean “data
about metadata” and “data about meta-metadata”. Metadata schema is a typical meta-metadata
because it is a description of metadata from the viewpoint of structural and/or semantic definition.
Because of the nature of metadata, meta-metadata and meta-meta-metadata are metadata.
Metadata instances are created as (1) a digital instance of metadata, e.g., a text file describing a
book, a CSV file of bibliographic records, or (2) a logical data instance expressed as a
self-contained digital object or embedded in a digital object, e.g., a metadata expressed as an
RDF/XML instance and an RDFa expression embedded in an HTML document. In both cases,
provenance is an important issue for the longevity of metadata - they require both digital object
provenance and metadata provenance, i.e., metadata instance as a file and a written instance in the
file.
Provenance of the metadata schema is one of the key issues for the long-term use of metadata
instances. Metadata schema provenance can be categorized using DCMI application profile – (1)
Vocabulary Provenance, (2) Structural Provenance (i.e., provenance of description set profiles),
(3) Provenance of other components: Encoding Syntax Guidelines, User Guidelines, and
Functional Requirements. Vocabulary provenance is for recording semantic change of terms.
Structural provenance includes revision history of terms used in the schema as well as the
revision history of structural constraints. Other provenance descriptions are crucial for readers in
the future to understand contextual information to process metadata. From another viewpoint, a
vocabulary mapping table created for a metadata schema mapping is a metadata instance about
the metadata schema mapping, e.g., conversion from an old schema to a new schema, and merger
of two schemas. Provenance description for the table should be given to record a change history
of metadata terms used in the schema(s).
6. Discussion and Future Work
Although many projects have made great efforts for digital preservation, there is no efficient
method proposed for metadata preservation. Metadata provenance for metadata longevity in the
Semantic Web is an important issue. It is easier to collect and merge open metadata from various
sources. Given to the dynamic factors, e.g., URI, linkage relation, and RDF vocabulary, the
representation of provenance of metadata and metadata schema is necessary.
There is a challenge in how to make metadata provenance interoperable and semantic even
preservation environment changes during a long time period. Interoperability in provenance
155
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
description is useful for the interchange among various domains or systems. Semantic provenance
is required to make the meaning of provenance easily and correctly understandable by both
humans and machines. In any event, preservation context and provenance context for metadata
need further research.
References
BBC. (2012).Provenance Ontology. Retrieved March 18, 2014, from http://www.bbc.co.uk/ontologies/provenance.
Consultative Committee for Space Data System. (2012,June). CCSDS 650.0-M-2.Reference model for open archival
information system (OAIS), Recommended Practice, Issue 2.Retrieved March 18, 2014, from
http://public.ccsds.org/publications/archive/650x0m2.pdf.
Ciccarese, Paolo, Stian Soiland-Reyes, Khalid Belhajjame, Alasdair JG Gray, Carole Goble, and Tim Clark.
(2013).PAV 2.0 - Provenance Authoring and Versioning ontology. Journal of Biomedical Semantics 2013, 4:37.
Retrieved March 18, 2014, from http://www.jbiomedsem.com/content/4/1/37.
Dodds, Leigh and Ian Davis. (2012, May 31). Chapter 5. Data Management Patterns. Linked Data Patterns: A pattern
catalogue for modeling, publishing, and consuming Linked Data. Retrieved March 18, 2014, from
http://patterns.dataincubator.org/book/named-graphs.html.
Eckert, Kai. (2012). Metadata Provenance in Europeana and the Semantic Web. Retrieved July 25, 2014, from
http://edoc.hu-berlin.de/series/berliner-handreichungen/2012-332/PDF/332.pdf.
Eckert, Kai. (2013). Provenance and Annotations for Linked Data. Proceedings of the International Conference on
Dublin Core and Metadata Applications.2013, 9-18.
Garijo, Daniel and Yolanda Gil. (2014, July 11). The OPMW-PROV Ontology. Retrieved July 29, 2014, from
http://www.opmw.org/model/OPMW/.
Hartig, Olaf and Jun Zhao. (2012, March 14).Provenance Vocabulary Core Ontology Specification. Retrieved March
18, 2014, from http://trdf.sourceforge.net/provenance/ns.html.
Liu, Jun. (2011). W7 Model of Provenance and its Use in the Context of Wikipedia.Ph.D. Dissertation. The University
of Arizona.
Lebo, Timothy, Satya Sahoo, Deborah McGuinness, Khalid Belhajjame, James Cheney, David Corsar, Daniel Garijo,
Stian Soiland-Reyes, Stephan Zednik, and Jun Zhao.(2013, April 30). PROV-O: The PROV Ontology. Retrieved
March 18, 2014, from http://www.w3.org/TR/prov-o/.
Moreau, Luc, Paolo Missier, Khalid Belhajjame,Reza B'Far,James Cheney, Sam Coppens,Stephen Cresswell, Yolanda
Gil,Paul Groth,Graham Klyne, Timothy Lebo,Jim McCusker,Simon Miles,James Myers, Satya Sahoo,and Curt
Tilmes.(2013, April 30).PROV-DM: The PROV Data Model. Retrieved March 18, 2014, from
http://www.w3.org/TR/prov-dm/.
Moreau, Luc, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, Simon Miles,
Paolo Missier, Jim Myers, Beth Plale, Yogesh Simmhan, Eric Stephan and Jan Van den Bussche. (2011). The Open
Provenance Model Core Specification (v1.1). Future Generation Computer Systems, 27, (6), 743-756.
Moreau, Luc, Li Ding,Joe Futrelle,Daniel Garijo Verdejo,Paul Groth,Mike Jewell,Simon Miles,Paolo Missier,Jeff
Pan,and Jun Zhao. (2010, October 12). Open Provenance Model (OPM) OWL Specification. Retrieved March 18,
2014, from http://openprovenance.org/model/opmo.
Omitola, Tope, Christopher Gutteridge, and Nicholas Gibbins.(2011).voidp: A Vocabulary for Data and Dataset
Provenance. Retrieved March 18, 2014, from http://www.enakting.org/provenance/voidp/.
Pearce-Moses, Richard. (2005). A Glossary of Archival and Records Terminology (pp. 317). Chicago: The Society of
American Archivists. Retrieved March 18, 2014, from http://files.archivists.org/pubs/free/SAA-Glossary-2005.pdf.
PREMIS Editorial Committee. (2012). PREMIS Data Dictionary for Preservation Metadata, version 2.2. July 2012.
Retrieved March 18, 2014, from http://www.loc.gov/standards/premis/v2/premis-2-2.pdf.
Sahoo, S. Satya, and Amit P. Sheth. (2009). Provenir ontology: Towards a Framework for eScience Provenance
Management. Retrieved March 18, 2014, from http://corescholar.libraries.wright.edu/knoesis/80.
W3C Provenance Incubator Group.(2010).Use Case Report.
http://www.w3.org/2005/Incubator/prov/wiki/Use_Case_Report.
Retrieved
July
25,
2014,
from
Zhao, Jun.(2010, October 6).Open Provenance Model Vocabulary Specification. Retrieved March 18, 2014, from
http://open-biomed.sourceforge.net/opmv/ns.html.
156
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Interlinking Cross Language Metadata Using Heterogeneous
Graphs and Wikipedia
Xiaozhong Liu
School of Informatics and
Computing
Indiana University, USA
[email protected]
Miao Chen
School of Informatics and
Computing
Indiana University, USA
[email protected]
Jian Qin
School of Information
Studies
Syracuse University, USA
[email protected]
Abstract
Cross-language metadata are essential in helping users overcome language barriers in information
discovery and recommendation. The construction of cross-language vocabulary, however, is
usually costly and intellectually laborious. This paper addresses these problems by proposing a
Cross-Language Metadata Network (CLMN) approach, which uses Wikipedia as the intermediary
for cross-language metadata linking. We conducted a proof-of-concept experiment with key
metadata in two digital libraries and in two different languages without using machine translation.
The experiment result is encouraging and suggests that the CLMN approach has the potential not
only to interlink metadata in different languages with reasonable rate of precision and quality but
also to construct cross-language metadata vocabulary. Limitations and further research are also
discussed.
Keywords: metadata; linked data; cross language; heterogeneous graph
1. Research Problem
Subject categories and keywords in metadata descriptions are primary subject access points for
information discovery whether for English- or non-English-speaking users. While many nonEnglish speaking users can read and understand English, it is often not the same for the opposite.
To bridge the gap between languages, digital libraries such as Europeana (http://europeana.eu)
offer cross-language metadata so that users can search by any language. The cross-language
search function is valuable and enables information discovery in languages that users would have
otherwise unable to reach due to the language barrier.
Cross-language subject tools for Asian languages, however, have been lagging behind the
increase in Asian Internet users and research output. Although the Internet has created a global
village, the lack of cross-language metadata prevents information from flowing bi-directionally
between English and Asian languages and creates language silos of information. Take CiNii
(http://ci.nii.ac.jp/) as an example: even though both Japanese and English resources are indexed
in the CiNii database, cross language retrieval and recommendation is unavailable. The same
problem exists in Google Scholar, a giant scholarly retrieval engine. In addition, current tools are
often limited to standardized human or machine translation, which is not suitable for high quality
information retrieval and recommendation. One contributing factor for the lack of cross-language
information discovery and recommendation is the difficulty in constructing a multi-language
metadata vocabulary.
It is well known that the construction of any vocabulary tool is time consuming and
intellectually laborious. The Chinese language version (AAT-Taiwan) of the Art and Architecture
Thesaurus (“AAT”, 2014) for example, is translated and mapped with its English version. It
contains 34,961 concepts, 26,813 translated concepts, 12,668 archived records, and 6,564 edited
records and took multiple years and professionals and domain experts to complete. The
maintenance and updating has been ongoing since its first release in 2009. Building crosslanguage counterparts is a huge endeavor and costly in both time and personnel.
157
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The usefulness/lack of cross-language subject vocabularies calls for new approaches to
developing such vocabularies at a large scale while maintaining a reasonable level of quality and
low cost. To address this conundrum, we propose a cross-language metadata network approach
that will generate cross-language vocabularies on the fly by leveraging existing vocabulary
resources. This paper reports a preliminary experiment as a proof of concept that uses metadata
from four elements – publication, author, keyword, and venue – to construct cross-language
metadata network graphs, which will then be linked through the language counterparts in
Wikipedia concepts and subject categories. This approach will allow for searches in a user’s
native language to return results in multiple languages without machine translation.
2. Relevant Research
Developing cross-language metadata network graphs is motivated not only by the need for
such tools but also by the issues in cross-language information retrieval that previous research has
ignored or unable to address (Oard & Diekema 1998; Nie 2010; Ye, Huang, He & Lin 2012).
Cross-language retrieval algorithms and methods are well documented in research publications.
Most of these algorithms and methods, however, focused on translation rather than linking. They
employed statistical models, i.e., latent semantic indexing (Littman, Dumais & Landauer 1998),
parallel corpuses mining (Nie et al., 1999), and n-gram (AbdulJaleel & Larkey 2003) to construct
bilingual translation models. As such, the translations rely on the source text and are limited to
matching terms for translating the query from its original language to the target language in order
to perform searches, rather than for linking relevant concepts cross languages. The translations
have nothing to do with the metadata describing the source, much less creating both content and
language linkages between metadata descriptions.
Machine translation plays an important role in constructing cross-language vocabularies
(Dumais et al., 1997; Vossen 1998). Research literature in this field exhibits two paradigms of
translating approaches: dictionary/rule based and parallel/comparable corpus based (Potthast,
Stein, and Anderka, 2008). The first approach relies heavily on corpora and dictionaries while the
second one uses the human-built cross-language links in knowledge bases such as Wikipedia.
Cross-language links in Wikipedia explicitly connect concepts in different languages together and
have proved to be useful sources for text mining across languages by navigating between the
links. Studies show that same language pairs have a high ratio of cross-lingual links in Wikipedia.
For example, the ratio of English-German links is as high as 95% (Sorg & Cimiano, 2008).
The method used by Sorg and Cimiano (2008) and Potthast et al. (2008) is called CL-ESA
(Cross-Language Explicit Semantic Analysis). By projecting documents/queries to a vector space
of concepts via Explicit Semantic Analysis (ESA) in one language, the vector space of concepts
is mapped to a vector space of another language via cross-language links in Wikipedia. Potthast
et al. (2008) used cross-language links in Wikipedia for cross-language information retrieval and
showed a reasonably good performance in cross-language ranking and bi-lingual correlation
ranking. Ye, Huang, He and Lin (2012) also employed Wikipedia as a graph-based bi-lingual
resource for constructing a cross-language association dictionary (CLAD). They also found
CLAD can be useful to enhance the cross-language information retrieval performance.
The studies mentioned above provide encouraging results for using Wikipedia as the bridge in
developing cross-language metadata vocabularies. Although unforeseen factors may affect the
precisions and coverage of concepts cross languages, it is nonetheless a worthwhile attempt in
experimenting with the cross-language metadata linking approach using Wikipedia.
3. A Case Scenario in Cross-Language Vocabulary Linking
To demonstrate how cross-language vocabulary might be interlinked, we present a case
scenario of metadata for scholarly publications. The DBLP Computer Science Bibliography
(http://dblp.uni-trier.de/db/) contains metadata descriptions primarily for computer science
publications written in English. The C-DBLP (“Chinese DBLP”, n.d.) serves same goal for
158
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
computer
science
publications written in
Chinese. The metadata
schemas for both DBLP
and
C-DBLP
are
comparable but do not
communicate to one
another, nor can users
conduct searches across
both databases. While
different ownerships for
each of these two
databases is a primary
factor for their inability
to communicate to one
another, it is also true
that the metadata in two
databases represent two
completely
different
sets of publications and
are in two different
languages.
Similarly,
large search engine
players such as Google
Scholar and OCLC
WorldCat
index
resources in multiple
languages,
but
the
metadata descriptions
(e.g.,
keywords
in
different languages) in
these systems are not
Figure 1. Wikipedia concepts and language links
related within their own
system.
Over the past decade, Wikipedia has become an increasingly important resource for the world
knowledge. It provides two unique features that can potentially solve the aforementioned
problems for cross-language information discovery. The first feature is that Wikipedia provides
concept definitions in multiple languages. An example is the concept definition for “Semantic
Web”: this entry has been written in 39 languages (see Figure 1). In each language, the concept
name is defined by the title of the article (entry). The Chinese counterpart for this concept is
defined by the title “语义网”, a term used in most publications for this topic in Chinese.
The other important feature is that all concepts in Wikipedia are inter-connected via Wikipedia
hierarchical categories and hyperlinked among Wikipedia pages. For instance, the concepts
“Semantic Web” and “metadata” are connected via the path
[Wikipedia Concept: Semantic Web] →
[Wikipedia Category: Knowledge Engineering] →
[Wikipedia Category: Knowledge Representation] ←
[Wikipedia Concept: Metadata]
In other words, all concepts in Wikipedia are inter-connected through topic links (Wikipedia
categories) and cross-language equivalents.
159
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
For the purpose of generating cross-language metadata vocabularies, the interconnectedness
across multiple languages between concepts and knowledge categories in Wikipedia makes it an
ideal source to leverage. If Wikipedia can be used as the intermediary vocabulary, we may be
able to design algorithms to “ask” it to translate metadata between different languages. This
means that digital libraries and repositories in different languages may use the intermediary tool
to construct cross-language metadata vocabularies for information discovery and
recommendation. It will be possible then for cross-language vocabulary tools to automatically
select and recommend most relevant cross-language publications without having to rely on
machine translation. In the cases of DBLP and C-DBLP, it is possible to use Wikipedia as the
intermediary nodes to interlink publications, venues, and authors in these two digital libraries, no
matter which language is used to search, via the [Keyword] →[Wikipedia Concept] link. As each
Wikipedia concept is written in both Chinese and English, this step does not need to involve
machine translation.
We are aware of the limitation of Wikipedia resource, and the sparseness of Wikipedia
definitions in certain languages may limit the generalizability of the proposed method. For
instance, if there is only a small amount of Wikipedia concepts defined in a language, the
keyword projection performance can be understandably low.
4. Methodology
Using Wikipedia to create Cross-Language Metadata Networks (CLMN) involves two steps. In
the first step a Single-Language Metadata Network (SLMN) is built for a monolingual digital
library or repository. In the second step, the SLMN will be mapped to Wikipedia concepts and
subject categories to create Cross-Language Metadata Networks (CLMN). Through this two-step
method, cross-language metadata vocabularies are constructed and then used to connect metadata
and resource objects in digital libraries/repositories across different languages. In the section
below we will first discuss the method for generating metadata networks for an individual
repository and then describe the CLMN through which SLMNs are interconnected via
Wikipedia’s bridge nodes, i.e., Wikipedia pages and subject categories.
4.1 Step 1: Creating Single-Language Metadata Networks (SLMN)
We assume that there are four types of resource objects – publications (papers, reports,
webpages, and books), venues (journals, conference proceedings, and domain names as embodied
by websites), subjects, and authors – in a single-language digital library. Between the four types
of resource objects, there exist various types of linkages: citation linkages between publications,
authorship linkages between authors and publications, and venue linkages between publications
and venues. We also assume that in a single-language digital library (or repository), a list of
subject terms and values (keywords or controlled vocabulary) is available to represent
publications and venues and that metadata and publications share the same language.
Using the network theory, each resource object is a network node (vertex) and the links
between nodes (vertices) are edges. Metadata in a single-language digital library are considered
as a single-language metadata network (SLMN) in which the nodes are connected by edges. This
network is heterogeneous by nature in the sense of network node types, because the same
network contains multi-types of nodes: author (A), publication (P), venue (V), and keyword (K),
which are what this study focuses.
For each digital library (a commercial database or an institutional repository), there exists a
local SLMN. All four types of nodes mentioned above can be connected by any of the 7 types of
edges: 1) 𝑃 → 𝐴, a paper is written by an author; 2) 𝑃 → 𝑉, a paper is published in a venue; 3)
𝑃 → 𝐾, a paper or publication is relevant to a keyword; 4) 𝑃 → 𝑃, a publication cites or links to
publications; 5) 𝐾 → P, a keyword (topic) is assigned to publications; 6) 𝐾 → 𝐴, a keyword
(topic) is assigned by authors; and 7) 𝐾 → V, a keyword (topic) is assigned to venues. Edge types
1, 2, 3 and 4 are implemented by using metadata in a single-language digital library. Keywords
160
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
derived from publications, author names, and venues are labeled as topic and represented by edge
types 5, 6, and 7, which are calculated by using PageRank with Prior algorithms (Liu, Zhang, and
Guo, 2013) on the homogeneous citation graphs (publication-citation graph, author-citation
graph, and venue-citation graph). Note that, as this network can be potentially used for resource
recommendation, all edges are associated with an edge weight, P(v|u), which indicates the
transitioning probability (weight) from node u to node v.
4.2 Step 2: Creating Cross-Language Metadata Networks (CLMN)
The goal of this step is to generate cross-language metadata networks using computational
methods. The CLMNs generated from using Wikipedia and the PageRank Prior algorithms will
function as a linking mechanism to interconnect metadata silos of single language into a global
network with the capability of performing cross-language information discovery and
recommendation. In the CLMN approach, a collection of digital libraries or repositories are
represented by k Metadata Networks (MNs). Figure 2 visualizes the CLMN creation progress.
There are four layers in a CLMN and k SLMNs connect to the Wikipedia concept and Wikipedia
category nodes on the CLMN, in which Wikipedia nodes function as the bridge to interconnect
different SLMNs. Meanwhile, all Wikipedia nodes (Wikipedia concepts and Wikipedia
categories) also connect with the incoming/outgoing links (between Wikipedia concepts),
concept-category relations, and the hierarchical relations between categories.
Figure 2. Cross-Language Metadata Networks (CLMN)
In Figure 2, dotted lines indicate the calculated or inferred relationships and the solid lines
indicate the relationships physically exist in the repository or Wikipedia database. It depicts how
one SLMN typically connects to Wikipedia nodes, which is also how other SLMNs will connect
to the Wikipedia nodes. The middle section is where automatic pairing and linking of the
concepts in different languages takes place. All keywords or controlled vocabularies (node K)
connect to Wikipedia concepts via two kinds of edges: exact match edge and partial match edge.
The former edge type indicates that the string represented by node K is exactly the same as
Wikipedia concept title. Note that K on different SLMNs may be in different languages, while
Wikipedia concept is also indexed by multiple languages. The latter edge type is generated by
using information retrieval algorithms, e.g., language model or vector space model, which means
that the target keyword or controlled vocabulary is part of the content of the Wikipedia concept’s
161
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
content. Similarly, the content of Wikipedia concept may also be in different languages. Similar
as the edge types in SLMN, all edges between Wikipedia nodes and keywords nodes are
associated with the edge weight.
5. Preliminary Experiment
As a proof of concept for the proposed method, we construct a CLMN by using the ACM
Digital Library (English computer science publications + metadata, http://www.acm.org/) and
WanFang Digital Library (Chinese computer science publications + metadata,
http://www.wanfangdata.com.cn/). All four types of nodes in publications’ metadata across both
libraries – authors, venues, papers, and keywords – were connected by using the intermediary
layer Wikipedia as shown in Figure 2. For this experiment, we used Wikipedia Chinese and
English 2014 April dumps.
Due to the space limit, we present only the metadata layer and Wikipedia layer in this section.
The CLMN constructed in this preliminary experiment contains 1,481 English keywords and 121
Chinese keywords (English keywords 10 times more than Chinese keywords because of the data
limitation). Connected to these keywords were 1,719 Wikipedia page nodes and 1,146 Wikipedia
category nodes.
Two exemplar Chinese keywords, “机器学习” (Machine Learning) and “信息抽取”
(Information Extraction) , were used as query terms to find the related English keywords by using
two types of paths: 1. [Chinese Keyword] → [Wikipedia Concept] ←[English Keyword], and 2.
[Chinese Keyword] → [Wikipedia Concept]→ [Wikipedia Category] ← [Wikipedia
Concept] ←[English Keyword] (Edge direction was ignored). The first path used only one
intermediary Wikipedia node between the query and target keywords in Chinese and English. The
second one was more complicated because the Chinese query keyword and English target
keyword may link to different Wikipedia concepts and these Wikipedia concepts may share the
same Wikipedia category.
Given the space limitation, we investigated only the first example in more details. Figure 3
displays the paths through which the results for “机器学习” were generated. Different types of
nodes are represented by different colors on the CLMN graph. This graph example shows that
Wikipedia page and category nodes function as intermediary nodes to link together the same
concept Machine learning in English and Chinese.
Figure 3. Related English Keywords for “机器学习” on the CLMN (via Wikipedia nodes)
162
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The specific paths for query “机器学习” on the CLMN are listed below (CK = Chinese keyword,
WP = Wikipedia page, WC = Wikipedia category, and EK = English Keyword):
Result for path [Chinese Keyword] → [Wikipedia Concept] ←[English Keyword] (1 result)
1. CK:机器学习→WP:machine_learning←EK:machine_learning
Results for path [Chinese Keyword] → [Wikipedia Concept]→ [Wikipedia Category] ←
[Wikipedia Concept] ←[English Keyword] (26 results)
1. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:cluster_analysis←EK:cluster_analysis 2. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:expectation_maximization_algorithm
←EK:em_algorithm 3. CK:机器学习→
WP:machine_learning→WC:Cybernetics←WP:complex_systems←EK:complex_systems 4. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:reinforcement_learning←EK:reinforc
ement_learning 5. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:pattern_recognition←EK:pattern_rec
ognition 6. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:formal_concept_analysis←EK:concep
t_analysis 7. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:unsupervised_learning←EK:unsuper
vised_learning 8. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:hidden_markov_model←EK:hidden_
markov_model 9. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:expectation_maximization_algorithm
←EK:expectation_maximization 10. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:supervised_learning←EK:supervise
d_learning 11. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:pattern_recognition←EK:pattern_de
tection 12. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:artificial_neural_network←EK:neura
l_networks 13. CK:机器学习→W P:machine_learning→WC:Machine_learning←WP:artificial_neural_network←EK:artificia
l_neural_network 14. CK:机器学习→
WP:machine_learning→WC:Cybernetics←WP:genetic_algorithm←EK:genetic_algorithm 15. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:nearest_neighbor_search←EK:neare
st_neighbor_search 163
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
16. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:principal_component_analysis←EK:
principal_component_analysis 17. CK:机器学习→
WP:machine_learning→WC:Cybernetics←WP:artificial_intelligence←EK:artificial_intelli
gence 18. CK:机器学习→WP:machine_learning→WC:Cybernetics←WP:system←EK:systems 19. CK:机器学习→WP:machine_learning→WC:Cybernetics←WP:autonomy←EK:autonomy 20. CK:机器学习
→WP:machine_learning→WC:Cybernetics←WP:control_theory←EK:control_theory 21. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:support_vector_machine←EK:suppo
rt_vector_machine 22. CK:机器学习→
WP:machine_learning→WC:Cybernetics←WP:information_theory←EK:information_theo
ry 23. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:discriminative_model←EK:discrimin
ative_model 24. CK:机器学习
→WP:machine_learning→WC:Machine_learning←WP:perceptron←EK:perceptron 25. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:formal_concept_analysis←EK:formal
_concept_analysis 26. CK:机器学习→
WP:machine_learning→WC:Machine_learning←WP:conditional_random_field←EK:condi
tional_random_field Specific paths for query “信息抽取” are listed below:
Results for path [Chinese Keyword] → [Wikipedia Concept] ←[English Keyword] (1 result):
1. CK:信息抽取→WP:information_extraction←EK:information_extraction
Results for path [Chinese Keyword] → [Wikipedia Concept]→ [Wikipedia Category] ←
[Wikipedia Concept] ←[English Keyword] (13 results):
1. CK:信息抽取→
WP:information_extraction→WC:Artificial_intelligence←WP:artificial_intelligence←EK:ar
tificial_intelligence 2. CK:信息抽取→
WP:information_extraction→WC:Artificial_intelligence←WP:computer_vision←EK:comp
uter_vision 3. CK:信息抽取→
WP:information_extraction→WC:Artificial_intelligence←WP:description_logic←EK:descr
iption_logics 4. CK:信息抽取→
WP:information_extraction→WC:Artificial_intelligence←WP:fuzzy_logic←EK:fuzzy_logic 164
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
5. CK:信息抽取→
WP:information_extraction→WC:Artificial_intelligence←WP:game_theory←EK:game_the
ory 6. CK:信息抽取→
WP:information_extraction→WC:Artificial_intelligence←WP:intelligent_agent←EK:intelli
gent_agent 7. CK:信息抽取→
WP:information_extraction→WC:Artificial_intelligence←WP:markov_random_field←EK:
markov_random_field 8. CK:信息抽取→
WP:information_extraction→WC:Natural_language_processing←WP:cross-­‐
language_information_retrieval←EK:cross_language_information_retrieval 9. CK:信息抽取→
WP:information_extraction→WC:Natural_language_processing←WP:information_retriev
al←EK:information_retrieval 10. CK:信息抽取→
WP:information_extraction→WC:Natural_language_processing←WP:latent_semantic_ana
lysis←EK:latent_semantic_analysis 11. CK:信息抽取→
WP:information_extraction→WC:Natural_language_processing←WP:natural_language_pr
ocessing←EK:natural_language_processing 12. CK:信息抽取→
WP:information_extraction→WC:Natural_language_processing←WP:natural_language←E
K:natural_language 13. CK:信息抽取→
WP:information_extraction→WC:Natural_language_processing←WP:question_answering
←EK:question_answering The specific results shown above demonstrate that the path [Chinese Keyword] → [Wikipedia
Concept] ←[English Keyword] can find accurate translation, while the path [Chinese Keyword]
→ [Wikipedia Concept]→ [Wikipedia Category] ← [Wikipedia Concept] ←[English Keyword]
can locate a number of high quality related (linked) keywords in a different language. The
experiment results suggest that CLMN is promising as a means to link metadata across languages
and digital libraries. The metadata used in this experiment are relatively specialized with
reasonable level of quality, hence whether the method can be applied to other domains and
accomplish a comparable level of performance will need further study and evaluation.
6. Discussion and Conclusion
The resulting CLMNs have a number of potentials for metadata representation and resource
discovery. The four sets of results presented in the last section are structured data with path and
node information attached. They can be parsed into the format suitable for building crosslanguage vocabularies using computer programs. Such cross-language vocabularies can be then
encoded in the Linked Data formats and shared through vocabulary services. Another application
is to recommend resources (i.e., publication, author or venue) across repositories and languages.
For example, given an author ID (on a SLMN), the system can recommend publications
potentially relevant to users’ interest in a different language. Given a keyword (on a SLMN), we
can recommend top related venues (venue recommendation) or expert (author recommendation)
in a different language.
Unlike classical machine translation methods that use homogeneous data sources, this study
employed heterogeneous graph mining and text mining methods to connect all the metadata via
165
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Cross-Language Metadata Networks (CLMN), in which Wikipedia is used as the intermediary
nodes to link local repositories. We took metadata from ACM and WanFang digital libraries to
run our experiment. The results suggest that CLMN as a novel approach was able to find not only
accurate translations but also locate related metadata in different languages. This is especially
encouraging for developing a low cost and effective method for automatic cross-language
vocabulary construction.
The reliability and validity of CLMN method need further study and experiment to verify. We
plan to conduct further experiment with other sources of metadata, e.g., those available in open
repositories where metadata are crowd-sourced and in disciplines other than computer science. As
our next step research, we are keen on developing a bilingual vocabulary linked data set using
this method in a humanities domain by leveraging data from public digital libraries.
References
AAT (ART & Architectural Thesaurus). Retrieved Aug 1, 2014 from
http://www.getty.edu/research/tools/vocabularies/lod/
AbdulJaleel, Nasreen, and Leah S. Larkey. (2003). Statistical transliteration for English-Arabic cross language
information retrieval. In Proceedings of the twelfth international conference on Information and knowledge
management (pp. 139-146). ACM.
Chinese DBLP. Retrieved Aug 1, 2014 from http://cdblp.cn/index.php
Dumais, Susan T., Todd A. Letsche, Michael L. Littman, and Thomas K. Landauer. (1997) Automatic cross-language
retrieval using latent semantic indexing. In AAAI spring symposium on cross-language text and speech retrieval
(Vol. 15, p. 21).
Littman, Michael L., Susan T. Dumais, and Thomas K. Landauer.(1998). Automatic cross-language information
retrieval using latent semantic indexing. In Cross-language information retrieval (pp. 51-62). Springer US.
Nie, Jian-Yun, Michel Simard, Pierre Isabelle, and Richard Durand. (1999, August). Cross-language information
retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd
annual international ACM SIGIR conference on Research and development in information retrieval(pp. 74-81).
ACM.
Nguyen, D., A. Overwijk, C. Hauff, D.R. Trieschnigg, D. Hiemstra, and F. De Jong, (2009). WikiTranslate: query
translation for cross-lingual information retrieval using only Wikipedia. In Evaluating Systems for Multilingual and
Multimodal Information Access (pp. 58-65). Springer Berlin Heidelberg.
Nie, Jian-Yun. (2010). Cross-language information retrieval. Synthesis Lectures on Human Language Technologies,
3(1), 1-125.
Potthast, Martin, Benno Stein, and Maik Anderka. (2008). A Wikipedia-based multilingual retrieval model. In
Advances in Information Retrieval (pp. 522-530). Springer Berlin Heidelberg.
Sorg, Philipp, and Philipp Cimiano. (2008a). Cross-lingual information retrieval with explicit semantic analysis. In
Working Notes for the CLEF 2008 Workshop.
Sorg, Philipp, and Philipp Cimiano. (2008b). Enriching the crosslingual link structure of Wikipedia-a classificationbased approach. In Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (pp. 49-54).
Vossen, Piek.
Dordrecht.
(1998). A multilingual database with lexical semantic networks. Kluwer Academic Publishers,
Ye, Zheng., Huang, Jimmy X., He, Ben, and Hongfei Lin (2012). Mining a multilingual association dictionary from Wikipedia for cross-­‐language information retrieval. Journal of the American Society for Information Science and Technology, 63(12), 2474-­‐2487. 166
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Automated Enhancement of Controlled Vocabularies:
Upgrading Legacy Metadata in CONTENTdm
Andrew Weidner
Annie Wu
Santi Thompson
University of Houston, USA
University of Houston, USA
University of Houston, USA
[email protected]
[email protected]
[email protected]
Abstract
To ensure robust, reliable, retrievable and sharable metadata, the University of Houston (UH)
Libraries initiated a Metadata Upgrade Project in 2013 to systematically audit and refine the
quality of the metadata in the University of Houston Digital Library (UHDL). Still in progress,
the Metadata Upgrade Project has already produced significant discoverability improvements in
the UHDL’s legacy metadata and laid the foundation for future metadata production according to
recognized standards. The final phase of the project includes aligning controlled vocabulary terms
with appropriate authorities and adding and revising descriptive content in the UHDL. This is a
time intensive process that requires careful evaluation and entry of name and subject authority
terms. To improve efficiency and accuracy during the data entry process, the metadata librarian at
the UH Libraries developed name and subject authority applications that automatically transform
legacy controlled vocabulary terms into authorized forms. This project report provides an
overview of the UH Libraries Metadata Upgrade Project, a discussion of how the UHDL’s
upgraded metadata improves discoverability of our collections, and an in-depth look at the
custom tools that automate the authority alignment process in the CONTENTdm Project Client.
Keywords: metadata; controlled vocabularies; authority control; automation
1. Introduction
The University of Houston (UH) Libraries are committed to the dissemination and
discoverability of our unique, historical collections. In the five years since the launch of the
University of Houston Digital Library (UHDL), the repository has grown steadily and currently
provides online access to more than 50,000 digital objects. While the UHDL serves as a platform
for researchers to access the rare and unique materials in the UH Libraries holdings, the state of
the legacy metadata in the digital library presented barriers to efficient use of the UHDL’s digital
objects. Incomplete and inconsistent legacy metadata restrict both discoverability and
interoperability. To ensure robust, reliable, retrievable and sharable metadata, the UH Libraries
initiated a Metadata Upgrade Project in 2013 to audit and refine the quality of the metadata in the
UHDL.
The Metadata Upgrade Project team developed a three-phase strategy to systematically manage
the metadata audit and upgrade process based on feedback and data analysis from focus group
interviews, data inspection and benchmarking. Still in progress, the Metadata Upgrade Project has
already produced significant discoverability improvements in the UHDL’s legacy metadata. The
third phase requires time intensive work on item level descriptive metadata revision including
aligning controlled vocabulary terms with appropriate authorities. To improve efficiency and
accuracy during the data entry process, the metadata librarian at the UH Libraries developed
name and subject authority applications that automatically transform legacy controlled
vocabulary terms into authorized forms.
2. Metadata Upgrade Methodology and Strategy
The Metadata Upgrade Project utilized several approaches to identify metadata issues and
create strategies to improve the quality of metadata in the repository. To understand metadata
167
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
needs and address concerns that developed around legacy metadata, librarians conducted focus
groups with UH Libraries stakeholders—including Special Collections, Web Services, and
Liaison Services. External stakeholders were not included in the focus group interviews because
of the complicated institutional review board (IRB) application requirements and the difficulty in
identifying users. The project team also benchmarked current practices with similar digital
libraries. These two activities demonstrated that controlled vocabularies in the UHDL had been
applied inconsistently and inaccurately over time, most likely as a result of frequent changes in
staff from project to project. Consequently, some items in the UHDL had rich descriptive
connections with items in different digital collections while others had no terms to link them to
similar materials. The Metadata Upgrade Project team concluded that the controlled vocabulary
terms in the UHDL should be revised for accuracy, standardized to specific vocabulary lists, and
mapped to appropriate Dublin Core elements (Thompson and Wu, 2013).
TABLE 1: Three Phases of the Metadata Upgrade Project
Project Phase
Phase One
Phase Two
Phase Three
Tasks
Stakeholder Interviews, Metadata Schema Development
Collection-level Metadata Editing, Metadata Dictionary
Item-level Metadata Editing
After collecting data regarding the issues with the legacy metadata in the UHDL, librarians
developed key recommendations, a three-phase strategy for upgrading UHDL metadata (Table 1),
and a new input standard to ensure that the quality of future metadata remains accurate and
consistent over time. The first phase of the upgrade process focused on adding, revising, and
standardizing descriptive and administrative fields. The second phase edited metadata at the
collection level. Tasks performed in phase two included standardizing collection names for
archival and digital collections as well as editing collection-level fields. The third phase focuses
on adding and revising descriptive content in the digital library at the item level. To ensure that
future UHDL metadata complies with the new standard, the Metadata Upgrade Project also
produced a Metadata Dictionary which provides definitions, examples, and input rules for
descriptive, administrative, technical, and preservation metadata fields (Thompson and Wu,
2013). An abridged version of the UHDL Metadata Dictionary (2014) is available online.
3. Automated Metadata Transformation
Addressing issues with controlled vocabulary terms is a key activity in the third phase, and the
Metadata Upgrade Project staff spends a considerable amount of time reviewing existing terms,
identifying more appropriate terms, and reconciling terms with the source vocabularies. In the
early stages of phase three, the Metadata Upgrade Project staff experimented with exporting data
from CONTENTdm and cleaning the data with OpenRefine. However, getting the cleaned data
back into the system with a batch process proved a difficult task. The staff chose to work in the
CONTENTdm Project Client for all phase three item-level editing and use OpenRefine for
metadata analysis on new collections.
In order to speed up the editing process, the UH Libraries metadata librarian developed two
applications that enable efficient transformation of legacy authority data within the
CONTENTdm Project Client. Both applications are written in AutoHotkey (AHK), an open
source scripting and macro language for the Windows operating system. In addition to a GUI that
provides user feedback and menu functions, the core AHK scripts act as a glue language that
connects the data in the Project Client with locally maintained vocabulary mapping files. Each
AHK authority app gathers data recorded in the CONTENTdm Project Client and parses the tabdelimited authority files for matching terms. As of this writing, the tab-delimited files contain
approximately 900 subject mappings and 3,000 name authority mappings. The apps automatically
enter authorized terms in the Project Client and facilitate the addition of new terms to the local
mapping files with input boxes and automatic Web browser searches. Most importantly, the apps
168
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
allow the Metadata Upgrade Project team to focus on the intellectual content of their authority
work and let the computer take care of repetitive data entry tasks.
3.1 Subject Authority App
The decision to develop a subject authority app stems from the desire to ensure that the
metadata for every object in the UHDL contains subject terms from a widely used controlled
vocabulary. Legacy subject data in the UHDL includes terms from multiple vocabularies, and the
subject app performs automated mapping from those vocabularies to authorized terms in the
Library of Congress Subject Headings (LCSH). The UH Libraries are exploring opportunities for
applying linked data technologies to the collections in the UHDL, and the subject app also
facilitates harvesting of URIs from the Library of Congress Linked Data Service in preparation
for that work.
FIG. 1. AHK sub-routine for copying data and moving between Project Client fields.
The subject authority app processes one record at a time in the Project Client’s spreadsheet
view. When a metadata specialist triggers the subject app with the specified key combination, the
app traverses one row and copies the data in each alternate subject authority field to the clipboard.
In addition to LCSH, the UHDL uses four other subject vocabularies: Thesaurus for Graphic
Materials (TGM), Art & Architecture Thesaurus (AAT), the Thesaurus for Use in College and
University Archives (SAA), and a local UHDL vocabulary. To move between fields and copy the
data, the app sends key presses to the Project Client, as if a human user were pressing keys on the
keyboard. The sub-routine in Figure 1 sends the F2 key to activate the Project Client field for
editing, Control + A (^a) to select all of the text, Control + C (^c) to copy the text to the
clipboard, and the Tab key to move to the right one field. Brief pauses in between each keystroke
(Sleep, 50) give the Project Client GUI time to process each command.
FIG. 2. Subject mapping entries in the local tab-delimited file.
After copying values in a field, the app parses the clipboard data and attempts to match each
term against a tab-delimited mapping file stored on a local network drive (Figure 2). If no match
is found for a given term, the app opens a Library of Congress Linked Data Service search for
that term in a Web browser. After identifying an appropriate controlled term, a metadata
specialist enters the authorized form and authority record URI in dialog boxes. The app
automatically adds the term and URI to the local mapping file. When all of the alternative subject
authority columns have been queried, the app returns to the LCSH column and inputs the
authorized LCSH terms for that record (Figure 3) (Weidner, UHDL_SubjectTopical_CDM,
2014).
169
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
FIG. 3. Subject values in the Project Client after mapping.
3.2 Name Authority App
The UHDL name authority app performs similar matching and mapping functions in a
different direction. Instead of mapping values in multiple columns to a single vocabulary, the
name app maps values in a single column to multiple vocabularies: Library of Congress Name
Authority File (LCNAF), the Handbook of Texas (HOT), and a local UHDL name authority file
(Figure 4). Much of the legacy name authority data in the UHDL is recorded in the LCNAF field,
even though many of those names do not have records in the LCNAF vocabulary. This occurred
as a result of the metadata schema work in phase two of the Metadata Upgrade Project when staff
divided the UHDL’s name fields (Creator, Subject.Name, etc.) into multiple vocabularies instead
of one general field. In an effort to produce high quality, standardized data that is compatible with
linked data principles, the name authority app automates the transfer of name data to the
appropriate authority column in the CONTENTdm Project Client (Weidner,
UHDL_Names_CDM, 2014).
FIG. 4. AHK loop passes each name to the NameMap function which returns an authorized form.
Monitoring accuracy during authority work is very important, and the Metadata Upgrade
Project staff periodically review the name app’s tab-delimited mapping file in OpenRefine to
identify names mistakenly mapped to more than one form. Faceting on the authorized form
column quickly reveals any problems with the data. As a quality control feature, the name
authority app creates a report for each day and a log entry each time the name app is triggered
(Figure 5). Using these reports, staff can backtrack to locate any records that must be corrected.
FIG. 5. Name app report illustrating correct mappings to authorized forms.
170
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
3.3 Authority App Limitations
During the course of the authority work with the name and subject applications, the Metadata
Upgrade Project team has identified a number of limitations. The apps can handle the bulk of the
work, but there are edge cases that present interesting problems. In the case of the subject
authorities, mappings to LCSH may change between collections because a single term in an
alternate vocabulary can map to the multiple LCSH authorized terms. For example, the term
“gutters” in an alternate vocabulary could map to “Roof gutters” or “Street gutters” in LCSH,
depending on the context of the collection. This problem requires careful evaluation of a record
each time the app is triggered and occasional editing of the tab-delimited subject mapping file.
In the case of the name authorities, there are many times when a name is present in both the
LCNAF and HOT vocabularies. An update to the app provided the ability to harvest URIs from
both vocabularies and record those connections in a separate file for future use. The app gives
precedence to LCNAF for data entry purposes. As previously mentioned, the local tab-delimited
name mapping file requires constant monitoring to ensure the accuracy of the authorized forms
entered in the UHDL’s metadata. Both AHK authority apps are short term solutions for the
Metadata Upgrade Project and must eventually be supplanted by more robust controlled
vocabulary management features in the UHDL’s digital asset management system.
4. Benefits of Enhanced Metadata
There are numerous benefits to upgrading the legacy metadata in the UHDL. Integrating
metadata best practices—including the consistent use of established controlled vocabularies—
shaped the strategies and standards developed to address the issues identified during focus group
interviews and benchmarking. These best practices will improve how users connect with UHDL
content. In particular, standardized vocabulary terms consistently applied improve recall during
faceted browsing, reducing the likelihood of orphaned records. Implementing best practices also
ensures that UHDL metadata is fully interoperable with harvesting protocols, such as OAI-PMH,
thereby providing another potential discovery layer to our content and opening up possibilities for
collaboration with larger projects.
Aligning controlled vocabulary terms with recognized authorities and harvesting authority
record URIs also lays the foundation for publishing UHDL collections as linked data with rich
semantic markup. A first step might be to enrich subject terms and names with an owl:sameAs
link, populated by the URI gathered during the Metadata Upgrade Project, that points to the
unambiguous definition in the source vocabulary (W3C, 2004). Finally, with the creation of a
more robust metadata dictionary, UHDL metadata creators now have a standard to guide future
projects (Thompson and Wu, 2013).
5. Conclusion
While it is crucial to employ standards and best practices for quality control during the creation
of a repository’s metadata, metadata must be constantly maintained to reflect changes in the data
model, end-user interface configuration, and system transitions. The lack of batch processing and
limited authority control features in our digital asset management system creates barriers in our
metadata editing workflow. The rapidly growing volume and complexity of formats in our digital
library also presents challenges for our data quality management work. The utilization of
scripting and automation in our metadata revision process has assisted us greatly in overcoming
these barriers and challenges. The subject and name authority applications described in this paper
have simplified our workflow and helped to improve consistency and accuracy in our data.
Metadata is at the functional core of our digital system. High quality metadata not only
enhances the user experience in our digital library, but also enables the scalability and
interoperability of our data. To ensure high quality metadata, it is important for metadata
professionals to leverage traditional skills and new technologies to address the complex issues
involved in metadata creation and maintenance. Applying traditional cataloging skills during
171
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
descriptive metadata creation and enhancing data with applications for automated analysis and
transformation—such as data mining, name and subject heading mapping, and batch
processing—will improve the quality of the metadata in our repository and the efficiency with
which it is created. The UH Libraries will continue to explore and experiment with new
approaches to describing our digital objects and, with the metadata upgrade work outlined in this
paper, we are laying the groundwork for the migration of our data to a more expansive semantic
environment.
References
Art & Architecture Thesaurus. (2014). http://www.getty.edu/research/tools/vocabularies/aat/index.html/. Accessed July
26, 2014.
AutoHotkey. (2014). http://www.autohotkey.com/. Accessed May 29, 2014.
Handbook of Texas. (2014). http://www.tshaonline.org/handbook/. Accessed July 26, 2014.
Library of Congress Linked Data Service. (2014). http://id.loc.gov/. Accessed July 26, 2014.
Library of Congress Name Authority File. (2014). http://id.loc.gov/authorities/names.html/. Accessed July 26, 2014.
Library of Congress Subject Headings. (2014). http://id.loc.gov/authorities/subjects.html/. Accessed July 26, 2014.
OpenRefine. (2014). http://openrefine.org/. Accessed August 10, 2014.
Thesaurus for Graphic Materials. (2014). http://www.loc.gov/pictures/collection/tgm/. Accessed July 26, 2014.
Thesaurus for Use in College and University Archives. (2014).
http://www.archivists.org/publications/epubs/thesaurus.asp/. Accessed July 26, 2014.
Thompson, Santi and Annie Wu. (2013). Metadata overhaul: upgrading metadata in the University of Houston Digital
Library. Journal of Digital Media Management, 2(2): 137-147.
UHDL Metadata Dictionary. (2014). http://digital.lib.uh.edu/about/metadata/. Accessed August 7, 2014.
W3C. (2004). OWL Web Ontology Language Reference. Retrieved July 26, 2014 from http://www.w3.org/TR/owlref/#sameAs-def/.
Weidner, Andrew J. (2014). UHDL_Names_CDM. GitHub Repository. Retrieved May 29, 2014, from
https://github.com/metaweidner/UHDL_Names_CDM/.
Weidner, Andrew J. (2014). UHDL_SubjectTopical_CDM. GitHub Repository. Retrieved May 29, 2014, from
https://github.com/metaweidner/UHDL_SubjectTopical_CDM/.
172
Posters Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Retaining Metadata in Remixed Cultural Heritage Objects
Jamie Wittenberg
University of Illinois at
Urbana-Champaign, USA
[email protected]
Keywords: metadata; semantic web; remixing; linked open data
1. Context
Memory institutions have been working to incorporate features into their digital collections
that empower users to take ownership of cultural narratives. The advent of technologies like
annotation tools and crowdsourced tagging have allowed libraries, archives, and museums to
promote user content as part of an institutional narrative, albeit a somewhat tertiary one
(Salomon, 2013). Collecting institutions including the Smithsonian, MoMA, Australian Museum,
and British Library have been developing initiatives that encourage users to remix openly
available digital content. A remix appropriates components of existing resources and incorporates
them into a new work.
This movement towards user-generated remixed content is cost effective for institutions and
engaging for patrons. Increased interactivity is emblematic of the changing role of libraries,
archives and museums (Reiskind 2012, pp. 6). The future of cultural memory institutions will be
one that embraces collection diversity and incorporates user-generated material into institutional
narratives. This is already happening in social media, crowdsourced tagging, API development,
and remixing. Work to ensure that associated metadata is harvested along with media content is
still in its naissance. Increased endorsement of remixing as a way of engaging with cultural
heritage material requires a metadata infrastructure that can support description of remixed
content in a way that is comprehensive, interoperable, and scalable.
2. Existing Standards
There are two primary obstacles preventing the development of such a model. The first is that
even when comprehensive metadata is documented and available, current metadata standards do
not describe content with sufficient specificity. Because remixes appropriate segments of items,
rather than the entire item as a collection does, remixes require descriptions that are more
granular. In order to accommodate the clipping and cropping nature of remixing, a more robust
system of detailed object description is necessary.
The second obstacle is that metadata is often unidirectional. It is created for new items that
may express relationships to existing records, but less commonly updated in existing records. To
create metadata for remixes, metadata for original material would first need to be evaluated for its
relevance to the new content. Metadata for each appropriated component part that makes up the
remixed content should at minimum contain provenance, attribution, and descriptive information.
2.1. Descriptive Metadata Standards
In widely used descriptive metadata standards such as MODS and Dublin Core, relationships
between items are FRBR-type hierarchical relationships. Remixes seem to occupy an unspecified
space within the FRBR universe, because they appropriate and reuse items, rather than works or
expressions. Remixes take a single physical instance of a manifestation and modify it. MODS and
Dublin Core provide enough room in their structure that with some manipulation, it would be
possible to approximate a description of a remix. This is especially true if the remix is an
expression of the original work. However, some remixes might only incorporate minutiae of
173
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
existing content, drawing it together to create an entirely new work. Neither the MODS
RelatedItem attributes nor the Dublin Core Relation Type attributes express the relationship
between source content and a remixed object that is a new work (LOC 2013; DCMI 2012). There
is no possibility to include metadata touching on remixing actions, cardinality, or provenance.
Given that this form of cultural production is not only becoming increasingly popular, but is
being adopted into institutional narratives, there is a need for a metadata infrastructure that
explicitly addresses remixed material (Fisher, 2007).
2.2. Event-Based Metadata Standards
Event based metadata standards such as CDWA, CIDOC-CRM and LIDO orient
representation towards changes in the state of the item. These standards are better equipped than
descriptive metadata schemas to manage the lifecycle data associated with cultural heritage
material (Coburn 2010, pp.3-4). While event based standards offer the necessary process and
provenance support to a remix metadata model, the scope of such standards is steered towards
chain of custody-type changes such as the CIDOC-CRM Activity subclasses of Acquisition,
Transfer of Custody, and Curation Activity (ICOM/CIDOC 2013 pp. 5). Remixed cultural
heritage objects require a description that targets state changes in content production as well as
lifecycle events after accession.
5. Future Work: Linked Data and Annotation Standards
Metadata for remixed objects must enable consistent description and attribution for all aspects
of the work. Exploring Linked Open Data conceptualizations of aggregation and annotation such
as the Open Annotation Data Model and the OAI-ORE Abstract Data Model offers insight into
possibilities for structuring metadata associated with remixed cultural heritage objects (OAC
2013; OAI 2008). Such a structure must provide a descriptive framework for each component of
a remix and would require an extensible model flexible enough that elements could be included
from across domains. A standard that builds on Semantic Web concepts like the graph data model
has the potential to provide that flexibility. This is an area that requires further research.
6. Conclusion
The profile of the heritage institution of the future is beginning to take shape, and it is
characterized by ever-increasing interactivity, user customization, and widespread dissemination.
Libraries, archives, and museums will be participatory, collaborative spaces with room for
alternative narratives of heritage. Metadata structures and standards must adapt with these
institutions. It is essential to the integrity of cultural heritage institutions that as traditional
unilaterally created corpuses transition into inclusive and dynamic collections, descriptive
infrastructures transition as well (Bertacchini, 2013, pp. 60). The movement towards enabling
remixes of cultural heritage materials threatens existing metadata models because it requires
systemic change in the granularity of descriptive metadata and in metadata creation workflows.
The development of a metadata structure that can accommodate remixed content will help to
ensure that libraries, archives and museums continue to fulfill their roles as stewards of cultural
heritage content.
References
Bertacchini, E., & Morando, F. (2013). The Future of Museums in the Digital Age: New Models for Access to and Use
of Digital Collections. International Journal Of Arts Management, 15(2), 60-72.
Coburn, E., Light, R., McKenna, G., Stein, R., Vitzthum, A. (2010). LIDO - Lightweight Information Describing
Objects Version 1. Retrieved from http://www.lido-schema.org/schema/v1.0/lido-v1.0-specification.pdf
DCMI.
(2012).
Dublin
Core
Metadata
http://www.dublincore.org/documents/dces/.
Element
174
Set,
version
1.1.
Retrieved
from
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Fisher, M. & Twiss-Garrity, B.A. (2007, March). Remixing Exhibits: Constructing Participatory Narratives With OnLine Tools To Augment Museum Experiences. Museums and the Web 2007: Proceedings. Toronto: Archives &
Museum Informatics.
ICOM/CIDOC Documentation Standards Group. (2013). Definition of the CIDOC Conceptual Reference Model.
Retrieved from http://www.cidoc-crm.org/docs/cidoc_crm_version_5.1.2.pdf
Library of Congress. (2013). MODS Elelments and Attributes. MODS User Guidelines ver. 3. Retrieved from
http://www.loc.gov/standards/mods/userguide/generalapp.html
Open Annotation Community Group. (2013). Open Annotation
http://www.openannotation.org/spec/core/20130208/index.html
Open Archives Initiative. (2008). ORE Specification
http://www.openarchives.org/ore/1.0/datamodel
-
Core
Abstract
Data
Data
Model.
Model.
Retrieved
from
Retrieved
from
Reiskind, A. (2012). extraMUROS and the 21st century image library. VRA Bulletin, 38(2), 1. Retrieved from
http://online.vraweb.org/vrab/vol38/iss2/4
Salomon, D. (2013). Moving on from Facebook. College & Research Libraries News, 74(8), 408-412.
175
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Embedded Metadata—A Tool for Digital Excavation
Ana Cox
Phoenix Art Museum, USA
[email protected]
Keywords: embedded metadata; digital asset management; controlled vocabulary; collection
management; data mapping
1. Introduction
In June of 2012, I commenced the weighty task of searching the far reaches of Phoenix Art
Museum's digital storage spaces to import images into a recently acquired collection management
system, The Museum System (TMS). Before my newly created position as Visual Resource
Coordinator began, each department generated and stored assets with their own organizational
system in digital silos. I excavated long forgotten folders on various servers and desktops,
hunting for visual documentation of the art collection and past installations. Embedded metadata
was used as a tool to identify subject matter of images and indicate which folders had been
searched. These assets were then reorganized with a new file name convention and folder
structure. This poster will discuss my method for using embedded metadata to track information
about digital assets as well as challenges and opportunities for further development. This method
could be implemented by other cultural organizations as a low cost approach to tracking basic
metadata, content creators and copyright restrictions.
2. Implementation
The VRA XMP Info Panel, developed by the Visual Resource Association Embedded
Metadata Working Group (VRA EMWG), was installed to view in Adobe Bridge and provides
metadata fields that specifically pertain to cataloging art objects that the standard IPTC panel
does not provide. The VRA Panel also adheres to controlled vocabularies such as Dublin Core
and VRA Core. Thus, rules for cataloging were largely pre-established. The goal was to include
only the most pertinent information for identifying the art object and how the file was created.
The fields in Table 1 were identified to be the most useful.
TABLE. Metadata Fields.
Artwork
Creator
Title
Date
Medium
Dimensions
Repository
3
Description
Image
Creator
Date
Source
Copyright Restrictions
Copyright Notice
Custom Field: Image Type
Custom Field: Document Type
Custom Field: Object Number
1
Administration
1
Collection
Cataloger
Summary
2
Description
This field is used to assign curatorial area (controlled vocabulary).
Caption information is concatenated from work fields.
3
This field is used for additional notes about the artwork. For example, if multiple objects are included in
the same image.
2
176
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Only two custom fields were independently developed that were not included in the VRA
Panel: Object Number (institutional object tracking number) and Image Type (controlled
vocabulary: scanned transparency, reference image and professional collection photography).
3. Workflow
As digital silos were reviewed, labels in Adobe Bridge were utilized to mark which files and
folders had been reviewed, which images were copied and cataloged according to the new file
taxonomy and which images required subject identification. Where appropriate, object numbers
were added as embedded metadata to assist with future identification. Once the files were copied
to the new structure, the VRA Panel Export-Import Tool was used to transfer object metadata
from TMS to Adobe Bridge via an Excel spreadsheet. Metadata was also embedded regarding
how the image was created, suitability for publication and any copyright restrictions. As images
were imported into TMS, this additional metadata was ingested into corresponding fields in the
media record. Adobe Bridge provides several tools and features that allow the user to add
embedded metadata to large batches of images as well as automated file renaming tools, which
greatly improved the workflow.
4. Results
The hunt for these digital assets is ongoing, however after the initial survey spanning five months, I was
able to import about 10,000 files into TMS, which is a 280% increase from the files imported into the
previous collection management system, Argus. I also established procedures for cataloging and importing
new assets. Today there are 16,708 media records in TMS. Overall, I have copied and cataloged
approximately 74,000 files with embedded metadata into the digital archive. This number is growing daily.
5. Challenges and Opportunities
Using this method presented a few clear challenges and opportunities for development.
•
Open source tools provided by the VRA EMWG make tracking digital assets through
embedded metadata a low-cost, fast solution for digital asset management. For small to
mid-size cultural organizations this method can effectively organize institutional history.
However, in order to read every field in the VRA Panel, a staff member would need to
download the panel and install it in the Adobe Creative Suite. If an organization does not
already use Adobe products, there could be a cost barrier in acquiring this software and
investing in staff training.
•
The concatenated artwork caption appears in the description field in the standard IPTC
panel, which can be read by a staff member using any tool that reads embedded metadata,
such as Finder or Windows Explorer. This facilitates the ease of object identification;
however, caption information is not static. For example, if the Registrar completes a vault
inventory, is it worthwhile to correct all the updated measurements? Similarly, if the
work of an artist moves into the public domain, is it worthwhile to update every image by
this artist in the copyright restrictions field? Using this method for variable information
could prove time consuming and requires constant attention and editing.
•
The main benefit of using embedded metadata to track digital assets is that it is a tool to
recognize images that have previously been ingested into a digital asset management
system. For example, a staff member may create a copy of an image and rename the file
to store in their digital silo. The embedded metadata is copied as well, thus providing a
provenance for the file.
•
Phoenix Art Museum does not currently use digital asset management software. If we
were to move in this direction, the embedded metadata could easily be exported into an
Excel spreadsheet and imported into a DAMS.
177
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Despite these challenges, utilizing embedded metadata to track and describe digital assets is a
low cost digital asset management solution for galleries libraries archives or museums. Embedded
metadata is not only a useful tool for digital excavation, but can also provide opportunities as a
starting point for a more nuanced digital asset management system.
178
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Dublin Core to Ensure Interoperability between Models
Generated by Tools of Species Distribution Modeling
Cleverton Ferreira Borba
University of Sao Paulo, Brazil
[email protected]
Pedro Luiz P. Correa
University of Sao Paulo, Brazil
[email protected]
Keywords: species distribution modeling; connection between tools; biodiversity informatics;
ecological Informatics.
Abstract
This poster presents the use of the Dublin Core for tools that make species distribution
modeling. As a case study, this poster proposes the use of the Dublin Core for there to be a
connection between the models generated by tools of species distribution, contributing to the area
for biodiversity informatics.
1. Introduction
The area of scientific research called Biodiversity Informatics is a new area of research that has
received much attention in recent years because their results and innovations assist in decision
making for conservation and preservation of biodiversity. According to Peterson et al (2010) this
area is “challenged to meet the demand for support to biodiversity conservation technology”.
The species distribution modeling makes it possible to verify the changes in species
distribution, changes in populations and their diversity for a given period. However, studies show
that currently modeling species distribution has become more complex (Soberón, Nakamura,
2009). And equally modeling tools require improvements to the application of new techniques
and modeling strategies (Peterson et al. 2011).
One of the requirements is to ensure data interoperability between modeling tools.
Interoperability means the ability of information exchange through a metadata standard. In this
context the Dublin Core could help being adopted as a standard of data between models generated
by the modeling tools distribution of species.
2. Dublin Core and their use in species distribution modeling
Currently the main link between the modeling of species distribution and Dublin Core is that
modeling tools access different database platforms that use the Dublin Core standard for
publishing and standardization of information.
The use of metadata standards such as Dublin Core also assists in collecting biodiversity data
process because the data becomes public and through standardization is possible that the data is
available on various platforms.
At this stage, display and availability of data collected, this poster proposes the use of the
Dublin Core is further explored and used between models generated by modeling tools
distribution of species.
3. Application of Dublin Core for connection between models of species
distribution
The current structure of species distribution modeling tools is that they generate independent
models and may not be used or reused by other tools. This makes the researcher / user having to
use more than one tool to reach your goal.
179
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The idea is that with the Dublin Core standard, other standards, and an ontology, it is possible
to create a connection between the tools, ensuring interoperability between them, as we can see in
Figure 1.
FIG. 1. Using Dublin Core for connection between modeling species distribution tools.
The proposed use of the Dublin Core standard for interoperability between models generated
by tools of species distribution modeling is the use of the main elements of the Dublin Core.
Every model generated must have a title, subject, description, type, source, relation, creator,
publisher, contributor, date, format, etc.; this will ensure interoperability of basic information
between the generated models. From this information an ontology with the main elements of the
model should be a priority for the connection between modeling tools.
4. Conclusion
In conclusion of this part of the research it is possible to realize that the use of the Dublin Core
can assist in the process to ensure interoperability between models generated by modeling tools
distribution of species.
The Dublin Core standard has been one of the references regarding standardization for data
availability and data visualization, and this would have a strong acceptance of the researchers for
this standard is adopted as a party basis for a connection between modeling tools distribution
species.
4.1. Future research
As future work, we suggest: creating an ontology based on the Dublin Core standard to ensure
interoperability between tools; evaluation of the use of the Dublin Core in the tools and portals
that help biodiversity conservation.
References
DCMI. (2012). Dublin Core Metadata Element Set, version 1.1: Reference description. Retrieved April 12, 2014,
from http://dublincore.org/documents/dces/
Peterson A. T., Knapp S., Guralnick R., Soberón J. & Holder M. (2010) Perspective: The big question for biodiversity
informatics. Systematics and Biodiversity. The Natural History Museum. 8(2) 159-168.
Peterson, A. T., Soberón, J., Pearson, R. G., Anderson, R. P., Martínez-Meyer, E., Nakamura, M. & Araújo, M. B.
(2011) Ecological Niches and Geographic Distributions. United Kingdon: Princeton University Press. 328.
Soberón J. & Nakamura M. (2009) Niches and distributional areas: Concepts, methods, and assumptions. PNAS
106(2).
180
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Project Report: Building Bridges to the Future of a Distributed
Network: From DiRT Categories to TaDiRAH, a Methods
Taxonomy for Digital Humanities
Jody Perkins
Miami University
[email protected]
Quinn Dombrowski
University of California
[email protected]y.edu
Luise Borek
Technical University of Darmstadt
[email protected]
Christof Schöch
University of Würzburg
[email protected]
Keywords: digital humanities; taxonomies; research methods; ontologies; RDF; Resource
Description Framework; LOD; Linked Open Data
1. Project Background
This poster presentation traces the development and application of ‘TaDiRAH’ (Taxonomy of
Digital Research Activities in the Humanities), a shared taxonomy of digital humanities research
goals and methods (e.g. capture, enrichment, analysis), objects (e.g. data, images, manuscripts),
and techniques (e.g. cluster analysis, encoding, topic modeling) created for the purpose of
bridging the divide between related digital humanities hubs.
Earlier efforts to establish centralized hubs of information relevant to digital humanities (DH)
have proven unsustainable over the long term. These comprehensive hubs (such as artshumanities.net, a European initiative which previously aggregated information about events, jobs,
news, projects and tools) are currently being re-designed with a smaller scope and more focused
curation. However, this smaller scope comes with the risk of decontextualization—a digital
humanities project is best understood through the intersection of its subject matter, methodologies
and applications, not all of which are captured by any single site.
An example of a focused directory is the DiRT (Digital Research Tools) Directory, an
established, well-regarded source of information about tools available to support scholarship in
the humanities. DiRT is currently undergoing a new phase of development, with the goal of
making information about digital tools available outside the DiRT directory itself using RDF and
APIs.1 However, the ad-hoc set of categories that have been used to organize tools on DiRT since
its inception are of no utility outside DiRT itself. Adopting a shared taxonomy would provide a
means to connect DiRT’s tool data with related information provided by other sites.
2. Development Process
Early in 2013, as part of an effort to improve usability of the site, members of the DiRT
Steering Committee/Curatorial Board conducted an analysis of DiRT’s categories and free-form
tags. Shortly thereafter we began a series of discussions with the DARIAH-DE (Digital Research
Infrastructure for the Arts and Humanities-Germany) team that was developing a taxonomy for
their ‘Doing Digital Humanities’ Zotero bibliography. Recognizing our common goal, we formed
a transatlantic collaboration around the task of developing a shared taxonomy.
In the process of developing TaDiRAH we drew from three primary sources: 1) the artshumanities.net taxonomy for DH projects, tools, centers, and other resources, especially as it has
1
http://dirtdirectory.org/development
181
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
been expanded by [email protected] in the UK and DRAPIer (Digital Research and
Projects in Ireland); 2) the DiRT categories for digital research tools, re-launched under Project
Bamboo in the US but now continuing on after the end of that project; and 3) the scheme used by
the DARIAH ‘Doing Digital Humanities’ Zotero bibliography to organize literature on all facets
of DH. These resources were mapped, analyzed and distilled into their essential parts, producing
a simplified taxonomy of two levels: eight top-level “goals” that are broadly based on the steps of
the scholarly research process and a number of lower-level “methods” associated with each goal.
In addition, there are two separate open ended lists of digital humanities research “objects” and
“techniques” that can be freely associated with higher level methods.
In September 2013, and again in January 2014, we opened a draft version of the taxonomy for
public comment and received a tremendous amount of feedback from the DH community. The
response shows the ongoing relevance of a task that has been under discussion in digital
humanities circles since John Unsworth introduced his concept of 'scholarly primitives' in 2000.
We hope that one outcome of this presentation will be to extend the conversation beyond the
boundaries of the DH community.
3. Challenges and Future Work
This presentation will also cover some of the challenges encountered during TaDiRAH’s
development, including: selection of terms that facilitate consistent application vs. terms that
represent entities in a more precise manner2, avoiding conflation of concepts, reconciling terms
against existing taxonomies, minimizing redundancy, balancing theoretical “correctness” on one
hand against the necessity of adopting commonly used terms to ensure findability on the other
(e.g. visualization + geospatial coordinates object vs. “mapping”), and responding to thorough
(and sometimes conflicting) feedback from the digital humanities community.
We will also present several use cases based on the shared taxonomy, demonstrating how it
will work to serve both task and user-oriented endeavors. Applying TaDiRAH to actual
directories will provide an opportunity to assess the degree to which it can accommodate realworld data. In the coming months we will conduct a comprehensive review of all DiRT tool
entries, adding terms from the TaDiRAH taxonomy. DHCommons will also add TaDiRAH terms
to project profiles based on existing free-form metadata. Information from DiRT and
DHCommons will be exposed using RDF, making the content available as linked open data, as
well as through APIs that are currently under development.
The “Doing Digital Humanities” bibliography curated by DARIAH-DE has already
implemented the TaDiRAH taxonomy. The Zotero-based bibliography is using “collections”
(similar to subfolders) for the seven broad goals, and the tags for the research activities, objects
and techniques. Each entry is tagged with at least one activity and one object to enable a faceted
browsing of the bibliography, starting with either research activities or objects. Most recently
TaDiRAH has been adopted by two additional DARIAH initiatives: the Digital Humanities
Course Registry and the Training Materials Collection (Schulungsmaterial-Sammlung).
DARIAH-EU has committed to using this taxonomy as a basis for their development of a more
complex ontology of digital scholarly methods, and we are also engaged in ongoing dialog with
other ontology initiatives, including NeDiMAH’s (Network for Digital Methods in the Arts and
Humanities) work around scholarly methods. Our goal is to share at least high-level categories
with NeDiMAH’s ontology, so that objects (projects, tools, articles, etc.) classified using our
taxonomy can be automatically “mapped” to some level of the NeDiMAH ontology, and vice
versa.
2
While the use of specific terms supports precision, the use of more broadly defined terms tends to provide
better support for consistent application, collocation and recall. In the context of search, precision and
recall are often inversely related.
182
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The projects and collections that adopt TaDiRAH will also inform its evolution. TaDiRAH can
be found online at GitHub3, where we will be using the issue tracker to collect further feedback to
be incorporated into future revisions. A SKOS version soon to be available on the GitHub site
and a SPARQL endpoint through a TemaTres instance are currently in development. We expect
that TaDiRAH will continue to evolve as a relatively flexible scheme of associated scholarly
methods, techniques and object types that can be applied to a variety of DH resources.
Acknowledgements
The authors would like to recognize the work of Matt Munson, who was a member of the
TaDiRAH Coordinating Committee during its initial development phase.
References
Digital Humanities Course Registry, DARIAH-DE Cologne / CLARIAH-NL Rotterdam. (2014). Retrieved from
http://dhcoursereg.hki.uni-koeln.de/
DARIAH-DE (Digital Research Infrastructure for the Arts and Humanities-Germany). (2014). Retrieved from
https://de.dariah.eu/
DHCommons. (2014). Retrieved from http://dhcommons.org/
DiRT (Digital Research Tools Directory). (2014). Retrieved from http://dirtdirectory.org/
Doing
Digital
Humanities
A
DARIAH
Bibliography.
(2014).
Retrieved
from
https://www.zotero.org/groups/doing_digital_humanities_-_a_dariah_bibliography/items/order/creator/sort/asc
NeDiMAH (Network for Digital Methods in the Arts and Humanities). (2014). Retrieved from http://www.nedimah.eu/
Schulungsmaterial-Sammlung,
DARIAH-DE
https://de.dariah.eu/schulungsmaterial-sammlung
Würzburg/Cologne
TaDiRAH (Taxonomy for Digital Research Activities in the
https://github.com/dhtaxonomy/TaDiRAH and http://tadirah.dariah.eu.
(2014).
Humanities).
Retrieved
(2014).
Retrieved
from
from
Unsworth, John. (2000). Scholarly Primitives: What Methods Do Humanities Researchers Have in Common, and How
Might Our Tools Reflect This? London: King’s College London. Retrieved May 16, 2014 from
http://people.brandeis.edu/~unsworth/Kings.5-00/primitives.html.
3
http://github.com/dhtaxonomy/TaDiRAH
183
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Metadata Workflows Across Research Domains: Challenges and
Opportunities for Supporting the DFC Cyberinfrastructure
Adrian Ogletree
Drexel University
[email protected]
Keywords: metadata workflows; metadata generation; DataNet Federation Consortium (DFC);
research data; cyberinfrastructure.
1. Introduction
This poster presents research results from a survey studying metadata workflows. In the
context of this study, a ‘metadata workflow’ is defined as a workflow that generates metadata for
a data collection. The following research question guided this investigation: Where are people
(and automated processes) creating metadata in the data life cycle, and what could be done to
improve the quality?
2. Background
Metadata is necessary to find, use, and properly manage scientific data. Sharing metadata
workflows across different communities is thus crucial for promoting data interoperability and
reuse. The DataNet Federation Consortium (DFC) is a project within the NSF Office of CyberInfrastructure DataNet initiative. One widespread problem that the DFC seeks to address is the
unfortunate reality that “many scientific fields lack a common integrated data infrastructure,
which often results in non-standardized, local data management practices” (Akmon, 2011, p. 330331). Carole Goble, Robert Stevens, Dave De Roure, and others have made significant
contributions to the study of e-science workflows and reproducibility. In addition, Taverna and
Kepler are two open-source, community-driven, scientific workflow management systems with
large user bases in the eScience community (Taverna; Kepler). However, data management needs
vary substantially across disciplines. Willis, Green, and White (2012) call for future research to
examine in greater detail the “community-specific practices and workflows as well as constraints
caused by the technological environment and trends at the time of scheme creation” (p. 1517).
3. Methodology
A survey was distributed via e-mail to the DFC listserv in order to better understand how
scientific metadata is created. DFC scientists, researchers, and data curators involved in any
aspect of creation or use of scientific metadata were invited to participate in this study.
4. Results and Discussion
Fourteen (14) participants responded to the survey, representing a 34% response rate (the DFC
listserv contains 41 members). They were affiliated with eight different DFC project partners: the
Ocean Observatories Initiative (OOI),1 the iPlant Collaborative,2 the Odum Institute for Research
in Social Science,3 the National Oceanic and Atmospheric Administration (NOAA),4 the
Renaissance Computing Institute (RENCI),5 the University of Virginia, the Data Intensive Cyber
1
http://oceanobservatories.org/
http://www.iplantcollaborative.org/
3
http://www.odum.unc.edu/odum/home2.jsp
4
http://www.noaa.gov/
5
http://www.renci.org/
2
184
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Environments (DICE) Center,6 and the School of Information and Library Science at the
University of North Carolina at Chapel Hill. The participants’ fields of study included hydrology,
biology, climatology, ecology, library sciences, computer science, engineering, social sciences,
and information science. The composition of the participants’ positions were as follows: 2
professors, 1 associate professor, 1 assistant professor, 1 postdoc researcher, 1 doctoral student, 2
master’s students, 2 administrators, 1 software engineer, 1 scientific analyst, and 1 IT project
team lead (one participant did not respond to this question). Five (5) of the participants had 5 to
10 years of research experience.
The following types of data were created or used in the participants’ research: observational
data (7), papers (7), simulation data (4), laboratory experimental data (3), “other” (3), and field
experimental data (1). Participants were asked to select all that apply. Observational data has the
most long-term value for researchers because it is often unique, irreplaceable, or costly to collect
(Anderson, 2004).
Figure 1 below shows metadata creation by a person and metadata creation or capture by a
computer. Participants were asked to select all that apply; for instance, some researchers add
metadata at every point within the data collection process. Eight (8) of the participants who
responded to this question manually create metadata before data is collected, 10 manually create
metadata during data collection, and all 12 manually create metadata afterward. Only 2 of the
participants report that computer-generated metadata is created before data is collected; 9 report
that automated metadata creation occurs during or after data collection, with one respondent
selecting “other,” who had no automated metadata collection. Data management best practices
recommend that data documentation happen at the very beginning of the research project, before
data collection. However, these results indicate that more scientific metadata is created during or
after the data collection process than before, and that few researchers take advantage of
automated metadata generation workflows.
FIG. 1. Metadata creation by humans and automated processes.
Six (6) of the participants reported that their organization has a specified standard in place for
creating metadata. The following metadata schemes were used: Dublin Core (7), “Other” (7),
FGDC (2), NetCDF Climate and Forecast (CF) (2), “Don’t know” (1), EML (1), and “No
standard scheme is used” (1). Participants were asked to select all that apply. Six (6) of the
participants who selected “other” named the following metadata schemes: free tag AVU in irods,
MIxS, DDI (2), WaterML, and GML. Based on the survey results, many different metadata
schemes were used, consistent with Greenberg’s (2005) study of digital repositories that
“hundreds of metadata schemes [are] being used, many of which are in their second, third, or nth
iteration” (p. 18).
6
http://dice.unc.edu/
185
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
When asked what information another researcher would need to reproduce their research,
responses include: information about workflows, highly specialized knowledge, software, or
equipment, and/or algorithms and parameters used. Similarly, Borgman (2012) observes that
research reproducibility requires “the precise duplication of observations or experiments, exact
replication of a software workflow, degree of effort necessary, and whether proprietary tools are
required” (p. 17). Without contextual information and high-quality metadata, even “open” data is
unusable.
5. Conclusions
Overall, the results met expectations based on other similar studies of scientists’ data
management practices and perceptions (Akers, 2013; Anderson, 2004; Borgman, 2012; Chavan &
Penev, 2011; Greenberg, 2005). The following list represents the key findings of this survey:
•
More than half (58%) of participants create or use observational data
•
Metadata is more likely to be created after data collection
•
Scientists and researchers suffer from a lack of awareness of metadata standards
Data sharing is complicated by the need for highly specialized knowledge, software,
and/or equipment in order to reproduce research
This study makes a contribution towards methods of survey design for the purposes of
studying metadata workflows. Although the responses to this survey represent multiple scientific
disciplines, positions, and institutions, this study was limited by the small sample size. Future
research should include larger populations, and different research domains can be categorized in
order to study the similarities and differences of data management needs between communities.
Another area of interest for the DFC is the ability of the iRODS data grid to capture the
provenance information associated with execution of a workflow. This research could be useful
for creating a definition of a sufficient context to enable re-use of data.
•
Acknowledgements
I would like to thank Jane Greenberg and Reagan Moore for their guidance and support. The DFC
DataNet is supported by the National Science Foundation, OCI-0940841.
References
Akers, Katherine G. and Jennifer Doty. (2013). Disciplinary differences in faculty research data
management
practices and perspectives. The International Journal of Digital Curation, 8(2), 5-26. doi:10.2218/ijdc.v8i2.263
Akmon, Dharma, Ann Zimmerman, Morgan Daniels, and Margaret Hedstrom. (2011). The application of archival
concepts to a data-intensive environment: Working with scientists to understand data management and preservation
needs. Archival Science, 11(3-4), 329-348.
Anderson, William L. (2004). Some challenges and issues in managing, and preserving access
collections of digital scientific and technical data. Data Science Journal, 3, 191-201.
to,
long-lived
Borgman, Christine L. (2012). The conundrum of sharing research data. Journal of the American Society for
Information Science and Technology, 63(6), 1059-1078. doi:10.1002/asi.22634
Greenberg, Jane. (2005). Understanding metadata and metadata schemes. Cataloging & Classification Quarterly, 40(34), 17-36. doi:10.1300/J104v40n03_02
Kepler. Retrieved from https://kepler-project.org/
Taverna. (2014). Retrieved from http://www.taverna.org.uk/
Willis, Craig, Jane Greenberg, and Hollie White. (2012). Analysis and synthesis of metadata goals for scientific data.
Journal of the American Society for Information Science and Technology, 63(8), 1505-1520.
186
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
A Cooperative Project by Libraries and Museums of China:
Metadata Standards for the Digital Preservation of Cultural
Heritage
Ying Feng
CALIS Administrative Center,
China
[email protected]
Long Xiao
Peking University Library,
China
[email protected]
Keywords: cultural heritage; digital preservation; metadata standards; descriptive metadata;
administrative metadata; preservation metadata.
This poster introduces a project that aims to build metadata standards for digital preservation
of cultural heritage. Research and demonstration will be made by collaborative effort among
seven libraries and museums.
1. Background and Objectives
In addition to preserving cultural heritage, the objective of digitization of cultural heritage is to
share cultural heritage and related knowledge in an effective, rapid and convenient manner in
context of networked environment, to provide information and knowledge services relating to the
cultural heritage. At present, a number of museums in China have been digitalizing their culture
heritage. However, it is difficult to integrate, share, and apply these digital outcomes due to the
lack of uniform standards. At the same time, a large number of cultural heritage remain to be
digitized. To avoid repeated problems, a standard metadata system for digital cultural heritage is
required for comprehensive information organization, description, management and preservation.
Additionally, other standards such as classification system for cultural heritage is also needed for
building knowledge database of digital cultural heritage. Thus, it is urgent to establish a uniform
metadata standards for digital preservation of cultural heritage.
Metadata Standards for Digital Preservation of Cultural Heritage is one of key research areas
and sub-project of the Research and Demonstration Project on Standard Systems and Key
Standards for Digital Preservation of Cultural Heritage which is funded by the Ministry of
Science and Technology of China in this year. The objectives focus on the demands for business
management, digitization, management of digital content, long-term preservation of digital
content, and the establishment of a knowledge database for cultural heritage. The research will be
based on existing metadata standards and use the application logic of the digital preservation of
cultural heritage as its starting point. Then construct the metadata framework, core standards,
description standards, administrative and preservation standards, and application technology
specifications for digital preservation of cultural heritage, thereby standardizing metadata
generation during digitization and preservation of cultural heritage, supporting and promoting the
construction of digital preservation for cultural heritage, and driving the research, presentation,
applying, and development of cultural heritage preservation.
Composition of the project team: Seven entities are involved in research as follows: Peking
University, the Palace Museum, Dunhuang Research Academy, National Library of China,
Zhejiang University, Tsinghua University and University of Science and Technology of China,
with Peking University being the team leader.
Project development timeframe: 2014–2017.
187
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
FIG. 1. Composition of the Research and Demonstration Project on Standard Systems and Key Standards for Digital
Preservation of Cultural Heritage
2. Key Barriers
The cultural heritage metadata standards under this research project must be able to describe
the basic information of cultural heritage and meet the needs of business activity, while fulfilling
the application requirements for digitizing cultural heritage and constructing the knowledge
database. Flexibility, scalability and applicability also need to be considered. The key barriers
and difficulties are as below:
1. The research and development of metadata framework for the digital preservation of cultural
heritage. This is a fundamental technical issue for establishing the metadata framework for digital
preservation of cultural heritage, which will directly affect the scientificity and rationality. The
difficulties include:
Ÿ
Revealing of properties, digitization, business activities related, knowledge database
construction of mobile and immovable cultural heritage, as well as study and analysis on
corresponding application requirements of cultural heritage metadata standards.
Ÿ
Abstracting application requirements and building relationships among the concepts, as
well as constructing a metadata information model that meets the requirements of the
client.
2. The establishment of cultural heritage classification system. A cultural heritage
classification system needs to account for the characteristics of both the digital objects of cultural
heritage and physical entities. The scientific characteristics and practicality of each situation will
directly influence the segmentation and design of the metadata standards description. Given the
complex nature of cultural heritage, it is relatively difficult to construct a scientific and rational
classification system for them.
188
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
FIG. 2. The relationship between the cultural heritage classification systems and metadata standard system.
3. Design of descriptive metadata system and specific metadata standards. The difficulties
include the following:
Ÿ
In order to meet different application requirements for cultural heritage metadata,
modular, scalable, generic and customized descriptive metadata system and specific
metadata standards are needed.
Ÿ
How to make use of and integrate the various types of digital contents of cultural heritage
already digitized to build foundation for implementing information sharing and the
overall revealing of cultural heritage metadata.
4. Research and design of administrative and preservation metadata standards. Difficulties of
abstracting and generalizing the application requirements of administrative and preservation
metadata arise because of different business processes and management approaches among
different cultural organizations. It is also difficult to design practical and scalable framework for
administrative and preservation metadata.
5. The research and development of metadata application profiles which is shown in Figure 3.
FIG. 3. Composition of metadata application profiles.
189
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
3. Design Principles and Expected Results
It focuses on the digital objects of cultural heritage in conjunction with the physical entities
while designing cultural heritage metadata. Meantime, the following principles will also be
considered, includes simplicity and accuracy, specificity and versatility, scalability and
sustainability, interoperability and openness, user requirements and applicability.
The following are expected results.
Ÿ
The metadata framework for the digital preservation of cultural heritage: includes general
principles of cultural heritage metadata, metadata system for cultural heritage, metadata
information framework for cultural heritage, core metadata set and its application
guidelines, descriptive metadata application specification, and specific metadata design
principles for cultural heritage.
Ÿ
Classification systems for cultural heritage, for both digital and physical objects.
Ÿ
Specific metadata standards for cultural heritage: includes 12 specific metadata standards
and their cataloging rules as well as application guidelines for mobile cultural heritage, 7
specific metadata standards and their cataloging rules as well as application guidelines for
immobile cultural heritage.
Ÿ
Administrative and preservation metadata standards for the digital preservation of
cultural heritage: includes metadata framework, element set and application guidelines
for administrative metadata and preservation metadata for cultural heritage.
Ÿ
Application profiles of metadata standards for the digital preservation of cultural heritage:
includes metadata identification system, encoding rules, metadata packaging and
exchange specifications, access protocol and open mechanisms.
References
CIDOC CRM (2013). Definition of the CIDOC Conceptual Reference Model.
http://www.cidoc-crm.org.
Retrieved June 10, 2013, from
MIDAS Heritage (2012). MIDAS Heritage: the UK Historic Environment Data Standard. Retrieved July 2, from
http://www.english-heritage.org.uk/publications/midas-heritage/
190
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Undressing Fashion Metadata:
Ryerson University Fashion Research Collection
Naomi Eichenlaub
Ryerson University,
Canada
[email protected]
Marina Morgan
Ryerson University,
Canada
[email protected]
Ingrid Masak-Mida
Ryerson University,
Canada
[email protected]
Keywords: fashion; metadata schema; metadata mapping; cataloguing; digital collections;
digitization; Dublin Core; VRA Core.
1. Abstract
The purpose of this poster is to provide insight into the processes involved in making a unique
fashion research and teaching collection discoverable in an online environment at Ryerson
University. The online collection will provide a means for the users to identify what artifacts are
available for research purposes and facilitate teaching in the classroom. The poster will highlight
effective metadata standards and elements, cross-domain metadata uses, metadata mapping and
implementation
2. Introduction
Ryerson University Fashion Research Collection project, a collaboration between the School
of Fashion at Ryerson University and RULA (Ryerson University Library and Archives), consists
of creating an online collection of images and metadata representing several thousand donated
garments and accessories, designer clothing and millinery donated from private collections dating
back to the latter part of the nineteenth century and early 20th century.
The key goals of this online collection are to promote research, teaching and learning at
Ryerson University, and to connect with a broader community by building scholarly, online
exhibitions. Once finalized, it will be used as a pedagogical tool and it will inspire fashion
students and scholars to undertake research into fashion history.
3. Background
Ryerson University Library and Archives has partnered with the Ryerson School of Fashion to
increase access to a unique collection of fashion items. The collection was housed in
unfavourable conditions in a locked room in the library for many years and was relatively
unknown to students. It was recently relocated to a series of rooms in the School of Fashion
building. The collection is now in the process of being curated by its collection coordinator, who
received a grant to digitize a portion of the collection in 2012. Currently, only very limited
amount of information is available about the Ryerson Fashion Research Collection through a blog
and a Pinterest site. Initially, a sample of the collection was loaded on to Pinterest as a means of
both engaging students as well as exposing the collection to the world. The social media platform,
however, has limited search functionality and virtually no descriptive metadata beyond an item
description box.
Zeng (2009) asserts that the physical access restrictions common to most collections of
historical fashion result from “delicate artifacts and by the inaccessible nature of many costume
collection storage facilities”. The online collection, however, will increase access from what was
once multiple sets of excel spreadsheets of described items onto a searchable platform that will
191
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
allow students, faculty and staff to have a more robust discovery, searching and browsing
experience.
4. Research Significance
The research significance of the Ryerson Fashion Research Collection is three-fold. First, it
will provide a venue for a greater expanse of fashion online exhibits, a pedagogical tool that will
allow Ryerson students to learn and research, to foster students’ interaction and participation, and
to explore a rich yet previously inaccessible fashion collection at Ryerson University. Second, it
will allow us to build on and implement future specific collections, to foster a connection
between external and internal users, to promote and improve online access that would add value
to the existing collection. Third, digital access will preserve the valuable collection, but at the
same time will allow researchers, students, and the public to have “visual access to an entire
collection without needlessly disturbing the garments and their accessories” (Zeng, 1999).
5. Metadata Implementation and Challenges
Very little has been written or published about the digitization of fashion collections and
specifically about appropriate metadata schema for optimizing access and discovery of fashion
object collections. The question of appropriate descriptive elements for use in fashion collection
metadata records was noted by Marcia Lei Zeng in her article Metadata elements for object
description and representation: a case report from a digitized historical fashion collection
project specifically due to the three-dimensional nature of fashion artifacts (Zeng, 1999).
As Lampert and Chung (2011) argued, before developing and designing a digital collection,
there are various technical questions to consider, such as thoroughly assessing various feature sets
of different systems and making informed decisions to seek out appropriate solutions.
Consequently, we evaluated and analyzed several web-publishing platforms, both proprietary and
open source, and metadata standards that would better fit our criteria. Simplicity of installation of
the software and metadata ingest were very important to us, particularly since we were working
within a fairly short timeline. As well, metadata adaptability and interoperability, import and
export functionality of specific standard data formats, flexible approaches to various plug-ins,
(specifically the OAI-PMH Harvester) factored in to our selection criteria.
Judging against our expectations, we evaluated three possibilities that would best fit the
selection criteria mentioned above: ICA-AtoM, SharedShelf (ARTstor), and Omeka.
Metadata
Standard
Customizable
One to Many
Relationship
OAI-PMH
Cost
Atom
Various
No
No
Yes
Free
SharedShelf (ARTstor)
VRA Core
Yes
No
Yes
Subscription
Omeka
Dublin Core VRA
Yes
Yes
Yes
Free
Fig. 1 Comparison of possible web publishing platforms and metadata standards
AtoM (Access to Memory) is a web-based, dynamic open source application for standardsbased archival description and access, allowing various import and export formats, and
supporting ICA and non-ICA standards (RAD, Dublin Core, and MODS). Shared Shelf is a
media management software that enables management, storage, use, and publishing institutional
and faculty media collections within their institution, or publicly on the Web. Though highly
customizable, complex and flexible, the platform did not allow for multiple image batch loading
per individual item described (one-to-many relationships).
192
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Omeka, a free, flexible, and open source web-publishing platform, on the other hand, allows
the expansion of its core functionality with existing plugins to create maps, to allow users to tag
favorites, and to create dynamic and robust online exhibits, thus tying closely with the
pedagogical requirements of this project.
There are several other products on the market that offer features similar to Omeka. However,
when it comes to providing the rich visual context or exhibiting collections, as today’s web users
would expect, these platforms may be less effective, often difficult to adopt, and more expensive
to maintain than Omeka. Motivated to create digital collections due to the educational imperative
to share their collections with the public, the academic world is often facing “restricted budgets
and staffing issues” as Sauro (2009) argues. Faced with the same budget and staffing restrictions,
we decided to go with the most flexible and cost effective solution for our project, Omeka. The
decision was also based on the variety of features that Omeka offers. As highlighted by Kucsma,
Reiss, and Sidman (2010), Omeka allows strong and flexible approach to metadata
representation, straightforward plug-in deployment, customs creation of item types, and the
addition of the full set of Dublin Core properties to the existing Dublin Core element set,
including element refinements and supplemental elements.
Metadata element selection and metadata mapping was the next challenge. Selecting the
metadata elements and refining the specifications is closely tied to the end-user usage patterns
and item description choices. The stakeholders (users, faculty, and digital collection creators)
have different ideas about what is useful in the collection. We learned that faculty use the
collection for their own research as well as to enhance classroom learning; external researchers,
visiting scholars, and curators are interested in a particular designer, period or type of artifact
(e.g. 19th century hair accessories). We also learned that students seek access to the collection in
different ways: to establish the specific material of a garment, or identify the type of stitching or
other manufacture processes. Although the search strategy depends on the research question
being asked, in general the primary search terms would be for a particular type of garment
(corset, dress, coat, hat), period (1920s, 1950s, 1960s), designer (Balenciaga, Dior, Balmain),
construction type (bias cut, inset sleeves), colour (yellow, orange, red), or textile (silk, linen,
cotton). Consequently, we had consulted with the curator in order to determine the metadata
elements highlighting the benefits of certain metadata elements. This possible usage pattern
directly influenced the item description and metadata elements. What fits the faculty curricula vs.
what fits the interest of the student or researcher directly impacted those choices.
Zeng (1999) references difficulty of locating appropriate text to use as a title, an issue we faced
as well. Discussion of the requirement for a title in each record was necessary, as it was
understood that students would often be searching by accession number instead of title.
Consequently, the curator of the Ryerson Fashion Collection created titles for each item using her
expert knowledge in the field, adding mostly general terms such as “evening dress”, followed in
most cases by slightly more specific terms including the gender, colour, or shape of the garment,
for example “Green wool men's tailcoat with black satin lapel and black wool vest”.
Equally important when working with the metadata for these fashion items is the information
needs of the students using the collection. In terms of providing subject access to the collection,
after some discussion regarding the merits of additional subject access points we agreed that we
would use the Art & Architecture Thesaurus (Getty Research Institute) (AAT). We had also
considered using the Thesaurus of Graphic Materials (TGM) or Library of Congress (LC)
Subjects, but our examination of fashion headings in the AAT revealed that it had better coverage
in terms of fashion subject specificity. For example, the AAT has a term for dresses (garments)
under which there are more than a dozen narrower terms including chemise dresses, coat dresses,
gowns, jumpers (dresses), maxi dresses, midi dresses, muumuus, overdresses, etc. However, there
are also limitations. For example, in using the AAT as it currently exists, certain commonly used
terms for garments such as the word “pants” or “tunic” cannot be used. Consequently, to make
items more discoverable, we used the tagging option in Omeka.
193
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
There were a number of other metadata fields that posed some challenges in order to meet the
information needs of the students, while respecting the descriptive standards we were following.
Rich description and details ended up mostly being mapped to one DC element: DESCRIPTION.
VRA Core elements such as MATERIAL, MEASUREMENTS, CULTURE, OR
STYLE/PERIOD would seem more appropriate given the specificity of this collection. Omeka
does have the option of implementing the VraCoreElementSet plugin developed by the Scholars'
Lab at the University of Virginia Library. Given the time and staffing constraints we haven’t been
able to configure and test it, however, this is something that we could develop in the future.
Batch image loading and one-to-many relationships was another challenge we faced. To be
able to accomplish this task, we had two options: one was to use the OAI data to create a CSV
instead of using it to import directly into Omeka. This would allow us to add a column to the
CSV file with the location of the files for each item. The second option was to loop through the
images and identify the correct object ID, i.e. the description to which the image
will be linked. A script looped through each line of the CSV file, stored the accession number
into a variable, and for each accession number it looped through the filenames, adding only the
matching filenames at the end of the line in the CSV. When opening the resulting CSV, a new
column containing the matching filenames was created, thus allowing a smooth batch loading of
images and item description, including the metadata.
FIG. 2 Example of extra column added after executing the scripts
The last challenge was the collection’s discoverability. Necessary for a quality user experience,
the Fashion Research Collection should be seamlessly available via the library’s discovery layer.
However, this collection is not yet integrated into the library's discovery environment. In order
for this collection to be discovered, a record in the library catalogue and one in the University’s
repository will be created.
6. Future Research
Looking to the future, a number of international digital fashion collection projects provide
inspiration for the possibility of a similar Canadian initiative. The Europeana Fashion Portal is a
three-year project to aggregate best of Europe’s fashion collections and has a number of goals
including improving interoperability and developing a specialized Fashion Thesaurus.1 Australia
also has a national initiative called the Australian Dress Register which showcases pre-1975 dress
with Australian provenance and encourages museums and private collectors to “research their
garments and share the stories and photographs while the information is still available and within
living memory”.2
Of particular relevance to future directions for online fashion resources is the possibility of
incorporating interactive functionality and social features into collections to allow for usergenerated content (Lampert, 2011). Allowing users to contribute their knowledge about historical
items of little-known provenance, for example through tagging, can be an effective way to gather
information that collection curators might otherwise miss. Moreover, incorporating some of the
functionality of social curation sites such as Pinterest that allow for users to create their own
personal digital collections is another possible future direction. The Omeka software has a
1
2
http://blog.europeanafashion.eu/about/).
http://www.australiandressregister.org/about/
194
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
number of plugins that offer a level of user interaction, such as the Comments, Exhibit Builder
and MyOmeka, the latter allowing for item favouriting.
7. Conclusions
The Fashion Research Collection is a study collection consisting of several thousand artifacts
including garments, accessories, and ephemera including photographs, magazines, and patterns.
This collection is intended to support the research activities of the students and faculty at Ryerson
University as well as means of engaging the outside community. This project met its overarching
goals of increasing access and discoverability to a unique collection of mixed-provenance but
mostly Canadian fashion items. Furthermore, the collaboration between the School of Fashion
and Ryerson University Library and Archives allowed for subject matter experts in fashion,
cataloguing and metadata standards to collaborate on a project that will provide community
members and the public-alike with access to a tool for research, teaching, and learning.
References
Artstor Digital Library. (2014). Shared Shelf Features. Retrieved from
http://www.artstor.org/shared-shelf/s-html/features.shtml
Dublin Core Metadata Initiative. (2012). DCMI Metadata Terms. Retrieved from
http://dublincore.org/documents/dcmi-terms/
ICA-AtoM. Retrieved from https://www.ica-atom.org
Kucsma, J., Reiss, K., & Sidman, A. (2010). Using Omeka to build digital collections: The METRO case study. D-Lib
Magazine, 16(3/4) doi:10.1045/march2010-kucsma
Lampert, C. & Chung, S.K. (2011). Strategic planning for sustaining user-generated content in digital collections.
Journal of Library Innovation, 2(2), 74-93.
Omeka Plugins. (2014). Retrieved from http://omeka.org/add-ons/plugins/
Sauro, C. (2009). Digitized historic costume collections: Inspiring the future while preserving the past. Journal of the
American Society for Information Science and Technology, 60(9), 1939-1941. doi:10.1002/asi.21137
Tzoc, E., & Millard, J. (2011). Technical skills for new digital librarians. Library Hi Tech News, 28(8), 11-15.
doi:10.1108/07419051111187851
Valentino, M. L. (2010). Integrating metadata creation into catalog workflow. Cataloging & Classification Quarterly,
48(6-7), 541-550. doi:10.1080/01639374.2010.496304
VRA
Core
Schemas
and
Documentation.
(2007).
http://www.loc.gov/standards/vracore/VRA_Core4_Element_Description.pdf
Retrieved
from
Zeng, M. L. (1999). Metadata elements for object description and representation: A case report from a digitized
historical fashion collection project. Journal of the American Society for Information Science, 50(13), 1193.
195
Best Practice Posters &
Demonstrations Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013
Best Practice Poster:
MARC to schema.org: Providing Better Access to UIUC Library
Holdings Data
Timothy Cole
University of Illinois at
Urbana-Champaign,
United States
[email protected]
Michael Norman
University of Illinois at
Urbana-Champaign,
United States
[email protected]
Patricia Lampron
University of Illinois at
Urbana-Champaign,
United States
[email protected]
William Weathers
University of Illinois at
Urbana-Champaign,
United States
[email protected]
Ayla Stein
University of Illinois at
Urbana-Champaign,
United States
[email protected]
M. Janina Sarol
University of Illinois at
Urbana-Champaign, United
States
[email protected]
Myung-Ja Han
University of Illinois at
Urbana-Champaign, United
States
[email protected]
Keywords: MARC, schema.org; MARCXML; bibliographic description; MODS; holdings data.
1. Introduction
Taking advantage of the Web as a means for disseminating large datasets, libraries have begun
publishing their bibliographic metadata on the Web—e.g., the University of Michigan,1 the
University of Florida,2 and Harvard University.3 Initially, most libraries focused on releasing their
catalogs as MARCXML, however, MARC consists primarily of string data with few, if any,
URIs linking to ontologies or related resources. MARCXML was not designed for use with RDF.
Libraries are now experimenting with disseminating catalogs as linked open data in other
serializations, e.g., OCLC,4 and the British Library.5 Semantics compatible with RDF are being
used, but specific schemes vary. Detail about holdings associated with bibliographic descriptions
is still lacking, e.g., the volumes of a described serial title held by the library are not enumerated.
This last seems a significant omission given that libraries are uniquely positioned to provide this
information. The University of Illinois at Urbana-Champaign (UIUC) Library has released 5.5
million bibliographic catalog records that include detailed local holdings information to allow
consumers to know exactly which volumes or parts of the creative work described are available at
UIUC. MARCXML serializations are available for downloading now. MODS serializations
enriched with links to name and subject authorities and RDF serializations (using schema.org
semantics) will soon be available. This poster reports on the development of workflows for this
project, on the multiple formats of catalog metadata being made available through these
workflows, and on the lessons learned to date.
1
http://www.lib.umich.edu/library-information-technology/open-access-bibliographic-records-availabledownload-and-use
2
http://www.uflib.ufl.edu/catmet/creativecommons.html
3
http://openmetadata.lib.harvard.edu/bibdata
4
http://www.worldcat.org/
5
http://bnb.data.bl.uk/
196
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013
2. MARCXML with physical holdings information
As a first step, we created MARCXML bibliographic descriptions for each physical volume
the library holds with selected volume-specific information (e.g., barcode) recorded in the 955
local data field. With a simple VB.NET program, we collapsed volume-level records associated
with a single bibliographic entity into one bibliographic record that contains all holding and item
level information for associated volumes and parts in repeated MARC 852 data fields as shown in
Figure 1.
<marc:datafield tag="852" ind1="0" ind2=" ">
<marc:subfield code="a">IU</marc:subfield>
<marc:subfield code="b">Rare Book &amp; Manuscript Library [noncirculating]</marc:subfield>
<marc:subfield code="h">099</marc:subfield> <!-- classification number -->
<marc:subfield code="i">Ab3</marc:subfield> <!-- cutter -->
<marc:subfield code="p">30112066264109</marc:subfield> <!-- barcode -->
<marc:subfield code="t">1</marc:subfield> <!-- copy number -->
</marc:datafield>
FIG 1: Example of MARC XML 852 data field used to record physical holdings
3. MODS Transformation & Adding Links
The transformation of MARCXML with holdings information in 852 data fields into MODS is
based on the Library of Congress (LC) MARC to MODS recommendations.6 (We differ slightly
from the LC mapping recommendations in how we treat enumeration/chronology, copy number,
and barcode.) Each 852 data field is mapped to a MODS <location> element. 852 subfield a is
mapped to <location> sub-element <physicalLocation>; all other 852 subfields map to subelements of a single <copyInformation> element, within the <holdingSimple> subelement of
<location>. Figure 2 displays the 852 data field of Figure 1 transformed to MODS.
<mods:location>
<physicalLocation displayLabel=”Institution Code”>IU</physicalLocation>
<holdingSimple>
<copyInformation>
<subLocation> Rare Book &amp; Manuscript Library [non-circulating]</subLocation>
<shelfLocator>099 Ab3</shelfLocator>
<note displayLabel=”Copy Number”>1</note>
<note displayLabel=”Barcode”>30112066264109</note>
</copyInformation>
</holdingSimple>
</mods:location>
FIG 2: MARC 852 data field transformed to MODS
After transforming MARCXML records to MODS, a Python script is invoked to search VIAF
for URIs matching values in the MODS <name> element, as transformed from MARCXML data
fields 100, 110, 111, 700, 710, 711, and 720. When found, URIs are added to the MODS <name>
element replacing the string values. When searching VIAF, we use complete name information,
birth date, and death date (as available). Only exact matches in VIAF are recorded. The same
script searches LCSH Linked Data Services7 to find subject heading URIs, which are then also
6
7
http://www.loc.gov/standards/mods/userguide/location.html
http://id.loc.gov/
197
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013
added to the MODS <subject> element. If no match is found, the text string remains as the value
for the field.
4. Transformation to RDF and schema.org
The MODS metadata enriched with links to name and subject authorities are transformed into
schema.org semantics. These are disseminated one-by-one as RDFa (within HTML styled for
presentation to end-users), via bulk downloading (as RDF/XML or JSON-LD), and via a
SPARQL endpoint. Transformation of bibliographic metadata from MODS to schema.org is
straightforward (though arguably the distinction between work and manifestation is further
blurred). However, transforming holdings to schema.org is challenging. Based on earlier
experimentation at OCLC and our interpretation of relevant W3C Schema Bib Extend
Community Group guidelines,8 we mapped each holding as a schema.org <offer> entity.
Conclusion
The goal of this poster is two-fold:
8
•
sharing with the community practices and workflow implementations developed at
UIUC for disseminating traditional library data in multiple formats and serializations;
and,
•
gaining feedback on the mapping and modeling decisions made in transforming
detailed MARC bibliographic and holdings data into linked open data.
http://www.w3.org/community/schemabibex/wiki/Holdings_via_Offer
198
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
The TR32DB Metadata Schema: A Multi-level Metadata Schema
for an Interdisciplinary Project Database
Constanze Curdt
University of Cologne,
Germany
[email protected]
Dirk Hoffmeister
University of Cologne,
Germany
[email protected]
Keywords: metadata; Dublin Core, research data; data repository; interdisciplinary
Abstract
The multi-level TR32DB Metadata Schema (Curdt, 2014) was designed and implemented with
the purpose to describe all heterogeneous data, which are created by project participants of an
interdisciplinary research project, with accurate, interoperable metadata. The metadata schema
considers the interoperability to recent metadata standards and schemas. It is applied in the
CRC/TR32 project database (TR32DB, www.tr32db.de), a research data management system, to
improve the documentation, searchability and re-use of the data. The TR32DB is established for a
multidisciplinary, long-term research project, the Collaborative Research Centre/Transregio 32
‘Patterns in Soil-Vegetation-Atmosphere Systems: Monitoring, Modelling, and Data
Assimilation’ (CRC/TR32, www.tr32.de), funded by the German Research Foundation.
A key issue of research data management systems is the documentation of all research data
with accurate metadata (Greenberg et al., 2013). This is particularly important for long-term
research projects (Michener, 2006) and should follow recent metadata standards and schemas
(Jensen et al., 2011). Consequently, the TR32DB Metadata Schema is designed in a multi-level
approach combining several metadata schemas and standards, as well as data type and project
specific metadata elements to describe all heterogeneous data. Metadata elements of Dublin Core
are applied as a base schema. To meet the requirements of different TR32DB data types (data,
geodata, report, picture, presentation, publication), the Dublin Core metadata elements are
extended with further elements of metadata standards and schemes like ISO19115 Metadata
Standard1, INSPIRE2, as well as elements of the Bibliographic Ontology3 or the Event Ontology4.
In addition, metadata elements of the DataCite Metadata Schema Version 2.25 are complemented.
Furthermore, the TR32DB schema is expanded with own metadata properties corresponding to
the TR32DB data types (e.g. measurement instrument and parameter), as well as to the
CRC/TR32 background (e.g. specific keywords, themes). The schema specifies a defined number
of metadata properties, including a core set of mandatory properties, as well as optional and
automatically generated properties (e.g. metadata creator and date). In addition, available and
TR32DB-specific controlled vocabulary lists are supported. A mapping to the applied metadata
standards is provided for interoperability.
In detail, the TR32DB Metadata Schema is arranged in two layers: a general layer and a
specific layer. The general layer enables the description of all data with basic details (e.g. title,
description, creator, subjects). They are required for all data types. The specific layer
complements the documentation of the data with specific metadata properties for each data type.
For example, datasets from the TR32DB data type ‘data’ can be described with specific
1
http://www.iso.org/iso/catalogue_detail.htm?csnumber=26020
http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2008:326:0012:0030:EN:PDF,
http://inspire.jrc.ec.europa.eu/documents/Metadata/INSPIRE_MD_IR_and_ISO_v1_2_20100616.pdf
3
http://bibliontology.com/
4
http://purl.org/NET/c4dm/event.owl
5
http://schema.datacite.org/meta/kernel-2.2/doc/DataCite-MetadataKernel_v2.2.pdf
2
199
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
properties, such as the temporal extent (e.g. start/end data), the lineage, the used measurement
instrument (e.g. equipment group/method, model, manufacturer) and corresponding measurement
parameter. Furthermore, the datasets from the TR32DB data type ‘geodata’ can be described with
specific attributes, such as a temporal extent (e.g. start/end data), a lineage, the applied reference
system or spatial resolution. In addition, datasets from the TR32DB data type ‘report’ can be
described with additional attributes, like a report date, the report type (e.g. PhD report, master
thesis, fellow report), the city or institution, where the report was created. Moreover, datasets
from the TR32DB data type ‘picture’ can be described with a recorded date (e.g. start/end date),
the name of the recording place, the recording method and details about the recording event (e.g.
event type, name, location, website). Finally, the TR32DB data type ‘publication’ makes an
exception, because different publication types require various attributes. Consequently, an
‘article’ can be described, for example, with a type of article (e.g. journal, magazine), publication
source, publisher, volume, issue, pages, and page range. In contrast, an ‘event paper’ specifies
information about the event, where the paper was presented. This includes the event name, the
location, and period. In addition, details about the proceedings title, the editor, as well as the page
range of the paper can be specified.
CRC/TR32 participants provide their metadata of a dataset by the TR32DB web-interface. A
user-friendly, self-designed metadata input-wizard enables the entry of the metadata. The data
search through metadata is available for all visitors of the TR32DB website by predefined,
advanced, and map search functions. As a result, a detailed overview of all available metadata of
a selected dataset is provided, which is arranged according to the TR32DB Metadata Schema.
Overall, the interoperable TR32DB Metadata Schema allows the accurate description of all
heterogeneous data, generated by the CRC/TR32 participants. The multi-level approach enables a
simple enhancement of the schema according to changing requirements of the project
participants.
Acknowledgements
We would like to thank all colleagues involved in the design and implementation of the TR32DB.
In addition, we gratefully acknowledge financial support by the CRC/TR32 ‘Patterns in SoilVegetation-Atmosphere Systems: Monitoring, Modelling, and Data Assimilation’ funded by the
German Research Foundation (DFG).
References
Curdt, Constanze. (2014). TR32DB Metadata Schema for the Description of Research Data in the TR32DB. Cologne,
Germany: Transregional Collaborative Research Centre 32, Project Section Z1/INF, Institute of Geography,
University of Cologne. Retrieved July 1, 2014, from http://dx.doi.org/10.5880/TR32DB.10.
Greenberg, Jane, Swauger, Shea, & Feinstein, Elena. (2013). Metadata Capital in a Data Repository. Paper presented at
the International Conference on Dublin Core and Metadata Applications, DC-2013, September 2-6, 2013. Retrieved
July 1, 2014, from http://dcevents.dublincore.org/IntConf/dc-2013/paper/view/189/86.
Jensen, Uwe, Katsanidou, Alexia, & Zenk-Möltgen, Wolfgang. (2011). Metadaten und Standards. In S. Büttner, H.-C.
Hobohm & L. Müller (Eds.), Handbuch Forschungsdatenmanagement (pp. 83-100). Bad Honnef, Germany: Bock u.
Herchen.
Michener, William K. (2006). Meta-information concepts for ecological data management. Ecological Informatics, 1,
3-7.
200
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
Development of the EDDA Study Design Terminology to
Enhance Retrieval of Clinical and Bibliographic Records in
Dispersed Repositories
Ashleigh Faith
School of Library and
Information Sciences
University of Pittsburgh,
United States
[email protected]
Eugene Tseytlin
Department of Biomedical
Informatics
University of Pittsburgh
School of Medicine, United
States
[email protected]
Tanja Bekhuis
Department of Biomedical
Informatics
University of Pittsburgh
School of Medicine, United
States
[email protected]
Keywords: clinical records; bibliographic records; dispersed repositories; medical terminaology:
design process; study designs.
1. Background
Medical terminology varies across disciplines and reflects linguistic differences in
communities of clinicians, researchers, and indexers. Inconsistency of terms for the same
concepts and lack of machine-readable metadata impede discovery of information artifacts, such
as records of clinical reports and scientific articles that reside in various repositories. To facilitate
discovery, retrieval, and data sharing, the medical community maintains an assortment of
terminologies, thesauri, and ontologies. Valuable resources include the US National Library of
Medicine Medical Subject Headings (MeSH), Elsevier Life Science thesaurus (Emtree), and the
National Cancer Institute Thesaurus (NCIT). It is increasingly important to identify medical
investigations by their design features, as these have implications for evidence regarding research
questions.
2. Purpose
Recently, Bekhuis et al (2013) found that coverage of study designs was poor in MeSH and
Emtree. Based on this work, the EDDA Group at the University of Pittsburgh is developing a
terminology of study designs. In addition to randomized controlled trials, it covers observational
or uncontrolled designs.
3. Methods
Among the resources analyzed thus far, inconsistent entry points, semantic labels, synonyms,
and definitions are common. The EDDA Study Design Terminology is freely available in the
NCBO BioPortal (http://purl.bioontology.org/ontology/EDDA). Some of the preferred terms have
several variants, definitions sometimes compete, as well as other concept identifiers useful for
researchers. The beta version was developed using the Protégé ontology editor v.4.3
(http://protege.stanford.edu) and distributed as a Web Ontology Language (OWL) file. Dublin
Core Metadata Initiative (DCMI) protocols are in place for recording overall terminology
metadata and OWL annotations.
4. Results
At this preliminary stage, the term matrix consists of 171 class axioms consisting of study
design terms, related terms, and publication types. When possible, class axioms were annotated
with definition(s), incompatibility status, legacy term(s), controlled vocabulary resource unique
201
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
identification, semantic type, and variant term annotation properties. In addition, revision
metadata was also captured with editor annotations consisting of the team member who modified
the class axiom and the date of modification. The following process was used for axiom
enhancement (Figure 1):
De#ine EDDA term │MeSH, Emtree,
NCIT─Definition & source
[#Definition] Entry term, preferred term, synonyms, acronyms & abbreviations [#Variant] Legacy terms & previous indexing [#legacyTerm] Mention of incompatible terms [#incompatibleWith] Semantic type [#semanticType] Unique identi#ication code [#MeSHCode] [#NCIThesaurusCode] [#UMLSCUI] Symmetry of annotations veri#ied Note editor and date of modi#ication [#Editor] Add DCMI artifact tags [dc/elements/1.1/creator] [dc/elements/1.1/format] [dc/elements/1.1/contributor] [dc/elements/1.1/publisher] [dc/elements/1.1/description] Release version in NCBO BioPortal─ OWL format FIG. 1. Design term annotation process.
Through the annotation process, a total of 2,381 axiom annotations were recorded. This
included 51 MeSH, 33 NCIT, and 27 Emtree exact match access points to EDDA Study Design
terms. Both MeSH and NCIT access points enabled information to be recorded. However, 12 of
27 Emtree access points did not result in any information because of insufficient information.
Because NCIT cross-references other controlled vocabularies, 33 Unified Medical Language
System (UMLS) resources also contributed to axiom annotations. A total of 123 definitions, 14
instances of incompatibility, 33 legacy terms, 95 semantic types, and 1,349 term variations were
recorded.
5. Conclusions & Future Work
Identifying and retrieving reports of medical investigations by design features is increasingly
possible, primarily through linking metadata. Further development entails adding definitions from
other sources, mapping relationships among terms, and integrating terms from existing
vocabularies, particularly the Information Artifact Ontology. A primary goal is to improve
identification and retrieval of electronic records describing studies in dispersed data warehouses
or electronic repositories.
202
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Acknowledgements
This research was partially supported by the US National Library of Medicine (NLM), National
Institutes of Health, grant number 5R00LM010943. Its contents are solely the responsibility of
the authors and do not necessarily represent the official views of the NLM.
203
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
Normalizing Decentralized Metadata Practices Using Business
Process Improvement Methodology: A Data-Informed Approach
to Identifying Institutional Core Metadata
Emily Porter
Emory University, USA
[email protected]
Keywords: metadata standards; element sets; benchmarking; best practices; business process
improvement; thematic analysis; quantitative analysis; qualitative analysis.
Environment, Context, and Techniques
The Emory University Libraries and Emory Center for Digital Scholarship have developed
numerous digital collections over the past decade. Accompanying metadata originates via
multiple business units, authoring tools and schemas, and is delivered to varied destination
platforms. Seeking a more uniform metadata strategy, the Libraries’ Metadata Working Group
initiated a project in 2014 to define a set of core, schema-agnostic metadata elements relevant to
local content types.
Quantitative and qualitative techniques commonly used in the field of Business Process
Improvement were utilized to mitigate complex organizational factors. A key research deliverable
emerged from benchmarking: a structured comparison of over 30 element sets, recording for each
standard its descriptive element names, their required-ness, and general semantic concepts.
FIG. 1. Descriptive Elements by Schema/Standard: Quantity and Requirements (Selected Sources).
Additional structured data collection methodologies included a diagnostic task activity, in which
participants with varying expertise created (simple) Dublin Core records for selected digital
content. A survey of stakeholders provided greater context for local practices. Multiple publicfacing discovery system interfaces were inventoried to log search, browse, filter, and sort options,
and available web analytics were reviewed for user activity patterns correlating to these options.
207
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Thematic analysis was performed on all benchmarking, system profiles, and web analytics data
to map the results to a common set of conceptual themes, facilitating quantification and analysis.
A weighted scoring model enabled the ranking of elements’ themes: the highest scoring concepts
then explicated as an initial set of core elements, mapped to relevant standards and schemas.
Acknowledgements
The Emory University Libraries’ Metadata Working Group lent subject matter expertise to data
collection and analysis (Brian Croxall, Jen Doty, Jason LeMay, Rebecca Koeser, Melanie
Kowalski, Elizabeth Roke). Lars Meyer (Woodruff Library, Emory University) sponsored the
project and has provided ongoing support. Emory’s Office of Business Practice Improvement
(Bill Dracos, Chris Rapalje, Jamie Smith, Ashley Teal) shared valuable techniques and expertise.
References
Atom
Syndication
Format
–
Introduction.
(2007).
Retrieved
March
31,
2014,
from
http://atomenabled.org/developers/syndication/.
Berkman Center for Internet & Society at Harvard Law School. (2003). RSS 2.0 Specification (RSS 2.0 at Harvard
Law). Retrieved March 31, 2014, from http://cyber.law.harvard.edu/rss/rss.html
Data Documentation Initiative. (2014). DDI Lite (Recommended Elements). Retrieved Month DD, YYYY, from
http://www.ddialliance.org/sites/default/files/ddi-lite.html.
DCMI-Libraries Working Group. (2004). DC-Library Application Profile (DC-Lib). Retrieved Feb 13, 2014, from
http://dublincore.org/documents/library-application-profile/.
Digital Library Federation. (2009). Digital Library Federation/Aquifer Guidelines for Shareable MODS Records,
Version
1.1.
Retrieved
April
30,
2014,
from
https://wiki.dlib.indiana.edu/download/attachments/24288/DLFMODS_ImplementationGuidelines.pdf.
Digital Public Library of America. (2013). Metadata Application Profile, Version 3. Retrieved Month DD, YYYY,
from http://dp.la/info/developers/map/.
Dublin Core Metadata Initiative. (2012). Dublin Core Metadata Element Set, version 1.1. Retrieved March 25, 2014,
from http://www.dublincore.org/documents/dces/.
Embedded Metadata Working Group, Smithsonian Institution. (2010). Basic Guidelines for Minimal Descriptive
Embedded
Metadata
in
Digital
Images.
Retrieved
April
7,
2014,
from
http://www.digitizationguidelines.gov/guidelines/GuidelinesEmbeddedMetadata.pdf.
Federal Geographic Data Committee. (1998). CSDGM Graphical Representation. Retrieved April 23, 2014, from
http://www.fgdc.gov/csdgmgraphical/index.html.
Google Scholar.
(n.d.). Inclusion Guidelines for Webmasters.
http://scholar.google.com/intl/en-US/scholar/inclusion.html#indexing.
Retrieved
April
17,
2014,
from
International Standardization Organization (ISO). (2003). Geographic information — Metadata. New York: American
National Standards Institute.
Library of Congress. (2013). MODS User Guidelines (Version 3). Retrieved February 28, 2014, from
http://www.loc.gov/standards/mods/userguide/index.html.
Library of Congress. (2014).
LC RDA
http://www.loc.gov/aba/rda/pdf/core_elements.pdf
Core
Elements.
Retrieved
April
8,
2014
from
Miller, S. (2011). Metadata for Digital Collections. New York: Neal-Schuman Publishers, Inc.
PBCore. (2011). Elements. Retrieved April 21, 2014, from http://www.pbcore.org/elements/.
Schema.org. (n.d.). CreativeWork. Retrieved April 9, 2014, from http://schema.org/CreativeWork.
Society of American Archivists. (2013). Describing Archives: A Content Standard. Retrieved May 2, 2014, from
http://files.archivists.org/pubs/DACS2E-2013.pdf.
Text Encoding Initiative. (2014). 2 - The TEI Header.
c.org/release/doc/tei-p5-doc/en/html/HD.html.
Retrieved April 7, 2014, from http://www.tei-
U.S. National Library of Medicine. (2004). NLM Metadata Schema. Retrieved April 17, 2014, from
http://www.nlm.nih.gov/tsd/cataloging/metafilenew.html.
208
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
VRA Core Oversight Committee. (2007). VRA Core 4.0 Element Description. Retrieved March 31, 2014, from
http://www.loc.gov/standards/vracore/VRA_Core4_Element_Description.pdf.
W3C. (2014). 4.2 Document metadata. Retrieved July 1, 2014, from http://www.w3.org/TR/html5/documentmetadata.html
209
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
The NDL Great East Japan Earthquake Archive:
Features of Metadata Schema
Akiko Hashizume
National Diet Library, Japan
[email protected]
Julie Fukuyama
National Diet Library, Japan
[email protected]
Keywords: The Great East Japan Earthquake; archive; dcndl; ndlkn
1. Background
The Great East Japan Earthquake, which struck Japan on March 11, 2011, caused extensive
damage in several parts of Japan and has affected Japanese society, culture and economy. Since
immediately after the earthquake, the importance of passing on this historical experience to future
generations has been pointed out in Japan and overseas. The Japanese government announced its
basic policy towards the recovery from the earthquake. This policy pointed out the need to
develop a system to collect, preserve and provide access to records of and lessons learned from
the earthquake, tsunami and nuclear disaster.
Based on this policy, the National Diet Library (NDL), in conjunction with numerous other
organizations throughout Japan, has developed the Great East Japan Earthquake Archive Project
for the collection, preservation, and provision of information related to the earthquake.
2. The NDL Great East Japan Earthquake Archive
A portal site for this project was developed by the NDL and opened to the public on March
2013. Features available at the portal site include integrated searches of resources and reports on
the earthquake and subsequent disasters produced by public institutions, private organizations,
and mass media companies as well as research publications by universities, academic societies,
and research institutions. The portal site has been named HINAGIKU, which means daisy in
English.1 This name is intended to convey an image of hope for the future and mutual concern in
support recovery from the earthquake.
FIG. 1. Top page of the Great East Japan Earthquake Archive (HINAGIKU) (English version).
1
HINAGIKU is an acronym of Hybrid Infrastructure for National Archive of the Great East Japan
Earthquake and Innovative Knowledge Utilization.
210
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
HINAGIKU allows you to search the following resources. By the end of April 2014, the
number of searchable records in HINAGIKU had reached 2,642,788.
TABLE 1: Resources collected in HINAGIKU.
Records of the Great East Japan Earthquake and the damage it caused,
records of the affected areas before the earthquake, records of the restoration
and reconstruction after the earthquake
Records of aid activities by the national government, local municipalities, and
other public organizations as well as records of aid activities by volunteer
groups, non-profit organizations, and other private initiatives.
Subject
Records of disaster prevention planning and academic research before and
after the Earthquake as well as records of disaster prevention planning for the
future
Records of nuclear hazards resulting from the earthquake
Records of earthquakes, tsunami, and other natural disasters from the past
Records of the impact of past earthquakes on politics, economics, and society
in Japan and around the world
Records of the Great East Japan Earthquake and the damage it caused,
records of the affected areas before the earthquake, and records of restoration
and reconstruction after the earthquake
Books, journals, newspapers, and other publications and digitized data
Reports, research papers, news
Websites of public and private organizations
Format
Images
Video
Audio (interviews, etc.)
Fact sheets (observed data, geodetic data, etc.)
The user-friendly HINAGIKU interface includes a map display and a timeline display. Users
interested in searching documents, images, video, and other digital material from a particular
region can browse via the map display. Users interested in searching digital material
chronologically search via the timeline. The time base can be changed to facility tracking the
passage of time and reviewing the progress of reconstruction initiatives.
211
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
FIG. 2. Map page of HINAGIKU (English version).
FIG. 3. Timeline page of HINAGIKU (English version).
To enable integrated searches, HINAGIKU collects three types of metadata:
1. metadata on digital materials stored in HINAGIKU
2. metadata from the NDL’s other databases, including the online catalog (NDL-OPAC)
3. metadata collected from other databases created by other organizations,2 including those of
local municipalities, universities, and mass media
To handle this metadata in HINAGIKU, we developed the Great East Japan Earthquake
Archive Metadata Schema (NDLKN).3 This schema is based on the National Diet Library Dublin
Core Metadata Description (DC-NDL), which is our own metadata schema, based on the DCMES
and DCMI Metadata Terms, for facilitating interoperation of metadata between libraries and
2
The Examples of the cooperating organization with HINAGIKU is as follows:
CiNii Article by National Institute of Informatics
JAEA OPAC by Japan Atomic Energy Agency’s Library
Digital Archive of Japan’s 2011 Disasters by Edwin O. Reischauer Institute of Japanese Studies at Harvard
University
East Japan Earthquake Picture Project by Yahoo!JAPAN etc.
3
“NDLKN” is from ‘NDL Knowledge infrastructure system metadata schema’.
212
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
related institutions in Japan. DC-NDL comprises NDL Metadata Terms as well as Application
Profile and RDF Schema for NDL Metadata Terms.
Mechanical searches and harvesting of metadata are supported on HINAGIKU through Web
API with SRU, OpenSearch, and OAI-PMH. API/SRU returns search results in RDF/XML.
3. The Great East Japan Earthquake Archive Metadata Schema (NDLKN)
NDLKN was created as an extension of DC-NDL, so that HINAGIKU could search the
metadata not only of other institutions but also of NDL search systems, such as a discovery tool
"NDL Search", which implement DC-NDL for metadata schema.
NDLKN comprises
1. 87 terms described in DC-NDL (dcndl:),
2. 33 terms described by W3C and adopted internationally (exif: etc.), and
3. 5 terms described originally in NDLKN (ndlkn:).
There were two major issues to solve in development of NDLKN. The first was coordination
of metadata in various systems over multiple domains. The second was to satisfy requirements for
archiving disaster records. NDLKN was developed to be a solution to these issues.
3.1. Coordinate with metadata of variable systems over domains
It was not possible to create metadata in the new NDLKN schema for existing domestic and
foreign disaster record archive systems, because they held metadata in original schema.
Therefore, we decided to harvest and keep metadata in the original schema in one storage and
map this data to the NDLKN schema for storage for searching. We ask newly building archives to
adopt NDLKN and to extend the terms according to the needs of each institution.
As mentioned above, NDLKN was extended from DC-NDL. The main differences between
these schema are changes of the classes from [dcndl:Item] to [ndlkn:Resource] and from
[dcndl:BibAdminResource] to [ndlkn:MetaResource]. The NDL Search, which implements DCNDL based on FRBR model, holds terms for individual items in the class [dcndl:Item]. However,
we felt that it would be difficult for organizations other than libraries to understand the concept of
FRBR item, especially since HINAGIKU was intended to utilize digital materials such as images
and videos more than books and journals found in traditional libraries. Also, we set
[ndlkn:Resource] and changed the class [dcndl:BibAdminResource] to [ndlkn:MetaResource].
We also decided to store the URI of metadata providers in [dcterms:creator] of
[ndlkn:MetaResource] class and the URI of the NDL in [dcterms:publisher]. We did this because
we consider metadata providers to be primarily responsible for the metadata, which the NDL
accepts and makes available.
We also assumed that the number of cooperating archives would continue to increase, and
therefore it would be preferable to use identifiers for HINAGIKU metadata that would not require
adjustment or reduction and would never be exhausted or overlap. As a result of these
considerations, we adopted the UUID (Universally Unique Identifier)-RFC4122 and decided to
add UUID to one file as a minimum unit.4
It is necessary to specify a license or terms of use for each resource that will be reused.
Therefore, we decided to use [dcterms:license] for the information of the license and to adopt
[cc:attributionURL] from Creative Commons Rights Expression Language to describe the name
of the rights holders. Both of these are used in the form of URI.
Ex. 1: Creative Commons license
<dcterms:license rdf:resource="http://creativecommons.org/licenses/by/3.0/us/"/>
4
The UUID Version 4 is a string of random 32 hexadecimal digits, so it is impossible to overlap the
identifiers.
213
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Ex. 2: Yahoo! JAPAN East Japan Earthquake Picture Project
<dcterms:license rdf:resource="http://archive.shinsai.yahoo.co.jp/contents/guide/"/>
Ex. 3: The NDL
<cc:attributionURL rdf:resource="http://www.ndl.go.jp/"/>
3.2. Meeting the needs of archiving disaster records
HINAGIKU functions not only as a web portal that enables integrated search for either digital
or analogue resources but also as an archive that stores and preserves resources themselves with
metadata. HINAGIKU archives digital materials such as images and videos at the moment. We
considered the terms of NDLKN for each material types of objective resources.
NDLKN adopted [premis:formatName] and [premis:formatVersion] from PREMIS as terms
for preservation technology. For images and videos recorded on digital cameras, we selected from
Ontology for Media Resources by W3C, for example, concerning the recording location,
[ma:createdIn] for the URI, [ma:locationLatitude] for latitude and [ma:locationLongitude] for
longitude, and regarding the sound and video, [ma:samplingRate] for sound, [ma:frameRate] for
video and [ma:duration] for playing time. We adopted the terms minimum amount necessary for
images only [exif:width] for the width and [exif:height] for the height of the image from Exif data
description vocabulary.
It is important for post-disaster surveys that resources such as images and videos have
geospatial information. For this reason, we set terms not only for describing address, longitude, or
latitude but also for distinguishing the objective space from the recording location of the resource.
As for recording location, we adopted [v:street-address] and [v:postal-code] from Ontology for
vCard. To describe the objective space of the resource, we described the value structure using
[dcterms:spatial] and adopted [rdfs:label] for the name of the objective space, [v:region] for the
prefecture, [v:locality] for the city, town and village, [v:street-address] for the street address,
[v:postal-code] for the postal code, additionally [geo:lat] for the latitude, [geo:long] for the
longitude from the terms of the Basic Geo (WGS84 lat/long) Vocabulary.
The temporal information is also important for disaster records. Therefore we described the
date the image or video was recorded in [dcterms:created] and the date it was started to collect
from a website in [dcndl:dateCaptured]. We recommend that values be stored in W3CDTF
format, specifying by [rdf:datatype]. Furthermore, in HINAGIKU, metadata is mapped to
W3CDTF format uniformly if possible, even if the provided metadata is not in W3CDTF format.
At the beginning of the development of the NDLKN, we assumed that it would be necessary to
group the data by region, kind of disaster, or other characteristic useful to searching the data and
displaying the search results. For this, we discussed to use the terms collection and item to
represent a parent/child relationships in resources. However, after consideration, it became clear
that it is almost impossible to describe collection uniquely. Therefore, we described both
collection and item by [ndlkn:Resource] and chose to represent parent/child relationships in
resources by connecting them with [dcterms:isPartOf] or [dcterms:hasPart].
HINAGIKU was initially intended to be an archive of the Great East Japan Earthquake. The
target of the collection, however, includes records of earthquakes, tsunamis, and other past
disasters, too, and other new archives might also be developed for future disasters. Based on these
assumptions, we described [dcterms:coverage] to store the name and URI of disasters in order to
describe the objective disaster of the resource.
4. Characteristic utilization examples of NDLKN in HINAGIKU system
We introduce several utilization examples of implementation of NDLKN in HINAGIKU
system.
As HINAGIKU coordinates with domestic and foreign archive systems of disaster records, we
assume that it would be necessary to confirm metadata schema definitions of its acquired time if
214
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
cooperative organizations change their schema in the future. Therefore, HINAGIKU system
stores its own URI in [dcterms:conformsTo] of [ndlkn:Resource] class and information of
original metadata schema of providers in [ndlkn:sourceConformsTo] as internal term.
We also utilized the NDLKN terms [geo:lat], [geo:long], [ma:locationLatitude],
[ma:locationLongitude] for the longitude and latitude of resources such as images and videos.
HINAGIKU stores the latitude and longitude data automatically from either the name of the
objective space or the recording location through the Yahoo! Geocoder API when the provided
metadata does not have the value of latitude or longitude.
Web sites of the local governments of stricken areas and the Japanese government are also
important as disaster records. The NDL has archived web sites for a long time by the WARP
system and we have started to archive disaster related web sites with higher frequency after the
Great East Japan Earthquake. As for the web sites, the titles (for example 'Sendai city') are not
changed even if the content changes. Therefore it is necessary for searching and distinguishing
the search results to add temporal information such as year, month, and date collected to the
collected web sites. For this reason, HINAGIKU stores not only the value of title but also related
information in [dcterms:title] in regard to the web sites collected by the WARP system. More
specifically, we described to store the date started to collect too in [dcndl:dateCaptured] with [ ]
after the title.
References
Basic
Geo
(WGS84
lat/long)
Vocabulary.
http://www.w3.org/2003/01/geo/wgs84_pos#.
Retrieved
April
28,
2014,
from
Creative Commons. Creative Commons Rights Expression Language. Retrieved April 28, 2014, from
http://creativecommons.org/ns#.
IFLA (2009). Functional Requirements for Bibliographic Records: FRBR. Retrieved April 28, 2014, from
http://www.ifla.org/publications/functional-requirements-for-bibliographic-records.
NDL. (2011). National Diet Library Dublin Core Metadata Description (DC-NDL), version Dec. 2011. Retrieved April
28, 2014, from http://www.ndl.go.jp/en/aboutus/standards/index.html. The DC-NDL is DCMI Metadata Terms
2010-10-11 based metadata schema and compliant with the recommendations of the DCMI.
NDL Search. Retrieved April 28, 2014, from http://iss.ndl.go.jp/?locale=en.
PREMIS (2011). PREMIS Data Dictionary for Preservation Metadata, version 0.99. Retrieved April 28, 2014, from
http://multimedialab.elis.ugent.be/users/samcoppe/ontologies/Premis/premis.owl.
The Great East Japan Earthquake Archive (HINAGIKU). Retrieved April 28, 2014, from http://kn.ndl.go.jp/.
W3C (2003). Exif data description vocabulary. Retrieved April 28, 2014, from http://www.w3.org/2003/12/exif/ns.
W3C. (2006). Ontology for vCard. Retrieved April 28, 2014, from http://www.w3.org/2006/vcard/ns.
W3C (2012). Ontology for Media Resources 1.0. Retrieved April 28, 2014, from http://www.w3.org/ns/ma-ont.
WARP. Retrieved April 28, 2014, from http://warp.da.ndl.go.jp/info/WARP_en.html.
Yahoo!
Geocoder
API
(Japanese
only).
Retrieved
April
http://developer.yahoo.co.jp/webapi/map/openlocalplatform/v1/geocoder.html.
215
28,
2014,
from
International Conference on Dublin Core and Metadata Applications 2014
Best Practice Poster:
Reusing Legacy Metadata for Digital Projects: The Colorado Coal
Project Collection
Michael Dulock
University of Colorado
Boulder, USA
[email protected]
Keywords: metadata; legacy metadata; digital libraries; digital collections; archives; Dublin
Core; McBee cards; keyslot cards; edge-notched cards; metadata repurposing; hidden collections
1. Introduction
Libraries and other cultural institutions are increasingly focused on efforts to unearth hidden
and unique collections. Yet the metadata describing these collections, when such exist, may not
be in an immediately useable format. In some cases the metadata records may be as exceptional
as the materials themselves. This poster describes research underway into how libraries can
repurpose metadata in archaic formats using the Colorado Coal Project Collection1 slides as a
case study.
Metadata in outdated formats, whether analog or digital, are a mixed blessing for metadata
practitioners when creating digital collections. On the one hand, practitioners are happy to have
pre-existing descriptive information to accompany their materials, eliminating the need to redescribe a collection or the items it contains. On the other hand, a lot of work may be required to
convert that metadata into a form that can be used by their digital systems. Examples of legacy
metadata include archival finding aids in typescript, catalog cards, handwritten inventories, outof-date database software, and other more exotic formats. The metadata thus preserved can
provide a wealth of information for users of a digital collection, but first the data must be moved
from its old format into a newer, digital system. Various tools, such as database conversion
software or OCR (optical character recognition) applications, can be used to convert metadata.
But those tools are not fool-proof. A text captured using OCR may still require manual quality
checking, since OCR software may not be able to correctly interpret the inconsistencies of
typescript. Even metadata captured in a spreadsheet may not be immediately useable. Manual
intervention is required to separate different values in cells that contain multiple data points, for
instance.
2. Background
The Colorado Coal Project Collection documents the history of coal mining in the western
United States, primarily focusing on Colorado in the early 20th century. The original project was
conducted between 1974 and 1984 by Eric Margolis and Ron McMahan of the Institute of
Behavioral Science at the University of Colorado Boulder. The two researchers documented the
history, technology, and lives of coal miners in Colorado through photographs and interviews
with miners, community members, and historians to discuss topics ranging from mining camp life
and immigration to working conditions, labor unions, and strikes. The physical collection, housed
at the University of Colorado Boulder Archives, comprises over one hundred video and audio
files of interviews, scores of transcripts, and over four thousand slides depicting mining life.
The slides are accompanied by over four thousand McBee cards, a manual computing format
that saw occasional use for recordkeeping in the mid-20th century (McCoy, 1965; Rabinow, 1958;
1
http://libcudl.colorado.edu:8180/luna/servlet/UCBOULDERCB1~76~76
216
International Conference on Dublin Core and Metadata Applications 2014
Smith & Schnall, 1980). These cards contain written notes as well as punches around the edge
which indicate various features of the slides such as locations, dates, and technical details.
Transferring this rich metadata from thousands of cards into a workable digital format was a
challenge. The poster examines the process of transferring the metadata recorded on these arcane
cards to a 21st century digital library collection, utilizing a combination of student labor, Metadata
Services staff, MS Excel, and careful quality control.
3. Methodology
The first part of the metadata transfer process was capturing the metadata on the cards in an
electronic format that could then be manipulated. The data was recorded in a consistent manner
according to a classification key included with the cards. Each card was divided into sections:
text in the interior of the card recorded the slide number, title, date, and description information,
image quality, and restriction/rights notes; a series of numerically-coded holes (locations for
punches) were arranged around the edge of the card. These, too, were divided into sections
according to type: decades, structures, historical notes, “general” notes; states and regions;
“general categories”; and technical notes. (See poster for a card and key images.) Each numbered
pinhole was assigned a value on the key. Categories on the cards were mapped to metadata
elements from the Dublin Core Metadata Element Set (DCMI, 2012). The Metadata Librarian
built an Excel spreadsheet to capture the card metadata by category, which could then be
crosswalked to Dublin Core (DC). The spreadsheet had one column for each category (slide
number, decades, etc.), with each row representing a single card. Multiple data points would be
entered in a single column but separated by a delimiter so the Metadata Librarian could later
create one column for each entry. A key was added at the top of the spreadsheet indicating valid
values for each category (text, 1-14, L0-L8, etc.). The spreadsheet would be filled out with data
exactly as it appeared on the card, including numeric codes.
The Metadata Services Department hired three student workers to manually transfer the data.
Each would be expected to record metadata from approximately 1,400 cards. The students were
provided with written procedures as well as a visual job aid to make the transfer of data from card
to spreadsheet as clear as possible (see poster). Having the students enter codes directly from the
cards without translating them with the key served to reduce the labor time per card and eliminate
mistranslation errors. In addition, Excel functionality could be used to isolate invalid data in
individual columns based on the valid value ranges for some columns.
The Metadata Librarian checked the students’ output periodically throughout the project.
Quality issues were minor and mostly typographical errors with number entry. The biggest hurdle
was the handwritten text on the cards: in some cases handwriting was difficult to decipher,
especially for proper names. Students were instructed to note entries that were difficult to
decipher, so that the Metadata Librarian could examine the cards and do additional research as
needed. A portion of the problem cards were completed by a paraprofessional from the Metadata
Services Department after a student recorded the numeric coding.
Once the card metadata was captured, the Metadata Librarian split columns with multiple
entries into individual columns. This resulted in multiple columns for several categories such as
structures and technical notes. Once each column contained a single data point, another round of
quality control was performed. The Metadata Librarian used conditional formatting to highlight
invalid entries in each column. In some cases, a variety of invalid entries were searched for (e.g.,
letters and numbers outside of the valid range) and some spot checking was done against
individual cards.
Following quality control, numeric codes were replaced by textual terms from the key columnby-column. Since each card might represent multiple slides, the Digitization Lab Manager deduplicated entries on the spreadsheet by comparing it with the actual slides, indicating redundant
slide numbers, or those for which we had no corresponding slide. The Metadata Librarian then
further divided the document’s rows into one per slide, removing entries for missing or redundant
217
International Conference on Dublin Core and Metadata Applications 2014
slides. The Metadata Librarian then crosswalked the spreadsheet data into the DC form and
loaded it into the digital library software. The entire collection, including non-slide material, was
processed and published in the CU Digital Library in time for the centenary of the Ludlow
Massacre of 20 April, 1914, a watershed event in mining history and labor relations in the United
States.
4. Conclusion
The Colorado Coal Project Collection, as it exists in the University of Colorado Boulder
Archives, is a large, complex, and rich resource for researchers in mining and labor in the United
States. Capturing and displaying the robust metadata that accompanied it proved an interesting
and significant challenge, and served as a lesson in dealing with legacy metadata.
Acknowledgements
The author would like to acknowledge the hard work of everyone who contributed to this project,
especially students Sarah Han, Rosalyn Wong, and Grace Zhong; library technicians Cynthia
Hardey and Jane Zumwalt; Digitization Lab Manager Patrick Mulcrone; Digital Initiatives
Librarian Holley Long; and original researcher and project sponsor Ron McMahan.
References
Dublin
Core
Metadata
Initiative.
(2012).
http://dublincore.org/documents/dcmi-terms/
DCMI
Metadata
Terms.
Retrieved
from
McCoy, Ralph E. (1965). Computerized circulation work: a case study of the 357 Data Collection System. Library
Resources & Technical Services 9(1), Winter 1965, 59-65.
Rabinow, Jacob. (1958). Presently available tools for information retrieval. Electrical Engineering 77(6), June 1958,
494-498.
Smith, Donald A. & Peter L. Schnall. (1980). Improved hypertension control using a surveillance system in a
neighborhood health center. Medical Care 18(7), July 1980, 766-774.
218
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Demonstration:
A Model and Roles of a Common Terminology to Improve
Metadata Interoperability
(Boaz) Sunyoung Jin
University of Illinois at
Urbana-Champaign,
United States
[email protected]
Keywords: metadata; interoperability; common terminology; metadata model; MARC; (Q)DC
1. Introduction
Interoperability issues pose a barrier to sharing and exchanging information among digital
libraries and repositories. This is due to the use of diverse metadata standards, and their
different degrees of generality or specificity. This causes loss of information at all metadata
model levels (e.g., schema, schema definition language, record, and repository) (Chan & Zeng,
2006) (Haslhofer & Klas, 2010, p. 19). As a possible solution for a long-term problem,
historically argued standardization on a common communications format (Svenonius, 1983, p. 2),
and a common command language or vocabulary (Lancaster & Smith, 1983, p. 21) are
considered. A Common Terminology (CT), thus, is suggested as a bridge to various degrees’
metadata standards to give uniformity for searching and to achieve metadata interoperability at
multiple levels.
2. The Abstract Model and Roles of a Common Terminology (CT)
Based on DCMI abstract model (DCMI, 2013), an abstract model of CT is diagrammed in
Figure 1. The definitions for terms in this extended abstract model are as follows:
•
A Common Terminology is a set of Common Terms of element names in widely used
metadata schemas such as MARC, MODS, DC and QDC.
•
A Common Term is a property (element) or class.
•
A property (sub-property) can be one kind of common element (field) or attribute
(subfield) in two or more metadata schemas.
Enumerated Set
0..n
1..n
property
has range
0..n
class
vocabulary
1..n
term
sub-property of
has domain
0..n
0..n
DCMI Syntax
Encoding Scheme
0..n
sub-class of
instance of
resource
Common
Terminology
1..n
DCMI Vocabulary
Encoding Scheme
Common Term
member of
0..n
describes
describes
CTScheme
Common
Resource
FIG. 1. The CT Abstract Model based on DCMI abstract model (DCMI, 2013)
219
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
The core role of CT is to encompass various metadata schemas allowing communities to use
their own standards, while providing uniformity to searching. CT is a bridge of existing standards
to maintain balance between different degrees of generality or specificity, minimizing loss of
information at all metadata model levels. CT is to provide uniformity for search with CT union
catalog and Linked Open Data connecting online accessible metadata records on the Web. CT,
ultimately, is to provide a common standard way to achieve interoperability at multiple levels in
order to share resources readily among many libraries, organizations, and governments.
3. The Developed CT to Improve Metadata Interoperability
Taking commonly used standards (MARC, MODS, DC, and QDC) as bases, CT has developed
as a bridge across different generality and specificity levels. CT is selected to improve metadata
interoperability at the schema, schema definition language, record, and repository model levels..
3.1. At the Schema Metadata Model Level
The developed CT (Jin, 2014) is a set of 12 Common Terms (properties), and 58 qualifiers
(sub-properties) that specify and subdivide 12 properties in detail, with CTScheme. CTScheme is
defined as a controlled set of values that are specific to CT. The development bases on crosswalks
of Library of Congress (e.g., MARC from/to (Q)DC, etc.) (LC). The development is supported by
usages of MARC tags and (Q)DC elements in 5 search interfaces and in actual metadata records
of Harvard (MARC, 12 million records), UIUC (MARCXML, 10 million), and MIT (QDC,
20,000) through cooperation of three universities in the USA. The selected CT at the schema
level is generalized common terms which maximize lexical and semantic interoperability, used
over 50% usage in Harvard, WorldCat and UIUC metadata records; and used in all 5 search
interfaces. 12 Common Terms are contributor, date, description, format, identifier, language,
publisher, relation, rights, subject, title, and typeGenre. 58 qualifiers are on the project website.
3.2. At the Schema Language Definition Level
The generalized 12 Common Terms and 58 qualifiers are represented with XML schema
(ct.xsd) and RDF schema (ct.rdf) with SKOS concepts (ctskos.rdf) to improve semantic
interoperability.
3.3. At the Record Level
The performance of CT in achieving and improving metadata interoperability is presented
through empirical evaluations with Harvard (MARC), MIT (QDC), and UIUC (MARCXML)
records through cooperation of three universities. A conversion with Python language is designed
to convert (Q)DC of MIT records to CT, and to measure transfer rate and lexical and semantic
match rates. As a result of the conversion of mapping experiments, total transfer rate from (Q)DC
of MIT to CT is 99.9%. Lexical and semantic match rates are 98.7% and 100%. Loss of
information rate is extremely lower as 0.00463%. CT, thus, maximizes lexical and semantic
interoperability reducing significantly the gaps of different degrees of generality or specificity.
Finally, CT minimizes considerably loss of information at multiple levels.
3.4. At the Repository Level
As a next step, a prototype is planned to achieve and improve metadata interoperability at
repository level. The prototype will build CT union catalog and Linked Open Data connecting 3
million online accessible records of Harvard (MARC), MIT (QDC) and UIUC (MARCXML)
libraries providing a portal for them. The prototype will demonstrate a certain solution to build
interoperability globally with CT among libraries or Well-Designed Digital Libraries all over the
world that will consist of International Open Public Digital Library (Jin, 2014).
220
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Conclusion
The Common Terminology (CT) has developed as a bridge across different generality and
specificity levels such as MARC, MODS, DC, and QDC. CT minimizes considerably loss of
information reducing the gaps among them. CT increases significantly accuracy in mappings
showing high lexical and semantic match rates. The planned prototype will build CT union
catalog and Linked Open Data connecting records of three universities on the Web, and provide a
portal for Harvard, MIT and UIUC libraries. CT will give an assured solution to achieve and
improve interoperability among university libraries and further among libraries and organizations
to work together and share information reducing loss of information at multiple metadata levels.
Acknowledgements
This research project has been supported by the Graduate School of Library and Information
Science (GSLIS) at the University of Illinois at Urbana-Champaign (UIUC). Thanks to Professors
David Dubin and Linda Smith for their guidance on this project. We also appreciate the
cooperation of Harvard, MIT, and UIUC in providing their metadata for the project.
References
Chan, Lois M., & Marcia L. Zeng. (2006, 06). Metadata Interoperability and Standardization – A Study of
Methodology Part I. Achieving Interoperability at the Schema Level. D-Lib Magazine, Volume 12(Number 6).
DCMI.
(2013).
DCMI
Abstract
Model.
http://dublincore.org/documents/abstract-model/
Retrieved
from
Dublin
Core
Metadata
Initiative:
Haslhofer, Bernhard, & Wolfgang Klas. (2010). A Survey of Techniques for Achieving Metadata Interoperability.
ACM Comput. Surv., 42(2).
Jin, (Boaz) Sunyoung. (2014). A Model and Roles of a Common Terminology to Improve Metadata Interoperability.
Illinois Digital Environment for Access to Learning and Scholarship (IDEALS). Retrieved from
http://hdl.handle.net/2142/50100
Jin, (Boaz) Sunyoung. (2014). International Open Public Digital Library (IOPDL): A Proposal for the Future. Illinois
Digital Environment for Access to Learning and Scholarship (IDEALS). Retrieved from
http://hdl.handle.net/2142/50101
Lancaster, F. Wilfrid, & Linda Smith. (1983). Compatibility Issues Affecting Information Systems and Services.
General Information Programme and UNISIST.
LC.
(n.d.).
Conversions.
Retrieved
from
Metadata
http://www.loc.gov/standards/mods/mods-conversions.html
Object
Description
Schema
(MODS):
Svenonius, Elaine. (1983). Compatibility of Retrieval Languages: Introduction to a Forum. Int. Classif, 10(No.1), 2-4.
221
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
Converting Personal Comic Book Collection Records to Linked
Data
Sean Petiya
Kent State University, USA
[email protected]
Keywords: Linked Data; comic books; graphic novels; ontologies; metadata; usability.
1. Introduction
The Comic Book Ontology (CBO) is a metadata vocabulary currently in development for the
description of comic books and comic book collections. The vocabulary is part of a larger,
ongoing research project exploring the design and exchange of data about comic books and
graphic novels. The goal of the project is to produce a series of usable schemata and tools for the
many participants in the often complex universe of comic books, which includes publishers,
collectors, and libraries, among many others. The long-term objectives of the project include
addressing the needs and overlapping roles of each user group through designated application
profiles. Recognizing that all groups involved will have different needs, goals, and concerns, the
base for each of these user application profiles is a much simpler set of elements required to first
uniquely identify a resource. The intention of this core set of elements is to lower the difficulty in
implementing the vocabulary and enhance the overall understandability of the ontology. The core
application profile has been modeled from common elements found in the data of comic book
collectors, a community of users largely responsible for the preservation of the medium, which
has historically been underrepresented in knowledge institutions.
FIG 1. Core concepts in the Comic Book Ontology (CBO).
This poster describes progress on the Comic Book Ontology (CBO) by presenting a diagram
illustrating current components of the model (FIG. 1), and outlines the methodology and rationale
for producing a core application profile. Additionally, it presents a workflow illustrating how the
core set of elements is used to map user data to the vocabulary and generate RDF/XML records
through an automated process. Community data is commonly contained in spreadsheets, or made
available as CSV, and a workflow is described for both the preparation and conversion of that
data, as well as its connection to existing Linked Open Data (LOD) resources.
222
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
2. Background
The recent success of Marvel’s Guardians of the Galaxy at the box-office highlights the
dominance of the superhero movie in popular culture, and interest in the genre is only likely to
continue with future films planned featuring familiar icons like Batman, Superman, and SpiderMan. However, before these characters and stories made it to movie screens, they first appeared
in periodical comic books on newsstands, where they then made it into the homes and collections
of many generations of readers around the world. In addition to appearing in library special
collections and archives, like the Comic Art Collection of the Michigan State University Library
composed of over 200,000 items (comics.lib.msu.edu), the comic book is also collected by the
Library of Congress (LOC) and the institution’s Comic Book Collection contains over 120,000
comic issues (LOC, 2013). While the efforts of these institutions are significant, parallel activities
occur daily in the homes of many comic book collectors (Serchay, 1998). Passion and dedication
to the hobby on the part of both collectors and professionals has produced numerous research
projects and efforts dedicated to the comic book. Notable projects in this area include the Grand
Comics Database (GCD), an international effort to index all comic books published worldwide
(gcd.org), and Comichron: The Comic Book Chronicles, a research project collecting comic book
sales and circulation data (comichron.com), among many other related endeavors that can be
found in the Comics Research Bibliography (Rhode & Bullough, 2009). The Comic Book
Ontology (CBO) represents an effort to bring greater bibliographic control, representation, and
visibility to the endeavors of many writers, artists, researchers, and collectors who have
contributed to the preservation and proliferation of the medium.
3. Application Profile and Workflow
The comic book is a complex object that can be viewed as a bibliographic resource, collection
item, and art object, with its contents telling part of the story in an ongoing narrative that can span
multiple issues, volumes, and series titles, all of which compose a detailed, fictional universe. In
addition to the complexities of the objects themselves, the domain’s many participants, including
libraries and archives, each produce data of various degrees of quality and control, while
following different standards and practices. However, shared entities and elements found in the
data formulate a core model that can represent a simplified view of this complex world.
The methodology for producing the core application profile involved aligning components of
the Comic Book Ontology (CBO) to a WEMI model. The WEMI model produces a view of the
core elements at various levels of description, up to a specific, physical copy in a comic book
collection. Extending the exchange of knowledge to the collector using Linked Data enables a
passionate and dedicated segment of the user population to participate in the ecosystem, not just
at the item-level, but at all levels of resource description potentially expanding the “global graph”
of RDF statements describing comic works, creators, and collections. However, in order to
participate successfully, users require a simple, clear process for the preparation and conversion
of their data. This workflow involves: (1) mapping existing data to CBO terms, (2) converting
data to qualified RDF/XML, and (3) automatically replacing values with LOD URIs. The
automated conversion process is achieved through an online tool, or a script that can be run
locally. Experienced users can modify, rewrite, or create their own script, and expand on the
selection of LOD resources linked in the resulting dataset.
4. Summary
The Comic Book Ontology (CBO) seeks to provide the tools through which collectors,
researchers, and libraries can share information about their individual collections and better
combine and exchange knowledge in a Linked Data environment. In order to improve the
usability of the ontology, a core application profile has been developed. A basic workflow
describes using this profile to guide the preparation and mapping of existing data to CBO
elements, and the automated conversion of that data to qualified RDF/XML containing Linked
223
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Data URIs for common values. The core application profile will form the base of additional
profiles that will address the needs of other user groups as the ontology expands. The vocabulary
is made available at comicmeta.org, which functions as a repository for the ontology as well as all
related schemata, tools, and utilities.
References
Library of Congress. (2013, April). Comic book collection. Retrieved from http://www.loc.gov/rr/news/coll/049.html
Rhode, Michael, and John Bullough. (2009). Comics research bibliography. Retrieved from
http://homepages.rpi.edu/~bulloj/comxbib.html
Serchay, David. S. (1998). Comic book collectors: The serials librarians of the home. Serials Review, 24(1), 57-70.
doi:10.1016/S0098-7913(99)80103-8
224
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
Making Vendor-Generated Metadata Work for Archival
Collections Using VRA and Python
Carolyn Hansen
University of Cincinnati
United States
[email protected]
Sean Crowe
University of Cincinnati
United States
[email protected]
Keywords: vendor-generated metadata; metadata mapping; archival description; Dublin Core;
VRA; Python
1. Introduction
Although cataloging cultural resources requires a greater level of descriptive granularity than
standard library materials, metadata for digital collections is often generated by non-specialists.
This can lead to significant problems with metadata accuracy and consistency, causing
breakdown of authority control, high incidence of false positives in searching, and impeded
access to materials. The purpose of this poster is to illustrate a successful workflow for improving
vendor-generated metadata for a large digital collection of archival materials by converting the
metadata from the Dublin Core standard to the VRA standard using the scripting language
Python.
2. Background
The University of Cincinnati Libraries (UCL) contracted with a vendor to scan and generate
metadata for the Cincinnati Subway and Street Improvements Collection. Consisting of
photographs and documents related to the construction of the unfinished Cincinnati Subway
system and street improvements throughout the city, the collection is a unique resource
documenting early 20th century transportation, urban planning, and social history. Following the
initial load of approximately 9,000 scanned images and associated Dublin Core metadata records
into the shared OhioLINK Digital Repository Center, librarians Sean Crowe and Carolyn Hansen
were charged with converting the metadata to the VRA standard, improving metadata quality, and
loading the collection into the University's Luna image repository. Carolyn Hansen brought
metadata standard expertise and Sean Crowe provided technical and scripting skills to the project.
3. Implementation
The planning and specifications for the contract scanning project were conducted by UCL’s
Digital Projects Repositories Department, and did not include input from UCL’s Content Services
Division, in which the authors work. As a result, the project workflow began with an assessment
phase, which involved researching the initial scanning project, assessing the vendor-generated
metadata, and gathering domain-specific information about the original physical format of the
materials. A metadata map was created to record decisions about field equivalents between
Dublin Core and VRA, controlled vocabulary usage, improvement of vendor-generated metadata,
and addition of VRA-specific fields to describe original materials and digital surrogates.
These decisions were then encoded into a Python script. The Python script incorporated a
custom class to parse and process the metadata in CSV format. In addition to coding the field
conversions and formatting field contents based on the metadata map, the script ran several
validation processes on the input and output metadata files. Finally, a function was added to the
225
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
script to link records to image files by unique identifier. Coding the script comprised a
considerable portion of the project timeline though the script run-time was negligible.
4. Challenges
Project implementation involved a number of challenges. In terms of metadata mapping,
moving from a less robust standard like Dublin Core to a very robust standard like VRA required
strategic decisions. Since VRA provides the opportunity for highly-detailed descriptive metadata,
it is necessary to look at the metadata with a strong editorial eye in order to balance detailed
description with project time constraints and vendor-created metadata of varying quality. In order
to accomplish this, a baseline for acceptable metadata was created, detailing changes to vendorcreated metadata as well as who would be responsible for metadata enrichment. For example,
errors in access points from controlled vocabularies such as LCNAF or LCSH headings would be
corrected by Content Services faculty, but additional subject analysis would be provided by
curators at a later stage in the project. The metadata quality baseline was also applied to
controlled vocabulary usage. For example, when working with detailed vocabularies like the
Getty Research Institutes’ Art & Architecture Thesaurus, it was important to balance the level of
descriptive granularity with vocabulary that was understandable to users and applicable to a wide
range of materials.
Additionally, local practices regarding archival materials presented unique challenges to the
project. Specifically, university archivists at UCL preferred that the structure of the digital
collection should replicate the physical archive, including record order and collection level titles
for item records. As a result, titles without description of the image content such as “Rapid
Transit Photographs -- Box 17, Folder 22 (September 21, 1922 - October 24, 1922) -- negative,
1922-09-28, 9:42 A.M.” were used. These titles offer little descriptive content and create greater
reliance on subject searching. Further work needs to be done to make the collection searchable
based on the content of the image. Lastly, geographic coordinates, included in some of the
records, enrich the collection and should be added where possible.
5. Conclusions/Results
Since the collection was posted in Fall 2013, it has received over 17,000 unique page-views in
the Luna Repository. This project serves as a template for future shared, interdepartmental
projects. Further collaboration is certain as traditional Library Technical Services operations
evolve to support local and unique digital content, including research data, archival material, and
beyond.
Acknowledgements
Linda Newman, Head, University of Cincinnati Libraries Digital Content & Repositories Dept.
and Elna Saxton, Head, University of Cincinnati Libraries Content Services Division.
References
Crowe,
Sean
and
Carolyn
Hansen
https://github.com/crowesn/DC_to_VRA.
(2014).
DC_to_VRA.
In
GitHub.
Retrieved
from
University of Cincinnati Libraries (2014). Cincinnati Subway and Street Improvements, 1916-1955. Retrieved from
http://digital.libraries.uc.edu/subway/.
University
of
Cincinnati
Libraries
(n.d.).
LUNA
Digital
http://digproj.libraries.uc.edu:8180/luna/servlet/univcincin~42~42>.
226
Repository.
Retrieved
from
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
A Library Catalog REST API Framework
Jason Thomale
University of North Texas, United States
[email protected]
William Hicks
University of North Texas, United States
[email protected]
Keywords: library catalog metadata; MARC; Machine Readable Cataloging; REST;
Representational State Transfer; integrated library systems; API; application programming
interface;
Abstract
Within the archipelago of cultural memory data, library catalogs and systems still comprise
some of the most isolated and least penetrable desert islands. Although the library world has
made significant strides over the past decade to open its metadata, many individual libraries
remain at the mercy of their ILS vendors to implement open protocols, standards, and APIs. At
the University of North Texas Libraries, we have been developing a REST API framework for
exposing our catalog and ILS metadata, taking our first steps toward breaking our data off this
particular island.
Catalog resources that we’ve modeled so far include bibliographic records (modified from
MARC), item-level records, branch location records, item type records, and item status records.
We are also working on resources that support a shelf-list browser application, which mix usersupplied data with item and bibliographic metadata and demonstrate a real-world use for the API.
Our framework is not merely an API for our particular ILS. Rather, we are developing a
toolset to allow us to extract and re-model our ILS data—to use data derived from our ILS but not
necessarily to adhere to ILS data models—and expose the data as RESTful, linked resources.
Although our initial efforts have focused on modeling resources that do closely align with ILS
entities, future development will include extended models for work- and identity-related
resources and possibly extending our APIs to expose linked data (using, e.g., JSON-LD).
Best practices in this area, exposing ILS metadata as RESTful resources, are hard to come by.
Given the mixture of metadata practitioners, systems-oriented individuals, and web-oriented
individuals that the Dublin Core Metadata Initiative (DCMI) conferences tend to attract, we hope
that presenting a poster about the project in the Best Practices track might allow us to connect
with others with whom we might dialog. Ultimately, we believe an exchange of information
about our project so far—our approach and practices—would be valuable to us and to others in
the DCMI community.
227
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
Building the Bridge: Collaboration between
Technical Services and Special Collections
Susan Matveyeva
Wichita State University
Libraries
[email protected]
Lizzy Walker
Wichita State University
Libraries
[email protected]
Keywords: departmental collaboration; standards; best practices; CONTENTdm; metadata
1. Introduction
At Wichita State University Ablah Library, members of Technical Services and Special
Collections began collaborating on a mass digitization project to increase visibility and
accessibility of Special Collections holdings, and to digitally preserve brittle rare materials. Both
departments scan collections, create metadata, and upload materials into CONTENTdm. The
departments overcame challenges regarding the project, such as limited collaboration between the
departments, poor communication, minimal metadata, and differences in quality control
expectations.
2. Challenges
Differences in philosophy between Technical Services and Special Collections presented the
first challenge. Special Collections was concerned with securing their collections and felt
librarians did not share this concern. They also emphasized the collections over users’ needs.
Since their practice was boutique style treatment for items, less attention was paid to productivity,
and keeping current with changing cataloging standards. Special Collections were concerned
Technical Services were unfamiliar with archival practices. Technical Services goal was to
operate with the end user in mind. To that end, the adoption of RDA as well as OCLC's Best
Practices for CONTENTdm and other OAI-PMH Compliant Repositories (2013) in metadata
creation reflected this focus. They also had a provider/client relationship at first. Technical
Services also felt Special Collections were not familiar with the standards and practices the
cataloguers used, nor with Technical Services' production environment.
Poor communication presented another challenge. When Technical Services completed a
collection, it was returned to Special Collections with expectation of rapid feedback. With no
information forthcoming, Technical Service operated as though there were no problems. As a
result, they completed six collections by the time Special Collections sent feedback. Some
corrections came from incorrect information from old finding aids. Additionally, Technical
Services and Special Collections each had internal control processes, but no shared criteria
existed for gauging quality.
Common metadata standards did not exist between the departments. Technical Services
operated in a production environment based on collaboration and cooperation. Special
Collections operated in an isolated environment with emphasis on unique description, locally
created metadata, and traditional archival standards. They also did not have a dedicated cataloger
on staff. The uniqueness of uncontrolled vocabulary metadata versus controlled vocabulary for
interoperability allowed for much constructive debate. The goal in regard to metadata was to get
Special Collections and Technical Services using the same standards.
The departments had different approaches to metadata. Special Collections focused mainly on
the descriptive metadata in a human-readable format. Technical Services was interested in the
228
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
addition of administrative and technical metadata, and kept in mind the current Web environment
and machine-readable representation of information.
3. Method and Results
Building trust between the departments involved many facets. Staff from both departments
created a metadata group responsible for creating metadata templates for manuscripts and printed
materials. Investigation of standards and best practices, creation of data dictionaries, and mapping
templates were only a few of the topics focused on by this subcommittee. The group developed
minimal and core level metadata templates for published and unpublished materials based on
common standards for rare books and manuscripts using OCLC's Best Practices. The templates
focused on access to collections, future migration, and preservation. Technical Services
accommodated unique needs of Special Collections while working on the creation of shared
workflows and metadata templates. Special Collections responded positively to the processes and
also recommended changes based on their needs. Multiple revisions to the templates were
required to accommodate both departments.
Quality control quickly became a priority. With administration support, the departments
implemented pre-planning meetings where the departments discuss specific collections. This
includes the level of metadata Special Collections and Technical Services selects for a collection.
Levels of quality control are also present throughout the process. Multiple people handle the
scans and metadata in terms of viewing and uploading, as well as the final review. There are also
pre-planning meeting forms, scan inventory worksheets, a metadata cheat sheet for the catalogers,
and workflow checklists.
Technical Services introduced DC mapping in CONTENTdm, as well as an enhanced
production environment to Special Collections that previously performed boutique treatment of
materials. Likewise, Special Collections communicated their specific needs so that we gained an
understanding of expectations from Special Collections, which the metadata group kept in mind
when creating the templates. Special Collections’ willingness to work with the OCLC Best
Practices, as well as RDA was a real leap for the department in terms of opening up their
collections to a worldwide audience.
3. Next Steps
Next steps include appointing a project manager who will lead each project from beginning to
end. A metadata checklist is being created to aid the catalogers in reviewing their peers’ work.
Creation of local controlled vocabularies will also be a future project.
4. Conclusion
This has been a positive collaborative experience for both departments. Bringing expertise of
catalogers and uniqueness of Special Collections together has helped them to be less isolated. The
implementation of metadata and cataloging standards creates a layer of interoperability, and
increases the potential of users finding unique materials. Additionally, the departments have a
new working relationship that will hopefully continue in the future.
Acknowledgements
The authors would like to thank Wichita State University Libraries Special Collections and
University Archives for their collaboration and partnership in creating this project.
References
OCLC. (2013). Best practices for CONTENTdm and other OAI-PMH compliant repositories: Creating sharable
metadata. Retrieved June 01, 2014, from
http://www.oclc.org/content/dam/support/wcdigitalcollectiongateway/MetadataBestPractices.pdf
229
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
Best Practices for Complex Diacritics Handling in CONTENTdm
Jason W. Dean
University of Arkansas
Libraries, USA
[email protected]
Deborah E. Kulczak
University of Arkansas
Libraries, USA
[email protected]
Keywords: CONTENTdm, diacritical marks, indexing, UTF, encoding
In order to ensure the best possible access for materials held by libraries and archives, these
institutions must employ special accent and punctuation marks when transcribing or
transliterating languages other than English. These marks are called diacritical marks by the
library community. Their use in MARC cataloging is widespread, as is their use in library
catalogs. However, users of digital content management systems (CMS), such as CONTENTdm
encounter difficulty in ensuring appropriate diacritical marks are read by the CMS when metadata
is imported or migrated into such a system. These problems further compound searching issues
for the user, as noted by Bar-Ilan and Gutman (2005) in the Journal of Information Science. Little
literature and few instructions exist to assist users in working with these diacritical marks.
However, some pertinent literature exists on the subject.
In Hongyan Jing’s essay for an IEEE symposium on speech synthesis (2002), the author in
discussing Italian highlights the ubiquity of all types of diacritical marks. His work states that of
445,626 entries in a dictionary, 4.9% of these entries include a diacritical mark. Though the
number is not high, for libraries and archives this number represents a barrier to access and
description that must be overcome. Tull and Straley’s article in Library Hi Tech (2003) covers the
issues presented in sorting and searching in relation to diacritical marks. Most literature discusses
formatting text in UTF-8 or similar UTF standards however some literature discusses the use of
ASCII. This poster focuses on the use of UTF-8, which is required by CONTENTdm to ingest
diacritical marks correctly.
The research and work behind this poster came largely from a recently completed project at the
University of Arkansas Libraries that dealt with metadata and items in a plethora of languages,
from English and French to Quapaw, many of which required the use of unusual diacritical
marks. The authors were responsible for the ingestion of metadata into CONTENTdm and
encountered several issues with complex diacritical marks presented by the disparate languages in
this project. What follows is the procedure arrived at and now codified in a metadata “cookbook.”
The handling of these diacritical marks was primarily in three areas: controlled vocabularies in
CONTENTdm, transcripts, and loading metadata spreadsheets.
Creating and importing a controlled vocabulary list is most easily done in Notepad++ and
encoded as “UTF-8 without BOM.” Using these settings, diacritical marks ingested into
CONTENTdm will be maintained using this encoding setting and following the CONTENTdm
instructions for loading a controlled vocabulary.
Transcripts are best handled using a similar procedure. Transcripts are created in Notepad++,
and saved as “UTF-8 without BOM” as the encoding setting. However, some transcripts might be
loaded at the same time as metadata in a spreadsheet. In this case, if the spreadsheet is created in
Excel, the user must use the “Arial Unicode MS” font for data entry. When data entry is
complete, use the Save As command to save the spreadsheet as a tab-delimited text file. In the
Save As dialog box, select “Unicode Text” from the “Save as type” menu. After selecting
“Unicode Text”, select the “Tools” box to the left of the “Save” button. Select the “Encoding” tab
230
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
in the “Web Options” dialog box. In the “Save this document as” box, select “Unicode (UTF-8)”
from the drop-down menu. Select “OK” then “Save” in the Save As menu.
Metadata spreadsheets and tab-delimited files present a similar set of challenges for diacritical
marks loading into CONTENTdm. In this case, if the spreadsheet is created in Excel, the user
must use the “Arial Unicode MS” font for data entry. When data entry is complete, use the Save
As command to save the spreadsheet as a tab-delimited text file. In the Save As dialog box, select
“Unicode Text” from the “Save as type” menu. After selecting “Unicode Text”, select the
“Tools” box to the left of the “Save” button. Select the “Encoding” tab in the “Web Options”
dialog box. In the “Save this document as” box, select “Unicode (UTF-8)” from the drop-down
menu. Select “OK” then “Save” in the Save As menu.
References
Bar-Ilan, Judit, and Tatyana Gutman. (2005). How do search engines respond to some non-English queries? Journal of
Information Science. 31(1), 2005, 13-28.
Jing, Hongyan. (2002). Identifying accents in Italian text: a preprocessing step in TTS. Proceedings of 2002 IEEE
Workshop on Speech Synthesis, 2002, 151-154.
Tull, Laura, and Dona Straley. (2003). Unicode: Support for multiple languages at the Ohio State University Libraries.
Library Hi Tech. 21(4), 2003, 440-450.
231
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Demonstration:
Ecco!: A Linked Open Data Service for Collaborative Named
Entity Resolution
Matthew Miller
New York Public Library,
NYPL Labs,
United States
[email protected]
M. Cristina Pattuelli
Pratt Institute, School of
Information and Library
Science,
United States
[email protected]
Keywords: named entity resolution; Linked Open Data (LOD); Ecco!;
Abstract
This demo proposal presents Ecco!, a Linked Open Data (LOD) application for entity
resolution. Specifically, Ecco! is designed to disambiguate and reconcile named entities with
URIs from authoritative sources. Technically, Ecco! creates a wrapper around LOD APIs of
suitable datasets such as VIAF and Freebase to retrieve data useful for supporting entity
matching. The system automatically ranks and groups the results into different clusters according
to various confidence levels – from exact matches to one to many or no matches. The quality of
the data output can be further refined through human disambiguation consisting of validating a
match or identifying the correct URI when multiple matches are possible.
Ecco! is designed to enable users to quickly and easily contribute to this curation process. The
system provides an intuitive user interface that supports a collaborative workflow where a
community can work together in a distributed and incremental way. The combination of
automated matching plus human curation has the potential to produce a superior quality of data,
not currently achievable through traditional methods.
This application works alongside existing legacy systems and data sources through an import
and export workflow. Extracts generated from a legacy system or data source are enriched
through Ecco! and then looped back to update the originating source. Ecco! intends to address the
well-known "bucket names" problem that occurs when legacy data has accumulated and contains
a mix of heterogeneous names derived from different authorities (e.g., LC/NAF, ULAN, etc.) as
well as locally defined terms.
Ecco! is a node.js application that anyone can download and run on their local system. There is
no need for a server installation, but it could be installed on a server to allow for the collaboration
of an unlimited number of participants. Ecco! has the capacity to work with LOD APIs in a
modular way. While the demo version will specifically leverage VIAF and Freebase, any API
plugin could be virtually written for it. Also, while in the current release the application will be
centered on persons and organizations, other types of entities including geographic locations,
events, topics, etc. could be also handled by the system.
Even though Ecco! was developed as part of the Linked Jazz project,1 it is domain-agnostic
and thus not tied to any specific context of use. The demonstration includes different scenarios
showing a series of use cases. Results from a first round of testing will also be shared.
Data quality poses a daunting challenge in Linked Open Data development and requires the
creation and adoption of new methods and tools to promote accuracy and consistency of data.
Ecco! includes a series of innovative features that make it uniquely flexibility and easy to use.
1
http://linkedjazz.org
232
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Most notably, this system lowers the barrier for non-programmers who want to actively
contribute to the production of high quality linked data through a user-friendly and collaborative
platform.
233
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013
Best Practice Poster:
Wikipedia-based Extraction of Lightweight Ontologies for
Concept Level Annotation
Elshaimaa Ali
University of Louisiana at
Lafayette, USA
[email protected]
Michael Lauruhn
Elsevier Labs, USA
[email protected]
Keywords: Wikipedia; text mining; annotation; semantic annotation; lightweight ontologies
Abstract
This poster describes a project under development. We propose a framework for automating
the construction of lightweight ontologies for semantic annotations. Lightweight ontology is
defined as the ontology that does not have to include all the components expressed with formal
languages such as concept taxonomies, formal axioms, disjoint and exhaustive decomposition of
concepts. (Giunchiglia and Zaihrayeu 2009). However, manual enhancement of the ontology
through the addition of axioms, rules, disjoint sets, etc., is possible for future reasoning purposes.
The purpose behind this research is to evaluate possible means for efficiently annotating domainspecific content using open ontology sources.
When considering building ontologies for annotations in any domain, we follow the process of
ontology learning in (Stelios 2006) which are: acquisition of the relevant terminology,
identification of synonym terms / linguistic variants, formation of concepts, hierarchical
organization of the concepts (concept hierarchy), learning of relations, properties or attributes,
together with the appropriate domain and range, hierarchical organization of the relations
(relation hierarchy), instantiation of axiom schemata, definition of arbitrary axioms, and ontology
evaluation. Since we are looking for a lightweight ontology, we only consider a subset of these
tasks, which are the acquisition of domain terminologies, generating concept hierarchies, learning
relations and properties, and ontology evaluation.
When developing the framework modules we base most of our knowledge base on the
structure of the Wikipedia, which represents the hierarchical links between categories and links
between pages, in addition to specific sections of the content. To ensure machine readability and
interoperability, ontologies have to be explicit to make an annotation publicly accessible, formal
to make an annotation publicly agreeable, and unambiguous to make an annotation publicly
identifiable (Ding 2006). An important aspect in order to achieve explicitly, formality and
unambiguity of the developed ontology, is to define an annotation schema that allows the
ontologies to be reused and be part of linked data.1 We designed our schema based on annotation
elements already defined in the Dublin core standards2 and we used the DBpedia3 annotation
elements for defining named entities. We are also introducing new elements for annotating
concepts and defining the context (domain knowledge) in which the concept exists.
The main tasks for this framework are: extracting domain concepts and terms, measuring
relatedness between domain terms, defining boundaries of subdomains using concept clustering
and extracting relations, and defining named entities within each subdomain.
The following figure is an abstract explanation to the modules of the proposed framework.
1
http://linkeddata.org/
http://dublincore.org/documents/usageguide/
3
http://mappings.dbpedia.org/server/ontology/classes/
2
234
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013
FIG. 1. Framework modules.
We start by defining a pool of domain terms and concepts that needs to be modeled for the
domain ontology, but if this is not the case then we build Module A, where we start with relevant
domain concepts and consider them as seed concepts for the ontology. Then we expand our
concepts space using the Wikipedia link structure.
In Module B, we generate the domain and subdomain boundaries by computing relatedness
between domain terms extracted in the first phase. Then we build a similarity matrix that models
the relatedness between the extracted concepts. We developed a relatedness measure that relies
on the degree of connectivity between two concepts in the Wikipedia graph. We then use
hierarchical clustering (Diday and Simon 1980) to create subdomain boundaries. In Module C,
we classify the generated named entities and concepts into wiki concepts and named wiki entities
according to the description of the annotation schema. We will use the DBpedia triples for named
entity recognition. In Module D we extract concept hierarchies and concept – concept relations by
analyzing sections of Wikipedia articles. We will use openNLP to parse and extract relations
defined in sections like the introductory sections in the Wikipedia page that defines the concept,
in addition to exploring the category graph for the Wikipedia. OpenNLP has been successfully
used for extracting relations for ontology enrichment in (Barkschat 2014). For Module E, we will
evaluate the extracted ontology by comparing it to some of the mature existing ground truth like
predefined domain ontologies or even topic maps that is created by a domain expert, and will use
manual and expert evaluations.
Acknowledgements
Research and development of the project described in this poster and abstract is part of the
LinkedUp Project, a European Commission FP7 Support Action.
235
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013
References
Barkschat, K. (2014). Semantic Information Extraction on Domain Specific Data Sheets. The Semantic Web: Trends
and Challenges, Springer: 864-873.
Diday, E. and J. Simon (1980). Clustering analysis. Digital Pattern Recognition, Springer: 47-94.
Ding, Y., Embley, D W . ( 2006). Using Data-Extraction Ontologies to Foster Automating Semantic Annotation. ICDE.
Giunchiglia, F. and I. Zaihrayeu (2009). Lightweight ontologies. Encyclopedia of Database Systems, Springer: 16131619.
Stelios, K., Dimitris, A. (2006). "Consensus Building in Collaborative Ontology Engineering Processes." Journal of
Universal Knowledge Management vol. 1 (no. 3): 199-216.
236
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013
Best Practice Poster:
How to Build a Local Thesaurus
Robert H. Estep
Fondren Library, Rice University
United States
[email protected]
Keywords: LCNAF; local headings; thesaurus construction.
1. First Steps
One of the first steps taken in preparing for participation in the Rice Historical Images Project
was the assembling of a skeletal structure of terms specific to both Rice University and its earlier
incarnation as the William M. Rice Institute.
The primary research tools used were archival maps and blueprints, newspaper accounts
contemporaneous with the University’s building schedule, campus telephone directories, and
online entries (including Wikipedia). Alphabetical lists of building names, and names of
University departments and schools, were cross-referenced for name changes effective during the
University’s history, and checked against the corporate LCNAF. The more complex internal
inconsistencies were noted for future fiddling and/or resolution.
2. People, Places, and Things
It was clear from the start that an unusually large number of LOCAL headings would need to
be constructed, the bulk of these in the form of corporate headings for departments and
organizations, such as the Rice MOB (or Marching Owl Band), university buildings and
structures, as well as headings related to the city of Houston. In some cases additional research
was required, one example being that of the historic Rice Hotel in downtown Houston, for which
construction and demolition dates were included in the heading.
Once the project commenced a large number of personal name headings were also required, for
faculty, staff, students, members of the Houston business community, city and state politicians,
and others. Additionally, in the case of University faculty with existing LCNAF entries, a second
LOCAL heading was included with the parenthetical “(Faculty)” made explicit. Regional
resources, often in the form of obituaries or published tributes and Festschrifts, were scoured for
relevant dates, middle initials and names, and other information.
The smallest number of LOCAL headings were reserved for ‘things’: in other words, for
events or activities which were unique to the history of Rice as an institution. The most prominent
of these were the “May Fete” and the “Spring Rondolet” (both dance events), as well as the
thematically adventurous “Archi-Arts Ball”, not to mention the yearly “Beer Bike cycling event”.
In addition to events such as these, certain architectural features were given headings because of
the frequency and prominence of their appearances in the images, for instance, the large central
“Sallyport” which can be seen from Main Street and which ushers visitors and members of the
Rice community alike into the large central “Quad”, bounded on one end by Fondren Library.
3. The Desire for Consistency
The use of LOCAL headings gave us the latitude to delve deeply into the metadata description
of each image, but every attempt was made to model these headings upon valid forms already in
LC, whether the LOCAL heading was for a building name, a student organization, or a member
of the faculty.
237
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013
We quickly learned that bannering consistency, as one of the most important qualities of the
thesaurus would demand patience and flexibility, as new image-types or additional archival
information made our earlier entries obsolete or overly unique.
4. Looking Forward, Looking Back
Building a progressive thesaurus which is both a melding of valid LC headings and LOCAL
headings requires the flexibility of being able to return from time to time, sometimes to tweak,
other times to undo earlier work and start from scratch. But the pattern we have found is that each
return is both easier and shorter, as we digest the lessons of embarking on a project which extends
both into the past via the images themselves, and into the future as the life of the University, its
teachers and its students, continues to be documented.
5. Illustrative Aspects
The poster will feature graphics in the form of sample images from the project collection itself,
as well as screen shots of the existing Thesaurus.
238
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
Designing an Archaeology Database: Mapping Field
Notes to Archival Metadata
Ann Ellis
Austin State University Library
Stephen F. Austin State University
United States
[email protected]
Keywords: archaeology; field notes; archaeological artifacts
Abstract
The Stephen F. Austin State University Center for Digital Scholarship and Center for Regional
Heritage Research engaged in a collaborative project to design and implement a database
collection in a digital archive that would accommodate images, data and text related to
archaeological artifacts located in East Texas. There were challenges in creating metadata profiles
that could effectively manage, retrieve and display the disparate data in multiple discovery
platforms.
The poster illustrates the steps that were taken to map field notes into useful archival metadata.
Using original notes and field record information a preliminary data dictionary was created.
After collaborative edits and revisions were made, a comprehensive data dictionary was designed
to represent the materials in the collection. From this, a profile was configured in the digital
archive platform to allow for upload of the metadata and images, and for discovery and display of
the archaeological artifacts and related works.
239
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2014
Best Practice Poster:
Utilizing Drupal for the Implementation of a Dublin Core-Based
Data Catalog
Lisa Federer
National Institutes of Health Library,
United States
[email protected]
Keywords: NIH; National Institutes of Health; data catalog; Drupal; content management
system.
1. Objective
To create a data catalog suitable for use within the context of biomedical and health sciences
research. The ideal catalog would allow researchers to easily describe their data using Dublin
Core Metadata Terms and subject-appropriate controlled vocabularies, as well as provide search
and browse capabilities for end users to enable data discovery and facilitate re-use.
2. Setting
The National Institutes of Health (NIH) Library serves the community of NIH intramural
researchers, which includes over 1,200 principal investigators and 4,000 postdoctoral fellows
conducting basic, translational, and clinical research on its primary campus in Bethesda, MD, and
several satellite campuses.
3. Methods
Drupal, a free and open-source content management system, was utilized as a framework for a
data catalog using the Dublin Core Metadata Terms. Using the Structure function within Drupal,
the research data informationist at the NIH Library constructed a pilot system that utilized Dublin
Core Metadata schema and relevant biomedical taxonomies. This pilot system can be adapted to
the needs of a variety of basic, translational, and clinical research applications.
4. Results
The pilot system is currently undergoing testing with researchers within the NIH intramural
community. Results will be available by the time of the DCMI 2014 conference.
5. Conclusions
A data catalog that utilizes an extensible metadata schema like Dublin Core and an opensource framework like Drupal provides users a powerful yet uncomplicated method for
describing their data.
5. Implications
As funders and publishers increasingly require data sharing, researchers will need simple,
intuitive methods for describing their data. Open-source systems like Drupal and extensible
metadata schema like Dublin Core will likely play a large role in data description, thus making
data more discoverable and facilitating data re-use.
240
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013
Best Practice Poster:
PunkCore: Developing an Application Profile for the Culture of
Punk
Joelen Pastva
University of Illinois at
Chicago, United States
[email protected]
Valerie Harris
University of Illinois at
Chicago, United States
[email protected]
Keywords: DCAP; Punk culture; Punk music; domain model; functional requirements; genre
vocabulary;
Abstract
PunkCore is a Dublin Core Application Profile (DCAP) for the description of the culture of
Punk, including its music, its places, its fashions, its artistic expression through film and art, and
its artifacts such as fliers, patches, buttons, and other ephemera. The structure of PunkCore is
designed to be simple enough for non-experts yet specific enough to meet the needs of
information professionals and to capture the unique qualities of materials classified as Punk. In
the interest of interoperability and adoptability, PunkCore is drawn from existing metadata
schema, and the development of PunkCore is intended to be open and collaborative to appeal to
the entire Punk community. Our poster illustrates the initial development of the PunkCore
standard and outlines future plans to bring PunkCore to the community.
The PunkCore DCAP is in its first phase of development, which follows Singapore Framework
stages 1 and 2, including the creation of a functional requirements document and domain model.
In order to capture the specificity of Punk culture, a preliminary genre vocabulary has been also
been developed. The functional requirements document, domain model, and genre vocabulary
will be published on a wiki for community discussion and feedback. The remaining phases of
development, including the creation of a description set profile and usage guidelines, will be
initiated following our review of community interest and comments.
The ultimate goal of this DCAP is to reach the Punk community and achieve broad adoption.
The outcome of our work would aid in the effective acquisition and dissemination of Punk
materials, or their metadata, in a variety of settings. Our project will also be useful to other niche
communities documenting their cultural contributions because it provides a model that
incorporates community outreach with traditional metadata development to lend more credibility
and visibility to the end result.
241
AUTHOR INDEX
Alam, Andias Wira
64
Alemneh, Daniel
43
Ali, Elshaimaa
234
Arlitsch, Kenning
138
Baierer Konstantin
Bekhuis, Tanja
201
Borba, Cleverton Ferreira
179
Borek, Luise
181
Bosch, Thomas
95, 129
Büchner, Michael
24
Chen, Gang
31
Chen, Miao
157
Cole, Timothy
196
Correa, Pedro Luiz P.
179
Cox, Ana
176
Crowe, Sean
225
Curdt, Constanze
199
Dean, Jason W.
230
Dressler, Virginia
204
Dröge, Evelyn
Dulock, Michael
Ecker, Kai
1
216
95, 129
Eichenlaub, Naomi
191
Ellis, Ann
239
Estep, Robert H.
237
Esteva, Maria
Faith, Ashleigh N.
Farnel, Sharon
1
53
201
74
Federer, Lisa
240
Feng, Ying
187
Fukuyama, Julie
210
Greenberg, Jane
37
Han, Myung-Ja
196
Hannah Tarver
43
Hansen, Carolyn
225
Harris, Valerie
241
Hashizume, Akiko
210
Hicks, William
227
Hoffmeister, Dirk
199
Honma, Tsunagu
109
Jin, (Boaz) Sunyoung
219
Johnson, Michael
53
Kulasekaran, Sivakumar
53
Kulczak, Deborah E.
Kunze, John A.
83
Lauruhn, Michael
234
Lempron, Patricia
196
Li, Chunqiu
147
Liu, Xiaozhong
157
Maron, Deborah
37
Masak-Mida, Ingrid
191
Matienzo, Mark A.
12
Matveyeva, Susan
228
Miller, Matthew
232
Missen, Cliff
37
Mixter, Jeff Keith
138
Morgan, Marina
191
Nagamori, Mitsuharu
109
Norman, Michael
196
Obrien, Patrick
138
Ogletree, Adrian T.
184
Pastva, Joelen
241
Pattuelli, M. Cristina
232
Perkins, Jody
181
Petiya, Sean
222
Petras, Vivien
1
Peyrard, Sébastien
83
Phillips, Mark
43
Porter, Emily
207
Qin, Jian
157
Quinn Dombrowski
181
Rudersdorf, Amy
12
Rühle, Stefanie
24
Sarol, M. Janina
196
Schöch, Christof
181
Schulze, Francesca
24
Shakeri, Shadi
43
Shir, Ali
74
Stein, Ayla
Sugimoto, Shigeo
230
196
109, 147
Tanaka, Kei
109
Thomale, Jason
227
Thompson, Santi
167
Tramoni, Jean-Philippe
83
Trelogan, Jessica
53
Trkulja, Violeta
Tseytlin, Eugene
201
Urban, Richard J.
119
Walker, Lizzy Anne
228
Wan, Jing
1
31
Weathers, William
196
Weidner, Andrew
167
Wittenberg, Jamie Viva
173
Wu, Annie
167
Xiao, Long
187
Yi, Junka
31
Zavalina, Oksana
43
Zhou, Yubin
31
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement