SEMANTIC SPACES OF CLINICAL TEXT Leveraging Distributional Semantics for Natural Language Processing of

SEMANTIC SPACES OF CLINICAL TEXT Leveraging Distributional Semantics for Natural Language Processing of
SEMANTIC SPACES OF
CLINICAL TEXT
Leveraging Distributional Semantics
for Natural Language Processing of
Electronic Health Records
Aron Henriksson
Licentiate Thesis
Department of Computer and Systems Sciences
Stockholm University
October, 2013
Stockholm University
DSV Report Series No. 13-009
ISSN 1101-8526
c 2013 Aron Henriksson
Typeset by the author using LATEX
Printed in Stockholm, Sweden by US-AB
I
ABSTRACT
The large amounts of clinical data generated by electronic health record systems
are an underutilized resource, which, if tapped, has enormous potential to improve
health care. Since the majority of this data is in the form of unstructured text,
which is challenging to analyze computationally, there is a need for sophisticated
clinical language processing methods. Unsupervised methods that exploit
statistical properties of the data are particularly valuable due to the limited
availability of annotated corpora in the clinical domain.
Information extraction and natural language processing systems need to
incorporate some knowledge of semantics.
One approach exploits the
distributional properties of language – more specifically, term co-occurrence
information – to model the relative meaning of terms in high-dimensional vector
space. Such methods have been used with success in a number of general language
processing tasks; however, their application in the clinical domain has previously
only been explored to a limited extent. By applying models of distributional
semantics to clinical text, semantic spaces can be constructed in a completely
unsupervised fashion. Semantic spaces of clinical text can then be utilized in a
number of medically relevant applications.
The application of distributional semantics in the clinical domain is here
demonstrated in three use cases: (1) synonym extraction of medical terms,
(2) assignment of diagnosis codes and (3) identification of adverse drug reactions.
To apply distributional semantics effectively to a wide range of both general and,
in particular, clinical language processing tasks, certain limitations or challenges
need to be addressed, such as how to model the meaning of multiword terms
and account for the function of negation: a simple means of incorporating
paraphrasing and negation in a distributional semantic framework is here proposed
and evaluated. The notion of ensembles of semantic spaces is also introduced;
these are shown to outperform the use of a single semantic space on the synonym
extraction task. This idea allows different models of distributional semantics,
with different parameter configurations and induced from different corpora, to be
combined. This is not least important in the clinical domain, as it allows potentially
limited amounts of clinical data to be supplemented with data from other, more
readily available sources. The importance of configuring the dimensionality of
semantic spaces, particularly when – as is typically the case in the clinical domain
– the vocabulary grows large, is also demonstrated.
II
III
SAMMANFATTNING
De stora mängder kliniska data som genereras i patientjournalsystem är en
underutnyttjad resurs med en enorm potential att förbättra hälso- och sjukvården.
Då merparten av kliniska data är i form av ostrukturerad text, vilken är utmanande
för datorer att analysera, finns det ett behov av sofistikerade metoder som kan
behandla kliniskt språk. Metoder som inte kräver märkta exempel utan istället
utnyttjar statistiska egenskaper i datamängden är särskilt värdefulla, med tanke på
den begränsade tillgången till annoterade korpusar i den kliniska domänen.
System för informationsextraktion och språkbehandling behöver innehålla viss
kunskap om semantik. En metod går ut på att utnyttja de distributionella
egenskaperna hos språk – mer specifikt, statistisk över hur termer samförekommer
– för att modellera den relativa betydelsen av termer i ett högdimensionellt
vektorrum. Metoden har använts med framgång i en rad uppgifter för behandling
av allmänna språk; dess tillämpning i den kliniska domänen har dock endast
utforskats i mindre utsträckning. Genom att tillämpa modeller för distributionell
semantik på klinisk text kan semantiska rum konstrueras utan någon tillgång till
märkta exempel. Semantiska rum av klinisk text kan sedan användas i en rad
medicinskt relevanta tillämpningar.
Tillämpningen av distributionell semantik i den kliniska domänen illustreras
här i tre användningsområden: (1) synonymextraktion av medicinska termer,
(2) tilldelning av diagnoskoder och (3) identifiering av läkemedelsbiverkningar.
Det krävs dock att vissa begränsningar eller utmaningar adresseras för att möjliggöra en effektiv tillämpning av distributionell semantik på ett brett spektrum av
uppgifter som behandlar språk – både allmänt och, i synnerhet, kliniskt – såsom
hur man kan modellera betydelsen av flerordstermer och redogöra för funktionen
av negation: ett enkelt sätt att modellera parafrasering och negation i ett distributionellt semantiskt ramverk presenteras och utvärderas. Idén om ensembler av
semantisk rum introduceras också; dessa överträffer användningen av ett enda semantiskt rum för synonymextraktion. Den här metoden möjliggör en kombination
av olika modeller för distributionell semantik, med olika parameterkonfigurationer
samt inducerade från olika korpusar. Detta är inte minst viktigt i den kliniska
domänen, då det gör det möjligt att komplettera potentiellt begränsade mängder
kliniska data med data från andra, mer lättillgängliga källor. Arbetet påvisar också
vikten av att konfigurera dimensionaliteten av semantiska rum, i synnerhet när
vokabulären är omfattande, vilket är vanligt i den kliniska domänen.
IV
V
LIST OF PAPERS
This thesis is based on the following papers:
I
Aron Henriksson, Mike Conway, Martin Duneld and Wendy W.
Chapman (2013). Identifying Synonymy between SNOMED
Clinical Terms of Varying Length Using Distributional Analysis
of Electronic Health Records. To appear in Proceedings of
the Annual Symposium of the American Medical Informatics
Association (AMIA), Washington DC, USA.
II
Aron Henriksson, Hans Moen, Maria Skeppstedt, Vidas
Daudaravičius and Martin Duneld (2013). Synonym Extraction
and Abbreviation Expansion with Ensembles of Semantic
Spaces. Submitted.
III
Aron Henriksson and Martin Hassel (2011).
Exploiting
Structured Data, Negation Detection and SNOMED CT Terms
in a Random Indexing Approach to Clinical Coding. In
Proceedings of the RANLP Workshop on Biomedical Natural Language Processing, Association for Computational
Linguistics (ACL), pages 3–10, Hissar, Bulgaria.
IV
Aron Henriksson and Martin Hassel (2013). Optimizing the
Dimensionality of Clinical Term Spaces for Improved Diagnosis
Coding Support. In Proceedings of the 4th International Louhi
Workshop on Health Document Text Mining and Information
Analysis (Louhi 2013), Sydney, Australia.
V
Aron Henriksson, Maria Kvist, Martin Hassel and Hercules
Dalianis (2012). Exploration of Adverse Drug Reactions in
Semantic Vector Space Models of Clinical Text. In Proceedings
of the ICML Workshop on Machine Learning for Clinical Data
Analysis, Edinburgh, UK.
VI
VII
ACKNOWLEDGEMENTS
Research may seem a solitary activity, conjuring up images of the lone researcher
cooped up in a gloomy study, lit up by the mere flickering of candle light – or computer screen – faced with the daunting endeavor of making a contribution, however
minuscule, to the body of scientific knowledge. Occasionally, this can be the
case; for the most part, it is a fundamentally joint effort that is more likely to bear
fruit in collaboration with others, some of whom I would like to acknowledge here.
First and foremost, I am indebted to my supervisors: Professor Hercules Dalianis
and Dr. Martin Duneld (formerly Hassel) – you make up a great, complementary
supervision duo. You have introduced me to the rewarding world of research
and, in particular, the exciting area of natural language processing. Hercules,
you were the one who gave me the opportunity to join this research group. Your
inborn social nature and optimistic outlook are important counterweights to
the sometimes stressful and constrained, navel-gazing world of a PhD student.
Martin, thank you for introducing me to, and sharing my interest in, distributional
semantics. I appreciate and have learned much from our discussions on issues
concerning research, both big and small.
I also have the great fortune of belonging to a wonderful research group and I
would like to thank both former and current members for providing a pleasant
environment to work in and great opportunities for collaboration. I would like to
name a few persons in particular: Dr. Maria Kvist, who, as cheerful officemate
and sensible co-author, has provided me with ready access to important medical
domain expertise and good advice; Dr. Sumithra Velupillai, who, as a recent
graduate in our group, has set an impressive precedent and allowed those of us
who follow to walk down a path that is perhaps a little less thorny; and Maria
Skeppstedt, whom, as co-author and fellow PhD candidate, I always insist on
collaborating with and whose humility cloaks a wealth of strengths. Thank you
all for the fun times – I look forward to continued collaborations.
It is a privilege to have the opportunity of conducting research in a stimulating
environment: the DADEL project (High-Performance Data Mining for Drug
Effect Detection), funded by the Swedish Foundation for Strategic Research
(SSF), has certainly contributed to that, providing an exciting application area
and allowing me to receive quality feedback on my work. I would like to thank
all the members of the project, especially Professor Henrik Boström and fellow
VIII
PhD candidate Jing Zhao, who generously agreed to act as opponents at my
pre-licentiate seminar and provided me with valuable comments that have allowed
me to improve this thesis and will, hopefully, strengthen my research in the future.
Thank you also to everyone at the Unit of Information Systems and my fellow
PhD students at DSV for contributing to a pleasant work environment.
The network extends beyond the borders of Sweden: thank you to all the
participants in HEXAnord (HEalth TeXt Analysis network in the Nordic and
Baltic countries), funded by Nordforsk, especially Hans Moen at the Norwegian
University of Science and Technology (NTNU) and Dr. Vidas Daudaravičius at
Vytautas Magnus University, as co-authors. I have also had the opportunity of
visiting the Division of Biomedical Informatics at the University of California,
San Diego (UCSD) for a month, providing a much-needed respite from the
Swedish winter, but also – and more importantly – a great learning experience.
Thank you, Dr. Mike Conway and Dr. Wendy Chapman, for an inspiring visit
and for co-authoring one of the papers in this thesis. Thank you also, Danielle
Mowery, for being such an excellent host – we thoroughly enjoyed our stay.
The opportunities to travel and attend conferences and courses around the world
is a wonderful benefit of being a PhD student, especially for an avid traveler and
TCK like myself; it would not be possible to the same extent without generously
funded scholarships from Helge Ax:son Johnson Foundation, John Söderberg
Scholarship Foundation and Google. It is also great to be affiliated with a network
of NLP researchers closer to home through the Swedish National Graduate School
of Language Technology (GSLT) – thank you for your support. Ultimately, all of
this is made possible thanks to SSF, who are funding my doctoral studies through
the DADEL project.
Last, but certainly not least, thank you to all my family and friends, both near
and far: I am immensely grateful for your support and companionship. Sorry for
sometimes underperforming in the work-life balancing act – hopefully it is not
an incorrigible flaw. A special mention is due to my parents, who have always
been supportive and have generously hosted me many times in their current home
in Limassol, Cyprus – in recent years a home away from home and where I first
began to outline this thesis.
Aron Henriksson
Stockholm, October 2013
IX
TABLE OF CONTENTS
1
2
INTRODUCTION
PROBLEM SPACE . . .
RESEARCH QUESTIONS
THESIS DISPOSITION .
1.1
1.2
1.3
BACKGROUND
CLINICAL LANGUAGE PROCESSING
DISTRIBUTIONAL SEMANTICS . . .
2.1
2.2
2.3
2.4
2.5
3
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
THEORETICAL FOUNDATION . . . . . . . . . . .
MODELS OF DISTRIBUTIONAL SEMANTICS . . .
CONSTRUCTING SEMANTIC SPACES . . . . . . .
RANDOM INDEXING WITH (/ OUT ) PERMUTATIONS
RANDOM INDEXING MODEL PARAMETERS . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.2.1
2.2.2
2.2.3
2.2.4
2.2.5
2.2.6
APPLICATIONS OF DISTRIBUTIONAL SEMANTICS
2.5.1
AUTOMATICALLY DETECTING ADVERSE DRUG EVENTS
LANGUAGE USE VARIABILITY . . . . . . . . . .
2.3.1 SYNONYMS . . . . . . . . . . . . . . . .
2.3.2 ABBREVIATIONS . . . . . . . . . . . . .
DIAGNOSIS CODING . . . . . . . . . . . . . . .
2.4.1 COMPUTER - AIDED MEDICAL ENCODING
ADVERSE DRUG REACTIONS . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
METHOD
3.1
3.2
3.3
3.4
RESEARCH STRATEGY . . . . . . . . . . . .
EVALUATION FRAMEWORK . . . . . . . . .
3.2.1 PERFORMANCE METRICS . . . . . .
3.2.2 RELIABILITY AND GENERALIZATION
METHODS AND MATERIALS . . . . . . . . .
3.3.1 CORPORA . . . . . . . . . . . . . . .
3.3.2
3.3.3
TERMINOLOGIES . . . . .
TOOLS AND TECHNIQUES
EXPERIMENTAL SETUPS . . . . .
3.4.1 PAPER I . . . . . . . . . .
3.4.2 PAPER II . . . . . . . . .
3.4.3
.
.
.
.
.
PAPER III . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
3
5
7
9
10
11
11
13
14
16
18
19
21
22
24
26
28
30
31
33
34
36
37
39
40
40
42
44
49
49
52
53
X
3.5
4
PAPER IV
PAPER V
ETHICAL ISSUES
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
59
60
60
61
65
65
68
69
69
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
71
72
83
84
.
.
.
.
4.2.1
.
4.2.2
.
IDENTIFICATION OF ADVERSE DRUG REACTIONS .
4.3.1 PAPER V . . . . . . . . . . . . . . . . . . .
.
.
.
.
PAPER III . . . . . . . . . . . . . . . . .
PAPER IV . . . . . . . . . . . . . . . . .
DISCUSSION
SEMANTIC SPACES OF CLINICAL TEXT
MAIN CONTRIBUTIONS . . . . . . . . .
FUTURE DIRECTIONS . . . . . . . . . .
5.1
5.2
5.3
55
56
57
.
.
.
.
.
.
.
.
RESULTS
SYNONYM EXTRACTION OF MEDICAL TERMS
4.1.1 PAPER I . . . . . . . . . . . . . . . . .
4.1.2 PAPER II . . . . . . . . . . . . . . . .
4.2 ASSIGNMENT OF DIAGNOSIS CODES . . . . .
4.1
4.3
5
3.4.4
3.4.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
REFERENCES
87
INDEX
105
PAPERS
PAPER I : IDENTIFYING SYNONYMY BETWEEN SNOMED CLINICAL
TERMS OF VARYING LENGTH USING DISTRIBUTIONAL
ANALYSIS OF ELECTRONIC HEALTH RECORDS . . . . . . . . . .
109
PAPER II : SYNONYM EXTRACTION AND ABBREVIATION EXPANSION
WITH ENSEMBLES OF SEMANTIC SPACES . . . . . . . . . . . . .
PAPER III : EXPLOITING STRUCTURED DATA , NEGATION DETECTION
AND SNOMED CT TERMS IN A RANDOM INDEXING APPROACH
TO CLINICAL CODING . . . . . . . . . . . . . . . . . . . . . . .
PAPER IV: OPTIMIZING THE DIMENSIONALITY OF CLINICAL TERM
SPACES FOR IMPROVED DIAGNOSIS CODING SUPPORT . . . . .
PAPER V:
EXPLORATION OF ADVERSE DRUG REACTIONS IN
SEMANTIC VECTOR SPACE MODELS OF CLINICAL TEXT . . . . .
111
123
167
177
187
C HAPTER 1
INTRODUCTION
The increasingly widespread adoption of electronic health record (EHR) systems
has allowed the enormous amounts of clinical data that are generated each day
to be stored in a more readily accessible form. The digitization of clinical documentation comes with huge potential in that it renders clinical data processable by
computers. This, in turn, means that longitudinal, patient-specific data can be integrated with clinical decision support (CDS) systems at the point of care; secondary
usage of clinical data can also be exploited to support epidemiological studies and
medical research more broadly. It may, moreover, be possible to unlock even
greater potential by combining and linking clinical data with other sources, such
as biobanks and genetic data, with the aim of uncovering interesting genotypephenotype correlations. However, despite the possible uses of clinical data being
numerous, and their applications important in the effort of reducing health care
costs and improving patient safety, EHR data remains a largely untapped resource
(Jensen et al., 2012).
There are many possible explanations for this, including the difficulty to obtain
clinical data for research purposes due to a wide range of ethical and legal reasons. There are also technical reasons – some of which will be addressed in this
thesis – that make the full exploitation of clinical data challenging. One such reason is the heterogeneity of the data that EHRs contain, which can be characterized
as structured, such as gender and drug prescriptions, or unstructured, typically in
2
C HAPTER 1.
the form of clinical narratives. Unstructured clinical text1 – or, more commonly,
semi-structured under various headings – constitutes the bulk of EHR data. These
narratives – in the form of, for instance, admission notes or treatment plans – are
written (or dictated) by clinicians and play a key role in allowing clinicians to
communicate observations and reasoning in a flexible manner. On the other hand,
they are the most difficult to process and analyze computationally (Jensen et al.,
2012), also in comparison to biomedical text2 (Meystre et al., 2008). Clinical text
is frequently composed of ungrammatical, incomplete and telegraphic sentences
that are littered with shorthand, misspellings and typing errors. Many of the shorthand abbreviations and acronyms are created more or less ad hoc as a result of
author-, domain- and institution-specific idiosyncrasies that, in turn, lead to the
semantic overloading of lexical units (Jensen et al., 2012; Meystre et al., 2008). In
other words, the level of ambiguity in clinical text is extremely high. Sophisticated
natural language processing (NLP) methods are thus needed to mine the valuable
information that is contained in clinical text.
It is widely recognized that general language processing solutions are typically not
directly applicable to the clinical domain without suffering a drop in performance
(Meystre et al., 2008; Hassel et al., 2011). The tools and techniques that have
been developed specifically for the clinical domain have traditionally been rulebased, as opposed to statistical or machine learning-based. Rule-based systems,
with rules formulated by linguists and/or medical experts, often perform well on
many tasks and are sometimes the only feasible approach when not having access
to large amounts of data for training machine learning models. They are, however,
expensive to create and can be difficult to generalize across domains and systems,
and over time. With access to clinical data, machine learning approaches have
become increasingly popular. The problem with supervised machine learning approaches is that one is dependent on annotated resources, which are difficult and
expensive to create, particularly in the clinical domain where expert annotators –
typically physicians – are needed. Developing and tailoring semi-supervised and
unsupervised methods for the clinical domain is thus potentially very valuable.
In order to create high-quality information extraction and other NLP systems, it is
important to incorporate some knowledge of semantics. One attractive approach
1 In this thesis, the terms clinical narrative and clinical text are used interchangeably and denote any
textual content written in the process of documenting care in EHRs.
2 Biomedical text typically refers to biomedical research articles, books etc; see (Cohen and Hersh,
2005) for an overview of biomedical text mining.
INTRODUCTION
3
to dealing with semantics, when one has access to a large corpus of text, is distributional semantics (see Section 2.2). Models of distributional semantics exploit
large corpora to capture the relative meaning of terms based on their distribution
in different contexts. This approach is attractive in that it, in some sense, renders
semantics computable: an estimate of the semantic relatedness between two terms
can be quantified. Observations of language use are needed; however, no manual
annotations are required, making it an unsupervised learning method. Models of
distributional semantics have been used with great success on many NLP tasks –
such as information retrieval, various semantic knowledge tests, text categorisation and word sense disambiguation – that involve general language. There is also
a growing interest in the application of distributional semantics in the biomedical
domain (see Section 2.2.6 or Cohen and Widdows 2009 for an overview). Partly
due to the difficulty of obtaining large amounts of clinical data, however, this particular application (sub-)domain has been less explored.
1.1
PROBLEM SPACE
The problem space wherein this thesis operates is multidimensional. This is largely
due to the interdisciplinary nature of the research (Figure 1.1) – at the intersection
of computer science, linguistics and medicine – where natural language processing
(computer science and linguistics) is leveraged to improve health care (medicine).
The core of the problem space concerns the application – and adaptation – of natural language processing methods to the domain of clinical text. In particular, this
thesis addresses the possible application of distributional semantics to facilitate
natural language processing of electronic health records. In doing so, however, it
also tackles a number of medically motivated problems that the use of this methodology may help to solve.
Transferring methods across domains, in this case from general to clinical natural
language processing3 (Figure 1.2), is not always trivial. The use of distributional
semantics in the clinical domain has, in fact, not been explored to a great extent; it
has even been suggested that the constrained nature of clinical narrative renders it
less amenable to distributional semantics (Cohen and Widdows, 2009). However,
with the increasing availability of clinical data for secondary use, it is important
3 See
ing.
Section 2.1 for a discussion on the differences between general and clinical language process-
4
C HAPTER 1.
Figure 1.1: RESEARCH AREA ( S ). Medical (or clinical) language processing is an
interdisciplinary research area, at the intersection of computer science, linguistics and
medicine.
to find ways of exploiting such data without having to create (manually) annotated
resources for supervised machine learning or, alternatively, having to build handcrafted rule-based systems. The possibility of leveraging distributional semantics
– with its unsupervised methods – in the clinical domain would thus yield great
value.
Figure 1.2: TRANSFERRING AND ADAPTING METHODS. NLP methods and techniques are often transferred from general NLP and adapted for clinical NLP (or other
specialized domains). Here, distributional semantics are transferred and, to some extent,
adapted to the clinical domain.
There are certain properties – in some cases, limitations – of distributional semantics that may also affect its applicability in the clinical domain (Figure 1.3): (1)
current models of distributional semantics typically focus on modeling the meaning of individual words, whereas the majority of medical terms are multiword units
(Paper I); (2) current models of distributional semantics typically ignore negation,
which is particularly prevalent in clinical text (Chapman et al., 2001; Skeppstedt
et al., 2011); and (3) current models of distributional semantics cannot distin-
INTRODUCTION
5
guish between different types of distributionally similar semantic relations, which
is problematic when trying to extract synonyms, for instance.
Figure 1.3: PROBLEM SPACE. This thesis operates in a multidimensional problem
space, with problems originating from various disciplines and research areas. Here, the
main problems addressed in this thesis are enumerated and categorized according to
source area.
1.2
RESEARCH QUESTIONS
The overarching research question that motivates the inclusion of the papers in this
thesis – and the common driving force behind each individual study – is whether,
and how, distributional semantics can be leveraged to facilitate natural language
processing of electronic health records. This question is explored through three
separate applications, or use cases: (1) synonym extraction of medical terms,
(2) assignment of diagnosis codes, and (3) identification of adverse drug reactions.
These use cases provide demonstrations of – or exemplify – how distributional
semantics can be used for clinical language processing and to support the development of tools that, ultimately, aim to improve health care. For each use case, there
is a set of more specific research questions that nevertheless is subsumed under the
one overarching, and for all common, research question.
6
C HAPTER 1.
SYNONYM EXTRACTION OF MEDICAL TERMS ( PAPERS I AND II )
− To what extent can distributional semantics be employed to extract candidate
synonyms of medical terms from a large corpus of clinical text? (Papers I
and II)
− How can synonymous terms of varying (word) length be captured in a
distributional semantic framework? (Paper I)
− Can different models of distributional semantics and types of corpora be
combined to obtain enhanced performance on the synonym extraction task,
compared to using a single model and a single corpus? (Paper II)
− To what extent can abbreviation-definition pairs be extracted with distributional semantics? (Paper II)
ASSIGNMENT OF DIAGNOSIS CODES ( PAPERS III AND IV )
− Can the distributional semantics approach to automatic diagnosis coding be
improved by also exploiting structured data? (Paper III)
− Can external terminological resources be leveraged to obtain enhanced
performance on the diagnosis coding task? (Paper III)
− How can negation be handled in a distributional semantic framework?
(Paper III)
− How does dimensionality affect the application of distributional semantics to
the assignment of diagnosis codes, particularly as the size of the vocabulary
grows large? (Paper IV)
IDENTIFICATION OF ADVERSE DRUG REACTIONS ( PAPER V )
− To what extent can distributional semantics be used to extract potential
adverse drug reactions to a given medication among a list of distributionally
similar drug-symptom pairs? (Paper V)
INTRODUCTION
1.3
7
THESIS DISPOSITION
The thesis is organized into the following five chapters:
CHAPTER I : INTRODUCTION
provides a preamble to the rest of the thesis. A general background is provided that is intended to be only sufficiently extensive
to motivate the research, outline the work and position the research in its
broader context. The problem space wherein the thesis operates, as well as
the research questions that it addresses, are described.
CHAPTER II : BACKGROUND
presents an extensive background to the subjects
covered by the included papers. First, an introduction is given to the specific
research and application area which this thesis mainly contributes: clinical
language processing. It then covers the methodology that is the focal point
here, namely distributional semantics: a description of the underlying theory
and different models is followed by details on the model of choice – Random
Indexing – and its model parameters; finally some applications of distributional semantics are related. A background and a description of related work
are then provided for each of the application areas: (1) synonym extraction
of medical terms, (2) assignment of diagnosis codes and (3) identification of
adverse drug reactions.
CHAPTER III : METHOD
describes both the general research strategy – as well as
the philosophical assumptions that underpin it – and the specific experimental setups of the included papers. The evaluation framework is laid out in
detail, followed by a description of the methods and materials used in the
included studies. Ethical issues are discussed briefly.
CHAPTER IV: RESULTS
gives a brief presentation of the most important results,
which are more fully reported in the included papers.
CHAPTER V: DISCUSSION
reviews the utility of applying models of distributional
semantics in the clinical domain. The research questions posed in the introductory chapter are revisited: answers are provided along with discussions
and comparisons with prior work. Finally, the main contributions of the
thesis are recapitulated and possible ways forward discussed.
8
C HAPTER 2
BACKGROUND
In this chapter, the thesis is positioned into a wider scientific and societal context:
(1) what has been done prior to this and which challenges remain? and (2)
what are the underlying, societal problems that motivate this effort? First, the
particular area to which the work here hopes primarily to contribute – clinical
language processing – is briefly introduced, including what distinguishes it
from other, related areas, and what makes it especially challenging. Then, the
methodology under investigation, namely distributional semantics, is described
at some length, including a discussion of theoretical underpinnings, various
implementations and applications. Finally, a background is provided to each of
the three application areas in which the overarching research question – whether,
and how, distributional semantics can be leveraged to facilitate natural language
processing of electronic health records – is explored: (1) synonym extraction of
medical terms, (2) assignment of diagnosis codes and (3) identification of adverse
drug reactions.
10
2.1
C HAPTER 2.
CLINICAL LANGUAGE PROCESSING
Clinical language processing concerns the automatic processing of clinical text,
which constitutes the unstructured, or semi-structured, part of EHRs. Clinical text
is written by clinicians in the clinical setting; in fact, this term is used to denote
any narrative that appears in EHRs, ranging from short texts describing a patient’s
chief complaint to long descriptions of a patient’s medical history (Meystre et al.,
2008). The language used in clinical text can also be characterized as a sublanguage with its own particular structure, distinct from both general language and
other sublanguages (Friedman et al., 2002).
Clinical language processing is a subdomain of medical language processing,
which, in turn, is a subdomain of natural language processing. Although
(bio)medical language processing subsumes clinical language processing, biomedical text often refers to non-clinical text, particularly bio(medical) literature
(Meystre et al., 2008). The noisy character of clinical text makes it much more
challenging to analyze computationally than biomedical and general language text,
such as newspaper articles. Compared to more formal genres, clinical language
often does not comply with formal grammar. Moreover, misspellings abound and
shorthand – abbreviations and acronyms – are often composed in an idiosyncratic
and ad hoc manner, with some being difficult to disambiguate even in context (Liu
et al., 2001). This may accidentally lead to an artificially high degree of synonymy
and homonymy1 – problematic phenomena from an information extraction perspective. In one study, it was found that the pharmaceutical product noradrenaline
had approximately sixty spelling variants in a set of intensive care unit nursing
narratives written in Swedish (Allvin et al., 2011). This example may serve to
illustrate some of the challenges involved in processing clinical language.
There are a number of approaches to performing information extraction from clinical text (Meystre et al., 2008). These can usefully be divided into two broad
categories: (1) rule-based approaches (ranging from simple pattern matching to
more complex symbolic approaches) and (2) statistical and machine learning approaches. A disadvantage of the former is their lack of generalizability and transferability to new domains; they are also expensive and time-consuming to build.
Statistical methods and machine learning have become increasingly popular – also
1 Homonyms are two orthographically identical words (same spelling) with different meanings. An
example of this phenomenon in Swedish clinical text is pat, which is short for both patient and pathological.
BACKGROUND
11
in the clinical domain – as access to large amounts of data become more readily
available. Supervised machine learning, where labeled examples are provided and
the goal is to learn a mapping from a known input to a desired output (Alpaydin, 2010), requires annotated data, which are also time-consuming and expensive
to create, especially in the medical domain. In unsupervised machine learning,
there is only input data and the goal is to find and exploit statistical regularities
in the data (Alpaydin, 2010). With access to large amounts of unannotated data,
unsupervised learning is obviously preferable, provided that the performance is
sufficiently good.
2.2
DISTRIBUTIONAL SEMANTICS
Knowledge of semantics – that is, aspects that concern the meaning of linguistic
units of various granularity: words, phrases and sentences (Yule, 2010) – even if
only tacit, is a prerequisite for claims of understanding (natural) language. For
computers to acquire – or mimic – any capacity for human language understanding, such semantic knowledge needs to be made explicit in some way. There are
numerous approaches to the computational handling of semantics, many of which
rely on mapping text to formal representations, such as ontologies or models based
on first-order logic (Jurafsky et al., 2000), which can then be used to reason over.
There is another, complementary, approach that instead exploits the inherent distributional structure of language to acquire the meaning of linguistic entities, most
typically words. This family of models falls under the umbrella of distributional
semantics. Early models of distributional semantics were introduced to overcome
the inability of the vector space model (Salton et al., 1975) – as it was originally
conceived, based on keyword matching – to account for the variability of language
use and word choice. To overcome the negative impact this had on recall (coverage) in information retrieval systems, various models of distributional semantics
were proposed (Deerwester et al., 1990; Schütze, 1993; Lund and Burgess, 1996).
2.2.1
THEORETICAL FOUNDATION
Distributional approaches to meaning rely in essence on a set of assumptions about
the nature of language in general and semantics in particular (Sahlgren, 2008).
12
C HAPTER 2.
These assumptions are embedded – explicitly and implicitly – in the distributional
hypothesis, which has its foundation in the work of Zellig Harris (1954) and states
that words with similar distributions in language tend to have similar meanings2 .
The assumption is thus that there is a correlation between distributional similarity
and semantic similarity: by exploiting the observables to which we have direct
access in a corpus of texts – the linguistic entities and their relative frequency
distributions over contexts (distributional similarity) – we can infer that which is
hidden, or latent, namely the relative meaning of those entities (semantic similarity). For instance, if two terms, t1 and t2 , share distributional properties – in the
sense that they appear in similar contexts – we can infer that they are semantically
similar along one or more dimensions of meaning.
This rather sweeping approach to meaning presents a point of criticism that is
sometimes raised against distributional approaches: the distributional hypothesis
does not make any claims about the nature of the semantic (similarity) relation,
only to what extent two entities are semantically similar. Since a large number
of distributionally similar entities have potentially very different semantic relations, distributional methods cannot readily distinguish between, for instance, synonymy, hypernymy, hyponymy, co-hyponymy and, in fact, even antonymy3 . As
we will discover, the types of semantic relations that are modeled depend on the
employed context definition and thereby also on what type of distributional information is collected (Sahlgren, 2006). The broad notion of semantic similarity
is, however, still meaningful and represents a psychologically intuitive concept
(Sahlgren, 2008), even if the ability to distinguish between different types of semantic relations is often desirable, as in the synonym extraction task.
If we, for a moment, consider the underlying philosophy of language and linguistic
theory, the distributional hypothesis – and the models that build upon it – can be
characterized as subscribing to a structuralist, functionalist and descriptive view
of language. As Sahlgren (2006, 2008) points out, Harris’ differential view of
meaning can be traced back to the structuralism of Saussure, in which the primary
interest lies in the structure of language – seen as a system (la langue) – rather
2 Harris,
however, speaks of differences in distributions and meaning rather than similarities: two
terms that have different distributions to a larger degree than another pair of terms are also more likely
to differ more in meaning. In the distributional hypothesis this is flipped around by emphasizing similarities rather than differences (Sahlgren, 2008).
3 Since antonyms are perhaps not intuitively seen as semantically similar, some prefer to speak of
semantic relatedness rather than semantic similarity (Turney and Pantel, 2010).
BACKGROUND
13
than its possible uses (la parole). In this language system, signs4 are defined by
their functional differences, which resonates with Harris’ notion of meaning differences: the meaning of a given linguistic entity can only be determined in relation
to other linguistic entities. Along somewhat similar lines, Wittgenstein (1958)
propounded the notion that has been aphoristically captured in the famous dictum:
meaning is use. The key to uncovering the meaning of a word is to look at its
communicative function in the language. The Firthian (Firth, 1957) emphasis on
the context-dependent nature of meaning is also perfectly in line with the distributional view of language: you shall know a word by the company it keeps. This
view of semantics is wholly descriptive – as opposed to prescriptive – in that very
few assumptions are made a priori: meaning is simply determined through observations of actual language use. Approaching semantics in this way comes with
many other advantages for natural language processing, as it has spawned a family
of data-driven, unsupervised methods that do not require hand-annotated data in
the learning process.
2.2.2
MODELS OF DISTRIBUTIONAL SEMANTICS
The numerous approaches to distributional semantics can be broadly divided into
spatial models and probabilistic models5 (Cohen and Widdows, 2009). Spatial
models of distributional semantics represent (term) meaning as locations in highdimensional vector space, where proximity6 indicates semantic similarity: terms
that are close to each other in semantic space are close in meaning; terms that
are distant are semantically unrelated. This notion is sometimes referred to as the
geometric metaphor of meaning and intuitively captures the idea that terms can
(metaphorically speaking) be near in meaning, as well as the fact that meaning
– when equated with use – shifts over time and across domains. Spatial models
include Wordspace (Schütze, 1993), Hyperspace Analogue to Language (HAL)
(Lund and Burgess, 1996), Latent Semantic Analysis (LSA) (Landauer and Dumais, 1997) and Random Indexing (RI) (Kanerva et al., 2000). Probabilistic mod4 In structuralism, a linguistic sign is composed of two parts: a signifier (a word) and a signified (the
meaning of that word).
5 The distinction between spatial and probabilistic models is not always clearcut. There are, for instance, probabilistic interpretations of vector space models; unified approaches are a distinct possibility
(van Rijsbergen, 2004).
6 Proximity is typically measured by calculating the cosine of the angle between two vectors. There
are other measures, too: see Section 3.3.3.
14
C HAPTER 2.
els view documents as a mixture of topics and represent terms according to the
probability of their occurrence during the discussion of each topic: two terms that
share similar topic distributions are assumed to be semantically related. Probabilistic models include Probabilistic LSA (pLSA) (Hofmann, 1999), Latent Dirichlet
Allocation (LDA) (Blei et al., 2003) and the Probabilistic Topic Model (Griffiths
and Steyvers, 2002).
There are pros and cons of each approach: while spatial models are more realistic
and intuitive from a psychological perspective, probabilistic models are more statistically sound and generative, which means that they can be used to estimate the
likelihood of unseen documents. From an empirical, experimental point of view,
one approach does not consistently outperform the other, independent of the task
at hand. Scalable versions of spatial models, such as RI, have, however, proved to
work well for very large corpora (Cohen and Widdows, 2009) and are the focus in
this thesis.
2.2.3
CONSTRUCTING SEMANTIC SPACES
Spatial representations of meaning (see Turney and Pantel 2010 for an overview)
differ mainly in the way context vectors, representing term meaning, are constructed. The context vectors are typically derived from a term-context matrix7
that contains the (weighted, normalized) frequency with which terms occur within
a set of defined contexts. The context definition may vary both across and within
models; the numerous context definitions can essentially be divided into two
groups: a non-overlapping passage of text (a document) or a sliding window of
tokens/characters surrounding the target term8 . There are also linguistically motivated context definitions, achieved with the aid of, for instance, a syntactic parser,
where the context is defined in terms of syntactic dependency relations (Padó and
Lapata, 2007). An advantage of employing a more naïve context definition is,
however, that the language agnostic property of distributional semantics is thereby
retained: all that is needed is a corpus of (term) segmented text.
7 There are also pair-pattern matrices, where rows correspond to pairs of words and columns correspond to the patterns in which the pairs co-occur. These have been used for identifying semantically
similar patterns (Lin and Pantel, 2001) – by comparing column vectors – and analogy relations (Turney
and Littman, 2005) – by comparing row vectors. Pattern similarity is a useful notion in paraphrase
identification (Lin and Pantel, 2001). It is also possible to employ higher-order tensors (vectors are
first-order tensors and matrices are second-order tensors), see for instance (Turney, 2007).
8 The size of the sliding window is typically static and predefined.
BACKGROUND
15
Irrespective of which context definition is employed, the number of contexts is
typically huge: with a document-level context definition, it equates to the number
of documents; with a (term-based) sliding window context definition, it amounts to
the size of the vocabulary. Working directly with such high-dimensional (and inherently sparse) data – where the dimensionality is equal to the number of contexts
– would involve unnecessary computational complexity and come with significant
memory overhead, in particular since most terms only occur in a limited number of
contexts, which means that most cells in the matrix would be zero. Data sparsity as
a result of high dimensionality is commonly referred to as the curse of dimensionality and means that statistical evidence is needed. The solution is to project the
high-dimensional data into a lower-dimensional space, while approximately preserving the relative distances between data points. The benefit of dimensionality
reduction is two-fold: on the one hand, it reduces complexity and data sparseness;
on the other hand, it has also been shown to improve the coverage and accuracy
of term-term associations (Landauer and Dumais, 1997), as, in this reduced (semantic) space, terms that do not necessarily co-occur directly in the same contexts
will nevertheless be clustered about the same subspace, as long as they appear in
similar contexts, i.e. have neighbors in common (co-occur with the same terms).
In this way, the reduced space can be said to capture higher order co-occurrence
relations and latent semantic features.
In one of the earliest spatial models, Latent Semantic Indexing/Analysis
(LSI/LSA) (Deerwester et al., 1990; Landauer and Dumais, 1997), dimensionality reduction is performed with a computationally expensive matrix factorization
technique known as Singular Value Decomposition (SVD). SVD decomposes the
term-document matrix into a reduced-dimensional matrix – of a predefined dimensionality – that best captures the variance between points in the original matrix,
given a particular (reduced) dimensionality. Due to its reliance on SVD, LSA –
in its original form – received some criticism, partly for its poor scalability properties, as well as for the fact that a semantic space needs to be completely rebuilt
each time new data is added9 . As a result, alternative methods for constructing
semantic spaces have been proposed.
9 There
are now versions of LSA in which an existing SVD can be updated.
16
C HAPTER 2.
2.2.4
RANDOM INDEXING WITH (/ OUT ) PERMUTATIONS
Random Indexing (Kanerva et al., 2000; Karlgren and Sahlgren, 2001; Sahlgren,
2005) is an incremental, scalable and computationally efficient alternative to LSA
in which explicit dimensionality reduction is circumvented. Instead of constructing an initial and huge term-context matrix, which then needs to be reduced, a
lower dimensionality d – where d is much smaller than the number of contexts – is
chosen a priori as a model parameter. The pre-reduced d-dimensional context vectors are then incrementally populated with context information. RI can be viewed
as a two-step operation:
1. Each context is first given a static, unique representation in the vector space
that is approximately uncorrelated to all other contexts. This is achieved by
assigning a sparse, ternary10 and randomly generated d-dimensional index
vector: a small number (usually ∼1-2%) of +1’s and −1’s are randomly
distributed, with the rest of the elements set to zero. By generating sparse
vectors of a sufficiently high dimensionality in this way, the index vectors
will, with a high probability, be nearly orthogonal11 . With a sliding window
context definition, each unique term is assigned an index vector; with a
document-level context definition, each document is assigned an index
vector.
2. Each unique term – or whichever linguistic entity we wish to model – is
assigned an initially empty context vector of the same dimensionality d. The
context vectors are then incrementally populated with context information
by adding the index vectors of the contexts in which the target term appears.
With a sliding window context definition, this means that the index vectors
of the surrounding terms are added to the target term’s context vector; with a
document-level context definition, the index vector of the document is added
to the context vector of the target term’s context vector. The meaning of a
10 Ternary vectors allow three possible values:
+1’s, 0’s and −1’s. Allowing negative vector elements
ensures that the entire vector space is utilized (Karlgren et al., 2008).
11 Orthogonal index vectors would yield completely uncorrelated context representations; in the RI
approximation, near-orthogonal index vectors result in almost uncorrelated context representations.
BACKGROUND
17
term, represented by its context vector, is effectively the sum12 of all the
contexts in which it occurs.
Random Indexing belongs to a group of dimensionality-reduction methods, which
includes Random Projection (Papadimitriou et al., 1998) and Random Mapping
(Kaski, 1998), that are motivated by the Johnson-Lindenstrauss lemma (Johnson
and Lindenstrauss, 1984). This asserts that the relative distances between points
in a high-dimensional Euclidean (vector) space will be approximately preserved if
they are randomly projected into a lower-dimensional subspace of sufficiently high
dimensionality. RI exploits this property by randomly projecting near-orthogonal
index vectors into a reduced-dimensional space: since there are more nearly orthogonal than truly orthogonal vectors in a high-dimensional space (Kaski, 1998),
randomly generating sparse (index) vectors is a good approximation of orthogonality.
Making the dimensionality a static model parameter gives the method attractive
scalability properties since the dimensionality will not grow with the size of the
data, or, more specifically, the number of contexts13 . Moreover, RI’s incremental
approach allows new data to be added at any given time without having to rebuild
the semantic space: if a hitherto unseen term is encountered, it is simply assigned
an index vector (i.e., in the case of a sliding window context definition) and an
empty context vector. This property – allowing data to be added to an existing
semantic space in an efficient manner – makes RI unique among the aforementioned models of distributional semantics. Its ability to handle streaming data14
makes it possible to study real-time acquisition of semantic knowledge (Cohen
and Widdows, 2009).
Models of distributional semantics, including RI, generally treat each context (sliding window or document) as a bag of words15 . Such models are often criticized for
failing to account for term order. By weighting the index vectors according to their
position relative to the target term in the context window, term order information
is, in some sense, incorporated in the model. There are, however, approaches that
12 Sometimes
weighting is applied to the index vectors before they are added to a context vector, for
instance according to their distance and/or direction from the target term.
13 The dimensionality may, however, not remain appropriate irrespective of data size. This issue is
discussed in more detail in PAPER IV.
14 This property of RI makes it analogous to an online learning method.
15 The bag-of-words model is a simplified representation of a text as an unordered collection of
words, where grammar and word order is ignored.
18
C HAPTER 2.
model term order in more explicit ways. Random Permutation (RP) (Sahlgren et
al., 2008) is a modification of RI that encodes term order information by simply
permuting (i.e., shifting) the elements in the index vectors according to their direction and distance16 from the target term before they are added to the context
vector. For instance, before adding the index vector of a term two positions to the
left of the target term, the elements are shifted two positions to the left; similarly,
before adding the index vector of a term one position to the right of the target term,
the elements are shifted one position to the right. In effect, each term has multiple
unique representations: one index vector for each possible position relative to the
target term in the context window. This naturally affects the near-orthogonality
property of RI-based approaches; in theory, the dimensionality thus needs to be
higher with RP than with RI. RP is inspired by the work of Jones and Mewhort
(2007) and their BEAGLE model that uses vector convolution to incorporate term
order information in semantic space. Incorporating term order information not
only enables order-based retrieval; it also constrains the types of semantic relations that are captured. RP has been shown to outperform RI on the TOEFL17
synonym test (Sahlgren et al., 2008).
2.2.5
RANDOM INDEXING MODEL PARAMETERS
Whether to employ RI with or without permuting index vectors is not the only
decision that needs to be made before inducing a semantic space. There are, in
fact, a number of model parameters that need to be configured according to the
task that the induced semantic space will be used for. One such parameter is the
context definition, which affects the types of semantic relations that are captured
(Sahlgren, 2006). By employing a document-level context definition, relying on
direct co-occurrences, one models syntagmatic relations. That is, two terms that
frequently co-occur in the same documents are likely to be about the same general
topic, e.g., <car, motor, race>. By employing a sliding window context definition,
one models paradigmatic relations. That is, two terms that frequently co-occur
with similar sets of words – i.e., share neighbors – but do not necessarily co-occur
themselves, are semantically similar, e.g., <car, automobile, vehicle>. The size of
16 An alternative is to shift the index vectors according to direction only, effectively producing direction vectors. In fact, these are shown to outperform permutations based on both direction and distance
on the synonym identification task (Sahlgren et al., 2008).
17 Test Of English as a Foreign Language
BACKGROUND
19
the context window also affects the types of relations that are modeled and typically
needs to be tuned for the task at hand.
Other important model parameters concern the dimensionality of the semantic
space. The dimensionality (vis-à-vis the number of contexts) is a parameter that
needs to be configured in order to maintain the important near-orthogonality property of RI. Each context – with a sliding window context definition, this means
each unique term – needs to be assigned a near-orthogonal index vector; this is not
possible unless the dimensionality is sufficiently large. This is also affected by the
proportion of non-zero elements in the index vectors, which is another parameter
of the model. The index vectors need to be sparse, while ensuring that there are
enough possible permutations to satisfy the near-orthogonality condition of RI – a
common rule-of-thumb is to distribute randomly 1-2% non-zero elements.
Yet another model parameter concerns the weighting schema that is applied to the
index vectors before they are added to the target terms’s context vector. Possibilities include constant weighting (the terms in the context are treated as a bag
of words), linear distance weighting (weight decreases by some constant as the
distance from the target term increases) and aggressive distance weighting (increasingly less weight as the distance from the target term increases) (Sahlgren,
2006).
Finally, it seems apt to underscore the fact that an induced semantic space can be
manipulated by preprocessing the data in various ways. Common preprocessing
steps – or, method choices – in distributional semantics research include stop-word
removal and lemmatization (see Section 3.3).
2.2.6
APPLICATIONS OF DISTRIBUTIONAL SEMANTICS
Models of distributional semantics have been applied with considerable success to
a range of (general) NLP tasks. One of the original applications – and a motivating
force behind the development of distributional semantics – appeared in the context
of information retrieval, in particular document retrieval, where the limitations of
string-matching, keyword-based approaches were recognized; attempts were made
to exploit distributional similarities between query terms and document terms to
improve recall (Deerwester et al., 1990).
20
C HAPTER 2.
A widespread application of distributional semantics is to support the development
of terminologies and other lexical resources by, for instance, identifying synonymy
between terms. Synonym tests, where a synonym of a given term is to be selected
among a number of candidate terms, are, in fact, a common means to evaluate
semantic spaces and models of distributional semantics. Landauer and Dumais
(1997) used the synonym part of the TOEFL test to evaluate LSA, and many have
since used the same test for evaluation. With RP, an accuracy of 80% was achieved
on this test (Sahlgren et al., 2008), compared to 25% with random guessing, 64.4%
with LSA18 (36% without SVD) and an average of 64.5% by foreign applicants to
American colleges (Landauer and Dumais, 1997). Other applications of distributional semantics include document classification, essay grading, (bilingual) information extraction, word sense disambiguation/discrimination, sentiment analysis
and visualization (see Cohen and Widdows 2009 and Turney and Pantel 2010 for
an overview of distributional semantics applications).
In recent years there has been an increasing interest in the application of distributional semantics in the biomedical domain19 , most of which involves exploiting
the ever-growing amounts of biomedical literature that is now readily available in
digital form (Cohen and Widdows, 2009). Homayouni et al. (2005), for instance,
apply LSI on a corpus comprising titles and abstracts in MEDLINE20 citations to
identify both explicit and implicit gene relationships. Others have exploited the
analogy between linguistic sequences and biological sequences, e.g., sequences of
genes or proteins, and used distributional semantics to derive a measure of similarity between biological sequences, see for instance (Stuart and Berry, 2003) or
(Ganapathiraju et al., 2004). Another application in this context is to use distributional semantics for what is known as literature-based knowledge discovery, i.e.,
the identification of knowledge that is not explicitly stated in the research literature. Gordon and Dumais (1998) demonstrate how two disparate literatures can
be bridged by a third by using LSI to identify common concepts and in this way
support the knowledge discovery process. Cohen et al. (2010) develop and apply
an iterative version of RI – Reflective Random Indexing – to the same problem
and it is shown to perform better than traditional RI implementations on the task
18 It should be noted that Landauer and Dumais (1997) and Sahlgren et al. (2008) used different
corpora.
19 This section is largely based on the extensive review of biomedical applications of distributional
semantics by Cohen and Widdows (2009).
20 MEDLINE is a database comprising over 19 million references to biomedical research articles:
http://www.nlm.nih.gov/pubs/factsheets/medline.html.
BACKGROUND
21
of finding implicit connections between terms. Other biomedical applications of
distributional semantics involve mapping the biomedical literature to ontologies
and controlled vocabularies, as well the automatic generation of synonyms.
Cohen and Widdows (2009) argue that distributional semantics is perhaps less
suited to the clinical domain due to the constrained nature of clinical narrative. The
application of distributional semantics to this particular genre of text has been less
explored21 , largely due to difficulties involved in obtaining clinical data for privacy and other ethical reasons. Some work has, however, been done and is worth
relating. Pedersen et al. (2007) construct context vectors from a corpus of clinical notes and use them to derive a measure of the semantic relatedness between
SNOMED CT concepts; it was found that the distributional semantics approach
outperformed ontology-based methods. Cohen et al. (2008) use LSA to identify
meaningful associations between clinical concepts in psychiatric narratives to support the formulation of intermediate diagnostic hypotheses. Jonnalagadda et al.
(2012) demonstrate that the incorporation of distributional semantic features in
a machine learning classifier (Conditional Random Fields) can lead to enhanced
performance on the task of identifying named entities in clinical text.
2.3
LANGUAGE USE VARIABILITY
Models of distributional semantics are, in some sense, able to capture the semantics underlying word choice. This quality of natural language is essential and endows it with the flexibility we, as humans, need in order to express and communicate our thoughts. For computers, however, language use variability is problematic
and far from the unambiguous, artificial languages to which they are used. There
is thus a need for computers to be able to handle and account for language use
variability, not least by being aware of the relationship between surface tokens and
their underlying meaning, as well as the relationship between the meaning of tokens. Indeed, morphological variants, abbreviations, acronyms, misspellings and
synonyms – although different in form – may share semantic content to different
21 One could make the claim that distributional methods have been applied rather extensively in
the clinical domain; however, in those cases, the object of analysis has been terminological resources
such as UMLS (Chute et al., 1991) and ICD (Chute and Yang, 1992). Distributional methods have
also been applied to non-clinical corpora for the enrichment of terminological resources in the clinical
domain (Fan and Friedman, 2007). The application of distributional semantics to clinical text is, however, much more limited.
22
C HAPTER 2.
degrees. The various lexical instantiations of a concept thus need to be mapped
to some standard representation of the concept, either by converting the different
expressions to a canonical form or by generating lexical variants of a concept’s
preferred term. These mappings are typically encoded in semantic resources, such
as thesauri or ontologies, which enable the recall of information extraction systems
to be improved. Although their value is undisputed, manual construction of such
resources is often prohibitively expensive and may also result in limited coverage,
particularly in the biomedical and clinical domains where language use variability
is exceptionally high (Meystre et al., 2008; Allvin et al., 2011).
2.3.1
SYNONYMS
Language use variability is very often realized through word choice, which is made
possible by the existence of a linguistic phenomenon known as synonymy. Synonymy is a semantic relation between two phonologically distinct words with very
similar meaning22 . It is, however, extremely rare that two words have the exact
same meaning – perfect synonyms or true synonyms – as there is often at least
one parameter that distinguishes the use of one word from another. Hence, we
typically speak of near synonyms instead; that is, two words that are interchangeable in some, but not all, contexts23 . Two near synonyms may also have different
connotations and formality levels, or represent dialectal or idiolectal differences
(Saeed, 1997). Another way of viewing this particular semantic relation is to state
that two expressions are synonymous if and only if they can be substituted without
affecting the truth conditions of the constructions in which they are used24 . In the
medical domain, there are differences in the vocabulary used by medical professionals and patients, where latter use layman variants of medical terms (Leroy and
Chen, 2001). When performing medical language processing, it is important to
have ready access to terminological resources that cover this variation in the use
of vocabulary. Some examples of applications where this is particularly important
22 As synonymous expressions do not necessarily comprise a pair of single words, we sometimes
speak of rather than synonymy. Paraphrases are alternative ways of conveying the same information,
e.g., pain in joints and joint pain (Friedman et al., 2002). See (Androutsopoulos and Malakasiotis,
2010) for an overview of paraphrasing methods.
23 The words big and large are, for instance, synonymous when describing a house, but not when
describing a sibling.
24 This condition of synonymy is known as substitution salva veritate (Cruse, 1986).
BACKGROUND
23
include query expansion (Leroy and Chen, 2001), text simplification (Leroy et al.,
2012) and information extraction (Eriksson et al., 2013).
The importance of synonym learning is well recognized in the NLP research
community, especially in the biomedical (Cohen and Hersh, 2005) and clinical
(Meystre et al., 2008) domains. A whole host of techniques have been proposed
for the identification of synonyms and other semantic relations, including the use
of lexico-syntactic patterns, graph-based models and, indeed, distributional semantics.
Hearst (1992) proposes the use of lexico-syntactic patterns for the automatic acquisition of hyponyms from unstructured text. These patterns are hand-crafted
according to observations in a corpus and can also be constructed for other types
of lexical relations, including synonymy. However, a requirement is that these
syntactic patterns are sufficiently prevalent to match a wide array of instances of
the targeted semantic relation. Blondel et al. (2004) present a graph-based method
that draws on the calculation of hub, authority and centrality scores when ranking
hyperlinked web pages. They demonstrate how the central similarity score can be
applied to the task of automatically extracting synonyms from a monolingual dictionary, where the assumption is that synonyms have a large overlap in the words
used in their definitions; synonyms also co-occur in the definition of many words.
Another possible source for extracting synonyms is linked data, such as Wikipedia.
Nakayama et al. (2007) also utilize a graph-based method, but instead of relying
on word co-occurrence information, they exploit the links between Wikipedia articles and treat articles as concepts. This allows both the strength (the number of
paths from one article to another) and the distance (the length of each path) between concepts to be measured: concepts close to each other in the graph and with
common hyperlinks are deemed to have a stronger semantic relation than those
farther way.
Efforts have been made to obtain better performance on the synonym extraction
task by not only relying on a single source and a single method. Many of these
approaches have drawn inspiration from ensemble learning, a machine learning
technique that combines the output of several different classifiers with the aim
of improving classification performance (see Dietterich 2000 for an overview).
Curran (2002) demonstrates that ensemble methods outperform individual classifiers even for very large corpora. Wu and Zhou (2003) use multiple resources –
a monolingual dictionary, a bilingual corpus and a large monolingual corpus – in
24
C HAPTER 2.
a weighted ensemble method that combines the individual extractors, effectively
improving performance on the synonym learning task. van der Plas and Tiedemann (2006) use parallel corpora to calculate distributional similarity based on
(automatic) word alignment, where a translational context definition is employed;
this method outperforms a monolingual approach. Peirsman and Geeraerts (2009)
combine predictors based on collocation measures and distributional semantics
with a so-called compounding approach, where cues are combined with strongly
associated words into compounds and verified against a corpus; this is shown to
outperform the individual predictors of strong term associations.
In the biomedical domain, most efforts have focused on extracting synonyms of
gene and protein names from the biomedical literature (Yu and Agichtein, 2003;
Cohen et al., 2005; McCrae and Collier, 2008). In the clinical domain, significantly less work has been done on the synonym extraction task. Conway and
Chapman (2012) propose a rule-based approach to generate potential synonyms of
terms from the BioPortal ontology25 – using, for instance, term permutations and
abbreviation generation – after which candidate synonyms are verified against a
large clinical corpus. This method, however, fails to capture lexical instantiations
of concepts that do not share morphemes or characters with the seed term, which,
in fact, is the typical case for synonyms according to the aforementioned definition.
From an information extraction perspective, however, all types of lexical instantiations of a concept need to be accounted for. Zeng et al. (2012) evaluate three query
expansion methods for retrieval of clinical documents and conclude that a distributional semantics approach, in the form of an LDA-based topic model, generates the
best synonyms. For the broader estimation of medical concept similarity, corpusdriven approaches have proved more effective than path-based measures (Pedersen
et al., 2007; Koopman et al., 2012).
2.3.2
ABBREVIATIONS
Abbreviations and their corresponding long form (expansions), although not typically considered synonyms, fulfil many of the conditions of the synonym semantic
relation: they are often interchangeable in some contexts without affecting the
truth condition of the larger construction in which they are used. Even if the extraction of synonyms and abbreviation-expansion pairs are not identical tasks, they
25 BioPortal is is Web portal to a virtual library of over fifty biological and medical ontologies:
http://bioportal.bioontology.org/.
BACKGROUND
25
can reasonably be viewed as being closely related, particularly when approached
from a distributional perspective.
Abbreviations and acronyms are prevalent in both the biomedical literature (Liu
et al., 2002) and clinical narrative (Meystre et al., 2008). Not only does this result in decreased readability (Keselman et al., 2007), but it also poses challenges
for information extraction (Uzuner et al., 2011). Semantic resources that link abbreviations to their corresponding concept, or dictionaries of abbreviations and
their corresponding long form, are therefore as important as synonym resources
for many medical language processing tasks. In contrast to synonyms, however,
abbreviations are semantically overloaded to a much larger extent. This means
that an abbreviation may have several possible long forms or have the same surface form as another word. In fact, 81% of UMLS26 abbreviations were found to
have ambiguous occurrences in biomedical text (Liu et al., 2002), many of which
are difficult to disambiguate even in context (Liu et al., 2001).
Many of the existing studies on the automatic creation of biomedical abbreviation dictionaries exploit the fact that abbreviations are sometimes defined in the
text on their first mention. These studies extract candidate abbreviation-expansion
pairs by assuming that either the long form or the abbreviation is written in parentheses (Schwartz and Hearst, 2003). Candidate abbreviation-expansion pairs are
extracted and the process of determining which of these are likely to be correct is
then performed either by rule-based (Ao and Takagi, 2005) or machine learning
(Chang et al., 2002; Movshovitz-Attias and Cohen, 2012) methods. Most of these
studies have been conducted on English corpora; however, there is one study on
Swedish medical text (Dannélls, 2006). Despite its popularity, there are problems
with this approach: Yu et al. (2002) found that around 75% of all abbreviations in
the biomedical literature are never defined.
The application of this method to clinical text is almost certainly inappropriate, as
it seems highly unlikely that abbreviations would be defined in this way. The telegraphic style of clinical narrative, with its many ad hoc abbreviations, is reasonably
explained by time and efficiency considerations in the clinical setting. There has,
however, been some work on identifying such undefined abbreviations (Isenius et
al., 2012), as well as on finding the intended abbreviation expansion among several
candidates available in an abbreviation dictionary (Gaudan et al., 2005).
26 Unified
Medical Language System: http://www.nlm.nih.gov/research/umls/
26
2.4
C HAPTER 2.
DIAGNOSIS CODING
Diagnosis coding – that is, the process of assigning codes that correspond to a
given disease or health condition – is necessary in order to estimate the prevalence
and incidence of diseases and health conditions, as well as differences therein between various geographical areas and over time (Socialstyrelsen, 2013). For such
statistics to be comparable, there needs to be a standardized classification system,
ideally one that is adopted globally. This is the motivation behind the development of the International Classification of Diseases (ICD), which has been regularly revised and released since 1948 by the World Health Organization (WHO).
The main purpose of ICD is to enable classification and statistical description of
diseases and other health problems that cause death or contact with health care.
In addition to traditional diagnoses, the classification must therefore encompass a
broad spectrum of symptoms, abnormal findings and social circumstances (World
Health Organization, 2013).
The International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10) (World Health Organization, 2013) is a hierarchical classification system and contains 21 chapters (indicated by roman numerals),
which are largely divided according to organ system or etiology27 (infectious diseases, tumors, congenital malformations and injuries), in addition to a number of
chapters based on other principles. The chapters are divided into sections that
contain groups of similar diseases. Each section, in turn, contains a number of
categories (indicated by alphanumeric codes: a letter and two digits), which often
represent individual diseases. Categories are typically divided into subcategories
(a letter, two digits and a decimal), which can refer to different types or stages of
a disease.
The current Swedish version is ICD-10-SE28 (Socialstyrelsen, 2010). This classification system has been widely used in Sweden for a variety of purposes, including
to generate statistics on inpatient and outpatient care, as well as for regional and
national registries. It has also served as the foundation for other classification systems that are used for hospital management and quality control in the health care
27 Etiology
is the study of causation or origination of a disease.
Swedish version of ICD-10 used to be called KSH97 (Klassifikation av sjukdomar och hälsoproblem 1997); this second edition is, however, called ICD-10-SE (Internationell statistisk klassifikation av sjukdomar och relaterade hälsoproblem – Systematisk förteckning) and replaced the first edition
as of January 1, 2011.
28 The
BACKGROUND
27
sector. The DRG system, which works with diagnosis-related groups based on
ICD, is used for the allocation of resources in health care and for financial reimbursement to hospitals.
The process of assigning diagnosis codes is generally carried out in one of two
ways29 : (1) by expert coders, who, without additional insight into the patient history, produce codes that are considered to be exhaustive; or (2) by physicians,
who, having some knowledge of the patient history, code only essential aspects of
the care episode and therefore generally produce fewer codes than expert coders
due to practical restrictions, such as limited awareness of coding guidelines and
lack of time. In either case, the process of medical encoding is expensive and
time-consuming, yet essential30 . In the case where diagnosis codes are assigned
by physicians, the latter are kept from their primary responsibility of tending to
patients. Employing expert coders is particularly expensive, as their job is complicated by the fact of having both to examine the health records and, typically,
hundreds of possible codes in encoding references.
Furthermore, the quality of medical encoding varies substantially according to
the experience and expertise of coders, often resulting in under or over encoding (Puentes et al., 2013). In a review of 4,200 health records conducted by the
Swedish National Board of Health and Welfare (Socialstyrelsen, 2006), major errors in the primary diagnosis were found in 20% of the cases and minor errors in
5% of the cases. Another review of the Swedish patient registry in 2008, however,
found that 98.6% of the care episodes were correctly encoded, but that 1.09% of
the care episodes had not been assigned a primary diagnosis code at all (Socialstyrelsen, 2010). Potential sources of erroneous ICD coding are related to the
patient trajectory during the care episode – the amount and quality of information at admission, patient-clinician communication and the clinician’s knowledge,
experience and attention to detail – and the documentation thereof – coder training and experience, language use variability in the EHR, as well as unintentional
and intentional coder errors, including upcoding, i.e., assigning codes of higher
reimbursement value (O’Malley et al., 2005).
29 See
(O’Malley et al., 2005) for a good description of how the process of assigning diagnosis codes
is typically carried out.
30 In fact, encoding diagnoses and medical procedures is obligatory in many countries, including
Sweden (Socialstyrelsen, 2013).
28
2.4.1
C HAPTER 2.
COMPUTER - AIDED MEDICAL ENCODING
Attempts to develop methods and tools that can be used to provide automated support for the coding of diagnoses are primarily motivated by the massive resources
– in terms of both time and money – the task consumes. According to one estimate, the cost of medical encoding and associated errors is approximately $25
billion per year in the US (Lang 2007 as cited in Pestian et al. 2007). Another
reason to support this rather non-intuitive task is to improve the quality of coding,
as the statistics that are generated from it play such a fundamental role in many dimensions of health care, ranging from quality control and financial reimbursement
to epidemiology and public health. As a result, computer-aided diagnostic coding
has long been an active research endeavor31 (see Stanfill et al. 2010 for a review).
A popular approach is to exploit prelabeled corpora – where codes have been assigned to health records – to infer codes for new, unseen records32 . Most methods primarily target clinical narratives, with the assumption that this information
source will be particularly informative about the diagnosis or diagnoses that should
be assigned. An early, and well-cited, effort in that vein was made by Larkey and
Croft (1996), who use three different types of classifiers – k-nearest-neighbor, relevance feedback and Bayesian independence – in isolation and in combination,
to assign ICD-9 codes to discharge summaries. The classifiers produce a ranked
list of codes, for which recall, among other metrics, is calculated at various prespecified cutoffs (15 and 20). In the conducted experiments, a combination of
all three classifiers yielded better results than any individual classifier or two-way
combination. Their test set contains 187 discharge summaries and 1068 codes,
where only codes that occurred at least six times in the training set are included;
their best results are 0.73 recall (top 15) and 0.78 recall (top 20).
Another attempt to assign ICD-9 codes to discharge summaries is made by de Lima
et al. (1998), who interpret the task as an information retrieval problem, treating
diagnosis codes as documents to be retrieved and discharge summaries as queries.
They exploit the hierarchical structure of ICD to create a rule-based classifier that
is shown to outperform the classic vector space model. Their test set contains 82
discharge summaries and 2,424 distinct ICD-9 codes; they only report precision at
31 To illustrate this, the review by Stanfill et al. (2010) encompasses 113 studies, even if these include
coding of other medical entities besides diagnoses.
32 This approach is, of course, affected by potentially inaccurate coding, as discussed in the previous
section: the model will reflect the mistakes in the data.
BACKGROUND
29
various recall levels and obtain a precision of 0.24 at 100% recall and 0.63 at 10%
recall.
One of the relatively few systems that have actually been operationalized in the
clinical setting is the Automatic Categorization System (Autocoder) at the Mayo
Clinic, which led to the subsequent reduction in staff from 34 coders to 7 verifiers (Pakhomov et al., 2006). Autocoder is based on a two-step classifier, where
the notion of certainty is used to determine whether subsequent manual review is
necessary. The system tries to assign institution-specific diagnosis codes (based
on ICD-8) to short diagnostic statements in patients’ problem list – arguably an
easier task than the ones mentioned above; however, with considerably better performance. The authors do not report the number of unique diagnosis codes in the
test sets, but the classification system contains 35,676 leaf nodes. Approximately
48% the entries are encoded with almost 0.97 precision and recall – these are
not manually reviewed – by using an example-based approach and leveraging the
highly repetitive nature of diagnostic statements. If the classifier is insufficiently
certain about a classification, it is passed on to a machine learning component.
Here, a simple naïve Bayes classifier is trained to associate words with categories.
The 34% of the entries that are classified by this component are encoded with approximately 0.87 precision and 0.94 recall. The remaining 18% of the entries are
classified with a precision of 0.59 and a recall of 0.45. They note that their bag-ofwords representation in the machine learning component is probably suboptimal.
An important milestone in the research efforts to provide computer-aided medical
encoding is marked by the 2007 Computational Medicine Challenge33 (Pestian et
al., 2007), a shared task in which 44 different research teams participated. The
task was limited to a set of 45 ICD-9-CM codes (in 94 distinct combinations, with
no unseen combinations in the test set), which were to be assigned to radiology
reports. The systems were evaluated against a gold standard created using the majority annotation from three independent annotation sets. The winning contribution
obtained a micro-averaged F1 -score of 0.89. This score is comparable to the interannotator agreement, indicating that automated systems are close to human-level
performance on this limited task. Many of the best-performing contributions rely
heavily on hand-crafted rules, sometimes in combination with a machine learning
component. Attempts have been made to eschew – entirely or partly – the need for
hand-crafted rules, sacrificing some performance in return for decreased development time (Suominen et al., 2008; Farkas and Szarvas, 2008; Chen et al., 2010).
33 http://computationalmedicine.org/challenge/previous
30
C HAPTER 2.
The approaches that have been proposed are varied (Goldstein et al., 2007; Crammer et al., 2007; Aronson et al., 2007); however, the best-performing systems handle negation and language use variability (particularly synonyms and hypernyms)
(Pestian et al., 2007).
More recent attempts have utilized structured data in EHRs (Ferrao et al., 2012)
and information fusion34 approaches (Lecornu et al., 2009; Alsun et al., 2010;
Lecornu et al., 2011) for the automatic assigning of diagnosis codes. The numerous approaches to computer-aided medical encoding that have been proposed are,
however, difficult to compare due to the heterogeneity of classification systems
and evaluation methods (Stanfill et al., 2010).
2.5
ADVERSE DRUG REACTIONS
The presence of adverse events (AEs) in health care is inevitable; their prevalence
and severity, however, are a massive concern, both in terms of (reduced) patient
safety and the economic strains it puts on the health care system. Reviewing health
records manually – in a process commonly known as manual chart review – has
traditionally been the sole means of detecting AEs. This approach is costly and
imperfect, for obvious reasons. With the increasing adoption of EHRs, automatic
methods have been developed to detect AE signals (see Murff et al. 2003 and Bates
et al. 2003 for an overview).
One of the most common types of AE are adverse drug events (ADEs), which are
caused by the use of a drug at normal dosage or due to an overdose. In the first case,
the ADE is said to be caused by an adverse drug reaction (ADR). The prevalence
of ADRs constitutes a major public health issue; in Sweden it has been identified
as the seventh most common cause of death (Wester et al., 2008). To mitigate the
detrimental effects caused by ADEs, it is important to be aware of the prevalence
and incidence of ADRs. As clinical trials are generally not sufficiently extensive
to identify all possible ADRs – especially less common ones – many are unknown
when a new drug enters the market. Pharmacovigilance is therefore carried out
throughout the life cycle of a pharmaceutical product and is typically supported by
spontaneous reports and medical chart reviews (Chazard, 2011). Relying on spon34 Information fusion involves transforming information from different sources and different points
in time into a unified representation (Boström et al., 2007).
BACKGROUND
31
taneous reports, however, results in substantial underreporting of ADEs (Bates et
al., 2003). The prospect of being able to employ automatic methods to estimate the
prevalence and incidence of ADRS, as well as to identify potentially unknown or
undocumented ADRs, thus comes with great economic and health-related benefits.
2.5.1
AUTOMATICALLY DETECTING ADVERSE DRUG EVENTS
Adverse drug reactions can be identified from a number of potential sources. There
has, for instance, been some previous work on identifying ADRs from online message boards: in one approach, ADRs are mined by calculating co-occurrences with
drug names in a twenty-token window (Benton et al., 2011). With the increasing
adoption of EHRs, several methods have been proposed for the automatic detection
of ADEs and ADRs from clinical sources. Chazard (2011), for instance, extracts
ADE detection rules from French and Danish EHRs. Decision trees and association rules are employed to mine the structured part of the EHRs, such as diagnosis
codes and lab results; clinical narratives, in the form of discharge summaries, are
only targeted when ATC35 codes are missing.
Many other previous attempts also focus primarily on structured patient data,
thereby missing out on potentially relevant information only available in the narrative parts of the EHR. There have, however, been text mining approaches to the
detection of ADRs, too (see Warrer et al. 2012 for a literature review). Wang et al.
(2010) first transform unstructured clinical narratives into a structured form (identified UMLS concepts) and then calculate co-occurrences with drugs. Aramaki et
al. (2010) assume a similar two-step approach – identifying terms first and then
relations – and estimate that approximately eight percent of their Japanese clinical
narratives contain ADE information. LePendu et al. (2012, 2013) propose methods for carrying out pharmacovigilance using clinical notes and demonstrate their
ability to flag adverse events, in some cases before an official alert is made. Eriksson et al. (2013) use a dictionary of Danish ADE terms and propose a rule-based
pipeline system that identifies possible ADEs in clinical text; a manual evaluation
showed that their system is able to identify possible ADEs with a precision of 89%
and a recall of 75%.
35 Anatomical
Therapeutic Chemical codes are used for the classification of drugs.
32
C HAPTER 3
METHOD
It is not uncommon in research to employ research methods in a rather implicit
manner, without due deliberation. The prevailing research methods in a given
research field are often taken for granted and tend to conceal the fact that they
are based on certain philosophical assumptions (Myers, 2009). Here, light will be
shed on these aspects to make them more explicit. This will also help to explain
and motivate the general research strategy assumed in the thesis, as well as the
methodological choices made in the included studies.
After a description of the research strategy, including a characterization of the research and a discussion around the underlying assumptions, the evaluation framework – based on, and an integral part of, the former – is laid out in more detail.
This is followed by a description of the methods and materials used in the included
studies, as well as motivations (and alternatives) for their use. The experimental
setup of each included study is then summarized individually. Finally, ethical issues that arise when conducting research on clinical data, including how those
impact on scientific aspects of the work, are discussed briefly.
34
C HAPTER 3.
3.1
RESEARCH STRATEGY
The research strategy (see Figure 3.1) assumed in this thesis is predominantly
quantitative in nature, both in that the employed NLP methods require substantial
amounts of data and – more importantly from a scientific perspective – in that the
evaluation is performed with statistical measures. There are, however, also qualitative components, particularly in the analysis phase, which allow deeper insights of
the studied phenomena to be gained. This triangulation of methods is particularly
important when the data sets used for evaluation are not wholly reliable, as they
seldom are1 .
While the semantic spaces are inductively constructed from the data, the research
can generally be characterized as deductive: a hypothesis is formulated and verified empirically and quantitatively. Even if hypothesis testing2 is not formally
carried out – and hypotheses are not always stated explicitly – introduced methods
or techniques are invariably evaluated empirically.
The research can be characterized as exploratory on one level and causal on another. There is not a single problem that motivates the specific use of distributional
semantics; instead, the potential leveraging of distributional semantics to solve
multiple problems in the clinical domain is investigated in an exploratory fashion.
The three targeted application areas are not in themselves critical in answering the
overarching research question. The research is also, to a large extent, causal: in
many of the included studies we are investigating the effect of one variable – the
introduction of some technique or method – on another.
The techniques or methods are evaluated in an experimental setting. In fact, the
experimental approach underlies much of the general research strategy. This is a
well-established means of evaluating NLP systems and adheres to the Cranfield
paradigm, which has its roots in experiments that were carried out in the 1960s
to evaluate indexing systems (Voorhees, 2002). In this evaluation paradigm, two
or more systems are evaluated on identical data sets using statistical measures of
performance. There are two broad categories of evaluation: system evaluation and
1 This
is discussed in the context of synonym extraction in Section 5.1.
hypothesis testing, conclusions are drawn using statistical (frequentist) inference. The hypothesis we wish to test is called the null hypothesis and is rejected or accepted according to some level of
significance. When comparing two classifiers, statistical significance tests essentially determine, with a
given degree of confidence, whether the observed difference in performance (according to some metric)
is significant (Alpaydin, 2010).
2 In
35
METHOD
user-based evaluation. In contrast to user-based evaluations, which involve users
and more directly measure a pre-specified target group’s satisfaction with a system,
system evaluations circumvent the need for direct access to large user groups by
creating reference standards or gold standards. These are pre-created data sets that
are treated as the ground truth, to which the systems are then compared. It could
be argued that NLP systems created for the clinical domain should be evaluated
in a real, clinical setting – and not in a laboratory – since they are likely to have
more specific applications than general NLP systems (Friedman and Hripcsak,
1999). However, when the performance of a system still needs to be improved
substantially, it is much more practical to employ a system evaluation, which is the
approach assumed here. When the performance approaches a clinically acceptable
level, user-based evaluations in the clinical setting would be more meaningful.
In quantitative research, a positivist philosophical standpoint is often assumed,
to which an interpretivist3 position is then juxtaposed. While these positions
are usefully opposed4 , many do not subscribe wholly and unreservedly to either
of the views. In the Cranfield paradigm, with its experimental approach, there
is certainly a predominance of positivist elements. This is naturally also the
case when performing predictive modeling, which all three applications in this
thesis can, in some sense, be viewed as instances of. As predictive models are
necessarily based on a logic of hypothetico-deduction, research that involves
predictive modeling is also based on positivist assumptions. NLP research also
entails working with natural language data, which are recognized to be socially
constructed and highly dependent on domain, context, culture and time. This
represents a vantage point more in line with that of interpretivism; in practice,
however, language data are many times treated as the objective starting point for
NLP research. That said, NLP methods need to be adapted – or must be able to
adapt – according to the aforementioned characteristics of language.
3 Interpretivism
4 Some
is sometimes referred to as antipositivism.
have, however, argued that these can be combined in a single study (Myers, 1997).
36
C HAPTER 3.
Figure 3.1: RESEARCH STRATEGY. A characterization of the research strategy.
3.2
EVALUATION FRAMEWORK
As mentioned earlier, the evaluation is mainly conducted quantitatively with
statistical measures of performance in an experimental setting, where one system
is compared to another, which is sometimes referred to as the baseline (system).
The hypothesis is often in the simple form of:
System A will perform better than System B on Task T given Performance Metric M,
even if it is not always explicitly stated as such. The performance measures are
commonly calculated by comparing the output of a system with a reference standard, which represents the desired output as defined by human experts. The automatic evaluations conducted in the included studies are extrinsic – as opposed to
intrinsic – which means that the semantic spaces, along with their various ways
of representing meaning, are evaluated according to their utility, more specifically
their ability to perform a predefined task (Resnik and Lin, 2013).
37
METHOD
3.2.1
PERFORMANCE METRICS
Widely used statistical measures of performance in information retrieval and NLP
are precision and recall (Alpaydin, 2010). To calculate these, the number of true
positives (tp), false positives (fp) and false negatives (fn) are needed. For some
measures, the number of true negatives (tn) is also needed. True positives are
correctly labeled instances, false positives are instances that have been incorrectly
labeled as positive and false negatives are instances that have been incorrectly
labeled as negative. Precision (P), or positive predictive value (PPV), is the proportion of positively labeled instances that are correct (Equation 3.1).
Precision : P =
tp
tp+ f p
(3.1)
Recall (R), or sensitivity, is the proportion of all actually positive instances that are
correctly labeled as such (Equation 3.2).
Recall : R =
tp
tp+ fn
(3.2)
There is generally a tradeoff between precision and recall: by assigning more instances to a given class (i.e., more positives), recall may, as a consequence, increase – some of these may be true (tp) – while precision may decrease since it is
likely that the number of false positives will, simultaneously, increase. Depending
on the application and one’s priorities, one can choose to optimize a system for either precision or recall. F-measure incorporates both notions and can be weighted
to prioritize either of the measures. When one does not have a particular preference, F1 -score is often employed and is defined as the harmonic mean of precision
and recall.
In the included studies, recall is generally prioritized, as many of the tasks are
concerned with extracting terms from a huge sample space – equal to the size of
the vocabulary. In the case of diagnosis code assignment, the sample space is also
large – equal to the number of ICD-10 codes in the data (over 12,000). If seen as
classification tasks, they can be characterized as multi-class, multi-label problems.
In such problems, the goal is to assign one or more labels to each instance in
an instance space, where each label associates an instance with one of k possible
38
C HAPTER 3.
classes. In the cases of synonyms, diagnosis codes and adverse drug reactions,
one does not know a priori how many labels should be assigned. This makes the
problem all the more challenging, which is the motivation behind initially focusing
on recall. Since the number of labels is unknown – and varies with each instance –
the approach assumed in this thesis is to pre-specify the number of labels, usually
ten or twenty. In the typical case, the number of actually positive instances is
much smaller than this pre-specified number, which means that precision will be
low. However, whenever precision is reported, the ranking of the true positives
should reasonably be taken into account. Here, this is implemented as a form of
weighted precision (Equation 3.3).
j−1
Weighted Precision : Pw =
∑i=0 ( j − i) · f (i)
where
f (i) =
j−1
∑i=0 j − i
1
0
(3.3)
if i ∈ {t p}
otherwise
and j is the pre-specified number of labels – here, ten or twenty – and {t p} is the
set of true positives. In words, this assigns a score to true positives according to
their (reverse) ranking in the list, sums their scores and divides the total score by
the maximum possible score (where all j labels are true positives).
Recall, on the other hand, is not affected by ranking: as long as a true positive
makes it above the j cut-off, its position in the list is ignored. In all of the included
studies where automatic evaluation is conducted – i.e., all save Paper V – recall
top j (where j = 10 or 20) is reported (Equation 3.4).
Recall (top j) : Rtop j =
tp
in a list of j labeled instances
tp+ fn
(3.4)
Although j is somewhat arbitrarily chosen, it is motivated from the perspective of
utility: in a decision support system, how many items is it reasonable for a human
to inspect? In the targeted applications, ten or twenty seems reasonable.
METHOD
3.2.2
39
RELIABILITY AND GENERALIZATION
Metrics like the ones described above are useful in estimating the expected performance – measured according to one or more criteria – of a system. There are,
however, two closely related aspects of system evaluations that are important to
consider: (1) reliability, i.e., the extent to which we can rely on the observed performance of a system, and (2) generalization, i.e., the extent to which we can
expect the observed performance to be maintained when we apply the system to
unseen data.
The reliability of a performance estimation depends fundamentally on the size of
the data used in the evaluation: the more instances we have in our (presumably
representative) evaluation data, the greater our statistical evidence is and thus the
more confident we can be about our results. This can be estimated by calculating
a confidence interval or, when comparing two systems, by conducting a statistical
significance test.
However, in order to estimate the performance of a system on unseen data – as we
wish to do whenever prediction is the aim – we need to ensure that our evaluation
data is distinct from the data we use for training our model. The goal, then, is
to fit our model to the training data while eschewing both underfitting – where
the model is less complex than the function underlying the data – and overfitting,
where the model is too complex and fitted to peculiarities of the training data
rather than the underlying function. In order to estimate generalization ability and,
ultimately, expected performance on unseen data, it is good practice to partition
the data into three partitions: a training set, a development set and an evaluation
set. The training set, as the name suggests, is used for training the model; the
generalization ability of a model typically improves with the number of training
instances (Alpaydin, 2010). The development set is used for model selection and
for estimating generalization ability: the model with the highest performance (and
the one with the best inductive bias) on the development set is the best one. This is
done in a process known as cross-validation. In order to estimate performance on
unseen data, however, the best model needs to be re-evaluated on the evaluation set,
as the development set has effectively become part of the training set5 by using it
for model selection. When conducting comparative evaluations, all modifications
to the (use of the) learning algorithm, including parameter tuning, must be made
5 Sometimes the training and development sets are jointly referred to as the training set and the
evaluation set as the test set.
40
C HAPTER 3.
prior to seeing the evaluation data. Failing to do so may lead to the reporting of
results that are overly optimistic (Salzberg, 1997).
It is also worth underscoring the fact that empirical evaluations of this kind are
in no way domain independent: any conclusions we draw from our analyses –
based on our empirical results – are necessarily conditioned on the data set we
use and the task we apply our models to. The No Free Lunch Theorem (Wolpert,
1995) states that there is no such thing as an optimal learning algorithm – for some
data sets it will be accurate and for others it will be poor. By the same token, it
seems unlikely to believe that there is a single optimal way to model semantics; the
results reported here are dependent on a specific domain (or data set) and particular
applications.
Quantitative results are supplemented with qualitative analyses, adding to the understanding of quantitative performance measurements. In fact, in all of the included studies, a qualitative evaluation is performed in what is commonly known
as an error analysis. This allows deeper insights into the studied phenomena to be
gained.
3.3
METHODS AND MATERIALS
In this section, the methods and materials – corpora, terminologies and knowledge
bases, NLP tools and software packages – that are used in the covered studies
are enumerated and described. Underlying techniques, unless they have been described earlier, are also introduced briefly. Precisely how these methods and materials are used in the included studies is discussed in more detail in the subsequent
section, Section 3.4, which describes the experimental setup of each study.
3.3.1
CORPORA
The two main resources for lexical knowledge acquisition are corpora and
machine-readable dictionaries (Matsumoto, 2003). As models of distributional
semantics rely on large-scale observations of language use, they generally induce
semantic spaces from a corpus – a computer-readable collection of text or speech
41
METHOD
(Jurafsky et al., 2000). The corpora used in the included studies are extracted from
three sources:
1. Stockholm EPR Corpus: The Stockholm Electronic Patient Record Corpus
2. MIMIC-II: The Multiparameter Intelligent Monitoring in Intensive Care
database
3. Läkartidningen: The Journal of the Swedish Medical Association
It is important to note that the Stockholm EPR Corpus and Läkartidningen are
written in Swedish, while the MIMIC-II database contains clinical notes written in
English. Two of the corpora (Stockholm EPR Corpus and MIMIC-II) are clinical
– which is the focus of this thesis – while the third (Läkartidningen) is a medical,
non-clinical corpus. The data sets used to induce the semantic spaces are extracted
from these three sources (Table 3.1).
Stockholm EPR
MIMIC-II
Läkartidningen
(Non-)Clinical
Clinical
Clinical
Non-clinical
Language
Swedish
English
Swedish
Total Size
110M
250M
20M
Used Size
42.5M
250M
20M
Table 3.1: CORPORA. An overview of the corpora used in the included studies. Corpus
size is given as the number of tokens.
STOCKHOLM EPR CORPUS
The Stockholm EPR Corpus (Dalianis et al., 2009, 2012) comprises Swedish
health records extracted from TakeCare, an EHR system used in the Stockholm
City Council, and encompasses approximately one million records from around
900 clinical units. Papers II–V utilize a subset of the Stockholm EPR Corpus
comprising all health records from the first half of 2008, a statistical description of
which is provided by Dalianis et al. (2009).
MIMIC - II
The MIMIC-II database is a publicly available database encompassing clinical data
for over 40,000 hospital stays of more than 32,000 patients, collected over a seven-
42
C HAPTER 3.
year period (2001-2007) from intensive care units (medical, surgical, coronary and
cardiac surgery recovery) at Boston’s Beth Israel Deaconess Medical Center. In
addition to various structured data, such as laboratory results and ICD-9 diagnosis
codes, the database contains text-based records written in English, including discharge summaries, nursing progress notes and radiology interpretations (Saeed et
al., 2011). In Paper I, a corpus comprising all text-based records from the MIMICII database is utilized.
LÄKARTIDNINGEN
The Läkartidningen corpus comprises the freely available section of the Journal
of the Swedish Medical Association (1996-2005) (Kokkinakis, 2012). Läkartidningen is a weekly journal written in Swedish and contains articles discussing
new scientific findings in medicine, pharmaceutical studies, health economic evaluations, etc. Although these issues have been made available for research, the
original order of the sentences has not been retained; the sentences are given in a
random order. The entire corpus is used in Paper II.
3.3.2
TERMINOLOGIES
Medical terminologies and ontologies are important tools for natural language
processing of health record narratives. The following resources are used in the
included studies:
1.
2.
3.
4.
5.
SNOMED CT: Systematized Nomenclature of Medicine – Clinical Terms
ICD-10: International Classification of Diseases, Tenth Revision
MeSH: Medical Subject Headings
FASS: Pharmaceutical Specialties in Sweden
Abbreviations and Acronyms: Medical Abbreviations and Acronyms
In some instances, these resources are used as reference standards for automatic
(Papers I–IV) or manual evaluation (Paper V) and, in others, they are leveraged by
the developed NLP methods (Papers III and V).
METHOD
43
SNOMED CT
Systematized Nomenclature of Medicine – Clinical Terms, SNOMED CT
(IHTSDO, 2013a), has emerged as the de facto terminological standard for representing clinical concepts in EHRs and is today used in more than fifty countries.
Currently, it is only available in a handful of languages – US English, UK English, Spanish, Danish and Swedish – but translations – into French, Lithuanian
and several other languages – are underway (IHTSDO, 2013b). SNOMED CT
encompasses a wide range of topics and its scope includes, for instance, clinical
findings, procedures, body structures and social contexts. The concepts are organized in hierarchies with multiple levels of granularity and are linked by relations
such as is a. There are well over 300,000 active concepts in SNOMED CT and
over a million relations. Each concept consists of a: Concept ID, Fully Specified Name, Preferred Term, Synonym. A preferred term can have zero or more
synonyms. Both the English and the Swedish versions are used in the included
studies. The English version of SNOMED CT contains synonyms, whereas the
current Swedish version does not.
ICD -10
International Statistical Classification of Diseases and Related Health Problems,
10th Revision, ICD-10 (World Health Organization, 2013), is a standardized statistical classification system for encoding diagnoses. The main purpose of ICD-10
is to enable classification and statistical description of diseases and other health
problems that cause death or contact with health care. See Section 2.4 for more
details.
MESH
Medical Subject Headings, MeSH (U.S. National Library of Medicine, 2013), is a
controlled vocabulary thesaurus that serves the purpose of indexing medical literature. It comprises terms naming descriptors in a hierarchical structure that allows
searching at various levels of specificity. There are 26,853 descriptors in the current (2013) version of MeSH. In the current Swedish version of MeSH, 26,542
of the descriptors have been translated into Swedish (Karolinska Institutet Univer-
44
C HAPTER 3.
sitetsbiblioteket, 2013). This resource is one of very few standard terminologies
that contains synonyms for medical terms in Swedish.
FASS
Farmaceutiska Specialiteter i Sverige6 , FASS (Läkemedelsindustriföreningens
Service AB, 2013), contains descriptions, including adverse reactions, of all drugs
that have been approved and marketed in Sweden. There are three versions of
FASS: one for health care personnel, one for the public and one for veterinarians.
The descriptions are provided by the pharmaceutical companies that produce the
drugs, but they must be approved by the authorities.
ABBREVIATIONS AND ACRONYMS
A list of medical abbreviations and their corresponding expansions/definitions
is extracted from a book called Medicinska förkortningar och akronymer
(Cederblom, 2005). These have been collected manually from Swedish health
records, newspapers and scientific articles.
3.3.3
TOOLS AND TECHNIQUES
The following existing NLP tools are used – and, in some cases, modified – in the
included studies:
1.
2.
3.
4.
5.
6 In
Granska Tagger: Swedish Part-of-Speech Tagger and Lemmatizer
TextNSP: Ngram Statistics Package
CoSegment: Collocation Segmentation Tool
Swedish NegEx: Swedish Negation Detection Tool
JavaSDM: Java-based Random Indexing Package
English: Pharmaceutical Specialties in Sweden
METHOD
45
These tools allow, or at least facilitate, certain NLP tasks to be carried out. Here,
these tasks will be discussed, including why they are performed and possible alternative means of carrying them out.
BASIC PREPROCESSING
Some amount of basic preprocessing of the data is almost invariably carried out
in NLP, such as tokenization, which is the process whereby words are separated
out from running text (Jurafsky et al., 2000). The type of preprocessing, however,
also depends on the application or the method. When applying models of distributional semantics, it is common to remove tokens which carry little meaning in
themselves, or which contribute very little to determining the meaning of nearby
tokens. In all of the included studies, punctuation marks and digits are removed.
STOP - WORD REMOVAL
A category of words that carry very little meaning or content is usually referred
to as stop words (Jurafsky et al., 2000). Precisely how to determine which words
are stop words is debated; however, a common property of stop words is typically
that they are high-frequent words. As such, they often appear in a large number of
contexts and thus do not contribute to capturing distinctions in meaning. Function
words and words that belong to closed classes, such as the definite article the, are
prime examples of stop words. When applying models of distributional semantics,
it is not uncommon first to remove stop words. However, for some models –
for instance, Random Permutation, where word order is, to a larger degree, taken
into account – it has been shown that employing extensive stop lists may degrade
performance (Sahlgren et al., 2008).
TERM NORMALIZATION
When applying models of distributional semantics – although, it depends on the
task – some form of term normalization, such as stemming or lemmatization, is
typically performed. Stemming and lemmatization are two alternative means of
handling morphology: stemming reduces inflected words to common stem, while
46
C HAPTER 3.
lemmatization transforms words into their lemma form (Jurafsky et al., 2000). One
reason for conducting term normalization when applying models of distributional
semantics is that, when modeling semantics at a more coarse-grained level, inflectional information is of less importance. It, moreover, helps to deal with potential
data sparsity issues since, by collapsing different word forms to a single, canonical form, the statistical foundation of that term will be strengthened. The Granska
Tagger (Knutsson et al., 2003), which has been developed for general Swedish7
with Hidden Markov Models, is used to perform lemmatization in Papers II–V.
TERM RECOGNITION
In both Paper I and IV, multiword terms need to be recognized in the text. There
are many ways of doing this, including simple string matching against an existing
terminology. There are also statistical approaches that are fundamentally based on
identifying collocations, i.e., words that, with a high probability, occur with other
words (Mitkov, 2003). In this thesis, two statistical approaches are employed to extract and recognize terms in clinical text. In Paper I, term extraction is performed
by first extracting n-grams. Word n-grams are sequences of consecutive words,
where n is the number of words in the sequence (Jurafsky et al., 2000). Unigrams,
bigrams and trigrams are extracted (i.e., where n is 1, 2 and 3 respectively). For extracting n-grams, TextNSP (Banerjee and Pedersen, 2003) is used. Once n-grams
have been extracted, their C-value (Frantzi and Ananiadou, 1996; Frantzi et al.,
2000) is calculated. This statistic has been used successfully for term recognition
in the biomedical domain, largely due to its ability to handle nested terms (Zhang
et al., 2008). It is based on term frequency and term length (number of words); if
a candidate term is part of a longer candidate term, it also takes into account how
many other terms it is part of and how frequent those longer terms are (Equation
3.5).
7 A small-scale evaluation on clinical text showed that performance on this particular domain was
still very high: 92.4% accuracy, which is more or less in line with state-of-the-art performance of
Swedish part-of-speech taggers (Hassel et al., 2011).
47
METHOD
C-value(a) =
a
b
Ta
|a|
=
=
=
=
if a is not nested
log2 |a| · f (a)
1
log2 |a| · ( f (a) − P(Ta)
∑b∈Ta f (b)) otherwise
candidate term
longer candidate terms
set of terms that contain a
length of a (number of words)
f (a) =
f (b) =
P(Ta) =
(3.5)
term frequency of a
term frequency of b
size of Ta
In Paper IV, an existing tool called CoSegment (Daudaravicius, 2010) is used
instead. The tool uses the Dice score (Equation 3.6) to measure the association
strength between two words; this score is sensitive to low-frequency word pairs.
The Dice scores between adjacent words are used to identify boundaries of collocation segments. A threshold can then be set for the collocation segmentation
process: a higher threshold means that stronger statistical associations between
constituents are required and fewer collocations will be identified.
Dice : Dxi−1 ,xi =
2 · f (xi−1 , xi )
f (xx−1 ) + f (xi )
(3.6)
where f (xi−1 , xi ) is the frequency with which the adjacent words xi−1 and xi cooccur; f (xx−1 ) and f (xi ) are the respective (and independent) frequencies of the
two words.
NEGATION DETECTION
Recognizing terms in text is not sufficient; knowing whether they are affirmed or
negated can be of critical importance, especially given the prevalence of negations
in the clinical domain (Chapman et al., 2001). Here, a Swedish version of NegEx
(Skeppstedt, 2011) is used. This system attempts to determine if a given clinical
entity has been negated or not by looking for negation cues in the vicinity of the
clinical entity.
48
C HAPTER 3.
SEMANTIC SPACES
For generating the semantic spaces, a Java implementation of Random Indexing –
with and without permutations – is used and extended: JavaSDM (Hassel, 2004).
See Section 2.2.3 for a detailed description of Random Indexing and how semantic
spaces are constructed. In all of the conducted experiments, the number non-zero
elements is set to eight – four +1s and four -1s. Moreover, weighting is applied to
the index vectors in the accumulation of context vectors (Equation 3.7).
wiv = 21−dit,tt
(3.7)
where wiv is the weight applied to the index vector and dit,tt is the distance from
the index term to the target term.
There are many means of calculating the distance between two context vectors and
thereby estimating the semantic similarity between two entities (Jurafsky et al.,
2000). Metrics such as Euclidean or Manhattan distance are rarely used to measure term similarity since they are both particularly sensitive to extreme values,
which high-frequency terms will have. The (raw) dot product is likewise problematic as a similarity metric due its favouring of long vectors, i.e., vectors with
more non-zero elements and/or elements with high values; again, this will result in
high-frequent terms obtaining disproportionately high similarity scores with other
terms. A solution is to normalize the dot product by, for instance, dividing the
dot product by the lengths of the two vectors. This is in effect the same as taking
the cosine of the angle between the two vectors. The cosine similarity (Equation
3.8) metric is widespread in information retrieval and other fields8 ; it is also the
similarity metric employed in all of the experiments conducted here.
simcosine (~v,~w) = q
∑Ni=1 vi · wi
q
∑Ni=1 v2i ∑Ni=1 w2i
(3.8)
where ~v and ~w are the two vectors and N is the dimensionality of the vectors.
8 Other common similarity metrics include the Jaccard similarity coefficient and the Dice coefficient
(Jurafsky et al., 2000).
49
METHOD
3.4
EXPERIMENTAL SETUPS
The experimental setup of each study is described below. Consult Table 3.2 for an
overview of the methods and materials used in the various studies.
Paper
I
Corpus
MIMIC-II
NLP Tools
JavaSDM,
TEXT-NSP
Terminologies
SNOMED CT
Evaluation
Recall Top 20
II
Stockholm
EPR Corpus,
Läkartidningen
JavaSDM,
Granska Tagger
MeSH,
Abbreviations
W. Precision,
Recall Top 10
III
Stockholm
EPR Corpus
JavaSDM,
Granska Tagger,
NegEx
ICD-10,
SNOMED CT
Recall Top 10
IV
Stockholm
EPR Corpus
JavaSDM,
Granska Tagger,
CoSegment
ICD-10
Recall Top 10
V
Stockholm
EPR Corpus
Granska Tagger
FASS,
SNOMED CT
Manual
(counting)
Table 3.2: METHODS AND MATERIALS. An overview of the methods and materials
used in the various studies.
PAPER I : IDENTIFYING SYNONYMY BETWEEN SNOMED CLINICAL TERMS
OF VARYING LENGTH USING DISTRIBUTIONAL ANALYSIS OF ELECTRONIC
HEALTH RECORDS
In this paper, an attempt is made to identify synonymy between terms of varying
length in a distributional semantic framework. A method is proposed, wherein
terms are identified using statistical techniques and – through simple concatenation – treated as single tokens, after which a semantic (term) space is induced from
the preprocessed corpus. The research question is to what extent such a method
can be used to identify synonymy between terms of varying length. For the sake
of comparison, a traditional word space is also induced. The semantic spaces are
50
C HAPTER 3.
evaluated on the task of identifying synonyms of SNOMED CT preferred terms.
The motivation behind this study is that distributional semantics could potentially
be used to support terminology development and translation efforts of standard
terminologies, like SNOMED CT, into additional languages, for which there are
preferred terms but no synonyms.
The experimental setup can be summarized in the following steps: (1) data preparation, (2) term extraction and identification, (3) model building and parameter
tuning, and (4) evaluation (Figure 3.2). Semantic spaces are constructed with various parameter settings on two data set variants: one with unigram terms and one
with multiword terms. The models – and, in effect, the method – are evaluated for
their ability to identify synonyms of SNOMED CT preferred terms. After optimizing the parameter settings for each group of models on a development set, the
best models are evaluated on unseen data.
Data Preparation
MIMIC II
Database
Term Extraction &
Identification
Document Data
Preprocessed
Document
(unigrams)
Document
Document Data
Preprocessed
(multiword terms)
SNOMED CT
Development
Set
SNOMED CT
Evaluation Set
Model Building &
Parameter Tuning
Evaluation
Model
Model
Model
Model
Model
Model
Model
Model
Model
Model
Model
Model
Results
Figure 3.2: EXPERIMENTAL SETUP. An overview of the process and the experimental
setup.
In Step 1, the clinical data from which the term spaces will be induced is extracted
and preprocessed. We create a corpus comprising all text-based records from the
MIMIC-II database. The documents in the corpus are then preprocessed to remove
metadata, incomplete sentence fragments, digits and punctuation marks.
In Step 2, we extract and identify multiword terms in the corpus by ranking ngrams according to their C-value (see Section 3.3.3 for more details). We then
METHOD
51
apply a number of filtering rules that remove terms beginning and/or ending with
certain words, e.g. prepositions (in, from, for) and articles (a, the). Another alteration to the term list – now ranked according to C-value – is to give precedence
to SNOMED CT preferred terms by adding/moving them to the top of the list,
regardless of their C-value (or failed identification). The term list is then used to
perform exact string matching on the entire corpus: multiword terms with a higher
C-value than its constituents are concatenated. We thus treat multiword terms as
separate tokens with their own particular distributions in the data, to a greater or
lesser extent different from those of their constituents.
In Step 3, semantic spaces are induced from the data set variants: one containing
only unigram terms (UNIGRAM TERMS) and one containing also longer terms:
unigram, bigram and trigram terms (MULTIWORD TERMS). The following model
parameters are experimented with:
− Model Type: Random Indexing (RI), Random Permutation (RP)
− Sliding Window: 1+1, 2+2, 3+3, 4+4, 5+5, 6+6 surrounding terms
− Dimensionality: 500, 1000, 1500 dimensions
Evaluation takes place in both Step 3 and Step 4. The semantic spaces are evaluated for their ability to identify synonyms of SNOMED CT preferred terms that
each appears at least fifty times in the corpus. A preferred term is provided as
input to a semantic space and the twenty most semantically similar terms are output, provided that they also appear at least fifty times in the data. Recall Top 20
is calculated for each preferred term, i.e. what proportion of the synonyms are
identified in a list of twenty suggestions? The SNOMED CT data is divided into a
development set and an evaluation set in a 50/50 split. The development set is used
in Step 3 to find the optimal parameter settings for the respective data sets and the
task at hand. The best parameter configuration for each type of model (UNIGRAM
TERMS and MULTIWORD TERMS) is then used in the final evaluation in Step 4.
52
C HAPTER 3.
PAPER II : SYNONYM EXTRACTION AND ABBREVIATION EXPANSION WITH
ENSEMBLES OF SEMANTIC SPACES
This is a significant extension of the work published in (Henriksson et al., 2012).
In this paper, our hypothesis is that combinations of semantic spaces can perform better than a single semantic pace on the tasks of extracting synonyms and
abbreviation-expansion pairs. Ensembles of semantic spaces allow different distributional semantic models with different parameter configurations and induced
from different types of corpora to be combined.
The experimental setup can be divided into the following steps: (1) corpora preprocessing, (2) construction of semantic spaces from the two corpora, (3) identification of the most profitable single-corpus combinations, (4) identification of the
most profitable multiple-corpora combinations, (5) evaluation of the single-corpus
and multiple-corpora combinations, and (6) post-processing of candidate terms.
Figure 3.3: ENSEMBLES OF SEMANTIC SPACES. An overview of the strategy of combining semantic spaces constructed with different models of distributional semantics –
with potentially different parameter configurations – and induced from different corpora.
Once the corpora have been preprocessed, ten semantic spaces from each corpus
are induced with different context window sizes (Random Permutation spaces are
also induced with and without stop words). Ten pairs of semantic spaces are then
combined using three different combination strategies. These are evaluated on the
METHOD
53
three tasks – (1) abbreviations → expansions, (2) expansions → abbreviations and
(3) synonyms – using the development subsets of the reference standards (a list
of medical abbreviation-expansion pairs and MeSH synonyms). Performance is
mainly measured as recall top 10, i.e. the proportion of expected candidate terms
that are among a list of ten suggestions.
The pair of semantic spaces involved in the most profitable combination for each
corpus is then used to identify the most profitable multiple-corpora combinations,
where eight different combination strategies are evaluated. The best single-corpus
combinations are evaluated on the evaluation subsets of the reference standards,
where using Random Indexing and Random Permutation in isolation constitute
the two baselines. The best multiple-corpora combination is likewise evaluated on
the evaluation subsets of the reference standards; here, the clinical and medical
ensembles constitute the two baselines.
Post-processing rules are then constructed using the development subsets of the
reference standards and the outputs of the various semantic space combinations.
These are evaluated on the evaluation subsets of the reference standards using the
most profitable single-corpus and multiple-corpora ensembles.
PAPER III :
EXPLOITING STRUCTURED DATA , NEGATION DETECTION AND
SNOMED CT TERMS IN A RANDOM INDEXING APPROACH TO CLINICAL CODING
In this paper, we investigate three different strategies to improve the performance
of a Random Indexing approach to clinical coding by: (1) giving extra weight
to clinically significant words, (2) exploiting structured data in patient records to
calculate the likelihood of candidate diagnosis codes, and (3) incorporating the
use of negation detection. We hypothesize that these techniques – in isolation and
in combination – will perform better on this task than our baseline, which is based
on the same approach but without these additional techniques. Important research
questions that are explored in the paper – and covered in this thesis – are how to
combine structured patient data with semantic spaces of clinical text and how to
capture the function of negation in a distributional semantic framework.
In the distributional semantics approach to clinical coding (Henriksson et al., 2011;
Henriksson and Hassel, 2011), assigned codes in the training set are treated as
terms in the notes to which they have been assigned. When applying Random
Indexing, a document-level context definition is employed, as there is no order de-
54
C HAPTER 3.
pendency between individual terms in a note and assigned diagnosis codes. The
semantic spaces are then used to produce a ranked list of recommended diagnosis
codes for a to-be-coded document. This list is created by allowing each of the
words in a document to ‘vote’ for a number of distributionally similar diagnosis
codes, thus necessitating the subsequent merging of the individual lists. This ranking procedure can be carried out in a number of ways, some of which are explored
in this paper. The starting point, however, is to use the semantic similarity of a
word and a diagnosis code – as defined by the cosine similarity score – and the
inverse document frequency (idf) value of the word, which denotes its discriminatory value. This is regarded as our baseline model (Henriksson and Hassel, 2011),
to which negation handling and additional weighting schemes are added.
The semantic spaces are induced from and evaluated on a subset of the Stockholm
EPR Corpus. A document contains all free-text entries concerning a single patient
made on consecutive days at a single clinical unit. The documents in the partitions
of the data sets on which the models are trained (90%) also include one or more
associated ICD-10 codes (on average 1.7 and at most 47). In the testing partitions
(10%), the associated codes are retained separately for evaluation. In addition
to the complete data set, two subsets are created, in which there are documents
exclusively from a particular type of clinic: one for ear-nose-throat clinics and
one for rheumatology clinics.
Variants of the three data sets are created, in which negated clinical entities are
automatically annotated using the Swedish version of NegEx. The clinical entities
are detected through exact string matching against a list of 112,847 SNOMED CT
terms belonging to the semantic categories ’finding’ and ’disorder’. A negated
term is marked in such a way that it will be treated as a single word, although with
its proper negated denotation. Multi-word terms are concatenated into unigrams.
The data is finally pre-processed: lemmatization is performed using the Granska
Tagger, while punctuation, digits and stop words are removed.
For each of the semantic spaces, we apply two distinct weighting techniques. First,
we assume a technocratic approach to the election of diagnosis codes. We do so
by giving added weight to words which are ‘clinically significant’. This is here
achieved by utilizing the same list of SNOMED CT findings and disorders that
was used by the negation detection system.
We also perform weighting of the correlated ICD-10 codes by exploiting statistics
generated from some structured data in the EHRs, namely gender, age and clinical
METHOD
55
unit. The idea is to use known information about a to-be-coded document in order
to assign weights to candidate diagnosis codes according to plausibility, which in
turn is based on past combinations of a particular code and each of the structured
data entries. In order for an unseen combination not to be ruled out entirely, additive smoothing is performed. Age is discretized into age groups for each and every
year up to the age of ten, after which ten-year intervals are used.
The evaluation is carried out by comparing the model-generated recommendations
with the clinically assigned codes in the data. This matching is done on all four
possible levels of ICD-10 according to specificity (see Figure 3.4).
Figure 3.4: ICD -10 STRUCTURE. The structure of ICD-10 allows division into four
levels.
PAPER IV: OPTIMIZING THE DIMENSIONALITY OF CLINICAL TERM SPACES FOR
IMPROVED DIAGNOSIS CODING SUPPORT
In this paper, we investigate the effect of the dimensionality of semantic spaces on
the task of assigning diagnosis codes, particularly as the vocabulary grows large,
which is generally the case in the clinical domain, but especially so when modeling
multiword terms as single tokens. Our hypothesis is that the performance will improve with a higher dimensionality than our baseline of one thousand dimensions.
The data is extracted from the Stockholm EPR Corpus and contains lemmatized
clinical notes in Swedish. Variants of the data set are created with different thresholds used for the collocation segmentation: a higher threshold means that stronger
statistical associations between constituents are required and fewer collocations
will be identified. The collocations are concatenated and treated as compounds,
which results in an increased number of types and a decreased number of tokens
per type. Identifying collocations is done to see if modeling multiword terms,
rather than only unigrams, may boost results; it will also help to provide clues
about the correlation between features of the vocabulary and the optimal dimensionality.
56
C HAPTER 3.
Random Indexing is used to construct the semantic spaces with eleven different
dimensionalities between 1,000 and 10,000. The semantic spaces are evaluated
for their ability to assign diagnosis codes to clinical notes; results, measured as
recall top 10, are only based on exact matches.
PAPER V: EXPLORATION OF ADVERSE DRUG REACTIONS IN SEMANTIC
VECTOR SPACE MODELS OF CLINICAL TEXT
In this paper, we explore the potential of distributional semantics to extract interesting semantic relations between drug-symptom pairs. The models are evaluated
for their ability to detect known and potentially unknown adverse drug reactions.
The study is primarily exploratory in nature; however, the underlying hypothesis
is that drugs and adverse drug reactions co-occur in similar contexts in clinical
narratives and that distributional semantics is able to capture this.
The data from which the models are induced is extracted from the Stockholm EPR
Corpus. A document comprises clinical notes from a single patient visit. A patient
visit is difficult to delineate; here it is simply defined according to the continuity
of the documentation process: all free-text entries written on consecutive days are
concatenated into one document. Two data sets are created: one in which patients
from all age groups are included and another where records for patients over fifty
years are discarded. This is done in order to investigate whether it is easier to
identify adverse drug reactions caused in patients less likely to have a multitude of
health issues. Preprocessing is done on the data sets in the form of lemmatization
and stop-word removal.
Random Indexing is applied on the two data sets with two sets of parameters,
yielding a total of four models. The context is the only model parameter we experiment with, using a sliding window of eight (4+4) or 24 (12+12) surrounding
words. The dimensionality is set to 1,000.
The models are then used to generate lists of twenty distributionally similar terms
for twenty common medications that have multiple known side-effects. To enable a comparison of the models induced from the two data sets, two groups of
medications are selected: (1) drugs for cardiovascular disorders and (2) drugs that
are not disproportionally prescribed to patients in older age groups, e.g. drugs for
epilepsy, diabetes mellitus, infections and allergies. Moreover, as side-effects are
manifested as either symptoms or disorders, a vocabulary of unigram SNOMED
57
METHOD
CT terms belonging to the semantic categories finding and disorder is compiled
and used to filter the lists generated by the models. The drug-symptom/disorder
pairs are here manually evaluated by a physician using lists of indications and
known side-effects.
3.5
ETHICAL ISSUES
Working with clinical data inevitably raises certain ethical issues. This is due to
the inherently sensitive nature of this type of data, which is protected by law. In
Sweden, there are primarily three laws that, in one way or another, govern and stipulate the lawful handling of patient information. The Act Concerning the Ethical
Review of Research Involving Humans9 , SFS 2003:460 (Sveriges riksdag, 2013a),
is applicable also to research that involves sensitive personal data (3 §). It states
that such research is only permissible after approval from an ethical review board
(6 §) and that processing of sensitive personal data may only be approved if it is
deemed necessary to carry out the research (10 §). More specific to clinical data is
the Patient Data Act10 , SFS 2008:355 (Sveriges riksdag, 2013b). It stipulates that
patient care must be documented (Chapter 3, 1 §) and that health records are an
information source also for research (Chapter 3, 2 §). A related, but more generally applicable, law is the Personal Data Act11 , SFS 1998:204 (Sveriges riksdag,
2013c), which protects individuals against the violation of their personal integrity
by processing of personal data (1 §). This law stipulates that personal data must
not be processed for any purpose that is incompatible with that for which the information was collected; however, it moreover states that processing of personal data
for historical, statistical or scientific purposes shall not be regarded as incompatible (9 §). If the processing of personal data has been approved by a research ethics
committee, the stipulated preconditions shall be considered to have been met (19
§).
The Stockholm EPR Corpus does not contain any structured information that may
be used directly to identify a patient. The health records were in this sense deidentified by the hospital by replacing social security numbers (in Swedish: personnummer) with a random key (one per unique patient) and by removing all
9 In
Swedish: Lag om etikprövning av forskning som avser människor
Swedish: Patientdatalagen
11 In Swedish: Personuppgiftslagen
10 In
58
C HAPTER 3.
names in the structured part of the records. The data is encrypted and stored securely on machines that are not connected to any networks and in a location to
which only a limited number of people, having first signed a confidentiality agreement, have access. Published results contain no confidential or other information
that can be used to identify individuals. The research conducted in this thesis has
been approved by the Regional Ethical Review Board in Stockholm12 (Etikprövningsnämnden i Stockholm), permission numbers 2009/1742-31/5 and 2012/83431/5. When applying for these permissions, the aforementioned measures to ensure that personal integrity would be respected were carefully described, along
with clearly stated research goals and descriptions of how, and for what purposes,
the data would be used.
Some of these ethical issues also impact on scientific aspects of the work. The
sensitive nature of the data makes it difficult to make data sets publicly available,
which, in turn, makes the studies less readily repeatable for external researchers.
Developed techniques and methods can, however, be applied and compared on
clinical data sets to which a researcher does have access. Unfortunately, access to
large repositories of clinical data for research purposes is still rather limited, which
also means that many problems have not yet been tackled in this particular domain
and – by extension – that there are fewer methods to which your proposed method
can be compared. This is partly reflected in this work, where an external baseline
system is sometimes lacking. Without a reference (baseline) system, statistical inference – in the form of testing the statistical significance of the difference between
the performances of two systems – cannot be used to draw conclusions. Employing a more naïve baseline (such as a system based on randomized or majority class
classifications) for the mere sake of conducting significance testing is arguably not
very meaningful.
12 http://www.epn.se/en/
C HAPTER 4
RESULTS
In this chapter, the main results of the included studies are summarized task- and
paper-wise. To facilitate the reading of the results, the experimental setup of each
study is described in Section 3.4. Some of the results reported in the papers have
been omitted in this chapter; consult the appended papers for complete results, as
well as detailed analyses thereof.
60
4.1
C HAPTER 4.
SYNONYM EXTRACTION OF MEDICAL TERMS
The synonym extraction task was addressed in two papers: one using English data
and the other using Swedish data. In the first paper, the paraphrasing issue was
addressed, while in the second paper, the notion of ensembles of semantic spaces
was introduced.
PAPER I : IDENTIFYING SYNONYMY BETWEEN SNOMED CLINICAL TERMS
OF VARYING LENGTH USING DISTRIBUTIONAL ANALYSIS OF ELECTRONIC
HEALTH RECORDS
In this study, semantic spaces are induced from two data set variants: one that only
contains single words (UNIGRAM TERMS) and one in which multiword terms
have been identified and are treated as compounds (MULTIWORD TERMS). The
semantic spaces are evaluated in two steps. First, the parameters are tuned on
a development set; then, the optimal parameter configurations for each type of
semantic space are applied on an evaluation set.
In the parameter tuning step, the pattern is fairly clear: for both data set variants,
the best model parameter settings are based on Random Permutation and a dimensionality of 1500 (Table 4.1). A sliding window of 5+5 and 4+4 yields the
best results for UNIGRAM TERMS and MULTIWORD TERMS, respectively. The
general tendency is that results improve as the dimensionality and the size of the
sliding window increase. Increasing the size of the context window beyond 5+5
surrounding terms does, however, not boosts results further. Incorporating term
order information (Random Permutation) is profitable.
Once the optimal parameter settings for each data set variant had been configured,
they were evaluated for their ability to identify synonyms on the unseen evaluation
set. Overall, the best UNIGRAM TERMS model (Random Permutation, 5+5 context window, 1,500 dimensions) achieved a recall top 20 of 0.24 (Table 4.2), i.e.
24% of all unigram synonym pairs that occur at least fifty times in the corpus were
successfully identified in a list of twenty suggestions per preferred term. With the
best MULTIWORD TERMS model (Random Permutation, 4+4 context window,
1,500 dimensions), the average recall top 20 is 0.16. For 22 of the correctly identified synonyms pairs, at least one of the terms in the synonymous relation was a
multiword term.
61
RESULTS
500
1000
1500
500
1000
1500
RANDOM INDEXING
RANDOM PERMUTATION
1+1 2+2 3+3 4+4 5+5 6+6 1+1 2+2 3+3 4+4 5+5 6+6
UNIGRAM TERMS
0.17 0.18 0.20 0.20 0.20 0.21 0.17 0.20 0.21 0.21 0.22 0.22
0.16 0.19 0.20 0.21 0.21 0.21 0.19 0.22 0.21 0.21 0.22 0.23
0.17 0.21 0.22 0.22 0.22 0.22 0.18 0.22 0.22 0.23 0.24 0.23
MULTIWORD TERMS
0.08 0.12 0.13 0.13 0.13 0.12 0.08 0.13 0.14 0.14 0.13 0.12
0.09 0.12 0.13 0.13 0.13 0.13 0.10 0.13 0.14 0.13 0.12 0.11
0.09 0.13 0.13 0.14 0.14 0.14 0.10 0.14 0.14 0.16 0.14 0.14
Table 4.1: MODEL PARAMETER TUNING. Results, measured as recall top 20, for
UNIGRAM TERMS and MULTIWORD TERMS on the development sets. Results are
reported with different parameter configurations: different models of distributional semantics (Random Indexing, Random Permutation), different sliding window sizes (1+1,
2+2, 3+3, 4+4, 5+5, 6+6 surrounding words to the left + right of the target word) and
different dimensionalities (500, 1000, 1500). NB: results separated by a double horizontal line should not be compared directly; they are reported in the same table for the
sake of efficiency.
UNIGRAM TERMS
MULTIWORD TERMS
# Synonym Pairs
215
292
Recall (Top 20)
0.24
0.16
Table 4.2: FINAL EVALUATION. Results, measured as recall top 20, for UNIGRAM
TERMS and MULTIWORD TERMS on the evaluation set. NB: results separated by a
double horizontal line should not be compared directly; they are reported in the same
table for the sake of efficiency.
PAPER II : SYNONYM EXTRACTION AND ABBREVIATION EXPANSION WITH
ENSEMBLES OF SEMANTIC SPACES
In this paper, semantic spaces are combined in two steps: first, two semantic spaces
– one constructed with Random Indexing and the other with Random Permutation
– induced from a single corpus (clinical or medical) are combined; then, the most
profitable single-corpus combinations are combined in multiple-corpora combinations. The most profitable parameter configurations and strategies for performing
the single-corpus and multiple-corpora combinations are determined using an de-
62
C HAPTER 4.
velopment set; these are then applied on an evaluation set. Simple post-processing
rules are applied in an attempt to improve results.
In both single-corpus combinations, the RI + RP combination strategy, where the
cosine similarity scores are merely summed, is the most successful on all three
tasks (Table 4.3). With the clinical corpus, the following results are obtained: 0.42
recall on the abbr→exp task, 0.32 recall on the exp→abbr task, and 0.40 recall
on the syn task. With the medical corpus, the results are significantly lower: 0.10
recall on the abbr→exp task, 0.08 recall on the exp→abbr task, and 0.30 recall on
the syn task.
Strategy
RI ⊂ RP30
RP ⊂ RI30
RI + RP
RI ⊂ RP30
RP ⊂ RI30
RI + RP
Abbr→Exp Exp→Abbr
CLINICAL
0.38
0.30
0.35
0.30
0.42
0.32
MEDICAL
0.08
0.03
0.08
0.03
0.10
0.08
Syn
0.39
0.38
0.40
0.26
0.24
0.30
Table 4.3: SINGLE - CORPUS RESULTS ON DEVELOPMENT SET. Results, measured as
recall top ten, of the best configurations for each model and model combination on the
three tasks. NB: results separated by a double horizontal line should not be compared
directly; they are reported in the same table for the sake of efficiency.
In the multiple-corpora combinations, the single-step approaches generally performed better than the two-step approaches, with some exceptions (Table 4.4).
The most successful ensemble was a simple single-step approach, where the cosine
similarity scores produced by each semantic space were simply summed (SUM),
yielding 0.32 recall for abbr→exp, 0.17 recall for exp→abbr, and 0.52 recall for
syn. Normalization, whereby ranking was used instead of cosine similarity, invariably affected performance negatively.
When applying the single-corpus combinations from the clinical corpus, the following results were obtained: 0.31 recall on abbr→exp, 0.20 recall on exp→abbr,
and 0.44 recall on syn (Table 4.5). The RI baseline was outperformed on all three
tasks; the RP baseline was outperformed on two out of three tasks, with the exception of the exp→abbr task. Finally, it might be interesting to point out that the RP
63
RESULTS
Strategy
AVG
AVG
SUM
SUM
AVG→AVG
SUM→SUM
AVG→SUM
SUM→AVG
Normalize
True
False
True
False
Abbr→Exp
0.13
0.24
0.13
0.32
0.15
0.13
0.15
0.13
Exp→Abbr
0.09
0.11
0.09
0.17
0.09
0.07
0.09
0.07
Syn
0.39
0.39
0.34
0.52
0.41
0.40
0.41
0.40
Table 4.4: MULTIPLE - CORPORA RESULTS ON DEVELOPMENT SET. Results, measured as recall top ten, of the best configurations for each model and model combination
on the three tasks.
baseline performed better than the RI baseline on the two abbreviation-expansion
tasks, but that the RI baseline did somewhat better on the synonym extraction
task. With the medical corpus, the following results were obtained: 0.17 recall
on abbr→exp, 0.11 recall on abbr→exp, and 0.34 recall on syn (Table 4.5). Both
the RI and RP baselines were outperformed, with a considerable margin, by their
combination. In complete contrast to the clinical corpus, the RI baseline here beat
the RP baseline on the two abbreviation-expansion tasks, but was outperformed by
the RP baseline on the synonym extraction task.
When applying the multiple-corpora ensembles, the following results were obtained on the evaluation sets: 0.30 recall on abbr→exp, 0.19 recall on exp→abbr,
and 0.47 recall on syn (Table 4.6). The two ensemble baselines were clearly outperformed by the larger ensemble of semantic spaces from two types of corpora
on two of the tasks; the clinical ensemble baseline performed equally well on
the exp→abbr task. It might be of some interest to compare the two ensemble
baselines here since, in this setting, they can more fairly be compared: they are
evaluated on identical test sets. The clinical ensemble baseline quite clearly outperformed the medical ensemble baseline on the two abbreviation-expansion tasks;
however, on the synonym extraction task, their performance was comparable.
The post-processing filtering was unsurprisingly more effective on the two
abbreviation-expansion tasks. With the clinical corpus, recall improved substantially with the post-processing filtering: from 0.31 to 0.42 on abbr→exp and from
0.20 to 0.33 on exp→abbr (Table 4.5). With the medical corpus, however, almost
no improvements were observed for these tasks. With the combination of semantic
64
C HAPTER 4.
Evaluation Configuration
Abbr→Exp
P
R
RI Baseline
RP Baseline
Standard (Top 10)
Post-Processing (Top 10)
Dynamic Cut-Off (Top ≤ 10)
0.04
0.04
0.05
0.08
0.11
0.22
0.23
0.31
0.42
0.41
RI Baseline
RP Baseline
Standard (Top 10)
Post-Processing (Top 10)
Dynamic Cut-Off (Top ≤ 10)
0.02
0.01
0.03
0.03
0.17
0.09
0.06
0.17
0.17
0.17
Exp→Abbr
P
R
CLINICAL
0.03 0.19
0.04 0.24
0.03 0.20
0.05 0.33
0.12 0.33
MEDICAL
0.01 0.08
0.01 0.05
0.01 0.11
0.02 0.11
0.10 0.11
P
Syn
R
0.07
0.06
0.07
0.08
0.08
0.39
0.36
0.44
0.43
0.42
0.03
0.05
0.06
0.06
0.06
0.18
0.26
0.34
0.34
0.34
Table 4.5: SINGLE - CORPUS RESULTS ON EVALUATION SET. Results (P = precision, R
= recall, top ten) of the best models with and without post-processing on the three tasks.
Dynamic # of suggestions allows the model to suggest less than ten terms in order to
improve precision. The results are based on the application of the model combinations
to the evaluation data. NB: results separated by a double horizontal line should not be
compared directly; they are reported in the same table for the sake of efficiency.
Evaluation Configuration
Clinical Ensemble Baseline
Medical Ensemble Baseline
Standard (Top 10)
Post-Processing (Top 10)
Dynamic Cut-Off (Top ≤ 10)
Abbr→Exp
P
R
0.04 0.24
0.02 0.11
0.05 0.30
0.07 0.39
0.28 0.39
Exp→Abbr
P
R
0.03 0.19
0.01 0.11
0.03 0.19
0.06 0.33
0.31 0.33
Syn
P
R
0.06 0.34
0.05 0.33
0.08 0.47
0.08 0.47
0.08 0.45
Table 4.6: MULTIPLE - CORPORA RESULTS ON EVALUATION SET. Results (P = precision, R = recall, top ten) of the best models with and without post-processing on the
three tasks. Dynamic # of suggestions allows the model to suggest less than ten terms
in order to improve precision. The results are based on the application of the model
combinations to the evaluation data.
spaces from the two corpora, significant gains were once more made by filtering
the candidate terms on the two abbreviation-expansion tasks: from 0.30 to 0.39
recall and from 0.05 precision to 0.07 precision on abbr→exp; from 0.19 to 0.33
65
RESULTS
recall and from 0.03 to 0.06 precision on exp→abbr (Table 4.6). With a dynamic
cut-off, only precision could be improved, although at the risk of negatively affecting recall. With the clinical corpus, recall was largely unaffected for the two
abbreviation-expansion task, while precision improved by 3-7 percentage points
(Table 4.5). With the medical corpus, the gains were even more substantial: from
0.03 to 0.17 precision on abbr→exp and from 0.02 to 0.10 precision on exp→abbr
– without having any impact on recall. The greatest improvements on these tasks
were, however, observed with the combination of semantic spaces from multiple
corpora: precision increased from 0.07 to 0.28 on abbr→exp and from 0.06 to
0.31 on exp→abbr – again, without affecting recall (Table 4.6). For synonyms,
the post-processing filtering was clearly unsuccessful for both corpora and their
combination, with almost no impact on either precision or recall.
4.2
ASSIGNMENT OF DIAGNOSIS CODES
The diagnosis coding task was also addressed in two papers. In the first of the two,
the focus was on improving the Random Indexing approach to clinical coding
by: (1) exploiting terminological resources, (2) exploiting structured data, and
(3) attempting to handle negation. In the second paper, the focus was instead
on investigating the effect of the dimensionality as the vocabulary and number of
documents grow large. This was studied using diagnosis coding as a use case,
where an attempt was made to improve results.
PAPER III : EXPLOITING STRUCTURED DATA , NEGATION DETECTION AND
SNOMED CT TERMS IN A RANDOM INDEXING APPROACH TO CLINICAL CODING
In this paper, three techniques are tried – both in isolation and in combination –
in an attempt to improve results on the diagnosis coding task. Two data sets are
created, in which one has been preprocessed to detect negated entities and mark
them as such. Interestingly, approximately 14-17 percent of detected entities are
negated. Moreover, two types of models are induced: general models – trained
on all types of clinics – and domain-specific models: an ENT (Ear-Nose-Throat)
model and a rheumatology model.
66
C HAPTER 4.
The baseline for the general models finds 23% of the clinically assigned codes (exact matches) (Table 4.7). The single application of one of the weighing techniques
to the baseline model boosts performance somewhat, the fixed fields-based code
filtering (26% exact matches) slightly more so than the technocratic word weighting (24% exact matches). The negation variant of the general model, General
Negation Model, performs somewhat better – up two percentage points (25% exact matches) – than the baseline model. The technocratic approach applied to this
model does not yield any observable added value. The fixed fields filtering does,
however, result in a further improvement on the three most specific levels (27%
exact matches). A combination of the two weighting schemes does not appear to
bring much benefit to either of the general models, compared to solely performing
fixed fields filtering.
Weighting
Baseline
Technocratic (T)
Fixed Fields (FF)
T + FF
E
0.23
0.24
0.26
0.26
General Model
L3
L2
0.25 0.33
0.26 0.34
0.28 0.36
0.28 0.36
L1
0.60
0.61
0.61
0.62
General Negation Model
E
L3
L2
L1
0.25 0.27 0.35 0.62
0.25 0.27 0.35 0.62
0.27 0.29 0.37 0.63
0.27 0.29 0.37 0.63
Table 4.7: GENERAL MODELS. With and without negation handling. Recall (top
10), measured as the presence of the clinically assigned codes in a list of ten modelgenerated recommendations. E = exact match, L3→L1 = matches on the other levels,
from specific to general. The baseline is for the model without negation handling only.
The baseline for the ENT models finds 33% of the clinically assigned codes (exact
matches) (Table 4.8). Technocratic word weighing yields a modest improvement
over the baseline model: one percentage point on each of the levels. Filtering code
candidates based on fixed fields statistics, however, leads to a remarkable boost
in results, from 33% to 43% exact matches. The ENT Negation Model performs
slightly better than the baseline model, although only as little as a single percentage
point (34% exact matches). Performance drops when the technocratic approach
is applied to this model. The fixed fields filtering, on the other hand, similarly
improves results for the negation variant of the ENT model; however, there is no
apparent additional benefit in this case of negation handling. In fact, it somewhat
hampers the improvement yielded by this weighting technique. As with the general
models, a combination of the two weighting techniques does not affect the results
much for either of the ENT models.
67
RESULTS
Weighting
Baseline
Technocratic (T)
Fixed Fields (FF)
T + FF
E
0.33
0.34
0.43
0.42
ENT Model
L3
L2
0.34 0.41
0.35 0.42
0.43 0.48
0.42 0.47
L1
0.62
0.63
0.64
0.64
ENT Negation Model
E
L3
L2
L1
0.34 0.35 0.42 0.62
0.33 0.33 0.41 0.61
0.42 0.43 0.48 0.63
0.42 0.42 0.47 0.62
Table 4.8: EAR - NOSE - THROAT MODELS. With and without negation handling. Recall
(top 10), measured as the presence of the clinically assigned codes in a list of ten modelgenerated recommendations. E = exact match, L3→L1 = matches on the other levels,
from specific to general. The baseline is for the model without negation handling only.
Weighting
Baseline
Technocratic (T)
Fixed Fields (FF)
T + FF
E
0.61
0.72
0.82
0.82
Rheuma Model
L3
L2
0.61 0.68
0.72 0.77
0.82 0.85
0.83 0.86
L1
0.92
0.94
0.95
0.95
Rheuma Negation Model
E
L3
L2
L1
0.61 0.61 0.70 0.92
0.60 0.60 0.70 0.91
0.67 0.67 0.75 0.91
0.68 0.68 0.76 0.92
Table 4.9: RHEUMATOLOGY MODELS. With and without negation handling. Recall
(top 10), measured as the presence of the clinically assigned codes in a list of ten modelgenerated recommendations. E = exact match, L3→L1 = matches on the other levels,
from specific to general. The baseline is for the model without negation handling only.
The baseline for the rheumatology models finds 61% of the clinically assigned
codes (exact matches) (Table 4.9). Compared to the above models, the technocratic
approach is here much more successful, resulting in 72% exact matches. Filtering
the code candidates based on fixed fields statistics leads to a further improvement
of ten percentage points for exact matches (82%). The Rheuma Negation Model
achieves only a modest improvement on L2. Moreover, this model does not benefit
at all from the technocratic approach; neither is the fixed fields filtering quite as
successful in this model (67% exact matches). A combination of the two weighting
schemes adds only a little to the two variants of the rheumatology model. Interesting to note is that the negation variant performs the same or even much worse than
the one without any negation handling.
68
C HAPTER 4.
PAPER IV: OPTIMIZING THE DIMENSIONALITY OF CLINICAL TERM SPACES FOR
IMPROVED DIAGNOSIS CODING SUPPORT
In this paper, a number of data set variants are created: a UNIGRAMS data set and
three collocation data sets, created with different thresholds – a lower threshold
leads to the creation of more collocation segments. The purpose of creating these
data set variants is to study the effect of the dimensionality on the task of assigning
diagnosis codes, given certain properties of the data set (Table 4.10).
DATA SET
UNIGRAMS
COLL 100
COLL 50
COLL 0
DOCUMENTS
219k
219k
219k
219k
TYPES
371,778
612,422
699,913
1,413,735
TOKENS / TYPE
51.54
33.19
28.83
13.53
Table 4.10: DATA DESCRIPTION. The COLL (collocation) X sets were created with
different thresholds.
Increasing the dimensionality yields major improvements (Table 4.11), up to
18 percentage points. The biggest improvements are seen when increasing
the dimensionality from the 1000-dimensionality baseline. When increasing
the dimensionality beyond 2000-2500, the boosts in results begin to level off,
although further improvements are achieved with a higher dimensionality: the
best results are achieved with a dimensionality of 10,000. A larger improvement
is seen with all of the COLL models compared to the UNIGRAMS model, even
if the UNIGRAMS model outperforms all three COLL models; however, with a
higher dimensionality, the COLL models appear to close in on the UNIGRAMS
model.
69
RESULTS
DIM
1000
1500
2000
2500
3000
3500
4000
4500
5000
7500
10000
+/- (pp)
UNIGRAMS
0.25
0.31
0.34
0.35
0.36
0.37
0.37
0.37
0.38
0.39
0.39
+14
COLL 100
0.18
0.26
0.29
0.31
0.33
0.33
0.34
0.33
0.34
0.34
0.36
+18
COLL 50
0.19
0.25
0.28
0.30
0.31
0.32
0.34
0.33
0.33
0.33
0.35
+16
COLL 0
0.15
0.20
0.24
0.26
0.26
0.28
0.29
0.30
0.30
0.32
0.32
+17
Table 4.11: AUTOMATIC DIAGNOSIS CODING RESULTS. Results, measured as recall top 10 for exact matches, with clinical term spaces constructed from differently
preprocessed data sets (unigrams and three collocation variants) and with different dimensionalities (DIM).
4.3
IDENTIFICATION OF ADVERSE DRUG REACTIONS
The task of identifying adverse drug reactions for a given medication was only
preliminarily explored in one paper. The idea was to investigate whether useful
semantic relations between a drug and symptoms/disorders – in particular the ADR
relation – could be identified in a large corpus of clinical text.
PAPER V: EXPLORATION OF ADVERSE DRUG REACTIONS IN SEMANTIC
VECTOR SPACE MODELS OF CLINICAL TEXT
In this paper, two data sets are created: one in which patients from all age groups
are included and another where records for patients over fifty years are discarded.
Semantic spaces are induced from these data sets with different sliding window
sizes (8 and 24 surrounding words). These are then used to generate lists of
twenty distributionally similar terms for twenty common medications. The drugsymptom/disorder pairs are manually evaluated by a physician.
70
C HAPTER 4.
Approximately half of the model-generated suggestions are disorders/symptoms
that are conceivable side-effects; around 10% are known and documented. A fair
share of the terms are indications for prescribing a certain drug (∼10-15%), while
differential diagnoses (i.e. alternative indications) also appear with some regularity
(∼5-6%). Interestingly, terms that explicitly indicate mention of possible adverse
drug events, such as side-effect, show up fairly often. However, over 20% of the
terms proposed by all of the models are obviously incorrect; in some cases they
are not symptoms/disorders, and sometimes they are simply not conceivable sideeffects.
Using the model induced from the larger data set – comprising all age groups –
slightly more potential side-effects are identified, whereas the proportion of known
side-effects remains constant across models. The number of incorrect suggestions
increases when using the smaller data set. The difference in the scope of the context, however, does not seem to have a significant impact on these results. Similarly, when analyzing the results for the two groups of medications separately and
vis-‘a-vis the two data sets, no major differences are found.
TYPE
KNOWN S-E
POTENTIAL S-E
INDICATION
ALT. INDICATION
S-E TERM
D/S, NOT S-E
NOT D/S
SUM %
ALL
SW 8 SW 24
11
11
37
39
14
14
6
6
10
9
15
14
7
7
100
100
≤ 50
SW 8 SW 24
10
11
35
35
10
10
5
6
7
7
23
21
10
10
100
100
Table 4.12: TERM CATEGORIZATION RESULTS. Results for models built on the two
data sets (all age groups vs. only ≤ 50 years) with a sliding window context of 8 (SW
8) and 24 (SW 24) words respectively. S-E = side-effect; D/S = disorder/symptom.
C HAPTER 5
DISCUSSION
In this chapter, the focus of this thesis, namely the application of distributional
semantics to the clinical domain, will be revisited. The three use cases, or
application areas, that have been explored will then be reviewed individually.
The research questions that were formulated in Chapter I will also be answered
individually and discussed, including comparisons with previous, related research.
Finally, the main contributions of the thesis, along with future directions, will be
indicated.
72
5.1
C HAPTER 5.
SEMANTIC SPACES OF CLINICAL TEXT
Semantic spaces of clinical text have many practical applications, some of which
have been demonstrated in this thesis. It is thus clear that distributional semantics
has a key role to play in clinical language processing, despite claims that state otherwise (Cohen and Widdows, 2009). That is not to say that there are no challenges
involved in applying distributional semantics to this very particular – and peculiar
– genre of text. One such issue is the size of the vocabulary, which tends to grow
very large in the clinical domain, partly as a result of frequent misspellings and
ad-hoc abbreviations. In one study, it was found that the pharmaceutical product
noradrenaline had approximately sixty different spellings in a set of intensive care
unit nursing narratives written in Swedish (Allvin et al., 2011). This is problematic for many reasons. First of all, since these lexical variants are semantically
identical, we ideally want to have one single representation of the meaning of this
concept. This type of phenomenon also results in data sparsity and a weaker statistical foundation for the meaning of terms; if the spelling variants had first been
canonized into a single form, this issue would have been mitigated. Moreover,
the prevalence of ad-hoc abbreviations and misspellings increases the likelihood
of semantically overloaded types and results in (coincidental) homonymy – a single form that has multiple, wholly uncorrelated meanings. Ideally, one would like
to have sense-specific vector representations of meaning. The large vocabulary
size also affects the appropriate dimensionality of a semantic space. In order for
the crucial near-orthogonality property of Random Indexing to be fulfilled, the
dimensionality has to be sufficiently large in relation to the number of contexts.
Generating a large number of index vectors in a relatively low-dimensional space
will likely result in cases of identical or largely overlapping index vectors, which,
in turn, means that each context will not have a unique representation, (approximately) uncorrelated to other contexts. In Paper IV, the importance of setting a
sufficiently large dimensionality was demonstrated. This is true in general, but
particularly so in the clinical domain and other noisy domains in which large vocabularies are the norm.
Another important issue that concerns the useful application of distributional semantics to the clinical domain is the ability to model the meaning of multiword
terms – the overwhelming majority of clinical, as well as biomedical, concepts
are instantiated by multiword terms – and to compare the distributional profiles
between terms of varying (word) length. In other words, a means of handling
DISCUSSION
73
paraphrasing in a distributional semantic framework is needed. This is an important research area, which has gained increasing attention and typically concerns the
modeling of semantic composition in a distributional semantic framework. Here,
this was approached in a fairly straightforward manner: identify multiword terms
using a statistical measure (C-value), concatenate and treat them as distinct semantic units, with their own distributional profiles, different from those of their
constituents. This approach is simple and intuitive. An important caveat is that
this approach requires large amounts of data in order to avoid data sparsity issues:
multiword terms are almost invariably less frequent than their constituents.
Considering the prevalence of negation in the clinical domain, it also seems important to be able to handle this phenomenon in a distributional semantic framework. In general, the meaning of a term is not affected by the fact that it is sometimes negated; however, given a specific mention of a term, knowing if it has been
negated or not can make all the difference. To our knowledge, the issue of incorporating negation in a distributional semantic framework has not been explored to
a great extent. An exception is the attempt by Widdows (2003) to handle negation in the context of information retrieval by making query vectors orthogonal to
negated terms. Here, this was – again – approached in a simple manner: use an
existing negation detection system and treat negated entities as distinct semantic
units with their own distributional profiles, different from that of their affirmed
counterparts. This approach is shown to lead to enhanced performance on the task
of automatically assigning diagnosis codes. Similarly to the aforementioned multiword solution, this approach can lead to data sparsity issues since the statistical
evidence for the meaning of terms is now divided between affirmed and negated
uses.
Given the oftentimes difficulty of obtaining large amounts of clinical data, it would
be valuable if other data sources could be used in a complementary fashion. This
was one of the ideas investigated in Paper II, where (small-scale) ensembles of
semantic spaces were induced from different types of corpora, including a nonclinical corpus. Even if one has access to sufficiently large amounts of clinical
data, however, this notion is nevertheless interesting, as it allows different models
of distributional semantics to be combined. This may potentially allow more aspects of semantics to be captured, which may not be possible to achieve in a single
model.
74
C HAPTER 5.
SYNONYM EXTRACTION OF MEDICAL TERMS
Using distributional semantics to extract synonyms from large corpora is the most
obvious application and does certainly not constitute a novelty in itself. In the
domain of clinical text, however, this has not been explored to a large degree.
Traditionally, models of distributional semantics have focused on single words;
however, many clinical terms – and other biomedical terms for that matter – are
multiword terms. In fact, in SNOMED CT, fewer than ten percent of the terms are
unigram terms (Paper I). For that reason, in Paper I, we attempted to incorporate
the notion of paraphrasing in a distributional semantic framework, allowing synonymy between terms of varying length to be identified. In that paper, the method
was evaluated on the English version of SNOMED CT, which already contains
synonyms, with the motivation that it can be used to facilitate ongoing translation
efforts into other languages, for which there are no synonyms. Swedish is one such
example and, in fact, the method has also been evaluated on the Swedish version
of SNOMED CT (Henriksson et al., 2013). The study yielded promising results.
A problem with applying distributional semantics to extract synonyms is that terms
that have other semantic relations also have similar distributional profiles and will
therefore be extracted too. It is as yet not possible to isolate synonyms from these
other semantic relations, including antonymy. An attempt to obtain better performance on this task was, however, made in Paper II, where the notion of ensembles
of semantic spaces was explored. In this paper, we demonstrated that by combining
models of distributional semantics – in this case, Random Indexing and Random
Permutation – with different model parameter configurations and induced from
different data sources – in this case, a clinical corpus and a non-clinical, medical
comprising medical journal articles – enhanced performance on the synonym extraction task could be obtained. We also applied this methodology to the similar
yet slightly different problem of identifying abbreviation-expansion pairs. Since
abbreviation-expansion pairs are distributionally similar, distributional semantics
seems to be an appropriate solution: the results indicate that this is the case.
DISCUSSION
75
RESEARCH QUESTIONS
− To what extent can distributional semantics be employed to extract candidate synonyms of medical terms from a large corpus of clinical text?
A number of relevant performance estimates are presented in this thesis. In
Paper I, results are reported separately for unigram terms and multiword
terms (up to a length of three tokens): 0.24 recall (top 20) for unigrams
and 0.16 recall (top 20) for multiword terms. These results are not directly
comparable since they are evaluated on different reference standards (the
one for unigram terms comprises only unigram-unigram synonyms, whereas
the one for multiword terms comprises unigrams, bigrams and trigrams, and
synonyms can be of varying word length). The experiments are conducted
with a corpus of English clinical text and a reference standard that is based
on the English version of SNOMED CT. In Paper II, only unigram terms are
considered: with a single Swedish clinical corpus, a recall (top 10) of 0.44 is
obtained; with a clinical corpus and a non-clinical, medical corpus, a recall
(top 10) of 0.47 is obtained. In these experiments, the reference standard is
based on the Swedish version of MeSH. This means that the results reported
in Paper II cannot be compared directly with the results reported for unigram
terms in Paper I.
The results provide an early indication of the extent to which distributional
semantics can be employed to extract candidate synonyms of medical terms
from clinical text. The performance of this approach to this task can reasonably be improved and these results are in no way definitive; the results are arguably promising enough to merit further exploration and must be improved
to have undisputed value in providing effective terminology development
support. It is, however, important to note that the employed reference standards do not really contain the ground truth: manual inspection in the error
analyses showed that there are many candidate synonyms that are reasonable, yet not present in the reference standards. Many candidate synonyms
are misspellings, which are frequent in clinical text and thus important to account for in order not to suffer a drop a recall when performing information
extraction.
Synonym extraction is a fairly obvious application of distributional semantics – despite the open research question of how to distinguish between different types of semantic relations – and this approach has clear advantages
over lexical pattern-based methods, such as the one proposed by Conway
76
C HAPTER 5.
and Chapman (2012), since it does not require synonymous terms to share
orthographic features. Moreover, the fact that this approach was investigated on two different corpora written in two different languages (Swedish
and English) is a strength of the method, which is data-driven and language
agnostic. There have been few previous attempts to extract synonyms from
clinical text – most synonym extraction tasks in the medical domain have
targeted biomedical text – and this makes it hard to compare the results reported in this thesis to other approaches. In future research, it is important
to compare a wide range of methods for synonym extraction from clinical
text.
− How can synonymous terms of varying (word) length be captured in a
distributional semantic framework?
There are of course many alternative means of capturing synonymous terms
of varying word length in a distributional semantic framework. One general
approach to account for this is here proposed and investigated: to identify
multiword terms statistically from the corpus, concatenate them and treat
as distinct semantic units. These will then have their own distributional
profiles, potentially very different from those of their constituents. A key
component of this solution is the ability to recognize multiword terms in the
corpus and two approaches are investigated: (1) using an existing collocation identification tool and (2) calculating the C-value of n-grams (where n
is 1, 2 or 3). In the context of synonym extraction, however, only the latter
method is employed.
In Paper I, a recall (top 20) of 0.16 is obtained on the task of identifying
synonymy between SNOMED clinical terms of varying length in a corpus
of English clinical text. This result demonstrates that it is indeed possible
to account for the fact that synonymous relations can exist between terms
of varying length in a distributional semantic framework. A problem with
treating multiword terms as single tokens is that it may lead to data sparsity
issues: the vocabulary increases substantially, resulting in a lower tokensper-type ratio. In Paper I, the results are not as good as those obtained for
unigram terms (even if these are essentially distinct tasks and cannot readily
be compared); however, the mere fact that the multiword term spaces are at
all able to capture synonymy that involves a multiword term is an advantage
over the unigram term spaces.
DISCUSSION
77
Since there have not been any similar studies on clinical text or, indeed, a
general solution to this limitation of distributional methods, it is not possible
to compare these results with any other. At the very least, this method constitutes a useful baseline, which more sophisticated methods – dealing more
elegantly with semantic composition, for instance – can be compared to.
− Can different models of distributional semantics and types of corpora
be combined to obtain enhanced performance on the synonym extraction task, compared to using a single model and a single corpus?
Yes, in Paper II, this was shown to yield enhanced performance on the synonym extraction task. Combining different models of distributional semantics – in this case, Random Indexing and Random Permutation – outperformed using a single model in isolation. With the clinical corpus, the model
combination improved from a recall (top 10) of 0.39 with Random Indexing
to 0.44; with the non-clinical, medical corpus, the performance improved
from 0.28 with Random Permutation to 0.34. Furthermore, combining semantic spaces induced from different corpora – in this case, a clinical corpus
and a non-clinical, medical corpus – further improved results, outperforming
a combination of semantic spaces induced from a single source. This multimodel, multi-corpora ensemble of four semantic spaces obtained a recall
(top 10) of 0.47, in comparison to 0.33 and 0.34 with the two single-corpus
combinations. It is worth pointing out that the latter experiment was conducted with a reference standard in which included synonym pairs had to
occur in both corpora a minimum number of times. Perhaps a better test of
the multiple-corpora combination would be to base the reference standard
on a single corpus (e.g., the clinical, so that included synonym pairs occur a
minimum number of times in that corpus) and then to verify if the additional
use of semantic spaces induced from another corpus would improve results.
The multiple-corpora combination should also be compared with the simpler
approach of using a single semantic space induced from multiple corpora,
which would essentially be treated as a single, combined corpus.
In the general language domain, previous work has showed that combining
methods and/or sources can lead to improved performance on the synonym
extraction task (Curran, 2002; Wu and Zhou, 2003; van der Plas and Tiedemann, 2006; Peirsman and Geeraerts, 2009). Some ensemble approaches do
indeed include the use of distributional semantics; however, to our knowledge, this approach of combining semantic spaces induced with different
distributional models and from different corpora has not been explored be-
78
C HAPTER 5.
fore. In the clinical domain, as mentioned earlier, the synonym extraction
task has not been targeted to a great extent, which makes it difficult to compare these results with other approaches. In this case, however, the baseline
was using a single model of distributional semantics and a single corpus; in
comparison to this, the small-scale ensembles of semantic spaces perform
better, yielding a fairly clear answer to the research question.
− To what extent can abbreviation-definition pairs be extracted with distributional semantics?
In Paper II, this is separated into two distinct tasks: one in which an abbreviation is provided as input and the task is to extract the corresponding
definition or expanded long form (abbr→exp), and one where the definition
or expanded long form is provided as input and the task is to extract the corresponding abbreviation (exp→abbr). The best results for these two tasks
are: 0.28 weighted precision and 0.39 recall (top 10) for abbr→exp; 0.31
precision and 0.33 recall (top 10) for exp→abbr.
As with the synonym extraction task, these results are in no way definitive
and it is reasonable to assume that performance can be improved on these
tasks with distributional methods. The results do, however, provide a rough
indication of the extent to which current models of distributional semantics
can be employed to extract medical abbreviation-definition pairs from clinical text. The results are sufficiently promising to merit further exploration.
In comparison to other approaches, most of which have been developed for
biomedical text, this approach does not require that abbreviations are accompanied by a parenthetical definition – a much easier task. Approaches
that rely on such an assumption – which is questionable also for biomedical
text, as Yu et al. (2002) found that 75% of abbreviations are never defined –
are unlikely to be applicable to clinical text, as abbreviations are generally
not defined in the text due to time constraints. Relying instead on distributional similarity is an interesting notion that, to our knowledge, has not
been explored in the biomedical domain either. In contrast to the synonym
extraction task, this task allows orthographic features to be readily exploited
in post-processing of candidate terms, greatly improving both recall (top 10)
and weighted precision.
DISCUSSION
79
ASSIGNMENT OF DIAGNOSIS CODES
The idea of leveraging distributional semantics to assign diagnosis codes to freetext clinical notes is less obvious than to extract synonyms and constitutes a novel
approach. Here, diagnosis codes are treated as terms in the notes to which they
were assigned. The approach was shown to hold some promise when first evaluated on a smaller scale (Henriksson et al., 2011) and later confirmed in a largescale quantitative evaluation (Henriksson and Hassel, 2011). In Paper III, the focus
was on developing this method further by: (1) exploiting terminological resources,
(2) exploiting structured data and, (3) incorporating negation handling. In Paper
IV, the effect of the dimensionality on performance is investigated. Here, the main
focus is on the (appropriate) dimensionality of semantic spaces in domains, like the
clinical, where the vocabulary tends to grow large, and the assignment of diagnosis code is primarily the task used to enable extrinsic evaluation; a parallel, albeit
secondary, focus is on obtaining enhanced performance on this task by properly
configuring the dimensionality.
As mentioned earlier, it is difficult to compare the results obtained by the many
approaches to automatic diagnosis coding that have been reported in the literature,
mainly due to the use of different classification systems (e.g., ICD-9 or ICD-10)
and performance metrics (e.g., precision, recall, recall top j or F1 -score). Even
more problematic in terms of evaluation is the fact that the problem is often framed
in very different ways, affecting the scope and difficulty of the task. In the 2007
Computational Medicine Challenge (Pestian et al., 2007), for instance, the task
was limited to the assignment of a set of 45 ICD-9 codes. Most of the results
reported in this thesis are based on all ICD-10 codes that have been assigned in
the data (over 12,000) – a hugely more daunting and difficult task. In Paper III,
results are also reported for domain-specific models (i.e., specific to a particular
type of clinic). Although the results are higher for these models – and limiting the
classification problem in this way is perhaps the way to go – they are also more
unreliable due to more limited amounts of evaluation data. The conclusions drawn
in that study – and the answers to the related research questions – are therefore
mainly based on the results obtained for the general models (trained and evaluated
on all types of clinics), for which there is a substantial amount of evaluation data.
The following research questions are, however, not directly about the usefulness of
this distributional semantics approach to diagnosis coding, even if they all involve
efforts to obtain better performance on the task.
80
C HAPTER 5.
RESEARCH QUESTIONS
− Can the distributional semantics approach to automatic diagnosis coding be improved by also exploiting structured data?
This research question can be answered in the affirmative without hedging
or caveats: all the relevant results reported in Paper III demonstrate that
performance improves by exploiting structured patient data. For the general
model, there is an increase in performance from 0.23 recall (top 10) to 0.26;
for the two domain-specific models, the performance increases from 0.33
to 0.43 and from 0.61 to 0.82. Using known information about a to-becoded document was thus shown to be effective in weighting the likelihood
of candidate diagnosis codes. A combination of structured and unstructured
data appears to be the way to go for the (semi-)automatic assignment of
diagnosis codes.
Although most previous approaches rely on the clinical text, it is not a novelty in itself to exploit structured data to improve performance. Pakhomov
et al. (2006), for instance, use gender information in combination with frequency data to filter out unlikely classifications, with the motivation that
gender has a high predictive value, particularly as some diagnosis codes are
gender-specific. Here, only three structured data fields were used: age, gender and clinic. It is reasonable to believe that there is scope for exploiting
other structured EHR data, too, such as various measurement values and
laboratory test results. This will, however, likely involve having to deal with
missing values – not all patients will have taken the same measurements or
laboratory tests – and the curse of dimensionality: the number of times a
laboratory test is taken can vary substantially from patient to patient.
− Can external terminological resources be leveraged to obtain enhanced
performance on the diagnosis coding task?
This research question cannot be answered quite as clearly as the previous
one based on the results obtained in Paper III. The approach of giving additional weight to clinically significant terms – terms that appear in a medical
vocabulary, here in the form of SNOMED CT findings and disorders – did
yield slightly better results; in combination with negation handling and/or
structured data, however, performance did not improve, and in some cases –
for the domain-specific models – performance actually decreased.
DISCUSSION
81
Despite these results, it seems likely that terminological resources can be of
use when assigning diagnosis codes since medical terms have been found to
have a high predictive value for the classification of clinical notes (Jarman
and Berndt, 2010). In fact, Larkey and Croft (1996) showed that performance improved when additional weight was assigned to words, phrases
and structures that provided the most diagnostic evidence. Perhaps the simple weighting approach assumed in Paper III is not be the best way to go,
however. Alternative approaches that leverage external terminological resources should be explored in future work.
− How can negation be handled in a distributional semantic framework?
There are naturally many alternative means of accounting for negation in a
distributional semantic framework. Here, the approach is similar to that of
accounting for paraphrasing: treat negated entities as distinct semantic units,
with different distributional profiles from those of their affirmed counterparts. An existing negation detection system (Swedish NegEx) is used. It is,
however, not sufficient to attempt to handle negation; it must also be shown
to be useful in practice. The proposed means of incorporating negation in a
distributional semantic framework is here evaluated on the task of assigning
diagnosis codes. In this context – and with a bag-of-words document representation – it is important since negated terms would otherwise mistakenly
be positively correlated with assigned diagnosis codes. In this approach,
where negated terms are retained but with their proper negated denotation,
negated terms are correctly correlated with diagnosis codes.
In previous approaches to diagnosis coding – not based on distributional
semantics – negation handling has been shown to be important (Pestian et
al., 2007). In Paper III, performance on this task with the general model
improves with negation handling, from 0.23 to 0.25 recall (top 10). With the
smaller, domain-specific models, performance increases by one percentage
point in one case and remains the same in the other. As mentioned earlier,
this approach relies on sufficiently large amounts of data, which is perhaps
an explanation for the reduced impact on the domain-specific models.
− How does dimensionality affect the application of distributional
semantics to the assignment of diagnosis codes, particularly as the size
of the vocabulary grows large?
There is a correlation between the appropriate dimensionality of a semantic space and properties of the data set from which it is induced, both in
82
C HAPTER 5.
theory and in practice, as was demonstrated in Paper IV. In domains where
the vocabulary grows large and where there is a large number of contexts
– as in the clinical domain – it is particularly important to configure the
dimensionality properly. In the distributional semantics approach to diagnosis coding, a document-level context definition is employed; with a large
number of documents, it is crucial that the dimensionality is sufficiently
large to allow near-orthogonal index vectors to be generated for all contexts.
Moreover, with a large number of concepts – which a large vocabulary may
be a crude indicator of, although not necessarily so due to the prevalence
of misspellings and ad-hoc abbreviations in the clinical domain – the dimensionality needs to be sufficiently large to allow concepts to be readily
distinguishable in the semantic space.
The effect of the dimensionality was evaluated on the task of assigning diagnosis codes and the difference in performance, measured as recall (top
10), was as large as eighteen percentage points when modeling multiword
terms: from 0.18 with 1,000 dimensions to 0.36 with 10,000 dimensions.
With unigram terms, the difference in performance was fourteen percentage
points. These results indicate that it may be particularly important to configure the dimensionality when modeling multiword terms as single tokens.
In contrast to the study conducted by Sahlgren and Cöster (2004), where
the performance – albeit on a different task – did not change greatly with a
higher dimensionality, leading the authors to conclude that configuring the
dimensionality for RI spaces is not critical (at least not in comparison to
SVD), we found that careful dimensionality configuration may yield significant boosts in performance. Major differences in the employed data sets –
in terms of the number of documents and the size of the vocabulary – could
explain the drawing of different conclusions.
IDENTIFICATION OF ADVERSE DRUG REACTIONS
Using distributional semantics and a large clinical corpus to identify adverse drug
reactions for a given drug was only preliminarily explored. In Paper V, it was, however, demonstrated that this method can be used to extract distributionally similar
drug-symptom pairs, some of which are adverse drug reactions. It is, however,
difficult to isolate these instances from those where the symptom is an indication
for taking a certain drug, for instance. Combining distributional semantics with
DISCUSSION
83
other techniques could be profitable in the task of identifying potentially adverse
drug reactions from health record narratives.
RESEARCH QUESTIONS
− To what extent can distributional semantics be used to extract potential
adverse drug reactions to a given medication among a list of distributionally similar drug-symptom pairs?
In Paper V, almost 50% of the extracted terms were deemed conceivable –
in the liberal sense that they could not readily be ruled out as such – by a
physician. Around 10% of these were known and documented in FASS. The
extent to which the conceivable but undocumented ADRs were actual ADRs
cannot currently be gauged. The fact that some of the distributionally similar drug-symptom pairs constitute an ADR relation is, however, sufficiently
interesting to merit further exploration of ADR identification in a distributional semantic framework. It should be noted – as was expected – that
many other terms were also extracted, including indications and alternative
indications. As in the case of synonyms, it is difficult to isolate ADRs from
other semantic relations that hold between a drug and a symptom.
To our knowledge, distributional semantics has not been exploited in the
context of ADR identification. Even if distributional semantics is unlikely
to be sufficient in isolation to identify ADRs from clinical text, it seems a
promising enough notion to leverage in more sophisticated methods.
5.2
MAIN CONTRIBUTIONS
The single-most important contribution has been to demonstrate the general utility
of distributional semantics to the clinical domain. This is important as it provides
an unsupervised method that, by definition, does not rely on annotated data. In
the clinical domain – in particular for a relatively under-resourced language like
Swedish – this is crucial. Creating annotated resources in this domain very often
requires medical experts. Needless to say, this is expensive. Distributional semantics can perform fairly well on a range of medical and clinical language processing
84
C HAPTER 5.
tasks; it can also be used in conjunction with other, possibly supervised, methods
to enhance performance.
Concerning the application of distributional semantics in the clinical domain, there
are four main contributions. First, we have addressed the important issue of being able to handle paraphrasing in a distributional semantic framework. By recognizing multiword terms with the C-value statistic and treating them as distinct
semantic units, synonymous terms of varying length can be identified. Second,
the notion of ensembles of semantic spaces has been introduced. This idea allows
different models of distributional semantics, with different model configurations
and induced from different data sources, to be combined to obtain enhanced performance. Here, this was demonstrated on the synonym extraction task; however,
this idea may be profitable on a range of natural language processing tasks. As
such, it is not only a contribution to the clinical domain, but also to the general
NLP community. Third, the ability to handle negation in a distributional semantic
framework was addressed: a straightforward solution whereby negated entities are
pre-identified and treated distinctly from their affirmed counterparts was proposed.
Fourth, the importance of configuring the dimensionality of semantic spaces, particularly when the vocabulary grows large, as it often does in the clinical domain,
has been demonstrated.
5.3
FUTURE DIRECTIONS
In the studies covered in this thesis, distributional semantics was applied directly
to perform a given task. For some tasks, this approach may be profitable; however, it should be noted that models of distributional semantics are not primarily
intended to be used as classifiers. It is rather a means of representing lexical semantics and quantifying semantic similarity between linguistic units. Perhaps a
more natural application of distributional semantics for tasks that are not directly
concerned with term-term relations is to exploit this semantic representation in an
existing machine learning-based classifier. As properties of words are often used
as features in machine learning-based approaches to NLP tasks, exploiting latent
features such as the semantics of a term may be even more profitable. This is a line
of research that has begun to be explored in the medical domain, for instance in the
context of named entity recognition and domain adaptation (Jonnalagadda et al.,
2012). For the task of assigning diagnosis codes, for instance, this appears to be a
DISCUSSION
85
reasonable approach: using (distributional) semantic features of the unstructured
part of the EHR, in conjunction with a range of structured EHR data.
Some key issues in distributional semantics were addressed here: (1) paraphrasing
and (2) negation. These are arguably especially important issues in the clinical
domain, where multiword terms and negations are prevalent, and need to be further explored. Modeling multiword terms as distinct semantic units ignores the
fact that they are composed of constituent words. Semantic composition is an important research area (Mitchell and Lapata, 2008; Baroni and Zamparelli, 2010;
Guevara, 2010; Grefenstette and Sadrzadeh, 2011); working on this problem in
the context of clinical text would be very interesting. This approach to handling
negation suffers from similar deficiencies: the fact that affirmed and negated terms
are otherwise semantically identical is ignored, which also entails that the statistical evidence is shared between the two polarities. Investigating more sophisticated
vector operations that can capture the function of negation would perhaps be more
profitable.
Finally, the notion of ensembles of semantic spaces seems promising and should
be explored further. Here, only four semantic spaces were combined; however,
in many ensemble methods, it is not uncommon for hundreds or even thousands
of models to be combined. There is definitely scope for exploiting the ensemble
methodology in a distributional semantic framework. This could prove profitable
on a range of general and clinical language processing tasks.
86
C HAPTER 5.
R EFERENCES
87
R EFERENCES
Helen Allvin, Elin Carlsson, Hercules Dalianis, Riitta Danielsson-Ojala, Vidas
Daudaravičius, Martin Hassel, Dimitrios Kokkinakis, Heljä Lundgrén-Laine,
Gunnar H Nilsson, Øystein Nytrø, et al. 2011. Characteristics of finnish and
swedish intensive care nursing narratives: a comparative analysis to support the
development of clinical language technologies. Journal of Biomedical Semantics, 2(Suppl 3):S1.
Ethern Alpaydin. 2010. Introduction to Machine Learning. MIT Press, Cambridge, MA. ISBN 978-0-262-01243-0.
Mohammad Homam Alsun, Laurent Lecornu, Basel Solaiman, Clara Le Guillou, and Jean Michel Cauvin. 2010. Medical diagnosis by possibilistic classification reasoning. In Proceedings of 13th Conference on Information Fusion
(FUSION), pages 1–7. IEEE.
Ion Androutsopoulos and Prodromos Malakasiotis. 2010. A Survey of Paraphrasing and Textual Entailment Methods. Journal of Artificial Intelligence Research, 38:135–187.
Hiroko Ao and Toshihisa Takagi. 2005. ALICE: An Algorithm to Extract Abbreviations from MEDLINE. Journal of the American Medical Informatics Association, 12(5):576–586. ISSN 1067-5027.
Eiji Aramaki, Yasuhide Miura, Masatsugu Tonoike, Tomoko Ohkuma, Hiroshi
Masuichi, Kayo Waki, and Kazuhiko Ohe. 2010. Extraction of Adverse Drug
Effects from Clinical Records. Stud Health Technol Inform, 160(1):739–43.
Alan R Aronson, Olivier Bodenreider, Dina Demner-Fushman, Kin Wah Fung, Vivian K Lee, James G Mork, Aurélie Névéol, Lee Peters, and Willie J Rogers.
2007. From Indexing the Biomedical Literature to Coding Clinical Text: Experience with MTI and machine Learning Approaches. In Proceedings of
BioNLP: Biological, translational, and clinical language processing, pages
105–112. Association for Computational Linguistics.
Satanjeev Banerjee and Ted Pedersen. 2003. The Design, Implementation, and
Use of the Ngram Statistic Package. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics
(CICLing), pages 370–381.
88
R EFERENCES
Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are
matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1183–1193. Association for Computational Linguistics.
David W Bates, R Scott Evans, Harvey Murff, Peter D Stetson, Lisa Pizziferri, and
George Hripcsak. 2003. Detecting Adverse Events Using Information Technology. Journal of the American Medical Informatics Association, 10(2):115–128.
Adrian Benton, Lyle Ungar, Shawndra Hill, Sean Hennessy, Jun Mao, Annie
Chung, Charles E Leonard, and John H Holmes. 2011. Identifying potential
adverse effects using the web: A new approach to medical hypothesis generation. Journal of Biomedical Informatics, 44(6):989–996.
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022.
Vincent D. Blondel, Anahí Gajardo, Maureen Heymans, Pierre Senellart, and
Paul Van Dooren. 2004. A measure of similarity between graph vertices: Applications to synonym extraction and web searching. SIAM Review, 46(4):pp. 647–
666. ISSN 00361445. URL http://www.jstor.org/stable/20453570.
Henrik Boström, Sten F Andler, Marcus Brohede, Ronnie Johansson, Alexander
Karlsson, Joeri van Laere, Lars Niklasson, Marie Nilsson, Anne Persson, and
Tom Ziemke. 2007. On the Definition of Information Fusion as a Field of Research. Institutionen för kommunikation och information: IKI Technical Reports (HS- IKI -TR-07-006).
Staffan Cederblom. 2005. Medicinska förkortningar och akronymer (In Swedish).
Studentlitteratur, Lund. ISBN 91-44-03393-1.
Jeffrey T Chang, Hinrich Schutze, and Russ B Altman. 2002. Creating an online
dictionary of abbreviations from medline. Journal of the American Medical
Informatics Association, 9:612–620.
Wendy W Chapman, Will Bridewell, Paul Hanbury, Gregory F Cooper, and
Bruce G Buchanan. 2001. Evaluation of negation phrases in narrative clinical
reports. In Proceedings of the AMIA Symposium, page 105. American Medical
Informatics Association.
R EFERENCES
89
Emmanuel Chazard. 2011. Automated detection of adverse drug events by data
mining of electronic health records. PhD thesis, Université du Droit et de la
Santé-Lille II.
Ping Chen, Araly Barrera, and Chris Rhodes. 2010. Semantic analysis of free
text and its application on automatically assigning ICD-9-CM codes to patient
records. In Cognitive Informatics (ICCI), 2010 9th IEEE International Conference on, pages 68–74. IEEE.
Christopher G Chute and Yiming Yang. 1992. An evaluation of concept based
latent semantic indexing for clinical information retrieval. In Proceedings of
the Annual Symposium on Computer Application in Medical Care, page 639.
American Medical Informatics Association.
Christopher G Chute, Yiming Yang, and DA Evans. 1991. Latent semantic indexing of medical diagnoses using umls semantic structures. In Proceedings of
the Annual Symposium on Computer Application in Medical Care, page 185.
American Medical Informatics Association.
Aaron M Cohen and William R Hersh. 2005. A survey of current work in biomedical text mining. Briefings in Bioinformatics, 6(1):57–71.
Aaron M Cohen, William R Hersh, C Dubay, and K Spackman. 2005. Using cooccurrence network structure to extract synonymous gene and protein names
from medline abstracts. BMC Bioinformatics, 6(1):103. ISSN 1471-2105.
URL http://www.biomedcentral.com/1471-2105/6/103.
Trevor Cohen, Brett Blatter, and Vimla Patel. 2008. Simulating expert clinical
comprehension: Adapting latent semantic analysis to accurately extract clinical
concepts from psychiatric narrative. Journal of Biomedical Informatics, 41(6):
1070–1087.
Trevor Cohen, Roger Schvaneveldt, and Dominic Widdows. 2010. Reflective random indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 43(2):240–256.
Trevor Cohen and Dominic Widdows. 2009. Empirical distributional semantics:
Methods and biomedical applications. Journal of Biomedical Informatics, 42
(2):390 – 405. ISSN 1532-0464. URL http://www.sciencedirect.com/
science/article/pii/S1532046409000227.
90
R EFERENCES
Mike Conway and Wendy Chapman. 2012. Discovering lexical instantiations of
clinical concepts using web services, wordnet and corpus resources. In AMIA
Annual Symposium Proceedings, page 1604.
Koby Crammer, Mark Dredze, Kuzman Ganchev, Partha Pratim Talukdar, and
Steven Carroll. 2007. Automatic code assignment to medical text. In Proceedings of BioNLP: Biological, translational, and clinical language processing,
pages 129–136. Association for Computational Linguistics.
D Alan Cruse. 1986. Lexical semantics. Cambridge University Press.
James R Curran. 2002. Ensemble methods for automatic thesaurus extraction. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 222–229. Association for Computational Linguistics.
Hercules Dalianis, Martin Hassel, Aron Henriksson, and Maria Skeppstedt. 2012.
Stockholm EPR Corpus: A Clinical Database Used to Improve Health Care. In
Proceedings of the Swedish Language Technology Conference (SLTC).
Hercules Dalianis, Martin Hassel, and Sumithra Velupillai. 2009. The Stockholm
EPR Corpus – Characteristics and Some Initial Findings. In Proceedings of the
14th International Symposium on Health Information Management Research
(ISHIMR), pages 1–7.
Dana Dannélls. 2006. Automatic acronym recognition. In Proceedings of the 11th
conference on European chapter of the Association for Computational Linguistics (EACL).
Vidas Daudaravicius. 2010. The influence of collocation segmentation and top 10
items to keyword assignment performance. In Computational Linguistics and
Intelligent Text Processing, pages 648–660. Springer.
Luciano RS de Lima, Alberto HF Laender, and Berthier A Ribeiro-Neto. 1998. A
hierarchical approach to the automatic categorization of medical documents.
In Proceedings of the Seventh International Conference on Information and
Knowledge Management, pages 132–139. ACM.
Scott C Deerwester, Susan T Dumais, Thomas K Landauer, George W Furnas, and
Richard A Harshman. 1990. Indexing by latent semantic analysis. Journal of
the American Society for Information Science, 41(6):391–407.
R EFERENCES
91
Thomas G Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, pages
1–15. Springer-Verlag.
Robert Eriksson, Peter B Jensen, Sune Frankild, Lars J Jensen, and Søren Brunak.
2013. Dictionary construction and identification of possible adverse drug events
in danish clinical narrative text. Journal of the American Medical Informatics
Association. URL http://jamia.bmj.com/content/early/2013/05/22/
amiajnl-2013-001708.abstract.
Jung-Wei Fan and Carol Friedman. 2007. Semantic classification of biomedical
concepts using distributional similarity. Journal of the American Medical Informatics Association, 14(4):467–477.
Richárd Farkas and György Szarvas. 2008. Automatic construction of rule-based
icd-9-cm coding systems. BMC Bioinformatics, 9(Suppl 3):S10.
Jose C Ferrao, Monica D Oliveira, Filipe Janela, and Henrique MG Martins.
2012. Clinical coding support based on structured data stored in electronic
health records. In Proceedings of Bioinformatics and Biomedicine Workshops
(BIBMW), pages 790–797. IEEE.
John R Firth. 1957. A synopsis of linguistic theory, 1930-1955. In F. Palmer, editor, Selected Papers of J. R. Firth 1952–59, pages 168–205. Longman, London,
UK.
Katerina Frantzi and Sophia Ananiadou. 1996. Extracting nested collocations. In
Proceedings of the Conference on Computational Linguistics (COLING), pages
41–46.
Katerina Frantzi, Sophia Ananiadou, and Hideki Mima. 2000. Automatic recognition of multi-word terms: the c-value/nc-value method. International Journal
on Digital Libraries, 3(2):115–130.
Carol Friedman and George Hripcsak. 1999. Evaluating natural language processors in the clinical domain. Methods Inf. Med., 37:334–344.
Carol Friedman, Pauline Kra, and Andrey Rzhetsky. 2002. Two biomedical sublanguages: a description based on the theories of Zellig Harris. Journal of
Biomedical Informatics, 35(4):222–235.
92
R EFERENCES
Madhavi K Ganapathiraju, Judith Klein-Seetharaman, N Balakrishnan, and Raj
Reddy. 2004. Characterization of protein secondary structure. Signal Processing Magazine, IEEE, 21(3):78–87.
Sylvain Gaudan, Harald Kirsch, and Dietrich Rebholz-Schuhmann. 2005. Resolving abbreviations to their senses in medline. Bioinformatics, 21(18):3658–3664.
ISSN 1367-4803.
Ira Goldstein, Anna Arzumtsyan, and Özlem Uzuner. 2007. Three approaches to
automatic assignment of icd-9-cm codes to radiology reports. In AMIA Annual
Symposium Proceedings, volume 2007, page 279. American Medical Informatics Association.
Michael D Gordon and Susan Dumais. 1998. Using latent semantic indexing for
literature based discovery. Journal of the American Society for Information
Science, 49:674–685.
Edward Grefenstette and Mehrnoosh Sadrzadeh. 2011. Experimental support
for a categorical compositional distributional model of meaning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 1394–1404. Association for Computational Linguistics.
Thomas L Griffiths and Mark Steyvers. 2002. A probabilistic approach to semantic
representation. In Proceedings of the 24th annual conference of the cognitive
science society, pages 381–386. Citeseer.
Emiliano Guevara. 2010. A regression model of adjective-noun compositionality in distributional semantics. In Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics, pages 33–37. Association for
Computational Linguistics.
Zellig S Harris. 1954. Distributional structure. Word, 10:146–162.
Martin Hassel. 2004.
JavaSDM package.
http://www.nada.kth.se/
~xmartin/java/. KTH School of Computer Science and Communication,
Stockholm, Sweden. Accessed: September 30, 2013.
Martin Hassel, Aron Henriksson, and Sumithra Velupillai. 2011. Something Old,
Something New: Applying a Pre-trained Parsing Model to Clinical Swedish. In
18th Nordic Conference of Computational Linguistics NODALIDA 2011. Northern European Association for Language Technology (NEALT).
R EFERENCES
93
Marti Hearst. 1992. Automatic acquisition of hyponyms from large text corpora.
In Proceedings of COLING 1992, pages 539–545.
Aron Henriksson and Martin Hassel. 2011. Election of Diagnosis Codes: Words
as Responsible Citizens. In Proceedings of the 3rd International Louhi Workshop on Health Document Text Mining and Information Analysis, pages 67–74.
CEUR-WS. org.
Aron Henriksson, Martin Hassel, and Maria Kvist. 2011. Diagnosis Code Assignment Support Using Random Indexing of Patient Records – A Qualitative Feasibility Study. In Artificial Intelligence in Medicine, pages 348–352. Springer.
Aron Henriksson, Hans Moen, Maria Skeppstedt, Ann-Marie Eklund, Vidas Daudaravičius, and Martin Hassel. 2012. Synonym Extraction of Medical Terms
from Clinical Text Using Combinations of Word Space Models. In Proceedings of the 5th International Symposium on Semantic Mining in Biomedicine
(SMBM).
Aron Henriksson, Maria Skeppstedt, Maria Kvist, Martin Duneld, and Mike Conway. 2013. Corpus-Driven Terminology Development: Populating Swedish
SNOMED CT with Synonyms Extracted from Electronic Health Records. In
Proceedings of BioNLP. Association for Computational Linguistics.
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings
of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57. ACM.
Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W Berry. 2005. Gene
clustering by latent semantic indexing of medline abstracts. Bioinformatics, 21
(1):104–115.
IHTSDO. 2013a. International Health Terminology Standards Development Organisation: SNOMED CT. URL http://www.ihtsdo.org/snomed-ct/.
Accessed: September 30, 2013.
IHTSDO. 2013b.
International Health Terminology Standards Development Organisation: Supporting Different Languages.
URL http://
www.ihtsdo.org/snomed-ct/snomed-ct0/different-languages/. Accessed: September 30, 2013.
94
R EFERENCES
Niklas Isenius, Sumithra Velupillai, and Maria Kvist. 2012. Initial Results in
the Development of SCAN: a Swedish Clinical Abbreviation Normalizer. In
Proceedings of the CLEF 2012 Workshop on Cross-Language Evaluation of
Methods, Applications, and Resources for eHealth Document Analysis (CLEFeHealth2012). CLEF.
Jay Jarman and Donald J Berndt. 2010. Throw the Bath Water Out, Keep the
Baby: Keeping Medically-Relevant Terms for Text Mining. In AMIA Annual
Symposium Proceedings, volume 2010, page 336. American Medical Informatics Association.
Peter B Jensen, Lars J Jensen, and Søren Brunak. 2012. Mining electronic health
records: towards better research applications and clinical care. Nature Reviews
Genetics, 13(6):395–405.
William B Johnson and Joram Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics, 26(189-206):1.
Michael N Jones and Douglas JK Mewhort. 2007. Representing word meaning
and order information in a composite holographic lexicon. Psychological review, 114(1):1.
Siddhartha Jonnalagadda, Trevor Cohen, Stephen Wu, and Graciela Gonzalez.
2012. Enhancing clinical concept extraction with distributional semantics.
Journal of Biomedical Informatics, 45(1):129–140.
Dan Jurafsky, James H Martin, Andrew Kehler, Keith Vander Linden, and Nigel
Ward. 2000. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, volume 2.
MIT Press.
Pentti Kanerva, Jan Kristofersson, and Anders Holst. 2000. Random indexing
of text samples for latent semantic analysis. In Proceedings of 22nd Annual
Conference of the Cognitive Science Society, page 1036.
Jussi Karlgren, Anders Holst, and Magnus Sahlgren. 2008. Filaments of meaning
in word space. In Advances in Information Retrieval, pages 531–538. Springer.
Jussi Karlgren and Magnus Sahlgren. 2001. From words to understanding. Foundations of Real-World Intelligence, pages 294–308.
R EFERENCES
95
Karolinska Institutet Universitetsbiblioteket. 2013. Karolinska Institutet: About
the Karolinska Institute MeSH resource (In Swedish). URL http://mesh.
kib.ki.se/swemesh/about_en.cfm. Accessed: September 30, 2013.
Samuel Kaski. 1998. Dimensionality reduction by random mapping: Fast similarity computation for clustering. In Neural Networks Proceedings, 1998. IEEE
World Congress on Computational Intelligence. The 1998 IEEE International
Joint Conference on, volume 1, pages 413–418. IEEE.
Alla Keselman, Laura Slaughter, Catherine Arnott-Smith, Hyeoneui Kim, Guy
Divita, Allen Browne, Christopher Tsai, and Qing Zeng-Treitler. 2007. Towards
Consumer-Friendly PHRs: Patients’ Experience with Reviewing Their Health
Records. In Proc. AMIA Annual Symposium, pages 399—403.
Ola Knutsson, Johnny Bigert, and Viggo Kann. 2003. A Robust Shallow Parser
for Swedish. In Proceedings of the Nordic Conference of Computational Linguistics (NODALIDA).
Dimitrios Kokkinakis. 2012. The Journal of the Swedish Medical Association a Corpus Resource for Biomedical Text Mining in Swedish. In Proceedings of
the Third Workshop on Building and Evaluating Resources for Biomedical Text
Mining (BioTxtM).
Bevan Koopman, Guido Zuccon, Peter Bruza, Laurianne Sitbon, and Michael
Lawley. 2012. An evaluation of corpus-driven measures of medical concept
similarity for information retrieval. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2439–
2442. ACM.
Läkemedelsindustriföreningens Service AB. 2013. FASS.se (In Swedish). http:
//www.fass.se. Accessed: September 30, 2013.
Thomas K Landauer and Susan T Dumais. 1997. A solution to Plato’s problem:
The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104(2):211.
D Lang. 2007. Consultant report – natural language processing in the health care
industry. Cincinnati Children’s Hospital Medical Center.
Leah S Larkey and W Bruce Croft. 1996. Combining classifiers in text categorization. In Proceedings of the 19th annual international ACM SIGIR conference
on Research and development in information retrieval, pages 289–297. ACM.
96
R EFERENCES
Laurent Lecornu, Clara Le Guillou, Frédéric Le Saux, Matthieu Hubert, John
Puentes, Julien Montagner, and Jean Michel Cauvin. 2011. Information fusion
for diagnosis coding support. In Engineering in Medicine and Biology Society,
EMBC, 2011 Annual International Conference of the IEEE, pages 3176–3179.
IEEE.
Laurent Lecornu, G Thillay, Clara Le Guillou, PJ Garreau, P Saliou, H Jantzem,
John Puentes, and Jean Michel Cauvin. 2009. Referocod: a probabilistic
method to medical coding support. In Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE, pages
3421–3424. IEEE.
Paea LePendu, Srinivasan V Iyer, Anna Bauer-Mehren, Rave Harpaz, Jonathan M
Mortensen, Tanya Podchiyska, Todd A Ferris, and Nigam H Shah. 2013. Pharmacovigilance using clinical notes. Clinical Pharmacology & Therapeutics.
Paea LePendu, Srinivasan V Iyer, Cédrick Fairon, Nigam H Shah, et al. 2012.
Annotation analysis for testing drug safety signals using unstructured clinical
notes. J Biomed Semantics, 3(Suppl 1):S5.
Gondy Leroy and Hsinchun Chen. 2001. Meeting medical terminology needs-the
ontology-enhanced medical concept mapper. IEEE Transactions on Information Technology in Biomedicine, 5(4):261–270.
Gondy Leroy, James E Endicott, Obay Mouradi, David Kauchak, and Melissa L
Just. 2012. Improving perceived and actual text difficulty for health information
consumers using semi-automated methods. AMIA Annu Symp Proc, 2012:522–
31.
Dekang Lin and Patrick Pantel. 2001. [email protected] [email protected] discovery of inference rules
from text. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 323–328. ACM.
Hongfang Liu, Alan R Aronson, and Carol Friedman. 2002. A study of abbreviations in medline abstracts. Proc AMIA Symp, pages 464–8.
Hongfang Liu, Yves A Lussier, and Carol Friedman. 2001. Disambiguating ambiguous biomedical terms in biomedical narrative text: an unsupervised method.
Journal of Biomedical Informatics, 34(4):249–261.
R EFERENCES
97
Kevin Lund and Curt Burgess. 1996. Producing high-dimensional semantic spaces
from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2):203–208.
Yuji Matsumoto. 2003. Lexical Knowledge Acquisition. In Ruslan Mitkov, editor,
The Oxford Handbook of Computational Linguistics. Oxford University Press,
New York.
John McCrae and Nigel Collier. 2008. Synonym set extraction from the biomedical literature by lexical pattern discovery. BMC Bioinformatics, 9:159.
Stéphane M Meystre, Guergana K Savova, Karin C Kipper-Schuler, and John F
Hurdle. 2008. Extracting information from textual documents in the electronic
health record: a review of recent research. Yearb Med Inform, 35:128–44.
Jeff Mitchell and Mirella Lapata. 2008. Vector-based Models of Semantic Composition. In Association for Computation Linguistics, pages 236–244.
Ruslan Mitkov. 2003. The Oxford Handbook of Computational Linguistics. Oxford University Press, New York.
Dana Movshovitz-Attias and William W Cohen. 2012. Alignment-HMM-based
Extraction of Abbreviations from Biomedical Text. In Proceedings of the 2012
Workshop on Biomedical Natural Language Processing (BioNLP 2012), pages
47–55, Montréal, Canada, June 2012. Association for Computational Linguistics.
Harvey J Murff, Vimla L Patel, George Hripcsak, and David W Bates. 2003. Detecting adverse events for patient safety research: a review of current methodologies. Journal of biomedical informatics, 36(1):131–143.
Michael D Myers. 1997. Qualitative research in information systems. Management Information Systems Quarterly, 21:241–242.
Michael D Myers. 2009. Qualitative Research in Business & Management. SAGE
Publications Ltd, London, U.K. ISBN 9781412921664.
Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2007. Wikipedia mining
for an association web thesaurus construction. In Web Information Systems
Engineering–WISE 2007, pages 322–334. Springer.
98
R EFERENCES
Kimberly J O’Malley, Karon F Cook, Matt D Price, Kimberly Raiford Wildes,
John F Hurdle, and Carol M Ashton. 2005. Measuring diagnoses: ICD code
accuracy. Health services research, 40(5p2):1620–1639.
Sebastian Padó and Mirella Lapata. 2007. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161–199.
Serguei VS Pakhomov, James D Buntrock, and Christopher G Chute. 2006.
Automating the assignment of diagnosis codes to patient encounters using
example-based and machine learning techniques. Journal of the American Medical Informatics Association, 13(5):516–525.
Christos H Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala. 1998. Latent semantic indexing: A probabilistic analysis. In Proceedings
of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles
of database systems, pages 159–168. ACM.
Ted Pedersen, Serguei VS Pakhomov, Siddharth Patwardhan, and Christopher G
Chute. 2007. Measures of semantic similarity and relatedness in the biomedical
domain. Journal of Biomedical Informatics, 40(3):288–299.
Yves Peirsman and Dirk Geeraerts. 2009. Predicting strong associations on the
basis of corpus data. In Proceedings of the 12th Conference of the European
Chapter of the Association for Computational Linguistics, EACL ’09, pages
648–656, Stroudsburg, PA, USA. Association for Computational Linguistics.
URL http://dl.acm.org/citation.cfm?id=1609067.1609139.
John P Pestian, Christopher Brew, Paweł Matykiewicz, DJ Hovermale, Neil Johnson, K Bretonnel Cohen, and Włodzisław Duch. 2007. A shared task involving
multi-label classification of clinical free text. In Proceedings of the Workshop
on BioNLP 2007: Biological, Translational, and Clinical Language Processing,
pages 97–104. Association for Computational Linguistics.
John Puentes, Julien Montagner, Laurent Lecornu, and Jean-Michel Cauvin. 2013.
Information quality measurement of medical encoding support based on usability. Computer Methods and Programs in Biomedicine, in print. ISSN
0169-2607. URL http://www.sciencedirect.com/science/article/
pii/S0169260713002538.
Philip Resnik and Jimmy Lin. 2013. Evaluation of NLP Systems. In Alexander
Clark, Chris Fox, and Shalom Lappin, editors, The Handbook of Computational
Linguistics and Natural Language Processing. Wiley-Blackwell, West Sussex.
R EFERENCES
99
John I Saeed. 1997. Semantics. Blackwell Publishers, Oxford. ISBN 0-63120034-7.
Mohammed Saeed, Mauricio Villarroel, Andrew T Reisner, Gari Clifford, Li-Wei
Lehman, George Moody, Thomas Heldt, Tin H Kyaw, Benjamin Moody, and
Roger G Mark. 2011. Multiparameter intelligent monitoring in intensive care ii
(mimic-ii): A public-access intensive care unit database. Critical care medicine,
39(5):952.
Magnus Sahlgren. 2005. An introduction to random indexing. In Methods and Applications of Semantic Indexing Workshop at the 7th International Conference
on Terminology and Knowledge Engineering, TKE, volume 5.
Magnus Sahlgren. 2006. The Word-Space Model: Using distributional analysis
to represent syntagmatic and paradigmatic relations between words in highdimensional vector spaces. PhD thesis, PhD thesis, Stockholm University.
Magnus Sahlgren. 2008. The distributional hypothesis. Italian Journal of Linguistics, 20(1):33–54.
Magnus Sahlgren and Rickard Cöster. 2004. Using bag-of-concepts to improve the
performance of support vector machines in text categorization. In Proceedings
of the 20th international conference on Computational Linguistics, page 487.
Association for Computational Linguistics.
Magnus Sahlgren, Anders Holst, and Pentti Kanerva. 2008. Permutations as a
means to encode order in word space. In Proceedings of the 30th Annual Meeting of the Cognitive Science Society, pages 1300–1305.
Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model
for automatic indexing. Communications of the ACM, 18(11):613–620.
Steven L Salzberg. 1997. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and knowledge discovery, 1(3):317–328.
Hinrich Schütze. 1993. Word space. In Advances in Neural Information Processing Systems 5.
Ariel S Schwartz and Marti A Hearst. 2003. A simple algorithm for identifying abbreviation definitions in biomedical text. Pacific Symposium on Biocomputing.
Pacific Symposium on Biocomputing, pages 451–462. ISSN 1793-5091.
100
R EFERENCES
Maria Skeppstedt. 2011. Negation detection in Swedish clinical text: An adaption
of NegEx to Swedish. Journal of Biomedical Semantics, 2(Suppl 3):S3.
Maria Skeppstedt, Hercules Dalianis, and Gunnar H Nilsson. 2011. Retrieving
disorders and findings: Results using SNOMED CT and NegEx adapted for
Swedish. In Proceedings of the 3rd International Louhi Workshop on Health
Document Text Mining and Information Analysis.
Socialstyrelsen. 2006. Diagnosgranskningar utförda i Sverige 1997–2005 samt råd
inför granskning (In Swedish). URL http://www.socialstyrelsen.se/
Lists/Artikelkatalog/Attachments/9740/2006-131-30_200613131.
pdf. Accessed: September 30, 2013.
Socialstyrelsen. 2010. Internationell statistisk klassifikation av sjukdomar och
hälsoproblem – Systematisk förteckning, Svensk version 2011 (ICD-10-SE)
(In Swedish), volume 2. Edita Västra Aros.
Socialstyrelsen. 2010. Kodningskvalitet i patientregistret, Slutenvård 2008
(In Swedish).
URL http://www.socialstyrelsen.se/Lists/
Artikelkatalog/Attachments/18082/2010-6-27.pdf.
Accessed:
September 30, 2013.
Socialstyrelsen. 2013. Diagnoskoder (ICD-10) (In Swedish). URL http://
www.socialstyrelsen.se/klassificeringochkoder/diagnoskoder.
Accessed: September 30, 2013.
Mary H Stanfill, Margaret Williams, Susan H Fenton, Robert A Jenders, and
William R Hersh. 2010. A systematic literature review of automated clinical
coding and classification systems. Journal of the American Medical Informatics Association, 17(6):646–651.
Gary W Stuart and Michael W Berry. 2003. A comprehensive whole genome bacterial phylogeny using correlated peptide motifs defined in a high dimensional
vector space. Journal of bioinformatics and computational biology, 1(03):475–
493.
Hanna Suominen, Filip Ginter, Sampo Pyysalo, Antti Airola, Tapio Pahikkala,
S Salanter, and Tapio Salakoski. 2008. Machine learning to automate the assignment of diagnosis codes to free-text radiology reports: a method description. In Proceedings of the ICML/UAI/COLT Workshop on Machine Learning
for Health-Care Applications.
R EFERENCES
101
Sveriges riksdag. 2013a.
Sveriges riksdag:
Lag (2003:460) om
etikprövning av forskning som avser människor (In Swedish).
URL
http://www.riksdagen.se/sv/Dokument-Lagar/Ovriga-dokument/
Ovrigt-dokument/_sfs-2003-460/. Accessed: September 30, 2013.
Sveriges riksdag. 2013b. Sveriges riksdag: Patientdatalag (2008:355) (In
Swedish).
URL http://www.riksdagen.se/sv/Dokument-Lagar/
Lagar/Svenskforfattningssamling/Patientdatalag-2008355_
sfs-2008-355/. Accessed: September 30, 2013.
Sveriges riksdag. 2013c.
Sveriges riksdag: Personuppgiftslag (1998:204)
(In Swedish). URL http://www.riksdagen.se/sv/Dokument-Lagar/
Lagar/Svenskforfattningssamling/Personuppgiftslag-1998204_
sfs-1998-204/. Accessed: September 30, 2013.
Peter D Turney. 2007. Empirical evaluation of four tensor decomposition algorithms. Technical report, Institute for Information Technology, National Research Council of Canada. Technical Report ERB-1152.
Peter D Turney and Michael L Littman. 2005. Corpus-based learning of analogies
and semantic relations. Machine Learning, 60(1-3):251–278.
Peter D Turney and Patrick Pantel. 2010. From Frequency to Meaning: Vector
Space Models of Semantics. Journal of Artificial Intelligence Research, 37(1):
141–188.
U.S. National Library of Medicine. 2013. MeSH (Medical Subject Headings).
http://www.ncbi.nlm.nih.gov/mesh.
Özlem Uzuner, Brett R South, Shuying Shen, and Scott L. DuVall. 2011. 2010
i2b2/va challenge on concepts, assertions, and relations in clinical text. J Am
Med Inform Assoc, 18(5):552–556.
Lonneke van der Plas and Jörg Tiedemann. 2006. Finding synonyms using automatic word alignment and measures of distributional similarity. In Proceedings
of the COLING/ACL on Main Conference Poster Sessions, COLING-ACL ’06,
pages 866–873, Stroudsburg, PA, USA. Association for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=1273073.1273184.
Cornelis Joost van Rijsbergen. 2004. The geometry of information retrieval. Cambridge University Press.
102
R EFERENCES
Ellen M Voorhees. 2002. The philosophy of information retrieval evaluation. In
Evaluation of cross-language information retrieval systems, pages 355–370.
Springer.
Xiaoyan Wang, Herbert Chase, Marianthi Markatou, George Hripcsak, and Carol
Friedman. 2010. Selecting information in electronic health records for knowledge acquisition. Journal of Biomedical Informatics, 43(4):595–601.
Pernille Warrer, Ebba Holme Hansen, Lars J Jensen, and Lise Aagaard. 2012. Using text-mining techniques in electronic patient records to identify ADRs from
medicine use. British Journal of Clinical Pharmacology, 73(5):674–684.
Karin Wester, Anna K Jönsson, Olav Spigset, Henrik Druid, and Staffan Hägg.
2008. Incidence of fatal adverse drug reactions: a population based study.
British journal of clinical pharmacology, 65(4):573–579.
Dominic Widdows. 2003. Orthogonal negation in vector spaces for modelling
word-meanings and document retrieval. In Proceedings of the 41st Annual
Meeting on Association for Computational Linguistics-Volume 1, pages 136–
143. Association for Computational Linguistics.
Ludwig Wittgenstein. 1958. Philosophical investigations. Blackwell Oxford.
David H Wolpert. 1995. The relationship between pac, the statistical physics
framework, the bayesian framework, and the vc framework. In The Mathematics of Generalization.
World Health Organization. 2013. International Classification of Diseases. URL
http://www.who.int/classifications/icd/en/. Accessed: September
30, 2013.
Hua Wu and Ming Zhou. 2003. Optimizing synonym extraction using monolingual and bilingual resources. In Proceedings of the Second International
Workshop on Paraphrasing - Volume 16, PARAPHRASE ’03, pages 72–79,
Stroudsburg, PA, USA. Association for Computational Linguistics. URL http:
//dx.doi.org/10.3115/1118984.1118994.
H Yu, G Hripcsak, and C§ Friedman. 2002. Mapping abbreviations to full forms in
biomedical articles. Journal of the American Medical Informatics Association,
9(3):262–272. ISSN 1067-5027.
R EFERENCES
103
Hong Yu and Eugene Agichtein. 2003. Extracting synonymous gene and protein
terms from biological literature. Bioinformatics, 1(19):340–349.
George Yule. 2010. The Study of Language. Cambridge University Press, New
York, USA. ISBN 978-0-521-74922-0.
Qing T Zeng, Doug Redd, Thomas Rindflesch, and Jonathan Nebeker. 2012. Synonym, topic model and predicate-based query expansion for retrieving clinical
documents. In AMIA Annual Symposium Proceedings, pages 1050–1059.
Ziqi Zhang, José Iria, Christopher Brewster, and Fabio Ciravegna. 2008. A comparative evaluation of term recognition algorithms. In Proceedings of Language
Resources and Evaluation (LREC).
104
R EFERENCES
INDEX
abbreviation, 24, 25, 44, 78
acronym, 25
adverse drug event (ADE), 30, 31
adverse drug reaction (ADR), 30, 31,
56, 82
adverse event (AE), 30, 31
bag of words, 17, 19
baseline, 36, 58
biomedical text, 2, 10
c-value, 46, 50, 76, 84
clinical language processing, 9, 10, 72,
83, 85
clinical narrative, 2, 21, 25, 28, 31
clinical text, 10, 21, 25, 31
collocation, 46, 55
confidence interval, 39
context definition, 12, 14–19, 24
context vector, 14, 16–19, 21, 48
corpus, 40
CoSegment, 44, 47
cosine similarity, 48
Cranfield paradigm, 34, 35
cross-validation, 39
curse of dimensionality, 15, 80
data sparsity, 15, 46, 72, 73
development set, 39
diagnosis coding, 26, 79–81
Dice score, 47
dimensionality, 15–19, 55, 72, 81, 84
dimensionality reduction, 15–17
distributional hypothesis, 12
distributional semantics, 9, 11–14, 17,
19–21, 23, 24, 40, 71–75, 77–
79, 81–85
ensemble learning, 23
ensembles of semantic spaces, 52, 73,
78, 84
error analysis, 40
evaluation set, 39
F1 -score, 37
F-measure, 37
false negative, 37
false positive, 37
FASS, 42, 44, 83
Firth, John, 13
generalization, 39
gold standard, 35
Granska Tagger, 44, 46
Harris, Zellig, 12, 13
hyperspace analogue
(HAL), 13
hypothesis testing, 34
ICD, 21, 26–29, 42, 43
105
to
language
106
index vector, 16–19, 48
inductive bias, 39
JavaSDM, 44, 48
Johnson-Lindenstrauss lemma, 17
Läkartidningen, 42
latent dirichlet allocation (LDA), 14, 24
latent semantic analysis (LSA), 13, 15,
16, 20, 21
latent semantic indexing (LSI), 15, 20
lemmatization, 19, 45
machine learning, 2, 10, 11, 21, 23, 25,
29, 84
MeSH, 42, 43, 75
MIMIC-II, 41
multi-class, multi-label problem, 37
multiword term, 72–76, 82, 85
n-gram, 46
near-orthogonality, 16–19, 72, 82
negation, 30, 47, 73, 81, 84, 85
NegEx, 44, 47, 81
no free lunch theorem, 40
overfitting, 39
parameter tuning, 39
paraphrasing, 22, 73, 84
pharmacovigilance, 30, 31
positive predictive value (PPV), 37
precision, 37, 38
probabilistic LSA (pLSA), 14
probabilistic models, 13, 14
probabilistic topic model, 14
random indexing (RI), 13, 16–20, 48
random mapping, 17
INDEX
random permutation (RP), 18, 20
random projection, 17
recall, 37, 38
recall top j, 38
reference standard, 35, 36, 42, 75
reflective random indexing, 20
reliability, 39
repeatability, 58
rule-based method, 2, 10, 23–25, 28, 31
Saussure, Ferdinand de, 12
semantic composition, 73, 77, 85
semantics, 2, 11, 13, 21
sensitivity, 37
significance testing, 34, 39, 58
similarity metric, 48
singular value decomposition (SVD),
15, 20, 82
sliding window, 14–19
SNOMED CT, 21, 42, 43, 50, 51, 57,
74–76, 80
spatial models, 13–15
stemming, 45
Stockholm EPR Corpus, 41, 57
stop word, 19, 45
structuralism, 12
synonymy, 9, 10, 12, 18, 20–25, 30, 74,
75, 77
system evaluation, 34, 35, 39
term normalization, 45
TextNSP, 44, 46
tokenization, 45
training set, 39
true negative, 37
true positive, 37, 38
underfitting, 39
INDEX
user-based evaluation, 35
weighted precision, 38
Wittgenstein, Ludwig, 13
wordspace, 13
107
108
INDEX
PAPERS
111
PAPER I
IDENTIFYING
CLINICAL
SYNONYMY
TERMS
OF
BETWEEN
VARYING
SNOMED
LENGTH
USING
DISTRIBUTIONAL ANALYSIS OF ELECTRONIC HEALTH
RECORDS
Aron Henriksson, Mike Conway, Martin Duneld and Wendy W. Chapman (2013).
Identifying Synonymy between SNOMED Clinical Terms of Varying Length
Using Distributional Analysis of Electronic Health Records. To appear in
Proceedings of the Annual Symposium of the American Medical Informatics
Association (AMIA), Washington DC, USA.
Author Contributions Aron Henriksson was responsible for coordinating the
study, its overall design and for carrying out the experiments. The proposed
method was designed and implemented by Aron Henriksson and Mike Conway,
with input from Martin Duneld and Wendy Chapman. The manuscript was written
by Aron Henriksson and Mike Conway. All authors were involved in reviewing
and improving the manuscript.
Identifying Synonymy between SNOMED Clinical Terms of Varying Length
Using Distributional Analysis of Electronic Health Records
Aron Henriksson, MS1 , Mike Conway, PhD2 , Martin Duneld, PhD1 , Wendy W. Chapman,
PhD3
1
Department of Computer and Systems Sciences (DSV), Stockholm University, Sweden
2
Division of Behavioral Medicine, Department of Family & Preventive Medicine, University
of California, San Diego, USA
3
Division of Biomedical Informatics, Department of Medicine, University of California, San
Diego, USA
Abstract
Medical terminologies and ontologies are important tools for natural language processing of health record narratives.
To account for the variability of language use, synonyms need to be stored in a semantic resource as textual instantiations of a concept. Developing such resources manually is, however, prohibitively expensive and likely to result
in low coverage. To facilitate and expedite the process of lexical resource development, distributional analysis of
large corpora provides a powerful data-driven means of (semi-)automatically identifying semantic relations, including synonymy, between terms. In this paper, we demonstrate how distributional analysis of a large corpus of electronic
health records – the MIMIC-II database – can be employed to extract synonyms of SNOMED CT preferred terms. A
distinctive feature of our method is its ability to identify synonymous relations between terms of varying length.
Introduction and Motivation
Terminological and ontological standards are an important and integral part of workflow and standards in clinical care.
In particular, SNOMED CT 1 has become the de facto standard for the representation of clinical concepts in Electronic
Health Records. However, SNOMED CT is currently only available in British and American English, Spanish, Danish, and Swedish (with translations into French and Lithuanian in process) 2 . In order to accelerate the adoption of
SNOMED CT (and by extension, Electronic Health Records) internationally, it is clear that the development of new
methods and tools to expedite the language porting process is of vital importance.
This paper presents and evaluates a semi-automatic – and language agnostic – method for the extraction of synonyms of
SNOMED CT preferred terms using distributional similarity techniques in conjunction with a large corpus of clinical
text (the MIMIC-II database 3 ). Key applications of the technique include:
1. Expediting SNOMED CT language porting efforts using semi-automatic identification of synonyms for preferred terms
2. Augmenting the current English versions of SNOMED CT with additional synonyms
In comparison to current rule-based synonym extraction techniques, our proposed method has two major advantages:
1. As the method uses statistical techniques (i.e. distributional similarity methods), it is agnostic with respect to
language. That is, in order to identify new synonyms, all that is required is a clinical corpus of sufficient size in
the target language
2. Unlike most approaches that use distributional similarity, our proposed method addresses the problem of identifying synonymy between terms of varying length – a key limitation in traditional distributional similarity
approaches
In this paper, we begin by presenting some relevant background literature (including describing related work in distributional similarity and synonym extraction); then we describe the materials and methods used in this research (in
113
particular corpus resources and software tools), before going on to set out the results of our analysis. We conclude the
paper with a discussion of our results and a short conclusion.
Background
In this section, we will first describe the structure of SNOMED CT, before going on to discuss relevant work related
to synonym extraction. Finally, we will present some opportunities and challenges associated with using distributional
similarity methods for synonym extraction.
SNOMED CT
In recent years, SNOMED CT has become the de facto terminological standard for representing clinical concepts in
Electronic Health Records 4 . SNOMED CT’s scope includes clinical findings, procedures, body structures, and social
contexts linked together through relationships (the most important of which is the hierarchical IS A relationship).
There are more than 300,000 active concepts in SNOMED CT and over a million relations 1 . Each concept consists of
a:
1. Concept ID: A unique numerical identifier
2. Fully Specified Name: An unambiguous string used to name a concept
3. Preferred Term: A common phrase or word used by clinicians to name a concept. Each concept has precisely one
preferred term in a given language. In contrast to the fully specified name, the preferred term is not necessarily
unique and can be a synonym or preferred name for a different concept
4. Synonym: A term that can be used as an acceptable alternative to the preferred term. A concept can have zero
or more synonyms
Depression
Myocardial Infarction
"Depressive disorder (disorder)"
ID: 767133013
"Myocardial infarction (disorder)"
ID: 751689013
"Myocardial infarction"
ID: 37436014
"Depressive disorder"
ID: 59212011
Fully specified name
Fully specified name
Preferred term
Preferred term
SNOMED_ID: D3-15000
"Cardiac infarction"
ID: 37442013
Synonym
SNOMED_ID: D9-52000
Synonym
"Depressed"
ID: 486187010
Synonym
Synonym
Synonym
Synonym
Synonym
Synonym
"Melancholia"
ID: 59218010
"Depressive illness"
ID: 486186018
Synonym
"Depression"
ID: 416184015
"Infarction of heart"
ID: 37441018
Synonym
"Heart attack"
ID: 3744395
"MI: Myocardial infarction"
"Myocardial infarction"
ID: 1784872019
ID: 1784873012
"Depressive episode"
ID: 486185019
Figure 1: Example SNOMED CT concepts: depression and myocardial infarction.
Figure 1 shows two example SNOMED CT concepts: depression and myocardial infarction. Note that in the depression example, the preferred term “depressive disorder” maps to single-word terms like “depressed” and “depression”.
Furthermore, it can be noted that the synonym “melancholia” does not contain the term “depression” or one of its
morphological variants.
Synonymy
Previous research on synonym extraction in the biomedical informatics literature has utilized diverse methodologies.
In the context of information retrieval from clinical documents, Zeng et al. 5 used three query expansion methods –
reading synonyms and lexical variants directly from the UMLS 6 , generating topic models from clinical documents,
and mining the SemRep 7 predication database – and found that an entirely corpus-based statistical method (i.e. topic
2
114
modeling) generated the best synonyms. Conway & Chapman 8 used a rule-based approach to generate potential
synonyms from the BioPortal ontology web service, verifying the acceptability of candidate synonyms by checking
for their presence in a very large corpus of clinical text, with the goal of populating a lexical-oriented knowledge
organization system. In the Natural Language Processing community, there is a rich tradition of using lexico-syntactic
patterns to extract synonyms (and other) relations 9 .
Distributional Semantics
Models of distributional semantics exploit large corpora to capture the meaning of terms based on their distribution in
different contexts. The theoretical foundation underlying such models is the distributional hypothesis 10 , which states
that words with similar meanings tend to appear in similar contexts. Distributional methods have become popular with
the increasing availability of large corpora and are attractive due to their ability, in some sense, to render semantics
computable: an estimate of the semantic relatedness between two terms can be quantified. These methods have been
applied successfully to a range of natural language processing tasks, including document retrieval, synonym tests
and word sense disambiguation 11 . An obvious use case of distributional methods is for the extraction of semantic
relations, such as synonyms, hypernyms and co-hyponyms (terms with a common hypernym) 12 . Ideally, one would
want to differentiate between such semantic relations; however, with these methods, the semantic relation between
two distributionally similar terms is unlabeled. As synonyms are interchangeable in most contexts – meaning that they
will have similar distributional profiles – synonymy is certainly a semantic relation that will be captured. However,
since hypernyms and hyponyms – in fact, even antonyms – are also likely to occur in similar contexts, such semantic
relations will likewise be extracted.
Distributional methods can be usefully divided into spatial models and probabilistic models. Spatial models represent
terms as vectors in a high-dimensional space, based on the frequency with which they appear in different contexts, and
where proximity between vectors is assumed to indicate semantic relatedness. Probabilistic models view documents as
a mixture of topics and represent terms according to the probability of their occurrence during the discussion of each
topic: two terms that share similar topic distributions are assumed to be semantically related. There are pros and cons
of each approach; however, scalable versions of spatial models have proved to work well for very large corpora. 11
Spatial models differ mainly in the way context vectors, representing term meaning, are constructed. In many methods,
they are derived from an initial term-context matrix that contains the (weighted, normalized) frequency with which the
terms occur in different contexts. The main problem with using these term-by-context vectors is their dimensionality,
equal to the number of contexts (e.g. # of documents / vocabulary size). The solution is to project the high-dimensional
data into a lower-dimensional space, while approximately preserving the relative distances between data points. In
latent semantic analysis (LSA) 13 , the term-context matrix is reduced by an expensive matrix factorization technique
known as singular value decomposition. Random indexing (RI) 14 is a scalable and computationally efficient alternative
to LSA, in which explicit dimensionality reduction is circumvented: a lower dimensionality d is instead chosen a priori
as a model parameter and the d-dimensional context vectors are then constructed incrementally. RI can be viewed as
a two-step operation:
1. Each context (e.g. each document or unique term) is assigned a sparse, ternary and randomly generated index
vector: a small number (1-2%) of 1s and -1s are randomly distributed; the rest of the elements are set to zero.
By generating sparse vectors of a sufficiently high dimensionality in this way, the context representations will,
with a high probability, be nearly orthogonal.
2. Each unique term is also assigned an initially empty context vector of the same dimensionality. The context
vectors are then incrementally populated with context information by adding the index vectors of the contexts
in which the target term appears.
There are a number of model parameters that need to be configured according to the task that the induced term
space will be used for. For instance, the types of semantic relations captured by an RI-based model depends on the
context definition 15 . By employing a document-level context definition, relying on direct co-occurrences, one models
syntagmatic relations. That is, two terms that frequently co-occur in the same documents are likely to be about the
same topic, e.g. <car, motor, race>. By employing a sliding window context definition, where the index vectors of
3
115
the surrounding terms within a, usually small, window are added to the context vector of the target term, one models
paradigmatic relations. That is, two terms that frequently occur with the same set of words – i.e. share neighbors –
but do not necessarily co-occur themselves, are semantically similar, e.g. <car, automobile, vehicle>. Synonymy is
an instance of a paradigmatic relation.
RI, in its original conception, does not take into full account term order information, except by giving increasingly
less weight to index vectors of terms as the distance from the target term increases. Random permutation (RP) 16 is an
elegant modification of RI that attempts to remedy this by simply permuting (i.e. shifting) the index vectors according
to their direction and distance from the target term before they are added to the context vector. RI has performed well
on tasks such as taking the TOEFL (Test of English as a Foreign Language) test. However, by incorporating term
order information, RP was shown to outperform RI on this particular task 16 . Combining RI and RP models has been
demonstrated to yield improved results on the synonym extraction task 17 .
The predefined dimensionality is yet another model parameter that has been shown to be potentially very important,
especially when the number of contexts (the size of the vocabulary) is large, as it often is in the clinical domain. Since
the traditional way of using distributional semantics is to model only unigram-unigram relations – a limitation when
wishing to model the semantics of phrases and longer textual sequences – a possible solution is to identify and model
multiword terms as single tokens. This will, however, lead to an explosion in the size of the vocabulary, necessitating
a larger dimensionality. In short, dimensionality and other model parameters need to be tuned for the dataset and task
at hand. 18
Materials and Methods
The method and experimental setup can be summarized in the following steps: (1) data preparation, (2) term extraction
and identification, (3) model building and parameter tuning, and (4) evaluation (Figure 2). Term spaces are constructed
with various parameter settings on two dataset variants: one with unigram terms and one with multiword terms. The
models – and, in effect, the method – are evaluated for their ability to identify synonyms of SNOMED CT preferred
terms. After optimizing the parameter settings for each group of models on a development set, the best models are
evaluated on unseen data.
Data Preparation
MIMIC II
Database
Term Extraction &
Identification
Document Data
Preprocessed
Document
(unigrams)
Document
Document Data
Preprocessed
(multiword terms)
SNOMED CT
Development
Set
SNOMED CT
Evaluation Set
Model Building &
Parameter Tuning
Evaluation
Model
Model
Model
Model
Model
Model
Model
Model
Model
Model
Model
Model
Results
Figure 2: An overview of the process and the experimental setup.
In Step 1, the clinical data from which the term spaces will be induced is extracted from the MIMIC-II database
and preprocessed. MIMIC-II 3 is a publicly available database encompassing clinical data for over 40,000 hospital
4
116
stays of more than 32,000 patients, collected over a seven-year period (2001-2007) from intensive care units (medical,
surgical, coronary and cardiac surgery recovery) at Boston’s Beth Israel Deaconess Medical Center. In addition to
various structured data, such as laboratory results and ICD-9 diagnosis codes, the database contains text-based records,
including nursing progress notes, discharge summaries and radiology interpretations. We create a corpus comprising
all text-based records (∼250 million tokens) from the MIMIC-II database. The documents in the corpus are then
preprocessed to remove metadata, such as headings (e.g. FINAL REPORT), incomplete sentence fragments, such as
enumerations (of e.g. medications), as well as digits and punctuation marks.
In Step 2, we extract and identify multiword terms in the corpus. This will allow us to extract synonymous relations
between terms of varying length. This is done by first extracting all unigrams, bigrams and trigrams from the corpus
with TEXT-NSP 19 and treating them as candidate terms for which the C-value is then calculated. The C-value statistic 20 21 has been used successfully for term recognition in the biomedical domain, largely due to its ability to handle
nested terms 22 . It is based on term frequency and term length (number of words); if a candidate term is part of a longer
candidate term, it also takes into account how many other terms it is part of and how frequent those longer terms are
(Figure 3). By extracting n-grams and then ranking the n-grams according to their C-value, we are incorporating the
notions of both unithood – indicating collocation strength – and termhood – indicating the association strength of a
term to domain concepts 22 . In this still rather simple approach to term extraction, however, we do not take any other
linguistic knowledge into account. As a simple remedy for this, we create a number of filtering rules that remove
terms beginning and/or ending with certain words, e.g. prepositions (in, from, for) and articles (a, the). Another alteration to the term list – now ranked according to C-value – is to give precedence to SNOMED CT preferred terms by
adding/moving them to the top of the list, regardless of their C-value (or failed identification). The reason for this is
that we are aiming to identify SNOMED CT synonyms of preferred terms and, by giving precedence to preferred terms
– but not synonyms, as that would constitute cheating – we are effectively strengthening the statistical foundation on
which the distributional method bases its semantic representation. The term list is then used to perform exact string
matching on the entire corpus: multiword terms with a higher C-value than their constituents are concatenated. We
thus treat multiword terms as separate tokens with their own particular distributions in the data, to a greater or lesser
extent different from those of their constituents.
log2 |a| · f (a)
if a is not nested
P
C-value(a) =
log2 |a| · (f (a) − P (T1 a) bT a f (b)) otherwise
a
b
|a|
f (a)
= candidate term
= longer candidate terms
= length of candidate term (number of words)
= term frequency of a
Ta
= set of extracted candidate terms that contain a
P (T a) = number of candidate terms in T a
f (b)
= term frequency of longer candidate term b
Figure 3: The formula for calculating C-value of candidate terms.
In Step 3, term spaces are induced from the dataset variants: one containing only unigram terms (UNIGRAM TERMS)
and one containing also longer terms: unigram, bigram and trigram terms (MULTIWORD TERMS). The following
model parameters are experimented with:
• Model Type: random indexing (RI), random permutation (RP). Does the method for synonym identification
benefit from incorporating word order information?
• Sliding Window: 1+1, 2+2, 3+3, 4+4, 5+5, 6+6 surrounding terms. Which paradigmatic context definition is
most beneficial for synonym identification?
• Dimensionality: 500, 1000, 1500 dimensions. How does the dimensionality affect the method’s ability to
identify synonyms, and is the impact greater when the vocabulary size grows exponentially as it does when
treating multiword terms as single tokens?
Evaluation takes place in both Step 3 and Step 4. The term spaces are evaluated for their ability to identify synonyms
of SNOMED CT preferred terms that each appears at least fifty times in the corpus (Table 1). A vast number of
5
117
SNOMED CT terms do not appear in the corpus; requiring that they appear a certain number of times arguably makes
the evaluation more realistic. Although fifty is a somewhat arbitrarily chosen number, it is likely to ensure that the
statistical foundation is solid. A preferred term is provided as input to a term space and the twenty most semantically
similar terms are output, provided that they also appear at least fifty times in the data. For each preferred term, recall
is calculated using the twenty most semantically similar terms generated by the model (i.e for each SNOMED CT
concept, recall is the proportion of SNOMED CT synonyms returned for that concept when the model is queried using
that concept’s preferred term). The SNOMED CT data is divided into a development set and an evaluation set in a
50/50 split. The development set is used in Step 3 to find the optimal parameter settings for the respective datasets
and the task at hand. The best parameter configuration for each type of model (UNIGRAM TERMS and MULTIWORD
TERMS) is then used in the final evaluation in Step 4. Note that the requirement of each synonym pair appearing
at least fifty times in the data means that the development and evaluation sets for the two types of models are not
identical, e.g. the test sets for UNIGRAM TERMS will not contain any multiword terms. This, in turn, means that the
results of the two types of models are not directly comparable.
Semantic Type
attribute
body structure
cell
cell structure
disorder
environment
event
finding
morphologic abnormality
observable entity
organism
person
physical force
physical object
procedure
product
qualifier value
regime/therapy
situation
specimen
substance
Total
UNIGRAM TERMS
Preferred Terms
Synonyms
4
6
2
2
0
0
1
1
26
32
3
3
1
1
39
54
35
45
17
22
2
2
2
2
1
1
9
10
23
28
6
6
133
173
0
0
0
0
0
0
24
24
328
412
MULTIWORD TERMS
Preferred Terms
Synonyms
6
8
12
12
1
1
1
1
68
81
3
3
1
1
69
86
42
54
22
26
3
3
2
2
1
1
12
15
49
64
3
3
153
190
3
3
1
1
2
0
24
25
478
580
Table 1: The frequency of SNOMED CT preferred terms and synonyms that are identified at least fifty times in the
MIMIC II Corpus. The UNIGRAM TERMS set contains only unigram terms, while the MULTIWORD TERMS set
contains unigram, bigram and trigram terms.
Results
One interesting result to report, although not the focus of this paper, concerns the coverage of SNOMED CT in a large
clinical corpus like MIMIC-II. First of all, it is interesting to note that only 9,267 out of the 105,437 preferred terms
with one or more synonyms are unigram terms (24,866 bigram terms, 21,045 trigram terms and 50,259 terms that
consist of more than three words/tokens). Out of the 158,919 synonyms, 12,407 are unigram terms (43,513 bigram
terms, 32,367 trigram terms and 70,632 terms that consist of more than three words/tokens). 7,265 SNOMED CT
terms (preferred terms and synonyms) are identified in the MIMIC-II corpus (2,632 unigram terms, 3,217 bigram
6
118
terms and 1,416 trigram terms); the occurrence of longer terms in the corpus has not been verified in the current work.
For the number of preferred terms and synonyms that appear more than fifty times in the corpus, consult Table 1.
When tuning the parameters of the UNIGRAM TERMS models and the MULTIWORD TERMS models, the pattern
is fairly clear: for both dataset variants, the best model parameter settings are based on random permutation and a
dimensionality of 1,500. For UNIGRAM TERMS, a sliding window of 5+5 yields the best results (Table 2). The
general tendency is that results improve as the dimensionality and the size of the sliding window increase. However,
increasing the size of the context window beyond 5+5 surrounding terms does not boosts results further.
Sliding Window →
500 Dimensions
1,000 Dimensions
1,500 Dimensions
1+1
0.17
0.16
0.17
RANDOM INDEXING
2+2 3+3 4+4 5+5
0.18 0.20 0.20 0.20
0.19 0.20 0.21 0.21
0.21 0.22 0.22 0.22
6+6
0.21
0.21
0.22
RANDOM PERMUTATION
1+1 2+2 3+3 4+4 5+5 6+6
0.17 0.20 0.21 0.21 0.22 0.22
0.19 0.22 0.21 0.21 0.22 0.23
0.18 0.22 0.22 0.23 0.24 0.23
Table 2: Model parameter tuning: results, recall top 20, for UNIGRAM TERMS on the development set.
For MULTIWORD TERMS, a sliding window of 4+4 yields the best results (Table 3). The tendency is similar to that
of the UNIGRAM TERMS models: incorporating term order (RP) and employing a larger dimensionality leads to the
best performance. In contrast to the UNIGRAM TERMS models, the most optimal context window size is in this case
slightly smaller.
Sliding Window →
500 Dimensions
1,000 Dimensions
1,500 Dimensions
1+1
0.08
0.09
0.09
RANDOM INDEXING
2+2 3+3 4+4 5+5
0.12 0.13 0.13 0.13
0.12 0.13 0.13 0.13
0.13 0.13 0.14 0.14
6+6
0.12
0.13
0.14
RANDOM PERMUTATION
1+1 2+2 3+3 4+4 5+5 6+6
0.08 0.13 0.14 0.14 0.13 0.12
0.10 0.13 0.14 0.13 0.12 0.11
0.10 0.14 0.14 0.16 0.14 0.14
Table 3: Model parameter tuning: results, recall top 20, for MULTIWORD TERMS on the development set.
Once the optimal parameter settings for each dataset variant had been configured, they were evaluated for their ability
to identify synonyms on the unseen evaluation set. Overall, the best UNIGRAM TERMS model (RP, 5+5 context
window, 1,500 dimensions) achieved a recall top 20 of 0.24 (Table 4), i.e. 24% of all unigram synonym pairs that
occur at least fifty times in the corpus were successfully identified in a list of twenty suggestions per preferred term.
For certain semantic types, such as morphologic abnormality and procedure, the results are slightly higher: almost
50%. For finding, the results are slightly lower: 15%.
With the best MULTIWORD TERMS model (RP, 4+4 context window, 1,500 dimensions), the average recall top 20 is
0.16. Again, the results vary depending on the semantic type: higher results are achieved for entities such as disorder
(0.22) morphologic abnormality (0.29) and physical object (0.42), while lower results are obtained for finding (0.12),
observable entity (0.09), qualifier value (0.08) and substance (0.08). For 22 of the correctly identified synonym pairs,
at least one of the terms in the synonymous relation was a multiword term.
Discussion
When modeling unigram terms, as is the traditional approach when employing models of distributional semantics, the
results are fairly good: almost 25% of all synonym pairs that appear with some regularity in the corpus are successfully
identified. However, the problem with the traditional unigram-based approach is that the vast majority of SNOMED
CT terms – and other biomedical terms for that matter – are multiword terms: fewer than 10% are unigram terms.
This highlights the importance of developing methods and techniques that are able to model the meaning of multiword
expressions. In this paper, we attempted to do this in a fairly straightforward manner: identify multiword terms and
7
119
Semantic Type
attribute
body structure
cell structure
disorder
environment
event
finding
morphologic abnormality
observable entity
organism
person
physical force
physical object
procedure
product
qualifier value
substance
All
# Synonym Pairs
2
1
1
17
2
1
27
26
11
1
1
1
6
15
2
89
12
215
Recall (Top 20)
0.50
0.00
1.00
0.23
0.00
0.00
0.15
0.43
0.22
0.00
1.00
0.00
0.30
0.46
0.50
0.15
0.33
0.24
Table 4: Final evaluation: results, recall top 20, for UNIGRAM TERMS on the evaluation set.
treat each one as a distinct semantic unit. This approach allowed us to identify more synonym pairs in the corpus
(292 vs. 215). The results were slightly lower compared to the UNIGRAM TERMS model, although the results are
not directly comparable since they were evaluated on different datasets. The UNIGRAM TERMS model was unable to
identify any synonymous relations involving a multiword term, whereas the MULTIWORD TERMS model successfully
identified 22 such relations. This demonstrates that multiword terms can be handled with some amount of success in
distributional semantic models. However, our approach relies to a large degree on the ability to identify high quality
multiword terms, which was not the focus of this paper. The term extraction could be improved substantially by using
a linguistic filter that produces better candidate terms than n-grams. Using a shallow parser to extract phrases is one
such obvious improvement.
Another issue concerns the evaluation of the method. Relying heavily on a purely quantitative evaluation, as we have
done, can provide only a limited view of the usefulness of the models. Only counting the number of synonyms that
are currently in SNOMED CT – and treating this as our gold standard – does not say anything about the quality of
the “incorrect” suggestions. There may be valid synonyms that are currently not in SNOMED CT. One example of
this is the preferred term itching, which, in SNOMED CT, has two synonyms: itch and itchy. The model was able
to identify the former but not the latter; however, it also identified itchiness. Another phenomenon which is perhaps
of less interest in the case of SNOMED CT, but of huge significance for developing terminologies that are to be used
for information extraction purposes: the identification of misspellings. When looking up anxiety, for instance, the
synonym anxiousness was successfully identified; other related terms were agitation and aggitation [sic]. Many of the
suggested terms are variants of a limited number of concepts. Future work should thus involve review of candidate
synonyms by human evaluators.
8
120
Semantic Type
attribute
body structure
cell structure
cell
disorder
environment
event
finding
morphologic abnormality
observable entity
organism
person
physical force
physical object
procedure
product
qualifier value
regime/therapy
situation
specimen
substance
All
# Synonym Pairs
3
6
1
1
40
2
1
43
29
15
2
1
1
7
34
2
88
2
1
1
12
292
Recall (Top 20)
0.33
0.17
1.00
0.00
0.22
0.00
0.00
0.12
0.29
0.09
0.00
1.00
0.00
0.42
0.17
0.50
0.08
0.50
0.00
0.00
0.08
0.16
Table 5: Final evaluation: results, recall top 20, for MULTIWORD TERMS on the evaluation set.
Conclusions
We have demonstrated how distributional analysis of a large corpus of clinical narratives can be used to identify
synonymy between SNOMED CT terms. In addition to capturing synonymous relations between pairs of unigram
terms, we have shown that we are also able to extract such relations between terms of varying length. This language
independent method can be used to port SNOMED CT – and other terminologies and ontologies – to other languages.
Acknowledgements
This work was partially supported by NIH grant R01LM009427 (authors WC & MC), the Swedish Foundation for
Strategic Research through the project High-Performance Data Mining for Drug Effect Detection, ref. no. IIS11-0053
(author AH) and the Stockholm University Academic Initiative through the Interlock project.
References
1. International Health Terminology Standards Development Organisation: SNOMED CT;.
http://www.ihtsdo.org/snomed-ct/ [cited 10th March 2013].
Available from:
2. International Health Terminology Standards Development Organisation: Supporting Different Languages;. Available from: http://www.ihtsdo.org/snomed-ct/snomed-ct0/different-languages/ [cited
10th March 2013].
3. Saeed M, Villarroel M, Reisner AT, Clifford G, Lehman LW, Moody G, et al. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): A public-access intensive care unit database. Crit Care Med. 2011;39(5):952–
960.
9
121
4. SNOMED
CT
User
Guide
2013
International
Release:;.
http://www.webcitation.org/6IlJeb5Uj [cited 10th March 2013].
Available
from:
5. Zeng QT, Redd D, Rindflesch T, Nebeker J. Synonym, topic model and predicate-based query expansion for
retrieving clinical documents. AMIA Annu Symp Proc. 2012;2012:1050–9.
6. Unified Medical Language System:;. Available from: http://www.nlm.nih.gov/research/umls/
[cited 10th March 2013].
7. Cohen T, Widdows D, Schvaneveldt RW, Davies P, Rindflesch TC. Discovering discovery patterns with
predication-based Semantic Indexing. J Biomed Inform. 2012 Dec;45(6):1049–65.
8. Conway M, Chapman W. Discovering lexical instantiations of clinical concepts using web services, WordNet and
corpus resources. In: AMIA Fall Symposium; 2012. p. 1604.
9. Hearst M. Automatic acquisition of hyponyms from large text corpora. In: Proceedings of COLING 1992; 1992.
p. 539–545.
10. Firth JR, Palmer FR. Selected papers of J. R. Firth, 1952-59. Indiana University studies in the history and theory
of linguistics. Bloomington: Indiana University Press; 1968.
11. Cohen T, Widdows D. Empirical distributional semantics: methods and biomedical applications. J Biomed
Inform. 2009 April;42(2):390–405.
12. Panchenko A. Similarity measures for semantic relation extraction. PhD thesis, Université catholique de Louvain
& Bauman Moscow State Technical University; 2013.
13. Deerwester S, Dumais S, Furnas G. Indexing by latent semantic analysis. Journal of the American Society for
Information Science. 1990;41(6):391–407.
14. Kanerva P, Kristofersson J, Holst A. Random indexing of text samples for latent semantic analysis. In: Proceedings of 22nd Annual Conference of the Cognitive Science Society; 2000. p. 1036.
15. Sahlgren M. The word-space model: Using distributional analysis to represent syntagmatic and paradigmatic
relations between words in high-dimensional vector spaces. PhD thesis, Stockholm University; 2006.
16. Sahlgren M, Holst A, Kanerva P. Permutations as a Means to Encode Order in Word Space. In: Proceedings of
the 30th Annual Meeting of the Cognitive Science Society; 2008. p. 1300–1305.
17. Henriksson A, Moen H, Skeppstedt M, Eklund AM, Daudaravicius V. Synonym extraction of medical terms from
clinical text using combinations of word space models. In: Proceedings of Semantic Mining in Biomedicine
(SMBM); 2012. p. 10–17.
18. Henriksson A, Hassel M. Optimizing the dimensionality of clinical term spaces for improved diagnosis coding
support. In: Proceedings of the LOUHI Workshop on Health Document Text Mining and Information Analysis;
2013. p. 1–6.
19. Banerjee S, Pedersen T. The design, implementation, and use of the Ngram Statistic Package. Proceedings of the
Fourth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing). 2003;p.
370–381.
20. Frantzi K, Ananiadou S. Extracting nested collocations. In: Proceedings of the Conference on Computational
Linguistics (COLING); 1996. p. 41–46.
21. Frantzi K, Ananiadou S, Mima H. Automatic recognition of multi-word terms: the C-value/NC-value method.
International Journal on Digital Libraries. 2000;3(2):115–130.
22. Zhang Z, Iria J, Brewster C, Ciravegna F. A Comparative Evaluation of Term Recognition Algorithms. In:
Proceedings of Language Resources and Evaluation (LREC); 2008. .
10
122
123
PAPER II
SYNONYM
EXTRACTION
AND
ABBREVIATION
EXPANSION WITH ENSEMBLES OF SEMANTIC SPACES
Aron Henriksson, Hans Moen, Maria Skeppstedt, Vidas Daudaravičius and Martin
Duneld (2013). Synonym Extraction and Abbreviation Expansion with Ensembles
of Semantic Spaces. Submitted.
Author Contributions Aron Henriksson was responsible for coordinating the
study, its overall design and for carrying out the experiments. Hans Moen and
Maria Skeppstedt contributed equally: Hans Moen was responsible for designing
strategies for combining the output of multiple semantic spaces; Maria Skeppstedt
was responsible for designing the evaluation part of the study. Vidas Daudaravičius
and Maria Skeppstedt were jointly responsible for designing the post-processing
filtering of candidate terms, while Martin Duneld provided feedback on the design
of the study. The manuscript was written by Aron Henriksson, Hans Moen, Maria
Skeppstedt and Martin Duneld.
Synonym Extraction and Abbreviation Expansion
with Ensembles of Semantic Spaces
Aron Henriksson⇤1 , Hans Moen2 , Maria Skeppstedt1 , Vidas Daudaravičius3 and Martin Duneld1
1
2
3
Department of Computer and Systems Sciences (DSV), Stockholm University, Forum 100, SE-164 40 Kista, Sweden
Department of Computer and Information Science, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway
Faculty of Informatics, Vytautas Magnus University, Vileikos g. 8 - 409, Kaunas, LT-44404, Lithuania.
Email: Aron Henriksson⇤ - [email protected]; Hans Moen - [email protected]; Maria Skeppstedt - [email protected]; Vidas
Daudaravičius - [email protected]; Martin Duneld - [email protected];
⇤ Corresponding
author
Abstract
Background: Terminologies that account for language use variability by linking synonyms and abbreviations
to their corresponding concept are important enablers of high-quality information extraction from medical texts.
Due to the use of specialized sub-languages in this domain, manual construction of semantic resources that
accurately reflect language use is both challenging and costly, and often results in low coverage. Although models
of distributional semantics applied to large corpora provide a potential means of supporting development of such
resources, their ability to identify synonymy in particular is limited, and their application to the clinical domain
has only recently been explored. Combining distributional models and applying them to di↵erent types of corpora
may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion
pairs.
Results: A combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore,
combining semantic spaces induced from di↵erent types of corpora – a clinical corpus and a corpus comprising
medical journal articles – further improves results, outperforming a combination of semantic spaces induced from
a single source. A combination strategy that simply sums the cosine similarity scores of candidate terms is the
most profitable. Finally, applying simple post-processing filtering rules yields significant performance gains on the
tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a
1
125
list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to
abbreviations, and 0.47 for synonyms.
Conclusions: This study demonstrates that ensembles of semantic spaces yield improved performance on
the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits
further exploration, allows di↵erent distributional models – with di↵erent model parameters – and di↵erent types
of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural
language processing tasks.
Background
In order to create high-quality information extraction systems, it is important to incorporate some knowledge
of semantics, such as the fact that a concept can be signified by multiple signifiers. Morphological variants,
abbreviations, acronyms, misspellings and synonyms – although di↵erent in form – may share semantic
content to di↵erent degrees. The various lexical instantiations of a concept thus need to be mapped to some
standard representation of the concept, either by converting the di↵erent expressions to a canonical form or by
generating lexical variants of a concept’s preferred term. These mappings are typically encoded in semantic
resources, such as thesauri or ontologies, which enable the recall (sensitivity) of information extraction
systems to be improved. Although their value is undisputed, manual construction of such resources is often
prohibitively expensive and may also result in limited coverage, particularly in the biomedical and clinical
domains where language use variability is exceptionally high [1].
There is thus a need for (semi-)automatic methods that can aid and accelerate the process of lexical
resource development, especially ones that are able to reflect real language use in a particular domain and
adapt to di↵erent genres of text, as well as to changes over time. In the clinical domain, for instance, language
use in general and (ad-hoc) abbreviations in particular can vary significantly across specialities. Statistical,
corpus-driven and language-independent methods are attractive due to their inherent portability: given a
corpus of sufficient size in the target domain, the methods can be applied with no or little adaptation needed.
Models of distributional semantics fulfill these requirements and have been used to extract semantically similar terms from large corpora; with increasing access to data from electronic health records, their application
to the clinical domain has lately begun to be explored. In this paper, we present a method that employs
distributional semantics for the extraction of synonyms and abbreviation-expansion pairs from two large
corpora: a clinical corpus (comprising health record narratives) and a medical corpus (comprising journal
articles). We also demonstrate that performance can be enhanced by creating ensembles of (distributional)
2
126
semantic spaces – both with di↵erent model parameter configurations and induced from di↵erent genres of
text.
The structure of this paper is as follows. First, we present some relevant background literature on synonyms, abbreviations and their extraction/expansion. We also introduce the ideas underlying distributional
semantics in general and, in particular, the models employed in this study: Random Indexing and Random
Permutation. Then, we describe our method of combining semantic spaces induced from single and multiple corpora, including the details of the experimental setup and the materials used. A presentation of the
experimental results is followed by an analysis and discussion of their implications. Finally, we conclude the
paper with a brief recapitulation and conclusion.
Language Use Variability: Synonyms and Abbreviations
Synonymy is a semantic relation between two phonologically distinct words with very similar meaning. It
is, however, extremely rare that two words have the exact same meaning – perfect synonyms – as there is
often at least one parameter that distinguishes the use of one word from another [2]. For that reason, we
typically speak of near-synonyms instead; that is, two words that are interchangeable in some, but not all,
contexts [2]. The words big and large are, for instance, synonymous when describing a house, but certainly
not when describing a sibling. Two near-synonyms may also have di↵erent connotations, such as conveying a
positive or a negative attitude. To complicate matters, the same concept can sometimes be referred to with
di↵erent words in di↵erent dialects; for a speaker who is familiar with both dialects, these can be viewed
as synonyms. A similar phenomenon concerns di↵erent formality levels, where one word in a synonym pair
is used only in slang and the other only in a more formal context [2]. In the medical domain, there is one
vocabulary that is more frequently used by medical professionals, whereas patients often use alternative,
layman terms [3]. When developing many natural language processing (NLP) applications, it is important
to have ready access to terminological resources that cover this variation in the use of vocabulary by storing
synonyms. Examples of such applications are query expansion [3], text simplification [4] and, as mentioned
previously, information extraction [5].
The use of abbreviations and acronyms is prevalent in both medical journal text [6] and clinical text
[1]. This leads to decreased readability [7] and poses challenges for information extraction [8]. Semantic
resources that also link abbreviations to their corresponding concept, or, alternatively, simple term lists that
store abbreviations and their corresponding long form, are therefore as important as synonym resources
for many biomedical NLP applications. Like synonyms, abbreviations are often interchangeable with their
3
127
corresponding long form in some, if not all, contexts. An important di↵erence between abbreviations and
synonyms is, however, that abbreviations are semantically overloaded to a much larger extent; that is, one
abbreviation often has several possible long forms. In fact, 81% of UMLS1 abbreviations were found to have
ambiguous occurrences in biomedical text [6].
Identifying Synonymous Relations between Terms
The importance of synonym learning is well recognized in the NLP research community, especially in the
biomedical [9] and clinical [1] domains. A wide range of techniques has been proposed for the identification
of synonyms and other semantic relations, including the use of lexico-syntactic patterns, graph-based models
and, indeed, distributional semantics [10] – the approach assumed in this study.
For instance, Hearst [11] proposes the use of lexico-syntactic patterns for the automatic acquisition of
hyponyms from unstructured text. These patterns are hand-crafted according to observations in a corpus.
Patterns can similarly be constructed for other types of lexical relations. However, a requirement is that
these syntactic patterns are common enough to match a wide array of hyponym pairs. Blondel et al. [12]
present a graph-based method that takes its inspiration from the calculation of hub, authority and centrality
scores when ranking hyperlinked web pages. They illustrate that the central similarity score can be applied
to the task of automatically extracting synonyms from a monolingual dictionary, in this case the Webster
dictionary, where the assumption is that synonyms have a large overlap in the words used in their definitions;
they also co-occur in the definition of many words. Another possible source for extracting synonyms is the
use of linked data, such as Wikipedia. Nakayama et al. [13] also utilize a graph-based method, but instead
of relying on word co-occurrence information, they exploit the links between Wikipedia articles (treated as
concepts). This way they can measure both the strength (the number of paths from one article to another)
and the distance (the length of each path) between concepts: articles close to each other in the graph and
with common hyperlinks are deemed to be stronger than those farther away.
There have also been some previous e↵orts to obtain better performance on the synonym extraction task
by not only using a single source and a single method. Inspiration for some of these approaches has been
drawn from ensemble learning, a machine learning technique that combines the output of several di↵erent
classifiers with the aim of improving classification performance (see [14] for an overview). Curran [15] exploits
this notion for synonym extraction and demonstrates that ensemble methods outperform individual classifiers
even for very large corpora. Wu and Zhou [16] use multiple resources – a monolingual dictionary, a bilingual
corpus and a large monolingual corpus – in a weighted ensemble method that combines the individual
4
128
extractors, thereby improving both precision and recall on the synonym acquisition task. Along somewhat
similar lines, van der Plas and Tiedemann [17] use parallel corpora to calculate distributional similarity
based on (automatic) word alignment, where a translational context definition is employed; synonyms are
extracted with both greater precision and recall compared to a monolingual approach. This approach is,
however, hardly applicable in the medical domain due to the unavailability of parallel corpora. Peirsman
and Geeraerts [18] combine predictors based on collocation measures and distributional semantics with a
so-called compounding approach, wherein cues are combined with strongly associated words into compounds
and verified against a corpus. This ensemble approach is shown substantially to outperform the individual
predictors of strong term associations in a Dutch newspaper corpus.
In the biomedical domain, most e↵orts have focused on extracting synonyms of gene and protein names
from the biomedical literature [19–21]. In the clinical domain, Conway and Chapman [22] propose a rulebased approach to generate potential synonyms from the BioPortal ontology – using permutations, abbreviation generation, etc – after which candidate synonyms are verified against a large clinical corpus. Zeng
et al. [23] evaluate three query expansion methods for retrieval of clinical documents and conclude that an
LDA-based topic model generates the best synonyms. Henriksson et al. [24] use models of distributional
semantics to induce unigram word spaces and multiword term spaces from a large corpus of clinical text in
an attempt to extract synonyms of varying length for Swedish SNOMED CT concepts.
Creating Abbreviation Dictionaries Automatically
There are a number of studies on the automatic creation of biomedical abbreviation dictionaries that exploit
the fact that abbreviations are sometimes defined in the text on their first mention. These studies extract
candidates for abbreviation-expansion pairs by assuming that either the long form or the abbreviation is
written in parentheses [25]; other methods that use rule-based pattern matching have also been proposed [26].
The process of determining which of the extracted candidates that are likely to be correct abbreviationexpansion pairs is then performed either by rule-based [26] or machine learning [27, 28] methods. Most of
these studies have been conducted for English; however, there is also a study on Swedish medical text [29],
for instance.
Yu et al. [30] have, however, found that around 75% of all abbreviations found in biomedical articles
are never defined in the text. The application of these methods to clinical text is most likely inappropriate,
as clinical text is often written in a telegraphic style, mainly for documentation purposes [1]; that e↵ort
would be spent on defining used abbreviations in this type of text seems unlikely. There has, however, been
5
129
some work on identifying such undefined abbreviations [31], as well as on finding the intended abbreviation
expansion among several possible expansions available in an abbreviation dictionary [32].
In summary, automatic creation of biomedical abbreviation dictionaries from texts where abbreviations
are defined is well studied. This is also the case for abbreviation disambiguation given several possible
long forms in an abbreviation dictionary. The abbreviation part of this study, however, focuses on a task
which has not as yet been adequately explored: to find abbreviation-expansion pairs without requiring the
abbreviations to be defined in the text.
Distributional Semantics: Inducing Semantic Spaces from Corpora
Distributional semantics (see [33] for an overview of methods and their application in the biomedical domain)
were initially motivated by the inability of the vector space model [34] – as it was originally conceived – to
account for the variability of language use and word choice stemming from natural language phenomena such
as synonymy. To overcome the negative impact this had on recall in information retrieval systems, models
of distributional semantics were proposed [35–37]. The theoretical foundation underpinning such semantic
models is the distributional hypothesis [38], which states that words with similar distributions in language –
in the sense that they co-occur with overlapping sets of words – tend to have similar meanings. Distributional
methods have become popular with the increasing availability of large corpora and are attractive due to their
computational approach to semantics, allowing an estimate of the semantic relatedness between two terms
to be quantified.
An obvious application of distributional semantics is the extraction of semantically related terms. As nearsynonyms are interchangeable in at least some contexts, their distributional profiles are likely to be similar,
which in turn means that synonymy is a semantic relation that should to a certain degree be captured by
these methods. This seems intuitive, as, next to identity, the highest degree of semantic relatedness between
terms is realized by synonymy. It is, however, well recognized that other semantic relations between terms
that share similar contexts will likewise be captured by these models [39]; synonymy cannot readily be
isolated from such relations.
Spatial models2 of distributional semantics generally di↵er in how vectors representing term meaning are
constructed. These vectors, often referred to as context vectors, are typically derived from a term-context
matrix that contains the (weighted, normalized) frequency with which terms occur in di↵erent contexts.
Working directly with such high-dimensional (and inherently sparse) data — where the dimensionality is
equal to the number of contexts (e.g. the number of documents or the size of the vocabulary, depending on
6
130
which context definition is employed) — would entail unnecessary computational complexity, in particular
since most terms only occur in a limited number of contexts, which means that most cells in the matrix
will be zero. The solution is to project the high-dimensional data into a lower-dimensional space, while
approximately preserving the relative distances between data points. The benefit of dimensionality reduction
is two-fold: on the one hand, it reduces complexity and data sparseness; on the other hand, it has also been
shown to improve the coverage and accuracy of term-term associations, as, in this reduced (semantic) space,
terms that do not necessarily co-occur directly in the same contexts – this is indeed the typical case for
synonyms and abbreviation-expansion pairs – will nevertheless be clustered about the same subspace, as
long as they appear in similar contexts, i.e. have neighbors in common (co-occur with the same terms). In
this way, the reduced space can be said to capture higher order co-occurrence relations.
In latent semantic analysis (LSA) [35], dimensionality reduction is performed with a computationally
expensive matrix factorization technique known as singular value decomposition. Despite its popularity,
LSA has consequently received some criticism for its poor scalability properties, as well as for the fact that a
semantic space needs to be completely rebuilt each time new data is added. As a result, alternative methods
for constructing semantic spaces based on term co-occurrence information have been proposed.
Random Indexing
Random indexing (RI) [40] is an incremental, scalable and computationally efficient alternative to LSA in
which explicit dimensionality reduction is avoided: a lower dimensionality d is instead chosen a priori as a
model parameter and the d -dimensional context vectors are then constructed incrementally. This approach
allows new data to be added at any given time without having to rebuild the semantic space. RI can be
viewed as a two-step operation.
1. Each context (e.g. each document or unique term) is assigned a sparse, ternary and randomly generated
index vector : a small number (1-2%) of +1s and -1s are randomly distributed; the rest of the elements
are set to zero. By generating sparse vectors of a sufficiently high dimensionality in this way, the
context representations will, with a high probability, be nearly orthogonal.
2. Each unique term is also assigned an initially empty context vector of the same dimensionality. The
context vectors are then incrementally populated with context information by adding the index vectors
of the contexts in which the target term appears.
7
131
Random Permutation
Random permutation (RP) [41] is a modification of RI that attempts to take into account term order
information by simply permuting (i.e. shifting) the index vectors according to their direction and distance
from the target term before they are added to the context vector. In e↵ect, each term has multiple unique
representations: one index vector for each possible position relative to the target term in the context window.
Model Parameters
There are a number of model parameters that need to be configured according to the task that the induced
semantic spaces will be used for. For instance, the types of semantic relations captured by an RI space
depends on the context definition [39]. By employing a document-level context definition, relying on direct
co-occurrences, one models syntagmatic relations. That is, two terms that frequently co-occur in the same
documents are likely to be about the same general topic. By employing a sliding window context definition,
one models paradigmatic relations. That is, two terms that frequently co-occur with similar sets of words –
i.e., share neighbors – but do not necessarily co-occur themselves, are semantically similar. Synonymy is a
prime example of a paradigmatic relation. The size of the context window also a↵ects the types of relations
that are modeled and needs to be tuned for the task at hand. This is also true for semantic spaces produced
by RP; however, the precise impact of window size on RP spaces and the internal relations of their context
vectors is yet to be studied in depth.
Method
The main idea behind this study is to enhance the performance on the task of extracting synonyms and
abbreviation-expansion pairs by combining multiple and di↵erent semantic spaces – di↵erent in terms of
(1) type of model and model parameters used, and (2) type of corpus from which the semantic space is
induced. In addition to combining semantic spaces induced from a single corpus, we also combine semantic
spaces induced from two di↵erent types of corpora: in this case, a clinical corpus (comprising health record
notes) and a medical corpus (comprising journal articles). The notion of combining multiple semantic spaces
to improve performance on some task is generalizable and can loosely be described as creating ensembles
of semantic spaces. By combining semantic spaces, it becomes possible to benefit from model types that
capture slightly di↵erent aspects of semantics, to exploit various model parameter configurations (which
influence the types of semantic relations that are modeled), as well as to observe language use in potentially
very di↵erent contexts (by employing more than one corpus type). We set out exploring this approach by
8
132
querying each semantic space separately and then combining their output using a number of combination
strategies (Figure 1).
The experimental setup can be divided into the following steps: (1) corpora preprocessing, (2) construction of semantic spaces from the two corpora, (3) identification of the most profitable single-corpus
combinations, (4) identification of the most profitable multiple-corpora combinations, (5) evaluations of the
single-corpus and multiple-corpora combinations, (6) post-processing of candidate terms, and (7) frequency
threshold experiments. Once the corpora have been preprocessed, ten semantic spaces from each corpus are
induced with di↵erent context window sizes (RP spaces are also induced with and without stop words). Ten
pairs of semantic spaces are then combined using three di↵erent combination strategies. These are evaluated
on the three tasks – (1) abbreviations ! expansions, (2) expansions ! abbreviations and (3) synonyms
– using the development subsets of the reference standards (a list of medical abbreviation-expansion pairs
and MeSH synonyms). Performance is mainly measured as recall top 10, i.e. the proportion of expected
candidate terms that are among a list of ten suggestions. The pair of semantic spaces involved in the most
profitable combination for each corpus is then used to identify the most profitable multiple-corpora combinations, where eight di↵erent combination strategies are evaluated. The best single-corpus combinations
are evaluated on the evaluation subsets of the reference standards, where using RI and RP in isolation
constitute the two baselines. The best multiple-corpora combination is likewise evaluated on the evaluation
subsets of the reference standards; here, the clinical and medical ensembles constitute the two baselines.
Post-processing rules are then constructed using the development subsets of the reference standards and the
outputs of the various semantic space combinations. These are evaluated on the evaluation subsets of the
reference standards using the most profitable single-corpus and multiple-corpora ensembles. Finally, we explore the impact of frequency thresholds (i.e., how many times each pair of terms in the reference standards
needs to occur to be included) on performance.
Inducing Semantic Spaces from Clinical and Medical Corpora
Each individual semantic space is constructed with one model type, using a predefined context window size
and induced from a single corpus type. The semantic spaces are constructed with random indexing (RI)
and random permutation (RP) using JavaSDM [42]. For all semantic spaces, a dimensionality of 1,000 is
used (with 8 non-zero, randomly distributed elements in the index vectors: four 1s and four -1s). When
the RI model is employed, the index vectors are weighted according to their distance from the target term,
see Equation 1, where distit is the distance to the target term. When the RP model is employed, the
9
133
elements of the index vectors are instead shifted according to their direction and distance from the target
term; no weighting is performed. For all models, window sizes of two (1+1), four (2+2) and eight (4+4)
surrounding terms are used. In addition, RI spaces with a window size of twenty (10+10) are induced in
order to investigate whether a significantly wider context definition may be profitable. Incorporating order
information (RP) with such a large context window makes little sense; such an approach would also su↵er
from data sparseness. Di↵erent context definitions are experimented with in order to find one that is best
suited to each task. The RI spaces are induced only from corpora that have been stop-word filtered, as
co-occurrence information involving high-frequent and widely distributed words contribute very little to the
meaning of terms. The RP spaces are, however, also induced from corpora in which stop words have been
retained. The motivation behind this is that all words, including function words – these make up the majority
of the items in the stop-word lists – are important to the syntactic structure of language and may thus be
of value when modeling order information [43]. A stop-word list is created for each corpus by manually
inspecting the most frequent types and removing those that may be of interest, e.g. domain-specific terms.
Each list consists of approximately 150 terms.
weighti = 21
distit
(1)
The semantic spaces are induced from two types of corpora – essentially belonging to di↵erent genres,
but both within the wider domain of medicine: (1) a clinical corpus, comprising notes from health records,
and (2) a medical corpus, comprising medical journal articles. The clinical corpus contains a subset of the
Stockholm EPR Corpus [44], which encompasses health records from the Karolinska University Hospital in
Stockholm, Sweden over a five-year period3 . The clinical corpus used in this study is created by extracting
the free-text, narrative parts of the health records from a wide range of clinical practices. The clinical
notes are written in Swedish by physicians, nurses and other health care professionals over a six-month
period in 2008. In summary, the corpus comprises documents that each contain clinical notes documenting
a single patient visit at a particular clinical unit. The medical corpus contains the freely available subset
of Läkartidningen (1996-2005), which is the Journal of the Swedish Medical Association [45]. It is a weekly
journal written in Swedish and contains articles discussing new scientific findings in medicine, pharmaceutical
studies, health economic evaluations, etc. Although these issues have been made available for research, the
original order of the sentences has not been retained; the sentences are given in a random order. Both corpora
are lemmatized using the Granska Tagger [46] and thereafter further preprocessed by removing punctuation
10
134
marks and digits. Two versions of each corpus are created: one version in which the stop words are retained
and one version in which they are removed. As the sentences in Läkartidningen are given in a random order,
a document break is indicated between each sentence for this corpus. It is thereby ensured that context
information from surrounding sentences will not be incorporated in the induced semantic space. Statistics
for the two corpora are shown in Table 1.
In summary, a total of twenty semantic spaces are induced – ten from each corpus type. Four RI spaces
are induced from each corpus type (8 in total), the di↵erence being the context definition employed (1+1,
2+2, 4+4, 10+10). Six RP spaces are induced from each corpus type (12 in total), the di↵erence being the
context definition employed (1+1, 2+2, 4+4) and whether stop words have been removed or retained (sw).
Combinations of Semantic Spaces from a Single Corpus
Since RI and RP model semantic relations between terms in slightly di↵erent ways, it may prove profitable
to combine them in order to increase the likelihood of capturing synonymy and identifying abbreviationexpansion pairs. In one study it was estimated that the overlap in the output produced by RI and RP
spaces is, on average, only around 33% [41]: by combining them, we hope to capture di↵erent semantic
properties of terms and, ultimately, boost results. The combinations from a single corpus type involve only
two semantic spaces: one constructed with RI and one constructed with RP. In this study, the combinations
involve semantic spaces with identical window sizes, with the following exception: RI spaces with a wide
context definition (10+10) are combined with RP spaces with a narrow context definition (1+1, 2+2). The
RI spaces are combined with RP spaces both with and without stop words.
Three di↵erent strategies of combing an RI-based semantic space with an RP space are designed and
evaluated. Thirty combinations are evaluated for each corpus, i.e. sixty in total (Table 2). The three
combination strategies are:
• RI ⇢ RP30
Finds the top ten terms in the RI space that are among the top thirty terms in the RP space.
• RP ⇢ RI30
Finds the top ten terms in the RP space that are among the top thirty terms in the RI space.
• RI + RP
Sums the cosine similarity scores from the two spaces for each candidate term.
11
135
Combinations of Semantic Spaces from Multiple Corpora
In addition to combining semantic spaces induced from one and the same corpus, a combination of semantic
spaces induced from di↵erent corpora could potentially yield even better performance on the task of extracting synonyms and abbreviation-expansion pairs, especially if the terms of interest occur with some minimum
frequency in both corpora. Such ensembles of semantic spaces – in this study consisting of four semantic
spaces – allow not only di↵erent model types and model parameter configurations to be employed, but also
to capture language use in di↵erent genres or domains, in which terms may be used in slightly di↵erent contexts. The pair of semantic spaces from each corpus that are best able to perform each of the aforementioned
tasks – consisting of two semantic spaces – are subsequently combined using various combination strategies.
The combination strategies can usefully be divided into two sets of approaches: in the first, the four
semantic spaces are treated equally – irrespective of source – and combined in a single step; in the other,
a two-step approach is assumed, wherein each pair of semantic spaces – induced from the same source – is
combined separately before the combination of combinations is performed. In both sets of approaches, the
outputs of the semantic spaces are combined in one of two ways: SUM, where the cosine similarity scores are
merely summed, and AVG, where the average cosine similarity score is calculated based on the number of
semantic spaces in which the term under consideration exists. The latter is an attempt to mitigate the e↵ect
of di↵erences in vocabulary between the two corpora. In the two-step approaches, the SUM /AVG option is
configurable for each step. In the single-step approaches, the combinations can be performed either with or
without normalization, which in this case means replacing the exact cosine similarity scores of the candidate
terms in the output of each queried semantic space with their ranking in the list of candidate terms. This
means that the candidate terms are now sorted in ascending order, with zero being the highest score. When
combining two or more lists of candidate terms, the combined list is thus also sorted in ascending order.
The rationale behind this option is that the cosine similarity scores are relative and thus only valid within
a given semantic space: combining similarity scores from semantic spaces constructed with di↵erent model
types and parameter configurations, and induced from di↵erent corpora, might have adverse e↵ects. In the
two-step approach, normalization is always performed after combining each pair of semantic spaces. In
total, eight combination strategies are evaluated:
Single-Step Approaches
• SUM: RIclinical + RPclinical + RImedical + RPmedical
12
136
Each candidate term’s cosine similarity score in each semantic space is summed. The top ten terms
from this list are returned.
• SUM, normalized: norm(RIclinical ) + norm(RPclinical ) + norm(RImedical ) + norm(RPmedical )
The output of each semantic space is first normalized by using the ranking instead of cosine similarity;
each candidate term’s (reverse) ranking in each semantic space is then summed. The top ten terms
from this list are returned.
RIclinical + RPclinical + RImedical + RPmedical
countterm
Each candidate term’s cosine similarity score in each semantic space is summed; this value is then
• AVG:
averaged over the number of semantic spaces in which the term exists. The top ten terms from this
list are returned.
norm(RIclinical ) + norm(RPclinical ) + norm(RImedical ) + norm(RPmedical )
countterm
The output of each semantic space is first normalized by using the ranking instead of cosine similarity;
• AVG, normalized:
each candidate term’s normalized score in each semantic space is then summed; this value is finally
averaged over the number of semantic spaces in which the term exists. The top ten terms from this
list are returned.
Two-Step Approaches
• SUM!SUM: norm(RIclinical + RPclinical ) + norm(RImedical + RPmedical )
Each candidate term’s cosine similarity score in each pair of semantic spaces is first summed; these are
then normalized by using the ranking instead of the cosine similarity; finally, each candidate term’s
normalized score is summed. The top ten terms from this list are returned.
✓
◆
✓
◆
RIclinical + RPclinical
RImedical + RPmedical
norm
+ norm
countterm source a
countterm source b
• AVG!AVG:
countterm source a + countterm source b
Each candidate term’s cosine similarity score for each pair of semantic spaces is first summed; for each
pair of semantic spaces, this value is then averaged over the number of semantic spaces in that pair
in which the term exists; these are subsequently normalized by using the ranking instead of the cosine
similarity; each candidate term’s normalized score in each combined list is then summed and averaged
over the number of semantic spaces in which the term exists (in both pairs of semantic spaces). The
top ten terms from this list are returned.
13
137
norm(RIclinical + RPclinical ) + norm(RImedical + RPmedical )
countterm
Each candidate term’s cosine similarity score for each pair of semantic spaces is first summed; these are
• SUM!AVG:
then normalized by using the ranking instead of the cosine similarity; each candidate term’s normalized
score in each combined list is then summed and averaged over the number of semantic spaces in which
the term exists. The top ten terms from this list are returned.
✓
◆
✓
◆
RIclinical + RPclinical
RImedical + RPmedical
• AVG!SUM: norm
+ norm
countterm
countterm
Each candidate term’s cosine similarity score for each pair of semantic spaces is first summed and
averaged over the number of semantic spaces in that pair in which the term exists; these are then
normalized by using the ranking instead of the cosine similarity; each candidate term’s normalized
score in each combined list is finally summed. The top ten terms from this list are returned.
Post-Processing of Candidate Terms
In addition to creating ensembles of semantic spaces, simple filtering rules are designed and evaluated for
their ability to further enhance performance on the task of extracting synonyms and abbreviation-expansion
pairs. For obvious reasons, this is easier for abbreviation-expansion pairs than for synonyms.
With regards to abbreviation-expansion pairs, the focus is on increasing precision by discarding poor
suggestions in favor of potentially better ones. This is attempted by exploiting properties of the abbreviations
and their corresponding expansions. The development subset of the reference standard (see Evaluation
Framework) is used to construct rules that determine the validity of candidate terms. For an abbreviationexpansion pair to be considered valid, each letter in the abbreviation has to be present in the expansion and
the letters also have to appear in the same order. Additionally, the length of abbreviations and expansions
is restricted, requiring an expansion to contain more than four letters, whereas an abbreviation is allowed
to contain a maximum of four letters. These rules are shown in (2) and (3).
For synonym extraction, cut-o↵ values for rank and cosine similarity are instead employed. These cut-o↵
values are tuned to maximize precision for the best semantic space combinations in the development subset
of the reference standard, without negatively a↵ecting recall (see Figures 2-4). Used cut-o↵ values are shown
in (4) for the clinical corpus, in (5) for the medical corpus and in (6) for the combination of the two corpora.
In (6), Cos denotes the combination of the cosine values, which means that it has a maximum value of four
rather than one.
14
138
Exp ! Abbr =
(
Abbr ! Exp =
(
Synclinical =
(
T rue,
F alse,
T rue,
F alse,
T rue,
F alse,
Synclinical+medical =
T rue,
F alse,
(2)
if (Len > 4) ^ (Subin = T rue)
Otherwise
(3)
if (Cos 0.60) _ (Cos
Otherwise
Synmedical =
(
if (Len < 5) ^ (Subout = T rue)
Otherwise
(
T rue,
F alse,
if (Cos 1.9) _ (Cos
Otherwise
0.40 ^ Rank < 9)
if (Cos 0.50)
Otherwise
1.8 ^ Rank < 6) _ (Cos
(4)
(5)
1.75 ^ Rank < 3)
(6)
Cos : Cosine similarity between candidate term and query term.
Rank : The ranking of the candidate term, ordered by cosine similarity.
Subout : Whether each letter in the candidate term is present
in the query term, in the same order and with identical initial letters.
Subin : Whether each letter in the query term is present
in the candidate term, in the same order and with identical initial letters.
Len : The length of the candidate term.
The post-processing filtering rules are employed in two di↵erent ways. In the first approach, the semantic
spaces are forced to suggest a predefined number of candidate terms (ten), irrespective of how good they are
deemed to be by the semantic space. Candidate terms are retrieved by the semantic space until ten have
been classified as correct according to the post-processing rules, or until one hundred candidate terms have
been classified. If less than ten are classified as incorrect, the highest ranked discarded terms are used to
populate the remaining slots in the final list of candidate terms. In the second approach, the semantic spaces
are allowed to suggest a dynamic number of candidate terms, with a minimum of one and a maximum of
ten. If none of the highest ranked terms are classified as correct, the highest ranked term is suggested.
15
139
Evaluation Framework
Evaluation of the numerous experiments is carried out with the use of reference standards: one contains
known abbreviation-expansion pairs and the other contains known synonyms. The semantic spaces and their
various combinations are evaluated for their ability to extract known abbreviations/expansions (abbr!exp
and exp!abbr ) and synonyms (syn) – according to the employed reference standard – for a given query
term in a list of ten candidate terms (recall top 10). Recall is prioritized in this study and any decisions,
such as deciding which model parameters or which combination strategies are most the profitable, are solely
based on this measure. When precision is reported, it is calculated as weighted precision, where the weights
are assigned according to the ranking of a correctly identified term.
The reference standard for abbreviations is taken from Cederblom [47], which is a book that contains
lists of medical abbreviations and their corresponding expansions. These abbreviations have been manually
collected from Swedish health records, newspapers, scientific articles, etc. For the synonym extraction task,
the reference standard is derived from the freely available part of the Swedish version of MeSH [48] – a part
of UMLS – as well as a Swedish extension that is not included in UMLS [49]. As the semantic spaces are
constructed only to model unigrams, all multiword expressions are removed from the reference standards.
Moreover, hypernym/hyponym and other non-synonym pairs found in the UMLS version of MeSH are
manually removed from the reference standard for the synonym extraction task. Models of distributional
semantics sometimes struggle to model the meaning of rare terms accurately, as the statistical basis for
their representation is insufficiently solid. As a result, we only include term pairs that occur at least fifty
times in each respective corpus. This, together with the fact that term frequencies di↵er from corpus to
corpus, means that there is one set of reference standards for each corpus. For evaluating combinations of
semantic spaces induced from di↵erent corpora, a third set of reference standards is created, in which only
term pairs that occur at least fifty times in both corpora are included. Included terms are not restricted to
form pairs; in the reference standard for the synonym extraction task, some form larger groups of terms with
synonymous relations. There are also abbreviations with several possible expansions, as well as expansions
with several possible abbreviations. The term pairs (or n-tuples) in each reference standard are randomly
split into a development set and an evaluation set of roughly equal size. The development sets are used for
identifying the most profitable ensembles of semantic spaces (with optimized parameter settings, such as
window size and whether to include stop words in the RP spaces) for each of the three tasks, as well as for
creating the post-processing filtering rules. The evaluation sets are used for the final evaluation to assess the
expected performance of the ensembles in a deployment setting. Baselines for the single-corpus ensembles
16
140
are created by employing RI and RP in isolation; baselines for the multiple-corpora ensembles are created
by using the most profitable clinical and medical ensembles from the single-corpus experiments. Statistics
for the reference standards are shown in Table 3.
Results
The experimental setup was designed in such a manner that the semantic spaces that performed best in
combination for a single corpus would also be used in the subsequent combinations from multiple corpora.
Identifying the most profitable combination strategy for each of the three tasks was achieved using the development subsets of the reference standards. These combinations were then evaluated on separate evaluation
sets containing unseen data. All further experiments, including the post-processing of candidate terms, were
carried out with these combinations on the evaluation sets. This is therefore also the order in which the
results will be presented.
Combination Strategies: A Single Corpus
The first step involved identifying the most appropriate window sizes for each task, in conjunction with
evaluating the combination strategies. The reason for this is that the optimal window sizes for RI and RP
in isolation are not necessarily identical to the optimal window sizes when RI and RP are combined. In fact,
when RI is used in isolation, a window size of 2+2 performs best on the two abbreviation-expansion tasks,
and a window size of 10+10 performs best on the synonym task. For RP, a semantic space with a window
size of 2+2 yields the best results on two of the tasks – abbr!exp and syn – while a window size of 4+4 is
more successful on the exp!abbr task. These are the model configurations used in the RI and RP baselines,
to which the single-corpus combination strategies are compared in the final evaluation.
Using the semantic spaces induced from the clinical corpus, the RI +RP combination strategy, wherewith
the cosine similarity scores are merely summed, is the most successful on all three tasks: 0.42 recall on the
abbr!exp task, 0.32 recall on the exp!abbr task, and 0.40 recall on the syn task (Table 4). For the
abbreviation expansion task, a window size of 2+2 appears to work well for both models, with the RP space
retaining stop words. On the task of identifying the abbreviated form of an expansion, semantic spaces with
window sizes of 2+2 and 4+4 perform equally well; the RP spaces should include stop words. Finally, on
the synonym extraction task, an RI space with a large context window (10+10) in conjunction with an RP
space with stop words and a window size of 2+2 is the most profitable.
Using the semantic spaces induced from the medical corpus, again, the RI + RP combination strategy
17
141
outperforms the RI ⇢ RP30 and RP ⇢ RI30 strategies: 0.10 recall on the abbr!exp task, 0.08 recall on the
exp!abbr task, and 0.30 recall on the syn task (Table 5) are obtained. This combination beats the other
two by a large margin on the exp!abbr task: 0.08 recall compared to 0.03 recall. The most appropriate
window sizes for capturing these phenomena in the medical corpus are fairly similar to those that worked
best with the clinical corpus. On the abbr!exp task, the optimal window sizes are indeed identical across
the two corpora: a 2+2 context window with an RP space that incorporates stop words yields the highest
performance. For the exp!abbr task, a slightly larger context window of 4+4 seems to work well – again,
with stop words retained in the RP space. Alternatively, combining a large RI space (10+10) with a smaller
RP space (2+2, with stop words) performs comparably on this task and with this test data. Finally, for
synonyms, a large RI space (10+10) with a very small RP space (1+1) that retains all words best captures
this phenomenon with this type of corpus.
The best-performing combinations from each corpus and for each task were then treated as (ensemble)
baselines in the final evaluation, where combinations of semantic spaces from multiple corpora are evaluated.
Combination Strategies: Multiple Corpora
The pair of semantic spaces from each corpus that performed best on the three tasks were subsequently
employed in combinations that involved four semantic spaces – two from each corpus: one RI space and
one RP space. The single-step approaches generally performed better than the two-step approaches, with
some exceptions (Table 6). The most successful ensemble was a simple single-step approach, where the
cosine similarity scores produced by each semantic space were simply summed (SUM ), yielding 0.32 recall
for abbr!exp, 0.17 recall for exp!abbr, and 0.52 recall for syn. The AVG option, although the secondhighest performer on the abbreviation-expansion tasks, yielded significantly poorer results. Normalization,
whereby ranking was used instead of cosine similarity, invariably a↵ected performance negatively, especially
when employed in conjunction with SUM. The two-step approaches performed significantly worse than all
non-normalized single-step approaches, with the sole exception taking place on the synonym extraction task.
It should be noted that normalization was always performed in the two-step approaches – this was done after
each pair of semantic spaces from a single corpus had been combined. Of the four two-step combination
strategies, AVG!AVG and AVG!SUM performed best, with identical recall scores on the three tasks.
18
142
Final Evaluations
The combination strategies that performed best on the development sets were finally evaluated on completely
unseen data in order to assess their generalizability to new data and to assess their expected performance
in a deployment setting. Each evaluation phase involves comparing the results to one or more baselines: in
the case of single-corpus combinations, the comparisons are made to RI and RP in isolation; in the case
of multiple-corpora combinations, the comparisons are made to a clinical ensemble baseline and a medical
ensemble baseline.
When applying the single-corpus combinations from the clinical corpus, the following results were obtained: 0.31 recall on abbr!exp, 0.20 recall on exp!abbr, and 0.44 recall on syn (Table 7). Compared to the
results on the development sets, the results on the two abbreviation-expansion tasks decreased by approximately ten percentage points; on the synonym extraction task, the performance increased by a couple of
percentage points. The RI baseline was outperformed on all three tasks; the RP baseline was outperformed
on two out of three tasks, with the exception of the exp!abbr task. Finally, it might be interesting to point
out that the RP baseline performed better than the RI baseline on the two abbreviation-expansion tasks,
but that the RI baseline did somewhat better on the synonym extraction task.
With the medical corpus, the following results were obtained: 0.17 recall on abbr!exp, 0.11 recall on
abbr!exp, and 0.34 recall on syn (Table 8). Compared to the results on the development sets, the results
were higher for all three tasks. Both the RI and RP baselines were outperformed, with a considerable
margin, by their combination. In complete contrast to the clinical corpus, the RI baseline here beat the RP
baseline on the two abbreviation-expansion tasks, but was outperformed by the RP baseline on the synonym
extraction task.
When applying the multiple-corpora ensembles, the following results were obtained on the evaluation
sets: 0.30 recall on abbr!exp, 0.19 recall on exp!abbr, and 0.47 recall on syn (Table 9). Compared to the
results on the development sets, the results decreased somewhat on two of the tasks, with exp!abbr the
exception. The two ensemble baselines were clearly outperformed by the larger ensemble of semantic spaces
from two types of corpora on two of the tasks; the clinical ensemble baseline performed equally well on
the exp!abbr task. It might be of some interest to compare the two ensemble baselines here since, in this
setting, they can more fairly be compared: they are evaluated on identical test sets. The clinical ensemble
baseline quite clearly outperformed the medical ensemble baseline on the two abbreviation-expansion tasks;
however, on the synonym extraction task, their performance was comparable.
19
143
Post-Processing
In an attempt to further improve results, simple post-processing of the candidate terms was performed. In
one setting, the system was forced to suggest ten candidate terms regardless of their cosine similarity score
or other properties of the terms, such as their length. In another setting, the system had the option of
suggesting a dynamic number – ten or less – of candidate terms.
This was unsurprisingly more e↵ective on the two abbreviation-expansion tasks. With the clinical corpus,
recall improved substantially with the post-processing filtering: from 0.31 to 0.42 on abbr!exp and from 0.20
to 0.33 on exp!abbr (Table 7). With the medical corpus, however, almost no improvements were observed
for these tasks (Table 8). With the combination of semantic spaces from the two corpora, significant gains
were once more made by filtering the candidate terms on the two abbreviation-expansion tasks: from 0.30 to
0.39 recall and from 0.05 precision to 0.07 precision on abbr!exp; from 0.19 to 0.33 recall and from 0.03 to
0.06 precision on exp!abbr (Table 9). With a dynamic cut-o↵, only precision could be improved, although
at the risk of negatively a↵ecting recall. With the clinical corpus, recall was largely una↵ected for the two
abbreviation-expansion task, while precision improved by 3-7 percentage points (Table 7). With the medical
corpus, the gains were even more substantial: from 0.03 to 0.17 precision on abbr!exp and from 0.02 to
0.10 precision on exp!abbr – without having any impact on recall (Table 8). The greatest improvements
on these tasks were, however, observed with the combination of semantic spaces from multiple corpora:
precision increased from 0.07 to 0.28 on abbr!exp and from 0.06 to 0.31 on exp!abbr – again, without
a↵ecting recall (Table 9).
In the case of synonyms, this form of post-processing is more challenging, as there are no simple properties of the terms, such as their length, that can serve as indications of their quality as candidate synonyms.
Instead, one has to rely on their use in di↵erent contexts and grammatical properties; as a result, cosine similarity and ranking of the candidate terms were exploited in an attempt to improve the candidate synonyms.
This approach was, however, clearly unsuccessful for both corpora and their combination, with almost no
impact on either precision or recall. In a single instance – with the clinical corpus – precision increased by
one percentage point, albeit at the expense of recall, which su↵ered a comparable decrease (Table 7). With
the combination of semantic spaces from two corpora, the dynamic cut-o↵ option resulted in a lower recall
score, without improving precision (Table 9).
20
144
Frequency Thresholds
In order to study the impact of di↵erent frequency thresholds – i.e., how often each pair of terms had to
occur in the corpora to be included in the reference standard – on the task of extracting synonyms, the best
ensemble system was applied to a range of evaluation sets with di↵erent thresholds from 1 to 100 (Figure
5). With a low frequency threshold, it is clear that lower results are obtained. For instance, if each synonym
pair only needs to occur at least one time in both corpora, a recall of 0.17 is obtained. As the threshold is
increased, recall increases too - up to a frequency threshold of around 50, after which no performance boosts
are observed. Already with a frequency threshold of around 30, the results seem to level o↵. With frequency
thresholds over 100, there is not enough data in this case to produce any reliable results.
Discussion
The results clearly demonstrate that combinations of semantic spaces lead to improved results on all three
tasks. Combining random indexing and random permutation allows slightly di↵erent aspects of lexical
semantics to be captured; by combining them, stronger semantic relations between terms are extracted,
thereby increasing the performance on these tasks. Combining semantic spaces induced from di↵erent corpora
further improves performance. This demonstrates the potential of distributional ensemble methods, of which
this – to the extent of our knowledge – is the primary implementation of its kind, and it only scratches the
surface. In this initial study, only four semantic spaces were used; however, with increasing computational
capabilities, there is nothing stopping a much larger number of semantic spaces from being combined. These
can capture various aspects of semantics – aspects which may be difficult, if not impossible, to incorporate
into a single model – from a large variety of observational data on language use, where the contexts may be
very di↵erent.
Clinical vs. Medical Corpora
When employing corpus-driven methods to support lexical resource development, one naturally needs to
have access to a corpus in the target domain that reflects the language use one wishes to model. Hence,
one cannot, without due qualification, state that one corpus type is better than another for the extraction
of synonyms or abbreviation-expansion pairs. This is something that needs to be duly considered when
comparing the results for the semantic spaces on the clinical and medical corpora, respectively. Another
issue concerns the size of each corpus: in fact, the size of the medical corpus is only half as large as the clinical
corpus (Table 1). The reference standards used in the respective experiments are, however, not identical:
21
145
each term pair had to occur at least fifty times to be included – this will di↵er across corpora. To some extent
this mitigates the e↵ect of the total corpus size and makes the comparison between the two corpora fairer;
however, di↵erences in reference standards also entail that the results are not directly comparable. Another
di↵erence between the two corpora is that the clinical corpus contains more unique terms (types) than the
medical corpus, which might indicate that it consists of a larger number of concepts. It has previously been
shown that it can be beneficial, indeed important, to employ a larger dimensionality when using corpora
with a large vocabulary, as is typically the case in the clinical domain [50]; in this study a dimensionality
of 1,000 was used to induce all semantic spaces. The results, on the contrary, seem to indicate that better
performance is generally obtained with the semantic spaces induced from the clinical corpus. A fairer
comparison can be made between the two ensemble baselines (Table 9): on the synonym extraction task,
the results are comparable, while the clinical ensemble performs much better on the abbreviation-expansion
task. A possible explanation for the latter is perhaps that abbreviations are defined in the medical corpus
and are thus not used interchangeably with their expansions to the same extent as in clinical text. This
implies that such abbreviation-expansion pairs do not have a paradigmatic relation; models of syntagmatic
relations are perhaps better suited to capturing abbreviation-expansion pairs in this type of text.
An advantage of using non-sensitive corpora like the medical corpus employed in this study is that they
are generally more readily obtainable than sensitive clinical data. Perhaps such and similar sources can
complement smaller clinical corpora and yet obtain similar or potentially even better results.
Combining Semantic Spaces
Creating ensembles of semantic spaces has been shown to be profitable, at least on the task of extracting
synonyms and abbreviation-expansion pairs. In this study, the focus has been on combining the output of the
semantic spaces. This is probably the most straightforward approach and it has several advantages. For one,
the manner in which the semantic representations are created can largely be ignored, which would potentially
allow one to combine models that are very di↵erent in nature, as long as one can retrieve a ranked list of
semantically related terms with a measure of the strength of the relation. It also means that one can readily
combine semantic spaces that have been induced with di↵erent parameter settings, for instance with di↵erent
context definitions and of di↵erent dimensionality. An alternative approach would perhaps be to combine
semantic spaces on a vector level. Such an approach would be interesting to explore; however, it would
pose numerous challenges, not least in combining context vectors that have been constructed di↵erently and
potentially represent meaning in disparate ways.
22
146
Several combination strategies were designed and evaluated. In both the single-corpus and multiplecorpora ensembles, the most simple strategy performed best: the one whereby the cosine similarity scores
are summed. There are potential problems with such a strategy, since the similarity scores are not absolute
measures of semantic relatedness, but merely relative and only valid within a single semantic space. The
cosine similarity scores will, for instance, di↵er depending on the distributional model used and the size of
the context window. An attempt was made to deal with this by replacing the cosine similarity scores with
ranking information, as a means to normalize the output of each semantic space before combing them. This
approach, however, yielded much poorer results. A possible explanation for this is that a measure of the
semantic relatedness between terms is of much more importance than their ranking. After all, a list of the
highest ranked terms does not necessarily imply that they are semantically similar to the query term; only
that they are the most semantically similar in this space.
To gain deeper insights into the process of combining the output of multiple semantic spaces, an error
analysis was conducted on the synonym extraction task. This was achieved by comparing the outputs of the
most profitable combination of semantic spaces from each corpus, as well as with the combination of semantic
spaces from the two corpora. The error analysis was conducted on the development sets. Of the 68 synonyms
that were correctly identified as such by the corpora combination, five were not extracted by either of the
single-corpus combinations; nine were extracted by the medical ensemble but not by the clinical ensemble; as
many as 51 were extracted by the clinical ensemble but not by its medical counterpart; in the end, this means
that only three terms were extracted by both the clinical and medical ensembles. These results augment
the case for multiple-corpora ensembles. There appears to be little overlap in the top-10 outputs of the
corpora-specific ensembles; by combining them, 17 additional true synonyms are extracted compared to the
clinical ensemble alone. Moreover, the fact that so many synonyms are extracted by the clinical ensemble
demonstrates the importance of exploiting clinical corpora and the applicability of distributional semantics
to this genre of text. In Table 10, the first two examples, sjukhem (nursing-home) and depression show cases
for which the multiple-corpora ensemble was successful but the single-corpus ensembles were not. In the
third example, both the multiple-corpora ensemble and the clinical ensemble extract the expected synonym
candidate.
There was one query term – the drug name omeprazol – for which both single-corpus ensembles were
able to identify the synonym, but where the multiple-corpora ensemble failed. There were also three query
terms for which synonyms were identified by the clinical ensemble, but not by the multiple-corpora ensemble;
there were five query terms that were identified by the medical ensemble, but not by the multiple-corpora
23
147
ensemble. This shows that combining semantic spaces can also, in some cases, introduce noise.
Since synonym pairs were queried both ways, i.e. each term in the pair would be queried to see if the
other could be identified, we wanted to see if there were cases where the choice of query term would be
important. Indeed, among the sixty query terms for which the expected synonym was not extracted, this
was the case in fourteen instances. For example, given the query term blindtarmsinflammation (”appendixinflammation”), the expected synonym appendicit (appendicitis) was given as a candidate, whereas with the
query term appendicit, the expected synonym was not successfully identified.
Models of distributional semantics face the problem of modeling terms with several ambiguous meanings.
This is, for instance, the case with the polysemous term arv (referring to inheritance as well as to heredity).
Distant synonyms also seem to be problematic, e.g. the pair rehabilitation/habilitation. For approximately
a third of the synonym pairs that are not correctly identified, however, it is not evident that they belong to
either of these two categories.
Post-Processing
In an attempt to improve results further, an additional step in the proposed method was introduced: filtering
of the candidate terms, with the possibility of extracting new, potentially better ones. For the extraction
of abbreviation-expansion pairs, this was fairly straightforward, as there are certain patterns that generally
apply to this phenomenon, such as the fact that the letters in an abbreviation are contained – in the same
order – in its expansion. Moreover, expansions are longer than abbreviations. This allowed us to construct
simple yet e↵ective rules for filtering out unlikely candidate terms for these two tasks. As a result, both
precision and recall increased; with a dynamic cut-o↵, precision improved significantly. Although our focus
in this study was primarily on maximizing recall, there is a clear incentive to improve precision as well. If
this method were to be used for terminological development support, with humans inspecting the candidate
terms, minimizing the number of poor candidate terms has a clear value. However, given the seemingly easy
task of filter out unlikely candidates, it is perhaps more surprising that the results were not even better.
A part of the reason for this may stem from the problem of semantically overloaded types, which a↵ects
abbreviations to a large degree, particularly in the clinical domain with its telegraphic style and where ad-hoc
abbreviations abound. This was also reflected in the reference standard, as in some cases the most common
expansion of an abbreviation was not included.
The post-processing filtering of synonyms clearly failed. Although ranking information and, especially,
cosine similarity provide some indication of the quality of synonym candidates, employing cut-o↵ values with
24
148
these features can impossibly improve recall: new candidates will always have a lower ranking and a lower
cosine similarity score than discarded candidate terms. It can, however – at least in theory – potentially
improve precision when using these rules in conjunction with a dynamic cut-o↵, i.e. allowing less than ten
candidates terms to be suggested. In this case, however, the rules did not have this e↵ect.
Thresholds
Increasing the frequency threshold further did not improve results. In fact, a threshold of 30 occurrences in
both corpora seems to be sufficient. A high frequency threshold is a limitation of distributional methods;
thus, the ability to use a lower threshold is important, especially in the clinical domain where access to data
is difficult to obtain.
The choice of evaluating recall among ten candidates was based on an estimation of the number of
candidate terms that would be reasonable to present to a lexicographer for manual inspection. Recall might
improve if more candidates were presented, but it would likely come at the expense of decreased usability.
It might instead be more relevant to limit further the number of candidates to present. As is shown in
Figure 4, there are only a few correct synonyms among the candidates ranked 6-10. By using more advanced
post-processing techniques and/or being prepared to sacrifice recall slightly, it is possible to present fewer
candidates for manual inspection, thereby potentially increasing usability.
Reflections on Evaluation
To make it feasible to compare a large number of semantic spaces and their various combinations, fixed
reference standards derived from terminological resources were used for evaluation, instead of manual classification of candidate terms. One of the motivations for the current study, however, is that terminological
resources are seldom complete; they may also reflect a desired use of language rather than actual use. The
results in this study thus reflect to what extent di↵erent semantic spaces – and their combinations – are
able to extract synonymous relations that have been considered relevant according to specific terminologies,
rather than to what extent the semantic spaces – and their combinations – capture the phenomenon of
synonymy. This is, for instance, illustrated by the query term depression in Table 10, in which one potential synonym is extracted by the clinical ensemble – depressivitet (”depressitivity”) – and another potential
synonym by the medical ensemble: depressionsjukdom (depressive illness). Although these terms might not
be formal or frequent enough to include in all types of terminologies, they are highly relevant candidates
for inclusion in terminologies intended for text mining. Neither of these two terms are, however, counted
25
149
as correct synonyms, and only the multiple-corpora ensemble is able to find the synonym included in the
terminology. Future studies would therefore benefit from a manual classification of retrieved candidates,
with the aim of also finding synonyms that are not already included in current terminologies.
The choice of terminological resources to use as reference standards was originally based on their appropriateness for evaluating semantic spaces induced from the clinical corpus. However, for evaluating the
extraction of abbreviation-expansion pairs with semantic spaces induced from the medical corpus, the chosen
resources – in conjunction with the requirement that terms should occur at least fifty times in the corpus –
were less appropriate, as it resulted in a very small reference standard. It should, therefore, be noted that
the results for this part of the study are less reliable, especially for the experiments with the multiple-corpora
ensembles on the two abbreviation-expansion tasks.
For synonyms, the number of instances in the reference standard is, of course, smaller for the experiments
with multiple-corpora ensembles than for the single-corpus experiments. However, when evaluating the final
results with di↵erent frequency thresholds, similar results are obtained when lowering the threshold and, as
a result, including more evaluation instances. With a threshold of twenty occurrences, 306 input terms are
evaluated, which results in a recall of 0.42; with a threshold of thirty occurrences and 222 query terms, a
recall of 0.46 is obtained. For experiments that involve the medical corpus, more emphasis should therefore
be put on the synonym part of the study; when assessing the potential of the ensemble method on the
abbreviation-expansion tasks, the results for the clinical ensemble are more reliable.
As the focus of this study was on synonyms and abbreviation-expansion pairs, the reference standards
used naturally fail to reflect the fact that many of the given candidates were reasonable candidates, e.g.
spelling variants, which would be desirable to identify from an information extraction perspective. Many
of the suggestions were, however, non-synonymous terms belonging to the same semantic class as the query
term, such as drugs, family relations, occupations and diseases. Hypo/hypernym relations were also frequent,
as exemplified with the query term allergy shown in Table 10, which gives a number of hyponyms referring
to specific allergies as candidates.
Future Work
Now that this first step has been taken towards creating ensembles of semantic spaces, this notion should
be explored in greater depth and taken further. It would, for instance, be interesting to combine a larger
number of semantic spaces, possibly including those that have been more explicitly modeled with syntactic
information. To verify the superiority of this approach, it should be compared to the performance of a single
26
150
semantic space that has been induced from multiple corpora.
Further experiments should likewise be conducted with combinations involving a larger number of corpora
(types). One could, for instance, combine a professional corpus with a layman corpus – e.g. a corpus of
extracts from health-related fora – in order to identify layman expressions for medical terms. This could
provide a useful resource for automatic text simplification.
Another technique that could potentially be used to identify term pairs with a higher degree of semantic
similarity is to ensure that both terms have each other as their closest neighbors in the semantic subspace.
This is not always the case, as we pointed out in our error analysis. This could perhaps improve performance
on the task of extracting synonyms and abbreviation-expansion pairs.
A limitation of the current study – in the endeavor to create a method than can help to account for the
problem of language use variability – is that the semantic spaces were constructed to model only unigrams.
Textual instantiations of the same concept can, however, vary in term length. This needs to be accounted
for in a distributional framework and concerns paraphrasing more generally than synonymy in particular.
Combining unigram spaces with multiword spaces is a possibility that could be explored. This would also
make the method applicable for acronym expansion.
Conclusions
This study demonstrates that combinations of semantic spaces yield improved performance on the task of
automatically extracting synonyms and abbreviation-expansion pairs. First, combining two distributional
models – random indexing and random permutation – on a single corpus enables the capturing of di↵erent
aspects of lexical semantics and e↵ectively increases the quality of the extracted candidate terms, outperforming the use of one model in isolation. Second, combining distributional models and types of corpora
– a clinical corpus, comprising health record narratives, and a medical corpus, comprising medical journal articles – improves results further, outperforming ensembles of semantic spaces induced from a single
source. We hope that this study opens up avenues of exploration for applying the ensemble methodology to
distributional semantics.
Semantic spaces can be combined in numerous ways. In this study, the approach was to combine the
outputs, i.e. ranked lists of semantically related terms to a given query term, of the semantic spaces. How
this should be done is not wholly intuitive. By exploring a variety of combination strategies, we found that
the best results were achieved by simply summing the cosine similarity scores provided by the distributional
models.
27
151
On the task of extracting abbreviation-expansion pairs, significant performance gains were obtained by
applying a number of simple post-processing rules to the list of candidate terms. By filtering out unlikely
candidates based on simple patterns and retrieving new ones, both recall and precision were improved by a
large margin.
Authors’ Contributions
AH was responsible for coordinating the study and was thus involved in all parts of it. AH was responsible
for the overall design of the study and for carrying out the experiments. AH initiated the idea of combining
semantic spaces induced from di↵erent corpora and implemented the evaluation and post-processing modules.
AH also had the main responsibility for the manuscript and drafted parts of the background and results
description.
HM and MS contributed equally to the study. HM initiated the idea of combining Random Indexing
and Random Permutation and was responsible for designing and implementing strategies for combining the
output of multiple semantic spaces. HM also drafted parts of the method description in the manuscript.
MS initiated the idea of applying the method to abbreviation-expansion extraction and to di↵erent types
of corpora. MS was responsible for designing the evaluation part of the study, as well as for preparing the
reference standards. MS also drafted parts of the background and method description in the manuscript.
VD, together with MS, was responsible for designing the post-processing filtering of candidate terms.
MD provided feedback on the design of the study and drafted parts of the background section in the
manuscript.
AH, HM and MS analyzed the results and drafted the discussion in the manuscript. All authors
reviewed the manuscript.
Acknowledgements
This work was partly (AH) supported by the Swedish Foundation for Strategic Research through the project
High-Performance Data Mining for Drug E↵ect Detection (ref. no. IIS11-0053) at Stockholm University,
Sweden. It was also partly (HM) supported by the Research Council of Norway through the project EviCare
- Evidence-based care processes: Integrating knowledge in clinical information systems (NFR project no.
193022). We would like to thank the members of our former research network HEXAnord, within which this
study was initiated. We would especially like to thank Ann-Marie Eklund for her contributions to the initial
28
152
stages of this work. We are also grateful to Sta↵an Cederblom and Studentlitteratur for giving us access to
their database of medical abbreviations.
29
153
References
1. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF: Extracting information from textual documents
in the electronic health record: a review of recent research. Yearb Med Inform 2008, :128–144.
2. Saeed JI: Semantics. Oxford: Blackwell Publishers 1997.
3. Leroy G, Chen H: Meeting medical terminology needs-the ontology-enhanced Medical Concept Mapper. IEEE Transactions on Information Technology in Biomedicine 2001, 5(4):261–270.
4. Leroy G, Endicott JE, Mouradi O, Kauchak D, Just ML: Improving perceived and actual text difficulty for
health information consumers using semi-automated methods. AMIA Annu Symp Proc 2012, 2012:522–
31.
5. Eriksson R, Jensen PB, Frankild S, Jensen LJ, Brunak S: Dictionary construction and identification of
possible adverse drug events in Danish clinical narrative text. J Am Med Inform Assoc 2013.
6. Liu H, Aronson AR, Friedman C: A study of abbreviations in MEDLINE abstracts. Proc AMIA Symp
2002, :464–8.
7. Keselman A, Slaughter L, Arnott-Smith C, Kim H, Divita G, Browne A, Tsai C, Zeng-Treitler Q: Towards
Consumer-Friendly PHRs: Patients’ Experience with Reviewing Their Health Records. In Proc.
AMIA Annual Symposium 2007:399—403.
8. Uzuner Ö, South B, Shen S, DuVall S: 2010 i2b2/VA challenge on concepts, assertions, and relations
in clinical text. J Am Med Inform Assoc 2011, 18(5):552–556.
9. Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Briefings in Bioinformatics
2005, 6:57–71, [http://bib.oxfordjournals.org/content/6/1/57.abstract].
10. Dumais S, Landauer T: A solution to Platos problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological review 1997, 104(2):211–240.
11. Hearst M: Automatic Acquisition of Hyponyms from Large Text Corpora. In Proceedings of COLING
1992 1992:539–545.
12. Blondel VD, Gajardo A, Heymans M, Senellart P, Dooren PV: A Measure of Similarity between Graph
Vertices: Applications to Synonym Extraction and Web Searching. SIAM Review 2004, 46(4):pp.
647–666, [http://www.jstor.org/stable/20453570].
13. Nakayama K, Hara T, Nishio S: Wikipedia mining for an association web thesaurus construction. In
Web Information Systems Engineering–WISE 2007, Springer 2007:322–334.
14. Dietterich TG: Ensemble Methods in Machine Learning. In Proceedings of the First International Workshop
on Multiple Classifier Systems, Springer-Verlag 2000:1–15.
15. Curran JR: Ensemble Methods for Automatic Thesaurus Extraction. In Proceedings of the Conference
on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics
2002:222–229.
16. Wu H, Zhou M: Optimizing synonym extraction using monolingual and bilingual resources. In Proceedings of the Second International Workshop on Paraphrasing - Volume 16, PARAPHRASE ’03, Stroudsburg,
PA, USA: Association for Computational Linguistics 2003:72–79, [http://dx.doi.org/10.3115/1118984.1118994].
17. van der Plas L, Tiedemann J: Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, COLING-ACL ’06, Stroudsburg, PA, USA: Association for Computational Linguistics 2006:866–873,
[http://dl.acm.org/citation.cfm?id=1273073.1273184].
18. Peirsman Y, Geeraerts D: Predicting strong associations on the basis of corpus data. In Proceedings
of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’09,
Stroudsburg, PA, USA: Association for Computational Linguistics 2009:648–656, [http://dl.acm.org/citation.
cfm?id=1609067.1609139].
19. Yu H, Agichtein E: Extracting Synonymous Gene and Protein Terms from Biological Literature.
Bioinformatics 2003, 1(19):340–349.
20. Cohen A, Hersh W, Dubay C, Spackman K: Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts. BMC Bioinformatics 2005, 6:103,
[http://www.biomedcentral.com/1471-2105/6/103].
30
154
21. McCrae J, Collier N: Synonym Set Extraction from the Biomedical Literature by Lexical Pattern
Discovery. BMC Bioinformatics 2008, 9:159.
22. Conway M, Chapman W: Discovering Lexical Instantiations of Clinical Concepts Using Web Services,
WordNet and Corpus Resources. In AMIA Annual Symposium Proceedings 2012:1604.
23. Zeng QT, Redd D, Rindflesch T, Nebeker J: Synonym, topic model and predicate-based query expansion
for retrieving clinical documents. In AMIA Annual Symposium Proceedings 2012:1050–1059.
24. Henriksson A, Skeppstedt M, Kvist M, Conway M, Duneld M: Corpus-Driven Terminology Development:
Populating Swedish SNOMED CT with Synonyms Extracted from Electronic Health Records. In
Proceedings of BioNLP 2013, forthcoming.
25. Schwartz AS, Hearst MA: A simple algorithm for identifying abbreviation definitions in biomedical
text. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 2003, :451–462.
26. Ao H, Takagi T: ALICE: An Algorithm to Extract Abbreviations from MEDLINE. Journal of the
American Medical Informatics Association 2005, 12(5):576–586.
27. Chang JT, D HSP, Biosciences N, Altman RB, D P: Creating an online dictionary of abbreviations from
MEDLINE. Journal of the American Medical Informatics Association 2002, 9:612–620.
28. Movshovitz-Attias D, Cohen WW: Alignment-HMM-based Extraction of Abbreviations from Biomedical Text. In Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (BioNLP 2012),
Montréal, Canada: Association for Computational Linguistics 2012:47–55.
29. Dannélls D: Automatic Acronym Recognition. In Proceedings of the 11th conference on European chapter
of the Association for Computational Linguistics (EACL) 2006.
30. Yu H, Hripcsak G, Friedman C: Mapping abbreviations to full forms in biomedical articles. Journal of
the American Medical Informatics Association : JAMIA 2002, 9(3):262–272.
31. Isenius N, Velupillai S, Kvist M: Initial Results in the Development of SCAN: a Swedish Clinical
Abbreviation Normalizer. In Proceedings of the CLEF 2012 Workshop on Cross-Language Evaluation of
Methods, Applications, and Resources for eHealth Document Analysis - CLEFeHealth2012, Rome, Italy: CLEF
2012.
32. Gaudan S, Kirsch H, Rebholz-Schuhmann D: Resolving abbreviations to their senses in Medline. Bioinformatics 2005, 21(18):3658–3664.
33. Cohen T, Widdows D: Empirical Distributional Semantics: Methods and Biomedical Applications. J
Biomed Inform 2009, 42(2):390–405.
34. Salton G, Wong A, Yang CS: A Vector Space Model for Automatic Indexing. Communications of the
ACM 11(18):613–620.
35. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA: Indexing by latent semantic analysis.
Journal of the American Society for Information Science 1990, 41(6):391–407.
36. Schütze H: Word Space. In Advances in Neural Information Processing Systems 5, Morgan Kaufmann 1993:895–
902.
37. Lund K, Burgess C: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior
Research Methods 1996, 28(2):203–208, [http://dx.doi.org/10.3758/bf03204766].
38. Harris ZS: Distributional structure. Word 1954, 10:146–162.
39. Sahlgren M: The Word-Space Model: Using distributional analysis to represent syntagmatic and
paradigmatic relations between words in high-dimensional vector spaces. PhD thesis, PhD thesis,
Stockholm University 2006.
40. Kanerva P, Kristofersson J, Holst A: Random indexing of text samples for latent semantic analysis. In
Proceedings of 22nd Annual Conference of the Cognitive Science Society 2000:1036.
41. Sahlgren M, Holst A, Kanerva P: Permutations as a Means to Encode Order in Word Space. In Proceedings of the 30th Annual Meeting of the Cognitive Science Society 2008:1300–1305.
42. Hassel M: JavaSDM package. http:// www.nada.kth.se/ ⇠xmartin/ java/ 2004. [KTH School of Computer Science and Communication; Stockholm, Sweden].
31
155
43. Jones MN, Mewhort DJK: Representing Word Meaning and Order Information in a Composite Holographic Lexicon. Psychological Review 2007, 1(114):1–37.
44. Dalianis H, Hassel M, Velupillai S: The Stockholm EPR Corpus - Characteristics and Some Initial
Findings. In Proceedings of ISHIMR 2009, Evaluation and implementation of e-health and health information
initiatives: international perspectives. 14th International Symposium for Health Information Management Research, Kalmar, Sweden 2009:243–249.
45. Kokkinakis D: The Journal of the Swedish Medical Association - a Corpus Resource for Biomedical
Text Mining in Swedish. In The Third Workshop on Building and Evaluating Resources for Biomedical Text
Mining (BioTxtM), an LREC Workshop. Turkey 2012.
46. Knutsson O, Bigert J, Kann V: A Robust Shallow Parser for Swedish. In Proceedings of Nodalida 2003
2003.
47. Cederblom S: Medicinska förkortningar och akronymer (In Swedish). Lund: Studentlitteratur 2005.
48. US National Library of Medicine: MeSH (Medical Subject Headings). http://www.ncbi.nlm.nih.gov/mesh.
49. Karolinska Institutet: Hur man använder den svenska MeSHen (In Swedish, translated as: How to
use the Swedish MeSH). http://mesh.kib.ki.se/swemesh/manual se.html 2012. [Accessed 2012-03-10].
50. Henriksson A, Hassel M: Optimizing the Dimensionality of Clinical Term Spaces for Improved Diagnosis Coding Support. In Proceedings of Louhi Workshop on Health Document Text Mining and Information
Analysis 2013.
Figures
Figure 1 - Ensembles of Semantic Spaces for Synonym Extraction and Abbreviation Expansion
Semantic spaces built with di↵erent model parameters are induced from di↵erent corpora. The output of the
semantic spaces are combined in order to obtain better results compared to using a single semantic space in
isolation.
32
156
Figure 2 - Distribution of Candidate Terms for the Clinical Corpus
The distribution (cosine similarity and rank) of candidates for synonyms for the best combination of semantic
spaces induced from the clinical corpus. The results show the distribution for query terms in the development
reference standard.
Figure 3 - Distribution of Candidate Terms for the Medical Corpus
The distribution (cosine similarity and rank) of candidates for synonyms for the best combination of semantic
spaces induced from the medical corpus. The results show the distribution for query terms in the development
reference standard.
33
157
Figure 4 - Distribution of Candidate Terms for Clinical + Medical Corpora
The distribution (combined cosine similarity and rank) of candidates for synonyms for the ensemble of
semantic spaces induced from medical and clinical corpora. The results show the distribution for query
terms in the development reference standard.
34
158
Figure 5 - Frequency Thresholds
The relation between recall and the required minimum frequency of occurrence for the reference standard
terms in both corpora. The number of query terms for each threshold value is also shown.
Tables
Table 1 - Corpora Statistics
The number of tokens and unique terms (types) in the medical and clinical corpus, with and without stop
words.
Corpus
Clinical
Medical
With Stop Words
⇠42.5M tokens (⇠0.4M types)
⇠20.3M tokens (⇠0.3M types)
Without Stop Words
⇠22.5M tokens (⇠0.4M types)
⇠12.1M tokens (⇠0.3M types)
35
159
Table 2 - Overview of Experiments Conducted with a Single Corpus
For each of the two corpora, 30 di↵erent combinations were evaluated. The configurations are described
according to the following pattern: model windowSize. For RP, sw means that stop words are retained in
the semantic space. For instance, model 20 means a window size of 10+10 was used.
For each of the 2 corpora, 10 semantic spaces were induced.
RI Spaces
RI 20 RI 2
RI 4
RP Spaces
RP 2 RP 2 sw RP 4
RP 4 sw
The induced semantic spaces were combined in 10 di↵erent combinations:
Combinations
Identical Window Size
RI 2, RP 2
RI 4, RP 4
Identical Window Size, Stop Words RI 2, RP 2 sw
RI 4, RP 4 sw
Large Window Size
RI 20, RP 2
RI 20, RP 4
Large Window Size, Stop Words
RI 20, RP 2 sw
RI 20, RP 4 sw
RI 8
RP 8
RP 8 sw
RI 8, RP 8
RI 8, RP 8 sw
For each combination, 3 combination strategies were evaluated.
Combination Strategies RI ⇢ RP 30
RP ⇢ RI 30
RI + RP
Table 3 - Reference Standards Statistics
Size shows the number of queries, 2 cor shows the proportion of queries with two correct answers and 3 cor
the proportion of queries with three (or more) correct answers. The remaining queries have one correct
answer.
Reference Standard
Abbr!Exp (Devel)
Abbr!Exp (Eval)
Exp!Abbr (Devel)
Exp!Abbr (Eval)
Syn (Devel)
Syn (Eval)
Clinical Corpus
Size 2 Cor 3 Cor
117
9.4%
0.0%
98
3.1%
0.0%
110
8.2%
1.8%
98
7.1%
0.0%
334
9.0%
1.2%
340
14%
2.4%
Medical Corpus
Size 2 Cor 3 Cor
55
13%
1.8%
55
11%
0%
63
4.7%
0%
61
0%
0%
266
11%
3.0%
263
13%
3.8%
Clinical + Medical
Size 2 Cor 3 Cor
42
14%
0%
35
2.9%
0%
45
6.7%
0%
36
0%
0%
122
4.9%
0%
135
11%
0%
Table 4 - Clinical Corpus Results on Development Set
Results (recall, top ten) of the best configurations for each model and model combination on the three tasks.
The configurations are described according to the following pattern: model windowSize. For RP, sw means
that stop words are retained in the model.
36
160
Strategy
RI ⇢ RP30
RI
RI 8
Abbr!Exp
RP
Result
RP 8 sw
0.38
RP ⇢ RI30
RI 20
RP 4 sw
0.35
RI + RP
RI 4
RP 4 sw
0.42
RI
RI 8
RI 4
RI 20
RI 4
RI 8
Exp!Abbr
RP
Result
RP 8
0.30
RP 4 sw
RP 4 sw
0.30
RP 4 sw
RP 8 sw
0.32
RI
RI 8
RI 8
RI 8
RI 20
Syn
RP
RP 8
RP 8
RP 8 sw
RP 2 sw
RI 20
RP 4 sw
Result
0.39
0.38
0.40
Table 5 - Medical Corpus Results on Development Set
Results (recall, top ten) of the best configurations for each model and model combination on the three tasks.
The configurations are described according to the following pattern: model windowSize. For RP, sw means
that stop words are retained in the model.
Strategy
RI
RI 4
RI 20
RI 20
Abbr!Exp
RP
Result
RP 4 sw
RP 2
RP 4 sw
RI ⇢ RP30
RP ⇢ RI30
RI + RP
0.08
RI
RI
RI
RI
RI
RI
RI
RI
2
4
4
8
8
20
20
20
RI 4
RP
RP
RP
RP
RP
RP
RP
RP
2
4
4
8
8
2
4
4
sw
sw
sw
sw
0.08
sw
RP 4 sw
0.10
RI
RI 2
RI 4
RI 4
RI 8
RI 20
RI 20
RI 20
RI 20
RI 2
RI 2
RI 4
RI 4
RI 8
RI 8
RI 20
RI 20
RI 20
RI 20
RI 8
RI 20
Exp!Abbr
RP
Result
RP 2
RP 4
RP 4 sw
RP 8
0.03
RP 2
RP 2 sw
RP 4
RP 4 sw
RP 2
RP 2 sw
RP 4
RP 4 sw
RP 8
0.03
RP 8 sw
RP 2
RP 2 sw
RP 4
RP 4 sw
RP 8 sw
0.08
RP 4 sw
RI
Syn
RP
Result
RI 20
RP 4 sw
0.26
RI 8
RP 8 sw
0.24
RI 20
RP 2 sw
0.30
Table 6 - Clinical + Medical Corpora Results on Development Set
Results (P = precision, R = recall, top ten) of the best models with and without post-processing on the
three tasks. Dynamic # of suggestions allows the model to suggest less than ten terms in order to improve
precision. The results are based on the application of the model combinations to the development data.
37
161
Strategy
AVG
AVG
SUM
SUM
AVG!AVG
SUM!SUM
AVG!SUM
SUM!AVG
Normalize
True
False
True
False
Abbr!Exp
Clinical Medical
RI 4
RI 4
RP 4 sw RP 4 sw
0.13
0.24
0.13
0.32
0.15
0.13
0.15
0.13
Exp!Abbr
Clinical Medical
RI 4
RI 8
RP 4 sw RP 8 sw
0.09
0.11
0.09
0.17
0.09
0.07
0.09
0.07
Syn
Clinical Medical
RI 20
RI 20
RP 4 sw RP 2 sw
0.39
0.39
0.34
0.52
0.41
0.40
0.41
0.40
Table 7 - Clinical Corpus Results on Evaluation Set
Results (P = precision, R = recall, top ten) of the best models with and without post-processing on the
three tasks. Dynamic # of suggestions allows the model to suggest less than ten terms in order to improve
precision. The results are based on the application of the model combinations to the evaluation data.
Evaluation Configuration
RI Baseline
RP Baseline
Standard (Top 10)
Post-Processing (Top 10)
Dynamic Cut-O↵ (Top  10)
Abbr!Exp
RI 4+RP 4 sw
P
R
0.04
0.22
0.04
0.23
0.05
0.31
0.08
0.42
0.11
0.41
Exp!Abbr
RI 4+RP 4 sw
P
R
0.03
0.19
0.04
0.24
0.03
0.20
0.05
0.33
0.12
0.33
Syn
RI 20+RP 4 sw
P
R
0.07
0.39
0.06
0.36
0.07
0.44
0.08
0.43
0.08
0.42
Table 8 - Medical Corpus Results on Evaluation Set
Results (P = precision, R = recall, top ten) of the best models with and without post-processing on the
three tasks. Dynamic # of suggestions allows the model to suggest less than ten terms in order to improve
precision. The results are based on the application of the model combinations to the evaluation data.
Evaluation Configuration
RI Baseline
RP Baseline
Standard (Top 10)
Post-Processing (Top 10)
Dynamic Cut-O↵ (Top  10)
Abbr!Exp
RI 4+RP 4 sw
P
R
0.02
0.09
0.01
0.06
0.03
0.17
0.03
0.17
0.17
0.17
Exp!Abbr
RI 8+RP 8 sw
P
R
0.01
0.08
0.01
0.05
0.01
0.11
0.02
0.11
0.10
0.11
38
162
Syn
RI 20+RP 2 sw
P
R
0.03
0.18
0.05
0.26
0.06
0.34
0.06
0.34
0.06
0.34
Table 9 - Clinical + Medical Corpora Results on Evaluation Set
Results (P = precision, R = recall, top ten) of the best models with and without post-processing on the
three tasks. Dynamic # of suggestions allows the model to suggest less than ten terms in order to improve
precision. The results are based on the application of the model combinations to the evaluation data.
Evaluation Configuration
Clinical Ensemble Baseline
Medical Ensemble Baseline
Standard (Top 10)
Post-Processing (Top 10)
Dynamic Cut-O↵ (Top  10)
Abbr!Exp
Clinical Medical
RI 4
RI 4
RP 4 sw RP 4 sw
SUM, False
P
R
0.04
0.24
0.02
0.11
0.05
0.30
0.07
0.39
0.28
0.39
39
163
Exp!Abbr
Clinical Medical
RI 4
RI 8
RP 4 sw RP 8 sw
SUM, False
P
R
0.03
0.19
0.01
0.11
0.03
0.19
0.06
0.33
0.31
0.33
Syn
Clinical Medical
RI 20
RI 20
RP 4 sw RP 2 sw
SUM, False
P
R
0.06
0.34
0.05
0.33
0.08
0.47
0.08
0.47
0.08
0.45
Table 10 - Examples of Extracted Candidate Synonyms
The top ten candidate terms for three di↵erent query terms, for the clinical ensemble, the medical ensemble
and for the multiple-corpora ensemble. Expected synonym according to reference standard in boldface.
Query Term: sjukhem (nursing-home)
Clinical
Medical
heartcenter (heart-center)
vårdcentral (health-center)
bröstklinik
akutmottagning
(breast-clinic)
(emergency room)
hälsomottagningen (health-clinic) akuten (ER)
hjärtcenter (heart-center)
mottagning (reception)
län (county)
intensivvårdsavdelning (ICU)
eyecenter (eye-center)
arbetsplats (work-place)
bröstklin (breast-clin.)
vårdavdelning (ward)
sjukhems (nursing-home’s)
gotland (a Swedish county)
hartcenter
kväll
(”hart-center”)
(evening)
biobankscentrum (biobank-center) ks (Karolinska hospital)
Query Term: depression (depression)
Clinical
Medical
sömnstörning (insomnia)
depressioner (depressions)
sömnsvårigheter (insomnia)
osteoporos (osteoporosis)
panikångest (panic disorder)
astma (asthma)
tvångssyndrom (OCD)
fetma (obesity)
fibromyalgi (fibromyalgia)
smärta (pain)
ryggvärk
depressionssjukdom
(back-pain)
(depressive-illness)
självskadebeteende
bensodiazepiner
(self-harm)
(benzodiazepines)
osteoporos (osteoporosis)
hjärtsvikt (heart-failure)
depressivitet (”depressitivity”)
hypertoni (hypertension)
pneumoni (pneumonia)
utbrändhet (burnout)
Query Term: allergi (allergy)
Clinical
Medical
pollenallergi (pollen-allergy)
allergier (allergies)
födoämnesallergi (food-allergy)
sensibilisering (sensitization)
hösnuva (hay-fever)
hösnuva (hay-fever)
överkänslighet
rehabilitering
(hypersensitivity)
(rehabilitation)
kattallergi
fetma
(cat-allergy)
(obesity)
jordnötsallergi (peanut-allergy)
kol (COPD)
pälsdjursallergi (animal-allergy)
osteoporos (osteoporosis)
negeras (negated)
födoämnesallergi (food-allergy)
pollen (pollen)
astma (asthma)
pollenallergiker
utbrändhet
(”pollen-allergic”)
(burnout)
40
164
Clinical + Medical
vårdcentral (health-center)
mottagning
(reception)
vårdhem (nursing-home)
gotland (a Swedish county)
sjukhus (hospital)
gård (yard)
vårdavdelning (ward)
arbetsplats (work-place)
akutmottagning
(emergency room)
akuten (ER)
Clinical + Medical
sömnstörning (insomnia)
osteoporos (osteoporosis)
tvångssyndrom (OCD)
epilepsi (epilepsy)
hjärtsvikt (heart failure)
nedstämdhet
(sadness)
fibromyalgi
(fibromyalgia)
astma (asthma)
alkoholberoende (alcoholism)
migrän (migraine)
Clinical + Medical
allergier (allergies)
hösnuva (hay-fever)
födoämnesallergi (food-allergy)
pollenallergi
(pollen-allergy)
överkänslighet
(hypersensitivity)
astma (asthma)
kol (COPD)
osteoporos (osteoporosis)
jordnötsallergi (peanut-allergy)
pälsdjursallergi
(animal-allergy)
Endnotes
1 Unified
Medical Language System: http://www.nlm.nih.gov/research/umls/
are also probabilistic models, which view documents as a mixture of topics and represent terms according to the
probability of their occurrence during the discussion of each topic: two terms that share similar topic distributions are assumed
to be semantically related.
3 This research has been approved by the Regional Ethical Review Board in Stockholm (Etikprövningsnämnden i Stockholm),
permission number 2012/834-31/5.
2 There
41
165
167
PAPER III
EXPLOITING
STRUCTURED
DATA ,
NEGATION
DETECTION AND SNOMED CT TERMS IN A RANDOM
INDEXING APPROACH TO CLINICAL CODING
Aron Henriksson and Martin Hassel (2011). Exploiting Structured Data, Negation
Detection and SNOMED CT Terms in a Random Indexing Approach to Clinical
Coding. In In Proceedings of the RANLP Workshop on Biomedical Natural
Language Processing, Association for Computational Linguistics (ACL), pages
3–10, Hissar, Bulgaria.
Author Contributions The study was designed by Aron Henriksson and Martin
Hassel. Aron Henriksson was responsible for carrying out the experiments, while
Martin Hassel generated the co-occurrence statistics from the structured data. The
analysis was conducted by Aron Henriksson and Martin Hassel. The manuscript
was written by Aron Henriksson and reviewed by Martin Hassel.
Exploiting Structured Data, Negation Detection and SNOMED CT Terms
in a Random Indexing Approach to Clinical Coding
Aron Henriksson
DSV, Stockholm University
[email protected]
Abstract
Martin Hassel
DSV, Stockholm University
[email protected]
1.1
The problem of providing effective computer support for clinical coding has been
the target of many research efforts. A recently introduced approach, based on statistical data on co-occurrences of words
in clinical notes and assigned diagnosis
codes, is here developed further and improved upon. The ability of the word space
model to detect and appropriately handle
the function of negations is demonstrated
to be important in accurately correlating
words with diagnosis codes, although the
data on which the model is trained needs
to be sufficiently large. Moreover, weighting can be performed in various ways, for
instance by giving additional weight to
‘clinically significant’ words or by filtering code candidates based on structured
patient records data. The results demonstrate the usefulness of both weighting
techniques, particularly the latter, yielding 27% exact matches for a general
model (across clinic types); 43% and 82%
for two domain-specific models (ear-nosethroat and rheumatology clinics).
1 Introduction
Clinicians spend much valuable time and effort
in front of a computer, assigning diagnosis codes
during or after a patient encounter. Tools that facilitate this task would allow costs to be reduced
or clinicians to spend more of their time tending to patients, effectively improving the quality
of healthcare. The idea, then, is that clinicians
should be able simply to verify automatically assigned codes or to select appropriate codes from a
list of recommendations.
Previous Work
There have been numerous attempts to provide
clinical coding support, even if such tools are yet
to be widely used in clinical practice (Stanfill et
al., 2010). The most common approach has been
to view it essentially as a text classification problem. The assumption is that there is some overlap between clinical notes and the content of assigned diagnosis codes, making it possible to predict possible diagnosis codes for ‘uncoded’ documents. For instance, in the 2007 Computational
Challenge (Pestian et al., 2007), free-text radiology reports were to be assigned one or two labels
from a set of 45 ICD-9-CM codes. Most of the
best-performing systems were rule-based, achieving micro-averaged F1 -scores of up to 89.1%.
Some have tried to enhance their NLP-based
systems by exploiting the structured data available
in patient records. Pakhomov et al. (2006) use
gender information—as well as frequency data—
to filter out improbable classifications. The motivation is that gender has a high predictive value,
particularly as some categories make explicit gender distinctions.
Medical terms also have a high predictive value
when it comes to classification of clinical notes
(see e.g. Jarman and Berndt, 2010). In an attempt to assign ICD-9 codes to discharge summaries, the results improved when extra weight
was given to words, phrases and structures that
provided the most diagnostic evidence (Larkey
and Croft, 1995).
Given the inherent practice of ruling out possible diseases, symptoms and findings, it seems important to handle negations in clinical text. In one
study, it was shown that around 9% of automatically detected SNOMED CT findings and disorders were negated (Skeppstedt et al., 2011). In the
attempt of Larkey and Croft (1995), negated medical terms are annotated and handled in various
3
168
Proceedings of the Workshop on Biomedical Natural Language Processing, pages 3–10,
Hissar, Bulgaria, 15 September 2011.
ways; however, none yielded improved results.
1.2 Random Indexing of Patient Records
In more recent studies, the word space model,
in its Random Indexing mold (Sahlgren, 2001;
Sahlgren, 2006), has been investigated as a possible alternative solution to clinical coding support (Henriksson et al., 2011; Henriksson and
Hassel, 2011). Statistical data on co-occurrences
of words and ICD-101 codes is used to build predictive models that can generate recommendations
for uncoded documents. In a list of ten recommended codes, general models—trained and evaluated on all clinic types—achieve up to 23% exact
matches and 60% partial matches, while domainspecific models—trained and evaluated on a particular type of clinic—achieve up to 59% exact
matches and 93% partial matches.
A potential limitation of the above models is
that they fail to capture the function of negations,
which means that negated terms in the clinical
notes will be positively correlated with the assigned diagnosis codes. In the context of information retrieval, Widdows (2003) describes a way to
remove unwanted meanings from queries in vector models, using a vector negation operator that
not only removes unwanted strings but also synonyms and neighbors of the negated terms. To our
knowledge, however, the ability of the word space
model to handle negations has not been studied extensively.
1.3 Aim
The aim of this paper, then, is to develop the
Random Indexing approach to clinical coding support by exploring three potential improvements:
1. Giving extra weight to words used in a list of
SNOMED CT terms.
2. Exploiting structured data in patient records
to calculate the likelihood of code candidates.
ICD-10 codes) on a document level. The resulting models contain information about the ‘semantic similarity’ of individual words and diagnosis
codes2 , which is subsequently used to classify uncoded documents.
2.1
Stockholm EPR Corpus
The models are trained and evaluated on a
Swedish corpus of approximately 270,000 clinically coded patient records, comprising 5.5 million notes from 838 clinical units. This is a subset of the Stockholm EPR corpus (Dalianis et al.,
2009). A document contains all free-text entries
concerning a single patient made on consecutive
days at a single clinical unit. The documents in the
partitions of the data sets on which the models are
trained (90%) also include one or more associated
ICD-10 codes (on average 1.7 and at most 47). In
the testing partitions (10%), the associated codes
are retained separately for evaluation. In addition
to the complete data set, two subsets are created,
in which there are documents exclusively from a
particular type of clinic: one for ear-nose-throat
clinics and one for rheumatology clinics.
Variants of the three data sets are created, in
which negated clinical entities are automatically
annotated using the Swedish version of NegEx
(Skeppstedt, 2011). The clinical entities are detected through exact string matching against a list
of 112,847 SNOMED CT terms belonging to the
semantic categories ’finding’ and ’disorder’. It is
important to handle ambiguous terms in order to
reduce the number of false positives; therefore, the
list does not include findings which are equivalent
to a common non-clinical unigram or bigram (see
Skeppstedt et al., 2011). A negated term is marked
in such a way that it will be treated as a single
word, although with its proper negated denotation.
Multi-word terms are concatenated into unigrams.
The data is finally pre-processed: lemmatization is performed using the Granska Tagger
(Knutsson et al., 2003), while punctuation, digits
and stop words are removed.
3. Incorporating the use of negation detection.
2.2
2
Method
Random Indexing is applied on patient records
to calculate co-occurrences of tokens (words and
1
The 10th revision of the International Classification of
Diseases and Related Health Problems (World Health Organization, 2011).
4
169
Word Space Models
Random Indexing is performed on the training
partitions of the described data sets, resulting in
2
According to the distributional hypothesis, words that
appear in similar contexts tend to have similar properties. If
two words repeatedly co-occur, we can assume that they in
some way refer to similar concepts (Harris, 1954). Diagnosis codes are here treated as words.
a total of six models (Table 1): two variants of the
general model and two variants of the two domainspecific models3 .
Table 1: The six models.
w/o negations
General Model
ENT Model
Rheuma Model
w/ negations
General NegEx Model
ENT NegEx Model
Rheuma NegEx Model
2.3 Election of Diagnosis Codes
The models are then used to produce a ranked list
of recommended diagnosis codes for each of the
documents in the testing partitions of the corresponding data sets. This list is created by letting each of the words in a document ‘vote’ for a
number of semantically similar codes, thus necessitating the subsequent merging of the individual
lists. This ranking procedure can be carried out in
a number of ways, some of which are explored in
this paper. The starting point, however, is to use
the semantic similarity of a word and a diagnosis
code—as defined by the cosine similarity score—
and the idf4 value of the word. This is regarded
as our baseline model (Henriksson and Hassel,
2011), to which negation handling and additional
weighting schemes are added.
2.4 Weighting Techniques
For each of the models, we apply two distinct
weighting techniques. First, we assume a technocratic approach to the election of diagnosis codes.
We do so by giving added weight to words which
are ‘clinically significant’. That is here achieved
by utilizing the same list of SNOMED CT findings
and disorders that was used by the negation detection system. However, rather than trying to match
the entire term—which would likely result in a
fairly limited number of hits—we opted simply
to give weight to the individual (non stop) words
used in those terms. These words are first lemmatized, as the data on which the matching is performed has also been lemmatized. It will also allow hits independent of morphological variations.
We also perform weighting of the correlated
ICD-10 codes by exploiting statistics generated
3
ENT = Ear-Nose-Throat, Rheuma = Rheumatology.
Inverse document frequency, denoting a word’s discriminatory value.
4
5
170
from the fixed fields of the patient records, namely
gender, age and clinical unit. The idea is to use
known information about a to-be-coded document
in order to assign weights to code candidates according to plausibility, which in turn is based on
past combinations of a particular code and each
of the structured data entries. For instance, if the
model generates a code that has very rarely been
assigned to a patient of a particular sex or age
group—and the document is from the record of
such a patient—it seems sensible to give it less
weight, effectively reducing the chances of that
code being recommended. In order for an unseen
combination not to be ruled out entirely, additive
smoothing is performed. Gender and clinical unit
can be used as defined, while age groups are created for each and every year up to the age of 10,
after which ten-year intervals are used. This seems
reasonable since age distinctions are more sensitive in younger years.
In order to make it possible for code candidates
that are not present in any of the top-ten lists of
the individual words to make it into the final topten list of a document, all codes associated with
a word in the document are included in the final
re-ranking phase. This way, codes that are more
likely for a given patient are able to take the place
of more improbable code candidates. For the general models, however, the initial word-based code
lists are restricted to twenty, due to technical efficiency constraints.
2.5
Evaluation
The evaluation is carried out by comparing the
model-generated recommendations with the clinically assigned codes in the data. This matching is
done on all four possible levels of ICD-10 according to specificity (see Figure 1).
Figure 1: The structure of ICD-10 allows division
into four levels.
3 Results
The general data set, on which General Model and
General NegEx Model are trained and evaluated,
comprises approximately 274,000 documents and
12,396 unique labels. The ear-nose-throat data
set, on which ENT Model and ENT NegEx Model
are trained and evaluated, contains around 23,000
documents and 1,713 unique labels. The rheumatology data set, on which Rheuma Model and
Rheuma NegEx Model are trained an evaluated,
contains around 9,000 documents and 630 unique
labels (Table 2).
data set
General
ENT
Rheumatology
documents
∼274 k
∼23 k
∼9 k
codes
12,396
1,713
630
Table 2: Data set statistics.
The proportion of the detected clinical entities that
are negated is 13.98% in the complete, general
data set and slightly higher in the ENT (14.32%)
and rheumatology data sets (16.98%) (Table 3).
Technocratic word weighing yields a modest
improvement over the baseline model: one percentage point on each of the levels. Filtering code
candidates based on fixed fields statistics, however, leads to a remarkable boost in results, from
33% to 43% exact matches. ENT NegEx Model
performs slightly better than the baseline model,
although only as little as a single percentage point
(34% exact matches). Performance drops when
the technocratic approach is applied to this model.
The fixed fields filtering, on the other hand, similarly improves results for the negation variant of
the ENT model; however, there is no apparent additional benefit in this case of negation handling.
In fact, it somewhat hampers the improvement
yielded by this weighting technique.
As with the general models, a combination of
the two weighting techniques does not affect the
results much for either of the ENT models.
3.1 General Models
The baseline for the general models finds 23%
of the clinically assigned codes (exact matches),
when the number of model-generated recommendations is confined to ten (Table 4). Meanwhile,
matches on the less specific levels of ICD-10, i.e.
partial matches, amount to 25%, 33% and 60% respectively (from specific to general).
The single application of one of the weighing techniques to the baseline model boosts performance somewhat, the fixed fields-based code
filtering (26% exact matches) slightly more so
than the technocratic word weighting (24% exact matches). The negation variant of the general
model, General NegEx Model, performs somewhat better—up two percentage points (25% exact matches)—than the baseline model. The technocratic approach applied to this model does not
yield any observable added value. The fixed fields
filtering does, however, result in a further improvement on the three most specific levels (27% exact
matches).
A combination of the two weighting schemes
does not appear to bring much benefit to either of
the general models, compared to solely performing fixed fields filtering.
3.3
3.2 Ear-Nose-Throat Models
The two weighting techniques and the incorporation of negation handling provide varying degrees of benefit—from small to important boosts
in performance—depending to some extent on the
model to which they are applied.
The baseline for the ENT models finds 33% of the
clinically assigned codes (exact matches) and 34%
(L3), 41% (L2) and 62% (L1) at the less specific
levels (Table 5).
6
171
Rheumatology Models
The baseline for the rheumatology models finds
61% of the clinically assigned codes (exact
matches) and 61% (L3), 68% (L2) and 92% (L1)
at the less specific levels (Table 6).
Compared to the above models, the technocratic
approach is here much more successful, resulting
in 72% exact matches. Filtering the code candidates based on fixed fields statistics leads to
a further improvement of ten percentage points
for exact matches (82%). Rheuma NegEx Model
achieves only a modest improvement on L2.
Moreover, this model does not benefit at all from
the technocratic approach; neither is the fixed
fields filtering quite as successful in this model
(67% exact matches).
A combination of the two weighting schemes
adds only a little to the two variants of the rheumatology model. Interesting to note is that the negation variant performs the same or even much worse
than the one without any negation handling.
4 Discussion
Model
General NegEx Model
ENT NegEx Model
Rheuma NegEx Model
Clinical Entities
634,371
40,362
20,649
Negations
88,679
5,780
3,506
Negations/Clinical Entities
13.98%
14.32%
16.98%
Table 3: Negation Statistics. The number of detected clinical entities, the number of negated clinical
entities and the percentage of the detected clinical entities that are negated.
Weighting
Baseline
Technocratic
Fixed Fields
Technocratic + Fixed Fields
General Model
E
L3
L2
0.23 0.25 0.33
0.24 0.26 0.34
0.26 0.28 0.36
0.26 0.28 0.36
L1
0.60
0.61
0.61
0.62
General NegEx Model
E
L3
L2 L1
0.25 0.27 0.35 0.62
0.25 0.27 0.35 0.62
0.27 0.29 0.37 0.63
0.27 0.29 0.37 0.63
Table 4: General Models, with and without negation handling. Recall (top 10), measured as the presence
of the clinically assigned codes in a list of ten model-generated recommendations. E = exact match,
L3→L1 = matches on the other levels, from specific to general. The baseline is for the model without
negation handling only.
Weighting
Baseline
Technocratic
Fixed Fields
Technocratic + Fixed Fields
ENT Model
E
L3
L2
0.33 0.34 0.41
0.34 0.35 0.42
0.43 0.43 0.48
0.42 0.42 0.47
L1
0.62
0.63
0.64
0.64
ENT NegEx Model
E
L3
L2
0.34 0.35 0.42
0.33 0.33 0.41
0.42 0.43 0.48
0.42 0.42 0.47
L1
0.62
0.61
0.63
0.62
Table 5: Ear-Nose-Throat Models, with and without negation handling. Recall (top 10), measured as the
presence of the clinically assigned codes in a list of ten model-generated recommendations. E = exact
match, L3→L1 = matches on the other levels, from specific to general. The baseline is for the model
without negation handling only.
Weighting
Baseline
Technocratic
Fixed Fields
Technocratic + Fixed Fields
Rheuma Model
E
L3
L2
0.61 0.61 0.68
0.72 0.72 0.77
0.82 0.82 0.85
0.82 0.83 0.86
L1
0.92
0.94
0.95
0.95
Rheuma NegEx Model
E
L3
L2 L1
0.61 0.61 0.70 0.92
0.60 0.60 0.70 0.91
0.67 0.67 0.75 0.91
0.68 0.68 0.76 0.92
Table 6: Rheumatology Models, with and without negation handling. Recall (top 10), measured as the
presence of the clinically assigned codes in a list of ten model-generated recommendations. E = exact
match, L3→L1 = matches on the other levels, from specific to general. The baseline is for the model
without negation handling only.
7
172
4.1 Technocratic Approach
The technocratic approach, whereby clinically significant words are given extra weight, does result in some improvement when applied to all
models that do not incorporate negation handling. The effect this weighting technique has
on Rheuma Model is, however, markedly different from when it is applied to the other two corresponding models. It could potentially be the result of a more precise, technical language used
in rheumatology documentation, where certain
words are highly predictive of the diagnosis. However, the results produced by this model need to be
examined with some caution, due to the relatively
small size of the data set on which the model is
based and evaluated.
Since this approach appears to have a positive impact on all of the models where negation
handling is not performed, assigning even more
weight to clinical terminology may yield additional benefits. This would, of course, have to be
tested empirically and may differ from domain to
domain.
4.2 Structured Data Filtering
The technique whereby code candidates are given
weight according to their likelihood of being accurately assigned to a particular patient record—
based on historical co-occurrence statistics of diagnosis codes and, respectively, age, gender and
clinical unit—is successful across the board. To a
large extent, this is probably due to a set of ICD10 codes being frequently assigned in any particular clinical unit. In effect, it can partly be seen as
a weighting scheme according to code frequency.
There are also codes, however, that make gender
and age distinctions. It is likewise well known that
some diagnoses are more prevalent in certain age
groups, while others are exclusive to a particular
gender.
It is interesting to note the remarkable improvement observed for the two domains-specific
models. Perhaps the aforementioned factor of
frequently recurring code assignments is even
stronger in these particular types of clinics. By
contrast, there are no obvious gender-specific diagnoses in either of the two domains; however,
in the rheumatology data, there are in fact 23
codes that have frequently been assigned to men
but never to women. In such cases it is especially
beneficial to exploit the structured data in patient
8
173
records. It could also be that the restriction to
twenty code candidates for each of the individual
words in the general models was not sufficiently
large a number to allow more likely code candidates to make it into the final list of recommendations. That said, it seems somewhat unlikely that a
code that is not closely associated with any of the
words in a document should make it into the final
list.
Even if the larger improvements observed for
the domain-specific models may, again, in part be
due to the smaller amounts of data compared with
the general model, the results clearly indicate the
general applicability and benefit of such a weighting scheme.
4.3
Negation Detection
The incorporation of automatic detection of
negated clinical entities improves results for all
models, although more so for the general model
than the domain-specific models. This could possibly be ascribed to the problem of data sparsity. That is, in the smaller domain-specific models, there are fewer instances of each type of
negated clinical entity (11.7 on average in ENT
and 9.4 on average in rheumatology) than in the
general model (31.6 on average). This is problematic since infrequent words, just as very frequent words, are commonly assumed to hold little or no information about semantics (Jurafsky
and Martin, 2009). There simply is little statistical evidence for the rare words, which potentially makes the estimation of their similarity with
other words uncertain. For instance, Karlgren and
Sahlgren (2001) report that, in their TOEFL test
experiments, they achieved the best results when
they removed words that appeared in only one or
two documents. While we cannot just remove infrequent codes, the precision of these suggestions
are likely to be lower.
The prevalence of negated clinical entities—
almost 14% in the entire data set—indicates the
importance of treating them as such in an NLPbased approach to clinical coding. Due to the extremely low recall (0.13) of the simple method
of detecting clinical entities through exact string
matching (Skeppstedt et al., 2011), negation handling could potentially have a more marked impact on the models if more clinical entities were to
be detected, as that would likely also entail more
negated terms.
There are, of course, various ways in which one
may choose to handle negations. An alternative
could have been simply to ignore negated terms in
the construction of the word space models, thereby
not correlating negated terms with affirmed diagnosis codes. Even if doing so may make sense, the
approach assumed here is arguably better since a
negated clinical entity could have a positive correlation with a diagnosis code. That is, ruling out
or disconfirming a particular diagnosis may be indicative of another diagnosis.
4.4 Combinations of Techniques
When the technocratic weighting technique is applied to the variants of the models which include
annotations of negated clinical entities, there is
no positive effect. In fact, results drop somewhat
when applied to the two domain-specific models. A possible explanation could perhaps be that
clinically significant words that are constituents
of negated clinical entities are not detected in
the technocratic approach. The reason for this is
that the application of the Swedish NegEx system, which is done prior to the construction and
evaluation of the models, marks the negated clinical entities in such a way that those words will
no longer be recognized by the technocratic word
detector. Such words may, of course, be of importance even if they are negated. This could be
worked around in various ways; one would be simply to give weight to all negated clinical entities.
Fixed fields filtering applied to the NegEx models has an impact that is more or less comparable
to the same technique applied to the models without negation handling. This weighting technique
is thus not obviously impeded by the annotations
of negated clinical entities, with the exception of
the rheumatology models, where an improvement
is observed, yet not as substantial as when applied
to Rheuma Model.
A combination of the technocratic word weighting and the fixed fields code filtering does not appear to provide any added value over the sole application of the latter weighting technique. Likewise, the same combination applied to the NegEx
version does not improve on the results of the fixed
fields filtering.
In this study, fine-tuning of weights has not been
performed, neither internally or externally to each
of the weighting techniques. It may, of course, be
that, for instance, gender distinctions are more in9
174
formative than age distinctions—or vice versa—
and thus need to be weighted accordingly. By the
same token should the more successful weighting
schemes probably take precedence over the less
successful variants.
4.5
Classification Problem
It should be pointed out that the model-generated
recommendations are restricted to a set of properly formatted ICD-10 codes. Given the conditions under which real, clinically generated data
is produced, there is bound to be some noise, not
least in the form of inaccurately assigned and illformatted diagnosis codes. In fact, only 67.9% of
the codes in the general data set are in this sense
‘valid’ (86.5% in the ENT data set and 66.9% in
the rheumatology data set). As a result, a large
portion of the assigned codes in the testing partition cannot be recommended by the models, possibly having a substantial negative influence on the
evaluation scores. For instance, in the ear-nosethroat data, the five most frequent diagnosis codes
are not present in the restricted result set. Not all
of these are actually ‘invalid’ codes but rather action codes etc. that were not included in the list of
acceptable code recommendations. A fairer evaluation of the models would be either to include such
codes in the restricted result set or to base the restricted result set entirely on the codes in the data.
Furthermore, there is a large number of unseen
codes in the testing partitions, which also cannot
be recommended by the models (358 in the general data set, 79 in the ENT data set and 39 in the
rheumatology data set). This, on the other hand,
reflects the real-life conditions of a classification
system and so should not be eschewed; however, it
is interesting to highlight when evaluating the successfulness of the models and the method at large.
5 Conclusion
The Random Indexing approach to clinical coding benefits from the incorporation of negation
handling and various weighting schemes. While
assigning additional weight to clinically significant words yields a fairly modest improvement,
filtering code candidates based on structured
patient records data leads to important boosts
in performance for general and domain-specific
models alike. Negation handling is also important,
although the way in which it is here performed
seems to require a large amount of training data
for marked benefits. Even if combining a number
of weighting techniques does not necessarily give
rise to additional improvements, tuning of the
weighting factors may help to do so.
Acknowledgments
We would like to thank all members of our research group, IT for Health, for their support and
input. We would especially like to express our
gratitude to Maria Skeppstedt for her important
contribution to the negation handling aspects of
this work, particularly in adapting her Swedish
NegEx system to our specific needs. Thanks also
to the reviewers for their helpful comments.
References
Hercules Dalianis, Martin Hassel and Sumithra
Velupillai. 2009. The Stockholm EPR Corpus: Characteristics and Some Initial Findings. In Proceedings
of ISHIMR 2009, pp. 243–249.
Zellig S. Harris. 1954. Distributional structure. Word,
10, pp. 146–162.
Aron Henriksson, Martin Hassel and Maria Kvist.
2011. Diagnosis Code Assignment Support Using
Random Indexing of Patient Records — A Qualitative Feasibility Study. In Proceedings of AIME, 13th
Conference on Artificial Intelligence in Medicine,
pp. 348–352.
Aron Henriksson and Martin Hassel. 2011. Election of
Diagnosis Codes: Words as Responsible Citizens. In
Proceedings of Louhi, 3rd International Workshop
on Health Document Text Mining and Information
Analysis.
Jay Jarman and Donald J. Berndt. 2010. Throw
the Bath Water Out, Keep the Baby: Keeping
Medically-Relevant Terms for Text Mining. In Proceedings of AMIA, pp. 336–340.
Daniel Jurafsky and James H. Martin. 2009. Speech
and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Education International, NJ, USA, p. 806.
Jussi Karlgren and Magnus Sahlgren. 2001. From
Words to Understanding. Foundations of Real-World
Intelligence, pp. 294–308.
Ola Knutsson, Johnny Bigert and Viggo Kann. 2003. A
Robust Shallow Parser for Swedish. In Proceedings
of Nodalida.
Leah S. Larkey and W. Bruce Croft. 1995. Automatic
Assignment of ICD9 Codes to Discharge Summaries. In PhD thesis University of Massachusetts
at Amherst, Amerst, MA, USA.
10
175
Serguei V.S. Pakhomov, James D. Buntrock and
Christopher G. Chute. 2006. Automating the Assignment of Diagnosis Codes to Patient Encounters
Using Example-based and Machine Learning Techniques. J Am Med Inform Assoc, 13, pp. 516–525.
John P. Pestian, Christopher Brew, Pawel Matykiewicz,
DJ Hovermale, Neil Johnson, K. Bretonnel Cohen
and Wlodzislaw Duch. 2007. A Shared Task Involving Mulit-label Classification of Clinical Free Text.
In Proceedings of BioNLP 2007: Biological, translational, and clinical language processing, pp. 97–
104.
Magnus Sahlgren. 2001. Vector-Based Semantic Analysis: Representing Word Meanings Based on Random Labels. In Proceedings of Semantic Knowledge
Acquisition and Categorization Workshop at ESSLLI’01.
Magnus Sahlgren. 2006. The Word-Space Model: Using distributional analysis to represent syntagmatic
and paradigmatic relations between words in highdimensional vector spaces. In PhD thesis Stockholm
University, Stockholm, Sweden.
Maria Skeppstedt. 2011. Negation detection in Swedish
clinical text: An adaption of Negex to Swedish.
Journal of Biomedical Semantics 2, S3.
Maria Skeppstedt, Hercules Dalianis and Gunnar H.
Nilsson. 2011. Retrieving disorders and findings:
Results using SNOMED CT and NegEx adapted for
Swedish. In Proceedings of Louhi, 3rd International
Workshop on Health Document Text Mining and Information Analysis.
Mary H. Stanfill, Margaret Williams, Susan H. Fenton, Robert A. Jenders and William R Hersh. 2010.
A systematic literature review of automated clinical
coding and classification systems. J Am Med Infrom
Assoc, 17, pp. 646–651.
Dominic Widdows. 2003. Orthogonal Negation in Vector Spaces for Modelling Word-Meanings and Document Retrieval. In Proceedings of ACL, pp. 136-143.
World Health Organization. 2011. International Classification of Diseases (ICD). In World Health
Organization. Retrieved June 19, 2011, from
http://www.who.int/classifications/icd/en/.
177
PAPER IV
OPTIMIZING
THE
DIMENSIONALITY
OF
CLINICAL
TERM SPACES FOR IMPROVED DIAGNOSIS CODING
SUPPORT
Aron Henriksson and Martin Hassel (2013). Optimizing the Dimensionality of
Clinical Term Spaces for Improved Diagnosis Coding Support. In Proceedings
of the 4th International Louhi Workshop on Health Document Text Mining and
Information Analysis (Louhi 2013), Sydney, Australia.
Author Contributions Aron Henriksson was responsible for designing the study
and for carrying out the experiments. Martin Hassel provided feedback on the
design of the study and contributed to the analysis of the results. The manuscript
was written by Aron Henriksson and reviewed by Martin Hassel.
Optimizing the Dimensionality of Clinical Term
Spaces for Improved Diagnosis Coding Support
Aron Henriksson and Martin Hassel
Department of Computer and Systems Sciences (DSV)
Stockholm University, Sweden
{aronhen,xmartin}@dsv.su.se
Abstract. In natural language processing, dimensionality reduction is
a common technique to reduce complexity that simultaneously addresses
the sparseness property of language. It is also used as a means to capture
some latent structure in text, such as the underlying semantics. Dimensionality reduction is an important property of the word space model, not
least in random indexing, where the dimensionality is a predefined model
parameter. In this paper, we demonstrate the importance of dimensionality optimization and discuss correlations between dimensionality and
the size of the vocabulary. This is of particular importance in the clinical
domain, where the level of noise in the text leads to a large vocabulary; it
may also mitigate the effect of exploding vocabulary sizes when modeling
multiword terms as single tokens. A system that automatically assigns
diagnosis codes to patient record entries is shown to improve by up to
18 percentage points by manually optimizing the dimensionality.
1
Introduction
Dimensionality reduction is important in limiting complexity when modeling
rare events, such as co-occurrences of all words in a vocabulary. In the word
space model, reducing the number of dimensions yields the additional benefit of
capturing second-order co-occurrences. There is, however, a trade-off between the
degree of dimensionality reduction and the ability to model semantics usefully.
This trade-off is specific to the dataset—the number of contexts and the size of
the vocabulary—and, to some extent, the task that the induced term space will
be applied to.
When working with noisy clinical text, which typically entails a large vocabulary, it may be especially prohibitive to pursue a dimensionality reduction that is
too aggressive. In random indexing, the dimensionality is a predefined parameter
of the model; however, there is precious little guidance in the literature on how
to optimize—and reason around—the dimensionality in an informed manner. In
the current work, we attempt to optimize the dimensionality toward the task of
assigning diagnosis codes to free-text patient record entries and reason around
the correlation between an optimal dimensionality and dataset-specific features,
such as the number of training documents and the size of the vocabulary.
The 4th International Louhi Workshop on Health Document Text Mining and
Information Analysis (Louhi 2013), edited by Hanna Suominen.
179
2
2
Aron Henriksson and Martin Hassel
Distributional Semantics
With the increasing availability of large collections of electronic text, empirical
distributional semantics has gained in popularity. Such models rely on the observation that words with similar meanings tend to appear in similar contexts [1].
Representing terms as vectors in a high-dimensional vector space that encode
contextual co-occurrence information makes semantics computable: spatial proximity between vectors is assumed to indicate the degree of semantic relatedness
between terms2 . There are numerous approaches to producing these context
vectors. In many methods they are derived from an initial term-context matrix
that contains the (weighted, normalized) frequency with which the terms occur in different contexts3 . The main problem with using these term-by-context
vectors is their dimensionality—equal to the number of contexts (e.g. documents/vocabulary size)—which involves unnecessary computational complexity,
in particular since most term-context occurrences are non-events, i.e. most of the
cells in the matrix will be zero. The solution is to project the high-dimensional
data into a low-dimensional space, while approximately preserving the relative
distances between data points. This not only reduces complexity and data sparseness; it has been shown also to improve the accuracy of term-term associations:
in this lower-dimensional space, terms no longer have to co-occur directly in the
same contexts for their vectors to gravitate towards each other; it is sufficient
for them to appear in similar contexts, i.e. co-occur with the same terms.
This was in fact one of the main motivations behind the development of latent
semantic analysis (LSA) [2], which provided an effective solution to the problem of synonymy negatively affecting recall in information retrieval. In LSA, the
dimensionality of the initial term-document matrix is reduced by an expensive
matrix factorization technique called singular value decomposition. Random indexing (RI) [3] is a scalable and efficient alternative in which there is no explicit
dimensionality reduction step: it is not needed since there is no initial, highdimensional term-context matrix to reduce. Instead, pre-reduced d -dimensional4
context vectors (where d the number of contexts) are constructed incrementally. First, each context (e.g. each document or unique term) is assigned a randomly generated index vector, which is high-dimensional, ternary5 and sparse:
a small number (1-2%) of +1s and -1s are randomly distributed; the rest of the
elements are set to zero. Ideally, index vectors should be orthogonal; however, in
the RI approximation they are—or should be—nearly orthogonal6 . Each unique
2
3
4
5
6
This can be estimated by, e.g., taking the cosine similarity between two term vectors.
A context can be defined as a non-overlapping passage of text (a document) or a
sliding window of tokens/characters surrounding the target term.
In RI, the dimensionality is a model parameter. A benefit of employing a static
dimensionality is scalability. Whether the dimensionality remains appropriate regardless of data size is, we argue, debatable and is preliminarily investigated here.
Allowing negative vector elements ensures that the entire vector space is utilized [4].
There are more nearly orthogonal than truly orthogonal directions in a highdimensional vector space. Randomly generating sparse vectors within this space
will, with a high probability, get us close enough to orthogonality [5].
180
Optimizing the Dimensionality of Clinical Term Spaces
3
term is also assigned an initially empty context vector of the same dimensionality.
The context vectors are then incrementally populated with context information
by adding the index vectors of the contexts in which a target term appears.
Although it is generally acknowledged that the dimensionality is an important design decision when constructing semantic spaces [4, 6], there is little guidance on how to choose one appropriately. Often a single, seemingly magic, number is chosen with little motivation. Generally, the following—rather vague—
guidelines are given: O(100) for LSA and O(1, 000) for RI [4, 6]. Is this because
the exact dimensionality is of little empirical significance, or simply because it is
dependent on the dataset and the task? If the latter is the case—and choosing
an appropriate dimensionality is significant—it seems important to optimize the
dimensionality when carrying out experiments that utilize semantic term spaces.
In a few studies the impact of the dimensionality has been investigated empirically. In one study, the effect of the dimensionality of LSA and PLSA (Probabilistic LSA) was studied on a knowledge acquisition task. Dimensionalities ranging from 50 to 300 were tested; however, no regular tendency could be found [7].
In another study using LSA on a term comparison task, it was found that a dimensionality in the 300-500 range was something of an ”island of stability”, with
the best results achieved with 400 [8]. Using RI, there was a study where nine
different dimensionalities (500-6,000) were used in a text categorization task.
Here the performance hardly changed when the dimensionality exceeded 2,500
and the impact was generally low. The authors were led to conclude that the
choice of dimensionality is less important in RI than, for instance, LSA [9].
3
Applying Semantic Spaces in the Clinical Domain
There is a growing interest in the application of distributional semantics to the
biomedical domain (see [10] for an overview). Due to the difficulty of obtaining
large amounts of clinical data, however, this particular application (sub)-domain
has been less explored. There are several complicating factors that need to be
considered when working with this type of data, some of which have potential
effects on the application of semantic spaces. The noisy nature of clinical text,
with frequent misspellings and ad-hoc abbreviations, leads to a large vocabulary,
often with concepts having a great number of lexical instantiations. For instance,
the pharmaceutical product noradrenaline was shown to have approximately
sixty different spellings in a set of ICU nursing narratives written in Swedish [11].
One application of semantic spaces in the clinical domain is for the purpose of
(semi-)automatic diagnosis coding. A term space is constructed from a corpus of
clinical notes and diagnosis codes (ICD-10). Document-level contexts are used, as
there is no order dependency between codes and the words in an associated note.
The term space is then used to suggest ten codes for each document by allowing
the words to vote for distributionally similar codes in an ensemble fashion. In
these experiments, a dimensionality of 1,000 is employed [12].
181
4
4
Aron Henriksson and Martin Hassel
Experimental Results
Random indexing is used to construct the semantic spaces with eleven different dimensionalities between 1,000 and 10,000. The data7 is from the Stockholm
EPR Corpus [13] and contains lemmatized notes in Swedish. Variants of the
dataset are created with different thresholds used for the collocation segmentation [14]: a higher threshold means that stronger statistical associations between
constituents are required and fewer collocations will be identified. The collocations are concatenated and treated as single tokens, increasing the number
of word types and decreasing the number of tokens per type. Identifying collocations is done to see if modeling multiword terms, rather than only unigrams,
may boost results; it will also help to provide clues about the correlation between
features of the vocabulary and the optimal dimensionality (Table 1).
Table 1. Data description; the COLL X sets were created with different thresholds.
DATASET
UNIGRAMS
COLL 100
COLL 50
COLL 0
DOCUMENTS
219k
219k
219k
219k
WORD TYPES
371,778
612,422
699,913
1,413,735
TOKENS / TYPE
51.54
33.19
28.83
13.53
Increasing the dimensionality yields major improvements (Table 2), up to 18
percentage points. The biggest improvements are seen when increasing the dimensionality from the 1,000-dimensionality baseline. When increasing the dimensionality beyond 2,000-2,500, the boosts in results begin to level off, although
further improvements are achieved with a higher dimensionality: the best results are achieved with a dimensionality of 10,000. A larger improvement is seen
with all of the COLL models compared to the UNIGRAMS model, even if the
UNIGRAMS model outperforms all three COLL models; however, with a higher
dimensionality, the COLL models appear to close in on the UNIGRAMS model.
5
Discussion
There are two dataset-specific features that affect the appropriateness of a given
dimensionality: the number of contexts and the size of the vocabulary. In this
case, each document is assigned an index vector. The RI approximation assumes
the near orthogonality of index vectors, which is dependent on the dimensionality: the lower the dimensionality, the higher the risk of two contexts being
assigned similar or identical index vectors8 . When working with a large number
7
8
This research has been approved by the Regional Ethical Review Board in Stockholm
(Etikprovningsnämnden i Stockholm), permission number 2012/834-31.
The proportion of non-zero elements is another aspect of this, which is affected by
changing the dimensionality while keeping the number of non-zero elements constant.
182
Optimizing the Dimensionality of Clinical Term Spaces
5
Table 2. Automatic diagnosis coding results, measured as recall top 10 for exact
matches, with clinical term spaces constructed from differently preprocessed datasets
(unigrams and three collocation variants) and with different dimensionalities (DIM).
DIM
1000
1500
2000
2500
3000
3500
4000
4500
5000
7500
10000
+/- (pp)
UNIGRAMS
0.25
0.31
0.34
0.35
0.36
0.37
0.37
0.37
0.38
0.39
0.39
+14
COLL 100
0.18
0.26
0.29
0.31
0.33
0.33
0.34
0.33
0.34
0.34
0.36
+18
COLL 50
0.19
0.25
0.28
0.30
0.31
0.32
0.34
0.33
0.33
0.33
0.35
+16
COLL 0
0.15
0.20
0.24
0.26
0.26
0.28
0.29
0.30
0.30
0.32
0.32
+17
of contexts it is important to use a sufficiently high dimensionality. With 200k+
documents in the current experiments, a dimensionality of 1,000 is possibly too
low. This is a plausible explanation for the significant boosts in performance
observed when increasing the dimensionality. The size of the vocabulary is another important factor. With a low dimensionality, there is less room for context
vectors to be far apart [6]. A large vocabulary may not necessarily indicate a
large number of concepts—as we saw with the noradrenaline example—but it
can arguably serve as a crude indicator of that. For instance, the collocation
segmentation of the data represents an attempt to identify multiword terms and
thus meaningful concepts to model in addition to the (constituent) unigrams.
For these to be modeled as clearly distinct concepts, it is critical that the dimensionality is sufficiently large. This is of particular concern when using a wide
context definition, as there will be more co-occurrence events, resulting in more
similar context vectors. The COLL models are also affected by the fewer tokens
per type, which means that their semantic representation will be less statistically
well-grounded. The fact that all COLL models are outperformed by the UNIGRAMS model could, however, also be due to poor collocations. Moreover, when
working with clinical text, the vocabulary size is typically larger compared to
many other domains. This may help to explain why increasing the dimensionality
yielded such huge boosts in results also for the UNIGRAMS model.
Compared to prior work [9], where optimizing the dimensionality of RI-based
models yielded only minor changes, it is now evident that dimensionality optimization can be of the utmost importance, particularly when working with large
vocabularies and large document sets. A possible explanation for reaching different conclusions is their much smaller document set (21,578 vs 219k) and significantly smaller vocabulary (8,887 vs 300k+). It should be noted, however, that
their results were already much higher, making it more difficult to increase per-
183
6
Aron Henriksson and Martin Hassel
formance. This can also be viewed as demonstrating the particular importance
of dimensionality optimization for more difficult tasks (90 classes vs 12,396).
6
Conclusion
Optimizing the dimensionality of semantic term spaces is important and may
yield significant boosts in performance, which was demonstrated on the task
of automatically assigning diagnosis codes to clinical notes. It is of particular
importance when applying such models to the clinical domain, where the size of
the vocabulary tends to be large, and when working with large document sets.
References
1. Harris, Z.S.: Distributional structure. Word, 10, pp. 146–162 (1954).
2. Deerwester, S., Dumais, S., Furnas, G., Landauer, T, Harshman, R.: Indexing by
latent semantic analysis. Journal of the American Society for Information Science,
41(6), pp. 391–407 (1990).
3. Kanerva, P., Kristofersson, J., Holst, A.: Random indexing of text samples for latent
semantic analysis. In Proceedings of CogSci, p. 1036 (2000).
4. Karlgren, J., Holst, A., Sahlgren, M.: Filaments of Meaning in Word Space. In
Proceedings of ECIR, LNCS 4956, pp. 531–538 (2008).
5. Kaski, S.: Dimensionality Reduction by Random Mapping: Fast Similarity Computation for Clustering. In Proceedings of IJCNN, pp. 413–418 (1998).
6. Sahlgren, M.: The Word-Space Model: Using distributional analysis to represent
syntagmatic and paradigmatic relations between words in high-dimensional vector
spaces. In PhD thesis Stockholm University, Stockholm, Sweden (2006).
7. Kim, Y-S., Chang, J-H., Zhang, B-T.: An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition. In Proceedings of
PAKDD, LNAI 2637, pp. 111–116 (2003).
8. Bradford, R.B.: An Empirical Study of Required Dimensionality for Large-scale Latent Semantic Indexing Applications. In Proceedings of CIKM, pp. 153–162 (2008).
9. Sahlgren, M., Cöster, R.: Using Bag-of-Concepts to Improve the Performance of
Support Vector Machines in Text Categorization. In Proceedings of COLING (2004).
10. Cohen, T., Widdows, D.: Empirical distributional semantics: Methods and biomedical applications. Journal of Biomedical Informatics, 42, pp. 390–405 (2009).
11. Allvin, H., Carlsson, E., Dalianis, H., Danielsson-Ojala, R., Daudaravicius, V., Hassel, M., Kokkinakis, D., Lundgren-Laine, H., Nilsson, G. H., Nytrø, Ø., Salanterä,
S., Skeppstedt, M., Suominen, H., Velupillai, S.: Characteristics and Analysis of
Finnish and Swedish Clinical Intensive Care Nursing Narratives, In Proceedings of
Louhi, pp. 53–60 (2010).
12. Henriksson, A., Hassel, M.: Exploiting Structured Data, Negation Detection and
SNOMED CT Terms in a Random Indexing Approach to Clinical Coding. In Proceedings of RANLP Workshop on Biomedical NLP, pp. 11–18 (2011).
13. Dalianis, H., Hassel, M., Velupillai, S.: The Stockholm EPR Corpus: Characteristics
and Some Initial Findings. In Proceedings of ISHIMR 2009, pp. 243–249 (2009).
14. Daudaravicius, V.: The Influence of Collocation Segmentation and Top 10 Items to
Keyword Assignment Performance. In Proceedings of CICLing, pp. 648–660 (2010).
184
187
PAPER V
EXPLORATION OF ADVERSE DRUG REACTIONS IN
SEMANTIC VECTOR SPACE MODELS OF CLINICAL
TEXT
Aron Henriksson, Maria Kvist, Martin Hassel and Hercules Dalianis (2012).
Exploration of Adverse Drug Reactions in Semantic Vector Space Models of
Clinical Text. In Proceedings of the ICML Workshop on Machine Learning for
Clinical Data Analysis, Edinburgh, UK.
Author Contributions Aron Henriksson was responsible for coordinating the
study, its overall design and for carrying out the experiments. Maria Kvist was
involved in designing the study, evaluating the output and analyzing the results.
Martin Hassel and Hercules Dalianis provided feedback on the design of the study
and contributed to the analysis of the results. The manuscript was written by Aron
Henriksson and reviewed by all authors.
Exploration of Adverse Drug Reactions
in Semantic Vector Space Models of Clinical Text
Aron Henriksson
[email protected]
Department of Computer and Systems Sciences (DSV), Stockholm University, Sweden
Maria Kvist
[email protected]
Department of Clinical Immunology and Transfusion Medicine, Karolinska University Hospital, Sweden
Department of Computer and Systems Sciences (DSV), Stockholm University, Sweden
Martin Hassel
[email protected]
Department of Computer and Systems Sciences (DSV), Stockholm University, Sweden
Hercules Dalianis
[email protected]
Department of Computer and Systems Sciences (DSV), Stockholm University, Sweden
Abstract
to detect potentially unknown or undocumented sideeffects of medicinal substances by automatic means
thus comes with great economic and health-related
benefits.
A novel method for identifying potential
side-effects to medications through largescale analysis of clinical data is here introduced and evaluated. By calculating distributional similarities for medication-symptom
pairs based on co-occurrence information in
a large clinical corpus, many known adverse drug reactions are successfully identified. These preliminary results suggest that
semantic vector space models of clinical text
could also be used to generate hypotheses
about potentially unknown adverse drug reactions. In the best model, 50% of the terms
in a list of twenty are considered to be conceivable side-effects. Among the medicationsymptom pairs, however, diagnostic indications and terms related to the medication in
other ways also appear. These relations need
to be distinguished in a more refined method
for detecting adverse drug reactions.
As clinical trials are generally not sufficiently extensive to identify all possible side-effects – especially less
common ones – many ADRs are unknown when a new
drug enters the market. Pharmacovigilance is therefore carried out throughout the life cycle of a pharmaceutical product and is typically supported by reporting systems and medical chart reviews (Chazard,
2011). Recently, with the increasing adoption of Electronic Health Records (EHRs), several attempts have
been made to facilitate this process by detecting ADRs
from large amounts of clinical data. Many of the previous attempts have focused primarily on structured
patient data, thereby missing out on potentially relevant information only available in the narrative parts
of the EHRs.
The aim of this preliminary study is to investigate the
application of distributional semantics, in the form of
Random Indexing, to large amounts of clinical text
in order to extract drug-symptom pairs. The models
are evaluated for their ability to detect known and
potentially unknown ADRs.
1. Introduction
The prevalence of adverse drug reactions (ADRs) constitutes a major public health issue. In Sweden it has
been identified as the seventh most common cause of
death (Wester et al., 2008). The prospect of being able
2. Background
Appearing in Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012.
Copyright 2012 by the author(s)/owner(s).
189
Extracting ADRs automatically from large amounts of
data essentially entails discovering a relationship – ideally a cause-and-effect relationship – between biomed-
Exploration of Adverse Drug Reactions in Semantic Vector Space Models of Clinical Text
as well as the relationships between them. The ADE
corpus is now freely available (Gurulingappa et al.,
2012).
ical concepts, typically between a medication and a
symptom/disorder. A potentially valuable source for
this are EHRs, in which symptoms – sometimes as
possible adverse reactions to a drug – are documented
at medical consultations of various kinds. In many
cases, however, the prescription of a named medication precedes the description of symptoms (diagnostic indications): this obviously does not correspond to
the required temporal order of a cause-and-effect relationship for side-effects. In other cases, the intake
of a medication precedes the appearance of new symptoms, which could thus be side-effects or merely due to
a second disease appearing independently from earlier
medical conditions. The different ways in which drugs
and symptoms can be documented in EHRs present a
serious challenge for automatic detection of ADRs.
2.2. Distributional Semantics
Another way of discovering relationships between concepts is to apply models of distributional semantics to
a large text collection. Common to the various models
that exist is their attempt to capture the meaning of
words based on their distribution in a corpus of unannotated text (Cohen & Widdows, 2009). These models are based on the distributional hypothesis (Harris,
1954), which states that words with similar distribution in language have similar meanings. Traditionally
the semantic representations have been spatial or geometric in nature by modeling terms as vectors in highdimensional space, according to which contexts – and
with what frequency – they occur in. The semantic
similarity of two terms can then be quantified by comparing their distributional profiles.
2.1. Related Research
Attempts to discover a relationship between drugs
and potential side-effects are usually based on cooccurrence statistics, semantic interpretation and machine learning (Doğan et al., 2011). For instance, Benton et al. (2011) try to mine potential ADRs from online message boards by calculating co-occurrences with
drugs in a twenty-token window.
Random Indexing (see e.g. Sahlgren 2006) is a computationally efficient and scalable method that utilizes word co-occurrence information. The meaning
of words are represented as vectors, which are points
in a high-dimensional vector space that, for each observation, move around so that words that appear in
similar contexts (i.e. co-occur with the same words)
gravitate towards each other. The distributional similarity between two terms can be calculated by e.g.
taking the cosine of the angles between their vectorial
representations.
The increasing digitalization of health records has enabled the application of data mining – sometimes with
the incorporation of natural language processing – to
clinical data in order to detect Adverse Drug Events
(ADEs) automatically. Chazard (2011) mines French
and Danish EHRs in order to extract ADE detection
rules. Decision trees and association rules are employed to mine structured patient data, such as diagnosis codes and lab results; free-text discharge letters are only mined when ATC1 codes are missing.
Wang et al. (2010) first map clinical entities from discharge summaries to UMLS2 codes and then calculate
co-occurrences with drugs. Arakami et al. (2010) assume a similar two-step approach, but use Conditional
Random Fields to identify terms; to identify relations,
both machine learning and a rule-based method are
used. They estimate that approximately eight percent
of their Japanese EHRs contain ADE information. In
order to facilitate machine learning efforts for automatic ADE detection, MEDLINE3 case reports are annotated for mentions of drugs, adverse effects, dosages,
3. Method
The data from which the models are induced is
extracted from the Stockholm EPR Corpus (Dalianis et al., 2009), which contains EHRs written in
Swedish4 . A document comprises clinical notes from
a single patient visit. A patient visit is difficult to
delineate; here it is simply defined according to the
continuity of the documentation process: all free-text
entries written on consecutive days are concatenated
into one document. Two data sets are created: one in
which patients from all age groups are included (20M
tokens) and another where records for patients over
fifty years are discarded (10M tokens). This is done
in order to investigate whether it is easier to identify
ADRs caused in patients less likely to have a multitude of health issues. The web of cause and effect may
1
Anatomical Therapeutic Chemical, used for the classification of drugs.
2
Unified Medical Language System: http://www.nlm.
nih.gov/research/umls/.
3
Contains journal citations and abstracts for biomedical
literature: http://www.nlm.nih.gov/pubs/factsheets/
medline.html.
4
This research has been approved by the Regional Ethical Review Board in Stockholm (Etikprövningsnämnden i
Stockholm), permission number 2009/1742-31/5.
190
Exploration of Adverse Drug Reactions in Semantic Vector Space Models of Clinical Text
thereby be disentangled to some extent. Preprocessing
is done on the data sets in the form of lemmatization
and stop-word removal.
Table 1. Results for models built on the two data sets (all
age groups vs. only ≤ 50 years) with a sliding window
context of 8 (SW 8) and 24 words (SW 24) respectively.
S-E = side-effect; D/S = disorder/symptom.
Random Indexing is applied on the two data sets with
two sets of parameters, yielding a total of four models.
A model is induced from the data in the following way.
Each unique term in the corpus is assigned a context
vector and a word vector with the same predefined
dimensionality, which in this case is set to 1,000. The
context vectors are static, consisting of zeros and a
small number of randomly placed 1s and -1s, which
ensures that they are nearly orthogonal. The word
vectors – initially empty – are incrementally built up
by adding the context vectors of the surrounding words
within a sliding window. The context is the only model
parameter we experiment with, using a sliding window
of eight (4+4) or 24 (12+12) surrounding words.
Type
Known S-E
Potential S-E
Indication
Alt. indication
S-E term
D/S, not S-E
Not D/S
SUM %
SW 8
11
37
14
6
10
15
7
100
All
SW 24
11
39
14
6
9
14
7
100
≤ 50
SW 8 SW 24
10
11
35
35
10
10
5
6
7
7
23
21
10
10
100
100
Using the model induced from the larger data set
– comprising all age groups – slightly more potential side-effects are identified, whereas the proportion
of known side-effects remains constant across models.
The number of incorrect suggestions increases when
using the smaller data set. The difference in the scope
of the context, however, does not seem to have a significant impact on these results. Similarly, when analyzing the results for the two groups of medications
separately and vis-à-vis the two data sets, no major
differences are found.
The models are then used to generate lists of twenty
distributionally similar terms for twenty common medications that have multiple known side-effects. To enable a comparison of the models induced from the
two data sets, two groups of medications are selected:
(1) drugs for cardiovascular disorders and (2) drugs
that are not disproportionally prescribed to patients
in older age groups, e.g. drugs for epilepsy, diabetes
mellitus, infections and allergies. Moreover, as sideeffects are manifested as either symptoms or disorders,
a vocabulary of unigram SNOMED CT5 terms belonging to the semantic categories finding and disorder is
compiled and used to filter the lists generated by the
models. The drug-symptom/disorder pairs are here
manually evaluated by a physician using lists of indications and known side-effects6 .
5. Discussion
The models produce some interesting results, being
able to identify indications for a given drug, as well as
known and potentially unknown ADRs. In a more refined method for identifying unknown side-effects, indications and known side-effects could easily be filtered
out automatically. A large portion of the suggested
terms were, however, clearly neither indications nor
drug reactions. One reason is that the list of SNOMED
CT findings and disorders contains several terms that
are neither symptoms nor disorders. Refining this list
is an obvious remedy, which could be done by for instance using a list of known side-effects. We also found
that many of the unexpected words were very rare in
the data, which means that their semantic representation is not statistically well-grounded: such words
should probably be removed. Many common and expected side-effects did, on the other hand, not show up,
e.g. headache. This could be due to the prevalence of
this term in many different contexts, which diffuses its
meaning in such a way that it is not distributionally
similar to any particular other term. Moreover, since
the vocabulary list was not lemmatized, but the health
records were, some terms that were not already in their
base form could not be proposed by the models. Some
4. Results
Approximately half of the model-generated suggestions are disorders/symptoms that are conceivable
side-effects; around 10% are known and documented
(Table 1). A fair share of the terms are indications
for prescribing a certain drug (∼10-15%), while differential diagnoses (i.e. alternative indications) also
appear with some regularity (∼5-6%). Interestingly,
terms that explicitly indicate mention of possible adverse drug events, such as side-effect, show up fairly
often. However, over 20% of the terms proposed by all
of the models are obviously incorrect; in some cases
they are not symptoms/disorders, and sometimes they
are simply not conceivable side-effects.
5
Systematized Nomenclature of Medicine – Clinical
Terms: http://www.ihtsdo.org/snomed-ct/.
6
http://www.fass.se, a medicinal database.
191
Exploration of Adverse Drug Reactions in Semantic Vector Space Models of Clinical Text
known side-effects were not present in the vocabulary
list.
Health Records. PhD thesis, Universite Lille Nord
de France, 2011.
A limitation of these models is that they are presently
restricted to unigrams; however, many side-effects are
multiword expressions. Moreover, the models in these
experiments were induced from data comprising all
types of clinical notes; however, Wang et al. (2010)
showed that their results improved by using specific
sections of the clinical narratives. It would also be
interesting to generate drug-symptom pairs for multiple drugs that belong to the same ATC code and
extract potential side-effects common to them both.
This could be a means to find stronger suspicions of
ADRs related to a particular chemical substance.
Cohen, T. and Widdows, D. Empirical distributional semantics: Methods and biomedical applications. Journal of Biomedical Informatics, 42:390–
405, 2009.
Dalianis, H., Hassel, M., and Velupillai, S. The Stockholm EPR Corpus: Characteristics and Some Initial
Findings. In Proc. of ISHIMR, pp. 243–249, 2009.
Doğan, R. I., Névéol, A., and Lu, Z. A context-blocks
model for identifying clinical relationships in patient
records. BMC Bioinformatics, 12(3):1–11, 2011.
Gurulingappa, H., Rajput, A. M., Roberts, A., Fluck,
J., Hofmann-Apitius, M., and Toldo, L. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects
from medical case reports. Journal of Biomedical
Informatics, http://dx.doi.org/10.1016/j.jbi.
2012.04.008, 2012.
6. Conclusion
By calculating distributional similarities for
medication-symptom pairs based on co-occurrence
information in a large clinical corpus, many known
ADRs are successfully detected. These preliminary
results suggest that semantic vector space models of
clinical text could also be used to generate hypotheses
about potentially unknown ADRs. Although several
limitations must be addressed for this method to
demonstrate its true potential, distributional similarity is a useful notion to incorporate in a more
sophisticated method for detecting ADRs from clinical
data.
Harris, Z. S. Distributional structure. Word, 10:146–
162, 1954.
Sahlgren, M.
The Word-Space Model: Using
distributional analysis to represent syntagmatic
and paradigmatic relations between words in highdimensional vector spaces. PhD thesis, Stockholm
University, 2006.
Wang, X., Chase, H., Markatou, M., Hripcsak, G.,
and Friedman, C. Selecting information in electronic
health records for knowledge acquisition. Journal of
Biomedical Informatics, 43(4):595–601, 2010.
Acknowledgments
This work was supported by the Swedish Foundation for Strategic Research through the project HighPerformance Data Mining for Drug Effect Detection
(ref. no. IIS11-0053) at Stockholm University, Sweden.
Wester, K., Jönsson, A. K., Spigset, O., Druid, H., and
Hägg, S. Incidence of fatal adverse drug reactions: a
population based study. British Journal of Clinical
Pharmacology, 65(4):573–579, 2008.
References
Arakami, E., Miura, Y., Tonoike, M., Ohkhuma, T.,
Masuichi, H., Waki, K., and Ohe, K. Extraction of
adverse drug effects from clinical records. In MEDINFO 2010, pp. 739–743. IOS Press, 2010.
Benton, A., Ungar, L., Hill, S., Hennessy, S., Mao, J.,
Chung, A., Leonard, C.E., and Holmes, J.H. Identifying potential adverse effects using the web: A new
approach to medical hypothesis generation. Journal
of Biomedical Informatics, 44:989–996, 2011.
Chazard, Emmanuel. Automated Detection of Adverse Drug Events by Data Mining of Electronic
192
194
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement