null  null
educe Documentation
Release 0.1
Eric Kow
November 20, 2015
Contents
1
User manual
1.1 STAC tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
2
Tutorial
2.1 STAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 RST-DT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 PDTB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
9
14
23
3
Cookbook
3.1 [STAC] Turns and resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
37
4
educe package
4.1 Layers . . . . . . . . . . . . . . . . .
4.2 Departures from the ideal (2013-05-23)
4.3 Subpackages . . . . . . . . . . . . . .
4.4 Submodules . . . . . . . . . . . . . .
4.5 educe.annotation module . . . . . . . .
4.6 educe.corpus module . . . . . . . . . .
4.7 educe.glozz module . . . . . . . . . .
4.8 educe.graph module . . . . . . . . . .
4.9 educe.internalutil module . . . . . . .
4.10 educe.util module . . . . . . . . . . .
5
Indices and tables
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
44
44
112
112
115
117
117
121
122
125
Bibliography
127
Python Module Index
129
i
ii
educe Documentation, Release 0.1
Contents:
Contents
1
educe Documentation, Release 0.1
2
Contents
CHAPTER 1
User manual
Educe is mainly a library but it comes with a small number of command line tools that can be useful for poking and
prodding at the corpora that it supports
1.1 STAC tools
Educe comes with a number of command line utilities for querying, checking, and modifying the STAC corpus:
• stac-util: queries
• stac-check: sanity checks (development)
• stac-edit: modifications to (development)
• stac-oneoff: rare modifications (development)
The first tool (stac-util) may be useful to all users of the STAC corpus, whereas the last three (stac-check, stac-edit,
and stac-oneoff) may be more of interest for corpus development work.
1.1.1 stac-util
The stac-util toolkit provides some potentially useful queries on the corpus.
stac-util text
Dump the text in documents along with segment annotations
stac-util text --doc s2-leagueM-game2\
--subdoc 02 --anno 'BRONZE|SILVER|GOLD' --stage discourse
This utility can be useful for getting a sense for what a particular document contains, without having to fire up the
Glozz platform
========== s2-leagueM-game2 [02] discourse SILVER ============
72
73
74
75
76
77
:
:
:
:
:
:
gotwood4sheep : [anyone got wood?]
gotwood4sheep : [i can offer sheep]
gotwood4sheep : [phrased in such a way i don't riff on my un]
inca : [i'm up for that]
CheshireCatGrin : [I have no wood]
gotwood4sheep : [1:1?]
3
educe Documentation, Release 0.1
78
81
82
83
84
85
86
87
:
:
:
:
:
:
:
:
inca : [yep,] [only got one]
gotwood4sheep : [matt, do you got clay?] [I can offer many things]
CheshireCatGrin : [No clay either]
gotwood4sheep : [anyone else?]
dmm : [i think clay is in short supply]
inca : [sorry,] [none here either]
gotwood4sheep : [indeed, something to do with a robber on the 5]
gotwood4sheep : [alas]
stac-util count
Display some basic counts on the corpus or a given subset thereof
stac-util count --doc s1-league3-game4
The output includes the number of instances of EDUs, turns, etc
Document structure
============================================================
per doc
--------doc
subdoc
dialogue
turn star
turn
edu
...
total
------1
3
7
25
28
58
min
-----
max
-----
mean
------
median
--------
3
7
25
28
58
3
7
25
28
58
3
7
25
28
58
3
7
25
28
58
along with dialogue-acts and relation instances...
Relation instances
============================================================
BRONZE
-------------------Comment
Elaboration
Acknowledgement
Continuation
Explanation
Q-Elab
Result
Background
Parallel
Question-answer_pair
TOTAL
...
total
------3
1
4
4
1
3
3
1
2
8
30
stac-util count-rfc
Count right frontier violations given all the RFC algorithms we have implemented:
stac-util count-rfc --doc pilot21
Output for the above includes both a total count and a pers label count
4
Chapter 1. User manual
educe Documentation, Release 0.1
Both
---------------------TOTAL
Question-answer_pair
Comment
Continuation
Elaboration
Q-Elab
Acknowledgement
...
total
------290
91
32
23
22
22
20
basic
------33
4
7
3
4
3
2
mlast
------11
0
5
1
0
1
0
stac-util count-shapes
Count and draw the number of instances of shapes that we deem to be interesting (for now, this only means “lozenges”,
but we may come up with other shapes in the future, for example, instances of nodes with in-degree > 1)
stac-util count-shapes --anno 'GOLD|SILVER|BRONZE'\
--output /tmp/graphs\
data/socl-season1
Aside from the graph below, this displays a per-document count along with the total
s1-league2-game1 [14] discourse
s1-league2-game2 [01] discourse
s1-league2-game2 [02] discourse
s1-league2-game2 [03] discourse
s1-league2-game3 [03] discourse
s1-league2-game4 [01] discourse
s1-league2-game4 [03] discourse
...
TOTAL lozenges: 46
TOTAL edges in lozenges: 234
1.1. STAC tools
SILVER
GOLD 3
GOLD 1
GOLD 1
BRONZE
BRONZE
BRONZE
1 (4)
(23)
(5)
(6)
2 (10)
1 (4)
1 (6)
5
educe Documentation, Release 0.1
stac-util graph
Draw the discourse graph for a corpus
stac-util graph --doc s1-league1-game2 --anno SILVER\
--output /tmp/graphs\
data/socl-season1
Tips:
• –strip-cdus shows what the graph would look like with an automated CDU-removing algorithm applied to it
• –rfc <algo> will highlight the right frontier and violations given an RFC algorithm (eg –rfc basic)
stac-util filter-graph
View all instances of a relation (or set of relations)
stac-util filter-graph --doc s1-league1-game2\
--output /tmp/graphs\
data/socl-season1\
Question-answer_pair Acknowledgement
(Sorry, easy mode not available)
6
Chapter 1. User manual
educe Documentation, Release 0.1
1.1.2 stac-check
The STAC corpus (at the time of this writing 2015-06-12) is a work in progress, and so some of our utilities are geared
at making it easier to clean up the annotations we have. The STAC sanity checker can be used to see what problems
there are with the current crop of annotations.
The sanity checker is best run in easy mode in the STAC development directory (ie. the project SVN at the time of
this writing):
stac-check --doc pilot03
It will output a report directory in a temporary location (something like /tmp/sanity-pilot03/ ). The report will be in
HTML (with links to some styled XML documents and SVG graphs) and so should be viewed in a browser.
1.1.3 stac-edit and stac-oneoff
stac-edit and stac-oneoff are probably best reserved for people interested in refining the annotations in the STAC
corpus. See the –help options for these tools or get in touch with us for our internal documentation
1.1.4 User interface notes
Command line filters
The stac utilities tend to use the same idiom of filtering the corpus on the command line. For example, the following
command will try to display the text for all (sub)documents in the training-2015-05-30 corpus whose document names
start with “pilot”; and subdocument is either ‘02’, ‘03’, or ‘04’; and which in the ‘discourse’ stage and by the annotator
‘GOLD’
1.1. STAC tools
7
educe Documentation, Release 0.1
stac-util text --doc 'pilot'\
--subdoc '0[2-4]'\
--stage 'discourse'\
--anno 'GOLD'\
data/FROZEN/training-2015-05-30
As we can see above, the filters are Python regular expressions, which can sometimes be useful for expressing range
matches. It’s also possible to filter as much or as little as you want, for example with this subcommand showing
EVERY gold-annotated document in that corpus
stac-util text --anno 'GOLD' data/FROZEN/training-2015-05-30
Or this command which displays every single document there is
stac-util text data/FROZEN/training-2015-05-30
Easy mode
The commands generally come with an “easy mode” where you need only specify a single document via ‘–doc’
stac-util text --doc pilot03
If you do this, the stac utilities will guess that you wanted the development corpus directory and sometimes some
sensible flags to go with it.
Note that “easy mode” does not preclude the use of other flags; you could also still have complex filters like the
following
stac-util text --doc pilot03 --subdoc '0[2-4]' --anno GOLD
Easy mode is available for stac-check, stac-edit, stac-oneoff, and stac-util.
8
Chapter 1. User manual
CHAPTER 2
Tutorial
Note: if you have downloaded the educe source code, the tutorial is available as iPython notebooks in the doc directory
2.1 STAC
Educe is a library for working with a variety of discourse corpora. This tutorial aims to show what using educe would
be like when working with the STAC corpus.
We’ll be working with a tiny fragment of the corpus included with educe. You may find it useful to symlink your
larger copy from the STAC distribution and modify this tutorial accordingly.
2.1.1 Installation
git clone https://github.com/irit-melodi/educe.git
cd educe
pip install -r requirements.txt
Note: these instructions assume you are running within a virtual environment. If not, and if you have permission
denied errors, replace pip with sudo pip.
2.1.2 Tutorial in browser (optional)
This tutorial can either be followed along with the command line and your favourite text editor, or embedded in an
interactive webpage via iPython:
pip install ipython
cd tutorials
ipython notebook
# some helper functions for the tutorial below
def text_snippet(text):
"short text fragment"
if len(text) < 43:
return text
else:
return "{0}...{1}".format(text[:20], text[-20:])
def highlight(astring, color=1):
9
educe Documentation, Release 0.1
"coloured text"
return("\x1b[3{color}m{str}\x1b[0m".format(color=color, str=astring))
2.1.3 Reading corpus files (STAC)
Typically, the first thing we want to do when working in educe is to read the corpus in. This can be a bit slow, but as
we will see later on, we can speed things up if we know what we’re looking for.
from __future__ import print_function
import educe.stac
# relative to the educe docs directory
data_dir = '../data'
corpus_dir = '{dd}/stac-sample'.format(dd=data_dir)
# read everything from our sample
reader = educe.stac.Reader(corpus_dir)
corpus = reader.slurp(verbose=True)
# print a text fragment from the first ten files we read
for key in corpus.keys()[:10]:
doc = corpus[key]
print("[{0}] {1}".format(key, doc.text()[:50]))
Slurping corpus dir [99/100]
[s1-league2-game1
[s1-league2-game1
[s1-league2-game1
[s1-league2-game1
[s1-league2-game1
[s1-league2-game1
[s1-league2-game1
[s1-league2-game3
[s1-league2-game1
[s1-league2-game1
[05]
[13]
[10]
[11]
[10]
[02]
[14]
[03]
[10]
[12]
unannotated None] 199 : sabercat : anyone any clay? 200 : IG : nope
units hjoseph] 521 : sabercat : skinnylinny 522 : sabercat : som
units hjoseph] 393 : skinnylinny : Shall we extend? 394 : saberc
discourse hjoseph] 450 : skinnylinny : Argh 451 : skinnylinny : How
unannotated None] 393 : skinnylinny : Shall we extend? 394 : saberc
units lpetersen] 75 : sabercat : anyone has any wood? 76 : skinnyl
units SILVER] 577 : sabercat : skinny 578 : sabercat : I need 2
discourse lpetersen] 151 : amycharl : got wood anyone? 152 : sabercat
discourse hjoseph] 393 : skinnylinny : Shall we extend? 394 : saberc
units SILVER] 496 : sabercat : yes! 497 : sabercat : :D 498 : s
Slurping corpus dir [100/100 done]
Faster reading
If you know that you only want to work with a subset of the corpus files, you can pre-filter the corpus before reading
the files.
It helps to know here that an educe corpus is a mapping from file id keys to Documents. The FileId tells us what
makes a Document distinct from another:
• document (eg. s1-league2-game1): in STAC, the game that was played (here, season 1, league 2, game 1)
• subdocument (eg. 05): a mostly arbitrary subdivision of the documents motivated by technical constraints
(overly large documents would cause our annotation tool to crash)
• stage (eg. units, discourse, parsed): the kinds of annotations available in the document
• annotator (eg. hjoseph): the main annotator for a document (gold standard documents have the distinguished
annotators, BRONZE, SILVER, or GOLD)
10
Chapter 2. Tutorial
educe Documentation, Release 0.1
NB: unfortunately we have overloaded the word “document” here. When talking about file ids, “document” refers to a
whole game. But when talking about actual annotation objects an educe Document actually corresponds to a specific
combination of document, subdocument, stage, and annotator
import re
# nb: you can import this function from educe.stac.corpus
def is_metal(fileid):
"is this a gold standard(ish) annotation file?"
anno = fileid.annotator or ""
return anno.lower() in ["bronze", "silver", "gold"]
# pick out gold-standard documents
subset = reader.filter(reader.files(),
lambda k: is_metal(k) and int(k.subdoc) < 4)
corpus_subset = reader.slurp(subset, verbose=True)
for key in corpus_subset:
doc = corpus_subset[key]
print("{0}: {1}".format(key, doc.text()[:50]))
Slurping corpus dir [11/12]
s1-league2-game1
s1-league2-game1
s1-league2-game1
s1-league2-game3
s1-league2-game1
s1-league2-game3
s1-league2-game3
s1-league2-game1
s1-league2-game3
s1-league2-game1
s1-league2-game3
s1-league2-game3
[01]
[01]
[02]
[01]
[03]
[02]
[01]
[02]
[02]
[03]
[03]
[03]
units SILVER: 1 : sabercat : btw, are we playing without the ot
discourse SILVER: 1 : sabercat : btw, are we playing without the ot
discourse SILVER: 75 : sabercat : anyone has any wood? 76 : skinnyl
discourse BRONZE: 1 : amycharl : i made it! 2 : amycharl : did the
discourse SILVER: 109 : sabercat : well done! 110 : IG : More clay!
units BRONZE: 73 : sabercat : skinny, got some ore? 74 : skinny
units BRONZE: 1 : amycharl : i made it! 2 : amycharl : did the
units SILVER: 75 : sabercat : anyone has any wood? 76 : skinnyl
discourse BRONZE: 73 : sabercat : skinny, got some ore? 74 : skinny
units SILVER: 109 : sabercat : well done! 110 : IG : More clay!
discourse BRONZE: 151 : amycharl : got wood anyone? 152 : sabercat
units BRONZE: 151 : amycharl : got wood anyone? 152 : sabercat
Slurping corpus dir [12/12 done]
from educe.corpus import FileId
# pick out an example document to work with creating FileIds by hand
# is not something we would typically do (normally we would just iterate
# through a corpus), but it's useful for illustration
ex_key = FileId(doc='s1-league2-game3',
subdoc='03',
stage='units',
annotator='BRONZE')
ex_doc = corpus[ex_key]
print(ex_key)
s1-league2-game3 [03] units BRONZE
2.1.4 Standing off
Most annotations in the STAC corpus are educe standoff annotations. In educe terms, this means that they (perhaps
indirectly) extend the educe.annotation.Standoff class and provide a text_span() function. Much of
our reasoning around annotations essentially consists of checking that their text spans overlap or enclose each other.
2.1. STAC
11
educe Documentation, Release 0.1
As for the text spans, these refer to the raw text saved in files with an .ac extension (eg. s1-league1-game3.ac).
In the Glozz annotation tool, these .ac text files form a pair with their .aa xml counterparts. Multiple annotation
files can point to the same text file.
There are also some annotations that come from 3rd party tools, which we will uncover later.
2.1.5 Documents and EDUs
A document is a sort of giant annotation that contains three other kinds of annotation
• units - annotations that directly cover a span of text (EDUs, Resources, but also turns, dialogues)
• relations - annotations that point from one annotation to another
• schemas - annotations that point to a set of annotations
To start things off, we’ll focus on one type of unit-level annotation, the Elementary Discourse Unit
def preview_unit(doc, anno):
"the default str(anno) can be a bit overwhelming"
preview = "{span: <11} {id: <20} [{type: <12}] {text}"
text = doc.text(anno.text_span())
return preview.format(id=anno.local_id(),
type=anno.type,
span=anno.text_span(),
text=text_snippet(text))
print("Example units")
print("-------------")
seen = set()
for anno in ex_doc.units:
if anno.type not in seen:
seen.add(anno.type)
print(preview_unit(ex_doc, anno))
print()
print("First few EDUs")
print("--------------")
for anno in filter(educe.stac.is_edu, ex_doc.units)[:4]:
print(preview_unit(ex_doc, anno))
Example units
------------(1,34)
stac_1368693094
[paragraph
]
(52,66)
stac_1368693099
[Accept
]
(117,123)
stac_1368693105
[Refusal
]
(189,191)
stac_1368693114
[Other
]
(209,210)
stac_1368693117
[Counteroffer]
(659,668)
stac_1368693162
[Offer
]
(22,26)
asoubeille_1374939590843 [Resource
(35,66)
stac_1368693098
[Turn
]
(0,266)
stac_1368693124
[Dialogue
]
151 : amycharl : got wood anyone?
yep, for what?
no way
:)
?
how much?
] wood
152 : sabercat : yep, for what?
151 : amycharl : go...cat : yep, thank you
First few EDUs
-------------(52,66)
stac_1368693099
(117,123)
stac_1368693105
(163,171)
stac_1368693111
(189,191)
stac_1368693114
yep, for what?
no way
could be
:)
12
[Accept
[Refusal
[Accept
[Other
]
]
]
]
Chapter 2. Tutorial
educe Documentation, Release 0.1
2.1.6 TODO
Everything below this point should be considered to be in a scratch/broken state. It needs to ported over from its
RST/DT considerations to STAC
To do:
• standing off (ac/aa) - shared aa
• layers (units/discourse)
• working with relations and schemas
• grabbing resources etc (example of working with unit level annotation)
• synchronising layers (grabbing the dialogue act and relations at the same time)
• external annotations (postags, parse trees)
• working with hypergraphs (implementing _repr_png()_ would be pretty sweet)
Tree searching
The same span enclosure logic can be used to search parse trees for particular constituents, verb phrases. Alternatively,
you can use the the topdown method provided by educe trees. This returns just the largest constituent for which some
predicate is true. It optionally accepts an additional argument to cut off the search when it is clearly out of bounds.
2.1.7 Conclusion
In this tutorial, we’ve explored a couple of basic educe concepts, which we hope will enable you to extract some data
from your discourse corpora, namely
• reading corpus data (and pre-filtering)
• standoff annotations
• searching by span enclosure, overlapping
• working with trees
• combining annotations from different sources
The concepts above should transfer to whatever discourse corpus you are working with (that educe supports, or that
you are prepared to supply a reader for).
Work in progress
This tutorial is very much a work in progress (last update: 2014-09-19). Educe is a bit of a moving target, so let me
know if you run into any trouble!
2.1. STAC
13
educe Documentation, Release 0.1
See also
stac-util
Some of the things you may want to do with the STAC corpus may already exist in the stac-util command line tool.
stac-util is meant to be a sort of Swiss Army Knife, providing tools for editing the corpus. The query tools are more
likely to be of interest:
• text: display text and edu/dialogue segmentation in a friendly way
• graph: draw discourse graphs with graphviz (arrows for relations, boxes for CDUs, etc)
• filter-graph: visualise instances of relations (eg. Question answer pair)
• count: generate statistics about the corpus
See stac-util --help for more details.
External tool support
Educe has some support for reading data from outside the discourse corpus proper. For example, if you run
the stanford corenlp parser on the raw text, you can read them back into educe-style ConstituencyTree and
DependencyTree annotations. See educe.external for details.
If you have a part of speech tagger that you would like to use, the educe.external.postag module may be
useful for representing the annotations that come out of it
You can also add support for your own tools by creating annotations that extend Standoff, directly or otherwise.
2.2 RST-DT
Educe is a library for working with a variety of discourse corpora. This tutorial aims to show what using educe would
be like.
2.2.1 Installation
git clone https://github.com/irit-melodi/educe.git
cd educe
pip install -r requirements.txt
Note: these instructions assume you are running within a virtual environment. If not, and if you have permission
denied errors, replace pip with sudo pip.
2.2.2 Tutorial setup
RST-DT portions of this tutorial require that you have a local copy of the RST Discourse Treebank. For purposes of
this tutorial, you will need to link this into the data directory, for example
ln -s $HOME/CORPORA/rst_discourse_treebank data
ln -s $HOME/CORPORA/PTBIII data
14
Chapter 2. Tutorial
educe Documentation, Release 0.1
Tutorial in browser (optional)
This tutorial can either be followed along with the command line and your favourite text editor, or embedded in an
interactive webpage via iPython:
pip install ipython
cd tutorials
ipython notebook
2.2.3 Reading corpus files (RST-DT)
from __future__ import print_function
import educe.rst_dt
# relative to the educe docs directory
data_dir = '../data'
rst_corpus_dir = '{dd}/rst_discourse_treebank/data/RSTtrees-WSJ-double-1.0/'.format(dd=data_dir)
# read and load the documents from the WSJ which were double-tagged
rst_reader = educe.rst_dt.Reader(rst_corpus_dir)
rst_corpus = rst_reader.slurp(verbose=True)
# print a text fragment from the first ten files we read
for key in rst_corpus.keys()[:10]:
doc = rst_corpus[key]
print("{0}: {1}".format(key.doc, doc.text()[:50]))
Slurping corpus dir [51/53]
wsj_1365.out:
wsj_0633.out:
wsj_1105.out:
wsj_1168.out:
wsj_1100.out:
wsj_1924.out:
wsj_0669.out:
wsj_0651.out:
wsj_2309.out:
wsj_1120.out:
The Justice Department has revised certain interna
These are the last words Abbie Hoffman ever uttere
CHICAGO - Sears, Roebuck & Co. is struggling as it
Wang Laboratories Inc. has sold $25 million of ass
Westinghouse Electric Corp. said it will buy ShawCALIFORNIA STRUGGLED with the aftermath of a Bay a
Nissan Motor Co. expects net income to reach 120 b
Nelson Holdings International Ltd. shareholders ap
Atco Ltd. said its utilities arm is considering bu
Japan has climbed up from the ashes of World War I
Slurping corpus dir [53/53 done]
Faster reading
If you know that you only want to work with a subset of the corpus files, you can pre-filter the corpus before reading
the files.
It helps to know here that an educe corpus is a mapping from file id keys to documents. The FileId contains the
minimally identifying metadata for a document, for example, the document name, or its annotator. For RST-DT, only
the doc attribute is used.
rst_subset = rst_reader.filter(rst_reader.files(),
lambda k:k.doc.startswith("wsj_062"))
rst_corpus_subset = rst_reader.slurp(rst_subset, verbose=True)
for key in rst_corpus_subset:
2.2. RST-DT
15
educe Documentation, Release 0.1
doc = rst_corpus_subset[key]
print("{0}: {1}".format(key.doc, doc.text()[:50]))
wsj_0627.out: October employment data -- also could turn out to
wsj_0624.out: Costa Rica reached an agreement with its creditor
Slurping corpus dir [2/2 done]
2.2.4 Trees and annotations
RST DT documents are basically trees
from educe.corpus import FileId
# an (ex)ample document
ex_key = educe.rst_dt.mk_key("wsj_1924.out")
ex_doc = rst_corpus[ex_key] # pick a document from the corpus
# display PNG tree
from IPython.display import display
ex_subtree = ex_doc[2][0][0][1] # navigate down to a small subtree
display(ex_subtree) # NLTK > 3.0b1 2013-07-11 should display a PNG image of the RST tree
# Mac users: see note below
Note for Mac users following along in iPython: if displaying the tree above does not work (particularly if you see a
GS prompt in your iPython terminal window instead of an embedded PNG in your browser), try my NLTK patch from
2014-09-17.
Standing off
RST DT trees function both as NLTK trees, and as educe standoff annotations. Most annotations in educe can be
seen as standoff annotations in some sense; they (perhaps indirectly) extend educe.annotation.Standoff and
provide a text_span() function. Comparing annotations usually consists of comparing their text spans.
Text spans in the RST DT corpus refer to the source document beneath each tree file, eg. for the tree file
wsj_1111.out.dis, educe reads wsj_1111.out as its source text. (The source text is somewhat optional
as the RST trees themselves contain text, but this tends to have subtle differences with its underlying source). Below,
we see an example of one of these source documents.
ex_rst_txt_filename = '{corpus}/{doc}'.format(corpus=rst_corpus_dir,
doc=ex_key.doc)
with open(ex_rst_txt_filename) as ifile:
ex_txt = ifile.read()
ex_snippet_start = ex_txt.find("At a national")
print(ex_txt[ex_snippet_start:ex_snippet_start + 500])
At a nationally televised legislative session in Budapest, the Parliament overwhelmingly approved cha
The country was renamed the Republic of Hungary.
Like other Soviet bloc nations, it had been known as a "people's republic" since
16
Chapter 2. Tutorial
educe Documentation, Release 0.1
The voting for new laws followed dissolution of Hungary's Communist Party this month and
Now let’s have a closer look at the annotations themselves.
# it may be useful to have a couple of helper functions to
# display standoff annotations in a generic way
def text_snippet(text):
"short text fragment"
if len(text) < 43:
return text
else:
return "{0}...{1}".format(text[:20], text[-20:])
def preview_standoff(tystr, context, anno):
"simple glimpse at a standoff annotation"
span = anno.text_span()
text = context.text(span)
return "{tystr} at {span}:\t{snippet}".format(tystr=tystr,
span=span,
snippet=text_snippet(text))
EDUs and subtrees
# in educe RST/DT all annotations have a shared context object
# that refers to an RST document; you don't always need to use
# it, but it can be handy for writing general code like the
# above
ex_context = ex_doc.label().context
# display some edus
print("Some edus")
edus = ex_subtree.leaves()
for edu in edus:
print(preview_standoff("EDU", ex_context, edu))
print("\nSome subtrees")
# display some RST subtrees and the edus they enclose
for subtree in ex_subtree.subtrees():
node = subtree.label()
stat = "N" if node.is_nucleus() else "S"
label = "{stat} {rel: <30}".format(stat=stat,
rel=node.rel)
print(preview_standoff(label, ex_context, subtree))
Some edus
EDU at (1504,1609):
EDU at (1610,1662):
EDU at (1663,1703):
EDU at (1704,1750):
EDU at (1751,1782):
At a nationally tele...gly approved changes
formally ending one-...tion in the country,
regulating free elections by next summer
and establishing the...e of state president
to replace a 21-member council.
Some subtrees
S elaboration-general-specific
N span
S elaboration-object-attribute-e
N List
2.2. RST-DT
at
at
at
at
(1504,1782):
(1504,1609):
(1610,1782):
(1610,1662):
At a nationally
At a nationally
formally ending
formally ending
tele...a 21-member council.
tele...gly approved changes
one-...a 21-member council.
one-...tion in the country,
17
educe Documentation, Release 0.1
N
N
N
S
List
List
span
purpose
at
at
at
at
(1663,1703):
(1704,1782):
(1704,1750):
(1751,1782):
regulating free elections by next summer
and establishing the...a 21-member council.
and establishing the...e of state president
to replace a 21-member council.
Paragraphs and sentences
Going back to the source text, we can notice that it seems to be divided into sentences and paragraphs with line
separators. This does not seem to be done very consistently, and in any case, RST constituents seem to traverse these
boundaries freely. But they can still make for useful standoff annotations.
for para in ex_context.paragraphs[4:8]:
print(preview_standoff("paragraph", ex_context, para))
for sent in para.sentences:
print("\t" + preview_standoff("sentence", ex_context, sent))
paragraph at
sentence
sentence
sentence
paragraph at
sentence
paragraph at
sentence
paragraph at
sentence
sentence
sentence
(862,1288):
The 77-year-old offi...o-democracy groups.
at (862,1029): The 77-year-old offi...ttee in East Berlin.
at (1030,1144):
Honecker, who was re... for health reasons.
at (1145,1288):
He was succeeded by ...o-democracy groups.
(1290,1432):
Honecker's departure...nted with his rule.
at (1290,1432):
Honecker's departure...nted with his rule.
(1434,1502):
HUNGARY ADOPTED cons... democratic system.
at (1434,1502):
HUNGARY ADOPTED cons... democratic system.
(1504,1913):
At a nationally tele...e's republic" since
at (1504,1782):
At a nationally tele...a 21-member council.
at (1783,1831):
The country was rena...Republic of Hungary.
at (1832,1913):
Like other Soviet bl...e's republic" since
2.2.5 Penn Treebank integration
RST DT annotations are mostly over Wall Street Journal articles from the Penn Treebank. If you have a copy of the
latter at the ready, you can ask educe to read and align the two (ie. PTB annotations treated as standing off the RST
source text). This alignment consists of some universal substitutions (eg. -LBR- to () and with a bit of hardcoding to
account for seemingly random differences in whitespace/punctuation.
from educe.rst_dt import ptb
from nltk.tree import Tree
# confusingly, this is not an educe corpus reader, but the NLTK
# bracketed reader. Sorry
ptb_reader = ptb.reader('{dd}/PTBIII/parsed/mrg/wsj/'.format(dd=data_dir))
ptb_trees = {}
for key in rst_corpus:
ptb_trees[key] = ptb.parse_trees(rst_corpus, key, ptb_reader)
# pick and display an arbitary ptb tree
ex0_ptb_tree = ptb_trees[rst_corpus.keys()[0]][0]
print(ex0_ptb_tree.pprint()[:400])
(S
(NP-SBJ
(DT <educe.external.postag.Token object at 0x10e41ecd0>)
(NNP <educe.external.postag.Token object at 0x10e41ee10>)
(NNP <educe.external.postag.Token object at 0x10e41ef50>))
18
Chapter 2. Tutorial
educe Documentation, Release 0.1
(VP
(VBZ <educe.external.postag.Token object at 0x10e41efd0>)
(VP
(VP
(VBN <educe.external.postag.Token object at 0x10e41ef90>)
(NP
(JJ <educe.external.postag.
The result of this alignment is an educe ConstituencyTree, the leaves of which are educe Token objects. We’ll
say a little bit more about these below.
# show what's beneath these educe tokens
def str_tree(tree):
if isinstance(tree, Tree):
return Tree(str(tree.label()), map(str_tree, tree))
else:
return str(tree)
print(str_tree(ex0_ptb_tree).pprint()[:400])
(S
(NP-SBJ
(DT The/DT
(0,3))
(NNP Justice/NNP
(4,11))
(NNP Department/NNP
(12,22)))
(VP
(VBZ has/VBZ
(23,26))
(VP
(VP
(VBN revised/VBN
(27,34))
(NP
(JJ certain/JJ
(35,42))
(JJ internal/JJ
(43,51))
(NNS guidelines/NNS
(52,62))))
(CC and/CC
(63,66))
(VP (VBN clarified/VBN
(67,76)) (NP (NNS others/NNS
(77,83))))
2.2.6 Combining annotations
We now have several types of annotation at our disposal:
• EDUs and RST trees
• raw text paragraph/sentences (not terribly reliable)
• PTB trees
The next question that arises is how we can use these annotations in conjuction with each other.
Span enclosure and overlapping
The simplest way to reason about annotations (particularly since they tend to be sloppy and to overlap). Suppose for
example, we wanted to find all of the edus in a tree that are in the same sentence as an given edu.
from itertools import chain
# pick an EDU, any edu
2.2. RST-DT
19
educe Documentation, Release 0.1
ex_edus = ex_subtree.leaves()
ex_edu0 = ex_edus[3]
print(preview_standoff('example EDU', ex_context, ex_edu0))
# all of the sentences in the example document
ex_sents = list(chain.from_iterable(x.sentences for x in ex_context.paragraphs))
# sentences that overlap the edu
# (we use overlaps instead of encloses because edus might
# span sentence boundaries)
ex_edu0_sents = [x for x in ex_sents if x.overlaps(ex_edu0)]
# and now the edus that overlap those sentences
ex_edu0_buddies = []
for sent in ex_edu0_sents:
print(preview_standoff('overlapping sentence', ex_context, sent))
buddies = [x for x in ex_edus if x.overlaps(sent)]
buddies.remove(ex_edu0)
for edu in buddies:
print(preview_standoff('\tnearby EDU', ex_context, edu))
ex_edu0_buddies.extend(buddies)
example EDU at (1704,1750): and establishing the...e of state president
overlapping sentence at (1504,1782):
At a nationally tele...a 21-member council.
nearby EDU at (1504,1609):
At a nationally tele...gly approved changes
nearby EDU at (1610,1662):
formally ending one-...tion in the country,
nearby EDU at (1663,1703):
regulating free elections by next summer
nearby EDU at (1751,1782):
to replace a 21-member council.
Span example 2 (exercise)
As an exercise, how about extracting the PTB part of speech tags for every token in our example EDU? How for
example, would you determine if an EDU contains a VBG-tagged word?
ex_postags = list(chain.from_iterable(t.leaves() for t in ptb_trees[ex_key]))
print("some of the POS tags")
for postag in ex_postags[300:310]:
print(preview_standoff(postag.tag, ex_context, postag))
print()
ex_edu0_postags = [] # EXERCISE <-- fill this in
print("has VBG? ", ) # EXERCISE <-- fill this in
some of the POS tags
VBG at (1663,1673): regulating
JJ at (1674,1678): free
NNS at (1679,1688): elections
IN at (1689,1691): by
JJ at (1692,1696): next
NN at (1697,1703): summer
CC at (1704,1707): and
VBG at (1708,1720): establishing
DT at (1721,1724): the
NN at (1725,1731): office
has VBG?
20
Chapter 2. Tutorial
educe Documentation, Release 0.1
Tree searching
The same span enclosure logic can be used to search parse trees for particular constituents, verb phrases. Alternatively,
you can use the the topdown method provided by educe trees. This returns just the largest constituent for which some
predicate is true. It optionally accepts an additional argument to cut off the search when it is clearly out of bounds.
ex_ptb_trees = ptb_trees[ex_key]
ex_edu0_ptb_trees = [x for x in ex_ptb_trees if x.overlaps(ex_edu0)]
ex_edu0_cons = []
for ptree in ex_edu0_ptb_trees:
print(preview_standoff('ptb tree', ex_context, ptree))
ex_edu0_cons.extend(ptree.topdown(lambda c: ex_edu0.encloses(c)))
# the largest constituents enclosed by this edu
for cons in ex_edu0_cons:
print(preview_standoff(cons.label(), ex_context, cons))
display(ex_edu0_cons[3])
ptb tree at (1504,1782):
At a nationally tele...a 21-member council.
CC at (1704,1707): and
VBG at (1708,1720): establishing
NP at (1721,1731): the office
PP at (1732,1750): of state president
WHNP-1 at (1750,1750):
NP-SBJ at (1750,1750):
2.2.7 Simplified trees
The tree representation used in the RST DT can take some getting used to (relation labels are placed on the satellite
rather than the root of a subtree). You may prefer to work with the simplified representation instead. In the simple
representation, trees are binarised and relation labels are moved to the root node. Compare for example, the two
versions of the same RST subtree.
# rearrange the tree so that it is easier to work with
ex_simple_subtree = educe.rst_dt.SimpleRSTTree.from_rst_tree(ex_subtree)
print('Corpus representation\n\n')
display(ex_subtree)
print('Simplified (binarised, rotated) representation\n\n')
display(ex_simple_subtree)
Corpus representation
2.2. RST-DT
21
educe Documentation, Release 0.1
Simplified (binarised, rotated) representation
2.2.8 Dependency trees and back
Educe also provides an experimental conversion between simplified trees above and dependency trees. See the
educe.rst_dt.deptree for the algorithm used.
Our current example is a little too small to give a sense of what the resulting dependency tree might look like, so we’ll
back up slightly closer to the root to have a wider view.
from educe.rst_dt import deptree
ex_subtree2 = ex_doc[2]
ex_simple_subtree2 = educe.rst_dt.SimpleRSTTree.from_rst_tree(ex_subtree2)
ex_deptree2 = deptree.relaxed_nuclearity_to_deptree(ex_simple_subtree2)
display(ex_deptree2)
Going back to our original example, we can (lossily) convert back from these dependency tree representations to RST
trees. The dependency trees have some ambiguities in them that we can’t resolve without an oracle, but we can at least
make some guesses. Note that when converting back to RST, we need to supply a list of relation labels that should be
treated as multinuclear.
ex_deptree = deptree.relaxed_nuclearity_to_deptree(ex_simple_subtree)
ex_from_deptree = deptree.relaxed_nuclearity_from_deptree(ex_deptree, ["list"]) # multinuclear in low
display(ex_from_deptree)
2.2.9 Conclusion
In this tutorial, we’ve explored a couple of basic educe concepts, which we hope will enable you to extract some data
from your discourse corpora, namely
• reading corpus data (and pre-filtering)
• standoff annotations
22
Chapter 2. Tutorial
educe Documentation, Release 0.1
• searching by span enclosure, overlapping
• working with trees
• combining annotations from different sources
The concepts above should transfer to whatever discourse corpus you are working with (that educe supports, or that
you are prepared to supply a reader for).
That said, some of the features mentioned in particular tutorial are specific to the RST DT:
• simplifying RST trees
• converting them to dependency trees
• PTB integration
This tutorial was last updated on 2014-09-18. Educe is a bit of a moving target, so let me know if you run into any
trouble!
See also
rst-dt-util
Some of the things you may want to do with the RST DT may already exist in the rst-dt-util command line tool. See
rst-dt-util --help for more details.
(At the time of this writing the only really useful tool is the rst-dt-util reltypes one, which prints an inventory of relation
labels, but the utility may grow over time)
External tool support
Educe has some support for reading data from outside the discourse corpus proper. For example, if you run
the stanford corenlp parser on the raw text, you can read them back into educe-style ConstituencyTree and
DependencyTree annotations. See educe.external for details.
If you have a part of speech tagger that you would like to use, the educe.external.postag module may be
useful for representing the annotations that come out of it
You can also add support for your own tools by creating annotations that extend Standoff, directly or otherwise.
2.3 PDTB
Educe is a library for working with a variety of discourse corpora. This tutorial aims to show what using educe would
be like when working with the Penn Discourse Treebank corpus.
2.3.1 Installation
git clone https://github.com/kowey/educe.git
cd educe
pip install -r requirements.txt
Note: these instructions assume you are running within a virtual environment. If not, and if you have permission
denied errors, replace pip with sudo pip.
2.3. PDTB
23
educe Documentation, Release 0.1
2.3.2 Tutorial setup
This tutorial require that you have a local copy of the PDTB. For purposes of this tutorial, you will need to link this
into the data directory, for example
ln -s $HOME/CORPORA/pdtb_v2 data
Optionnally, to match the pdtb text spans to their analysis in the Penn Treebank, you need to have a local copy of the
PTB at the same location
ln -s $HOME/CORPORA/PTBIII data
Tutorial in browser (optional)
This tutorial can either be followed along with the command line and your favourite text editor, or embedded in an
interactive webpage via iPython:
pip install ipython
cd tutorials
ipython notebook
# some helper functions for the tutorial below
def show_type(rel):
"short string for a relation type"
return type(rel).__name__[:-8] # remove "Relation"
def highlight(astring, color=1):
"coloured text"
return("\x1b[3{color}m{str}\x1b[0m".format(color=color, str=astring))
2.3.3 Reading corpus files (PDTB)
NB: unfortunately, at the time of this writing, PDTB support in educe is very much behind and rather inconsistent with
that of the other corpora. Apologies for the mess!
from __future__ import print_function
import educe.pdtb
# relative to the educe docs directory
data_dir = '../data'
corpus_dir = '{dd}/pdtb_v2/data'.format(dd=data_dir)
# read a small sample of the pdtb
reader = educe.pdtb.Reader(corpus_dir)
anno_files = reader.filter(reader.files(),
lambda k: k.doc.startswith('wsj_231'))
corpus = reader.slurp(anno_files, verbose=True)
# print the first five rel types we read from each doc
for key in corpus.keys()[:10]:
doc = corpus[key]
rtypes = [show_type(r) for r in doc]
print("[{0}] {1}".format(key.doc, " ".join(rtypes[:5])))
24
Chapter 2. Tutorial
educe Documentation, Release 0.1
Slurping corpus dir [7/8]
[wsj_2315]
[wsj_2311]
[wsj_2316]
[wsj_2310]
[wsj_2319]
[wsj_2317]
[wsj_2313]
[wsj_2314]
Explicit Implicit Entity Explicit Implicit
Implicit
Explicit Implicit Implicit Implicit Explicit
Entity
Explicit
Implicit Implicit Explicit Implicit Explicit
Entity Explicit Explicit Implicit Explicit
Explicit Explicit Implicit Explicit Entity
Slurping corpus dir [8/8 done]
2.3.4 What’s a corpus?
A corpus is a dictionary from FileId keys to representation of PDTB documents.
Keys
A key has several fields meant to distinguish different annotated documents from each other. In the case of the PDTB,
the only field of interest is doc, a Wall Street journal article number as you might find in the PTB.
ex_key = educe.pdtb.mk_key('wsj_2314')
ex_doc = corpus[ex_key]
print(ex_key)
print(ex_key.__dict__)
wsj_2314 [None] discourse unknown
{'doc': 'wsj_2314', 'subdoc': None, 'annotator': 'unknown', 'stage': 'discourse'}
Documents
At some point in the future, the representation of a document may change to something a bit higher level and easier
to work with. For now, a “document” in the educe PDTB sense consists of a list of relations, each relation having a
low-level representation that hews fairly closely to the grammar described in the PDTB annotation manual.
TIP: At least until educe grows a more educe-like uniform representation of PDTB annotations, a very useful resource
to look at when working with the PDTB may be The Penn Discourse Treebank 2.0 Annotation Manual, sections 6.3.1
to 6.3.5 (Description of PDTB representation format → File format → General outline. . . ).
lr = [r for r in ex_doc]
r0 = lr[0]
type(r0).__name__
'ExplicitRelation'
Relations
There are five types of relation annotation: explicit, implicit, altlex, entity, no (as in no relation). These are described
in further detail in the PDTB annotation manual. Here’s well try to sketch out some of the important properties.
The main thing to notice is that the 5 types of annotation not have very much in common with each other, but they
have many overlapping pieces (see table in the educe.pdtb docs)
2.3. PDTB
25
educe Documentation, Release 0.1
• a relation instance always has two arguments (these can be selected as arg1 and arg2)
def display_rel(r):
"pretty print a relation instance"
rtype = show_type(r)
if rtype == "Explicit":
conn = highlight(r.connhead)
elif rtype == "Implicit":
conn = "{rtype} {conn1}".format(rtype=rtype,
conn1=highlight(str(r.connective1)))
elif rtype == "AltLex":
conn = "{rtype} {sem1}".format(rtype=rtype,
sem1=highlight(r.semclass1))
else:
conn = rtype
fmt = "{src}\n \t ---[{label}]---->\n \t\t\t{tgt}"
return(fmt.format(src=highlight(r.arg1.text, 2),
label=conn,
tgt=highlight(r.arg2.text, 2)))
print(display_rel(r0))
[32mQuantum Chemical Corp. went along for the ride[0m
---[[31mConnective(when | Temporal.Synchrony)[0m]---->
[32mthe price of plastics took off in 1987[0m
r0.connhead.text
u'when'
2.3.5 Gorn addresses
# print the first seven gorn addresses for the first argument of the first
# 5 rels we read from each doc
for key in corpus.keys()[:3]:
doc = corpus[key]
rels = doc[:5]
print(key.doc)
for r in doc[:5]:
print("\t{0}".format(r.arg1.gorn[:7]))
wsj_2315
[0.0, 0.1.0, 0.1.1.0,
[1.1.1]
[3]
[5.1.1.1.0]
[6.0, 6.1.0, 6.1.1.0,
wsj_2311
[0]
wsj_2316
[0.0.0, 0.0.1, 0.0.3,
[2.0.0, 2.0.1, 2.0.3,
[4]
26
0.1.1.1, 0.1.1.2, 0.2]
6.1.1.1.0, 6.1.1.1.1, 6.1.1.1.2, 6.1.1.1.3.0]
0.1, 0.2]
2.1, 2.2]
Chapter 2. Tutorial
educe Documentation, Release 0.1
[5.3.4.1.1.2.2.2]
[5.3.4]
2.3.6 Penn Treebank integration
from educe.pdtb import ptb
# confusingly, this is not an educe corpus reader, but the NLTK
# bracketed reader. Sorry
ptb_reader = ptb.reader('{dd}/PTBIII/parsed/mrg/wsj/'.format(dd=data_dir))
ptb_trees = {}
for key in corpus.keys()[:3]:
ptb_trees[key] = ptb.parse_trees(corpus, key, ptb_reader)
print("{0}...".format(str(ptb_trees[key])[:100]))
[Tree('S', [Tree('NP-SBJ-1', [Tree('NNP', ['RJR']), Tree('NNP', ['Nabisco']), Tree('NNP', ['Inc.'])].
[Tree('S', [Tree('NP-SBJ', [Tree('NNP', ['CONCORDE']), Tree('JJ', ['trans-Atlantic']), Tree('NNS', [.
[Tree('S', [Tree('NP-SBJ', [Tree('NP', [Tree('DT', ['The']), Tree('NNP', ['U.S.'])]), Tree(',', [','.
!ls ../data/PTBIII/parsed/mrg/wsj/
[34m00[m[m [34m01[m[m [34m02[m[m [34m03[m[m [34m04[m[m [34m05[m[m [34m06[m[m [34m07[m[m [34m08[m[m [3
def pick_subtree(tree, gparts):
if gparts:
return pick_subtree(tree[gparts[0]], gparts[1:])
else:
return tree
# print the first seven gorn addresses for the first argument of the first
# 5 rels we read from each doc, along with the corresponding subtree
ndocs = 1
nrels = 3
ngorn = -1
for key in corpus.keys()[:1]:
doc = corpus[key]
rels = doc[:nrels]
ptb_tree = ptb_trees[key]
print("======="+key.doc)
for i,r in enumerate(doc[:nrels]):
print("---- relation {0}".format(i+1))
print(display_rel(r))
for (i,arg) in enumerate([r.arg1,r.arg2]):
print(".... arg {0}".format(i+1))
glist = arg.gorn # arg.gorn[:ngorn]
subtrees = [pick_subtree(ptb_tree, g.parts) for g in glist]
for gorn, subtree in zip(glist, subtrees):
print("{0}\n{1}".format(gorn, str(subtree)))
=======wsj_2315
---- relation 1
[32mRJR Nabisco Inc. is disbanding its division responsible for buying network advertising
---[[31mConnective(after | Temporal.Asynchronous.Succession)[0m]---->
[32mmoving 11 of the group's 14 employees to New York from Atlanta[0m
2.3. PDTB
27
educe Documentation, Release 0.1
.... arg 1
0.0
(NP-SBJ-1 (NNP RJR) (NNP Nabisco) (NNP Inc.))
0.1.0
(VBZ is)
0.1.1.0
(VBG disbanding)
0.1.1.1
(NP
(NP (PRP$ its) (NN division))
(ADJP
(JJ responsible)
(PP
(IN for)
(S-NOM
(NP-SBJ (-NONE- ))
(VP
(VBG buying)
(NP (NN network) (NN advertising) (NN time)))))))
0.1.1.2
(, ,)
0.2
(. .)
.... arg 2
0.1.1.3.2
(S-NOM
(NP-SBJ (-NONE- *-1))
(VP
(VBG moving)
(NP
(NP (CD 11))
(PP
(IN of)
(NP
(NP (DT the) (NN group) (POS `s))
(CD 14)
(NNS employees))))
(PP-DIR (TO to) (NP (NNP New) (NNP York)))
(PP-DIR (IN from) (NP (NNP Atlanta)))))
---- relation 2
[32mthat it is shutting down the RJR Nabisco Broadcast unit, and dismissing its 14 employee
---[Implicit [31mConnective(in addition | Expansion.Conjunction)[0m]---->
[32mRJR is discussing its network-buying plans with its two main advert
.... arg 1
1.1.1
(SBAR
(IN that)
(S
(NP-SBJ (PRP it))
(VP
(VBZ is)
(VP
(VP
(VBG shutting)
28
Chapter 2. Tutorial
educe Documentation, Release 0.1
(PRT (RP down))
(NP
(DT the)
(NNP RJR)
(NNP Nabisco)
(NNP Broadcast)
(NN unit)))
(, ,)
(CC and)
(VP (VBG dismissing) (NP (PRP$ its) (CD 14) (NNS employees)))
(, ,)
(PP-LOC
(IN in)
(NP
(DT a)
(NN move)
(S
(NP-SBJ (-NONE- *))
(VP (TO to) (VP (VB save) (NP (NN money)))))))))))
.... arg 2
2.1.1
(SBAR
(-NONE- 0)
(S
(NP-SBJ (NNP RJR))
(VP
(VBZ is)
(VP
(VBG discussing)
(NP (PRP$ its) (JJ network-buying) (NNS plans))
(PP
(IN with)
(NP
(NP
(PRP$ its)
(CD two)
(JJ main)
(NN advertising)
(NNS firms))
(, ,)
(NP
(NP (NNP FCB/Leber) (NNP Katz))
(CC and)
(NP (NNP McCann) (NNP Erickson)))))))))
---- relation 3
[32mWe found with the size of our media purchases that an ad agency could do just as good a
---[Entity]---->
[32mAn executive close to the company said RJR is spending about $140 m
.... arg 1
3
(SINV
(`` ``)
(S-TPC-3
(NP-SBJ (PRP We))
2.3. PDTB
29
educe Documentation, Release 0.1
(VP
(VBD found)
(PP
(IN with)
(NP
(NP (DT the) (NN size))
(PP (IN of) (NP (PRP$ our) (NNS media) (NNS purchases)))))
(SBAR
(IN that)
(S
(NP-SBJ (DT an) (NN ad) (NN agency))
(VP
(MD could)
(VP
(VB do)
(NP (ADJP (RB just) (RB as) (JJ good)) (DT a) (NN job))
(PP
(IN at)
(NP (ADJP (RB significantly) (JJR lower)) (NN cost)))))))))
(, ,)
(`' `')
(VP (VBD said) (S (-NONE- *T-3)))
(NP-SBJ
(NP (DT the) (NN spokesman))
(, ,)
(SBAR
(WHNP-1 (WP who))
(S
(NP-SBJ-4 (-NONE- T-1))
(VP
(VBD declined)
(S
(NP-SBJ (-NONE- -4))
(VP
(TO to)
(VP
(VB specify)
(SBAR
(WHNP-2 (WRB how) (JJ much))
(S
(NP-SBJ (NNP RJR))
(VP
(VBZ spends)
(NP (-NONE- *T-2))
(PP-CLR
(IN on)
(NP (NN network) (NN television) (NN time)))))))))))))
(. .))
.... arg 2
4
(S
(NP-SBJ
(NP (DT An) (NN executive))
(ADJP (RB close) (PP (TO to) (NP (DT the) (NN company)))))
30
Chapter 2. Tutorial
educe Documentation, Release 0.1
(VP
(VBD said)
(SBAR
(-NONE- 0)
(S
(NP-SBJ (NNP RJR))
(VP
(VBZ is)
(VP
(VBG spending)
(NP
(NP
(QP (RB about) ($ $) (CD 140) (CD million))
(-NONE- U ))
(ADVP (-NONE- ICH-1)))
(PP-CLR
(IN on)
(NP (NN network) (NN television) (NN time)))
(NP-TMP (DT this) (NN year))
(, ,)
(ADVP-1
(RB down)
(PP
(IN from)
(NP
(NP
(QP (RB roughly) ($ $) (CD 200) (CD million))
(-NONE- U ))
(NP-TMP (JJ last) (NN year))))))))))
(. .))
print(subtree.flatten())
print(subtree.leaves())
(S
An
executive
close
to
the
company
said
0
RJR
is
spending
about
$
140
million
U
ICH-1
on
network
television
2.3. PDTB
31
educe Documentation, Release 0.1
time
this
year
,
down
from
roughly
$
200
million
U
last
year
.)
[u'An', u'executive', u'close', u'to', u'the', u'company', u'said', u`0', u'RJR', u'is', u'
from copy import copy
t = copy(subtree)
print("constituent = "+ highlight(t.label()))
for i in range(len(subtree)):
print(i)
print(t.pop())
constituent = [31mS[0m
0
(. .)
1
(VP
(VBD said)
(SBAR
(-NONE- 0)
(S
(NP-SBJ (NNP RJR))
(VP
(VBZ is)
(VP
(VBG spending)
(NP
(NP
(QP (RB about) ($ $) (CD 140) (CD million))
(-NONE- U ))
(ADVP (-NONE- ICH-1)))
(PP-CLR
(IN on)
(NP (NN network) (NN television) (NN time)))
(NP-TMP (DT this) (NN year))
(, ,)
(ADVP-1
(RB down)
(PP
(IN from)
(NP
(NP
(QP (RB roughly) ($ $) (CD 200) (CD million))
(-NONE- U ))
32
Chapter 2. Tutorial
educe Documentation, Release 0.1
(NP-TMP (JJ last) (NN year))))))))))
2
(NP-SBJ
(NP (DT An) (NN executive))
(ADJP (RB close) (PP (TO to) (NP (DT the) (NN company)))))
from copy import copy
t = copy(subtree)
def expand(subtree):
if type(subtree) is unicode:
print(subtree)
else:
print("constituent = "+ highlight(subtree.label()))
for i, st in enumerate(subtree):
#print(i)
expand(st)
expand(t)
constituent
constituent
constituent
constituent
An
constituent
executive
constituent
constituent
close
constituent
constituent
to
constituent
constituent
the
constituent
company
constituent
constituent
said
constituent
constituent
0
constituent
constituent
constituent
RJR
constituent
constituent
is
constituent
constituent
spending
constituent
constituent
2.3. PDTB
=
=
=
=
[31mS[0m
[31mNP-SBJ[0m
[31mNP[0m
[31mDT[0m
= [31mNN[0m
= [31mADJP[0m
= [31mRB[0m
= [31mPP[0m
= [31mTO[0m
= [31mNP[0m
= [31mDT[0m
= [31mNN[0m
= [31mVP[0m
= [31mVBD[0m
= [31mSBAR[0m
= [31m-NONE-[0m
= [31mS[0m
= [31mNP-SBJ[0m
= [31mNNP[0m
= [31mVP[0m
= [31mVBZ[0m
= [31mVP[0m
= [31mVBG[0m
= [31mNP[0m
= [31mNP[0m
33
educe Documentation, Release 0.1
constituent
constituent
about
constituent
$
constituent
140
constituent
million
constituent
U
constituent
constituent
ICH-1
constituent
constituent
on
constituent
constituent
network
constituent
television
constituent
time
constituent
constituent
this
constituent
year
constituent
,
constituent
constituent
down
constituent
constituent
from
constituent
constituent
constituent
constituent
roughly
constituent
$
constituent
200
constituent
million
constituent
U
constituent
constituent
last
constituent
34
= [31mQP[0m
= [31mRB[0m
= [31m$[0m
= [31mCD[0m
= [31mCD[0m
= [31m-NONE-[0m
= [31mADVP[0m
= [31m-NONE-[0m
= [31mPP-CLR[0m
= [31mIN[0m
= [31mNP[0m
= [31mNN[0m
= [31mNN[0m
= [31mNN[0m
= [31mNP-TMP[0m
= [31mDT[0m
= [31mNN[0m
= [31m,[0m
= [31mADVP-1[0m
= [31mRB[0m
= [31mPP[0m
= [31mIN[0m
=
=
=
=
[31mNP[0m
[31mNP[0m
[31mQP[0m
[31mRB[0m
= [31m$[0m
= [31mCD[0m
= [31mCD[0m
= [31m-NONE-[0m
= [31mNP-TMP[0m
= [31mJJ[0m
= [31mNN[0m
Chapter 2. Tutorial
educe Documentation, Release 0.1
year
constituent = [31m.[0m
.
2.3.7 Work in progress
This tutorial is very much a work in progress. Moreover, support for the PDTB in educe is still very incomplete. So
it’s very much a moving target.
2.3. PDTB
35
educe Documentation, Release 0.1
36
Chapter 2. Tutorial
CHAPTER 3
Cookbook
Short how-tos on focused topics
3.1 [STAC] Turns and resources
Suppose you wanted to find the following (an actual request from the STAC project)
“Player offers to give resource X (possibly for Y) but does not hold resource X.”
In this tutorial, we’ll walk through such a query applying it to a single file in the corpus. Before digging into the
tutorial proper, let’s first read the sample data.
from __future__ import print_function
from educe.corpus import FileId
import educe.stac
# relative to the educe docs directory
data_dir = '../data'
corpus_dir = '{dd}/stac-sample'.format(dd=data_dir)
def text_snippet(text):
"short text fragment"
if len(text) < 43:
return text
else:
return "{0}...{1}".format(text[:20], text[-20:])
def preview_unit(doc, anno):
"the default str(anno) can be a bit overwhelming"
preview = "{span: <11} {id: <20} [{type: <12}] {text}"
text = doc.text(anno.text_span())
return preview.format(id=anno.local_id(),
type=anno.type,
span=anno.text_span(),
text=text_snippet(text))
# pick out an example document to work with creating FileIds by hand
# is not something we would typically do (normally we would just iterate
# through a corpus), but it's useful for illustration
ex_key = FileId(doc='s1-league2-game3',
subdoc='03',
stage='units',
37
educe Documentation, Release 0.1
annotator='BRONZE')
reader = educe.stac.Reader(corpus_dir)
ex_files = reader.filter(reader.files(),
lambda k: k == ex_key)
corpus = reader.slurp(ex_files, verbose=True)
ex_doc = corpus[ex_key]
Slurping corpus dir [1/1 done]
3.1.1 1. Turn and resource annotations
How would you go about doing it? One place to start is to look at turns and resources independently. We can filter
turns and resources with the helper functions is_turn and is_resource from educe.stac
import educe.stac
ex_turns = [x for x in ex_doc.units if educe.stac.is_turn(x)]
ex_resources = [x for x in ex_doc.units if educe.stac.is_resource(x)]
ex_offers = [x for x in ex_resources if x.features['Status'] == 'Givable']
print("Example turns")
print("-------------")
for anno in ex_turns[:5]:
# notice here that unit annotations have a features field
print(preview_unit(ex_doc, anno))
print()
print("Example resources")
print("-----------------")
for anno in ex_offers[:5]:
# notice here that unit annotations have a features field
print(preview_unit(ex_doc, anno))
print('', anno.features)
Example turns
------------(35,66)
stac_1368693098
(100,123)
stac_1368693104
(146,171)
stac_1368693110
(172,191)
stac_1368693113
(192,210)
stac_1368693116
[Turn
[Turn
[Turn
[Turn
[Turn
]
]
]
]
]
152
154
156
157
160
:
:
:
:
:
sabercat
sabercat
sabercat
amycharl
amycharl
:
:
:
:
:
yep, for what?
no way
could be
:)
?
Example resources
----------------(84,88)
asoubeille_1374939917916 [Resource
] clay
{'Status': 'Givable', 'Kind': 'clay', 'Correctness': 'True', 'Quantity': '?'}
(141,144)
asoubeille_1374940096296 [Resource
] ore
{'Status': 'Givable', 'Kind': 'ore', 'Correctness': 'True', 'Quantity': '?'}
(398,403)
asoubeille_1374940373466 [Resource
] sheep
{'Status': 'Givable', 'Kind': 'sheep', 'Correctness': 'True', 'Quantity': '?'}
(464,467)
asoubeille_1374940434888 [Resource
] ore
{'Status': 'Givable', 'Kind': 'ore', 'Correctness': 'True', 'Quantity': '1'}
(689,692)
asoubeille_1374940671003 [Resource
] one
{'Status': 'Givable', 'Kind': 'Anaphoric', 'Correctness': 'True', 'Quantity': '1'}
38
Chapter 3. Cookbook
educe Documentation, Release 0.1
Oh no, Anaphors
Oh dear, some of our resources won’t tell us their types directly. They are anaphors pointing to other annotations.
We’ll ignore these for the moment, but it’ll be important to deal with them properly later on.
3.1.2 2. Resources within turns?
It’s not enough to be able to spit out resource and turn annotations.
What we really want to know about are which resources are within which
turns’
ex_turns_with_offers = [t for t in ex_turns if any(t.encloses(r) for r in ex_offers)]
print("Turns and resources within")
print("--------------------------")
for turn in ex_turns_with_offers[:5]:
t_resources = [x for x in ex_resources if turn.encloses(x)]
print(preview_unit(ex_doc, turn))
for rsrc in t_resources:
kind = rsrc.features['Kind']
print("\t".join(["", str(rsrc.text_span()), kind]))
Turns and resources within
-------------------------(959,1008) stac_1368693191
(999,1004)
sheep
(1009,1030) stac_1368693195
(1026,1029)
Anaphoric
(67,99)
stac_1368693101
(84,88) clay
(124,145)
stac_1368693107
(141,144)
ore
(363,404)
stac_1368693135
(398,403)
sheep
[Turn
] 201 : sabercat : can...or another sheep? or
[Turn
] 202 : sabercat : two?
[Turn
] 153 : amycharl : clay preferably
[Turn
] 155 : amycharl : ore?
[Turn
] 171 : sabercat : want to trade for sheep?
3.1.3 3. But does the player own these resources?
Now that we can extract the resources within a turn, our next task is to figure out if the player actually has these
resources to give. This information is stored in the turn features.
def parse_turn_resources(turn):
"""Return a dictionary of resource names to counts thereof
"""
def split_eq(attval):
key, val = attval.split('=')
return key.strip(), int(val)
rxs = turn.features['Resources']
return dict(split_eq(x) for x in rxs.split(';'))
print("Turns and player resources")
print("--------------------------")
for turn in ex_turns[:5]:
t_resources = [x for x in ex_resources if turn.encloses(x)]
3.1. [STAC] Turns and resources
39
educe Documentation, Release 0.1
print(preview_unit(ex_doc, turn))
# not to be confused with the resource annotations within the turn
print('\t', parse_turn_resources(turn))
Turns and player resources
-------------------------(35,66)
stac_1368693098
{'sheep': 5, 'wood': 2,
(100,123)
stac_1368693104
{'sheep': 5, 'wood': 2,
(146,171)
stac_1368693110
{'sheep': 5, 'wood': 2,
(172,191)
stac_1368693113
{'sheep': 1, 'wood': 0,
(192,210)
stac_1368693116
{'sheep': 1, 'wood': 1,
[Turn
'ore': 2,
[Turn
'ore': 2,
[Turn
'ore': 2,
[Turn
'ore': 3,
[Turn
'ore': 2,
]
'wheat':
]
'wheat':
]
'wheat':
]
'wheat':
]
'wheat':
152 : sabercat
1, 'clay': 2}
154 : sabercat
1, 'clay': 2}
156 : sabercat
1, 'clay': 2}
157 : amycharl
1, 'clay': 3}
160 : amycharl
1, 'clay': 3}
: yep, for what?
: no way
: could be
: :)
: ?
3.1.4 4. Putting it together: is this an honest offer?
def is_somewhat_honest(turn, offer):
"""True if the player has the offered resource
"""
if offer.features['Status'] != 'Givable':
raise ValueError('Resource must be givable')
kind = offer.features['Kind']
t_rxs = parse_turn_resources(turn)
return t_rxs.get(kind, 0) > 0
def is_honest(turn, offer):
"""
True if the player has the offered resource
at the quantity offered. Undefined for offers that
do not have a defined quantity
"""
if offer.features['Status'] != 'Givable':
raise ValueError('Resource must be givable')
if offer.features['Quantity'] == '?':
raise ValueError('Resource must have a known quantity')
promised = int(offer.features['Quantity'])
kind = rsrc.features['Kind']
t_rxs = parse_turn_resources(turn)
return t_rxs.get(kind, 0) >= promised
def critique_offer(turn, offer):
"""Return some commentary on an offered resource"""
kind = offer.features['Kind']
quantity = offer.features['Quantity']
honest = 'n/a' if quantity == '?' else is_honest(turn, offer)
msg = ("\t{offered}/{has} {kind} | "
"has some: {honestish}, "
"enough: {honest}")
return msg.format(kind=kind,
offered=quantity,
has=player_rxs.get(kind),
honestish=is_somewhat_honest(turn, offer),
honest=honest)
40
Chapter 3. Cookbook
educe Documentation, Release 0.1
ex_turns_with_offers = [t for t in ex_turns if any(t.encloses(r) for r in ex_offers)]
print("Turns and offers")
print("----------------")
for turn in ex_turns_with_offers[:5]:
offers = [x for x in ex_offers if turn.encloses(x)]
print('', preview_unit(ex_doc, turn))
player_rxs = parse_turn_resources(turn)
for offer in offers:
print(critique_offer(turn, offer))
Turns and offers
---------------(959,1008) stac_1368693191
[Turn
] 201
1/5 sheep | has some: True, enough: True
(1009,1030) stac_1368693195
[Turn
] 202
2/None Anaphoric | has some: False, enough: True
(67,99)
stac_1368693101
[Turn
] 153
?/3 clay | has some: True, enough: n/a
(124,145)
stac_1368693107
[Turn
] 155
?/3 ore | has some: True, enough: n/a
(363,404)
stac_1368693135
[Turn
] 171
?/5 sheep | has some: True, enough: n/a
: sabercat : can...or another sheep? or
: sabercat : two?
: amycharl : clay preferably
: amycharl : ore?
: sabercat : want to trade for sheep?
3.1.5 5. What about those anaphors?
Anaphors are represented with ‘Anaphora’ relation instances. Relation instances have a source and target connecting
two unit level annotations (here two resources). The idea here is that the anaphor would be the source of the relation,
and its antecedant is the target. We’ll assume for simplicity that resource anaphora do not form chains.
import copy
resource_types = {}
for anno in ex_doc.relations:
if anno.type != 'Anaphora':
continue
resource_types[anno.source] = anno.target.features['Kind']
print("Turns and offers (anaphors accounted for)")
print("-----------------------------------------")
for turn in ex_turns_with_offers[:5]:
offers = [x for x in ex_offers if turn.encloses(x)]
print('', preview_unit(ex_doc, turn))
player_rxs = parse_turn_resources(turn)
for offer in offers:
if offer in resource_types:
kind = resource_types[offer]
offer = copy.copy(offer)
offer.features['Kind'] = kind
print(critique_offer(turn, offer))
Turns and offers (anaphors accounted for)
----------------------------------------(959,1008) stac_1368693191
[Turn
1/5 sheep | has some: True, enough: True
(1009,1030) stac_1368693195
[Turn
3.1. [STAC] Turns and resources
] 201 : sabercat : can...or another sheep? or
] 202 : sabercat : two?
41
educe Documentation, Release 0.1
2/5 sheep | has some: True, enough: True
(67,99)
stac_1368693101
[Turn
?/3 clay | has some: True, enough: n/a
(124,145)
stac_1368693107
[Turn
?/3 ore | has some: True, enough: n/a
(363,404)
stac_1368693135
[Turn
?/5 sheep | has some: True, enough: n/a
] 153 : amycharl : clay preferably
] 155 : amycharl : ore?
] 171 : sabercat : want to trade for sheep?
3.1.6 Conclusion
In this tutorial, we’ve explored a couple of basic educe concepts, which we hope will enable you to extract some data
from your discourse corpora, namely
• reading corpus data (and pre-filtering)
• standoff annotations
• searching by span enclosure, overlapping
• working with trees
• combining annotations from different sources
The concepts above should transfer to whatever discourse corpus you are working with (that educe supports, or that
you are prepared to supply a reader for).
42
Chapter 3. Cookbook
CHAPTER 4
educe package
Note: At the time of this writing, this is a slightly idealised representation of the package. See below for notes on
where things get a bit messier
The educe library provides utilities for working with annotated discourse corpora. It has a three-layer structure:
• base layer (files, annotations, fusion, graphs)
• tool layer (specific to tools, file formats, etc)
• project layer (specific to particular corpora, currently stac)
4.1 Layers
Working our way up the tower, the base layer provides four sublayers:
• file management (educe.corpus): basic model for corpus traversal, for selecting slices of the corpus
• annotation: (educe.annotation), representation of annotated texts, adhering closely to whatever annotation tool
produced it.
• fusion (in progress): connections between annotations on different layers (eg. on speech acts for text spans,
discourse relations), or from different tools (eg. from a POS tagger, a parser, etc)
• graph (educe.graph): high-level/abstract representation of discourse structure, allowing for queries on the structures themselves (eg. give me all pairs for discourse units separated by at most 3 nodes in the graph)
Building on the base layer, we have modules that are specific to a particular set of annotation tools, currently this is
only educe.glozz. We aim to add modules sparingly.
Finally, on top of this, we have the project layer (eg. educe.stac) which keeps track of conventions specific to this
particular corpus. The hope would be for most of your script writing to deal with this layer directly, eg. for STAC
stac
|
+--------+-------------+--------+
|
|
|
|
|
v
|
|
|
glozz
|
|
|
|
|
|
v
v
v
v
corpus -> annotation <- fusion <- graph
[project layer]
[tool layer]
[base layer]
Support for other projects would consist in adding writing other project layer modules that map down to the tool layer.
43
educe Documentation, Release 0.1
4.2 Departures from the ideal (2013-05-23)
Educe is still its early stages. Some departures you may want to be aware of:
• fusion layer does not really exist yet; educe.annotation currently takes on some of the job (for example, the
text_span function makes annotations of different types more or less comparable)
• layer violations: ideally we want lower layers to be abstract from things above them, but you may find eg.
glozz-specific assumptions in the base layer, which isn’t great.
• inconsistency in encapsulation: educe.stac doesn’t wrap everything below it (it’s also not clear yet if it should).
It currently wraps educe.glozz and educe.corpus (so by rights you shouldn’t really need to import them), but not
the graph stuff for example.
4.3 Subpackages
4.3.1 educe.external package
Interacting with annotations from 3rd party tools
Submodules
educe.external.coref module
Coreference chain output in the form of educe standoff annotations (at least as emitted by Stanford’s CoreNLP
pipeline)
A coreference chain is considered to be a set of mentions. Each mention contains a set of tokens.
class educe.external.coref.Chain(mentions)
Bases: educe.annotation.Standoff
Chain of coreferences
class educe.external.coref.Mention(tokens, head, most_representative=False)
Bases: educe.annotation.Standoff
Mention of an entity
educe.external.corenlp module
Annotations from the CoreNLP pipeline
class educe.external.corenlp.CoreNlpDocument(tokens, trees, deptrees, chains)
Bases: educe.annotation.Standoff
All of the CoreNLP annotations for a particular document as instances of educe.annotation.Standoff or as structures that contain such instances.
class educe.external.corenlp.CoreNlpToken(t, offset, origin=None)
Bases: educe.external.postag.Token
A single token and its POS tag.
44
Chapter 4. educe package
educe Documentation, Release 0.1
features
dict(string, string)
Additional info found by corenlp about the token (eg. x.features[’lemma’])
class educe.external.corenlp.CoreNlpWrapper(corenlp_dir)
Bases: object
Wrapper for the CoreNLP parsing system
process(txt_files, outdir, properties=[])
Run CoreNLP on text files
Parameters
• txt_files (list of strings) – Input files
• outdir (string) – Output dir
• properties (list of strings, optional) – Properties to control the behaviour of CoreNLP
Returns corenlp_outdir – Directory containing CoreNLP’s output files
Return type string
educe.external.parser module
Syntactic parser output into educe standoff annotations (at least as emitted by Stanford’s CoreNLP pipeline
This currently builds off the NLTK Tree class, but if the NLTK dependency proves too heavy, we could consider doing
without.
class educe.external.parser.ConstituencyTree(node, children, origin=None)
Bases: educe.external.parser.SearchableTree, educe.annotation.Standoff
A variant of the NLTK Tree data structure which can be treated as an educe Standoff annotation.
This can be useful for representing syntactic parse trees in a way that can be later queried on the basis of Span
enclosure.
Note that all children must have a span member of type Span
The subtrees() function can useful here.
classmethod build(tree, tokens)
Build an educe tree by combining an existing NLTK tree with some replacement leaves.
The replacement leaves should correspond 1:1 to the leaves of the original tree (for example, they may
contain features related to those words
text_span()
Note: doc is ignored here
class educe.external.parser.DependencyTree(node, children, link, origin=None)
Bases: educe.external.parser.SearchableTree, educe.annotation.Standoff
A variant of the NLTK Tree data structure for the representation of dependency trees. The dependency tree is
also considered a Standoff annotation but not quite in the same way that a constituency tree might be. The spans
roughly indicate the range covered by the tokens in the subtree (this glosses over any gaps). They are mostly
useful for determining if the tree (at its root node) pertains to any given sentence based on its offsets.
Fields:
•node is an some annotation of type educe.annotation.Standoff
4.3. Subpackages
45
educe Documentation, Release 0.1
•link is a string representing the link label between this node and its governor; None for the root node
classmethod build(deps, nodes, k, link=None)
Given two dictionaries
•mapping node ids to a list of (link label, child node id))
•mapping node ids to some representation of those nodes
and the id for the root node, build a tree representation of the dependency tree
is_root()
This is a dependency tree root (has a special node)
class educe.external.parser.SearchableTree(node, children)
Bases: nltk.tree.Tree
A tree with helper search functions
depth_first_iterator()
Iterate on the nodes of the tree, depth-first, pre-order.
topdown(pred, prunable=None)
Searching from the top down, return the biggest subtrees for which the predicate is True (or empty list if
none are found).
The optional prunable function can be used to throw out subtrees for more efficient search (note that pred
always overrides prunable though). Note that leaf nodes are ignored.
topdown_smallest(pred, prunable=None)
Searching from the top down, return the smallest subtrees for which the predicate is True (or empty list if
none are found).
This is almost the same as topdown, except that if a subtree matches, we check for smaller matches in its
subtrees.
Note that leaf nodes are ignored.
educe.external.postag module
CONLL formatted POS tagger output into educe standoff annotations (at least as emitted by CMU’s ark-tweet-nlp.
Files are assumed to be UTF-8 encoded.
Note: NLTK has a CONLL reader too which looks a lot more general than this one
exception educe.external.postag.EducePosTagException(*args, **kw)
Bases: exceptions.Exception
Exceptions that arise during POS tagging or when reading POS tag resources
class educe.external.postag.RawToken(word, tag)
Bases: object
A token with a part of speech tag associated with it
class educe.external.postag.Token(tok, span)
Bases: educe.external.postag.RawToken, educe.annotation.Standoff
A token with a part of speech tag and some character offsets associated with it.
classmethod left_padding()
Return a special Token for left padding
46
Chapter 4. educe package
educe Documentation, Release 0.1
educe.external.postag.generic_token_spans(text, tokens, offset=0, txtfn=None)
Given a string and a sequence of substrings within than string, infer a span for each of the substrings.
We do this spans by walking the text and the tokens we consume substrings and skipping over any whitespace
(including that which is within the tokens). For this to work, the substring sequence must be identical to the text
modulo whitespace.
Spans are relative to the start of the string itself, but can be shifted by passing an offset (the start of the original
string’s span). Empty tokens are accepted but have a zero-length span.
Note: this function is lazy so you can use it incrementally provided you can generate the tokens lazily too
You probably want token_spans instead; this function is meant to be used for similar tasks outside of pos tagging
Parameters txtfn – function to extract text from a token (default None, treated as identity function)
educe.external.postag.read_token_file(fname)
Return a list of lists of RawToken
The input file format is what I believe to be the CONLL format (at least as emitted by the CMU Twitter POS
tagger)
educe.external.postag.token_spans(text, tokens, offset=0)
Given a string and a sequence of RawToken representing tokens in that string, infer the span for each token.
Return the results as a sequence of Token objects.
We infer these spans by walking the text as we consume tokens, and skipping over any whitespace in between.
For this to work, the raw token text must be identical to the text modulo whitespace.
Spans are relative to the start of the string itself, but can be shifted by passing an offset (the start of the original
string’s span)
educe.external.stanford_xml_reader module
Reader for Stanford CoreNLP pipeline outputs
Example of output:
<document>
<sentences>
<sentence id="1">
<tokens>
...
<token id="19">
<word>direction</word>
<lemma>direction</lemma>
<CharacterOffsetBegin>135</CharacterOffsetBegin>
<CharacterOffsetEnd>144</CharacterOffsetEnd>
<POS>NN</POS>
</token>
<token id="20">
<word>.</word>
<lemma>.</lemma>
<CharacterOffsetBegin>144</CharacterOffsetBegin>
<CharacterOffsetEnd>145</CharacterOffsetEnd>
<POS>.</POS>
</token>
...
<parse>(ROOT (S (PP (IN For) (NP (NP (DT a) (NN look)) (PP (IN at) (SBAR (WHNP (WP what)) (S (V
<basic-dependencies>
4.3. Subpackages
47
educe Documentation, Release 0.1
<dep type="prep">
<governor idx="13">let</governor>
<dependent idx="1">For</dependent>
</dep>
...
</basic-dependencies>
<collapsed-dependencies>
<dep type="det">
<governor idx="3">look</governor>
<dependent idx="2">a</dependent>
</dep>
...
</collapsed-dependencies>
<collapsed-ccprocessed-dependencies>
<dep type="det">
<governor idx="3">look</governor>
<dependent idx="2">a</dependent>
</dep>
...
</collapsed-ccprocessed-dependencies>
</sentence>
</sentences>
</document>
IMPORTANT: Note that Stanford pipeline uses RHS inclusive offsets.
class educe.external.stanford_xml_reader.PreprocessingSource(encoding=’utf-8’)
Bases: object
Reads in document annotations produced by CoreNLP pipeline.
This works as a stateful object that stores and provides access to all annotations contained in a CoreNLP output
file, once the read method has been called.
get_coref_chains()
Get all coreference chains
get_document_id()
Get the document id
get_offset2sentence_map()
Get the offset to each sentence
get_offset2token_maps()
Get the offset to each token
get_ordered_sentence_list(sort_attr=’extent’)
Get the list of sentences, ordered by sort_attr
get_ordered_token_list(sort_attr=’extent’)
Get the list of tokens, ordered by sort_attr
get_sentence_annotations()
Get the annotations of all sentences
get_token_annotations()
Get the annotations of all tokens
read(base_file, suffix=’.raw.stanford’)
Read and store the annotations from CoreNLP’s output.
This function does not return anything, it modifies the state of the object to store the annotations.
48
Chapter 4. educe package
educe Documentation, Release 0.1
educe.external.stanford_xml_reader.test_file(base_filename, suffix=’.raw.stanford’)
Test that a file is effectively readable and print sentences
educe.external.stanford_xml_reader.xml_unescape(_str)
Get a proper string where special XML characters are unescaped.
Notes
You can also use xml.sax.saxutils.escape
4.3.2 educe.learning package
Submodules
educe.learning.csv module
CSV helpers for machine learning
We sometimes need tables represented as CSV files, with a few odd conventions here and there to help libraries like
Orange
class educe.learning.csv.SparseDictReader(f, *args, **kwds)
Bases: csv.DictReader
A CSV reader which avoids putting null values in dictionaries (note that this is basically a copy of DictReader)
next()
class educe.learning.csv.Utf8DictReader(f, **kwds)
A CSV reader which assumes strings are encoded in UTF-8.
next()
class educe.learning.csv.Utf8DictWriter(f, headers, dialect=<class csv.excel>, **kwds)
A CSV writer which will write rows to CSV file “f”, which is encoded in UTF-8.
writeheader()
writerow(row)
writerows(rows)
educe.learning.csv.mk_plain_csv_writer(outfile)
Just writes records in stac dialect
educe.learning.csv.tune_for_csv(string)
Given a string or None, return a variant of that string that skirts around possibly buggy CSV implementations
SIGH: some CSV parsers apparently get really confused by empty fields
educe.learning.edu_input_format module
This module implements a dumper for the EDU input format
See https://github.com/kowey/attelo/blob/scikit/doc/input.rst
educe.learning.edu_input_format.dump_all(X_gen, y_gen, f, class_mapping, docs, instance_generator)
Dump a whole dataset: features (in svmlight) and EDU pairs
4.3. Subpackages
49
educe Documentation, Release 0.1
class_mapping is a mapping from label to int
Parameters
• f – output features file path
• class_mapping – dict(string, int)
• instance_generator – function that returns an iterable of pairs given a document
educe.learning.edu_input_format.dump_edu_input_file(docs, f )
Dump a dataset in the EDU input format.
Each document must have:
•edus: sequence of edu objects
•grouping: string (some sort of document id)
•edu2sent: int -> int or string or None (edu num to sentence num)
The EDUs must provide:
•identifier(): string
•text(): string
educe.learning.edu_input_format.dump_pairings_file(epairs, f )
Dump the EDU pairings
educe.learning.edu_input_format.labels_comment(class_mapping)
Return a string listing class labels in the format that attelo expects
educe.learning.edu_input_format.load_labels(f )
Read label set (from a features file) into a dictionary mapping labels to indices and index
educe.learning.keygroup_vectorizer module
This module provides ways to transform lists of PairKeys to sparse vectors.
class educe.learning.keygroup_vectorizer.KeyGroupVectorizer
Bases: object
Transforms lists of KeyGroups to sparse vectors.
fit_transform(vectors)
Learn the vocabulary dictionary and return instances
transform(vectors)
Transform documents to EDU pair feature matrix.
Extract features out of documents using the vocabulary fitted with fit.
educe.learning.keys module
Feature extraction keys.
A key is basically a feature name, its type, some help text.
We also provide a notion of groups that allow us to organise keys into sections
class educe.learning.keys.Key(substance, name, description)
Bases: object
Feature name plus a bit of metadata
50
Chapter 4. educe package
educe Documentation, Release 0.1
classmethod basket(name, description)
A key for fields that represent a multiset of possible values. Baskets should be dictionaries from string to
int (collections.Counter would be a good bet for collecting these)
classmethod continuous(name, description)
A key for fields that have range value (eg. numbers)
classmethod discrete(name, description)
A key for fields that have a finite set of possible values
substance = None
see Substance
class educe.learning.keys.KeyGroup(description, keys)
Bases: dict
A set of related features.
Note that a KeyGroup can be used as a dictionary, but instead of using Keys as values, you use the key names
DEBUG = True
NAME_WIDTH = 35
one_hot_values_gen(suffix=’‘)
Get a one-hot encoded version of this KeyGroups as a generator
suffix is added to the feature name
class educe.learning.keys.MagicKey(substance, function)
Bases: educe.learning.keys.Key
Somewhat fancier variant of Key that is built from a function The goal of the magic key is to reduce the amount
of boilerplate needed to define keys
classmethod basket_fn(function)
A key for fields that represent a multiset of possible values. Baskets should be dictionaries from string to
int (collections.Counter would be a good bet for collecting these)
classmethod continuous_fn(function)
A key for fields that have range value (eg. numbers)
classmethod discrete_fn(function)
A key for fields that have a finite set of possible values
class educe.learning.keys.MergedKeyGroup(description, groups)
Bases: educe.learning.keys.KeyGroup
A key group that is formed by fusing several key groups into one.
Note that for now all the keys in a merged group are lumped into the same object.
The help text tries to preserve the internal breakdown into the subgroups, however. It comes with a “level 1”
section header, eg.
=======================================================
big block of features
=======================================================
class educe.learning.keys.Substance
Bases: object
The kind of the variable represented by this key.
•continuous
4.3. Subpackages
51
educe Documentation, Release 0.1
•discrete
•string (for meta vars; you probably want discrete instead)
If we ever reach a point where we’re happy to switch to Python 3 wholesale, we should subclass Enum
BASKET = 4
CONTINUOUS = 1
DISCRETE = 2
STRING = 3
educe.learning.svmlight_format module
This module implements a dumper for the svmlight format
See sklearn.datasets.svmlight_format
educe.learning.svmlight_format.dump_svmlight_file(X_gen, y_gen, f, zero_based=True,
comment=None, query_id=None)
Dump the dataset in svmlight file format.
educe.learning.util module
Common helper functions for feature extraction.
educe.learning.util.space_join(str1, str2)
join two strings with a space
educe.learning.util.tuple_feature(combine)
(a -> a -> b) ->
((current, cache, edu) -> a) ->
(current, cache, edu, edu) -> b)
Combine the result of single-edu feature function to make a pair feature
educe.learning.util.underscore(str1, str2)
join two strings with an underscore
educe.learning.vocabulary_format module
This module implements a loader and dumper for vocabularies.
educe.learning.vocabulary_format.dump_vocabulary(vocabulary, f )
Dump the vocabulary as a tab-separated file.
educe.learning.vocabulary_format.load_vocabulary(f )
Read vocabulary file into a dictionary of feature name and index
4.3.3 educe.pdtb package
Conventions specific to the Penn Discourse Treebank (PDTB) project
52
Chapter 4. educe package
educe Documentation, Release 0.1
Subpackages
educe.pdtb.util package
Submodules
educe.pdtb.util.args module Command line options
educe.pdtb.util.args.add_usual_input_args(parser)
Augment a subcommand argparser with typical input arguments. Sometimes your subcommand may require
slightly different output arguments, in which case, just don’t call this function.
educe.pdtb.util.args.add_usual_output_args(parser)
Augment a subcommand argparser with typical output arguments, Sometimes your subcommand may require
slightly different output arguments, in which case, just don’t call this function.
educe.pdtb.util.args.announce_output_dir(output_dir)
Tell the user where we saved the output
educe.pdtb.util.args.get_output_dir(args)
Return the output directory specified on (or inferred from) the command line arguments, creating it if necessary.
We try the following in order:
1.If –output is given explicitly, we’ll just use/create that
2.OK just make a temporary directory. Later on, you’ll probably want to call announce_output_dir.
educe.pdtb.util.args.mk_output_path(odir, k)
Path stub (needs extension) given an output directory and a PDTB corpus key
educe.pdtb.util.args.read_corpus(args, verbose=True)
Read the section of the corpus specified in the command line arguments.
educe.pdtb.util.features module Feature extraction library functions for PDTB corpus
class educe.pdtb.util.features.DocumentPlus(key, doc)
Bases: tuple
__getnewargs__()
Return self as a plain tuple. Used by copy and pickle.
__getstate__()
Exclude the OrderedDict from pickling
__repr__()
Return a nicely formatted representation string
doc
Alias for field number 1
key
Alias for field number 0
class educe.pdtb.util.features.FeatureInput(corpus, debug)
Bases: tuple
__getnewargs__()
Return self as a plain tuple. Used by copy and pickle.
4.3. Subpackages
53
educe Documentation, Release 0.1
__getstate__()
Exclude the OrderedDict from pickling
__repr__()
Return a nicely formatted representation string
corpus
Alias for field number 0
debug
Alias for field number 1
class educe.pdtb.util.features.RelKeys(inputs)
Bases: educe.learning.keys.MergedKeyGroup
Features for relations
fill(current, rel, target=None)
See RelSubgroup
class educe.pdtb.util.features.RelSubGroup_Core
Bases: educe.pdtb.util.features.RelSubgroup
core features
fill(current, rel, target=None)
class educe.pdtb.util.features.RelSubgroup(description, keys)
Bases: educe.learning.keys.KeyGroup
Abstract keygroup for subgroups of the merged RelKeys. We use these subgroup classes to help provide modularity, to capture the idea that the bits of code that define a set of related feature vector keys should go with the
bits of code that also fill them out
fill(current, rel, target=None)
Fill out a vector’s features (if the vector is None, then we just fill out this group; but in the case of a merged
key group, you may find it desirable to fill out the merged group instead)
class educe.pdtb.util.features.SingleArgKeys(inputs)
Bases: educe.learning.keys.MergedKeyGroup
Features for a single EDU
fill(current, arg, target=None)
See SingleArgSubgroup.fill
class educe.pdtb.util.features.SingleArgSubgroup(description, keys)
Bases: educe.learning.keys.KeyGroup
Abstract keygroup for subgroups of the merged SingleArgKeys. We use these subgroup classes to help provide
modularity, to capture the idea that the bits of code that define a set of related feature vector keys should go with
the bits of code that also fill them out
fill(current, arg, target=None)
Fill out a vector’s features (if the vector is None, then we just fill out this group; but in the case of a merged
key group, you may find it desirable to fill out the merged group instead)
educe.pdtb.util.features.extract_rel_features(inputs)
Return a pair of dictionaries, one for attachments and one for relations
educe.pdtb.util.features.mk_current(inputs, k)
Pre-process and bundle up a representation of the current document
54
Chapter 4. educe package
educe Documentation, Release 0.1
educe.pdtb.util.features.spans_to_str(spans)
string representation of a list of spans, meant to work as an id
Submodules
educe.pdtb.corpus module
PDTB Corpus management (re-exported by educe.pdtb)
class educe.pdtb.corpus.Reader(corpusdir)
Bases: educe.corpus.Reader
See educe.corpus.Reader for details
files()
slurp_subcorpus(cfiles, verbose=False)
See educe.rst_dt.parse for a description of RSTTree
educe.pdtb.corpus.id_to_path(k)
Given a fleshed out FileId (none of the fields are None), return a filepath for it following Penn Discourse
Treebank conventions.
You will likely want to add your own filename extensions to this path
educe.pdtb.corpus.mk_key(doc)
Return an corpus key for a given document name
educe.pdtb.parse module
Standalone parser for PDTB files.
The function parse takes a single .pdtb file and returns a list of Relation, with the following subtypes:
Relation
ExplicitRelation
ImplicitRelation
AltLexRelation
EntityRelation
NoRelation
selection
Selection
InferenceSite
Selection
InferenceSite
InferenceSite
features
attr, 1 connhead
attr, 2 conn
attr, 2 semclass
none
none
sup?
Y
Y
Y
N
N
These relation subtypes are stitched together (and inherit members) from two or three components
• arguments: always arg1 and arg2; but in some cases, the arguments can have supplementary information
• selection: see either Selection or InferenceSite
• some features (see eg. ExplictRelationFeatures)
The simplest way to get to grips with this may be to try the parse function on some sample relations and print the
resulting objects.
class educe.pdtb.parse.AltLexRelation(selection, features, args)
Bases: educe.pdtb.parse.Selection, educe.pdtb.parse.AltLexRelationFeatures,
educe.pdtb.parse.Relation
class educe.pdtb.parse.AltLexRelationFeatures(attribution, semclass1, semclass2)
Bases: educe.pdtb.parse.PdtbItem
class educe.pdtb.parse.Arg(selection, attribution=None, sup=None)
Bases: educe.pdtb.parse.Selection
4.3. Subpackages
55
educe Documentation, Release 0.1
class educe.pdtb.parse.Attribution(source, type, polarity, determinacy, selection=None)
Bases: educe.pdtb.parse.PdtbItem
class educe.pdtb.parse.Connective(text, semclass1, semclass2=None)
Bases: educe.pdtb.parse.PdtbItem
class educe.pdtb.parse.EntityRelation(infsite, args)
Bases: educe.pdtb.parse.InferenceSite, educe.pdtb.parse.Relation
class educe.pdtb.parse.ExplicitRelation(selection, features, args)
Bases: educe.pdtb.parse.Selection, educe.pdtb.parse.ExplicitRelationFeatures,
educe.pdtb.parse.Relation
class educe.pdtb.parse.ExplicitRelationFeatures(attribution, connhead)
Bases: educe.pdtb.parse.PdtbItem
class educe.pdtb.parse.GornAddress(parts)
Bases: educe.pdtb.parse.PdtbItem
class educe.pdtb.parse.ImplicitRelation(infsite, features, args)
Bases: educe.pdtb.parse.InferenceSite, educe.pdtb.parse.ImplicitRelationFeatures,
educe.pdtb.parse.Relation
class educe.pdtb.parse.ImplicitRelationFeatures(attribution,
tive2=None)
Bases: educe.pdtb.parse.PdtbItem
connective1,
connec-
class educe.pdtb.parse.InferenceSite(strpos, sentnum)
Bases: educe.pdtb.parse.PdtbItem
class educe.pdtb.parse.NoRelation(infsite, args)
Bases: educe.pdtb.parse.InferenceSite, educe.pdtb.parse.Relation
class educe.pdtb.parse.PdtbItem
Bases: object
class educe.pdtb.parse.Relation(args)
Bases: educe.pdtb.parse.PdtbItem
Fields:
•self.arg1
•self.arg2
class educe.pdtb.parse.Selection(span, gorn, text)
Bases: educe.pdtb.parse.PdtbItem
class educe.pdtb.parse.SemClass(klass)
Bases: educe.pdtb.parse.PdtbItem
class educe.pdtb.parse.Sup(selection)
Bases: educe.pdtb.parse.Selection
educe.pdtb.parse.parse(path)
Parse a single .pdtb file and return the list of relations found within
Return type [Relation]
educe.pdtb.parse.parse_relation(s)
Parse a single relation or throw a ParseException.
educe.pdtb.parse.split_relations(s)
56
Chapter 4. educe package
educe Documentation, Release 0.1
educe.pdtb.pdtbx module
PDTB in an adhoc (educe-grown) XML format, unfortunately not a standard, but a little homegrown language using
XML syntax. I’ll call it pdtbx. No reason it can’t be used outside of educe.
Informal DTD:
• SpanList is attribute spanList in PDTB string convention
• GornAddressList is attribute gornList in PDTB string convention
• SemClass is attribute semclass1 (and optional attribute semclass2) in PDTB string convention
• text in <text> elements with usual XML escaping conventions
• args in <arg> elements in order (arg1 before arg2)
• implicitRelations can have multiple connectives
educe.pdtb.pdtbx.Relation_xml(itm)
educe.pdtb.pdtbx.Relations_xml(itms)
educe.pdtb.pdtbx.read_Relation(node)
educe.pdtb.pdtbx.read_Relations(node)
educe.pdtb.pdtbx.read_pdtbx_file(filename)
educe.pdtb.pdtbx.write_pdtbx_file(filename, relations)
educe.pdtb.ptb module
Alignment with the Penn Treebank
educe.pdtb.ptb.parse_trees(corpus, k, ptb)
Given an PDTB document and an NLTK PTB reader, return the PTB trees.
Note that a future version of this function will try to educify the trees as well, but for now things will be fairly
rudimentary
educe.pdtb.ptb.reader(corpus_dir)
An instantiated NLTK BracketedParseCorpusReader for the PTB section relevant to the PDTB corpus.
Note that the path you give to this will probably end with something like parsed/mrg/wsj
4.3.4 educe.ptb package
Conventions specific to the Penn Treebank.
The PTB isn’t a discourse corpus as such, but a supplementary resource to be combined with the RST DT or the PDTB
Submodules
educe.ptb.annotation module
Educe representation of Penn Tree Bank annotations.
We actually just use the token and constituency tree representations from educe.external.postag and
educe.external.parse, but included here are tools that can also be used to align the PTB with other corpora based
off the same text (eg. the RST Discourse Treebank)
4.3. Subpackages
57
educe Documentation, Release 0.1
educe.ptb.annotation.PTB_TO_TEXT = {“’‘”: ‘”’, ‘‘‘’: ‘”’, ‘-LSB-‘: ‘[’, ‘-RRB-‘: ‘)’, ‘-LCB-‘: ‘{‘, ‘-LRB-‘: ‘(‘, ‘-RSB-‘:
Straight substitutions you can use to replace some PTB-isms with their likely original text
class educe.ptb.annotation.TweakedToken(word, tag, tweaked_word=None, prefix=None)
Bases: educe.external.postag.RawToken
A token with word, part of speech, plus “tweaked word” (what the token should be treated as when aligning
with corpus), and offset (some tokens should skip parts of the text)
This intermediary class should only be used within the educe library itself. The context is that we sometimes
want to align PTB annotations (see educe.external.postag.generic_token_spans) against text which is almost but
not quite identical to the text that PTB annotations seem to represent. For example, the source text might have
sentences that end in abbreviations, like “He moved to the U.S.” and the PTB might annotation an extra full stop
after this for an end-of-sentence marker. To deal with these, we use wrapped tokens to allow for some manual
substitutions:
•you could “delete” a token by assigning it an empty tweaked word (it would then be assigned a zero-length
span)
•you could skip some part of the text by supplying a prefix (this expands the tweaked word, and introduces
an offset which you can subsequentnly use to adjust the detected token span)
•or you could just replace the token text outright
These tweaked tokens are only used to obtain a span within the text you are trying to align against; they can be
subsequently discarded.
educe.ptb.annotation.basic_category(label)
Get the basic syntactic category of a label.
This is done by truncating whatever comes after a (non-word-initial) occurrence of one of the label_annotation_introducing_characters().
educe.ptb.annotation.is_empty_category(postag)
True if postag is the empty category, i.e. -NONE- in the PTB.
educe.ptb.annotation.is_non_empty(tree)
Filter (return False for) nodes that cover a totally empty span.
educe.ptb.annotation.is_nonword_token(text)
True if the text appears to correspond to some kind of non-textual token, for example, *T*-1 for some kind of
trace. These seem to only appear with tokens tagged -NONE-.
educe.ptb.annotation.post_basic_category_index(label)
Get the index of the first char after the basic label.
This should never match the first char of the label ; if the first char is such a char, then a matched char is also not
used iff there is something in between, e.g. (-LRB- => -LRB-) but (–PU => -).
educe.ptb.annotation.prune_tree(tree, filter_func)
Prune a tree by applying filter_func recursively.
All children of filtered nodes are pruned as well. Nodes whose children have all been pruned are pruned too.
The filter function must be applicable to Tree but also non-Tree, as are leaves in an NLTK Tree.
educe.ptb.annotation.strip_subcategory(tree,
retain_TMP_subcategories=False,
tain_NPTMP_subcategories=False)
Transform tree to strip additional label annotation at each node
re-
educe.ptb.annotation.transform_tree(tree, transformer)
Transform a tree by applying a transformer at each level.
The tree is traversed depth-first, left-to-right, and the transformer is applied at each node.
58
Chapter 4. educe package
educe Documentation, Release 0.1
educe.ptb.head_finder module
This submodule provides several functions that find heads in trees.
It uses head rules as described in (Collins 1999), Appendix A. See http://www.cs.columbia.edu/~mcollins/papers/heads,
Bikel’s 2004 CL paper on the intricacies of Collins’ parser and the classes in (StanfordNLP) CoreNLP that inherit
from AbstractCollinsHeadFinder.java .
educe.ptb.head_finder.find_edu_head(tree, hwords, wanted)
Find the head word of a set of wanted nodes from a tree.
The tree is traversed top-down, breadth first, until we reach a node headed by a word from wanted.
Return a pair of treepositions (head node, head word), or None if no occurrence of any word in wanted was
found.
This function is typically called for each EDU, wanted being the set of tree positions of its tokens, after
find_lexical_heads has been called on the entire tree (providing hwords).
Parameters
• tree (nltk.Tree with educe.external.postag.RawToken leaves) – PTB tree whose lexical
heads we want.
• hwords (dict(tuple(int), tuple(int))) – Map from each node of the constituency tree to its
lexical head. Both nodes are designated by their (NLTK) tree position (a.k.a. Gorn address).
• wanted (iterable of tuple(int)) – The tree positions of the tokens in the span of interest, e.g.
in the EDU we are looking at.
Returns
• cur_treepos (tuple(int)) – Tree position of the head node, i.e. the highest node headed by a
word from wanted.
• cur_hw (tuple(int)) – Tree position of the head word.
educe.ptb.head_finder.find_lexical_heads(tree)
Find the lexical head at each node of a constituency tree.
The logic corresponds to Collins’ head finding rules.
This is typically used to find the lexical head of each node of a (clean) educe.external.parser.ConstituencyTree
whose leaves are educe.external.postag.Token.
Parameters tree (nltk.Tree with educe.external.postag.RawToken leaves) – PTB tree whose lexical
heads we want
Returns head_word – Map each node of the constituency tree to its lexical head. Both nodes are
designated by their (NLTK) tree position (a.k.a. Gorn address).
Return type dict(tuple(int), tuple(int))
educe.ptb.head_finder.load_head_rules(f )
Load the head rules from file f.
Return a dictionary from parent non-terminal to (direction, priority list).
4.3.5 educe.rst_dt package
Conventions specific to the RST discourse treebank project
4.3. Subpackages
59
educe Documentation, Release 0.1
Subpackages
educe.rst_dt.learning package
Submodules
educe.rst_dt.learning.args module Command line options for learning commands
class educe.rst_dt.learning.args.FeatureSetAction(option_strings,
**kwargs)
Bases: argparse.Action
dest,
nargs=None,
Select the desired feature set
educe.rst_dt.learning.args.add_usual_input_args(parser)
Augment a subcommand argparser with typical input arguments. Sometimes your subcommand may require
slightly different output arguments, in which case, just don’t call this function.
educe.rst_dt.learning.base module Basics for feature extraction
class educe.rst_dt.learning.base.DocumentPlusPreprocessor(token_filter=None)
Bases: object
Preprocessor for feature extraction on a DocumentPlus
This pre-processor currently does not explicitly impute missing values, but it probably should eventually. As
the ultimate output is features in a sparse format, the current strategy amounts to imputing missing values as 0,
which is most certainly not optimal.
preprocess(doc, strict=False)
Preprocess a document and output basic features for each EDU.
Return a dict(EDU, (dict(basic_feat_name, basic_feat_val)))
TODO explicitly impute missing values, e.g. for (rev_)idxes_in_*
exception educe.rst_dt.learning.base.FeatureExtractionException(msg)
Bases: exceptions.Exception
Exceptions related to RST trees not looking like we would expect them to
educe.rst_dt.learning.base.edu_feature(wrapped)
Lift a function from edu -> feature to single_function_input -> feature
educe.rst_dt.learning.base.edu_pair_feature(wrapped)
Lifts a function from (edu, edu) -> f to pair_function_input -> f
educe.rst_dt.learning.base.lowest_common_parent(treepositions)
Find tree position of the lowest common parent of a list of nodes.
treepositions is a list of tree positions see nltk.tree.Tree.treepositions()
educe.rst_dt.learning.base.on_first_bigram(wrapped)
Lift a function from a -> string to [a] -> string the function will be applied to the up to first two elements of the
list and the result concatenated. It returns None if the list is empty
educe.rst_dt.learning.base.on_first_unigram(wrapped)
Lift a function from a -> b to [a] -> b taking the first item or returning None if empty list
60
Chapter 4. educe package
educe Documentation, Release 0.1
educe.rst_dt.learning.base.on_last_bigram(wrapped)
Lift a function from a -> string to [a] -> string the function will be applied to the up to the two elements of the
list and the result concatenated. It returns None if the list is empty
educe.rst_dt.learning.base.on_last_unigram(wrapped)
Lift a function from a -> b to [a] -> b taking the last item or returning None if empty list
educe.rst_dt.learning.doc_vectorizer module This submodule implements document vectorizers
class educe.rst_dt.learning.doc_vectorizer.DocumentCountVectorizer(instance_generator,
feature_set,
lecsie_data_dir=None,
max_df=1.0,
min_df=1,
max_features=None,
vocabulary=None,
separator=’=’,
split_feat_space=None)
Bases: object
Fancy vectorizer for the RST-DT treebank.
See sklearn.feature_extraction.text.CountVectorizer for reference.
build_analyzer()
Return a callable that extracts feature vectors from a doc
decode(doc)
Decode the input into a DocumentPlus
doc is an educe.rst_dt.document_plus.DocumentPlus
fit(raw_documents, y=None)
Learn a vocabulary dictionary of all features from the documents
fit_transform(raw_documents, y=None)
Learn the vocabulary dictionary and generate (row, (tgt, src))
transform(raw_documents)
Transform documents to a feature matrix
Note: generator of (row, (tgt, src))
class educe.rst_dt.learning.doc_vectorizer.DocumentLabelExtractor(instance_generator,
unknown_label=’__UNK__’,
labelset=None)
Bases: object
Label extractor for the RST-DT treebank.
build_analyzer()
Return a callable that extracts feature vectors from a doc
decode(doc)
Decode the input into a DocumentPlus
doc is an educe.corpus.FileId
4.3. Subpackages
61
educe Documentation, Release 0.1
fit(raw_documents)
Learn a labelset from the documents
fit_transform(raw_documents)
Learn the label encoder and return a vector of labels
There is one label per instance extracted from raw_documents.
transform(raw_documents)
Transform documents to a label vector
educe.rst_dt.learning.doc_vectorizer.re_emit(feats, suff )
Re-emit feats with suff appended to each feature name
educe.rst_dt.learning.features module Feature extraction library functions for RST_DT corpus
educe.rst_dt.learning.features.build_doc_preprocessor()
Build the preprocessor for feature extraction in each EDU of doc
educe.rst_dt.learning.features.build_edu_feature_extractor()
Build the feature extractor for single EDUs
educe.rst_dt.learning.features.build_pair_feature_extractor()
Build the feature extractor for pairs of EDUs
TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different
names
educe.rst_dt.learning.features.combine_features(feats_g, feats_d, feats_gd)
Generate features by taking a (linear) combination of features.
I suspect these do not have a great impact, if any, on results.
Parameters
• feats_g (dict(feat_name, feat_val)) – features of the gov EDU
• feats_d (dict(feat_name, feat_val)) – features of the dep EDU
• feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns cf – combined features
Return type dict(feat_name, feat_val)
educe.rst_dt.learning.features.extract_pair_gap(edu_info1, edu_info2)
Document tuple features
educe.rst_dt.learning.features.extract_pair_pos_tags(edu_info1, edu_info2)
POS tag features on EDU pairs
educe.rst_dt.learning.features.extract_pair_raw_word(edu_info1, edu_info2)
raw word features on EDU pairs
educe.rst_dt.learning.features.extract_single_ptb_token_pos(edu_info)
POS features on PTB tokens for the EDU
educe.rst_dt.learning.features.extract_single_ptb_token_word(edu_info)
word features on PTB tokens for the EDU
educe.rst_dt.learning.features.extract_single_raw_word(edu_info)
raw word features for the EDU
educe.rst_dt.learning.features.product_features(feats_g, feats_d, feats_gd)
Generate features by taking the product of features.
62
Chapter 4. educe package
educe Documentation, Release 0.1
Parameters
• feats_g (dict(feat_name, feat_val)) – features of the gov EDU
• feats_d (dict(feat_name, feat_val)) – features of the dep EDU
• feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns pf – product features
Return type dict(feat_name, feat_val)
educe.rst_dt.learning.features_dev module Experimental features.
class educe.rst_dt.learning.features_dev.LecsieFeats(lecsie_data_dir)
Bases: object
Extract Lecsie features from each pair of EDUs
fit(edu_pairs, y=None)
transform(edu_pairs)
educe.rst_dt.learning.features_dev.build_doc_preprocessor()
Build the preprocessor for feature extraction in each EDU of doc
educe.rst_dt.learning.features_dev.build_edu_feature_extractor()
Build the feature extractor for single EDUs
educe.rst_dt.learning.features_dev.build_pair_feature_extractor(lecsie_data_dir=None)
Build the feature extractor for pairs of EDUs
TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different
names
educe.rst_dt.learning.features_dev.combine_features(feats_g, feats_d, feats_gd)
Generate features by taking a (linear) combination of features.
I suspect these do not have a great impact, if any, on results.
Parameters
• feats_g (dict(feat_name, feat_val)) – features of the gov EDU
• feats_d (dict(feat_name, feat_val)) – features of the dep EDU
• feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns cf – combined features
Return type dict(feat_name, feat_val)
educe.rst_dt.learning.features_dev.extract_pair_doc(edu_info1, edu_info2)
Document-level tuple features
educe.rst_dt.learning.features_dev.extract_pair_para(edu_info1, edu_info2)
Paragraph tuple features
educe.rst_dt.learning.features_dev.extract_pair_sent(edu_info1, edu_info2)
Sentence tuple features
educe.rst_dt.learning.features_dev.extract_pair_syntax(edu_info1, edu_info2)
syntactic features for the pair of EDUs
educe.rst_dt.learning.features_dev.extract_single_length(edu_info)
Sentence features for the EDU
4.3. Subpackages
63
educe Documentation, Release 0.1
educe.rst_dt.learning.features_dev.extract_single_para(edu_info)
paragraph features for the EDU
educe.rst_dt.learning.features_dev.extract_single_pdtb_markers(edu_info)
Features on the presence of PDTB discourse markers in the EDU
educe.rst_dt.learning.features_dev.extract_single_pos(edu_info)
POS features for the EDU
educe.rst_dt.learning.features_dev.extract_single_sentence(edu_info)
Sentence features for the EDU
educe.rst_dt.learning.features_dev.extract_single_syntax(edu_info)
syntactic features for the EDU
educe.rst_dt.learning.features_dev.extract_single_word(edu_info)
word features for the EDU
educe.rst_dt.learning.features_dev.product_features(feats_g, feats_d, feats_gd)
Generate features by taking the product of features.
Parameters
• feats_g (dict(feat_name, feat_val)) – features of the gov EDU
• feats_d (dict(feat_name, feat_val)) – features of the dep EDU
• feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns pf – product features
Return type dict(feat_name, feat_val)
educe.rst_dt.learning.features_dev.split_feature_space(feats_g, feats_d, feats_gd,
keep_original=False,
split_criterion=’dir’)
Split feature space on a criterion.
Current supported criteria are: * ‘dir’: directionality of attachment, * ‘sent’: intra/inter-sentential, * ‘dir_sent’:
directionality + intra/inter-sentential.
Parameters
• feats_g (dict(feat_name, feat_val)) – features of the gov EDU
• feats_d (dict(feat_name, feat_val)) – features of the dep EDU
• feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
• keep_original (boolean, default=False) – whether to keep or replace the original features with the derived split features
• split_criterion (string) – feature(s) on which to split the feature space, options are
‘dir’ for directionality of attachment, ‘sent’ for intra/inter sentential, ‘dir_sent’ for their
conjunction
Returns feats_g, feats_d, feats_gd – dicts of features with their copies
Return type (dict(feat_name, feat_val))
Notes
This function should probably be generalized and moved to a more relevant place.
64
Chapter 4. educe package
educe Documentation, Release 0.1
educe.rst_dt.learning.features_dev.token_filter_li2014(token)
Token filter defined in Li et al.’s parser.
This filter only applies to tagged tokens.
educe.rst_dt.learning.features_li2014 module Partial re-implementation of the feature extraction procedure used
in [li2014text] for discourse dependency parsing on the RST-DT corpus.
Text-level discourse dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 25-35). http://www.aclweb.org/anthology/P/P14/P14-1003.pdf
educe.rst_dt.learning.features_li2014.build_doc_preprocessor()
Build the preprocessor for feature extraction in each EDU of doc
educe.rst_dt.learning.features_li2014.build_edu_feature_extractor()
Build the feature extractor for single EDUs
educe.rst_dt.learning.features_li2014.build_pair_feature_extractor()
Build the feature extractor for pairs of EDUs
TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different
names
educe.rst_dt.learning.features_li2014.combine_features(feats_g, feats_d, feats_gd)
Generate features by taking a (linear) combination of features.
I suspect these do not have a great impact, if any, on results.
Parameters
• feats_g (dict(feat_name, feat_val)) – features of the gov EDU
• feats_d (dict(feat_name, feat_val)) – features of the dep EDU
• feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns cf – combined features
Return type dict(feat_name, feat_val)
educe.rst_dt.learning.features_li2014.extract_pair_length(edu_info1, edu_info2)
Sentence tuple features
educe.rst_dt.learning.features_li2014.extract_pair_para(edu_info1, edu_info2)
Paragraph tuple features
educe.rst_dt.learning.features_li2014.extract_pair_pos(edu_info1, edu_info2)
POS tuple features
educe.rst_dt.learning.features_li2014.extract_pair_sent(edu_info1, edu_info2)
Sentence tuple features
educe.rst_dt.learning.features_li2014.extract_pair_word(edu_info1, edu_info2)
word tuple features
educe.rst_dt.learning.features_li2014.extract_single_length(edu_info)
Sentence features for the EDU
educe.rst_dt.learning.features_li2014.extract_single_para(edu_info)
paragraph features for the EDU
educe.rst_dt.learning.features_li2014.extract_single_pos(edu_info)
POS features for the EDU
4.3. Subpackages
65
educe Documentation, Release 0.1
educe.rst_dt.learning.features_li2014.extract_single_sentence(edu_info)
Sentence features for the EDU
educe.rst_dt.learning.features_li2014.extract_single_syntax(edu_info)
syntactic features for the EDU
educe.rst_dt.learning.features_li2014.extract_single_word(edu_info)
word features for the EDU
educe.rst_dt.learning.features_li2014.get_syntactic_labels(edu_info)
Syntactic labels for this EDU
educe.rst_dt.learning.features_li2014.product_features(feats_g, feats_d, feats_gd)
Generate features by taking the product of features.
Parameters
• feats_g (dict(feat_name, feat_val)) – features of the gov EDU
• feats_d (dict(feat_name, feat_val)) – features of the dep EDU
• feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns pf – product features
Return type dict(feat_name, feat_val)
educe.rst_dt.learning.features_li2014.token_filter_li2014(token)
Token filter defined in Li et al.’s parser.
This filter only applies to tagged tokens.
educe.rst_dt.util package
Submodules
educe.rst_dt.util.args module Command line options
educe.rst_dt.util.args.add_usual_input_args(parser)
Augment a subcommand argparser with typical input arguments. Sometimes your subcommand may require
slightly different output arguments, in which case, just don’t call this function.
:param doc_subdoc_required force user to supply –doc/–subdoc for this subcommand
:type doc_subdoc_required bool
:param help_suffix appended to –doc/–subdoc help strings :type help_suffix string
educe.rst_dt.util.args.add_usual_output_args(parser)
Augment a subcommand argparser with typical output arguments, Sometimes your subcommand may require
slightly different output arguments, in which case, just don’t call this function.
educe.rst_dt.util.args.announce_output_dir(output_dir)
Tell the user where we saved the output
educe.rst_dt.util.args.get_output_dir(args)
Return the output directory specified on (or inferred from) the command line arguments, creating it if necessary.
We try the following in order:
1.If –output is given explicitly, we’ll just use/create that
2. OK just make a temporary directory. Later on, you’ll probably want to call announce_output_dir.
66
Chapter 4. educe package
educe Documentation, Release 0.1
educe.rst_dt.util.args.read_corpus(args, verbose=True)
Read the section of the corpus specified in the command line arguments.
Submodules
educe.rst_dt.annotation module
Educe-style representation for RST discourse treebank trees
class educe.rst_dt.annotation.EDU(num, span, text, context=None, origin=None)
Bases: educe.annotation.Standoff
An RST leaf node
context = None
See the RSTContext object
identifier()
A global identifier (assuming the origin can be used to uniquely identify an RST tree)
is_left_padding()
Returns True for left padding EDUs
classmethod left_padding(context=None, origin=None)
Return a left padding EDU
num = None
EDU number (as used in tree node edu_span)
raw_text = None
text that was in the EDU annotation itself
This is not the same as the text that was in the annotated document, on which all standoff annotations and
spans are based.
set_context(context)
Update the context of this annotation.
set_origin(origin)
Update the origin of this annotation and any contained within
span = None
text span
text()
Return the text associated with this EDU. We try to return the underlying annotated text if we have the
necessary context; if we not, we just fall back to the raw EDU text
class educe.rst_dt.annotation.Node(nuclearity, edu_span, span, rel, context=None)
Bases: object
A node in an RSTTree or SimpleRSTTree.
context = None
See the RSTContext object
edu_span = None
pair of integers denoting edu span by count
is_nucleus()
A node can either be a nucleus, a satellite, or a root node. It may be easier to work with SimpleRSTTree,
in which nodes can only either be nucleus/satellite or much more rarely, root.
4.3. Subpackages
67
educe Documentation, Release 0.1
is_satellite()
A node can either be a nucleus, a satellite, or a root node.
nuclearity = None
one of Nucleus, Satellite, Root
rel = None
relation label (see SimpleRSTTree for a note on the different interpretation of rel with this and RSTTree)
span = None
span
class educe.rst_dt.annotation.RSTContext(text, sentences, paragraphs)
Bases: object
Additional annotations or contextual information that could accompany a RST tree proper. The idea is to have
each subtree pointing back to the same context object for easy retrieval.
paragraphs = None
Paragraph annotations pointing back to the text
sentences = None
sentence annotations pointing back to the text
text(span=None)
Return the text associated with these annotations (or None), optionally limited to a span
class educe.rst_dt.annotation.RSTTree(node, children, origin=None)
Bases: educe.external.parser.SearchableTree, educe.annotation.Standoff
Representation of RST trees which sticks fairly closely to the raw RST discourse treebank one.
edu_span()
Return the span of the tree in terms of EDU count See self.span refers more to the character offsets
set_origin(origin)
Update the origin of this annotation and any contained within
text()
Return the text corresponding to this RST subtree. If the context is set, we return the appropriate segment
from the subset of the text. If not we just concatenate the raw text of all EDU leaves.
text_span()
exception educe.rst_dt.annotation.RSTTreeException(msg)
Bases: exceptions.Exception
Exceptions related to RST trees not looking like we would expect them to
class educe.rst_dt.annotation.SimpleRSTTree(node, children, origin=None)
Bases: educe.external.parser.SearchableTree, educe.annotation.Standoff
Possibly easier representation of RST trees to work with:
•binary
•relation labels on parent nodes instead of children
Note that RSTTree and SimpleRSTTree share the same Node type but because of the subtle difference in interpretation you should be extremely careful not to mix and match.
classmethod from_rst_tree(tree)
Build and return a SimpleRSTTree from an RSTTree
68
Chapter 4. educe package
educe Documentation, Release 0.1
classmethod incorporate_nuclearity_into_label(tree)
Integrate nuclearity of the children into each node’s label.
Nuclearity of the children is incorporated in one of two forms, NN for multi- and NS for mono-nuclear
relations.
Parameters tree (SimpleRSTTree) – The tree of which we want a version with nuclearity
incorporated
Returns mod_tree – The same tree but with the type of nuclearity incorporated
Return type SimpleRSTTree
Note: This is probably not the best way to provide this functionality. In other words, refactoring is much
needed here.
set_origin(origin)
Recursively update the origin for this annotation, ie. a little link to the document metadata for this annotation
text_span()
classmethod to_binary_rst_tree(tree, rel=None)
Build and return a binary RSTTree from a SimpleRSTTree.
This function is recursive, it essentially pushes the relation label from the parent to the satellite child (for
mononuclear relations) or to all nucleus children (for multinuclear relations).
Parameters
• tree (SimpleRSTTree) – SimpleRSTTree to convert
• rel (string, optional) – Relation that must decorate the root node of the output
Returns rtree – The (binary) RSTTree that corresponds to the given SimpleRSTTree
Return type RSTTree
educe.rst_dt.annotation.is_binary(tree)
True if the given RST tree or SimpleRSTTree is indeed binary
educe.rst_dt.corpus module
Corpus management (re-exported by educe.rst_dt)
class educe.rst_dt.corpus.Reader(corpusdir)
Bases: educe.corpus.Reader
See educe.corpus.Reader for details
files(exclude_file_docs=False)
Parameters exclude_file_docs (boolean, optional (default=False)) – If True, fileX documents are ignored. The figures reported by (Li et al., 2014) on the RST-DT corpus indicate
they exclude fileN files, whereas Joty seems to include them. fileN documents are more damaged than wsj_XX documents, e.g. text mismatches with the corresponding document in the
PTB.
slurp_subcorpus(cfiles, verbose=False)
See educe.rst_dt.parse for a description of RSTTree
4.3. Subpackages
69
educe Documentation, Release 0.1
class educe.rst_dt.corpus.RstDtParser(corpus_dir,
args,
clude_file_docs=False)
Bases: object
coarse_rels=False,
ex-
Fake parser that gets annotation from the RST-DT.
decode(doc_key)
Decode a document from the RST-DT (gold)
parse(doc)
Parse the document using the RST-DT (gold).
segment(doc)
Segment the document into EDUs using the RST-DT (gold).
class educe.rst_dt.corpus.RstRelationConverter(relmap_file)
Bases: object
Converter for RST relations (labels)
Known to work on RstTree, possibly SimpleRstTree (untested).
convert_label(label)
Convert a label following the mapping, lowercased otherwise
convert_tree(rst_tree)
Change relation labels in rst_tree using the mapping
educe.rst_dt.corpus.id_to_path(k)
Given a fleshed out FileId (none of the fields are None), return a filepath for it following RST Discourse Treebank
conventions.
You will likely want to add your own filename extensions to this path
educe.rst_dt.corpus.mk_key(doc)
Return an corpus key for a given document name
educe.rst_dt.deptree module
Convert RST trees to dependency trees and back.
class educe.rst_dt.deptree.RstDepTree(edus=[], origin=None)
Bases: object
RST dependency tree
add_dependency(gov_num, dep_num, label=None, nuc=’Satellite’, rank=None)
Add a dependency between two EDUs.
Parameters
• gov_num (int) – Number of the head EDU
• dep_num (int) – Number of the modifier EDU
• label (string, optional) – Label of the dependency
• nuc (string, one of [NUC_S, NUC_N]) – Nuclearity of the modifier
• rank (integer, optional) – Rank of the modifier in the order of attachment to the head.
None means it is not given declaratively and it is instead inferred from the rank of modifiers
previously attached to the head.
70
Chapter 4. educe package
educe Documentation, Release 0.1
append_edu(edu)
Append an EDU to the list of EDUs
deps(gov_idx)
Get the ordered list of dependents of an EDU
classmethod from_simple_rst_tree(rtree)
Converts a SimpleRSTTree‘ to an RstDepTree
get_dependencies()
Get the list of dependencies in this dependency tree.
Each dependency is a 3-uple (gov, dep, label), gov and dep being EDUs.
real_roots_idx()
Get the list of the indices of the real roots
set_origin(origin)
Update the origin of this annotation
set_root(root_num)
Designate an EDU as a real root of the RST tree structure
exception educe.rst_dt.deptree.RstDtException(msg)
Bases: exceptions.Exception
Exceptions related to conversion between RST and DT trees. The general expectation is that we only raise
these on bad input, but in practice, you may see them more in cases of implementation error somewhere in the
conversion process.
educe.rst_dt.document_plus module
This submodule implements a document with additional information.
class educe.rst_dt.document_plus.DocumentPlus(key, grouping, rst_context)
Bases: object
A document and relevant contextual information
align_with_doc_structure()
Align EDUs with the document structure (paragraph and sentence).
Determine which paragraph and sentence (if any) surrounds this EDU. Try to accomodate the occasional
off-by-a-smidgen error by folks marking these EDU boundaries, eg. original text:
Para1: “Magazines are not providing us in-depth information on circulation,” said Edgar Bronfman Jr., ..
“How do readers feel about the magazine?... Research doesn’t tell us whether people actually do read the
magazines they subscribe to.”
Para2: Reuben Mark, chief executive of Colgate-Palmolive, said...
Marked up EDU is wide to the left by three characters: “
Reuben Mark, chief executive of Colgate-Palmolive, said...
align_with_raw_words()
Compute for each EDU the raw tokens it contains
This is a dirty temporary hack to enable backwards compatibility. There should be one clean text per
document, one tokenization and so on, but, well.
align_with_tokens()
Compute for each EDU the overlapping tokens
4.3. Subpackages
71
educe Documentation, Release 0.1
align_with_trees(strict=False)
Compute for each EDU the overlapping trees
all_edu_pairs()
Generate all EDU pairs of a document
relations(edu_pairs)
Get the relation that holds in each of the edu_pairs
educe.rst_dt.document_plus.align_edus_with_paragraphs(doc_edus, doc_paras, text,
strict=False)
Align EDUs with paragraphs, if any.
Parameters
• doc_edus –
• doc_paras –
• strict –
Returns edu2para – Index of the paragraph that contains each EDU, None if the paragraph segmentation is missing.
Return type list(int) or None
educe.rst_dt.document_plus.containing(span)
span -> anno -> bool
if this annotation encloses the given span
educe.rst_dt.graph module
Converter from RST Discourse Treebank trees to educe-style hypergraphs
class educe.rst_dt.graph.DotGraph(anno_graph)
Bases: educe.graph.DotGraph
A dot representation of this graph for visualisation. The to_string() method is most likely to be of interest here
class educe.rst_dt.graph.Graph
Bases: educe.graph.Graph
classmethod from_doc(corpus, doc_key)
educe.rst_dt.parse module
From RST discourse treebank trees to Educe-style objects (reading the format from Di Eugenio’s corpus of instructional texts).
The main classes of interest are RSTTree and EDU. RSTTree can be treated as an NLTK Tree structure. It is also an
educe Standoff object, which means that it points to other RST trees (their children) or to EDU.
educe.rst_dt.parse.parse_lightweight_tree(tstr)
Parse lightweight RST debug syntax into SimpleRSTTree, eg.
(R:attribution
(N:elaboration (N foo) (S bar)
(S quux)))
This is motly useful for debugging or for knocking out quick examples
72
Chapter 4. educe package
educe Documentation, Release 0.1
educe.rst_dt.parse.parse_rst_dt_tree(tstr, context=None)
Read a single RST tree from its RST DT string representation. If context is set, align the tree with it. You should
really try to pass in a context (see RSTContext if you can, the None case is really intended for testing, or in cases
where you don’t have an original text)
educe.rst_dt.parse.read_annotation_file(anno_filename, text_filename)
Read a single RST tree
educe.rst_dt.ptb module
Alignment the RST-WSJ-corpus with the Penn Treebank
class educe.rst_dt.ptb.PtbParser(corpus_dir)
Bases: object
Gold parser that gets annotations from the PTB.
It uses an instantiated NLTK BracketedParseCorpusReader for the PTB section relevant to the RST DT corpus.
Note that the path you give to this will probably end with something like parsed/mrg/wsj
parse(doc)
Given a document, return a list of educified PTB parse trees (one per sentence).
These are almost the same as the trees that would be returned by the parsed_sents method, except that each
leaf/node is associated with a span within the RST DT text.
Note: does nothing if there is no associated PTB corpus entry.
tokenize(doc)
Tokenize the document text using the PTB gold annotation.
Return a tokenized document.
educe.rst_dt.ptb.align_edus_with_sentences(edus, syn_trees, strict=False)
Map each EDU to its sentence.
If an EDU span overlaps with more than one sentence span, the sentence with maximal overlap is chosen.
Parameters
• edus (list(EDU)) – List of EDUs.
• syn_trees (list(Tree)) – List of syntactic trees, one per sentence.
• strict (boolean, default False) – If True, raise an error if an EDU does not map to exactly
one sentence.
Returns edu2sent – Map from EDU to (0-based) sentence index or None.
Return type list(int or None)
educe.rst_dt.rst_wsj_corpus module
This module provides loaders for file formats found in the RST-WSJ-corpus.
educe.rst_dt.rst_wsj_corpus.load_rst_wsj_corpus_edus_file(f )
Load a file that contains the EDUs of a document.
Return clean text and the list of EDU offsets on the clean text.
4.3. Subpackages
73
educe Documentation, Release 0.1
educe.rst_dt.rst_wsj_corpus.load_rst_wsj_corpus_text_file(f )
Load a text file from the RST-WSJ-CORPUS.
Return the text plus its sentences and paragraphs.
The corpus contains two types of text files, so this function is mainly an entry point that delegates to the appropriate function.
educe.rst_dt.rst_wsj_corpus.load_rst_wsj_corpus_text_file_file(f )
Load a text file whose name is of the form file##
These files do not mark paragraphs. Each line contains a sentence preceded by two or three leading spaces.
educe.rst_dt.rst_wsj_corpus.load_rst_wsj_corpus_text_file_wsj(f )
Load a text file whose name is of the form wsj_##
By convention:
•paragraphs are separated by double newlines
•sentences by single newlines
Note that this segmentation isn’t particularly reliable, and seems to both over- (e.g. cut at some abbreviations,
like “Prof.”) and under-segment (e.g. not separate contiguous sentences). It shouldn’t be taken too seriously,
but if you need some sort of rough approximation, it may be helpful.
educe.rst_dt.sdrt module
Convert RST trees to SDRT style EDU/CDU annotations.
The core of the conversion is rst_to_sdrt which produces an intermediary pointer based representation (a single CDU
pointing to other CDUs and EDUs).
A fancier variant, rst_to_glozz_sdrt wraps around this core and further converts the CDU into a Glozz-friendly form
class educe.rst_dt.sdrt.CDU(members, rel_insts)
A CDU contains one or more discourse units, and tracks relation instances between its members. Both CDU
and EDU are discourse units.
class educe.rst_dt.sdrt.RelInst(source, target, type)
Relation instance (educe.annotation calls these ‘Relation’s which is really more in keeping with how Glozz class
them, but properly speaking relation instance is a better name)
educe.rst_dt.sdrt.debug_du_to_tree(m)
Tree representation of CDU, treating the set of relation instances as the parent of each node. Loses information;
should only be used for debugging purposes.
educe.rst_dt.sdrt.rst_to_glozz_sdrt(rst_tree, annotator=’ldc’)
From an RST tree to a STAC-like version using Glozz annotations. Uses rst_to_sdrt
educe.rst_dt.sdrt.rst_to_sdrt(tree)
From RSTTree to CDU or EDU (recursive, top-down transformation). We recognise three patterns walking
down the tree (anything else is considered to be an error):
•Pre-terminal nodes: Return the leaf EDU
•Mono-nuclear, N satellites: Return a CDU with a relation instance from the nucleus to each satellite. As
an informal example, given X(attribution:S1, N, explanation-argumentative:S2), we return a CDU with
sdrt(N) – attribution –> sdrt(S1) and sdrt(N) – explanation-argumentative –> sdrt(S2)
•Multi-nuclear, 0 satellites: Return a CDU with a relation instance across each successive nucleus (assume the same relation). As an informal example, given X(List:N1, List:N2, List:N3), we return a CDU
containing sdrt(N1) –List–> sdrt(N2) – List –> sdrt(N3).
74
Chapter 4. educe package
educe Documentation, Release 0.1
educe.rst_dt.text module
Educe-style annotations for RST discourse treebank text objects (paragraphs and sentences)
class educe.rst_dt.text.Paragraph(num, sentences)
Bases: educe.annotation.Standoff
A paragraph is a sequence of ‘Sentence‘s (also standoff annotations).
classmethod left_padding(sentences)
Return a left padding Paragraph
num = None
paragraph ID in document
sentences = None
sentence-level annotations
class educe.rst_dt.text.Sentence(num, span)
Bases: educe.annotation.Standoff
Just a text span really
classmethod left_padding()
Return a left padding Sentence
num = None
sentence ID in document
text_span()
educe.rst_dt.text.clean_edu_text(text)
Strip metadata from EDU text and compress extraneous whitespace
4.3.6 educe.stac package
Conventions specific to the STAC project
This includes things like
• corpus layout (see corpus_files)
• which annotations are of interest
• renaming/deleting/collapsing annotation labels
Subpackages
educe.stac.learning package
Helpers for machine-learning tasks
Submodules
4.3. Subpackages
75
educe Documentation, Release 0.1
educe.stac.learning.addressee module EDU addressee prediction
educe.stac.learning.addressee.guess_addressees_for_edu(contexts, players, edu)
return a set of possible addressees for the given EDU or None if unclear
At the moment, the basis for our guesses is very crude: we simply guess that we have an addresee if the EDU
ends or starts with their name
educe.stac.learning.addressee.is_emoticon(token)
True if the token is tagged as an emoticon
educe.stac.learning.addressee.is_preposition(token)
True if the token is tagged as a preposition
educe.stac.learning.addressee.is_punct(token)
True if the token is tagged as punctuation
educe.stac.learning.addressee.is_verb(token)
True if the token is tagged as a verb
educe.stac.learning.doc_vectorizer module This submodule implements document vectorizers
class educe.stac.learning.doc_vectorizer.DialogueActVectorizer(instance_generator,
labels)
Bases: object
Dialogue act extractor for the STAC corpus.
transform(raw_documents)
Learn the label encoder and return a vector of labels
There is one label per instance extracted from raw_documents.
class educe.stac.learning.doc_vectorizer.LabelVectorizer(instance_generator, labels,
zero=False)
Bases: object
Label extractor for the STAC corpus.
transform(raw_documents)
Learn the label encoder and return a vector of labels
There is one label per instance extracted from raw_documents.
educe.stac.learning.features module Feature extraction library functions for STAC corpora. The feature extraction
script (rel-info) is a lightweight frontend to this library
exception educe.stac.learning.features.CorpusConsistencyException(msg)
Bases: exceptions.Exception
Exceptions which arise if one of our expecations about the corpus data is violated, in short, weird things we
don’t know how to handle. We should avoid using this for things which are definitely bugs in the code, and not
just weird things in the corpus we didn’t know how to handle.
class educe.stac.learning.features.DocEnv(inputs, current, sf_cache)
Bases: tuple
__getnewargs__()
Return self as a plain tuple. Used by copy and pickle.
__getstate__()
Exclude the OrderedDict from pickling
76
Chapter 4. educe package
educe Documentation, Release 0.1
__repr__()
Return a nicely formatted representation string
current
Alias for field number 1
inputs
Alias for field number 0
sf_cache
Alias for field number 2
class educe.stac.learning.features.DocumentPlus(key, doc, unitdoc, players, parses)
Bases: tuple
__getnewargs__()
Return self as a plain tuple. Used by copy and pickle.
__getstate__()
Exclude the OrderedDict from pickling
__repr__()
Return a nicely formatted representation string
doc
Alias for field number 1
key
Alias for field number 0
parses
Alias for field number 4
players
Alias for field number 3
unitdoc
Alias for field number 2
class educe.stac.learning.features.EduGap(sf_cache, inner_edus, turns_between)
Bases: tuple
__getnewargs__()
Return self as a plain tuple. Used by copy and pickle.
__getstate__()
Exclude the OrderedDict from pickling
__repr__()
Return a nicely formatted representation string
inner_edus
Alias for field number 1
sf_cache
Alias for field number 0
turns_between
Alias for field number 2
class educe.stac.learning.features.FeatureCache(inputs, current)
Bases: dict
4.3. Subpackages
77
educe Documentation, Release 0.1
Cache for single edu features. Retrieving an item from the cache lazily computes/memoises the single EDU
features for it.
expire(edu)
Remove an edu from the cache if it’s in there
class educe.stac.learning.features.FeatureInput(corpus, postags, parses, lexicons,
pdtb_lex, verbnet_entries, inquirer_lex)
Bases: tuple
__getnewargs__()
Return self as a plain tuple. Used by copy and pickle.
__getstate__()
Exclude the OrderedDict from pickling
__repr__()
Return a nicely formatted representation string
corpus
Alias for field number 0
inquirer_lex
Alias for field number 6
lexicons
Alias for field number 3
parses
Alias for field number 2
pdtb_lex
Alias for field number 4
postags
Alias for field number 1
verbnet_entries
Alias for field number 5
class educe.stac.learning.features.InquirerLexKeyGroup(lexicon)
Bases: educe.learning.keys.KeyGroup
One feature per Inquirer lexicon class
fill(current, edu, target=None)
See SingleEduSubgroup
classmethod key_prefix()
All feature keys in this lexicon should start with this string
mk_field(entry)
From verb class to feature key
mk_fields()
Feature name for each relation in the lexicon
class educe.stac.learning.features.LexKeyGroup(lexicon)
Bases: educe.learning.keys.KeyGroup
The idea here is to provide a feature per lexical class in the lexicon entry
fill(current, edu, target=None)
See SingleEduSubgroup
78
Chapter 4. educe package
educe Documentation, Release 0.1
key_prefix()
Common CSV header name prefix to all columns based on this particular lexicon
mk_field(cname, subclass=None)
For a given lexical class, return the name of its feature in the CSV file
mk_fields()
CSV field names for each entry/class in the lexicon
class educe.stac.learning.features.LexWrapper(key, filename, classes)
Bases: object
Configuration options for a given lexicon: where to find it, what to call it, what sorts of results to return
read(lexdir)
Read and store the lexicon as a mapping from words to their classes
class educe.stac.learning.features.MergedLexKeyGroup(inputs)
Bases: educe.learning.keys.MergedKeyGroup
Single-EDU features based on lexical lookup.
fill(current, edu, target=None)
See SingleEduSubgroup
class educe.stac.learning.features.PairKeys(inputs, sf_cache=None)
Bases: educe.learning.keys.MergedKeyGroup
Features for pairs of EDUs
fill(current, edu1, edu2, target=None)
See PairSubgroup
one_hot_values_gen(suffix=’‘)
class educe.stac.learning.features.PairSubgroup(description, keys)
Bases: educe.learning.keys.KeyGroup
Abstract keygroup for subgroups of the merged PairKeys. We use these subgroup classes to help provide modularity, to capture the idea that the bits of code that define a set of related feature vector keys should go with the
bits of code that also fill them out
fill(current, edu1, edu2, target=None)
Fill out a vector’s features (if the vector is None, then we just fill out this group; but in the case of a merged
key group, you may find it desirable to fill out the merged group instead)
class educe.stac.learning.features.PairSubgroup_Gap(sf_cache)
Bases: educe.stac.learning.features.PairSubgroup
Features related to the combined surrounding context of the two EDUs
fill(current, edu1, edu2, target=None)
class educe.stac.learning.features.PairSubgroup_Tuple(inputs, sf_cache)
Bases: educe.stac.learning.features.PairSubgroup
artificial tuple features
fill(current, edu1, edu2, target=None)
class educe.stac.learning.features.PdtbLexKeyGroup(lexicon)
Bases: educe.learning.keys.KeyGroup
One feature per PDTB marker lexicon class
4.3. Subpackages
79
educe Documentation, Release 0.1
fill(current, edu, target=None)
See SingleEduSubgroup
classmethod key_prefix()
All feature keys in this lexicon should start with this string
mk_field(rel)
From relation name to feature key
mk_fields()
Feature name for each relation in the lexicon
class educe.stac.learning.features.SingleEduKeys(inputs)
Bases: educe.learning.keys.MergedKeyGroup
Features for a single EDU
fill(current, edu, target=None)
See SingleEduSubgroup.fill
class educe.stac.learning.features.SingleEduSubgroup(description, keys)
Bases: educe.learning.keys.KeyGroup
Abstract keygroup for subgroups of the merged SingleEduKeys. We use these subgroup classes to help provide
modularity, to capture the idea that the bits of code that define a set of related feature vector keys should go with
the bits of code that also fill them out
fill(current, edu, target=None)
Fill out a vector’s features (if the vector is None, then we just fill out this group; but in the case of a merged
key group, you may find it desirable to fill out the merged group instead)
This defaults to _magic_fill if you don’t implement it.
class educe.stac.learning.features.SingleEduSubgroup_Chat
Bases: educe.stac.learning.features.SingleEduSubgroup
Single-EDU features based on the EDU’s relationship with the chat structure (eg turns, dialogues).
class educe.stac.learning.features.SingleEduSubgroup_Parser
Bases: educe.stac.learning.features.SingleEduSubgroup
Single-EDU features that come out of a syntactic parser.
class educe.stac.learning.features.SingleEduSubgroup_Punct
Bases: educe.stac.learning.features.SingleEduSubgroup
punctuation features
class educe.stac.learning.features.SingleEduSubgroup_Token
Bases: educe.stac.learning.features.SingleEduSubgroup
word/token-based features
class educe.stac.learning.features.VerbNetEntry(classname, lemmas)
Bases: tuple
__getnewargs__()
Return self as a plain tuple. Used by copy and pickle.
__getstate__()
Exclude the OrderedDict from pickling
__repr__()
Return a nicely formatted representation string
80
Chapter 4. educe package
educe Documentation, Release 0.1
classname
Alias for field number 0
lemmas
Alias for field number 1
class educe.stac.learning.features.VerbNetLexKeyGroup(ventries)
Bases: educe.learning.keys.KeyGroup
One feature per VerbNet lexicon class
fill(current, edu, target=None)
See SingleEduSubgroup
classmethod key_prefix()
All feature keys in this lexicon should start with this string
mk_field(ventry)
From verb class to feature key
mk_fields()
Feature name for each relation in the lexicon
educe.stac.learning.features.clean_chat_word(token)
Given a word and its postag (educe PosTag representation) return a somewhat tidied up version of the word.
•Sequences of the same letter greater than length 3 are shortened to just length three
•Letter is lower cased
educe.stac.learning.features.clean_dialogue_act(act)
Knock out temporary markers used during corpus annotation
educe.stac.learning.features.dialogue_act_pairs(current, cache, edu1, edu2)
tuple of dialogue acts for both EDUs
educe.stac.learning.features.edu_position_in_turn(_, edu)
relative position of the EDU in the turn
educe.stac.learning.features.edu_text_feature(wrapped)
Lift a text based feature into a standard single EDU one
(String -> a) ->
((Current, Edu) -> a)
educe.stac.learning.features.emoticons(tokens)
Given some tokens, return just those which are emoticons
educe.stac.learning.features.enclosed_lemmas(span, parses)
Given a span and a list of parses, return any lemmas that are within that span
educe.stac.learning.features.enclosed_trees(span, trees)
Return the biggest (sub)trees in xs that are enclosed in the span
educe.stac.learning.features.ends_with_bang(current, edu)
if the EDU text ends with ‘!’
educe.stac.learning.features.ends_with_qmark(current, edu)
if the EDU text ends with ‘?’
educe.stac.learning.features.extract_pair_features(inputs, stage)
Extraction for all relevant pairs in a document (generator)
educe.stac.learning.features.extract_single_features(inputs, stage)
Return a dictionary for each EDU
4.3. Subpackages
81
educe Documentation, Release 0.1
educe.stac.learning.features.feat_annotator(current, edu1, edu2)
annotator for the subdoc
educe.stac.learning.features.feat_end(_, edu)
text span end
educe.stac.learning.features.feat_has_emoticons(_, edu)
if the EDU has emoticon-tagged tokens
educe.stac.learning.features.feat_id(_, edu)
some sort of unique identifier for the EDU
educe.stac.learning.features.feat_is_emoticon_only(_, edu)
if the EDU consists solely of an emoticon
educe.stac.learning.features.feat_start(_, edu)
text span start
educe.stac.learning.features.get_players(inputs)
Return a dictionary mapping each document to the set of players in that document
educe.stac.learning.features.has_FOR_np(current, edu)
if the EDU has the pattern IN(for).. NP
educe.stac.learning.features.has_correction_star(current, edu)
if the EDU begins with a ‘*’ but does not contain others
educe.stac.learning.features.has_inner_question(current, gap, _edu1, _edu2)
if there is an intervening EDU that is a question
educe.stac.learning.features.has_one_of_words(sought,
tokens,
norm=<function
<lambda>>)
Given a set of words, a collection tokens, return True if the tokens contain words match one of the desired words,
modulo some minor normalisations like lowercasing.
educe.stac.learning.features.has_pdtb_markers(markers, tokens)
Given a sequence of tagged tokens, return True if any of the given PDTB markers appears within the tokens
educe.stac.learning.features.has_player_name_exact(current, edu)
if the EDU text has a player name in it
educe.stac.learning.features.has_player_name_fuzzy(current, edu)
if the EDU has a word that sounds like a player name
educe.stac.learning.features.is_just_emoticon(tokens)
Return true if a sequence of tokens consists of a single emoticon
educe.stac.learning.features.is_nplike(anno)
is some sort of NP annotation from a parser
educe.stac.learning.features.is_question(current, edu)
if the EDU is (or contains) a question
educe.stac.learning.features.is_question_pairs(current, cache, edu1, edu2)
boolean tuple: if each EDU is a question
educe.stac.learning.features.lemma_subject(*args, **kwargs)
the lemma corresponding to the subject of this EDU
educe.stac.learning.features.lexical_markers(lclass, tokens)
Given a dictionary (words to categories) and a text span, return all the categories of words that appear in that set.
Note that for now we are doing our own white-space based tokenisation, but it could make sense to use a different
source of tokens instead
82
Chapter 4. educe package
educe Documentation, Release 0.1
educe.stac.learning.features.map_topdown(good, prunable, trees)
Do topdown search on all these trees, concatenate results.
educe.stac.learning.features.mk_env(inputs, people, key)
Pre-process and bundle up a representation of the current document
educe.stac.learning.features.mk_envs(inputs, stage)
Generate an environment for each document in the corpus within the given stage.
The environment pools together all the information we have on a single document
educe.stac.learning.features.mk_high_level_dialogues(inputs, stage)
Generate all relevant EDU pairs for a document (generator)
educe.stac.learning.features.mk_is_interesting(args, single)
Return a function that filters corpus keys to pick out the ones we specified on the command line
We have two cases here: for pair extraction, we just want to grab the units and if possible the discourse stage.
In live mode, there won’t be a discourse stage, but that’s fine because we can just fall back on units.
For single extraction (dialogue acts), we’ll also want to grab the units stage and fall back to unannotated when
in live mode. This is made a bit trickier by the fact that unannotated does not have an annotator, so we have to
accomodate that.
Phew.
It’s a bit specific to feature extraction in that here we are trying
educe.stac.learning.features.num_edus_between(_current, gap, _edu1, _edu2)
number of intervening EDUs (0 if adjacent)
educe.stac.learning.features.num_nonling_tstars_between(_current,
_edu2)
number of non-linguistic turn-stars between EDUs
gap,
_edu1,
educe.stac.learning.features.num_speakers_between(_current, gap, _edu1, _edu2)
number of distinct speakers in intervening EDUs
educe.stac.learning.features.num_tokens(_, edu)
length of this EDU in tokens
educe.stac.learning.features.player_addresees(edu)
The set of people spoken to during an edu annotation. This excludes known non-players, like ‘All’, or ‘?’, or
‘Please choose...’,
educe.stac.learning.features.players_for_doc(corpus, kdoc)
Return the set of speakers/addressees associated with a document.
In STAC, documents are semi-arbitrarily cut into sub-documents for technical and possibly ergonomic reasons,
ie. meaningless as far as we are concerned. So to find all speakers, we would have to search all the subdocuments
of a single document.
(Corpus, String) -> Set String
educe.stac.learning.features.position_in_dialogue(_, edu)
relative position of the turn in the dialogue
educe.stac.learning.features.position_in_game(_, edu)
relative position of the turn in the game
4.3. Subpackages
83
educe Documentation, Release 0.1
educe.stac.learning.features.position_of_speaker_first_turn(edu)
Given an EDU context, determine the position of the first turn by that EDU’s speaker relative to other turns in
that dialogue.
educe.stac.learning.features.read_corpus_inputs(args)
Read and filter the part of the corpus we want features for
educe.stac.learning.features.read_pdtb_lexicon(args)
Read and return the local PDTB discourse marker lexicon.
educe.stac.learning.features.real_dialogue_act(edu)
Given an EDU in the ‘discourse’ stage of the corpus, return its dialogue act from the ‘units’ stage
educe.stac.learning.features.relation_dict(doc, quiet=False)
Return the relations instances from a document in the form of an id pair to label dictionary
If there is more than one relation between a pair of EDUs we pick one of them arbitrarily and ignore the other
educe.stac.learning.features.same_speaker(current, _, edu1, edu2)
if both EDUs have the same speaker
educe.stac.learning.features.same_turn(current, _, edu1, edu2)
if both EDUs are in the same turn
educe.stac.learning.features.speaker_already_spoken_in_dialogue(_, edu)
if the speaker for this EDU is the same as that of a previous turn in the dialogue
educe.stac.learning.features.speaker_id(_, edu)
Get the speaker ID
educe.stac.learning.features.speaker_started_the_dialogue(_, edu)
if the speaker for this EDU is the same as that of the first turn in the dialogue
educe.stac.learning.features.speakers_first_turn_in_dialogue(_, edu)
position in the dialogue of the turn in which the speaker for this EDU first spoke
educe.stac.learning.features.strip_cdus(corpus, mode)
For all documents in a corpus, remove any CDUs and relink the document according to the desired mode. This
mutates the corpus.
educe.stac.learning.features.subject_lemmas(span, trees)
Given a span and a list of dependency trees, return any lemmas which are marked as being some subject in that
span
educe.stac.learning.features.turn_follows_gap(_, edu)
if the EDU turn number is > 1 + previous turn
educe.stac.learning.features.type_text(wrapped)
Given a feature that emits text, clean its output up so to work with a wide variety of csv parsers
(a -> String) ->
(a -> String)
educe.stac.learning.features.word_first(*args, **kwargs)
the first word in this EDU
educe.stac.learning.features.word_last(*args, **kwargs)
the last word in this EDU
educe.stac.lexicon package
Submodules
84
Chapter 4. educe package
educe Documentation, Release 0.1
educe.stac.lexicon.markers module Api on discourse markers (lexicon I/O mostly)
class educe.stac.lexicon.markers.LexConn(infile, version=‘2’, stop=set([u’xe0’, u’ou’, u’en’,
u’pour’, u’et’]))
get_by_form(form)
get_by_id(id)
get_by_lemma(lemma)
class educe.stac.lexicon.markers.Marker(elmt, version=‘2’, stop=set([u’xe0’, u’ou’, u’en’,
u’pour’, u’et’]))
wrapper class for discourse marker read from Lexconn, version 1 or 2
should include at least id, cat (grammatical category) version 1 has type (coord/subord) version 2 has grammatical host and lemma
get_forms()
get_lemma()
get_relations()
educe.stac.lexicon.pdtb_markers module Lexicon of discourse markers.
Cheap and cheerful phrasal lexicon format used in the STAC project. Maps sequences of multiword expressions to
relations they mark
as ; explanation explanation* background as a result ; result result* for example ; elaboration if:then ;
conditional on the one hand:on the other hand
One entry per line. Sometimes you have split expressions, like “on the one hand X, on the other hand Y” (we model
this by saying that we are working with sequences of expressions, rather than single expressions). Phrases can be
associated with 0 to N relations (interpreted as disjunction; if wedge appears (LaTeX for the “logical and” operator),
it is ignored).
class educe.stac.lexicon.pdtb_markers.Marker(exprs)
Bases: object
A marker here is a sort of template consisting of multiword expressions and holes, eg. “on the one hand, XXX,
on the other hand YYY”. We represent this is as a sequence of Multiword
classmethod any_appears_in(markers, words, sep=’#####’)
Return True if any of the given markers appears in the word sequence.
See appears_in for details.
appears_in(words, sep=’#####’)
Given a sequence of words, return True if this marker appears in that sequence.
We use a very liberal defintion here. In particular, if the marker has more than component (on the one
hand X, on the other hand Y), we merely check that all components appear without caring what order they
appear in.
Note that this abuses the Python string matching functionality, and assumes that the separator substring
never appears in the tokens
class educe.stac.lexicon.pdtb_markers.Multiword(words)
Bases: object
A sequence of tokens representing a multiword expression.
4.3. Subpackages
85
educe Documentation, Release 0.1
educe.stac.lexicon.pdtb_markers.load_pdtb_markers_lexicon(filename)
Load the lexicon of discourse markers from the PDTB.
Parameters filename (string) – Path to the lexicon
Returns markers – Discourse markers and the relations they signal
Return type dict(Marker, list(string))
educe.stac.lexicon.pdtb_markers.read_lexicon(filename)
Load the lexicon of discourse markers from the PDTB, by relation.
This calls load_pdtb_markers_lexicon but inverts the indexing to map each relation to its possible discourse
markers.
Note that, as an effect of this inversion, discourse markers whose set of relations is left empty in the lexicon
(possibly because they are too ambiguous?) are absent from the inverted index.
Parameters filename (string) – Path to the lexicon
Returns relations – Relations and their signalling discourse markers
Return type dict(string, frozenset(Marker))
educe.stac.lexicon.wordclass module Cheap and cheerful lexicon format used in the STAC project. One entry per
line, blanks ignored. Each entry associates
• some word with
• some kind of category (we call this a “lexical class”)
• an optional part of speech (?? if unknown)
• an optional subcategory blank if none
Here’s an example with all four fields
purchase:VBEchange:VB:receivable acquire:VBEchange:VB:receivable give:VBEchange:VB:givable
and one without the notion of subclass
ought:modal:MD: except:negation:??:
class educe.stac.lexicon.wordclass.LexClass
Bases: educe.stac.lexicon.wordclass.LexClass
Grouping together information for a single lexical class. Our assumption here is that a word belongs to at most
one subclass
classmethod freeze(other)
A frozen copy of a lex class
just_subclasses()
Any subclasses associated with this lexical class
just_words()
Any words associated with this lexical class
classmethod new_writable_instance()
A brand new (empty) lex class
class educe.stac.lexicon.wordclass.LexEntry
Bases: educe.stac.lexicon.wordclass.LexEntry
a single entry in the lexicon
86
Chapter 4. educe package
educe Documentation, Release 0.1
classmethod read_entries(items)
Return a list of LexEntry given an iterable of entry strings, eg. the stream for the lines in a file. Blank
entries are ignored
classmethod read_entry(line)
Return a LexEntry given the string corresponding to an entry, or raise an exception if we can’t parse it
class educe.stac.lexicon.wordclass.Lexicon
Bases: educe.stac.lexicon.wordclass.Lexicon
All entries in a wordclass lexicon along with some helpers for convenient access
Parameters
• word_to_subclass (Dict String (Dict String String)) – class to word to subclass nested
dict
• subclasses_to_words (Dict String (Set String)) – class to subclass (to words)
dump()
Print a lexicon’s contents to stdout
classmethod read_file(filename)
Read the lexical entries in the file of the given name and return a Lexicon
:: FilePath -> IO Lexicon
educe.stac.oneoff package
Toolkit for one-off corpus-editing operations, things we don’t expect to come up very frequently, like mass renames
of one annotation type to another
Submodules
educe.stac.oneoff.weave module Combining annotations from an augmented ‘source’ document (with likely extra
text) with those in a ‘target’ document. This involves copying missing annotations over and shifting the text spans of
any matching documents
class educe.stac.oneoff.weave.Updates
Bases: educe.stac.oneoff.weave.Updates
Expected updates to the target document.
We expect to see four types of annotation:
1.target annotations for which there exists a source annotation in the equivalent span
2.target annotations for which there is no equivalent source annotation (eg. Resources, Preferences, but also
annotation moves)
3.source annotations for which there is at least one target annotation at the equivalent span (the mirror to
case 1; note that these are not represented in this structure because we don’t need to say much about them)
4.source annotations for which there is no match in the target side
5.source annotations that lie in between the matching bits of text
Parameters
• shift_if_ge (dict(int, int)) – (case 1 and 2) shift points and offsets for characters in the
target document (see shift_spans)
4.3. Subpackages
87
educe Documentation, Release 0.1
• abnormal_tgt_only ([Annotation]) – (case 2) annotations that only occur in the target
document (weird, found in matches)
• abnormal_src_only ([Annotation]) – (case 4) annotations that only occur in the source
document (weird, found in matches)
• [Annotation] (abnormal_src_only) – (case 5) annotations that only occur in the source
doc (ok, found in gaps)
map(fun)
Return an Updates in which a function has been applied to all annotations in this one (eg. useful for
previewing), and to all spans
exception educe.stac.oneoff.weave.WeaveException(*args, **kw)
Bases: exceptions.Exception
Unexpected alignment issues between the source and target document
educe.stac.oneoff.weave.check_matches(tgt_doc, matches)
Check that the target document text is indeed a subsequence of the source document text (the source document
is expected to be “augmented” version of the target with new text interspersed throughout)
educe.stac.oneoff.weave.compute_updates(src_doc, tgt_doc, matches)
Return updates that would need to be made on the target document.
Given matches between the source and target document, return span updates along with any source annotations
that do not have an equivalent in the target document (the latter may indicate that resegmentation has taken
place, or that there is some kind of problem)
Parameters
• src_doc (Document) –
• tgt_doc (Document) –
• matches ([Match]) –
Returns updates
Return type Updates
educe.stac.oneoff.weave.shift_char(position, updates)
Given a character position an updates tuple, return a shifted over position which reflects the update.
The basic idea that we have a set of “shift points” and their corresponding offsets. If a character position ‘c’
occurs after one of the points, we take the offset of the largest such point and add it to the character.
Our assumption here is that the update always consists in adding more text so offsets are always positive.
Parameters
• position (int) – initial position
• updates (Updates) –
Returns shifted position
Return type int
educe.stac.oneoff.weave.shift_span(span, updates)
Given a span and an updates tuple, return a Span that is shifted over to reflect the updates
Parameters
• span (Span) –
88
Chapter 4. educe package
educe Documentation, Release 0.1
• updates (Updates) –
Returns span
Return type Span
See also:
shift_char() for details on how this works
educe.stac.oneoff.weave.src_gaps(matches)
Given matches between the source and target document, return the spaces between these matches as source
offset and size (a bit like the matches). Note that we assume that the target document text is a subsequence of
the source document.
educe.stac.oneoff.weave.tgt_gaps(matches)
Given matches between the source and target document, return the spaces between these matches as target offset
and size (a bit like the matches). By rights this should be empty, but you never know
educe.stac.sanity package
Subpackages
educe.stac.sanity.checks package
Submodules
educe.stac.sanity.checks.annotation module STAC sanity-check: annotation oversights
class educe.stac.sanity.checks.annotation.FeatureItem(doc, contexts, anno, attrs, status=’missing’)
Bases: educe.stac.sanity.common.ContextItem
Annotations that are missing some feature(s)
annotations()
html()
educe.stac.sanity.checks.annotation.is_blank_edu(anno)
True if the annotation looks like it may be an unannotated EDU
educe.stac.sanity.checks.annotation.is_cross_dialogue(contexts)
The units connected by this relation (or cdu) do not inhabit the same dialogue.
educe.stac.sanity.checks.annotation.is_fixme(feature_value)
True if a feature value has a fixme value
educe.stac.sanity.checks.annotation.is_review_edu(anno)
True if the annotation has a FIXME tagged type
educe.stac.sanity.checks.annotation.missing_features(doc, anno)
Return set of attribute names for any expected features that may be missing for this annotation
educe.stac.sanity.checks.annotation.run(inputs, k)
Add any annotation omission errors to the current report
educe.stac.sanity.checks.annotation.search_for_fixme_features(inputs, k)
Return a ReportItem for any annotations in the document whose features have a fixme type
4.3. Subpackages
89
educe Documentation, Release 0.1
educe.stac.sanity.checks.annotation.search_for_missing_rel_feats(inputs, k)
Return ReportItems for any relations that are missing expected features
educe.stac.sanity.checks.annotation.search_for_missing_unit_feats(inputs, k)
Return ReportItems for any EDUs and CDUs that are missing expected features
educe.stac.sanity.checks.annotation.search_for_unexpected_feats(inputs, k)
Return ReportItems for any annotations that are have features we were not expecting them to have
educe.stac.sanity.checks.annotation.unexpected_features(_, anno)
Return set of attribute names for any features that we were not expecting to see in the given annotations
educe.stac.sanity.checks.glozz module Sanity checker: low-level Glozz errors
class educe.stac.sanity.checks.glozz.BadIdItem(doc, contexts, anno, expected_id)
Bases: educe.stac.sanity.common.ContextItem
An annotation whose identifier does not match its metadata
text()
class educe.stac.sanity.checks.glozz.DuplicateItem(doc, contexts, anno, others)
Bases: educe.stac.sanity.common.ContextItem
An annotation which shares an id with another
text()
class educe.stac.sanity.checks.glozz.IdMismatch(doc, contexts, unit1, unit2)
Bases: educe.stac.sanity.common.ContextItem
An annotation which seems to have an equivalent in some twin but with the wrong identifier
annotations()
html()
exception educe.stac.sanity.checks.glozz.MissingDocumentException(k)
Bases: exceptions.Exception
A document we are trying to cross check does not have the expected twin
class educe.stac.sanity.checks.glozz.MissingItem(status, doc1, contexts1, unit, doc2, contexts2, approx)
Bases: educe.stac.sanity.report.ReportItem
An annotation which is missing in some document twin (or which looks like it may have been unexpectedly
added)
excess_status = ‘ADDED’
html()
missing_status = ‘DELETED’
status_len = 7
text_span()
Return the span for the annotation in question
class educe.stac.sanity.checks.glozz.OffByOneItem(doc, contexts, unit)
Bases: educe.stac.sanity.common.UnitItem
An annotation whose boundaries might be off by one
html()
90
Chapter 4. educe package
educe Documentation, Release 0.1
html_turn_info(parent, turn)
Given a turn annotation, append a prettified HTML representation of the turn text (highlighting parts of it,
such as the turn number)
class educe.stac.sanity.checks.glozz.OverlapItem(doc, contexts, anno, overlaps)
Bases: educe.stac.sanity.common.ContextItem
An annotation whose span overlaps with that of another
annotations()
html()
educe.stac.sanity.checks.glozz.bad_ids(inputs, k)
Return annotations whose identifiers do not match their metadata
educe.stac.sanity.checks.glozz.check_unit_ids(inputs, key1, key2)
Return annotations that match in the two documents modulo identifiers. This might arise if somebody creates a
duplicate annotation in place and annotates that
educe.stac.sanity.checks.glozz.cross_check_against(inputs,
stage=’unannotated’)
Compare annotations with their equivalents on a twin document in the corpus
key1,
educe.stac.sanity.checks.glozz.cross_check_units(inputs, key1, key2, status)
Return tuples for certain corpus[key1] units not present in corpus[key2]
educe.stac.sanity.checks.glozz.duplicate_annotations(inputs, k)
Multiple annotations with the same local_id()
educe.stac.sanity.checks.glozz.filter_matches(unit, other_units)
Return any unit-level annotations in other_units that look like they may be the same as the given annotation
educe.stac.sanity.checks.glozz.is_maybe_off_by_one(text, anno)
True if an annotation has non-whitespace characters on its immediate left/right
educe.stac.sanity.checks.glozz.overlapping(inputs, k, is_overlap)
Return items for annotations that have overlaps
educe.stac.sanity.checks.glozz.overlapping_structs(inputs, k)
Return items for structural annotations that have overlaps
educe.stac.sanity.checks.glozz.run(inputs, k)
Add any glozz errors to the current report
educe.stac.sanity.checks.glozz.search_glozz_off_by_one(inputs, k)
EDUs which have non-whitespace (or boundary) characters either on their right or left
educe.stac.sanity.checks.graph module Sanity checker: fancy graph-based errors
educe.stac.sanity.checks.graph.BACKWARDS_WHITELIST = [’Conditional’]
relations that are allowed to go backwards
class educe.stac.sanity.checks.graph.CduOverlapItem(doc, contexts, anno, cdus)
Bases: educe.stac.sanity.common.ContextItem
EDUs that appear in more than one CDU
annotations()
html()
4.3. Subpackages
91
educe Documentation, Release 0.1
educe.stac.sanity.checks.graph.dialogue_graphs(k, doc, contexts)
Return a dict from dialogue annotations to subgraphs containing at least everything in that dialogue (and perhaps
some connected items)
educe.stac.sanity.checks.graph.horrible_context_kludge(graph,
simplified_graph,
contexts)
Given a graph and its copy, and given a context dictionary, return a copy of the context dictionary that corresponds to the simplified graph. Ugh
educe.stac.sanity.checks.graph.is_arrow_inversion(gra, _, rel)
Relation in a graph that goes from textual right to left (may not be a problem)
educe.stac.sanity.checks.graph.is_disconnected(gra, contexts, node)
An EDU is considered disconnected unless:
•it has an incoming link or
•it has an outgoing Conditional link
•it’s at the beginning of a dialogue
In principle we don’t need to look at EDUs that are disconnected on the outgoing end because (1) it’s can
be legitimate for non-dialogue-ending EDUs to not have outgoing links and (2) such information would be
redundant with the incoming anyway
educe.stac.sanity.checks.graph.is_dupe_rel(gra, _, rel)
Relation instance for which there are relation instances between the same source/target DUs (regardless of
direction)
educe.stac.sanity.checks.graph.is_non2sided_rel(gra, _, rel)
Relation instance which does not have exactly a source and target link in the graph
How this can possibly happen is a mystery
educe.stac.sanity.checks.graph.is_puncture(gra, _, rel)
Relation in a graph that traverse a CDU boundary
educe.stac.sanity.checks.graph.is_weird_ack(gra, contexts, rel)
Relation in a graph that represent a question answer pair which either does not start with a question, or which
ends in a question.
Note the detection process is a lot sloppier when one of the endpoints is a CDU. If all EDUs in the CDU are by
the same speaker, we can check as usual; otherwise, all bets are off, so we ignore the relation.
Note: slightly curried to accept contexts as an argument
educe.stac.sanity.checks.graph.is_weird_qap(gra, _, rel)
Relation in a graph that represent a question answer pair which either does not start with a question, or which
ends in a question
educe.stac.sanity.checks.graph.rel_link_item(doc, contexts, gra, rel)
return ReportItem for a graph relation
educe.stac.sanity.checks.graph.rfc_violations(inputs, k, gra)
Repackage right frontier contraint violations in a somewhat friendlier way
educe.stac.sanity.checks.graph.run(inputs, k)
Add any graph errors to the current report
educe.stac.sanity.checks.graph.search_graph_cdu_overlap(inputs, k, gra)
Return a ReportItem for every EDU that appears in more than one CDU
educe.stac.sanity.checks.graph.search_graph_cdus(inputs, k, gra, pred)
Return a ReportItem for any CDU in the graph for which the given predicate is True
92
Chapter 4. educe package
educe Documentation, Release 0.1
educe.stac.sanity.checks.graph.search_graph_edus(inputs, k, gra, pred)
Return a ReportItem for any EDU within the graph for which some predicate is true
educe.stac.sanity.checks.graph.search_graph_relations(inputs, k, gra, pred)
Return a ReportItem for any relation instance within the graph for which some predicate is true
educe.stac.sanity.checks.type_err module STAC sanity-check: type errors
educe.stac.sanity.checks.type_err.has_non_du_member(anno)
True if anno is a relation that points to another relation, or if it’s a CDU that has relation members
educe.stac.sanity.checks.type_err.is_non_du(anno)
True if the annotation is neither an EDU nor a CDU
educe.stac.sanity.checks.type_err.is_non_preference(anno)
True if the annotation is NOT a preference
educe.stac.sanity.checks.type_err.is_non_resource(anno)
True if the annotation is NOT a resource
educe.stac.sanity.checks.type_err.run(inputs, k)
Add any annotation type errors to the current report
educe.stac.sanity.checks.type_err.search_anaphora(inputs, k, pred)
Return a ReportItem for any anaphora annotation in which at least one member (not the annotation itself) is true
with the given predicate
educe.stac.sanity.checks.type_err.search_preferences(inputs, k, pred)
Return a ReportItem for any Preferences schema which has at least one member for which the predicate is True
educe.stac.sanity.checks.type_err.search_resource_groups(inputs, k, pred)
Return a ReportItem for any Several_resources schema which has at least one member for which the predicate
is True
Submodules
educe.stac.sanity.common module Functionality and report types common to sanity checker
class educe.stac.sanity.common.ContextItem(doc, contexts)
Bases: educe.stac.sanity.report.ReportItem
Report item involving EDU contexts
class educe.stac.sanity.common.RelationItem(doc, contexts, rel, naughty)
Bases: educe.stac.sanity.common.ContextItem
Errors which involve Glozz relation annotations
annotations()
html()
class educe.stac.sanity.common.SchemaItem(doc, contexts, schema, naughty)
Bases: educe.stac.sanity.common.ContextItem
Errors which involve Glozz schema annotations
annotations()
html()
4.3. Subpackages
93
educe Documentation, Release 0.1
class educe.stac.sanity.common.UnitItem(doc, contexts, unit)
Bases: educe.stac.sanity.common.ContextItem
Errors which involve Glozz unit-level annotations
annotations()
html()
educe.stac.sanity.common.anno_code(anno)
Short code providing a clue what the annotation is
educe.stac.sanity.common.is_default(anno)
True if the annotation has type ‘default’
educe.stac.sanity.common.is_glozz_relation(anno)
True if the annotation is a Glozz relation
educe.stac.sanity.common.is_glozz_schema(anno)
True if the annotation is a Glozz schema
educe.stac.sanity.common.is_glozz_unit(anno)
True if the annotation is a Glozz unit
educe.stac.sanity.common.rough_type(anno)
Return either
•“EDU”
•“relation”
•or the annotation type
educe.stac.sanity.common.search_for_glozz_relations(inputs,
k,
pred,
point_is_naughty=None)
Return a ReportItem for any glozz relation that satisfies the given predicate.
end-
If endpoint_is_naughty is supplied, note which of the endpoints can be considered naughty
educe.stac.sanity.common.search_for_glozz_schema(inputs,
k,
pred,
ber_is_naughty=None)
Search for schema that satisfy a condition
mem-
educe.stac.sanity.common.search_glozz_units(inputs, k, pred)
Return an item for every unit-level annotation in the given document that satisfies some predicate
Return type ReportItem
educe.stac.sanity.common.search_in_glozz_schema(inputs, k, stype, pred, member_is_naughty=None)
Search for schema whose memmbers satisfy a condition. Not to be confused with search_for_glozz_schema
educe.stac.sanity.common.summarise_anno(doc, light=False)
Return a function that returns a short text summary of an annotation
educe.stac.sanity.common.summarise_anno_html(doc, contexts)
Return a function that creates HTML descriptions of an annotation given document and contexts
educe.stac.sanity.html module Helpers for building HTML Hint: import the ET for the ET package too
educe.stac.sanity.html.br(parent)
Create and return an HTML br tag under the parent node
educe.stac.sanity.html.elem(parent, tag, text=None, attrib=None, **kwargs)
Create an HTML element under the given parent node, with some text inside of it
94
Chapter 4. educe package
educe Documentation, Release 0.1
educe.stac.sanity.html.span(parent, text=None, attrib=None, **kwargs)
Create and return an HTML span under the given parent node
educe.stac.sanity.main module Check the corpus for any consistency problems
class educe.stac.sanity.main.SanityChecker(args)
Bases: object
Sanity checker settings and state
output_is_temp()
True if we are writing to an output directory
run()
Perform sanity checks and write the output
educe.stac.sanity.main.add_element(settings, k, html, descr, mk_path)
Add a link to a report element for a given document, but only if it actually exists
educe.stac.sanity.main.copy_parses(settings)
Copy relevant stanford parser outputs from corpus to report
educe.stac.sanity.main.create_dirname(path)
Create the directory beneath a path if it does not exist
educe.stac.sanity.main.easy_settings(args)
Modify args to reflect user-friendly defaults. (args.doc must be set, everything else expected to be empty)
educe.stac.sanity.main.first_or_none(itrs)
Return the first element or None if there isn’t one
educe.stac.sanity.main.generate_graphs(settings)
Draw SVG graphs for each of the documents in the corpus
educe.stac.sanity.main.issues_descr(report, k)
Return a string characterising a report as either being warnings or error (helps the user scan the index to figure
out what needs clicking on)
educe.stac.sanity.main.main()
Sanity checker CLI entry point
educe.stac.sanity.main.run_checks(inputs, k)
Run sanity checks for a given document
educe.stac.sanity.main.sanity_check_order(k)
We want to sort file id by order of
1.doc
2.subdoc
3.annotator
4.stage (unannotated < unit < discourse)
The important bit here is the idea that we should maybe group unit and discourse for 1-3 together
educe.stac.sanity.main.write_index(settings)
Write the report index
4.3. Subpackages
95
educe Documentation, Release 0.1
educe.stac.sanity.report module Reporting component of sanity checker
class educe.stac.sanity.report.HtmlReport(anno_files, output_dir)
Bases: object
Representation of a report that we would like to generate. Output will be dumped to a directory
anchor_name(k, header)
HTML anchor name for a report section
css = ‘\n.annoid { font-family: monospace; font-size: small; }\n.feature { font-family: monospace; }\n.snippet { font-style:
delete(k)
Delete the subreport for a given key. This can be used if you want to iterate through lots of different keys,
generating reports incrementally and then deleting them to avoid building up memory.
No-op if we don’t have a sub-report for the given key
flush_subreport(k)
Write and delete (to save memory)
has_errors(k)
If we have error-level reports for the given key
javascript = ‘\nfunction has(xs, x) {\n for (e in xs) {\n if (xs[e] === x) { return true; }\n }\n return false;\n}\n\n\nfunctio
mk_hidden_with_toggle(parent, anchor)
Attach some javascript and html to the given block-level element that turns it into a hide/show toggle block,
starting out in the hidden state
mk_or_get_subreport(k)
Initialise and cache the subreport for a key, including the subreports for each severity level below it
If already cached, retrieve from cache
classmethod mk_output_path(odir, k, extension=’‘)
Generate a path within a parent directory, given a fileid
report(k, err_type, severity, header, items, noisy=False)
Append bullet points for each item to the appropriate section of the appropriate report in progress
set_has_errors(k)
Note that this report has seen at least one error-level severity message
subreport_path(k, extension=’.report.html’)
Report for a single document
write(k, path)
Write the subreport for a given key to the path. No-op if we don’t have a sub-report for the given key
class educe.stac.sanity.report.ReportItem
Bases: object
An individual reportable entry (usually involves a list of annotations), rendered as a block of text in the report
annotations()
The annotations which this report item is about
html()
Return an HTML element corresponding to the visualisation for this item
text()
If you don’t want to create an HTML visualisation for a report item, you can fall back to just generating
lines of text
96
Chapter 4. educe package
educe Documentation, Release 0.1
Return type [string]
class educe.stac.sanity.report.Severity
Bases: enum.Enum
Severity of a sanity check error block
class educe.stac.sanity.report.SimpleReportItem(lines)
Bases: educe.stac.sanity.report.ReportItem
Report item which just consists of lines of text
text()
educe.stac.sanity.report.html_anno_id(parent, anno, bracket=False)
Create and return an HTML span parent node displaying the local annotation id for an annotation item
educe.stac.sanity.report.mk_microphone(report, k, err_type, severity)
Return a convenience function that generates report entries at a fixed error type and severity level
Return type (string, [ReportItem]) -> string
educe.stac.sanity.report.snippet(txt, stop=50)
truncate a string if it’s longer than stop chars
educe.stac.util package
Submodules
educe.stac.util.annotate module Readable text dumps of educe annotations.
The idea here is to dump the text to screen, and use some informal text markup to show annotations over the text.
There’s a limit to how much we can display, but just breaking things up into paragraphs and [segments] seems to go a
long way.
educe.stac.util.annotate.annotate(txt, annotations, inserts=None)
Decorate a text with arbitrary bracket symbols, as a visual guide to the annotations on that text. For example, in
a chat corpus, you might use newlines to indicate turn boundaries and square brackets for segments.
Parameters
• inserts – inserts a dictionary from annotation type to pair of its opening/closing bracket
• FIXME (this needs to become a standard educe utility,) –
• as part of the educe.annotation layer? (maybe) –
educe.stac.util.annotate.annotate_doc(doc, span=None)
Pretty print an educe document and its annotations.
See the lower-level annotate for more details
educe.stac.util.annotate.reflow(text, width=40)
Wrap some text, at the same time ensuring that all original linebreaks are still in place
educe.stac.util.annotate.rough_type(anno)
Simplify STAC annotation types
educe.stac.util.annotate.schema_text(doc, anno)
(recursive) text preview of a schema and its contents. Members are enclosed in square brackets.
educe.stac.util.annotate.show_diff(doc_before, doc_after, span=None)
Display two educe documents (presumably two versions of the “same”) side by side
4.3. Subpackages
97
educe Documentation, Release 0.1
educe.stac.util.args module Command line options
educe.stac.util.args.add_commit_args(parser)
Augment a subcommand argparser with an option to emit a commit message for your version control tracking
educe.stac.util.args.add_usual_input_args(parser,
doc_subdoc_required=False,
help_suffix=None)
Augment a subcommand argparser with typical input arguments. Sometimes your subcommand may require
slightly different output arguments, in which case, just don’t call this function.
Parameters
• (bool) (doc_subdoc_required) – force user to supply –doc/–subdoc for this subcommand
(note you’ll need to add stage/anno yourself)
• (string) (help_suffix) – appended to –doc/–subdoc help strings
educe.stac.util.args.add_usual_output_args(parser, default_overwrite=False)
Augment a subcommand argparser with typical output arguments, Sometimes your subcommand may require
slightly different output arguments, in which case, just don’t call this function.
educe.stac.util.args.anno_id(string)
Split AUTHOR_DATE string into tuple, complaining if we don’t have such a string. Used for argparse
educe.stac.util.args.announce_output_dir(output_dir)
Tell the user where we saved the output
educe.stac.util.args.check_easy_settings(args)
Modify args to reflect user-friendly defaults. (args.doc must be set, everything else expected to be empty)
educe.stac.util.args.comma_span(string)
Split a comma delimited pair of integers into an educe span
educe.stac.util.args.get_output_dir(args, default_overwrite=False)
Return the output dir specified or inferred from command line args.
We try the following in order:
1.If –output is given explicitly, we’ll just use/create that
2.If default_overwrite is True, or the user specifies –overwrite on the command line (provided the command
supports it), the output directory may well be the original corpus dir (gulp! Better use version control!)
3.OK just make a temporary directory. Later on, you’ll probably want to call announce_output_dir.
educe.stac.util.args.read_corpus(args, preselected=None, verbose=True)
Read the section of the corpus specified in the command line arguments.
educe.stac.util.args.read_corpus_with_unannotated(args, verbose=True)
Read the section of the corpus specified in the command line arguments.
educe.stac.util.csv module STAC project CSV files
STAC uses CSV files for some intermediary steps when initially preparing data for annotation. We don’t expect these
to be useful outside of that particular context
class educe.stac.util.csv.SparseDictReader(f, *args, **kwds)
Bases: csv.DictReader
A CSV reader which avoids putting null values in dictionaries (note that this is basically a copy of DictReader)
next()
98
Chapter 4. educe package
educe Documentation, Release 0.1
class educe.stac.util.csv.Turn
Bases: educe.stac.util.csv.Turn
High-level representation of a turn as used in the STAC internal CSV files during intake)
to_dict()
csv representation of this turn
class educe.stac.util.csv.Utf8DictReader(f, **kwds)
A CSV reader which assumes strings are encoded in UTF-8.
next()
class educe.stac.util.csv.Utf8DictWriter(f, headers, dialect=<class csv.excel>, **kwds)
A CSV writer which will write rows to CSV file “f”, which is encoded in UTF-8.
writeheader()
writerow(row)
writerows(rows)
educe.stac.util.csv.mk_csv_reader(infile)
Assumes UTF-8 encoded files. Reads into dictionaries with Unicode strings.
See Utf8DictReader if you just want a generic UTF-8 dict reader, ie. not using the stac dialect
educe.stac.util.csv.mk_csv_writer(ofile)
Writes dictionaries. See CSV_HEADERS for details
educe.stac.util.csv.mk_plain_csv_writer(outfile)
Just writes records in stac dialect
educe.stac.util.doc module Utilities for large-scale changes to educe documents, for example, moving a chunk of
text from one document to another
exception educe.stac.util.doc.StacDocException(msg)
Bases: exceptions.Exception
An exception that arises from trying to manipulate a stac document (typically moving things around, etc)
educe.stac.util.doc.compute_renames(avoid, incoming)
Given two sets of documents (i.e. corpora), return a dictionary which would allow us to rename ids in incoming
so that they do not overlap with those in avoid.
:rtype author -> date -> date
educe.stac.util.doc.evil_set_id(anno, author, date)
This is a bit evil as it’s using undocumented functionality from the educe.annotation.Standoff object
educe.stac.util.doc.evil_set_text(doc, text)
This is a bit evil as it’s using undocumented functionality from the educe.annotation.Document object
educe.stac.util.doc.move_portion(renames, src_doc, tgt_doc, src_split, tgt_split=-1)
Return a copy of the documents such that part of the source document has been moved into the target document.
This can capture a couple of patterns:
•reshuffling the boundary between the target and source document (if tgt | src1 src2 ==> tgt src1 | src2)
(tgt_split = -1)
•prepending the source document to the target (src | tgt ==> src tgt; src_split=-1; tgt_split=0)
•inserting the whole source document into the other (tgt1 tgt2 + src ==> tgt1 src tgt2; src_split=-1)
4.3. Subpackages
99
educe Documentation, Release 0.1
There’s a bit of potential trickiness here:
•we’d like to preserve the property that text has a single starting and ending space (no real reason just seems
safer that way)
•if we’re splicing documents together particularly at their respective ends, there’s a strong off-by-one risk
because some annotations span the whole text (whitespace and all), particularly dialogues
educe.stac.util.doc.narrow_to_span(doc, span)
Return a deep copy of a document with only the text and annotations that are within the span specified by
portion.
educe.stac.util.doc.rename_ids(renames, doc)
Return a deep copy of a document, with ids reassigned according to the renames dictionary
educe.stac.util.doc.retarget(doc, old_id, new_anno)
Replace all links to the old (unit-level) annotation with links to the new one.
We refer to the old annotation by id, but the new annotation must be passed in as an object. It must also be either
an EDU or a CDU.
Return True if we replaced anything
educe.stac.util.doc.shift_annotations(doc, offset, point=None)
Return a deep copy of a document such that all annotations have been shifted by an offset.
If shifting right, we pad the document with whitespace to act as filler. If shifting left, we cut the text
If a shift point is specified and the offset is positive, we only shift annotations that are to the right of the point.
Likewise if the offset is negative, we only shift those that are to the left of the point.
educe.stac.util.doc.split_doc(doc, middle)
Given a split point, break a document into two pieces. If the split point is None, we take the whole document
(this is slightly different from having -1 as a split point)
Raise an exception if there are any annotations that span the point.
educe.stac.util.doc.strip_fixme(act)
Remove the fixme string from a dialogue act annotation. These were automatically inserted when there is an
annotation to review. We shouldn’t see them for any use cases like feature extraction though.
See educe.stac.dialogue_act which returns the set of dialogue acts for each annotation (by rights should be
singleton set, but there used to be more than one, something we want to phase out?)
educe.stac.util.doc.unannotated_key(key)
Given a corpus key, return a copy of that equivalent key in the unannotated portion of the corpus (the parser
outputs objects that are based in unannotated)
educe.stac.util.glozz module STAC Glozz conventions
class educe.stac.util.glozz.PseudoTimestamper
Bases: object
Generator for the fake timestamps used as a Glozz IDs
next()
Fresh timestamp
class educe.stac.util.glozz.TimestampCache
Bases: object
Generates and stores a unique timestamp entry for each key. You can use any hashable key, for exmaple, a span,
or a turn id.
100
Chapter 4. educe package
educe Documentation, Release 0.1
get(tid)
Return a timestamp for this turn id, either generating and caching (if unseen) or fetching from the cache
reset()
Empty the cache (but maintain the timestamper state, so that different documents get different timestamps;
the difference in timestamps is not mission-critical but potentially nice)
educe.stac.util.glozz.anno_author(anno)
Annotation author
educe.stac.util.glozz.anno_date(anno)
Annotation creation date as an int
educe.stac.util.glozz.anno_id_from_tuple(author_date)
Glozz string representation of authors and dates (AUTHOR_DATE)
educe.stac.util.glozz.anno_id_to_tuple(string)
Read a Glozz string representation of authors and dates into a pair (date represented as an int, ms since 1970?)
educe.stac.util.glozz.get_turn(tid, doc)
Return the turn annotation with the desired ID
educe.stac.util.glozz.is_dialogue(anno)
If a Glozz annotation is a STAC dialogue.
educe.stac.util.glozz.set_anno_author(anno, author)
Replace the annotation author the given author
educe.stac.util.glozz.set_anno_date(anno, date)
Replace the annotation creation date with the given integer
educe.stac.util.output module Help writing out corpus files
educe.stac.util.output.mk_parent_dirs(filename)
Given a filepath that we want to write, create its parent directory as needed.
educe.stac.util.output.output_path_stub(odir, k)
Given an output directory and an educe corpus key, return a ‘stub’ output path in that directory. This is dirname
and basename only; you probably want to tack a suffix onto it.
Example: given something like “/tmp/foo” and a key like {author:”bob”, stage:units, doc:”pilot03”, subdoc:”07”} you might get something like /tmp/foo/pilot03/units/pilot03_07)
educe.stac.util.output.save_document(output_dir, k, doc)
Save a document as a Glozz .ac/.aa pair
educe.stac.util.output.write_dot_graph(doc_key,
odir,
run_graphviz=True)
Write a dot graph and possibly run graphviz on it
educe.stac.util.prettifyxml
module Function
to
“prettify”
http://www.doughellmann.com/PyMOTW/xml/etree/ElementTree/create.html
dot_graph,
XML:
part=None,
courtesy
of
educe.stac.util.prettifyxml.prettify(elem, indent=’‘)
Return a pretty-printed XML string for the Element.
educe.stac.util.showscores module
4.3. Subpackages
101
educe Documentation, Release 0.1
class educe.stac.util.showscores.Score(reference, test)
Precision/recall type scores for a given data set.
This class is really just about holding on to sets of things The actual maths is handled by NLTK.
f_measure()
missing()
precision()
recall()
shared()
spurious()
educe.stac.util.showscores.banner(t)
educe.stac.util.showscores.show_multi(k, score)
educe.stac.util.showscores.show_pair(k, score)
Submodules
educe.stac.annotation module
STAC annotation conventions (re-exported in educe.stac)
STAC/Glozz annotations can be a bit confusing because for two reasons, first that Glozz objects are used to annotate
very different things; and second that annotations are done on different stages
Stage 1 (units)
Glozz
units
relations
schemas
Uses
doc structure, EDUs, resources, preferences
coreference
composite resources
Stage 2 (discourse)
Glozz
units
relations
schemas
Uses
doc structure, EDUs
relation instances, coreference
CDUs
Units
There is a typology of unit types worth noting:
• doc structure : type eg. Dialogue, Turn, paragraph
• resources : subspans of segments (type Resource)
• preferences : subspans of segments (type Preference)
• EDUs : spans of text associated with a dialogue act (eg. type Offer, Accept) (during discourse stage, these are
just type Segment)
Relations
• coreference : (type Anaphora)
• relation instances : links between EDUs, annotated with relation label (eg. type Elaboration, type Contrast,
etc). These can be further divided in subordinating or coordination relation instances according to their label
102
Chapter 4. educe package
educe Documentation, Release 0.1
Schemas
• composite resources : boolean combinations of resources (eg. “sheep or ore”)
• CDUs: type Complex_discourse_unit (discourse stage)
class educe.stac.annotation.PartialUnit
Bases: educe.stac.annotation.PartialUnit
Partially instantiated unit, for use when you want to programmatically insert annotations into a document
A partially instantiated unit does not have any metadata (creation date, etc); as these will be derived automatically
educe.stac.annotation.RENAMES = {‘Strategic_comment’: ‘Other’, ‘Segment’: ‘Other’}
Dialogue acts that should be treated as a different one
educe.stac.annotation.addressees(anno)
The set of people spoken to during an edu annotation
Annotation -> Set String
Note: this returns None if the value is the default ‘Please choose...’; but otherwise, it preserves values like ‘All’
or ‘?’.
educe.stac.annotation.cleanup_comments(anno)
Strip out default comment text from features. This placeholder text was inserted as a UI aid during editing in
Glozz, but isn’t actually the comment itself
educe.stac.annotation.create_units(_, doc, author, partial_units)
Return a collection of instantiated new unit objects.
Parameters partial_units (iterable of PartialUnit) –
educe.stac.annotation.dialogue_act(anno)
Set of dialogue act (aka speech act) annotations for a Unit, taking into consideration STAC conventions like
collapsing Strategic_comment into Other
By rights should be singleton set, but there used to be more than one, something we want to phase out?
educe.stac.annotation.is_cdu(annotation)
See CDUs typology above
educe.stac.annotation.is_coordinating(annotation)
See Relation typology above
educe.stac.annotation.is_dialogue(annotation)
See Unit typology above
educe.stac.annotation.is_dialogue_act(annotation)
Deprecated in favour of is_edu
educe.stac.annotation.is_edu(annotation)
See Unit typology above
educe.stac.annotation.is_preference(annotation)
See Unit typology above
educe.stac.annotation.is_relation_instance(annotation)
See Relation typology above
educe.stac.annotation.is_resource(annotation)
See Unit typology above
4.3. Subpackages
103
educe Documentation, Release 0.1
educe.stac.annotation.is_structure(annotation)
Is one of the document-structure annotations, something an annotator is expected not to edit, create, delete
educe.stac.annotation.is_subordinating(annotation)
See Relation typology above
educe.stac.annotation.is_turn(annotation)
See Unit typology above
educe.stac.annotation.is_turn_star(annotation)
See Unit typology above
educe.stac.annotation.relation_labels(anno)
Set of relation labels (eg. Elaboration, Explanation), taking into consideration any applicable STAC-isms
educe.stac.annotation.set_addressees(anno, addr)
Set the addresee list for an annotation. If the value None is provided, the addressee list is deleted (if present)
(Iterable String, Annotation) -> IO ()
educe.stac.annotation.speaker(anno)
Return the speaker associated with a turn annotation. NB: crashes if there is none
educe.stac.annotation.split_turn_text(text)
STAC turn texts are prefixed with a turn number and speaker to help the annotators (eg. “379: Bob: I think it’s
your go, Alice”).
Given the text for a turn, split the string into a prefix containing this turn/speaker information (eg. “379: Bob:
”), and a body containing the turn text itself (eg. “I think it’s your go, Alice”).
Mind your offsets! They’re based on the whole turn string.
educe.stac.annotation.split_type(anno)
An object’s type as a (frozen)set of items. You’re probably looking for educe.stac.dialogue_act instead.
educe.stac.annotation.turn_id(anno)
Return as an integer the turn number associated with a turn annotation (or None if this information is missing).
educe.stac.annotation.twin(corpus, anno, stage=’units’)
Given an annotation in a corpus, retrieve the equivalent annotation (by local identifier) from a a different stage
of the corpus. Return this “twin” annotation or None if it is not found
Note that the annotation’s origin must be set
The typical use of this would be if you have an EDU in the ‘discourse’ stage and need to get its ‘units’ stage
equvialent to have its dialogue act.
Parameters twin_doc – unit-level document to fish twin from (None if you want educe to search
for it in the corpus; NB: corpus can be None if you supply this)
educe.stac.annotation.twin_from(doc, anno)
Given a document and an annotation, return the first annotation in the document with a matching local identifier.
educe.stac.context module
The dialogue and turn surrounding an EDU along with some convenient information about it
class educe.stac.context.Context(turn, tstar, turn_edus, dialogue, dialogue_turns, doc_turns, tokens=None)
Bases: object
104
Chapter 4. educe package
educe Documentation, Release 0.1
Representation of the surrounding context for an EDU, basically the relevant enclosing annotations: turns,
dialogues. The idea is potentially extend this to a somewhat richer notion of context, including things like a
sentence count, etc.
Parameters
• turn – the turn surrounding this EDU
• tstar – the tstar turn surrounding this EDU (a tstar turn is a sort of virtual turn made by
merging consecutive turns in a dialogue that have the same speaker)
• turn_edus – the EDUs in the this turn
• dialogue – the dialogue surrounding this EDU
• dialogue_turns – all the turns in the dialogue surrounding this EDU (non-empty, sorted
by first-widest span)
• doc_turns – all the turns in the document
• tokens – (may not be present): tokens contained within this EDU
classmethod for_edus(doc, postags=None)
Return a dictionary of context objects for each EDU in the document
Returns
contexts –
A dictionary with a context For each EDU in the document
Return type dict(educe.glozz.Unit, Context)
speaker()
the speaker associated with the turn surrounding an edu
educe.stac.context.containing(span, annos)
Given an iterable of standoff, pick just those that enclose/contain the given span (ie. are bigger and around)
educe.stac.context.edus_in_span(doc, span)
Given an document and a text span return the EDUs the document contains in that span
educe.stac.context.enclosed(span, annos)
Given an iterable of standoff, pick just those that are enclosed by the given span (ie. are smaller and within)
educe.stac.context.merge_turn_stars(doc)
Return a copy of the document in which consecutive turns by the same speaker have been merged.
Merging is done by taking the first turn in grouping of consecutive speaker turns, and stretching its span over all
the subsequent turns.
Additionally turn prefix text (containing turn numbers and speakers) from the removed turns are stripped out.
educe.stac.context.sorted_first_widest(nodes)
Given a list of nodes, return the nodes ordered by their starting point, and in case of a tie their inverse width (ie.
widest first).
educe.stac.context.speakers(contexts, anno)
Return a list of speakers of an EDU or CDU (in the textual order of the EDUs).
educe.stac.context.turns_in_span(doc, span)
Given a document and a text span, return the turns that the document contains in that span
4.3. Subpackages
105
educe Documentation, Release 0.1
educe.stac.corenlp module
STAC conventions for running the Stanford CoreNLP pipeline, saving the results, and reading them.
The most useful functions here are
• run_pipeline
• read_results
educe.stac.corenlp.from_corenlp_output_filename(f )
Return a tuple of FileId and turn id.
This is entirely by convention we established when calling corenlp of course
educe.stac.corenlp.parsed_file_name(k, dir_name)
Given an educe.corpus.FileId and directory, return the file path within that directory that corresponds to the
corenlp output
educe.stac.corenlp.read_corenlp_result(doc, corenlp_doc, tid=None)
Read CoreNLP’s output for a document.
Parameters
• doc (educe Document (?)) – The original document (?)
• corenlp_doc (educe.external.stanford_xml_reader.PreprocessingSource) – Object that
contains all annotations for the document
• tid (turn id) – Turn id (?)
Returns corenlp_doc – A CoreNlpDocument containing all information.
Return type CoreNlpDocument
educe.stac.corenlp.read_results(corpus, dir_name)
Read stored parser output from a directory, and convert them to educe.annotation.Standoff objects.
Return a dictionary mapping ‘FileId’s to sets of tokens.
educe.stac.corenlp.run_pipeline(corpus, outdir, corenlp_dir, split=False)
Run the standard corenlp pipeline on all the (unannotated) documents in the corpus and save the results in the
specified directory.
If split=True, we output one file per turn, an experimental mode to account for switching between multiple
speakers. We don’t have all the infrastructure to read these back in (it should just be a matter of some filename manipulation though) and hope to flesh this out later. We also intend to tweak the notion of splitting
by aggregating consecutive turns with the same speaker, which may somewhat mitigate the loss of coreference
information.
educe.stac.corenlp.turn_id_text(doc)
Return a list of (turn ids, text) tuples in span order (no speaker)
educe.stac.corpus module
Corpus layout conventions (re-exported by educe.stac)
class educe.stac.corpus.LiveInputReader(corpusdir)
Bases: educe.stac.corpus.Reader
Reader for unannotated ‘live’ data that we want to parse.
The data is assumed to be in a directory with one aa/ac file pair.
106
Chapter 4. educe package
educe Documentation, Release 0.1
There is no notion of subdocument (subdoc = None) and the stage is ‘unannotated’
files()
class educe.stac.corpus.Reader(corpusdir)
Bases: educe.corpus.Reader
See educe.corpus.Reader for details
files()
slurp_subcorpus(cfiles, verbose=False)
educe.stac.corpus.id_to_path(k)
Given a fleshed out FileId (none of the fields are None), return a filepath for it following STAC conventions.
You will likely want to add your own filename extensions to this path
educe.stac.corpus.is_metal(fileid)
If the annotator is one of the distinguished standard annotators
educe.stac.corpus.twin_key(key, stage)
Given an annotation key, return a copy shifted over to a different stage.
Note that copying from unannotated to another stage, you will need to set the annotator
educe.stac.corpus.write_annotation_file(anno_filename, doc)
Write a GlozzDocument to XML in the given path
educe.stac.fake_graph module
Fake graphs for testing STAC algorithms
Specification for mini-language
Source string is parsed line by line, data type depends on first character Uppercase letters are speakers, lowercase
letters are units EDU names are arranged following alphabetical order (does NOT apply to CDUs) Please arrange the
lines in that order:
• # : speaker line
# Aabce Bdg Cfh
• any lowercase : CDU line (top-level last)
y(eg) x(wyz)
• S or C : relation line
Sabd bf ceCh
anything else : skip as comment
class educe.stac.fake_graph.LightGraph(src)
Structure holding only relevant information
Unit keys (sortable, hashable) must correspond to reading order CDUs can be placed in any position wrt their
components
get_doc()
get_edge(source, target)
Return an educe.annotation.Relation for the given LightGraph names for source and target
4.3. Subpackages
107
educe Documentation, Release 0.1
get_node(name)
Return an educe.annotation.Unit or Schema for the given LightGraph name
educe.stac.fusion module
Somewhat higher level representation of STAC documents than the usual Glozz layer.
Note that this is a relatively recent addition to Educe. Up to the time of this writing (2015-03), we had two options for
dealing with STAC:
• manually manipulating glozz objects via educe.annotation
• dealing with some high-level but not particularly helpful hypergraph objects
We try to provide an intermediary in this layer by merging information from several layers in one place.
A typical example might be to print a listing of
(edu1_id, edu2_id, edu1_dialogue_act, edu2_dialogue_act, relation_label)
This has always been a bit awkward when dealing with Glozz, because there are separate annotations in different
Glozz documents, the dialogue acts in the ‘units’ stage; and the linked units in the discourse stage. Combining these
streams has always involved a certain amount of manual lookup, which we hope to avoid with this fusion layer.
At the time of this writing, this will have a bit of emphasis on feature-extraction
class educe.stac.fusion.Dialogue(anno, edus, relations)
Bases: object
STAC Dialogue
Note that input EDUs should be sorted by span
edu_pairs()
Return all EDU pairs within this dialogue.
NB: this is a generator
class educe.stac.fusion.EDU(doc, discourse_anno, unit_anno)
Bases: educe.annotation.Unit
STAC EDU
A STAC EDU merges information from the unit and discourse annotation stages so that you can ignore the
distinction between the two annotation stages.
It also tries to be usable as a drop-in substitute for both annotations and contexts
dialogue_act()
The (normalised) speech act associated with this EDU (None if unknown)
fleshout(context)
second phase of EDU initialisation; fill out contextual info
identifier()
Some kind of identifier string that uniquely identfies the EDU in the corpus. Because these are higher
level annotations than in the Glozz layer we will use the ‘local’ identifier, which should be the same across
stages
is_left_padding()
If this is a virtual EDU used in machine learning tasks
speaker()
the speaker associated with the turn surrounding an edu
108
Chapter 4. educe package
educe Documentation, Release 0.1
subgrouping()
What abstract subgrouping the EDU is in (here: turn stars)
See also:
educe.stac.context.merge_turn_stars()
Returns subgrouping
Return type string
text()
The text for just this EDU
educe.stac.fusion.ROOT = ‘ROOT’
distinguished fake EDU id for machine learning applications
educe.stac.fusion.fuse_edus(discourse_doc, unit_doc, postags)
Return a copy of the discourse level doc, merging info from both the discourse and units stage.
All EDUs will be converted to higher level EDUs.
Notes
•The discourse stage is primary in that we work by going over what EDUs we find in the discourse stage
and trying to enhance them with information we find on their units-level equivalents. Sometimes (rarely
but it happens) annotations can go out of synch. EDUs missing on the units stage will be silently ignored
(we try to make do without them). EDUs that were introduced on the units stage but not percolated to
discourse will also be ignored.
•We rely on annotation ids to match EDUs from both stages; it’s up to you to ensure that the annotations
are really in synch.
•This does not constitute a full merge of the documents. For a full merge, you would have to bring over
other annotations such as Resources, Preference, Anaphor, Several_resources, taking care all the while to
ensure there are no timestamp clashes with pre-existing annotations (it’s unlikely but best be on the safe
side if you ever find yourself with automatically generated annotations, where all bets are off time-stamp
wise).
educe.stac.graph module
STAC-specific conventions related to graphs.
class educe.stac.graph.DotGraph(anno_graph)
Bases: educe.graph.DotGraph
A dot representation of this graph for visualisation. The to_string() method is most likely to be of interest here
class educe.stac.graph.EnclosureDotGraph(core)
Bases: educe.graph.EnclosureDotGraph
Conventions for visualising STAC enclosure graphs
class educe.stac.graph.EnclosureGraph(doc, postags=None)
Bases: educe.graph.EnclosureGraph
An enclosure graph based on STAC conventions
class educe.stac.graph.Graph
Bases: educe.graph.Graph
4.3. Subpackages
109
educe Documentation, Release 0.1
cdu_head(cdu, sloppy=False)
Given a CDU, return its head, defined here as the only DU that is not pointed to by any other member of
this CDU.
This is meant to approximate the description in Muller 2012 (/Constrained decoding for text-level discourse
parsing/):
1.in the highest DU in its subgraph in terms of suboordinate relations
2.in case of a tie in #1, the leftmost in terms of coordinate relations
Corner cases:
•Return None if the CDU has no members (annotation error)
•If the CDU contains more than one head (annotation error) and if sloppy is True, return the textually
leftmost one; otherwise, raise a MultiheadedCduException
first_outermost_dus()
Return discourse units in this graph, ordered by their starting point, and in case of a tie their inverse width
(ie. widest first)
classmethod from_doc(corpus, doc_key, pred=<function <lambda>>)
is_cdu(x)
is_edu(x)
is_relation(x)
recursive_cdu_heads(sloppy=False)
A dictionary mapping each CDU to its recursive CDU head (see cdu_head)
sorted_first_outermost(annos)
Given a list of nodes, return the nodes ordered by their starting point, and in case of a tie their inverse
width (ie. widest first).
strip_cdus(sloppy=False, mode=’head’)
Delete all CDUs in this graph. Links involving a CDU will point to/from the elements of this CDU.
Non-head modes may add new edges to the graph.
Parameters
• sloppy (boolean, default=False) – See cdu_head.
• mode (string, default=’head’) – Strategy for replacing edges involving CDUs. head will
relocate the edge on the recursive head of the CDU (see recursive_cdu_heads). broadcast
will distribute the edge over all EDUs belonging to the CDU. A copy of the edge will be
created for each of them. If the edge’s source and target are both distributed, a new copy
will be created for each combination of EDUs. custom (or any other string) will distribute
or relocate on the head depending on the relation label.
without_cdus(sloppy=False, mode=’head’)
Return a deep copy of this graph with all CDUs removed. Links involving these CDUs will point instead
from/to their deep heads
We’ll probably deprecate this function, since you could just as easily call deepcopy yourself
exception educe.stac.graph.MultiheadedCduException(cdu, *args, **kw)
Bases: exceptions.Exception
class educe.stac.graph.WrappedToken(token)
Bases: educe.annotation.Annotation
Thin wrapper around POS tagged token which adds a local_id field for use by the EnclosureGraph mechanism
110
Chapter 4. educe package
educe Documentation, Release 0.1
educe.stac.postag module
STAC conventions for running a pos tagger, saving the results, and reading them.
educe.stac.postag.extract_turns(doc)
Return a string representation of the document’s turn text for use by a tagger
educe.stac.postag.read_tags(corpus, dir)
Read stored POS tagger output from a directory, and convert them to educe.annotation.Standoff objects.
Return a dictionary mapping ‘FileId’s to sets of tokens.
educe.stac.postag.run_tagger(corpus, outdir, tagger_jar)
Run the ark-tweet-tagger on all the (unannotated) documents in the corpus and save the results in the specified
directory
educe.stac.postag.sorted_by_span(xs)
Annotations sorted by text span
educe.stac.postag.tagger_cmd(tagger_jar, txt_file)
educe.stac.postag.tagger_file_name(k, dir)
Given an educe.corpus.FileId and directory, return the file path within that directory that corresponds to the
tagger output
educe.stac.rfc module
Right frontier constraint and its variants
class educe.stac.rfc.BasicRfc(graph)
Bases: object
The vanilla right frontier constraint
1. X is textually last => RF(X)
2. Y
| (sub)
v
X
RF(Y) => RF(X)
3. X: +----+
| Y |
+----+
RF(Y) => RF(X)
frontier()
Return the list of nodes on the right frontier of the whole graph
violations()
Return a list of relation instance names, corresponding to the RF violations for the given graph.
You’ll need a stac graph object to interpret these names with.
Return type [string]
class educe.stac.rfc.ThreadedRfc(graph)
Bases: educe.stac.rfc.BasicRfc
4.3. Subpackages
111
educe Documentation, Release 0.1
Same as BasicRfc except for point 1:
1.X is the textual last utterance of any speaker => RF(X)
educe.stac.rfc.powerset([1,2,3]) –> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)
educe.stac.rfc.speakers(contexts, anno)
Returns the speakers for given annotation unit
Takes : contexts (Context dict), Annotation
4.4 Submodules
4.5 educe.annotation module
Low-level representation of corpus annotations, following somewhat faithfully the Glozz model for annotations.
This is low-level in the sense that we make little attempt to interpret the information stored in these annotations. For
example, a relation might claim to link two units of id unit42 and unit43. This being a low-level representation, we
simply note the fact. A higher-level representation might attempt to actually make the corresponding units available
to you, or perhaps provide some sort of graph representation of them
class educe.annotation.Annotation(anno_id, span, atype, features, metadata=None, origin=None)
Bases: educe.annotation.Standoff
Any sort of annotation. Annotations tend to have
•span: some sort of location (what they are annotating)
•type: some key label (we call a type)
•features: an attribute to value dictionary
identifier()
String representation of an identifier that should be unique to this corpus at least.
If the unit has an origin (see “FileId”), we use the
•document
•subdocument
•stage
•(but not the annotator!)
•and the id from the XML file
If we don’t have an origin we fall back to just the id provided by the XML file
See also position as potentially a safer alternative to this (and what we mean by safer)
local_id()
An identifier which is sufficient to pick out this annotation within a single annotation file
class educe.annotation.Document(units, relations, schemas, text)
Bases: educe.annotation.Standoff
A single (sub)-document.
This can be seen as collections of unit, relation, and schema annotations
112
Chapter 4. educe package
educe Documentation, Release 0.1
annotations()
All annotations associated with this document
fleshout(origin)
See set_origin
global_id(local_id)
String representation of an identifier that should be unique to this corpus at least.
set_origin(origin)
If you have more than one document, it’s a good idea to set its origin to a file ID so that you can more
reliably the annotations apart.
text(span=None)
Return the text associated with these annotations (or None), optionally limited to a span
class educe.annotation.RelSpan(t1, t2)
Bases: object
Which two units a relation connections.
t1 = None
string: id of an annotation
t2 = None
string: id of an annotation
class educe.annotation.Relation(rel_id, span, rtype, features, metadata=None)
Bases: educe.annotation.Annotation
An annotation between two annotations. Relations are directed; see RelSpan for details
Use the source and target field to grab these respective annotations, but note that they are only instantiated after
fleshout is called (corpus slurping normally fleshes out documents and thus their relations)
fleshout(objects)
Given a dictionary mapping ids to annotation objects, set this relation’s source and target fields.
source = None
source annotation; will be defined by fleshout
target = None
target annotation; will be defined by fleshout
class educe.annotation.Schema(rel_id, units, relations, schemas, stype, features, metadata=None)
Bases: educe.annotation.Annotation
An annotation between a set of annotations
Use the members field to grab the annotations themselves. But note that it is only created when fleshout is called.
fleshout(objects)
Given a dictionary mapping ids to annotation objects, set this schema’s members field to point to the
appropriate objects
terminals()
All unit-level annotations contained in this schema or (recursively in schema contained herein)
class educe.annotation.Span(start, end)
Bases: object
4.5. educe.annotation module
113
educe Documentation, Release 0.1
What portion of text an annotation corresponds to. Assumed to be in terms of character offsets
The way we interpret spans in educe amounts to how Python interprets array slice indices.
One way to understand them is to think of offsets as sitting in between individual characters
h
0
o
1
w
2
d
3
y
4
5
So (0,5) covers the whole word above, and (1,2) picks out the letter “o”
absolute(other)
Assuming this span is relative to some other span, return a suitably shifted “absolute” copy.
encloses(other)
Return True if this span includes the argument
Note that x.encloses(x) == True
Corner case: x.encloses(None) == False
See also educe.graph.EnclosureGraph if you might be repeating these checks
length()
Return the length of this span
merge(other)
Return a span that stretches from the beginning to the end of the two spans. Whereas overlaps can be
thought of as returning the intersection of two spans, this can be thought of as returning the union.
classmethod merge_all(spans)
Return a span that stretches from the beginning to the end of all the spans in the list
overlaps(other, inclusive=False)
Return the overlapping region if two spans have regions in common, or else None.
Span(5, 10).overlaps(Span(8, 12)) == Span(8, 10)
Span(5, 10).overlaps(Span(11, 12)) == None
If inclusive == True, spans with touching edges are considered to overlap
Span(5, 10).overlaps(Span(10, 12)) == None
Span(5, 10).overlaps(Span(10, 12), inclusive=True) == Span(10, 10)
relative(other)
Assuming this span is relative to some other span, return a suitably shifted “absolute” copy.
shift(offset)
Return a copy of this span, shifted to the right (if offset is positive) or left (if negative).
It may be a bit more convenient to use ‘absolute/relative’ if you’re trying to work with spans that are within
other spans.
class educe.annotation.Standoff(origin=None)
Bases: object
A standoff object ultimately points to some piece of text. The pointing is not necessarily direct though
encloses(other)
True if this annotations’s span encloses the span of the other.
s1.encloses(s2) is shorthand for s1.text_span().encloses(s2.text_span())
114
Chapter 4. educe package
educe Documentation, Release 0.1
overlaps(other)
True if this annotations’s span encloses the span of the other.
s1.overlaps(s2) is shorthand for s1.text_span().overlaps(s2.text_span())
text_span()
Return the span from the earliest terminal annotation contained here to the latest.
Corner case: if this is an empty non-terminal (which would be a very weird thing indeed), return None
class educe.annotation.Unit(unit_id, span, utype, features, metadata=None, origin=None)
Bases: educe.annotation.Annotation
An annotation over a span of text
position()
The position is the set of “geographical” information only to identify an item. So instead of relying on
some sort of name, we might rely on its text span. We assume that some name-based elements (document
name, subdocument name, stage) can double as being positional.
If the unit has an origin (see “FileId”), we use the
•document
•subdocument
•stage
•(but not the annotator!)
•and its text span
position vs identifier
This is a trade-off. One the hand, you can see the position as being a safer way to identify a unit, because
it obviates having to worry about your naming mechanism guaranteeing stability across the board (eg. two
annotators stick an annotation in the same place; does it have the same name). On the other hand, it’s a
bit harder to uniquely identify objects that may coincidentally fall in the same span. So how much do you
trust your IDs?
4.6 educe.corpus module
Corpus management
class educe.corpus.FileId(doc, subdoc, stage, annotator)
Information needed to uniquely identify an annotation file.
Note that this includes the annotator, so if you want to do comparisons on the “same” file between annotators
you’ll want to ignore this field.
Parameters
• doc (string) – document name
• subdoc (string) – subdocument (often None); sometimes you may have a need to divide a
document into smaller pieces (for exmaple working with tools that require too much memory
to process large documents). The subdocument identifies which piece of the document you
are working with. If you don’t have a notion of subdocuments, just use None
• stage (string) – annotation stage; for use if you have distinct files that correspond to
different stages of your annotation process (or different processing tools)
• annotator (string) – the annotator (or annotation tool) that generated this annoation file
4.6. educe.corpus module
115
educe Documentation, Release 0.1
mk_global_id(local_id)
String representation of an identifier that should be unique to this corpus at least.
If the unit has an origin (see “FileId”), we use the
•document
•subdocument
•(but not the stage!)
•(but not the annotator!)
•and the id from the XML file
If we don’t have an origin we fall back to just the id provided by the XML file
See also position as potentially a safer alternative to this (and what we mean by safer)
class educe.corpus.Reader(dir)
Reader provides little more than dictionaries from FileId to data.
Parameters rootdir (string) – the top directory of the corpus
A potentially useful pattern to apply here is to take a slice of these dictionaries for processing. For example, you
might not want to read the whole corpus, but only the files which are modified by certain annotators.
reader
files
subfiles
corpus
=
=
=
=
Reader(corpus_dir)
reader.files()
{ k:v in files.items() if k.annotator in [ 'Bob', 'Alice' ] }
reader.slurp(subfiles)
Alternatively, having read in the entire corpus, you might be doing processing on various slices of it at a time
corpus
= reader.slurp()
subcorpus = { k:v in corpus.items() if k.doc == 'pilot14' }
This is an abstract class; you should use the version from a data-set, eg. educe.stac.Reader instead
files()
Return a dictionary from FileId to (tuples of) filepaths. The tuples correspond to files that are considered
to ‘belong’ together; for example, in the case of standoff annotation, both the text file and its annotations
Derived classes
filter(d, pred)
Convenience function equivalent to
{ k:v for k,v in d.items() if pred(k) }
slurp(cfiles=None, verbose=False)
Read the entire corpus if cfiles is None or else the subset specified by cfiles.
Return a dictionary from FileId to educe.Annotation.Document
Parameters
• cfiles (dict) – a dictionary like what Corpus.files would return
• verbose (bool) – print what we’re reading to stderr
slurp_subcorpus(cfiles, verbose=False)
Derived classes should implement this function
116
Chapter 4. educe package
educe Documentation, Release 0.1
4.7 educe.glozz module
The Glozz file format in educe.annotation form
You’re likely most interested in slurp_corpus and read_annotation_file
class educe.glozz.GlozzDocument(hashcode, unit, rels, schemas, text)
Bases: educe.annotation.Document
Representation of a glozz document
set_origin(origin)
to_xml(settings=<educe.glozz.GlozzOutputSettings object>)
exception educe.glozz.GlozzException(*args, **kw)
Bases: exceptions.Exception
class educe.glozz.GlozzOutputSettings(feature_order, metadata_order)
Bases: object
Non-essential aspects of Glozz XML output, such as the order that feature structures or metadata are written out.
Controlling these settings could be useful when you want to automatically modify an existing Glozz document,
but produce only minimal textual diffs along the way for revision control, comparability, etc.
educe.glozz.glozz_annotation_to_xml(self,
tag=’annotation’,
tings=<educe.glozz.GlozzOutputSettings object>)
set-
educe.glozz.glozz_relation_to_span_xml(self )
educe.glozz.glozz_schema_to_span_xml(self )
educe.glozz.glozz_unit_to_span_xml(self )
educe.glozz.hashcode(f )
Hashcode mechanism as documented in the Glozz manual appendix. Hint, using cStringIO to get the hashcode
for a string
educe.glozz.ordered_keys(preferred, d)
Keys from a dictionary starting with ‘preferred’ ones in the order of preference
educe.glozz.read_annotation_file(anno_filename, text_filename=None)
Read a single glozz annotation file and its corresponding text (if any).
educe.glozz.read_node(node, context=None)
educe.glozz.write_annotation_file(anno_filename,
doc,
tings=<educe.glozz.GlozzOutputSettings object>)
Write a GlozzDocument to XML in the given path
set-
4.8 educe.graph module
Graph representation of discourse structure. Classes of interest:
• Graph: the core structure, use the Graph.from_doc factory method to build one out of an educe.annotation
document.
• DotGraph: visual representation, built from Graph. You probably want a project-specific variant to get more
helpful graphs, see eg. educe.stac.Graph.DotGraph
4.7. educe.glozz module
117
educe Documentation, Release 0.1
4.8.1 Educe hypergraphs
Somewhat tricky hypergraph representation of discourse structure.
• a node for every elementary discourse unit
• a hyperedge for every relation instance 1
• a hyperedge for every complex discourse unit
• (the tricky bit) for every (hyper)edge e_x in the graph, introduce a “mirror node” n_x for that edge (this node
also has e_x as its “mirror edge”)
The tricky bit is a response to two issues that arise: (A) how do we point to a CDU? Our hypergraph formalism and
library doesn’t have a notion of pointing to hyperedges (only nodes) and (B) what do we do about misannotations
where we have relation instances pointing to relation instances? A is the most important one to address (in principle,
we could just treat B as an error and raise an exception), but for now we decide to model both scenarios, and the same
“mirror” mechanism above.
The mirrors are a bit problematic because are not part of the formal graph structure (think of them as extra labels).
This could lead to some seriously unintuitive consequences when traversing the graph. For example, if you two DUs A
and B connected by an Elab instance, and if that instance is itself (bizarrely) connected to some other DU, you might
intuitively expect A, B, and C to all form one connected component
A
|
Elab |
o--------->C
| Comment
|
v
B
Alas, this is not so! The reality is a bit messier, with there being no formal relationship between edge and mirror
A
|
Elab |
|
|
|
v
B
n_ab
o--------->C
Comment
The same goes for the connectedness of things pointing to CDUs and with their members. Looking at pictures, you
might intuitively think that if a discourse unit (A) were connected to a CDU, it would also be connected to the discourse
units within
A
|
Elab |
|
v
+-----+
| B C |
+-----+
The reality is messier for the same reasons above
1 just a binary hyperedge, ie. like an edge in a regular graph. As these are undirected, we take the convention that the the first link is the tail
(from) and the second link is the tail (to).
118
Chapter 4. educe package
educe Documentation, Release 0.1
A
|
Elab |
|
v
n_bc
+-----+ e_bc
| B C |
+-----+
4.8.2 Classes
class educe.graph.AttrsMixin
Attributes common to both the hypergraph and directed graph representation of discourse structure
annotation(x)
Return the annotation object corresponding to a node or edge
edge_attributes_dict(x)
edgeform(x)
Return the argument if it is an edge id, or its mirror if it’s an edge id
(This is possible because every edge in the graph has a node that corresponds to it)
is_cdu(x)
is_edu(x)
is_relation(x)
mirror(x)
For objects (particularly, relations/CDUs) that have a mirror image, ie. an edge representation if it’s a node
or vice-versa, return the identifier for that image
node(x)
DEPRECATED (renamed 2013-11-19): use self.nodeform(x) instead
node_attributes_dict(x)
nodeform(x)
Return the argument if it is a node id, or its mirror if it’s an edge id
(This is possible because every edge in the graph has a node that corresponds to it)
type(x)
Return if a node/edge is of type ‘EDU’, ‘rel’, or ‘CDU’
class educe.graph.DotGraph(anno_graph)
Bases: pydot.Dot
A dot representation of this graph for visualisation. The to_string() method is most likely to be of interest here
This is fairly abstract and unhelpful. You probably want the project-layer extension instead, eg. educe.stac.graph
exception educe.graph.DuplicateIdException(duplicate)
Bases: exceptions.Exception
Condition that arises in inconsistent corpora
class educe.graph.EnclosureDotGraph(enc_graph)
Bases: pydot.Dot
class educe.graph.EnclosureGraph(annotations, key=None)
Bases: pygraph.classes.digraph.digraph, educe.graph.AttrsMixin
4.8. educe.graph module
119
educe Documentation, Release 0.1
Caching mechanism for span enclosure. Given an iterable of Annotation, return a directed graph where nodes
point to the largest nodes they enclose (i.e. not to nodes that are enclosed by intermediary nodes they point to).
As a slight twist, we also allow nodes to redundantly point to enclosed nodes of the same typ.
This should give you a multipartite graph with each layer representing a different type of annotation, but no
promises! We can’t guarantee that the graph will be nicely layered because the annotations may be buggy
(either nodes wrongly typed, or nodes of the same type that wrongly enclose each other), so you should not rely
on this property aside from treating it as an optimisation.
Note: there is a corner case for nodes that have the same span. Technically a span encloses itself, so the graph
could have a loop. If you supply a sort key that differentiates two nodes, we use it as a tie-breaker (first node
encloses second). Otherwise, we simply exclude both links.
NB: nodes are labelled by their annotation id
Initialisation parameters
•annotations - iterable of Annotation
•key - disambiguation key for nodes with same span (annotation -> sort key)
inside(annotation)
Given an annotation, return all annotations that are directly within it. Results are returned in the order of
their local id
outside(annotation)
Given an annotation, return all annotations it is directly enclosed in. Results are returned in the order of
their local id
class educe.graph.Graph
Bases: pygraph.classes.hypergraph.hypergraph, educe.graph.AttrsMixin
Hypergraph representation of discourse structure. See the section on Educe hypergraphs
You most likely want to use Graph.from_doc instead of instantiating an instance directly
Every node/hyperedge is represented as string unique within the graph. Given one of these identifiers x and a
graph g:
•g.type(x) returns one of the strings “EDU”, “CDU”, “rel”
•g.annotation(x) returns an educe.annotation object
•for relations and CDUs, if e_x is the edge representation of the relation/cdu, g.mirror(x) will return its
mirror node n_x and vice-versa
TODOS:
•TODO: Currently we use educe.annotation objects to represent the EDUs, CDUs and relations, but this is
likely a bit too low-level to be helpful. It may be nice to have higher-level EDU and CDU objects instead
cdu_members(cdu, deep=False)
Return the set of EDUs, CDUs, and relations which can be considered as members of this CDU.
This is shallow by default, in that we only return the immediate members of the CDU. If deep==True, also
return members of CDUs that are members of (members of ..) this CDU.
cdus()
Set of hyperedges representing complex discourse units.
See also cdu_members
connected_components()
Return a set of a connected components.
120
Chapter 4. educe package
educe Documentation, Release 0.1
Each connected component set can be passed to self.copy() to be copied as a subgraph.
This builds on python-graph’s version of a function with the same name but also adds awareness of our
conventions about there being both a node/edge for relations/CDUs.
containing_cdu(node)
Given an EDU (or CDU, or relation instance), return immediate containing CDU (the hyperedge) if there
is one or None otherwise. If there is more than one containing CDU, return one of them arbitrarily.
containing_cdu_chain(node)
Given an annotation, return a list which represents its containing CDU, the container’s container, and forth.
Return the empty list if no CDU contains this one.
copy(nodeset=None)
Return a copy of the graph, optionally restricted to a subset of EDUs and CDUs.
Note that if you include a CDU, then anything contained by that CDU will also be included.
You don’t specify (or otherwise have control over) what relations are copied. The graph will include all
hyperedges whose links are all (a) members of the subset or (b) (recursively) hyperedges included because
of (a) and (b)
Note that any non-EDUs you include in the copy set will be silently ignored.
This is a shallow copy in the sense that the underlying layer of annotations and documents remains the
same.
Parameters nodeset (iterable of strings) – only copy nodes with these names
edus()
Set of nodes representing elementary discourse units
classmethod from_doc(corpus, doc_key, could_include=<function <lambda>>, pred=<function
<lambda>>)
Return a graph representation of a document
Note: check the project layer for a version of this function which may be more appropriate to your project
Parameters
• corpus (dict from FileId to documents) – educe corpus dictionary
• doc_key (FileId) – key pointing to the document
• could_include (annotation -> boolean) – predicate on unit level annotations that
should be included regardless of whether or not we have links to them
• pred (annotation -> boolean) – predicate on annotations providing some requirement
they must satisfy in order to be taken into account (you might say that could_include
gives; and pred takes away)
rel_links(edge)
Given an edge in the graph, return a tuple of its source and target nodes.
If the edge has only a single link, we assume it’s a loop and return the same value for both
relations()
Set of relation edges representing the relations in the graph. By convention, the first link is considered the
source and the the second is considered the target.
4.9 educe.internalutil module
Utility functions which are meant to be used by educe but aren’t expected to be too useful outside of it
4.9. educe.internalutil module
121
educe Documentation, Release 0.1
exception educe.internalutil.EduceXmlException(*args, **kw)
Bases: exceptions.Exception
educe.internalutil.indent_xml(elem, level=0)
From <http://effbot.org/zone/element-lib.htm>
WARNING: destructive
educe.internalutil.linebreak_xml(elem)
Insert a break after each element tag
You probably want indent_xml instead
educe.internalutil.on_single_element(root, default, f, name)
Returns
• the default if no elements
• f(the node) if one element
• an exception if more than one
educe.internalutil.treenode(tree)
API-change padding for NLTK 2 vs NLTK 3 trees
4.10 educe.util module
Miscellaneous utility functions
educe.util.FILEID_FIELDS = [’stage’, ‘doc’, ‘subdoc’, ‘annotator’]
String representation of fields recognised in an educe.corpus.FileId
educe.util.add_corpus_filters(parser, fields=None, choice_fields=None)
For help with script-building:
Augment an argparer with options to filter a corpus on the various attributes in a ‘educe.corpus.FileId’ (eg,
document, annotator).
Parameters
• fields ([String]) – which flag names to include (defaults to FILEID_FIELDS)
• choice_fields (Dict String [String]) – fields which accept a limited range of answers
Meant to be used in conjunction with mk_is_interesting
educe.util.add_subcommand(subparsers, module)
Add a subcommand to an argparser following some conventions:
•the module can have an optional NAME constant (giving the name of the command); otherwise we assume
it’s the unqualified module name
•the first line of its docstring is its help text
•subsequent lines (if any) form its epilog
Returns the resulting subparser for the module
educe.util.concat(items)
:: Iterable (Iterable a) -> Iterable a
educe.util.concat_l(items)
:: [[a]] -> [a]
122
Chapter 4. educe package
educe Documentation, Release 0.1
educe.util.fields_without(unwanted)
Fields for add_corpus_filters without the unwanted members
educe.util.mk_is_interesting(args, preselected=None)
Return a function that when given a FileId returns ‘True’ if the FileId would be considered interesting according
to the arguments passed in.
Parameters preselected (Dict String [String]) – fields for which we already know what
matches we want
Meant to be used in conjunction with add_corpus_filters
educe.util.relative_indices(group_indices, reverse=False, valna=None)
Generate a list of relative indices inside each group. Missing (None) values are handled specifically: each
missing value is mapped to valna.
Parameters
• reverse (boolean, optional) – If True, compute indices relative to the end of each group.
• valna (int or None, optional) – Relative index for missing values.
4.10. educe.util module
123
educe Documentation, Release 0.1
124
Chapter 4. educe package
CHAPTER 5
Indices and tables
• genindex
• modindex
• search
125
educe Documentation, Release 0.1
126
Chapter 5. Indices and tables
Bibliography
[li2014text] Li, S., Wang, L., Cao, Z., & Li, W. (2014).
127
educe Documentation, Release 0.1
128
Bibliography
Python Module Index
e
educe, 43
educe.annotation, 112
educe.corpus, 115
educe.external, 44
educe.external.coref, 44
educe.external.corenlp, 44
educe.external.parser, 45
educe.external.postag, 46
educe.external.stanford_xml_reader, 47
educe.glozz, 117
educe.graph, 117
educe.internalutil, 121
educe.learning, 49
educe.learning.csv, 49
educe.learning.edu_input_format, 49
educe.learning.keygroup_vectorizer, 50
educe.learning.keys, 50
educe.learning.svmlight_format, 52
educe.learning.util, 52
educe.learning.vocabulary_format, 52
educe.pdtb, 52
educe.pdtb.corpus, 55
educe.pdtb.parse, 55
educe.pdtb.pdtbx, 57
educe.pdtb.ptb, 57
educe.pdtb.util, 53
educe.pdtb.util.args, 53
educe.pdtb.util.features, 53
educe.ptb, 57
educe.ptb.annotation, 57
educe.ptb.head_finder, 59
educe.rst_dt, 59
educe.rst_dt.annotation, 67
educe.rst_dt.corpus, 69
educe.rst_dt.deptree, 70
educe.rst_dt.document_plus, 71
educe.rst_dt.graph, 72
educe.rst_dt.learning, 60
educe.rst_dt.learning.args, 60
educe.rst_dt.learning.base, 60
educe.rst_dt.learning.doc_vectorizer,
61
educe.rst_dt.learning.features, 62
educe.rst_dt.learning.features_dev, 63
educe.rst_dt.learning.features_li2014,
65
educe.rst_dt.parse, 72
educe.rst_dt.ptb, 73
educe.rst_dt.rst_wsj_corpus, 73
educe.rst_dt.sdrt, 74
educe.rst_dt.text, 75
educe.rst_dt.util, 66
educe.rst_dt.util.args, 66
educe.stac, 75
educe.stac.annotation, 102
educe.stac.context, 104
educe.stac.corenlp, 106
educe.stac.corpus, 106
educe.stac.fake_graph, 107
educe.stac.fusion, 108
educe.stac.graph, 109
educe.stac.learning, 75
educe.stac.learning.addressee, 76
educe.stac.learning.doc_vectorizer, 76
educe.stac.learning.features, 76
educe.stac.lexicon, 84
educe.stac.lexicon.markers, 85
educe.stac.lexicon.pdtb_markers, 85
educe.stac.lexicon.wordclass, 86
educe.stac.oneoff, 87
educe.stac.oneoff.weave, 87
educe.stac.postag, 111
educe.stac.rfc, 111
educe.stac.sanity, 89
educe.stac.sanity.checks, 89
educe.stac.sanity.checks.annotation, 89
educe.stac.sanity.checks.glozz, 90
educe.stac.sanity.checks.graph, 91
educe.stac.sanity.checks.type_err, 93
educe.stac.sanity.common, 93
129
educe Documentation, Release 0.1
educe.stac.sanity.html, 94
educe.stac.sanity.main, 95
educe.stac.sanity.report, 96
educe.stac.util, 97
educe.stac.util.annotate, 97
educe.stac.util.args, 98
educe.stac.util.csv, 98
educe.stac.util.doc, 99
educe.stac.util.glozz, 100
educe.stac.util.output, 101
educe.stac.util.prettifyxml, 101
educe.stac.util.showscores, 101
educe.util, 122
130
Python Module Index
Index
Symbols
__repr__()
(educe.stac.learning.features.FeatureInput
method),
78
__getnewargs__() (educe.pdtb.util.features.DocumentPlus
__repr__()
(educe.stac.learning.features.VerbNetEntry
method), 53
method), 80
__getnewargs__() (educe.pdtb.util.features.FeatureInput
method), 53
A
__getnewargs__() (educe.stac.learning.features.DocEnv
absolute() (educe.annotation.Span method), 114
method), 76
add_commit_args() (in module educe.stac.util.args), 98
__getnewargs__() (educe.stac.learning.features.DocumentPlus
add_corpus_filters() (in module educe.util), 122
method), 77
(educe.rst_dt.deptree.RstDepTree
__getnewargs__() (educe.stac.learning.features.EduGap add_dependency()
method), 70
method), 77
__getnewargs__() (educe.stac.learning.features.FeatureInputadd_element() (in module educe.stac.sanity.main), 95
add_subcommand() (in module educe.util), 122
method), 78
add_usual_input_args() (in module educe.pdtb.util.args),
__getnewargs__() (educe.stac.learning.features.VerbNetEntry
53
method), 80
(in
module
__getstate__() (educe.pdtb.util.features.DocumentPlus add_usual_input_args()
educe.rst_dt.learning.args), 60
method), 53
(in
module
__getstate__()
(educe.pdtb.util.features.FeatureInput add_usual_input_args()
educe.rst_dt.util.args), 66
method), 53
__getstate__()
(educe.stac.learning.features.DocEnv add_usual_input_args() (in module educe.stac.util.args),
98
method), 76
(in
module
__getstate__() (educe.stac.learning.features.DocumentPlus add_usual_output_args()
educe.pdtb.util.args), 53
method), 77
(in
module
__getstate__()
(educe.stac.learning.features.EduGap add_usual_output_args()
educe.rst_dt.util.args), 66
method), 77
__getstate__() (educe.stac.learning.features.FeatureInput add_usual_output_args() (in module educe.stac.util.args),
98
method), 78
__getstate__() (educe.stac.learning.features.VerbNetEntry addressees() (in module educe.stac.annotation), 103
align_edus_with_paragraphs()
(in
module
method), 80
educe.rst_dt.document_plus), 72
__repr__()
(educe.pdtb.util.features.DocumentPlus
align_edus_with_sentences()
(in
module
method), 53
educe.rst_dt.ptb), 73
__repr__()
(educe.pdtb.util.features.FeatureInput
align_with_doc_structure()
method), 54
(educe.rst_dt.document_plus.DocumentPlus
__repr__()
(educe.stac.learning.features.DocEnv
method), 71
method), 76
__repr__() (educe.stac.learning.features.DocumentPlus align_with_raw_words() (educe.rst_dt.document_plus.DocumentPlus
method), 71
method), 77
__repr__()
(educe.stac.learning.features.EduGap align_with_tokens() (educe.rst_dt.document_plus.DocumentPlus
method), 71
method), 77
align_with_trees() (educe.rst_dt.document_plus.DocumentPlus
method), 71
131
educe Documentation, Release 0.1
all_edu_pairs() (educe.rst_dt.document_plus.DocumentPlus BadIdItem (class in educe.stac.sanity.checks.glozz), 90
method), 72
banner() (in module educe.stac.util.showscores), 102
AltLexRelation (class in educe.pdtb.parse), 55
basic_category() (in module educe.ptb.annotation), 58
AltLexRelationFeatures (class in educe.pdtb.parse), 55
BasicRfc (class in educe.stac.rfc), 111
anchor_name()
(educe.stac.sanity.report.HtmlReport BASKET (educe.learning.keys.Substance attribute), 52
method), 96
basket() (educe.learning.keys.Key class method), 50
anno_author() (in module educe.stac.util.glozz), 101
basket_fn()
(educe.learning.keys.MagicKey
class
anno_code() (in module educe.stac.sanity.common), 94
method), 51
anno_date() (in module educe.stac.util.glozz), 101
br() (in module educe.stac.sanity.html), 94
anno_id() (in module educe.stac.util.args), 98
build() (educe.external.parser.ConstituencyTree class
anno_id_from_tuple() (in module educe.stac.util.glozz),
method), 45
101
build() (educe.external.parser.DependencyTree class
anno_id_to_tuple() (in module educe.stac.util.glozz), 101
method), 46
annotate() (in module educe.stac.util.annotate), 97
build_analyzer() (educe.rst_dt.learning.doc_vectorizer.DocumentCountVect
annotate_doc() (in module educe.stac.util.annotate), 97
method), 61
Annotation (class in educe.annotation), 112
build_analyzer() (educe.rst_dt.learning.doc_vectorizer.DocumentLabelExtra
annotation() (educe.graph.AttrsMixin method), 119
method), 61
annotations() (educe.annotation.Document method), 112 build_doc_preprocessor()
(in
module
annotations() (educe.stac.sanity.checks.annotation.FeatureItem
educe.rst_dt.learning.features), 62
method), 89
build_doc_preprocessor()
(in
module
annotations() (educe.stac.sanity.checks.glozz.IdMismatch
educe.rst_dt.learning.features_dev), 63
method), 90
build_doc_preprocessor()
(in
module
annotations() (educe.stac.sanity.checks.glozz.OverlapItem
educe.rst_dt.learning.features_li2014), 65
method), 91
build_edu_feature_extractor()
(in
module
annotations() (educe.stac.sanity.checks.graph.CduOverlapItem
educe.rst_dt.learning.features), 62
method), 91
build_edu_feature_extractor()
(in
module
annotations() (educe.stac.sanity.common.RelationItem
educe.rst_dt.learning.features_dev), 63
method), 93
build_edu_feature_extractor()
(in
module
annotations()
(educe.stac.sanity.common.SchemaItem
educe.rst_dt.learning.features_li2014), 65
method), 93
build_pair_feature_extractor()
(in
module
annotations()
(educe.stac.sanity.common.UnitItem
educe.rst_dt.learning.features), 62
method), 94
build_pair_feature_extractor()
(in
module
annotations()
(educe.stac.sanity.report.ReportItem
educe.rst_dt.learning.features_dev), 63
method), 96
build_pair_feature_extractor()
(in
module
announce_output_dir() (in module educe.pdtb.util.args),
educe.rst_dt.learning.features_li2014), 65
53
announce_output_dir() (in module educe.rst_dt.util.args), C
66
CDU (class in educe.rst_dt.sdrt), 74
announce_output_dir() (in module educe.stac.util.args), cdu_head() (educe.stac.graph.Graph method), 109
98
cdu_members() (educe.graph.Graph method), 120
any_appears_in() (educe.stac.lexicon.pdtb_markers.Marker CduOverlapItem
(class
in
class method), 85
educe.stac.sanity.checks.graph), 91
appears_in() (educe.stac.lexicon.pdtb_markers.Marker cdus() (educe.graph.Graph method), 120
method), 85
Chain (class in educe.external.coref), 44
append_edu() (educe.rst_dt.deptree.RstDepTree method), check_easy_settings() (in module educe.stac.util.args), 98
70
check_matches() (in module educe.stac.oneoff.weave), 88
Arg (class in educe.pdtb.parse), 55
check_unit_ids()
(in
module
Attribution (class in educe.pdtb.parse), 55
educe.stac.sanity.checks.glozz), 91
AttrsMixin (class in educe.graph), 119
classname (educe.stac.learning.features.VerbNetEntry attribute), 80
B
clean_chat_word()
(in
module
BACKWARDS_WHITELIST
(in
module
educe.stac.learning.features), 81
educe.stac.sanity.checks.graph), 91
clean_dialogue_act()
(in
module
bad_ids() (in module educe.stac.sanity.checks.glozz), 91
educe.stac.learning.features), 81
132
Index
educe Documentation, Release 0.1
clean_edu_text() (in module educe.rst_dt.text), 75
cleanup_comments() (in module educe.stac.annotation),
103
combine_features()
(in
module
educe.rst_dt.learning.features), 62
combine_features()
(in
module
educe.rst_dt.learning.features_dev), 63
combine_features()
(in
module
educe.rst_dt.learning.features_li2014), 65
comma_span() (in module educe.stac.util.args), 98
compute_renames() (in module educe.stac.util.doc), 99
compute_updates() (in module educe.stac.oneoff.weave),
88
concat() (in module educe.util), 122
concat_l() (in module educe.util), 122
connected_components() (educe.graph.Graph method),
120
Connective (class in educe.pdtb.parse), 56
ConstituencyTree (class in educe.external.parser), 45
containing() (in module educe.rst_dt.document_plus), 72
containing() (in module educe.stac.context), 105
containing_cdu() (educe.graph.Graph method), 121
containing_cdu_chain() (educe.graph.Graph method),
121
Context (class in educe.stac.context), 104
context (educe.rst_dt.annotation.EDU attribute), 67
context (educe.rst_dt.annotation.Node attribute), 67
ContextItem (class in educe.stac.sanity.common), 93
CONTINUOUS (educe.learning.keys.Substance attribute), 52
continuous() (educe.learning.keys.Key class method), 51
continuous_fn() (educe.learning.keys.MagicKey class
method), 51
convert_label() (educe.rst_dt.corpus.RstRelationConverter
method), 70
convert_tree() (educe.rst_dt.corpus.RstRelationConverter
method), 70
copy() (educe.graph.Graph method), 121
copy_parses() (in module educe.stac.sanity.main), 95
CoreNlpDocument (class in educe.external.corenlp), 44
CoreNlpToken (class in educe.external.corenlp), 44
CoreNlpWrapper (class in educe.external.corenlp), 45
corpus (educe.pdtb.util.features.FeatureInput attribute),
54
corpus (educe.stac.learning.features.FeatureInput attribute), 78
CorpusConsistencyException, 76
create_dirname() (in module educe.stac.sanity.main), 95
create_units() (in module educe.stac.annotation), 103
cross_check_against()
(in
module
educe.stac.sanity.checks.glozz), 91
cross_check_units()
(in
module
educe.stac.sanity.checks.glozz), 91
css (educe.stac.sanity.report.HtmlReport attribute), 96
Index
current (educe.stac.learning.features.DocEnv attribute),
77
D
DEBUG (educe.learning.keys.KeyGroup attribute), 51
debug (educe.pdtb.util.features.FeatureInput attribute), 54
debug_du_to_tree() (in module educe.rst_dt.sdrt), 74
decode() (educe.rst_dt.corpus.RstDtParser method), 70
decode() (educe.rst_dt.learning.doc_vectorizer.DocumentCountVectorizer
method), 61
decode() (educe.rst_dt.learning.doc_vectorizer.DocumentLabelExtractor
method), 61
delete() (educe.stac.sanity.report.HtmlReport method), 96
DependencyTree (class in educe.external.parser), 45
deps() (educe.rst_dt.deptree.RstDepTree method), 71
depth_first_iterator() (educe.external.parser.SearchableTree
method), 46
Dialogue (class in educe.stac.fusion), 108
dialogue_act() (educe.stac.fusion.EDU method), 108
dialogue_act() (in module educe.stac.annotation), 103
dialogue_act_pairs()
(in
module
educe.stac.learning.features), 81
dialogue_graphs()
(in
module
educe.stac.sanity.checks.graph), 91
DialogueActVectorizer
(class
in
educe.stac.learning.doc_vectorizer), 76
DISCRETE (educe.learning.keys.Substance attribute), 52
discrete() (educe.learning.keys.Key class method), 51
discrete_fn()
(educe.learning.keys.MagicKey
class
method), 51
doc (educe.pdtb.util.features.DocumentPlus attribute), 53
doc (educe.stac.learning.features.DocumentPlus attribute), 77
DocEnv (class in educe.stac.learning.features), 76
Document (class in educe.annotation), 112
DocumentCountVectorizer
(class
in
educe.rst_dt.learning.doc_vectorizer), 61
DocumentLabelExtractor
(class
in
educe.rst_dt.learning.doc_vectorizer), 61
DocumentPlus (class in educe.pdtb.util.features), 53
DocumentPlus (class in educe.rst_dt.document_plus), 71
DocumentPlus (class in educe.stac.learning.features), 77
DocumentPlusPreprocessor
(class
in
educe.rst_dt.learning.base), 60
DotGraph (class in educe.graph), 119
DotGraph (class in educe.rst_dt.graph), 72
DotGraph (class in educe.stac.graph), 109
dump() (educe.stac.lexicon.wordclass.Lexicon method),
87
dump_all() (in module educe.learning.edu_input_format),
49
dump_edu_input_file()
(in
module
educe.learning.edu_input_format), 50
133
educe Documentation, Release 0.1
dump_pairings_file()
(in
module
educe.learning.edu_input_format), 50
dump_svmlight_file()
(in
module
educe.learning.svmlight_format), 52
dump_vocabulary()
(in
module
educe.learning.vocabulary_format), 52
duplicate_annotations()
(in
module
educe.stac.sanity.checks.glozz), 91
DuplicateIdException, 119
DuplicateItem (class in educe.stac.sanity.checks.glozz),
90
E
easy_settings() (in module educe.stac.sanity.main), 95
edge_attributes_dict() (educe.graph.AttrsMixin method),
119
edgeform() (educe.graph.AttrsMixin method), 119
EDU (class in educe.rst_dt.annotation), 67
EDU (class in educe.stac.fusion), 108
edu_feature() (in module educe.rst_dt.learning.base), 60
edu_pair_feature()
(in
module
educe.rst_dt.learning.base), 60
edu_pairs() (educe.stac.fusion.Dialogue method), 108
edu_position_in_turn()
(in
module
educe.stac.learning.features), 81
edu_span (educe.rst_dt.annotation.Node attribute), 67
edu_span() (educe.rst_dt.annotation.RSTTree method),
68
edu_text_feature()
(in
module
educe.stac.learning.features), 81
educe (module), 43
educe.annotation (module), 112
educe.corpus (module), 115
educe.external (module), 44
educe.external.coref (module), 44
educe.external.corenlp (module), 44
educe.external.parser (module), 45
educe.external.postag (module), 46
educe.external.stanford_xml_reader (module), 47
educe.glozz (module), 117
educe.graph (module), 117
educe.internalutil (module), 121
educe.learning (module), 49
educe.learning.csv (module), 49
educe.learning.edu_input_format (module), 49
educe.learning.keygroup_vectorizer (module), 50
educe.learning.keys (module), 50
educe.learning.svmlight_format (module), 52
educe.learning.util (module), 52
educe.learning.vocabulary_format (module), 52
educe.pdtb (module), 52
educe.pdtb.corpus (module), 55
educe.pdtb.parse (module), 55
educe.pdtb.pdtbx (module), 57
134
educe.pdtb.ptb (module), 57
educe.pdtb.util (module), 53
educe.pdtb.util.args (module), 53
educe.pdtb.util.features (module), 53
educe.ptb (module), 57
educe.ptb.annotation (module), 57
educe.ptb.head_finder (module), 59
educe.rst_dt (module), 59
educe.rst_dt.annotation (module), 67
educe.rst_dt.corpus (module), 69
educe.rst_dt.deptree (module), 70
educe.rst_dt.document_plus (module), 71
educe.rst_dt.graph (module), 72
educe.rst_dt.learning (module), 60
educe.rst_dt.learning.args (module), 60
educe.rst_dt.learning.base (module), 60
educe.rst_dt.learning.doc_vectorizer (module), 61
educe.rst_dt.learning.features (module), 62
educe.rst_dt.learning.features_dev (module), 63
educe.rst_dt.learning.features_li2014 (module), 65
educe.rst_dt.parse (module), 72
educe.rst_dt.ptb (module), 73
educe.rst_dt.rst_wsj_corpus (module), 73
educe.rst_dt.sdrt (module), 74
educe.rst_dt.text (module), 75
educe.rst_dt.util (module), 66
educe.rst_dt.util.args (module), 66
educe.stac (module), 75
educe.stac.annotation (module), 102
educe.stac.context (module), 104
educe.stac.corenlp (module), 106
educe.stac.corpus (module), 106
educe.stac.fake_graph (module), 107
educe.stac.fusion (module), 108
educe.stac.graph (module), 109
educe.stac.learning (module), 75
educe.stac.learning.addressee (module), 76
educe.stac.learning.doc_vectorizer (module), 76
educe.stac.learning.features (module), 76
educe.stac.lexicon (module), 84
educe.stac.lexicon.markers (module), 85
educe.stac.lexicon.pdtb_markers (module), 85
educe.stac.lexicon.wordclass (module), 86
educe.stac.oneoff (module), 87
educe.stac.oneoff.weave (module), 87
educe.stac.postag (module), 111
educe.stac.rfc (module), 111
educe.stac.sanity (module), 89
educe.stac.sanity.checks (module), 89
educe.stac.sanity.checks.annotation (module), 89
educe.stac.sanity.checks.glozz (module), 90
educe.stac.sanity.checks.graph (module), 91
educe.stac.sanity.checks.type_err (module), 93
educe.stac.sanity.common (module), 93
Index
educe Documentation, Release 0.1
educe.stac.sanity.html (module), 94
educe.stac.sanity.main (module), 95
educe.stac.sanity.report (module), 96
educe.stac.util (module), 97
educe.stac.util.annotate (module), 97
educe.stac.util.args (module), 98
educe.stac.util.csv (module), 98
educe.stac.util.doc (module), 99
educe.stac.util.glozz (module), 100
educe.stac.util.output (module), 101
educe.stac.util.prettifyxml (module), 101
educe.stac.util.showscores (module), 101
educe.util (module), 122
EducePosTagException, 46
EduceXmlException, 121
EduGap (class in educe.stac.learning.features), 77
edus() (educe.graph.Graph method), 121
edus_in_span() (in module educe.stac.context), 105
elem() (in module educe.stac.sanity.html), 94
emoticons() (in module educe.stac.learning.features), 81
enclosed() (in module educe.stac.context), 105
enclosed_lemmas()
(in
module
educe.stac.learning.features), 81
enclosed_trees() (in module educe.stac.learning.features),
81
encloses() (educe.annotation.Span method), 114
encloses() (educe.annotation.Standoff method), 114
EnclosureDotGraph (class in educe.graph), 119
EnclosureDotGraph (class in educe.stac.graph), 109
EnclosureGraph (class in educe.graph), 119
EnclosureGraph (class in educe.stac.graph), 109
ends_with_bang()
(in
module
educe.stac.learning.features), 81
ends_with_qmark()
(in
module
educe.stac.learning.features), 81
EntityRelation (class in educe.pdtb.parse), 56
evil_set_id() (in module educe.stac.util.doc), 99
evil_set_text() (in module educe.stac.util.doc), 99
excess_status (educe.stac.sanity.checks.glozz.MissingItem
attribute), 90
expire()
(educe.stac.learning.features.FeatureCache
method), 78
ExplicitRelation (class in educe.pdtb.parse), 56
ExplicitRelationFeatures (class in educe.pdtb.parse), 56
extract_pair_doc()
(in
module
educe.rst_dt.learning.features_dev), 63
extract_pair_features()
(in
module
educe.stac.learning.features), 81
extract_pair_gap()
(in
module
educe.rst_dt.learning.features), 62
extract_pair_length()
(in
module
educe.rst_dt.learning.features_li2014), 65
extract_pair_para()
(in
module
educe.rst_dt.learning.features_dev), 63
Index
extract_pair_para()
(in
module
educe.rst_dt.learning.features_li2014), 65
extract_pair_pos()
(in
module
educe.rst_dt.learning.features_li2014), 65
extract_pair_pos_tags()
(in
module
educe.rst_dt.learning.features), 62
extract_pair_raw_word()
(in
module
educe.rst_dt.learning.features), 62
extract_pair_sent()
(in
module
educe.rst_dt.learning.features_dev), 63
extract_pair_sent()
(in
module
educe.rst_dt.learning.features_li2014), 65
extract_pair_syntax()
(in
module
educe.rst_dt.learning.features_dev), 63
extract_pair_word()
(in
module
educe.rst_dt.learning.features_li2014), 65
extract_rel_features()
(in
module
educe.pdtb.util.features), 54
extract_single_features()
(in
module
educe.stac.learning.features), 81
extract_single_length()
(in
module
educe.rst_dt.learning.features_dev), 63
extract_single_length()
(in
module
educe.rst_dt.learning.features_li2014), 65
extract_single_para()
(in
module
educe.rst_dt.learning.features_dev), 63
extract_single_para()
(in
module
educe.rst_dt.learning.features_li2014), 65
extract_single_pdtb_markers()
(in
module
educe.rst_dt.learning.features_dev), 64
extract_single_pos()
(in
module
educe.rst_dt.learning.features_dev), 64
extract_single_pos()
(in
module
educe.rst_dt.learning.features_li2014), 65
extract_single_ptb_token_pos()
(in
module
educe.rst_dt.learning.features), 62
extract_single_ptb_token_word()
(in
module
educe.rst_dt.learning.features), 62
extract_single_raw_word()
(in
module
educe.rst_dt.learning.features), 62
extract_single_sentence()
(in
module
educe.rst_dt.learning.features_dev), 64
extract_single_sentence()
(in
module
educe.rst_dt.learning.features_li2014), 65
extract_single_syntax()
(in
module
educe.rst_dt.learning.features_dev), 64
extract_single_syntax()
(in
module
educe.rst_dt.learning.features_li2014), 66
extract_single_word()
(in
module
educe.rst_dt.learning.features_dev), 64
extract_single_word()
(in
module
educe.rst_dt.learning.features_li2014), 66
extract_turns() (in module educe.stac.postag), 111
135
educe Documentation, Release 0.1
F
fill()
(educe.stac.learning.features.SingleEduSubgroup
method), 80
fill() (educe.stac.learning.features.VerbNetLexKeyGroup
method), 81
filter() (educe.corpus.Reader method), 116
filter_matches()
(in
module
educe.stac.sanity.checks.glozz), 91
find_edu_head() (in module educe.ptb.head_finder), 59
find_lexical_heads() (in module educe.ptb.head_finder),
59
first_or_none() (in module educe.stac.sanity.main), 95
first_outermost_dus() (educe.stac.graph.Graph method),
110
fit() (educe.rst_dt.learning.doc_vectorizer.DocumentCountVectorizer
method), 61
fit() (educe.rst_dt.learning.doc_vectorizer.DocumentLabelExtractor
method), 61
fit()
(educe.rst_dt.learning.features_dev.LecsieFeats
method), 63
fit_transform() (educe.learning.keygroup_vectorizer.KeyGroupVectorizer
method), 50
fit_transform() (educe.rst_dt.learning.doc_vectorizer.DocumentCountVector
method), 61
fit_transform() (educe.rst_dt.learning.doc_vectorizer.DocumentLabelExtrac
method), 62
fleshout() (educe.annotation.Document method), 113
fleshout() (educe.annotation.Relation method), 113
fleshout() (educe.annotation.Schema method), 113
fleshout() (educe.stac.fusion.EDU method), 108
flush_subreport() (educe.stac.sanity.report.HtmlReport
method), 96
for_edus() (educe.stac.context.Context class method),
105
freeze() (educe.stac.lexicon.wordclass.LexClass class
method), 86
from_corenlp_output_filename()
(in
module
educe.stac.corenlp), 106
from_doc() (educe.graph.Graph class method), 121
from_doc() (educe.rst_dt.graph.Graph class method), 72
from_doc() (educe.stac.graph.Graph class method), 110
from_rst_tree() (educe.rst_dt.annotation.SimpleRSTTree
class method), 68
from_simple_rst_tree() (educe.rst_dt.deptree.RstDepTree
class method), 71
frontier() (educe.stac.rfc.BasicRfc method), 111
fuse_edus() (in module educe.stac.fusion), 109
f_measure() (educe.stac.util.showscores.Score method),
102
feat_annotator() (in module educe.stac.learning.features),
82
feat_end() (in module educe.stac.learning.features), 82
feat_has_emoticons()
(in
module
educe.stac.learning.features), 82
feat_id() (in module educe.stac.learning.features), 82
feat_is_emoticon_only()
(in
module
educe.stac.learning.features), 82
feat_start() (in module educe.stac.learning.features), 82
FeatureCache (class in educe.stac.learning.features), 77
FeatureExtractionException, 60
FeatureInput (class in educe.pdtb.util.features), 53
FeatureInput (class in educe.stac.learning.features), 78
FeatureItem (class in educe.stac.sanity.checks.annotation),
89
features (educe.external.corenlp.CoreNlpToken attribute),
44
FeatureSetAction (class in educe.rst_dt.learning.args), 60
fields_without() (in module educe.util), 122
FileId (class in educe.corpus), 115
FILEID_FIELDS (in module educe.util), 122
files() (educe.corpus.Reader method), 116
files() (educe.pdtb.corpus.Reader method), 55
files() (educe.rst_dt.corpus.Reader method), 69
files() (educe.stac.corpus.LiveInputReader method), 107
files() (educe.stac.corpus.Reader method), 107
fill() (educe.pdtb.util.features.RelKeys method), 54
fill() (educe.pdtb.util.features.RelSubgroup method), 54
fill()
(educe.pdtb.util.features.RelSubGroup_Core
method), 54
fill() (educe.pdtb.util.features.SingleArgKeys method), 54
fill()
(educe.pdtb.util.features.SingleArgSubgroup
method), 54
fill() (educe.stac.learning.features.InquirerLexKeyGroup
method), 78
fill()
(educe.stac.learning.features.LexKeyGroup
method), 78
fill() (educe.stac.learning.features.MergedLexKeyGroup
method), 79
fill() (educe.stac.learning.features.PairKeys method), 79
fill() (educe.stac.learning.features.PairSubgroup method),
79
fill()
(educe.stac.learning.features.PairSubgroup_Gap
method), 79
fill() (educe.stac.learning.features.PairSubgroup_Tuple G
generate_graphs() (in module educe.stac.sanity.main), 95
method), 79
fill()
(educe.stac.learning.features.PdtbLexKeyGroup generic_token_spans() (in module educe.external.postag),
46
method), 79
fill()
(educe.stac.learning.features.SingleEduKeys get() (educe.stac.util.glozz.TimestampCache method),
100
method), 80
136
Index
educe Documentation, Release 0.1
get_by_form()
(educe.stac.lexicon.markers.LexConn GlozzDocument (class in educe.glozz), 117
method), 85
GlozzException, 117
get_by_id()
(educe.stac.lexicon.markers.LexConn GlozzOutputSettings (class in educe.glozz), 117
method), 85
GornAddress (class in educe.pdtb.parse), 56
get_by_lemma() (educe.stac.lexicon.markers.LexConn Graph (class in educe.graph), 120
method), 85
Graph (class in educe.rst_dt.graph), 72
get_coref_chains() (educe.external.stanford_xml_reader.PreprocessingSource
Graph (class in educe.stac.graph), 109
method), 48
guess_addressees_for_edu()
(in
module
get_dependencies()
(educe.rst_dt.deptree.RstDepTree
educe.stac.learning.addressee), 76
method), 71
get_doc() (educe.stac.fake_graph.LightGraph method), H
107
has_correction_star()
(in
module
get_document_id() (educe.external.stanford_xml_reader.PreprocessingSource
educe.stac.learning.features), 82
method), 48
has_errors()
(educe.stac.sanity.report.HtmlReport
get_edge() (educe.stac.fake_graph.LightGraph method),
method), 96
107
has_FOR_np() (in module educe.stac.learning.features),
get_forms() (educe.stac.lexicon.markers.Marker method),
82
85
has_inner_question()
(in
module
get_lemma()
(educe.stac.lexicon.markers.Marker
educe.stac.learning.features), 82
method), 85
has_non_du_member()
(in
module
get_node() (educe.stac.fake_graph.LightGraph method),
educe.stac.sanity.checks.type_err), 93
107
has_one_of_words()
(in
module
get_offset2sentence_map()
educe.stac.learning.features), 82
(educe.external.stanford_xml_reader.PreprocessingSource
has_pdtb_markers()
(in
module
method), 48
educe.stac.learning.features), 82
get_offset2token_maps()
has_player_name_exact()
(in
module
(educe.external.stanford_xml_reader.PreprocessingSource educe.stac.learning.features), 82
method), 48
has_player_name_fuzzy()
(in
module
get_ordered_sentence_list()
educe.stac.learning.features), 82
(educe.external.stanford_xml_reader.PreprocessingSource
hashcode() (in module educe.glozz), 117
method), 48
horrible_context_kludge()
(in
module
get_ordered_token_list() (educe.external.stanford_xml_reader.PreprocessingSource
educe.stac.sanity.checks.graph), 92
method), 48
html() (educe.stac.sanity.checks.annotation.FeatureItem
get_output_dir() (in module educe.pdtb.util.args), 53
method), 89
get_output_dir() (in module educe.rst_dt.util.args), 66
html()
(educe.stac.sanity.checks.glozz.IdMismatch
get_output_dir() (in module educe.stac.util.args), 98
method), 90
get_players() (in module educe.stac.learning.features), 82 html()
(educe.stac.sanity.checks.glozz.MissingItem
get_relations()
(educe.stac.lexicon.markers.Marker
method), 90
method), 85
html()
(educe.stac.sanity.checks.glozz.OffByOneItem
get_sentence_annotations()
method), 90
(educe.external.stanford_xml_reader.PreprocessingSource
html()
(educe.stac.sanity.checks.glozz.OverlapItem
method), 48
method), 91
get_syntactic_labels()
(in
module html() (educe.stac.sanity.checks.graph.CduOverlapItem
educe.rst_dt.learning.features_li2014), 66
method), 91
get_token_annotations() (educe.external.stanford_xml_reader.PreprocessingSource
html() (educe.stac.sanity.common.RelationItem method),
method), 48
93
get_turn() (in module educe.stac.util.glozz), 101
html() (educe.stac.sanity.common.SchemaItem method),
global_id() (educe.annotation.Document method), 113
93
glozz_annotation_to_xml() (in module educe.glozz), 117 html() (educe.stac.sanity.common.UnitItem method), 94
glozz_relation_to_span_xml() (in module educe.glozz), html() (educe.stac.sanity.report.ReportItem method), 96
117
html_anno_id() (in module educe.stac.sanity.report), 97
glozz_schema_to_span_xml() (in module educe.glozz), html_turn_info() (educe.stac.sanity.checks.glozz.OffByOneItem
117
method), 90
glozz_unit_to_span_xml() (in module educe.glozz), 117 HtmlReport (class in educe.stac.sanity.report), 96
Index
137
educe Documentation, Release 0.1
I
id_to_path() (in module educe.pdtb.corpus), 55
id_to_path() (in module educe.rst_dt.corpus), 70
id_to_path() (in module educe.stac.corpus), 107
identifier() (educe.annotation.Annotation method), 112
identifier() (educe.rst_dt.annotation.EDU method), 67
identifier() (educe.stac.fusion.EDU method), 108
IdMismatch (class in educe.stac.sanity.checks.glozz), 90
ImplicitRelation (class in educe.pdtb.parse), 56
ImplicitRelationFeatures (class in educe.pdtb.parse), 56
incorporate_nuclearity_into_label()
(educe.rst_dt.annotation.SimpleRSTTree
class method), 68
indent_xml() (in module educe.internalutil), 122
InferenceSite (class in educe.pdtb.parse), 56
inner_edus (educe.stac.learning.features.EduGap attribute), 77
inputs (educe.stac.learning.features.DocEnv attribute), 77
inquirer_lex (educe.stac.learning.features.FeatureInput
attribute), 78
InquirerLexKeyGroup
(class
in
educe.stac.learning.features), 78
inside() (educe.graph.EnclosureGraph method), 120
is_arrow_inversion()
(in
module
educe.stac.sanity.checks.graph), 92
is_binary() (in module educe.rst_dt.annotation), 69
is_blank_edu()
(in
module
educe.stac.sanity.checks.annotation), 89
is_cdu() (educe.graph.AttrsMixin method), 119
is_cdu() (educe.stac.graph.Graph method), 110
is_cdu() (in module educe.stac.annotation), 103
is_coordinating() (in module educe.stac.annotation), 103
is_cross_dialogue()
(in
module
educe.stac.sanity.checks.annotation), 89
is_default() (in module educe.stac.sanity.common), 94
is_dialogue() (in module educe.stac.annotation), 103
is_dialogue() (in module educe.stac.util.glozz), 101
is_dialogue_act() (in module educe.stac.annotation), 103
is_disconnected()
(in
module
educe.stac.sanity.checks.graph), 92
is_dupe_rel() (in module educe.stac.sanity.checks.graph),
92
is_edu() (educe.graph.AttrsMixin method), 119
is_edu() (educe.stac.graph.Graph method), 110
is_edu() (in module educe.stac.annotation), 103
is_emoticon() (in module educe.stac.learning.addressee),
76
is_empty_category() (in module educe.ptb.annotation),
58
is_fixme()
(in
module
educe.stac.sanity.checks.annotation), 89
is_glozz_relation()
(in
module
educe.stac.sanity.common), 94
138
is_glozz_schema()
(in
module
educe.stac.sanity.common), 94
is_glozz_unit() (in module educe.stac.sanity.common), 94
is_just_emoticon()
(in
module
educe.stac.learning.features), 82
is_left_padding() (educe.rst_dt.annotation.EDU method),
67
is_left_padding() (educe.stac.fusion.EDU method), 108
is_maybe_off_by_one()
(in
module
educe.stac.sanity.checks.glozz), 91
is_metal() (in module educe.stac.corpus), 107
is_non2sided_rel()
(in
module
educe.stac.sanity.checks.graph), 92
is_non_du()
(in
module
educe.stac.sanity.checks.type_err), 93
is_non_empty() (in module educe.ptb.annotation), 58
is_non_preference()
(in
module
educe.stac.sanity.checks.type_err), 93
is_non_resource()
(in
module
educe.stac.sanity.checks.type_err), 93
is_nonword_token() (in module educe.ptb.annotation), 58
is_nplike() (in module educe.stac.learning.features), 82
is_nucleus() (educe.rst_dt.annotation.Node method), 67
is_preference() (in module educe.stac.annotation), 103
is_preposition()
(in
module
educe.stac.learning.addressee), 76
is_punct() (in module educe.stac.learning.addressee), 76
is_puncture() (in module educe.stac.sanity.checks.graph),
92
is_question() (in module educe.stac.learning.features), 82
is_question_pairs()
(in
module
educe.stac.learning.features), 82
is_relation() (educe.graph.AttrsMixin method), 119
is_relation() (educe.stac.graph.Graph method), 110
is_relation_instance() (in module educe.stac.annotation),
103
is_resource() (in module educe.stac.annotation), 103
is_review_edu()
(in
module
educe.stac.sanity.checks.annotation), 89
is_root()
(educe.external.parser.DependencyTree
method), 46
is_satellite() (educe.rst_dt.annotation.Node method), 67
is_structure() (in module educe.stac.annotation), 103
is_subordinating() (in module educe.stac.annotation), 104
is_turn() (in module educe.stac.annotation), 104
is_turn_star() (in module educe.stac.annotation), 104
is_verb() (in module educe.stac.learning.addressee), 76
is_weird_ack()
(in
module
educe.stac.sanity.checks.graph), 92
is_weird_qap()
(in
module
educe.stac.sanity.checks.graph), 92
issues_descr() (in module educe.stac.sanity.main), 95
Index
educe Documentation, Release 0.1
J
LexWrapper (class in educe.stac.learning.features), 79
javascript (educe.stac.sanity.report.HtmlReport attribute), LightGraph (class in educe.stac.fake_graph), 107
linebreak_xml() (in module educe.internalutil), 122
96
just_subclasses() (educe.stac.lexicon.wordclass.LexClass LiveInputReader (class in educe.stac.corpus), 106
load_head_rules() (in module educe.ptb.head_finder), 59
method), 86
(in
module
just_words()
(educe.stac.lexicon.wordclass.LexClass load_labels()
educe.learning.edu_input_format),
50
method), 86
load_pdtb_markers_lexicon()
(in
module
educe.stac.lexicon.pdtb_markers), 85
K
load_rst_wsj_corpus_edus_file()
(in
module
Key (class in educe.learning.keys), 50
educe.rst_dt.rst_wsj_corpus),
73
key (educe.pdtb.util.features.DocumentPlus attribute), 53
(in
module
key (educe.stac.learning.features.DocumentPlus at- load_rst_wsj_corpus_text_file()
educe.rst_dt.rst_wsj_corpus),
73
tribute), 77
load_rst_wsj_corpus_text_file_file()
(in
module
key_prefix() (educe.stac.learning.features.InquirerLexKeyGroup
educe.rst_dt.rst_wsj_corpus),
74
class method), 78
(in
module
key_prefix() (educe.stac.learning.features.LexKeyGroup load_rst_wsj_corpus_text_file_wsj()
educe.rst_dt.rst_wsj_corpus),
74
method), 78
(in
module
key_prefix() (educe.stac.learning.features.PdtbLexKeyGroupload_vocabulary()
educe.learning.vocabulary_format),
52
class method), 80
local_id() (educe.annotation.Annotation method), 112
key_prefix() (educe.stac.learning.features.VerbNetLexKeyGroup
lowest_common_parent()
(in
module
class method), 81
educe.rst_dt.learning.base),
60
KeyGroup (class in educe.learning.keys), 51
KeyGroupVectorizer
(class
educe.learning.keygroup_vectorizer), 50
in
L
labels_comment()
(in
module
educe.learning.edu_input_format), 50
LabelVectorizer
(class
in
educe.stac.learning.doc_vectorizer), 76
LecsieFeats (class in educe.rst_dt.learning.features_dev),
63
left_padding()
(educe.external.postag.Token
class
method), 46
left_padding()
(educe.rst_dt.annotation.EDU
class
method), 67
left_padding()
(educe.rst_dt.text.Paragraph
class
method), 75
left_padding() (educe.rst_dt.text.Sentence class method),
75
lemma_subject()
(in
module
educe.stac.learning.features), 82
lemmas (educe.stac.learning.features.VerbNetEntry attribute), 81
length() (educe.annotation.Span method), 114
LexClass (class in educe.stac.lexicon.wordclass), 86
LexConn (class in educe.stac.lexicon.markers), 85
LexEntry (class in educe.stac.lexicon.wordclass), 86
lexical_markers()
(in
module
educe.stac.learning.features), 82
Lexicon (class in educe.stac.lexicon.wordclass), 87
lexicons
(educe.stac.learning.features.FeatureInput
attribute), 78
LexKeyGroup (class in educe.stac.learning.features), 78
Index
M
MagicKey (class in educe.learning.keys), 51
main() (in module educe.stac.sanity.main), 95
map() (educe.stac.oneoff.weave.Updates method), 88
map_topdown() (in module educe.stac.learning.features),
82
Marker (class in educe.stac.lexicon.markers), 85
Marker (class in educe.stac.lexicon.pdtb_markers), 85
Mention (class in educe.external.coref), 44
merge() (educe.annotation.Span method), 114
merge_all() (educe.annotation.Span class method), 114
merge_turn_stars() (in module educe.stac.context), 105
MergedKeyGroup (class in educe.learning.keys), 51
MergedLexKeyGroup
(class
in
educe.stac.learning.features), 79
mirror() (educe.graph.AttrsMixin method), 119
missing() (educe.stac.util.showscores.Score method), 102
missing_features()
(in
module
educe.stac.sanity.checks.annotation), 89
missing_status (educe.stac.sanity.checks.glozz.MissingItem
attribute), 90
MissingDocumentException, 90
MissingItem (class in educe.stac.sanity.checks.glozz), 90
mk_csv_reader() (in module educe.stac.util.csv), 99
mk_csv_writer() (in module educe.stac.util.csv), 99
mk_current() (in module educe.pdtb.util.features), 54
mk_env() (in module educe.stac.learning.features), 83
mk_envs() (in module educe.stac.learning.features), 83
mk_field() (educe.stac.learning.features.InquirerLexKeyGroup
method), 78
139
educe Documentation, Release 0.1
mk_field() (educe.stac.learning.features.LexKeyGroup node_attributes_dict() (educe.graph.AttrsMixin method),
method), 79
119
mk_field() (educe.stac.learning.features.PdtbLexKeyGroup nodeform() (educe.graph.AttrsMixin method), 119
method), 80
NoRelation (class in educe.pdtb.parse), 56
mk_field() (educe.stac.learning.features.VerbNetLexKeyGroup
nuclearity (educe.rst_dt.annotation.Node attribute), 68
method), 81
num (educe.rst_dt.annotation.EDU attribute), 67
mk_fields() (educe.stac.learning.features.InquirerLexKeyGroup
num (educe.rst_dt.text.Paragraph attribute), 75
method), 78
num (educe.rst_dt.text.Sentence attribute), 75
mk_fields() (educe.stac.learning.features.LexKeyGroup num_edus_between()
(in
module
method), 79
educe.stac.learning.features), 83
mk_fields() (educe.stac.learning.features.PdtbLexKeyGroupnum_nonling_tstars_between()
(in
module
method), 80
educe.stac.learning.features), 83
mk_fields() (educe.stac.learning.features.VerbNetLexKeyGroup
num_speakers_between()
(in
module
method), 81
educe.stac.learning.features), 83
mk_global_id() (educe.corpus.FileId method), 115
num_tokens() (in module educe.stac.learning.features),
mk_hidden_with_toggle()
83
(educe.stac.sanity.report.HtmlReport method),
O
96
mk_high_level_dialogues()
(in
module OffByOneItem (class in educe.stac.sanity.checks.glozz),
educe.stac.learning.features), 83
90
mk_is_interesting()
(in
module on_first_bigram() (in module educe.rst_dt.learning.base),
educe.stac.learning.features), 83
60
mk_is_interesting() (in module educe.util), 123
on_first_unigram()
(in
module
mk_key() (in module educe.pdtb.corpus), 55
educe.rst_dt.learning.base), 60
mk_key() (in module educe.rst_dt.corpus), 70
on_last_bigram() (in module educe.rst_dt.learning.base),
mk_microphone() (in module educe.stac.sanity.report),
60
97
on_last_unigram()
(in
module
mk_or_get_subreport() (educe.stac.sanity.report.HtmlReport
educe.rst_dt.learning.base), 61
method), 96
on_single_element() (in module educe.internalutil), 122
mk_output_path() (educe.stac.sanity.report.HtmlReport one_hot_values_gen() (educe.learning.keys.KeyGroup
class method), 96
method), 51
mk_output_path() (in module educe.pdtb.util.args), 53
one_hot_values_gen() (educe.stac.learning.features.PairKeys
mk_parent_dirs() (in module educe.stac.util.output), 101
method), 79
mk_plain_csv_writer() (in module educe.learning.csv), ordered_keys() (in module educe.glozz), 117
49
output_is_temp() (educe.stac.sanity.main.SanityChecker
mk_plain_csv_writer() (in module educe.stac.util.csv), 99
method), 95
move_portion() (in module educe.stac.util.doc), 99
output_path_stub() (in module educe.stac.util.output),
MultiheadedCduException, 110
101
Multiword (class in educe.stac.lexicon.pdtb_markers), 85 outside() (educe.graph.EnclosureGraph method), 120
OverlapItem (class in educe.stac.sanity.checks.glozz), 91
N
overlapping() (in module educe.stac.sanity.checks.glozz),
91
NAME_WIDTH (educe.learning.keys.KeyGroup atoverlapping_structs()
(in
module
tribute), 51
educe.stac.sanity.checks.glozz), 91
narrow_to_span() (in module educe.stac.util.doc), 100
overlaps() (educe.annotation.Span method), 114
new_writable_instance() (educe.stac.lexicon.wordclass.LexClass
overlaps() (educe.annotation.Standoff method), 114
class method), 86
next() (educe.learning.csv.SparseDictReader method), 49
P
next() (educe.learning.csv.Utf8DictReader method), 49
next() (educe.stac.util.csv.SparseDictReader method), 98 PairKeys (class in educe.stac.learning.features), 79
next() (educe.stac.util.csv.Utf8DictReader method), 99
PairSubgroup (class in educe.stac.learning.features), 79
next()
(educe.stac.util.glozz.PseudoTimestamper PairSubgroup_Gap (class in educe.stac.learning.features),
method), 100
79
Node (class in educe.rst_dt.annotation), 67
PairSubgroup_Tuple
(class
in
node() (educe.graph.AttrsMixin method), 119
educe.stac.learning.features), 79
140
Index
educe Documentation, Release 0.1
Paragraph (class in educe.rst_dt.text), 75
product_features()
(in
module
paragraphs (educe.rst_dt.annotation.RSTContext ateduce.rst_dt.learning.features_li2014), 66
tribute), 68
prune_tree() (in module educe.ptb.annotation), 58
parse() (educe.rst_dt.corpus.RstDtParser method), 70
PseudoTimestamper (class in educe.stac.util.glozz), 100
parse() (educe.rst_dt.ptb.PtbParser method), 73
PTB_TO_TEXT (in module educe.ptb.annotation), 57
parse() (in module educe.pdtb.parse), 56
PtbParser (class in educe.rst_dt.ptb), 73
parse_lightweight_tree() (in module educe.rst_dt.parse),
R
72
parse_relation() (in module educe.pdtb.parse), 56
raw_text (educe.rst_dt.annotation.EDU attribute), 67
parse_rst_dt_tree() (in module educe.rst_dt.parse), 72
RawToken (class in educe.external.postag), 46
parse_trees() (in module educe.pdtb.ptb), 57
re_emit() (in module educe.rst_dt.learning.doc_vectorizer),
parsed_file_name() (in module educe.stac.corenlp), 106
62
parses
(educe.stac.learning.features.DocumentPlus read() (educe.external.stanford_xml_reader.PreprocessingSource
attribute), 77
method), 48
parses (educe.stac.learning.features.FeatureInput at- read()
(educe.stac.learning.features.LexWrapper
tribute), 78
method), 79
PartialUnit (class in educe.stac.annotation), 103
read_annotation_file() (in module educe.glozz), 117
pdtb_lex (educe.stac.learning.features.FeatureInput at- read_annotation_file() (in module educe.rst_dt.parse), 73
tribute), 78
read_corenlp_result() (in module educe.stac.corenlp), 106
PdtbItem (class in educe.pdtb.parse), 56
read_corpus() (in module educe.pdtb.util.args), 53
PdtbLexKeyGroup (class in educe.stac.learning.features), read_corpus() (in module educe.rst_dt.util.args), 66
79
read_corpus() (in module educe.stac.util.args), 98
player_addresees()
(in
module read_corpus_inputs()
(in
module
educe.stac.learning.features), 83
educe.stac.learning.features), 84
players (educe.stac.learning.features.DocumentPlus at- read_corpus_with_unannotated()
(in
module
tribute), 77
educe.stac.util.args), 98
players_for_doc()
(in
module read_entries()
(educe.stac.lexicon.wordclass.LexEntry
educe.stac.learning.features), 83
class method), 86
position() (educe.annotation.Unit method), 115
read_entry()
(educe.stac.lexicon.wordclass.LexEntry
position_in_dialogue()
(in
module
class method), 87
educe.stac.learning.features), 83
read_file() (educe.stac.lexicon.wordclass.Lexicon class
position_in_game()
(in
module
method), 87
educe.stac.learning.features), 83
read_lexicon()
(in
module
position_of_speaker_first_turn()
(in
module
educe.stac.lexicon.pdtb_markers), 86
educe.stac.learning.features), 83
read_node() (in module educe.glozz), 117
post_basic_category_index()
(in
module read_pdtb_lexicon()
(in
module
educe.ptb.annotation), 58
educe.stac.learning.features), 84
postags (educe.stac.learning.features.FeatureInput at- read_pdtbx_file() (in module educe.pdtb.pdtbx), 57
tribute), 78
read_Relation() (in module educe.pdtb.pdtbx), 57
powerset() (in module educe.stac.rfc), 112
read_Relations() (in module educe.pdtb.pdtbx), 57
precision() (educe.stac.util.showscores.Score method), read_results() (in module educe.stac.corenlp), 106
102
read_tags() (in module educe.stac.postag), 111
preprocess() (educe.rst_dt.learning.base.DocumentPlusPreprocessor
read_token_file() (in module educe.external.postag), 47
method), 60
Reader (class in educe.corpus), 116
PreprocessingSource
(class
in Reader (class in educe.pdtb.corpus), 55
educe.external.stanford_xml_reader), 48
Reader (class in educe.rst_dt.corpus), 69
prettify() (in module educe.stac.util.prettifyxml), 101
Reader (class in educe.stac.corpus), 107
process()
(educe.external.corenlp.CoreNlpWrapper reader() (in module educe.pdtb.ptb), 57
method), 45
real_dialogue_act()
(in
module
product_features()
(in
module
educe.stac.learning.features), 84
educe.rst_dt.learning.features), 62
real_roots_idx()
(educe.rst_dt.deptree.RstDepTree
product_features()
(in
module
method), 71
educe.rst_dt.learning.features_dev), 64
recall() (educe.stac.util.showscores.Score method), 102
Index
141
educe Documentation, Release 0.1
recursive_cdu_heads() (educe.stac.graph.Graph method),
110
reflow() (in module educe.stac.util.annotate), 97
rel (educe.rst_dt.annotation.Node attribute), 68
rel_link_item()
(in
module
educe.stac.sanity.checks.graph), 92
rel_links() (educe.graph.Graph method), 121
Relation (class in educe.annotation), 113
Relation (class in educe.pdtb.parse), 56
relation_dict() (in module educe.stac.learning.features),
84
relation_labels() (in module educe.stac.annotation), 104
Relation_xml() (in module educe.pdtb.pdtbx), 57
RelationItem (class in educe.stac.sanity.common), 93
relations() (educe.graph.Graph method), 121
relations() (educe.rst_dt.document_plus.DocumentPlus
method), 72
Relations_xml() (in module educe.pdtb.pdtbx), 57
relative() (educe.annotation.Span method), 114
relative_indices() (in module educe.util), 123
RelInst (class in educe.rst_dt.sdrt), 74
RelKeys (class in educe.pdtb.util.features), 54
RelSpan (class in educe.annotation), 113
RelSubgroup (class in educe.pdtb.util.features), 54
RelSubGroup_Core (class in educe.pdtb.util.features), 54
rename_ids() (in module educe.stac.util.doc), 100
RENAMES (in module educe.stac.annotation), 103
report() (educe.stac.sanity.report.HtmlReport method), 96
ReportItem (class in educe.stac.sanity.report), 96
reset() (educe.stac.util.glozz.TimestampCache method),
101
retarget() (in module educe.stac.util.doc), 100
rfc_violations()
(in
module
educe.stac.sanity.checks.graph), 92
ROOT (in module educe.stac.fusion), 109
rough_type() (in module educe.stac.sanity.common), 94
rough_type() (in module educe.stac.util.annotate), 97
rst_to_glozz_sdrt() (in module educe.rst_dt.sdrt), 74
rst_to_sdrt() (in module educe.rst_dt.sdrt), 74
RSTContext (class in educe.rst_dt.annotation), 68
RstDepTree (class in educe.rst_dt.deptree), 70
RstDtException, 71
RstDtParser (class in educe.rst_dt.corpus), 69
RstRelationConverter (class in educe.rst_dt.corpus), 70
RSTTree (class in educe.rst_dt.annotation), 68
RSTTreeException, 68
run() (educe.stac.sanity.main.SanityChecker method), 95
run() (in module educe.stac.sanity.checks.annotation), 89
run() (in module educe.stac.sanity.checks.glozz), 91
run() (in module educe.stac.sanity.checks.graph), 92
run() (in module educe.stac.sanity.checks.type_err), 93
run_checks() (in module educe.stac.sanity.main), 95
run_pipeline() (in module educe.stac.corenlp), 106
run_tagger() (in module educe.stac.postag), 111
142
S
same_speaker() (in module educe.stac.learning.features),
84
same_turn() (in module educe.stac.learning.features), 84
sanity_check_order() (in module educe.stac.sanity.main),
95
SanityChecker (class in educe.stac.sanity.main), 95
save_document() (in module educe.stac.util.output), 101
Schema (class in educe.annotation), 113
schema_text() (in module educe.stac.util.annotate), 97
SchemaItem (class in educe.stac.sanity.common), 93
Score (class in educe.stac.util.showscores), 101
search_anaphora()
(in
module
educe.stac.sanity.checks.type_err), 93
search_for_fixme_features()
(in
module
educe.stac.sanity.checks.annotation), 89
search_for_glozz_relations()
(in
module
educe.stac.sanity.common), 94
search_for_glozz_schema()
(in
module
educe.stac.sanity.common), 94
search_for_missing_rel_feats()
(in
module
educe.stac.sanity.checks.annotation), 89
search_for_missing_unit_feats()
(in
module
educe.stac.sanity.checks.annotation), 90
search_for_unexpected_feats()
(in
module
educe.stac.sanity.checks.annotation), 90
search_glozz_off_by_one()
(in
module
educe.stac.sanity.checks.glozz), 91
search_glozz_units()
(in
module
educe.stac.sanity.common), 94
search_graph_cdu_overlap()
(in
module
educe.stac.sanity.checks.graph), 92
search_graph_cdus()
(in
module
educe.stac.sanity.checks.graph), 92
search_graph_edus()
(in
module
educe.stac.sanity.checks.graph), 92
search_graph_relations()
(in
module
educe.stac.sanity.checks.graph), 93
search_in_glozz_schema()
(in
module
educe.stac.sanity.common), 94
search_preferences()
(in
module
educe.stac.sanity.checks.type_err), 93
search_resource_groups()
(in
module
educe.stac.sanity.checks.type_err), 93
SearchableTree (class in educe.external.parser), 46
segment() (educe.rst_dt.corpus.RstDtParser method), 70
Selection (class in educe.pdtb.parse), 56
SemClass (class in educe.pdtb.parse), 56
Sentence (class in educe.rst_dt.text), 75
sentences
(educe.rst_dt.annotation.RSTContext
attribute), 68
sentences (educe.rst_dt.text.Paragraph attribute), 75
set_addressees() (in module educe.stac.annotation), 104
set_anno_author() (in module educe.stac.util.glozz), 101
Index
educe Documentation, Release 0.1
set_anno_date() (in module educe.stac.util.glozz), 101
set_context() (educe.rst_dt.annotation.EDU method), 67
set_has_errors()
(educe.stac.sanity.report.HtmlReport
method), 96
set_origin() (educe.annotation.Document method), 113
set_origin() (educe.glozz.GlozzDocument method), 117
set_origin() (educe.rst_dt.annotation.EDU method), 67
set_origin() (educe.rst_dt.annotation.RSTTree method),
68
set_origin()
(educe.rst_dt.annotation.SimpleRSTTree
method), 69
set_origin() (educe.rst_dt.deptree.RstDepTree method),
71
set_root() (educe.rst_dt.deptree.RstDepTree method), 71
Severity (class in educe.stac.sanity.report), 97
sf_cache (educe.stac.learning.features.DocEnv attribute),
77
sf_cache (educe.stac.learning.features.EduGap attribute),
77
shared() (educe.stac.util.showscores.Score method), 102
shift() (educe.annotation.Span method), 114
shift_annotations() (in module educe.stac.util.doc), 100
shift_char() (in module educe.stac.oneoff.weave), 88
shift_span() (in module educe.stac.oneoff.weave), 88
show_diff() (in module educe.stac.util.annotate), 97
show_multi() (in module educe.stac.util.showscores), 102
show_pair() (in module educe.stac.util.showscores), 102
SimpleReportItem (class in educe.stac.sanity.report), 97
SimpleRSTTree (class in educe.rst_dt.annotation), 68
SingleArgKeys (class in educe.pdtb.util.features), 54
SingleArgSubgroup (class in educe.pdtb.util.features), 54
SingleEduKeys (class in educe.stac.learning.features), 80
SingleEduSubgroup
(class
in
educe.stac.learning.features), 80
SingleEduSubgroup_Chat
(class
in
educe.stac.learning.features), 80
SingleEduSubgroup_Parser
(class
in
educe.stac.learning.features), 80
SingleEduSubgroup_Punct
(class
in
educe.stac.learning.features), 80
SingleEduSubgroup_Token
(class
in
educe.stac.learning.features), 80
slurp() (educe.corpus.Reader method), 116
slurp_subcorpus() (educe.corpus.Reader method), 116
slurp_subcorpus() (educe.pdtb.corpus.Reader method),
55
slurp_subcorpus() (educe.rst_dt.corpus.Reader method),
69
slurp_subcorpus() (educe.stac.corpus.Reader method),
107
snippet() (in module educe.stac.sanity.report), 97
sorted_by_span() (in module educe.stac.postag), 111
sorted_first_outermost()
(educe.stac.graph.Graph
method), 110
Index
sorted_first_widest() (in module educe.stac.context), 105
source (educe.annotation.Relation attribute), 113
space_join() (in module educe.learning.util), 52
Span (class in educe.annotation), 113
span (educe.rst_dt.annotation.EDU attribute), 67
span (educe.rst_dt.annotation.Node attribute), 68
span() (in module educe.stac.sanity.html), 94
spans_to_str() (in module educe.pdtb.util.features), 54
SparseDictReader (class in educe.learning.csv), 49
SparseDictReader (class in educe.stac.util.csv), 98
speaker() (educe.stac.context.Context method), 105
speaker() (educe.stac.fusion.EDU method), 108
speaker() (in module educe.stac.annotation), 104
speaker_already_spoken_in_dialogue()
(in
module
educe.stac.learning.features), 84
speaker_id() (in module educe.stac.learning.features), 84
speaker_started_the_dialogue()
(in
module
educe.stac.learning.features), 84
speakers() (in module educe.stac.context), 105
speakers() (in module educe.stac.rfc), 112
speakers_first_turn_in_dialogue()
(in
module
educe.stac.learning.features), 84
split_doc() (in module educe.stac.util.doc), 100
split_feature_space()
(in
module
educe.rst_dt.learning.features_dev), 64
split_relations() (in module educe.pdtb.parse), 56
split_turn_text() (in module educe.stac.annotation), 104
split_type() (in module educe.stac.annotation), 104
spurious() (educe.stac.util.showscores.Score method),
102
src_gaps() (in module educe.stac.oneoff.weave), 89
StacDocException, 99
Standoff (class in educe.annotation), 114
status_len (educe.stac.sanity.checks.glozz.MissingItem
attribute), 90
STRING (educe.learning.keys.Substance attribute), 52
strip_cdus() (educe.stac.graph.Graph method), 110
strip_cdus() (in module educe.stac.learning.features), 84
strip_fixme() (in module educe.stac.util.doc), 100
strip_subcategory() (in module educe.ptb.annotation), 58
subgrouping() (educe.stac.fusion.EDU method), 108
subject_lemmas()
(in
module
educe.stac.learning.features), 84
subreport_path()
(educe.stac.sanity.report.HtmlReport
method), 96
Substance (class in educe.learning.keys), 51
substance (educe.learning.keys.Key attribute), 51
summarise_anno()
(in
module
educe.stac.sanity.common), 94
summarise_anno_html()
(in
module
educe.stac.sanity.common), 94
Sup (class in educe.pdtb.parse), 56
143
educe Documentation, Release 0.1
T
transform() (educe.rst_dt.learning.doc_vectorizer.DocumentLabelExtractor
method), 62
transform() (educe.rst_dt.learning.features_dev.LecsieFeats
method), 63
transform() (educe.stac.learning.doc_vectorizer.DialogueActVectorizer
method), 76
transform() (educe.stac.learning.doc_vectorizer.LabelVectorizer
method), 76
transform_tree() (in module educe.ptb.annotation), 58
treenode() (in module educe.internalutil), 122
tune_for_csv() (in module educe.learning.csv), 49
tuple_feature() (in module educe.learning.util), 52
Turn (class in educe.stac.util.csv), 98
turn_follows_gap()
(in
module
educe.stac.learning.features), 84
turn_id() (in module educe.stac.annotation), 104
turn_id_text() (in module educe.stac.corenlp), 106
turns_between (educe.stac.learning.features.EduGap attribute), 77
turns_in_span() (in module educe.stac.context), 105
TweakedToken (class in educe.ptb.annotation), 58
twin() (in module educe.stac.annotation), 104
twin_from() (in module educe.stac.annotation), 104
twin_key() (in module educe.stac.corpus), 107
type() (educe.graph.AttrsMixin method), 119
type_text() (in module educe.stac.learning.features), 84
t1 (educe.annotation.RelSpan attribute), 113
t2 (educe.annotation.RelSpan attribute), 113
tagger_cmd() (in module educe.stac.postag), 111
tagger_file_name() (in module educe.stac.postag), 111
target (educe.annotation.Relation attribute), 113
terminals() (educe.annotation.Schema method), 113
test_file() (in module educe.external.stanford_xml_reader),
48
text() (educe.annotation.Document method), 113
text() (educe.rst_dt.annotation.EDU method), 67
text() (educe.rst_dt.annotation.RSTContext method), 68
text() (educe.rst_dt.annotation.RSTTree method), 68
text() (educe.stac.fusion.EDU method), 109
text()
(educe.stac.sanity.checks.glozz.BadIdItem
method), 90
text()
(educe.stac.sanity.checks.glozz.DuplicateItem
method), 90
text() (educe.stac.sanity.report.ReportItem method), 96
text()
(educe.stac.sanity.report.SimpleReportItem
method), 97
text_span() (educe.annotation.Standoff method), 115
text_span()
(educe.external.parser.ConstituencyTree
method), 45
text_span() (educe.rst_dt.annotation.RSTTree method),
68
text_span()
(educe.rst_dt.annotation.SimpleRSTTree
U
method), 69
unannotated_key() (in module educe.stac.util.doc), 100
text_span() (educe.rst_dt.text.Sentence method), 75
text_span() (educe.stac.sanity.checks.glozz.MissingItem underscore() (in module educe.learning.util), 52
unexpected_features()
(in
module
method), 90
educe.stac.sanity.checks.annotation), 90
tgt_gaps() (in module educe.stac.oneoff.weave), 89
Unit (class in educe.annotation), 115
ThreadedRfc (class in educe.stac.rfc), 111
unitdoc (educe.stac.learning.features.DocumentPlus atTimestampCache (class in educe.stac.util.glozz), 100
tribute), 77
to_binary_rst_tree() (educe.rst_dt.annotation.SimpleRSTTree
UnitItem (class in educe.stac.sanity.common), 93
class method), 69
Updates (class in educe.stac.oneoff.weave), 87
to_dict() (educe.stac.util.csv.Turn method), 99
Utf8DictReader (class in educe.learning.csv), 49
to_xml() (educe.glozz.GlozzDocument method), 117
Utf8DictReader (class in educe.stac.util.csv), 99
Token (class in educe.external.postag), 46
token_filter_li2014()
(in
module Utf8DictWriter (class in educe.learning.csv), 49
Utf8DictWriter (class in educe.stac.util.csv), 99
educe.rst_dt.learning.features_dev), 64
token_filter_li2014()
(in
module
V
educe.rst_dt.learning.features_li2014), 66
verbnet_entries (educe.stac.learning.features.FeatureInput
token_spans() (in module educe.external.postag), 47
attribute), 78
tokenize() (educe.rst_dt.ptb.PtbParser method), 73
topdown()
(educe.external.parser.SearchableTree VerbNetEntry (class in educe.stac.learning.features), 80
VerbNetLexKeyGroup
(class
in
method), 46
educe.stac.learning.features), 81
topdown_smallest() (educe.external.parser.SearchableTree
violations() (educe.stac.rfc.BasicRfc method), 111
method), 46
transform() (educe.learning.keygroup_vectorizer.KeyGroupVectorizer
W
method), 50
transform() (educe.rst_dt.learning.doc_vectorizer.DocumentCountVectorizer
WeaveException, 88
method), 61
without_cdus() (educe.stac.graph.Graph method), 110
word_first() (in module educe.stac.learning.features), 84
144
Index
educe Documentation, Release 0.1
word_last() (in module educe.stac.learning.features), 84
WrappedToken (class in educe.stac.graph), 110
write() (educe.stac.sanity.report.HtmlReport method), 96
write_annotation_file() (in module educe.glozz), 117
write_annotation_file() (in module educe.stac.corpus),
107
write_dot_graph() (in module educe.stac.util.output), 101
write_index() (in module educe.stac.sanity.main), 95
write_pdtbx_file() (in module educe.pdtb.pdtbx), 57
writeheader()
(educe.learning.csv.Utf8DictWriter
method), 49
writeheader()
(educe.stac.util.csv.Utf8DictWriter
method), 99
writerow() (educe.learning.csv.Utf8DictWriter method),
49
writerow() (educe.stac.util.csv.Utf8DictWriter method),
99
writerows() (educe.learning.csv.Utf8DictWriter method),
49
writerows() (educe.stac.util.csv.Utf8DictWriter method),
99
X
xml_unescape()
(in
module
educe.external.stanford_xml_reader), 49
Index
145
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement