# Novel Methods in Information Retrieval ```Novel Methods in Information Retrieval
Vahed Qazvinian
School of Information and
Department of EECS
University of Michigan
Winter, 2008
1
Contents
1 Introduction
4
2 Random Walks
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Random Walks and Electrical Circuits . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Random Walks on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
5
7
3 Semi-supervised Learning
3.1 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Co-training . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Graph-based Methods . . . . . . . . . . . . . . . . . . . .
3.2 Semi-supervised Classification with Random Walks . . . . . . . .
3.3 Graph-Based Semi-supervised learning with Harmonic Functions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
. 9
. 9
. 10
. 12
. 12
4 Evaluation in Information Retrieval
4.1 Overview . . . . . . . . . . . . . . . . . . . . .
4.1.1 Binary Relevance, Set-based Evaluation
4.1.2 Evaluation of Ranked Retrieval . . . . .
4.1.3 Non-binary Relevance Evaluation . . . .
4.2 Assessing Agreement, Kappa Statistics . . . . .
4.2.1 An Example . . . . . . . . . . . . . . .
4.3 Statistical testing of retrieval experiments . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
14
14
15
15
15
5 Blog Analysis
5.1 Introduction . . . . . . . . . . .
5.2 Implicit Structure of Blogs . . .
5.2.1 iRank . . . . . . . . . .
5.3 Information Diffusion in Blogs .
5.4 Bursty Evolution . . . . . . . .
5.4.1 Time Graphs . . . . . .
5.4.2 Method . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
17
18
18
20
20
21
.
.
.
.
.
.
.
22
22
22
23
24
24
25
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Lexical Networks
6.1 Introduction . . . . . . . . . . . . . . . . . .
6.1.1 Small World of Human Languages .
6.2 Language Growth Model . . . . . . . . . . .
6.3 Complex Networks and Text Quality . . . .
6.3.1 Text Quality . . . . . . . . . . . . .
6.3.2 Summary Evaluation . . . . . . . . .
6.4 Compositionality in Quantitative Semantics
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Text Summarization
27
7.1 LexRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2 Summarization of Medical Documents . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Summarization Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2
7.4
MMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8 Graph Mincuts
8.1 Semi-supervised Learning .
8.1.1 Why This Works . .
8.2 Randomized Mincuts . . . .
8.2.1 Graph Construction
8.3 Sentiment Analysis . . . . .
8.4 Energy Minimization . . . .
8.4.1 Moves . . . . . . . .
9 Graph Learning I
9.1 Summarization . . . . .
9.2 Cost-effective Outbreak
9.2.1 Web Projections
9.3 Co-clustering . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
32
32
33
34
34
35
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
39
40
41
10 Graph Learning II
10.1 Dimensionality Reduction
10.2 Semi-Supervised Learning
10.3 Diffusion Kernels . . . . .
10.3.1 Reformulation . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
42
43
43
45
11 Sentiment Analysis I
11.1 Introduction . . . . . . .
11.2 Unstructured Data . . .
11.3 Appraisal Expressions .
11.4 Blog Sentiment . . . . .
11.5 Online Product Review
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
46
46
47
48
48
. . . . .
. . . . .
Reviews
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
50
50
51
52
.
.
.
.
54
54
55
56
57
.
.
.
.
.
12 Sentiment Analysis II
12.1 Movie Sale Prediction . . . .
12.2 Subjectivity Summarization .
12.3 Unsupervised Classification of
12.4 Opinion Strength . . . . . . .
13 Spectral Methods
13.1 Spectral Partitioning . . .
13.2 Community Finding Using
13.3 Spectral Learning . . . . .
13.4 Transductive learning . .
. . . . . . . .
Eigenvectors
. . . . . . . .
. . . . . . . .
References
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
3
1
Introduction
Information retrieval (IR) is the science of information search within documents, relational databases,
and the World Wide Web (WWW). In this work, we have tried to review some novel methods in
IR theory. This report covers a number of the state of art methods in a wide range of topics with
a focus on graph-based techniques in IR.
This report is created based on the literature review done as a requirement of the Directed Study,
EECS 599, course at the Department of Electrical Engineering and Computer Science, University
of Michigan, Ann Arbor. The first author would like to thank Prof. Dragomir R. Radev for his
discussions, reviews, and useful suggestions to support this work.
4
2
2.1
Random Walks
Preliminaries
A random walk is a mathematical formalization of successive movements in random directions.
This analysis applies to Computer Sciences, Physics, Theory of Probability, Economics and some
other fields of studies. In general, the position of a random walker is determined by its previous
state (position) and a random variable that determines the subsequent step length and direction.
More formally,
X(t + τ ) = X(t) + Φ(τ )
where X(t) is the position of the random walker at time t and Φ(τ ) is a random variable that
determines the rule to take the next step.
Different categorizations for random walks have been proposed, based on whether they are
discrete or continues, biased or unbiased, one dimensional or of higher dimensions. The simplest
form of a random walk is a discrete one-dimensional walk in which Φ(τ ) is a Bernoulli random
variable with p being the probability of a positive direction, and 1 − p being the probability of a
negative direction. Polya’s theorem  indicates that a random walker on an infinite lattice in
d-dimensional space is bound to return to the starting point when d = 2, but it has a positive
probability of escaping to infinity without returning to the starting point when d > 3. The walk
that will meet its starting place is called recurrent, while if there’s a positive probability of escaping
it is called transient.
An infinite lattice can be considered as an extremely big graph that fits in it. Let Gr be the
subgraph of an infinite lattice in which no node has a lattice distance of greater than r from the
origin. This means that no shortest path along the edges of the lattice with its head in origin
has length greater than r. Let ∂G(r) be the sphere of radius r about the origin, (points with
the exact distance r from the origin). ∂G(r) looks like a square in a 2-dimensional lattice. Now
(r)
consider a random walker that starts its walk from origin. If pesc be the probability that the random
walker reaches ∂G(r) before returning to origin, then the escape probability of the random walker
(r)
is limr→∞ pesc which decreases as r increases. With this definition, If this limit is 0, the walk is
recurrent, otherwise it is transient.
In the first part of this chapter we will have an overview of the electrical approach to proof
Polya’s theorem discussed in , and in the second part we discuss the work described by Aldous
and Fill .
2.2
Random Walks and Electrical Circuits
One of the attempts has been to interpret the Polya’s theorem as a statement about electric
networks, and then to prove the theorem using the electrical theoretic point of view .
(r)
To determine pesc electrically, the authors in  ground all the points of ∂G(r) and maintain
the origin at one volt, and measure the current i(r) flowing into the circuit. They show that
p(r)
esc =
i(r)
2d
For d being the dimension of the lattice. Given that the voltage is 1, then i(r) is the effective
5
conductance between the origin and ∂G(r):
i(r) =
1
(r)
Ref f
So
p(r)
esc =
1
(r)
2dRef f
If the effective resistance from origin to infinity is Ref f then
pesc =
1
2dRef f
Thus if the effective resistance of the d-dimensional lattice is infinite, then the escape probability
is zero and the walk is recurrent. In other words: “simple random walk on a graph is recurrent
if and only if a corresponding network of 1-ohm resistors has innite resistance out to infinity”. It
is trivial that an innite line of resistors has innite resistance. Using this fact, Doyle and Snell 
conclude that simple random walk on the 1-dimensional lattice is recurrent, which confirms Polyas
theorem.
For the two-dimensional case, they use the Shorting Law to calculate the resistance of the lattice
network out to infinity as
∞
X
1
4n + 8
n=1
which tends to ∞. So in the two dimensional case:
pesc =
1
1
=
P∞
2dRef f
2d n=1
1
4n+8
=0
Which again confirms that simple random walk on the 2-dimensional lattice is recurrent as stated
by Polyas theorem.
For three dimensional lattice, denote by P (a, b, c; n) the probability that the random walker be
at (a, b, c) at time n.
p(0, 0, 0; 0) = 1
and
1
p(a − 1, b, c; n − 1) +
6
1
p(a, b − 1, c; n − 1) +
6
1
p(a, b, c − 1; n − 1) +
6
p(a, b, c; n) =
+
+
1
p(a + 1, b, c; n − 1)
6
1
p(a, b + 1, c; n − 1)
6
1
p(a, b, c + 1; n − 1)
6
Using the technique of generating functions they reach
p(a, b, c; n) =
1
.
(2π)3
Z πZ πZ
π
(
−π
−π
−π
cos x + cos y + cos z n
) cos(xa) cos(yb) cos(zc)dxdydz
3
6
Then the expected value of the number of returns to origin is reached by assigning a = b = c = 0
and summing over n:
Z πZ πZ π
3
1
m=
dxdydz
3
(2π) −π −π −π 3 − (cosx + cosy + cosz)
A simple solution to this integral is proposed in  who evaluated this integral in terms of
gamma functions:
√
6
1
5
7
11
m=
Γ( )Γ( )Γ( )Γ( ) = 1.516386...
3
32π
24
24
24
24
where
Z ∞
Γ(x) =
e−t tx−1 dt
0
If u is the probability that a random walker, starting at the origin, will return to the origin,
then the probability that the walker will be there exactly k times (counting the initial time) is
uk (1 − u). Thus, if m is the expected number of times at which the walker is at the origin:
m=
∞
X
kuk−1 (1 − u) =
k=1
1
1−u
So,
m−1
m
This shows, in this particular 3-dimensional lattice, that the probability of a random walker, starting
at the origin, returning to the origin is
u=
u = 0.340537...
Which means that in the 3-dimensional lattice there is a non-zero probability that the random
2.3
Random Walks on Graphs
In this section, we mainly focus on the graph application of Markov chains .
A Markov Chain is a stochastic process {Xn , n = 0, 1, 2, · · ·} that takes on a finite or countable
number of possible values. Whenever the process is in state i, there’s a fixed probability Pij that
it will in state j in the next step.
P {Xn+1 = j|Xn = i, Xn−1 = in−1 , · · · , X1 = i1 , X0 = i0 } = Pij
where
Pij > 0,
i, j > 0,
∞
X
Pij = 1
j=0
P is called the one-step transition Probability matrix.
7




P =




P00 P01 P02 · · ·
P10 P11 P12 · · · 


..
..
..

.
.
.

Pi0 Pi1 Pi2 · · · 

..
..
..
.
.
.
If π is the stationary distribution of a finite irreducible discrete-time chain (Xt ), the chain is
reversible if
πi pij = πj pji
In fact, if (Xt ) is the stationary chain (X0 has distribution π) then
(X0 , X1 , · · · , Xt ) = (Xt , Xt−1 , · · · , X0 )
In the forth section, the authors talk about coupling which is a methodology to find an upper¯ := maxij kPi (Xt = .) − Pj (Xt = .)k. Coupling is a joint process ((X (i) , X (j) ), t > 0)
bound for d(t)
t
t
(i)
(j)
such that: Xt is distributed as the chain started at i, and Xt is distributed as the chain
started at j. Then the assumption is made that there is a random time T ij < ∞ such that
(j)
(j)
Xt = Xt , T ij 6 t < ∞ and T ij is called the coupling time. The coupling inequality is
(i)
(j)
kPi (Xt ∈ .) − Pj (Xt ∈ .)k = kPi (Xt ) ∈ .) − Pj (Xt ) ∈ .)k
(i)
(j)
6 P (Xt 6= Xt
6 P (T ij > t)
The calculation of random walks on large graphs can be done under two settings. First, when
the graph is just 1-dimensional. Second, when the graph is highly symmetric. Aldous and Fill 
further talk about tree states, and that on a n-vertex tree, a random walk’s stationary distribution
is
rv
πv =
2(n − 1)
8
3
Semi-supervised Learning
3.1
Survey
Many semi-supervised learning methods have been used before. Some are EM with generative
mixture models, self-training, co-training, transductive support vector machines, and graph-based
methods. There is no direct answer which one should be used or which is the best. The reason is
that the labeled data is scarce and semi-supervised learning methods tend to make strong model
assumption which is highly dependent on the problem structure.
The big picture is that semi-supervised learning methods use unlabeled data to modify the
hypotheses (p(y|x)) derived from labeled data (p(x)) alone. In this section semi-supervised learning
refers to semi-supervised classification, in which one has additional unlabeled data and the goal is
classification, unless otherwise noted.
3.1.1
Co-training
Co-training  is a semi-supervised learning techniques which makes three preliminary assumptions:
1. The feature space can be splitted up into two separate sets.
2. Each feature space is sufficient to train a good classifier.
3. The two sets are conditionally independent given the class.
Initially, each classifier is trained using the initial labeled data on the corresponding feature
space.Then both classifiers classify all unlabeled data. In each iteration each classifier adds n
negative data points and p positive ones, about which it is the most confident to the labeled data
and returns the rest to the shared unlabeled data. Both classifiers will then re-train using the new
labeled data.
Input::
• F = (F1 , F2 , ., Fm ) which are m semantically different feature sets (that can be considered
as different views of the data)
• C = (C1 , C2 , ., Cm ) which are m supervised learning classifiers (each of which corresponds
to a unique feature set)
• L is a set of labeled training data
• U is a set of unlabeled training data
The procedure of Co-training is as follows.
• Training each classifier Ci initially on L with respect to Fi
• For each classifier Ci :
– Ci labels the unlabeled training data from U based on Fi
– Ci chooses the top p positive and top n negative labeled examples E from U according
to the confidence of the prediction
9
– Remove E from U
– Add E into L with corresponding labels predicted by Ci
More generally, we can define learning paradigms that use the agreement among different learners in which multiple hypotheses are trained from the shared labeled data, and are necessary for
making predictions on given unlabeled data. In general, multiview learning models do not require
3.1.2
Graph-based Methods
In graph based methods, the techniques are applied on graphs with labeled an unlabeled nodes,
and similarity of nodes as edges. Graph based methods are problems that deal with estimating a
function f on the graph where f should satisfy two properties:
1. Should be close to the given labels yL on the labeled nodes. (loss function)
2. Should be smooth on the whole graph. (regularizer)
Most of the graph based methods are only different from one another in that they have different
loss functions and regularizers. It is more important to construct a good graph than to choose
among different methods. We will itemize some of the methods and discuss the graph construction
after that.
• Mincut
Blum and Chawla  represent the problem as a mincut problem in which positive labels act
as sources and negative labels act as sinks. The goal is to find a minimum set of edges whose
removal blocks all flow from the sources to the sinks. After that, all nodes connected to the
sources are labeled as positive, and all nodes connected to the sinks are labeled as negative.
The loss function is a quadratic loss with infinity weight
X
∞
(yi − yi|L )2
i∈L
This ensures that the values on the labeled data are fixed at their given labels. The regularizer
is
1X
1X
ωij |yi − yj | =
ωij (yi − yj )2
2
2
i,j
i,j
The above equality holds since y takes binary values. Putting all together, the goal of the
mincut method is to minimize
X
1X
ωij (yi − yj )2
∞
(yi − yi|L )2 +
2
i,j
i∈L
given the constraint
yi ∈ {0, 1}, ∀i
10
• Gaussian Random Fields
Zhu et, al  introduce the Gaussian random fields and harmonic function methods which
is a relaxation of discrete Markov random fields. This can be viewed as having a quadratic
loss function with infinity weight, so that the labeled data are fixed at given label values, and
a regularizer based on the graph combinatorial Laplacian (δ)
X
1X
(fi − yi )2 +
ωij (fi − fj )2
2
i,j
i∈L
X
(fi − yi )2 + f T ∆f
= ∞
∞
i∈:
fi ∈ R is the key relaxation to Mincut.
• Local and Global Consistency
The method of local and global consistency  uses the loss function
n
X
(fi − yi )2
i=1
But in the regularizer it uses a normalized Laplacian, that is, D−1/2 ∆D−1/2 . The regularizer
is
fj 2
1X
fi
−p
) = f T D−1/2 ∆D−1/2 f
ωij ( p
2
Djj
Djj
ij
• Local Learning Regularization
The solution of graph-based methods can often be viewed as local averaging. In other words,
the solution f (xi ) in an unlabeled point xi is the weighted average of its neighbors’ solutions.
Since most of the nodes are unlabeled, we do not require the solution to be exactly equal to
the local average, but regularize f so they are close. A more general approach is to generalize
local averaging to a local linear fit. To do so, one should build a local linear model from xi ’s
neighbors to predict the value at xi . Thus, the solution f (xi ) is regularized to be close to this
predicted value.
• Tree-Based Bayes
In this method, a tree, T , is constructed with the labeled and unlabeled data as the leaf
nodes, and has a mutation process. In the mutation process a label at the root propagates
down to the leaves. While moving down along edges, a label mutates at a constant rate. This
makes the tree uniquely define the probabilistic distribution P (X|Y ) on discrete Y labelling.
In fact, if two leaf nodes are closer in the tree, they have a higher probability of sharing the
same label.
Although the graph is the core of the graph-based methods, its construction for this purpose
has not been studied widely. Majer and Hein  propose a method to denoise point samples
from a manifold. This can be considered as a preprocessing step to build a better graph in
order to do the semi-supervised learning on the graph which is less noisy. Such preprocessing
results in a better semi-supervised classification accuracy.
11
3.2
Semi-supervised Classification with Random Walks
In this section we reviewed a method of semi-supervised classification using random walks described
in . In their work, Szummer and Jaakkola  first create a weighted K-nearest neighbor graph,
with a given metric, and form the one-step transition probability matrix as
ωik
pik = P
j ωij
if i and k are neighbor nodes, otherwise pik = 0.
Two points are considered close if they have nearly the same distribution over the starting states.
When t → ∞ all the points are indistinguishable if the original neighbor graph is connected. If
P is the one step transition probability then P t is the t-step transition probability which is rowstochastic, i.e. rows sum to one. A point k, whether labeled or unlabeled, is interpreted as a sample
from the t-step Markov random walk. The posterior probability o the label for point k is given by
Ppost (y|k) =
X
t
P (y|i)Pik
i
The class is then the one that maximizes the posterior probability:
ck = argmaxc {Ppost (y = c|k)}
The authors then use two separate techniques to estimate the unknown parameter P (y|i):
Maximum likelihood with EM and maximum margin given constraints. They conclude that the
Markov random walk representation provides a robust approach to classify datasets with significant
manifold structure and very few labeled data.
3.3
Graph-Based Semi-supervised learning with Harmonic Functions
Most of the semi-supervised learning algorithms are based on the manifold structure and the assumption that similar examples should belong to the same class. The goal in graph classification
is to learn a function f : V → R. For all labeled nodes vi we have f (vi ) ∈ {0, 1}.
The desired situation is that close and similar nodes are assigned to the same classes. Therefore,
an Energy function is defined as:
1X
ωij (f (i) − f (j))2
E(f ) =
2
i,j
Solving the classification problem, now reduces to minimizing the energy function. A solution f to
this minimization problem is a harmonic function with the following property:
1 X
ωij f (i)
f (j) =
dj
i∼j
for j = l + 1, · · · , l + u. It should be noted to keep the value of labeled nodes constant (0 or
1). The other nodes should have values that are calculated with the weighted average of their
neighbors. The harmonic function is a real relaxation of the mincut function. Unlike mincut, which
is not unique and is NP-hard, the minimum-energy harmonic function is unique and efficiently
computable. Although harmonic function is not formulated as discrete values, one can use the
domain knowledge or hard assignments to map from f to labels.
12
Retrieved
Not Retrieved
Relevant
true positive (tp)
false negative (fn)
Not Relevant
false negative (fn)
true negative (tn)
Table 1: Retrieval contingency table
4
Evaluation in Information Retrieval
4.1
Overview
The first part of this study focuses on an overview of Information Retrieval (IR) evaluation in
general. Concepts in this section are mainly referred to . To evaluate the effectiveness of an IR
system, one needs three things:
1. Document collection.
2. A set of queries.
3. A set of relevance judgment for each query. (Usually a binary judgment as relevant and not
relevant)
In fact, a document is called relevant if it addresses the user’s information need. The effectiveness
of some IR systems depends on the value of a number of parameters. Manning et, al  adds that
It is wrong to report results on a test collection which were obtained by tuning these
parameters to maximize performance on that collection.
4.1.1
Binary Relevance, Set-based Evaluation
Two basic measures in IR system is precision and recall. Precision (P ) is a fraction of retrieved
documents that are relevant and Recall (R) is the fraction of relevant documents that are retrieved.
According to table 1 the precision and recall values are calculated as:
P = tp/(tp + f p)
(1)
R = tp/(tp + f n)
(2)
Another measure to evaluate an IR system is accuracy which is the fraction of its classifications
that are correct:
Accuracy = (tp + tn)/(tp + tn + f p + f n)
(3)
In most of the cases the data is extremely skewed: normally over 99.9% of the documents not
relevant to the query . Thus, trying to find some relevant documents will result in a large
number of false positives.
A measure to make a trade off between precision and recall is the Fm easure:
F =
1
(β 2 + 1)P R 2 1 − α
=
,β =
α/P + (1 − p)/R
β2P + R
α
Values of β < 1 emphasize precision, while values of β > 1 have recall highly favored.
13
(4)
4.1.2
Evaluation of Ranked Retrieval
Measures mentioned in previous section are set-based measures, and use an unordered set of retrieved documents to evaluate the IR system. The first approach to evaluate a ranked retrieval is
to draw the precision-recall plot. The precision-recall plot have a distinctive shape in which if the
(k + 1)th document retrieved is nonrelevant then recall is the same as for the top k documents, but
precision has dropped. If it is relevant, then both precision and recall increase and the curve makes
a shift to the right.
Usually the interpolated precision (Pinterp ) is plotted. Pinterp is the highest precision value of
the system at a certain recall level. For a recall value r:
Pinterp (r) = maxr0 >r P (r0 )
(5)
The intuition behind this definition is that everybody would want to retrieve a few more documents
if it would increase precision. The 11-point interpolated average precision is a traditional approach
for this and was used in the first 8 TREC Ad Hoc evaluations. The interpolated precision is
measured at the 11 recall levels between 0.0-1.0.
Another standard measure in TREC is Mean Average Precision (MAP) which reports a single
value for the effectiveness of IR system. Average precision focuses on returning more relevant
documents earlier. It is the average of precisions computed after cutting the list after each of the
relevant documents retrieved in turn. The mean average precision is then the mean value of the
average precisions computed for each of the queries separately. If the set of relevant documents for
a query qj ∈ Q is {d1 , · · · , dmj } and Rjk is the set of ranked results from the top result until the
document dk then
mj
|Q|
1 X 1 X
P recision(Rjk )
M AP (Q) =
|Q|
mj
j=1
(6)
k=1
MAP is roughly the average area under the precision-recall curve for a set of queries.
Precision at k is a measure that finds the accuracy of an information retrieval system bashed
on the first few pages of the results and does not require any estimate of the size of the set of
relevant documents. An alternative for this measure is R-Precision which requires having a set of
known relevant documents of size R. The precision is then calculated when the top R documents
returned.
4.1.3
Non-binary Relevance Evaluation
Normalized discounted cumulative gain (NDCG)  is an increasingly adopted measure and is
designed for situations of non-binary relevance and like precision at k, it is evaluated using k top
search results.
If R(j, d) is the relevance score the annotators gave to document d for query j then,
N DCG(Q, k) =
|Q|
k
X
2R(j,m) − 1
1 X
Zk
|Q|
log(1 + m)
j=1
m=1
Zk is a normalization factor calculated to ensure a perfect ranking’s NDCG at k is 1.
14
(7)
Judge 1
Relevance
Yes
No
Total
judge 2
Yes
300
10
310
Relevance
No
20
70
90
Total
320
80
400
Table 2: Example on κ statistics
4.2
Assessing Agreement, Kappa Statistics
Once we have the set of documents and the set of queries we need to collect relevance assessments.
This task is quite expensive in that judges should agree on the relevance of a document for each
given query. The Kappa statistics  is
κ=
Pr(A) − Pr(E)
1 − Pr(E)
(8)
where P r(A) is the relative observed agreement among judges, and P r(E) is the probability that
the judges agree by chance. The above formula is useful to calculate the kappa when there are only
two judges. In the cases when there are more than two relevance judgments for each query Fleiss’
kappa is calculated:
κ=
P̄ − P¯e
1 − P¯e
(9)
1 − P¯e is the maximum agreement that is reachable above chance, and, P̄ − P¯e gives the agreement
actually which is actually achieved above chance. A κ value of 1 means complete agreement while
it means disagreement when it is 0.
Krippendorff  discusses that finding associations between two variables that both rely on
coding schemes with κ < 0.7 is often impossible, and that content analysis researchers generally
think of κ > 0.8 as good reliability with 0.67 < κ < 0.8 allowing provisional conclusions to be
drawn. Kappa is widely accepted in the field of content analysis and is interpretable.
4.2.1
An Example
Table 2 shows an example from  on the judgments of two assessors.
P (A) = (300 + 70)/400 = 370/400 = 0.925
P (nonrelevant) = (80 + 90)/(400 + 400) = 170/800 = 0.2125
P (relevant) = (320 + 310)/(400 + 400) = 630/800 = 0.7878
P (E) = P (nonrelevant)2 + P (relevant)2 = 0.21252 + 0.78782 = 0.665
κ = (P (A) − P (E))/(1 − P (E)) = (0.925 − 0.665)/(1 − 0.665) = 0.776
4.3
Statistical testing of retrieval experiments
An evaluation study is not always complete without some measurement of the significance of the
differences between retrieval methods. Hull  focuses on comparing two or more retrieval methods
using statistical significance measures. The t-test compares the magnitude of difference between
15
methods to the variation among the differences between the scores for each query which is analyzed.
If the average difference is large comparing its standard error, then the methods are significantly
different. The t-test assumes that the error follows a normal distribution. Two non parametric
alternatives to t-test are paired Wilcoxon signed-rank test and the sign test. Let Xi and Yi be the
scores of retrieval methods X and Y for query i, and let Di = θ + ²i where ²i are independent. The
null hypothesis is θ = 0.
• Paired t-test
t=
D̄
√
s(Di )/ n
• Paired Wilcoxon test
P
Ri
T = qP , Ri = sign(Di ) × rank|Di |
Ri2
• Sign Test
T =
2×
P
I[Di < 0] − n
√
n
where I[Di > 0] = 1 if Di > 0 and 0 otherwise.
16
5
Blog Analysis
5.1
Introduction
Nowadays weblogs, (aka. blogs) play an important role in social interactions. People who maintain
blogs and update them, so called bloggers, involve in a series of interactions and interconnections
with other people . Each blog usually has a constant regular number of readers. These readers
might make links to that blog or comment its posts which is said to be a motive for future postings
.
Although, not until 1997, had the term blog been coined , today many people maintain
blogs of their own and update it regularly, writing their feelings, thoughts, and any other thing
they desire. The role of blogs as a social network is clear, and so, many research projects are
devoted to analyze relations in blogs, from political conclusions  to finding mutual awareness and
communities . There are two reasons why blogs are systematically studied :
1. Sociological reasons
the cultural difference between the blogspace and regular webpages is that blogspace focuses
heavily on local community interactions between a small number of bloggers. There is also a
sequence of responces when hot topics arise. This makes it necessary to see if this dynamics
can be explained and modeled.
2. Technical reasons
The blogspace provides the notion of timestamp for blog posts and comments which makes it
easier to analyze this space overtime. Analysis of the bursty evolution in blogs concentrates
on bursts in the evolution of blogspace.
5.2
Implicit Structure of Blogs
Blogs are wonderful to analyze to track “memes” as they are constantly used and have and underlying structure. Blogs are significantly used to record diaries and are easy to update, so make
the online document growth faster. In addition, the hyperlinks between bloggers form a network
structure in the blogspace which makes the blogspace a wonderful testbench to analyze information
dynamics in networks.
The microscopic patterns of blog epidemics are splitted up into implicit and explicit. Blog
epidemics studies try to address two questions:
• Do different types of information have different epidemic profiles?
• Do different types of epidemic profiles group similar types of information?
General categories of information epidemics as well as introducing a tool to infer and visualize
the paths specific infections in the blog network are the main contributions of . Microscale
dynamics of blogs is also studied in , where the authors specify two major factors to consider to
address epidemics in blogs: Timing and graph. A few problems are to be considered in this analysis.
First, The root is not always clearly identified. Second, there might be multiple possibilities for
infection of a single node. As an example assume, A, B, C are blogs. B links to A and C links to
both A, B. C might be infected either by B (After B is infected by A) or directly by A. Third,
there is uncrawled space and the whole blogspace structure is not known to the analyst. Explicit
17
Figure 1: The results on the similarity of pairs in 
analysis is easy and can be done using the link structure. It is usually hard to analyze the implicit
structure of blogs though. This is known as the link inference problem. Possible ways are to use
machine learning algorithms such as support vector machines, or logistic regression. The full text
of blogs, blogs in common, links in common, and history of infection are used as available source
of information.
One experiment that Adar et, al  carry is to find pairs that are infected and are linked, and
pairs that are infected but are unlinked. Although in general case the machine learning approach
for both experiments shows more than 90 percent accuracy in the classification task, it doesn’t
work well in the specific case, in which for a given epidemic it connects all blogs.
Figure 1 is selected from  and shows the results on the distribution of similarity of pairs in
5.2.1
iRank
Adar et, al  conclude with a description of a new ranking algorithm, iRank, for blogs. iRank,
unlike traditional ranking algorithms, uses the implicit link structure to find those blogs that initiate
the epidemics. It focuses on finding true information resources in a network. if n blogs are pointing
to one, and that one is pointing to another single information resource the latter might be the true
source of information. In the iRank algorithm first, a weighted edge is drawn for every pair of blogs
that cite the same url, u:
Wiju = W (∆dij )
Weights are then normalized so outgoing weights sum to one, and at last a pagerank is performed
on the weighted graph.
5.3
Information Diffusion in Blogs
Dynamics of information in the blogosphere is studied in  and is the main focus of this section.
Over the past two decades there has been an increasing interest in observing information flows as
18
well as influencing and creating theses flows. The information diffusion is characterized along two
dimensions:
• Topics
This includes identifying a set of postings that are about the same topic, and then characterizing the different patterns the set of postings may fall in. Topics, according to  are unions
of chatter and spikes. The former refers to the discussions whose subtopic flow is determined
by authors’ decisions. The latter refers to short-term, high intensity discussions of real world
events.
• Individuals
The individual behavior in publishing online diaries differs dramatically.  characterize the
four categories of individuals based on their typical posting behavior within life cycle of a
topic.
The corpus was collected by crawling the web of 11,804 RSS blog feeds. Gruhl et, al  collected
2K-10K blog postings per day for a total of 401,021 postings in their data set. Based on their
observations on manually analyzing 12K individual words highly ranked using tf-idf method, they
decompose topics along two axes: chatters (internally driven and sustained discussions) and spikes
(externally induced sharp rises in postings). Depending on the average chatter level topics can be
places into one of the following three categories:
1. Just Spike
2. Spiky Chatter (Topics that are significantly chatter, but alongside, have high sensitivity to
external events)
3. Mostly Chatter
The model of “Topic = Chatter + Spikes” is further refined to see if Spikes themselves are
decomposable. For each proper noun x they compute the support s (the number of times x cooccurred with the target) and the reverse confidence cr = P (target|x). To generate a rational term
set, thresholds for s and cr are used. For these terms they look at the co-occurrences and define
spikes as areas where the posts in a given day exceed a certain number.
The distribution of non-spike daily average is approximated by
P r(avg # of posts per day > x) ≈ ce−x
This observation is also made that most spikes in the manually labeled chatter topics last about
510 days.
Table 3 shows the distribution of number of posts by user. The authors in  try to locate
users whose posts fall in the categories show in table 3. Each post on day i has a probability pi
that falls into a given category.
They gather all blog posts that contain a particular topic into a list
[(u1 , t1 ), (u2 , t2 ), · · · , (uk , tk )]
sorted by publication date of the blog, in which ui is the ID of the blog i, and ti is the first time
at which blog ui contained the reference to the topic. The aim is then to induce the relevant edges
among a candidate set of θ(n2 ) edges.
19
Predicate
RampUp
Ramp-Down
Mid-High
Spike
Algorithm
All days in the first 20% of post mass below mean
and average day during this period below µ − σ/2
All days in the last 20% of post mass below mean
and average day during this period below µ − σ/2
All days during middle 25% of post mass above mean
and average day during this period below µ − σ/2
For some day, number of posts exceed µ − σ/2
Region
first 20%
of posts mass
last 20%
of posts mass
Middle 25%
of posts mass
From Spike
to infection point
below µ both directions.
% of Topics
3.7%
5.1%
9.4%
18.2%
Table 3: Distribution of the number of posts by user
If a appears in the traversal sequence and b does not appear later in the same sequence, this
shows valuable information about (a, b). If b were a regular reader of a then, memes discussed by
a should sometimes appear in b.
The authors present an EM like algorithm to induce the parameter of the transmission graph,
in which they first compute soft assignments of each new edge infection, and then update the edge
parameters to increase the likelihood of assigned infections. They estimate the copy probability κ,
and inverse mean propagation delay (r) as:
P
j∈S1 pj
r=P
j∈S1 pj δj
and
P
κ= P
j∈S1
j∈S1 ∪S2
pj
P r(r 6 δj )
where P r(a < b) = (1−a)(1−(a−1)b ) the probability that a geometric distribution with parameter
a has value 6 b.
5.4
Bursty Evolution
The study of evolution of blogs is tightly coupled with the notion of a timed graph. There is a
clear difference between the community structures in the web and that of blogs. Blog communities
are usually formed as a result of a debate over time which leaves a number of hyperlinks in the
blogspace. Therefore, the community structure in blogs should be studied during short intervals,
as the heavy linkage in a short period of time is not significant when analyzed over a long time
span.
5.4.1
Time Graphs
A time graph G = (V, E) consists of
1. A set V of nodes, with each node associated with a time interval D(v) (duration).
2. A set E of edges where each edge is defined as a triple: e = (u, v, t) where u and v are nodes
and t is a point in the time interval D(u) ∩ D(v)
20
5.4.2
Method
The algorithm in  consists of two steps: pruning, and expansion. The pruning step is simply
done by first removing nodes with degree zero and one, and then checking all nodes with degree
2, to see if their neighbors are connected and form triangles. In that case, they are passed to the
expansion procedure (i.e. growing the seed to a dense subgraph to find a dense community.) and
the resulted expansion is reported as a community and is removed from the graph.
The results of studying the SCC evolution in blogs show that for each of the three largest strongly
connected components, at the beginning of the study, the number of blog pages is significant but
there is no strongly connected component of more than a few nodes. Around the beginning of the
second year, a few components representing 1-2% of the nodes in the graph appear, and maintain
the same relative size for the next year. In the forth year, however, the size of the component
increases to over 20% by the present day. The giant component still appears to be expanding
rapidly, doubling in size approximately every three months.
Results in  also indicate that the number of nodes in the communities for randomized
blogspace is an order of magnitude smaller than for blogspace. This shows that the community
formation in blogspace is not a property of the growth of the graph. In addition, the SSC in
randomized blogspace grows much faster than in the original blogspace. This means that the
striking community in the blogspace is a result of the links which are in deed referenced to topicality.
21
6
Lexical Networks
6.1
Introduction
One of the major differences between humans and other species is that human beings are capable
of adapting languages that are not shared by any other species. Human language allows the
construction of a virtually infinite range of combinations from a limited set of basic units. We
are able to rapidly produce words to form sentences in a highly reliable fashion as the process of
sentence generation is very rapid. Generation and evolution of languages causes the creation of
new language entities.
In this chapter we look at the some previous work on lexical networks. A lexical network
refers to a complete weighted graph in which vertices represent linguistic entities, such as words
or documents, and edge weights show a relationship between two nodes. Three main classes of
lexical networks are word-based, sentence-based, and document-based networks. Word-based lexical
networks can be further divided into networks based on co-occurrence, syllabus, semantic networks,
and synthetic networks.
6.1.1
Small World of Human Languages
The small world property  of language networks has been analyzed before . Ferrer Cancho
et, al  showed that the syntactic network of heads and their modifiers exhibits small-world
properties. The small-world property is also shown for the co-occurrence networks by . A cooccurrence network is a network of words as nodes and an edge appear between two nodes if they
appear in the same sentence. Many co-occurrences are due to syntactical relationships between
words or dependency relationships .
In this section we take a deeper look at the work by . They set the maximum distance according
to assume a co-occurrence to be the minimum distance at which most of the co-occurrences are
likely to happen:
• Many co-occurrences take place at a distance of one.
• Many co-occurrences take place at a distance of two.
It is also shown before that co-occurrences happen at distances greater than two . However, for
four reasons they decide to just consider maximum distance of two for a co-occurrence:
1. Unavailability of a system to perform the task of considering any distances greater than
two, and that all achievements in computational linguistics are based on consideration of a
maximum distance of two.
2. Method failure in capturing exact relationships at distances greater than two.
3. To make this as automatic as possible they stick with distance two and try to collect as many
4. long distance syntactic dependencies imply smaller syntactic links.
The authors consider the graph human language, ΩL , as defined by ΩL (WL , EL ), where WL = {wi },
(i = 1, · · · , NL ) is the set of NL words and EL = {wi , wj } is the set of edges or connections between
22
words. They call the biggest connected component of the networks that results from the basic and
improved methods,, respectively, the unrestricted word network (UWN) and the restricted word
network (RWN). They analyze the degree distribution for the unrestricted word network and the
restricted word network of about three quarters of the 107 words of the British National Corpus
(http://info.ox.ac.uk/bnc/). The authors further show that the degree increases as a function of
frequency, with exponent 0.80 for the first and 0.66 for the second segment.
In summary, they show that the graph connecting words in language exhibits the same statistical
features as other complex networks. The observe short distances between words arising from the
small-world structure. They conclude that language evolution might have involved the selection of
a particular arrangement of connections between words.
6.2
Language Growth Model
In this section we summarized the work by Dorogovtsev and Mendes  which describes the
evolution of language networks. Human language can be described as a network of linked words.
In this network neighboring words in language sentences are connected by edges. Dorogovtsev and
Mendes  treat the language as a self-organizing network and in which nodes interact with each
other and the network grows. They describe a network growth with the following rules:
• At each time step, a new vertex (word) is added to the network.
• A new node at its birth, connects to several old nodes whose number is of the order of 1.
• For convention it is assumed that a new node only attaches to one node with the probability
proportional to its degree.
• At each time increment, t, ct edges emerge in the network between old words where c is
constant.
• New edges connect nodes with probability proportional to their degree.
In continues approximation, the degree of a vertex born at i and observed at t is described by
∂k(i, t)
k(i, t)
= (1 + 2ct) R t
∂t
k(u, t)du
0
The total degree of the network at time t is
Z t
k(u, t)du = 2t + ct2
0
Solving for k(i, t) with the initial condition, k(t, t) = 1 results:
k(i, t) =
³ ct ´1/2 ³ 2 + ct ´3/2
ci
2 + ci
The nonstationary degree distribution is then
P (k, t) =
1 ci(2 + ci) 1
ct 1 + 2ci k
23
This results in a power-law degree distribution with two different regions, intersecting at
√
kcross ≈ ct(2 + ct)3/2
Below this point, the degree distribution is stationary:
P (k) ∼
= 1/2k −3/2
And above the cross point we they obtain:
P (k, t) ∼
= 1/4(ct)3 k − 3
which is a nonstationary degree distribution in this region
6.3
6.3.1
Complex Networks and Text Quality
Text Quality
The first work at which we take a closer look is the text assessment method described by Antiqueria
et, al  who looked at the text assessment problem using the concept of complex networks. In their
work they model the text with a complex network, in which each of the N words is represented
as a node and each connection between words as a weighted edge between the respective nodes
representing those words. They also get use of a list of stop words to eliminate terms not associated
with concepts.
The define two measures, instrength and outstrength corresponding to weighted indegree and
outdegree respectively as
N
X
Kin (i) =
W (j, i)
j=1
and
Kout (i) =
N
X
W (i, j)
j=1
Antiqueria et, al  reach several conclusions:
1. The quality of the text tends to decrease with outdegree.
2. Quality is almost independent (increases only slightly) of the number of outdegrees for the
good-quality texts, while for the low-quality texts the quality increases with the number of
outdegrees.
3. The quality of the text decreases with the clustering coefficient.
4. In low-quality texts there is much higher variability in the clustering coefficients, which also
tend to be high.
5. Quality tends to increase with the size of the minimum path for all the three definitions of
path used in their work.
24
6.3.2
Summary Evaluation
A summary evaluation method based on complex networks in discussed in . The first step to
define a measure to evaluate summaries using complex network concepts is to represent summaries
as complex networks. In this representation terms of the summary are nodes in the network, and
each association is determined by a simple adjacency relation. There is an edge between every
adjacent words in the summary. The weight on edges show the number of times the two terms are
By modeling summaries as complex networks and by introducing a network metric, the authors
showed it possible to evaluate different automatic summaries. Their measure is based on the number
of strongly connected components in the network. More formally, they define a deviation measure:
PA
− g(M )|/N
A
where f (M ) is the function that determines the number of components for M words associations and
g(M ) is the function that determines the linear variation of components for M words associations.
In this measure N is the number of different words in the text and A is the total number of words
associations found. Their measure is evaluated by using it to evaluate three automatic summarizers:
GistSumm , GEI , and SuPor  against a manual summary.
deviation =
6.4
M =1 |f (M )
Compositionality in Quantitative Semantics
The last article that we review in this chapter is on compositionality in quantitative semantics
by . They introduce a principle of latent compositionality into quantitative semantics. They
implement a Hierarchical Constraint Satisfaction Problem (HCSP) as the fundamental text representation model. This implementation utilizes an extended model of semantic spaces which is
sensitive to text structure and thus leaves behind the bag-of-feature approach. A simple example
shows the different models’ behavior in interpreting a sentence
All sculptures are restored. Only the lions stay in the garden.
• Vector Space (VS)
The representation of this text in the vector space model is a bag of words in which stop
words are filtered out. Important words (garden, lion, restore, sculpture, stay) are used to
build a vector of weighted terms to represent the sentence in the model.
• Latent Semantic Analysis (LSA)
LSA extends the previous approach. Unlike the VS model, it does not refer to a weighted bag
of words, but utilizes factor analytic representations. In this model, the sample is represented
by the strongest factors in locating the representations of the input words garden, lion, restore,
sculpture, stay within the semantic space. Therefore, this can be assumed of a bag of featurevector representation but it still ignores the structure of the text.
• Fuzzy Linguistics (FL)
FL derives a representation of text based on its integration hierarchy. This model expects the
text as a whole suggest words like sculptural art, park, collection, but not veldt, elephant, or
zoo. This analysis assumes the present order of sentences. It does not allow reinterpreting
25
lion if the order of the sentences is reversed. Moreover, the example presupposes that all
coherence relations as well as the integration hierarchy have been identified before.
The major contribution of  is, however, a formalization of the principle of compositionality in
terms of semantic spaces. In particular, they try to fill the gap of the missing linguistic interpretability of statistical meaning representations.
26
7
Text Summarization
This chapter covers some of the previous salient works on text summarization.
7.1
LexRank
Lexrank is an stochastic graph based method to determine the lexical centrality of documents
within a cluster, and is introduced in . The network on which lexrank is applied is a complete
weighted network of linguistic entities. These entities can be sentences, or documents. In its
special application for summarization, lexrank is used to find the centrality of sentences within a
text document.
To form the network, Lexrank utilizes the tf-idf term vector and the so called “bag of words”
representation. This enables Lexrank to use the cosine similarity of documents as weight edges to
build the entire network. The cosine similarity of two document vectors is calculated as
→
− −
→
di · dj
−
→ −
→
Sim( di , dj ) =
|di ||dj |
−
→
In this representation of documents, di = (w1i , w2i , · · · , wV i ), where V is the size of the vocabulary
and
wji = tfji · idfj
tfji is the normalized term frequency of term j in document i and idfj is the inverse document
frequency of the term j.
The centrality of a node u in this network is related to its neighbors’ centralities and is calculated
as:
X
Sim(u, v)
d
P
+ (1 − d)
p(v)
p(u) =
N
d in this equation is the damping factor to ensure convergence of the values.
Lexrank is evaluated using 30 documents of DUC 2003 and 50 documents of 2004 using ROUGE1
evaluation system.
7.2
Summarization of Medical Documents
In this section we review the work in  whose main aim is to survey the recent work in medical
documents summarization. One of the major problems with the the medical domain is that it
suffers particularly from information overload. This is important to deal with as it is necessary for
physicians to be able to quickly and efficiently access to up-to-date information according to their
particular needs. The need for an automatic summarization system is more crucial in this field as
a result of number and diversity of medical information sources. These summarizers help medical
researchers determine the main points of a document as quickly as possible.
A number of factors have to be taken into consideration while developing a summarization
system:
1
http://www.isi.edu/ cyl/ROUGE
27
• Input
One major factor is the type of the input to such system. Input can be Single Document or
Multi Document. Additionally input language should be taken into account, which results in
three different type of summarizing systems: monolingual, multilingual, or cross-lingual. The
third input factor is type of the input, which can be anything from plain text to images and
sound.
• Purpose
These factors concern the possible uses of the summary, and the potential readers of the
summary. A summary can be indicative or informative. The former does not claim any role
of substituting the source documents, while the latter substitutes the original documents. In
another categorization of summaries, a summary can be generic versus user-oriented. Generic
systems create a summary of a document or a set of documents based on all the information
found in the documents. User-oriented systems create outputs that are relevant to a given
input query. From another perspective a summary can either be general purpose or domainspecific.
• Output
Evaluation type of an automatic summarizing system determines its type of output. It can be
either qualitative or it can be quantitative. The last, yet major factor in creating a summary
is considering the relation that the summary has to the source document. A summary can
be an abstract, or an extract. Extracts include material (i.e., sentences, paragraphs, or even
phrases) from the source documents. An abstract, on the other hand, tries to identify the most
salient concepts in the source documents,then utilizes natural language processing techniques
to neatly present them. Generation.
7.3
Summarization Evaluation
The existing evaluation metrics can be split into two categories: intrinsic and extrinsic. An intrinsic
method evaluates the system summary independently of the purpose that the summary is supposed
to satisfy. An extrinsic evaluation, on the other hand, evaluates the produced summary in terms of
a specific task. The quality measures which are used by the intrinsic methos can be the integrity
of its sentences, the existence of anaphors, the summary readability, the fidelity of the summary
compared to the source documents. Gold summary is another way of doing an intrinsic evaluation.
Gold summaries are human-made summaries that are ideal and can be used to compare to system
summaries. However, it is usually hard to make annotators agree on what constitutes a “gold”
summary.
Tables 4, 5 summarize some of the extractive and abstractive methods of text summarization
respectively.
7.4
MMR
In this section we reviewed the MMR method which is a re-ranking algorithm based on diversity
which is used to produce summaries. Conventional IR methods maximize the relevance of the
retrieved documents to the query. This is, however, tricky when there are many potentially relevant
documents with partial information overlap. Maximal Marginal Relevance (MMR) is a user-tunable
28
29
User-oriented,
General purpose
Generic,
domain-specific
(scientific articles)
Multi-document,
English, text
Single-document,
English, text
Text regions
(sentences,
paragraphs, sections)
Sentences
Paragraphs
Sentences
Sentences
Sentences
Sentences
Output
Sentences
Graph-based,
statistics
(cosine similarity,
vector space model)
Graph-based,
cohesion relations,
language processing
Tree-based,
language processing
(to identify the )
(RST relations markers)
Use of thematic keywords,
no revision,
Statistics
Statistics
no revision
Statistics
use of thesauri,
revision
Language processing
(to identify keywords)
Method
Statistics (Edmundsonian,
Table 4: Summarizing systems with extractive techniques
Generic,
general purpose
User-oriented,
domain-specific
(scientific and
technical texts )
Generic,
domain-specific (news)
Purpose
Generic,
domain-specific
(technical papers)
Generic,
domain-specific
(scientific articles
on specific topics)
Generic,
domain-specific (news)
Multi-document,
multilingual , text
(English, Chinese)
Single-document,
English, text
Single-document,
multilingual, text
Single-document,
multilingual, text
Single-document,
English, text
Input
Single-document,
English, text
Intrinsic,
extrinsic
Intrinsic,
extrinsic
Intrinsic
Extrinsic
Intrinsic
Intrinsic
Evaluation
[50, 51]






ref.

30
Single-document,
multilingual, text
Single-document,
English, text
Single-document,
English, text
Multi-document,
English, text
Input
Single-document,
English, text
Conceptual
representation in UNL
Ontology-based
representation
Clusters
Templates
Output
Scripts
Syntactic processing of
Representative Sentences,
NLG
Syntactic processing of
Representative Sentences,
ontology-based annotation,
NLG
Statistics
(for scoring each UNL sentence),
removing redundant words,
combining sentences
Information extraction,
NLG
Method
Script activation,
canned generation
Table 5: Summarizing systems with abstractive techniques
Purpose
Informative,
user-oriented,
domain-specific
Informative,
user-oriented,
domain-specific
Generic,
domain-specific
(news articles)
Informative,
user-oriented,
domain-specific
Extrinsic
Intrinsic
Evaluation of
System components
Evaluation




ref.

method and a re-ranking method with a functionality to drill down on a narrow topic or retrieving
a large range of relevance bearing documents.
The criterion considered in MMR is the relevant novelty. The first way to do this is to measure
novelty and relevance independently, and then using a linear combination of two metrics provide
a novelty-relevance measure. This linear combination is called the Marginal Relevance. So a
document will have a high marginal relevance if it is both relevant and has the minimum information
overlap with the previously selected documents.
M M R = argmaxDi ∈R\S [λ(Sim1 (Di , Q) − (1 − λ)maxDj ∈S Sim2 (Di , Dj ))]
with Q begin the query. This can be specifically used in summarization. In that case, the top
highly ranked passages of a document can be chosen to be included in the summary. This summarization is shown to work better for longer documents (which contain more passage redundancy
across document sections . MMR is also extremely useful in extracting passages from multiple
documents that are about the same topics.
31
8
Graph Mincuts
This chapter reviews some of the previous salient works on graph mincut utilization for solving
machine learning related problems.
8.1
Semi-supervised Learning
The major issue in all learning algorithms is the lack of sufficient labeled data. Learning methods
are usually used in classifying text, web pages, and images, and they need sufficient annotated
corpora. Unlike labeled data, whose creation is quite tedious, unlabeled data is quite easy to
gather. For example, in a classification problem, one can easily access a large number of unlabeled
text documents, but the number of manually labeled data can hardly exceed a few. This causes a
great interest in semi-supervised methods.
The first paper that we review in this work uses graph mincuts to classify data points using both
labeled and unlabeled data. This work utilizes the idea of graph mincuts . Given a combination
of labeled and unlabeled datasets, Blum and Chawla  construct a graph of the examples such
that the minimum cut on the graph yields an “optimal” binary labeling data according to some
predefined optimization function.
In particular, the goal in  is to
1. Find the global optimum which is better than a local optimum regarding the objective minimization function.
2. Utilize this method to reach a significant difference in terms of prediction accuracy.
Blum and Chawla  also make an assumption that the unlabeled data comes under the same
distribution that the labeled data does. The classification algorithm described in  has five steps:
1. Construct a weighted graph G = (V, E) where V is the set of all data points as well as a sink
v− and a source v+ . The sink and the source are classification nodes.
2. Connect classification nodes to those labeled examples that have the same label with edges
having infinite weights. That is, for all v ∈ + add e(v+ , v) = ∞ and for all v ∈ −, add
e(v+ , v) = ∞.
3. Weights between example vertices are assigned using a similarity/distance function.
4. Determine the min-cut of the graph. This means to find a minimum total weight set of edges
whose removal discounts v+ from v− . The removal of edges on the cut will make a graph be
split into two parts: V+ and V− .
5. Assign positive labels to all nodes in V+ and negative labels to all nodes in V− .
8.1.1
Why This Works
The correctness of this algorithm is dependent on the choice of the weight function. For a certain
learning algorithm, we can define edge weights so that the mincut algorithm produces a labeling
which minimizes the leave one out cross validation error when applied to entire U ∪L. Additionally,
for certain learning algorithms, we can also define the edge weights so that the mincut algorithm’s
32
labeling results in a zero leave one out cross validation error in L. This is correct according to the
following discussions: (See  for proof)
If we define edge weights between examples in a way that for each pair of nodes x, y,
we have nnxy = 1 if y is the nearest neighbor of x, and nnxy = 0 otherwise, and
also define w(x, y) = nnxy + nnyx , then for any binary labeling of x, the cost of
associated cut is equal to the number of leave-one-out cross validation mistakes
made by 1-nearest neighbor on L ∪ U .
Given a locally weighted averaging algorithm, we can define edge weights so that the
minimum cut yields a labeling that minimizes the L1 -norm leave one out cross
validation error.
Let w be the weight function used for the symmetric weighted nearest neighbor algorithm, then if the same function is used for finding the graph mincut, the classification resulted by this method has a zero leave one out cross validation error in
U.
Suppose the data is generated at random in the set of k (², δ/4)-round regions, such
that the distance between any two regions is at least δ an the classification only
depends on the region to which a point belongs to. If the weighting function is
w(x, y) = 1 if d(x, y) < δ and w(x, y) = 0 otherwise, then O((k/²) log k labeled
examples and O((1/Vδ/4 ) log(1/Vδ/8 ) unlabeled examples are sufficient to classify a
1 − O(²) fraction of an unlabeled examples correctly.
Blum and Chawla  show the efficiency of their method, using standard data as well as
synthetic corpora they also show that this method is robust to noise.
8.2
Randomized Mincuts
The method of randomized mincuts extends the mincut approach by adding some sort of randomness to the graph structure, with some preliminary assumptions. Blum et, al  assume that
similar examples have similar labels. So a natural approach to use unlabeled data is to combine
nearest-neighbor prediction. In other words, as an example, similar unlabeled data should be put
into same classes.
The mincut approach for classification has several properties. First, it can be easily found
in polynomial time using network flow. Second, it can be viewed as giving the most probable
configuration of labels in the associated Markov random field. However, this method has some
shortcomings. Consider, as an example, a line of n vertices between two points s, t. This graph has
n − 1 cuts of size 1 and the cut at the very left will be quite unbalanced.
The method described in  proposes a simple method or addressing a number of these drawbacks
using a randomization approach. Specifically, the method repeatedly adds random noise to the edge
weights. Then it solves the mincut for the graph and outputs a fractional label for each example.
In this section we take a closer look at this method described in .
A natural energy function to consider is
1X
1X
E(f ) =
wij |f (i) − f (j)| =
wij (f (i) − (f (j))2
2
4
i,j
i,j
33
where the partition function Z normalizes over all labeling. Solving for the lowest energy configuration in this Markov random field produces a partition of the whole dataset and maximizes
self-consistency. The randomized mincut procedure is as follows:
Given a graph G constructed from the dataset, produce a collection of cuts by repeatedly adding
random noise to the edge weights and then solve the mincut in the resulted graph. Finally remove
cuts which are highly unbalanced. Blum et, al  show that we only need O(k log n) labeled
examples to be confident in a consistent of k edges.
8.2.1
Graph Construction
For a given distance metric, we can construct a graph in various ways. The graph should be either
connected or a small number of connected components cover all examples. If t components are
needed to cover a 1 − ² fraction of the points, the graph based method will then need at least
t examples to perform well. Blum et, al  construct the graph by simply making a minimum
spanning tree on the entire dataset. Their experimental setup is to analyze three datasets: handwritten digits, newsgroups, and the UCI dataset. The results from these experiments suggest the
applicability of the randomized mincut algorithm to various settings.
8.3
Sentiment Analysis
The analysis of different opinions and subjectivity has received a great amount of attention lately
because of its various applications. Pang and Lee  discuss a new approach in sentiment analysis
using graph mincuts. Previous approaches focused on selecting lexical features, and classified
sentences based on some such features. Conversely, the method in  (1) labels the sentences in
the document as either subjective or objective and then (2) applies a standard machine learning
classifier to make an extract.
Document polarity can be considered as a special case of text classification with sentiment
rather than topic based categories. Therefore, one can apply standard machine learning techniques
to this problem. The pipeline in extracting the sentiments is like the following process:
n-sentence Review → subjectivity tagging → m-sentence extraction → positive/negative
The cut based classification utilizes two weight functions: indj (x), and assoc(x, y). The definition
of these functions will be discussed later in this chapter. Here, indj (x) shows the closeness of data
point x to class j and the association function shows the similarity of two data points to each other.
The mincut minimizes
X
X
X
ind2 (x) +
ind2 (x) +
assoc(xi , xk )
x∈C1
x∈C2
xi ∈C1 ,xk ∈C2
It should be noted that every cut corresponds to a partition of items that has a cost equal to the
N B (x),
partition cost, and the mincut minimizes this cost. The set individual scores ind1 (x) to P rsub
N
B
N
B
and ind2 (x) to 1 − P rsub (x), where P rsub (x) is Naive Bayes’ estimate of the probability that
sentence x is subjective (Weights of SVM can also be used). The degree of proximity is used as the
association score of two nodes.
½
f (j − i) · c
if j − i 6 T
assoc(xi , xj ) =
0
otherwise
34
f (d) specifies how the influence of proximal sentences decays with respect to distance d. Pang
and Lee  use f (d) = 1, e1−d , and 1/d2 . They show that the for Naive Bayes polarity classifier,
the subjectivity extracts are shown to be more effective input than the original documents. They
also conclude that employing the minimum cut framework may result in an efficient algorithm for
sentiment analysis.
8.4
Energy Minimization
The last work that we took a look at is a method of fast approximation for energy minimization
using mincuts, described in .
The motivation in this work comes from Computer Vision. As mentioned before, these problems
can be naturally formulated in terms of energy minimization. This formulation means that one
aims to find a labeling function f that minimizes the energy
E(f ) = Esmooth (f ) + Edata (f )
in which Esmooth is a functions that measures the extent to which f is not smooth, and Edata shows
the disagreement between f and observed data. Here Edata can be
X
Edata (f ) =
(fp − Ip )2
pinP
with Ip being the observed intensity of p. The main problems in dealing with minimization problems
is the large computational costs. Normally, these minimization functions have several local minima
and finding the global minimum is a difficult problem, as the space of possible labeling has a big
dimension, |P | and can be up to many thousands. The energy function can also be considered as
X
X
Vp,q (fp , fq ) +
Dp (fp )
E(f ) =
p∈P
{p,q}∈N
where N is set of interacting pairs of pixels (typically, adjacent pixels).
The method described in  generates a local minimum with respect to two types of very large
moves α − expansion and α − β − swap.
8.4.1
Moves
A labeling f can be represented by a partition of image pixels P = {Pl |l ∈ L}, and Pl = {p ∈
P |fp = l} is the subset of pixels labeled l. Given a pair of labels α, β, a move from partition P to a
new partition P 0 is called a α − β − swap if Pl = Pl0 for any label l 6= α, β. The α − β − swap means
that the only difference between P and P 0 is that some pixels labeled α in P are now labeled β in
P 0 and some pixels labeled β in P are now labeled α in P 0 .
A move from partition P to a new partition P 0 is called α − expansion for a label α if Pα ⊂ Pα0
and Pl0 ⊂ Pl for any label l 6= α. This moves allows a set of image pixels to change their labels to
α.
Two moves are described in the following:
35
α − β − swap:
2. Set Success = 0.
3. For each pair of labels {α, β} ∈ L
Find fˆ = arg min E(f 0 ) among f 0 within one α − β − swap.
If E(fˆ) < E(f ), set f = fˆ and Success = 1
4. If Success = 1 goto 2.
5. Return f
expansion
2. Set Success = 0.
3. For each label α ∈ L
Find fˆ = arg min E(f 0 ) among f 0 within one α − expansion of f .
If E(fˆ) < E(f ), set f = fˆ and Success = 1
4. If Success = 1 goto 2.
5. Return f
Now given an input labeling f and a pair of labels, α, β they find a labeling fˆ that minimizes
E over all labelings within one α − β − swap of f . Also, they find a labeling fˆ that minimizes E
over all labelings within one α − expansion of f given an input labeling f , and a label α. They
do this part using graph mincuts. They further prove that, a local minimum fˆ, when expansion
moves are allowed, and f ∗ as the global minimum, satisfies
E(fˆ) 6 2cE(f ∗)
To evaluate their method, they present the results of experiments on visual correspondence for
stereo, motion, and image restoration.. In image restoration they try to restore the original image
from a noisy and affected image. In this example, labels are all possible colors.
36
9
Graph Learning I
In this chapter and the next, we review some prior learning techniques which use the graph setting
as the framework. We look at an application on text summarization, the problem of best outbreak
in blog networks and those of sensor placement, the problem of web projection, and the co-clustering
technique.
9.1
Summarization
The first section of this chapter describes a summarization method based on the approach in .
Zha  describes text summarization whose goal is to take a textual document, extract appropriate
content, and present the most important facts of the text to the user, which matches the users’
need.
Zha  adopts an unsupervised approach, and explicitly model both keyphrase and sentences that
contain them, using the concept of bipartite graphs. For each document, they generate two sets
of objects: the set of terms T = {t1 , t2 , · · · , tn } and the set of sentences S = {s1 , s2 , · · · , sm }. A
bipartite graph is then constructed using these two sets from T to S, in a way that there is an edge
between ti and sj if ti appears in sj . The weight of the edge is a nonnegative value, and can be set
proportional to the number of times ti appears in sj .
The major principle in this paper is the following:
A term should have a high salience score if it appears in many sentences with high
salience scores, while a sentence should have a high salience score if it contains many
terms with high salience scores.
More formally,
u(ti ) ∝
X
wij v(sj )
v(sj )∼u(ti )
X
v(sj ) ∝
wij u(ti )
u(ti )∼v(sj )
where a ∼ b shows the existence of an edge between a, b. The matrix form of these equations
will be
u=
v=
1
Wv
σ
1 T
W u
σ
It is then clear that u, v are the left and right singular vectors of W corresponding the singular
value σ and that, if σ is the largest singular value of W , then both u and v have nonnegative
components. For numerical computation of the largest singular value triple {u, σ, v}, Zha  uses
a variation of the power method. Specifically, the author chooses an initial value for v to be the
vector of all ones. The following equations are then performed until convergence is achieved.
1. u = W v, u = u/kuk
37
2. v = W T u, v = v/kvk
Zha  uses this salience score to identify salient sentences within topics. For this purpose they
use clustering. For sentence clustering, they build an undirected weighted graph with sentences as
nodes, and edges representing the fact that two nodes (sentences) are sharing a term. The weight
wij is considered to be as sparse as W T W .
Two sentences si and sj are said to be near-by if one follows the other in the linear order of the
document. Then, set
½
wij + α
if si and sj are nearby
ŵij =
wij
otherwise
For a fixed α they then apply the spectral clustering technique to obtain a set of sentence
clusters Π∗ (α). Thus, the problem is reduced to minimizing a clustering cost function defined as
GCV (α) = (n − k − J(W, Π∗ (α)))/γ(Π∗ (α))
where k is the number of desired sentence clusters, W is the weight matrix for term-sentence
bipartite graph and Π∗ (α) is the set of clusters obtained by applying the spectral clustering to the
modified weight matrix W (α). Also
J(W, Π) =
ni
k X
X
||ws(i) − mi ||2
i=1 s=1
and,
mi =
ni
X
ws(i) /ni
s=1
They then use the traditional K-means algorithm iteratively and in each iteration do the following
steps:
1. For each sentence vector w, find the center mi that is closest to w, and associate w with this
cluster
2. Compute the new set of centers by taking the center of mass of sentence vectors associated
with that center.
This way they find the local minimum of J(W, Π) with respect to Π. They then use sum-of-squares
formulation as a matrix trace maximization with special constraints, relaxing which, leads to a
trace maximization problem that possesses optimal global solution. Formally, let
mi = Wi e/ni
and
√
√
X = diag(e/ n1 , · · · , e/ nk )
The sum of squares function can be written as
J(W, Π) = trace(W T W ) − trace(X T W T W X)
38
so minimizing the above, means to maximize
max trace(X T W T W X)
X T X=Ik
Their summarization algorithm is summarized as
1. Compute k eigenvectors Vk = [v1 , · · · vk ] of Ws (α), corresponding to k largest eigenvalues.
2. Compute the pivoted QR decomposition of VkT as
(Vk )T P = QR = Q[R11 , R12 ]
Q is a k-by-k orthogonal matrix, R11 is a k-by-k upper triangular matrix, and P is a permutation matrix.
3. Compute
−1
−1
R̂ = R11
[R11 , R12 ]P T = [Ik , R11
R12 ]P T
The cluster assignment of each sentence will then be determined by the row index of the
largest element in absolute value of the corresponding column of R̂.
9.2
Cost-effective Outbreak
The problem of outbreak in the networks is to find the most effective way to select a set of nodes
to detect a spreading process in the network. Solving this problem is receiving interest as under
this setting many real-world problems can be modeled. A solution to this problem is suggested in
, where they also apply their solution to city water pipe networks and blog networks.
In the former we have a limited budget to put some sensors at some nodes in the network so
that water contaminants can be detected as quickly as possible. The latter focuses on the problem
of information spread in blogspace, and the user tries to read a particular number of posts to get
the most up-to-date information about a story which is propagating in the network.
More formally, in both problems, we seek to select a subset A of nodes in a graph G = (V, E)
which detect outbreak quickly. The outbreak starts at some node and spreads through edges.
Associated with each edge there is a time that it takes for the contaminant to reach the target
node. Depending on every node we select to put sensors, we achieve a certain placement score,
which is a set function R, mapping every placement A to a real number R(A)
There is also a cost function c(A) associated with each placement, and we expect the entire cost
of the sensor placement does not exceed our budget, B. Sensors placed in the network can be of
different types, and quality, resulting in variety ofPcosts. Let c(s) show the cost of buying a sensor
s. then, the cost of the placement is A : c(A) = s∈A c(s). The problem is then to maximize the
reward subject to the fact that cost is minimized. This can be formulated as follows:
max R(A) subject to c(A) 6 B
A∈V
Depending on the time t = T (i, s) at which we detect the outbreak in a scenario i, we incur a
penalty πi (t). which is a function depending on the scenario. The goal is then to minimize the
expected penalty over all possible scenarios:
X
π(A) ≡
P (i)πi (T (i, A))
39
where, for a placement A ⊂ V , T (i, A) is minimum among all T (i, s) for s ∈ A. T (i, A) is the time
until event i is detected by a sensors in A first. P is the probability distribution over the events
and we assume it is given beforehand.
An alternative formulation for the problem is to define a penalty reduction function which is
specific for every scenario, Ri (A) = π(∞) − π(A) and
X
R(A) =
P (i)Ri (A) = π(∅) − π(A)
i
The penalty function has several important properties: Firstly, R(∅) = 0. Secondly, R is
nondecreasing. That means, for A ⊂ B, R(A) 6 R(B). Thirdly, if we add a sensor to a small
placement A, we improve the score at least as much as the time we add that sensor to a larger
placement B ⊇ A. This property for a set function is called sub-modularity. Leskovec et, al 
show that, for all placements A ⊆ B ⊆ V and sensors s ∈ V \B, the following is true,
R(A ∪ {s}) − R(A) > R(B ∪ {s}) − R(B)
Maximizing sub-modular functions is NP-hard in its general form. They develop the (cost-effective
lazy forward selection) CELF algorithm which exploits sub-modularity to find near-optimal solution
and works well in the case where costs are not constants, in which greedy algorithm badly fails.
Their algorithm is guaranteed to achieve at least a fraction of 21 (1 − 1/e) of the optimal solution
even in the case where every node can have a different cost. As far as running time is concerned,
the CELF algorithm runs 700 times as fast as the standard greedy algorithm.
9.2.1
Web Projections
Information retrieval methods usually make an assumption that documents are independent. However, an effective web search is not achievable unless the hyperlink relations between web pages
are taken into consideration. Web projection concentrates on the creation and use of graphical
properties of the web subgraph . Web projection graph is a subgraph of the larger web graph
projected using a seed set of documents retrieved as the result of a query. Leskovec et. al, 
try to investigate how query search results project onto the web graph, how search result quality
can be determined using properties of the projection graph, and if we can estimate the behavior of
users with query reformulation given the projection graph.
They start with a query and collect the initial seed data using a search engine. Leskovec et, al
 then project the retrieved documents on the web graph. Projecting these documents on the web
graph results in an induced subgraph named query projection graph. A Query connection graph is
then created by adding intermediate nodes to make the query projection graph connected. These
connecting nodes are actually not part of the search results. Using these two graphs (i.e., query
projection graph and query connection graph) they construct features to describe the topological
properties of the graphs. These features are used in machine learning techniques to build predictive
models.
Overall, they extract 55 features that best describe the topological properties of two graphs.
These 55 features are grouped into four categories:
• Query projection graph features, which are to measure aspects of the connectivity of query
projection graph.
40
• Query connection graph features, which are to measure aspects of the connectivity of query
connection graph. These are useful in capturing relations between projection nodes and
connection nodes.
• Combination features, which are composition of features from other groups.
• Query features that represent the non-graphical properties of the result set. This feature is
calculated using the query text and the returned list of documents relevant to the query.
To do the projection their data contains 22 million web pages from a crawl of the web. The
largest connected component is said to have 21 million nodes, while the second largest connected
component contains merely several hundred nodes. The problem which is addressed in  is to
project the results of a query on the web graph and extract attributes described above. This can
then be used to learn a model that predicts a query’s class. They also learn models to predict user
behavior when reformulating queries.
9.3
Co-clustering
The basic idea of clustering is to extract unique content bearing words given a set of text documents.
If these words are treated as features, then documents can be represented as a vector of these
features. A possible representation for such e setting is a word-by-document matrix, in which a
non-zero entry indicates the existence of a particular word (row) in a document (represented by
column). Words can be clustered on the basis of the documents they appear in. This clustering
assumes that the more two terms have co-occurrences in documents, the closer they are. This type
of clustering can be useful in building statistical thesaurus and query reformulation. The problem
of simultaneously clustering words and documents is discussed in .
The primary assumption in  is that word clustering has influence on document clustering, and
so does document clustering on word clustering. A given word wi belongs to the word cluster Wm
if its association with the document cluster Dm is greater than its association with other document
clusters. A natural measure of association of a word with a document cluster is considered to be
the sum of edge weights to all documents in the cluster .
X
X
Aij }, ∀l = 1, · · · , k
Wm = {wi :
Aij >
j∈Dm
j∈Dl
This means that each word cluster is determined by the document cluster. Similarly, each document
cluster is determined by a word cluster:
X
X
Dm = {wi :
Aij >
Aij }, ∀l = 1, · · · , k
j∈Wm
j∈Wl
According to Dhillon, the best word and document clustering would correspond to a minimum
k-partitioning of the document-word bipartite graph. This clustering is performed using a spectral
algorithm, and they show it works well by their experimental results on the Medline (1300 medical
abstracts) and Cranfield (1400 aeronautical systems abstracts) corpora.
41
10
Graph Learning II
In this chapter we continue our review on some of the learning techniques that utilize graphs as
their underlying framework.
10.1
Dimensionality Reduction
In the areas of artificial intelligence, and information retrieval, researchers usually encounter problems where they have to deal with high dimensional data which is naturally coming from a smaller
number of dimensions. This data is actually low dimensional but is lying on a higher dimension
space. The problem of dimensionality reduction focuses on representing high dimensional data into
a lower dimension space. To do so Belkin and Niyogi  propose an algorithm that has several
properties. First, their algorithm is quite simple, with few local computations. Second, the authors
use the Laplace Beltrami operator in their algorithm. Third, their algorithm performs the dimensionality reduction in a geometric fashion. Forth, the use of Laplacian eigenmaps causes locality
preserving which makes the algorithm insensitive against noise and outliers.
The problem of dimensionality reduction can be formulated as follows:
Given k points x1 , · · · , xk in Rl , find a set of k points y1 , · · · , yk in Rm such that, m << l, and yi
represents xi . The algorithm described in  is summarized as follows:
1. In this step we should construct a graph by putting an edge between nodes i and j if xi and
xj are close. The proximity can whether be decided by using the Euclidian distance of nodes,
and applying a threshold, or can be set to 1 if both of the nodes are in the other’s k-nearest
neighbors.
2. In the second step, we try to weight the graph using a weight function. This weight function
can be a heath kernel:
2
Wij = e−
kxi −xj k
t
or using a simpler approach. That is, set Wij = 1 if and only if i and j are connected.
3. Compute eigenvalues and eigenvectors for the generalized eigenvector problem:
Lf = λDf
P
Here D is diagonal weight matrix where Dii = j Wji , and L = D − W is the Laplacian
matrix. If f0 , · · · , fk−1 is the solution of the above equation, sorted by eigenvalues:
Lfi = λi Dfi
and
0 = f0 6 λ1 6 · · · 6 λk−1
Then, leave out f0 , the corresponding eigenvector for the eigenvalue 0, and use the next m
eigenvectors to embed the data in a m-dimensional Euclidean space:
xi → (f1 (i), · · · , fm (i))
The authors conduct experiments on a simple synthetic example of “swiss roll” as well as an
example from vision with vertical and horizontal bars in a visual field. In both cases they use the
simpler weight function ( Wij ∈ {0, 1} ) and show it works well for the evaluated datasets.
42
10.2
Semi-Supervised Learning
In this section we focus on a semi-supervised learning method using harmonic functions described in
. Semi-supervised learning has received great attention since labeled examples are too expensive
and time consuming to create. Building a large labeled dataset requires a large effort of skilled
human annotators.
The work in  adopts Gaussian fields over a continuous state space rather than random fields
over discrete label set. The authors assume the solution is solely based on the structure of the data
manifold. The framework of the solution is as follows:
Suppose (x1 , y1 ), · · · , (xl , yl ) are labeled, and (x1+1 , y1+1 ), · · · , (xl+u , yl+u ) are unlabeled data points,
where l << u, and n = l+u. Construct a connected graph of the data points with a weight function
W:
m
X
(xid − xjd )2
wij = exp(−
)
σd2
d=1
dT h
where xid i the
component of instance xi represented as a vector xi inRm , and σi are length
scale hyperparameters for each dimension. They first compute a function f on the graph and then
assign labels based on that function. To assign probability distribution of functions f , they form
the Gaussian field
e−βE(f )
pβ (f ) =
Zβ
where β is an inverse temperature parameter, Zβ is the partition function, and E(f ) is the quadratic
energy function:
1X
E(f ) =
wij (f (i) − f (j))2
2
i,j
Zhu et, al  show that the minimum energy function is harmonic
f = argminf |L=fi E(f )
The harmonic threshold is used to determine the class labels using the weight function. That is,
assign 1 if f > 1/2 and assign 0 otherwise.
Further, they describe how to use external classifiers, classifiers that are trained separately on
labeled data point and are at hand. To combine the external classifier with harmonic function, for
each unlabeled node i they create a dongle node with the label given by the external classifier, and
assign the transition probability from i to its dongle to η, and discount all other transitions from i
by 1 − η. Then they perform the harmonic minimization on this augmented graph.
10.3
Diffusion Kernels
This section covers a feature extraction method described in . This article uses a graph of genes,
where two genes are linked whenever they catalyze two successive reactions to see if having the
knowledge represented by this graph, can help improve the performance of gene function prediction.
The formulation of the problem is as follows:
The discrete set X represents the set of genes, and |X| = n. Then the set of expression profile
is a mapping e : X → Rp with p being the number of measurements. In this setting, e(x) is the
expression profile of gene x. The goal is to use this graph to extract features from the expression
profiles.
43
Let F be the set of features. Each feature is simply a mapping on the set of genes to a real
number, f : X → R. The normalized variance of a linear feature is defined by:
P
fe,v (x)2
∀fe,v ∈ G, V (fe,v ) = x∈X 2
kvk
Linear features with a large normalized variance are called relevant. These can be extracted using
principle component analysis techniques. Additionally, if a feature varies slowly between adjacent
nodes in the graph, it is called smooth as opposed to rugged. A “good” feature is both smooth and
relevant.
If we can define two functions. h1 : F → R+ , and h2 : G → R+ for smoothness and relevance
respectively, then the problem is regularized as the following optimization problem:
max
(f1 ,f2 )∈F0 ×G
f10 f2
p
f10 f1 + δh1 (f1 ) f20 f2 + δh2 (f2 )
p
Here δ is a parameter to control the trade-off between smoothness and relevance. The method in
 uses the energy at high frequency by computing the Fourier transform to specify a smoothness
function of a feature on a graph. Let’s assume L is the Laplacian of the graph, and 0 = λ1 6 · · · 6 λn
are its eigenvalues, and {φi } is the orthonormal set of corresponding eigenvectors. The Fourier
decomposition of any feature f ∈ F is
n
X
f=
fˆi φi
i=1
where fˆi = φ0i . The smoothness function for a feature f ∈ F is calculated as
kf k2Kζ =
n
X
fˆi2
ζ(λi )
i=1
where ζ is a monotonic decreasing mapping, and Kζ : X 2 → R is defined by
Kζ (x, y) =
n
X
ζ(λi )φi (x)φi (y)
i=1
the matrix Kζ is positive definite as the mapping ζ only takes positive values. as i increases, λi
increases, so ζ(λi ) decreases. Subsequently, the above norm has higher a value on a feature with
a lot of energy at high frequency, so can be considered as a smoothing function. The exponential
function is a good example of ζ.
The relevance of a feature is also defined in  as
h2 (fe,v ) = kfe,v kH
where H is the reproducible kernel Hilbert space (RKHS) associated with the linear kernel K(x, y) =
e(x)0 e(y).
44
10.3.1
Reformulation
If K1 = exp(−τ L) is the diffusion kernel, and K2 (x, y) = e(x)0 e(y) is the linear kernel, taking
h1 (f ) = kf kH1 and h2 (f ) = kf kH2 the problem can be expressed in its dual form as
max γ(α, β) =
(α,β)∈F 2
(α0 (K12
α0 K1 K2 β
+ δK1 )α)1/2 (β 0 (K22 + δK2 )β)1/2
The above formulation is a generalization of the canonical correlation analysis known as kernel-CCA
which is discussed in .
Results reported by the article are encouraging. The performance is shown to be above 80%
for some classes, and this method seems successful on some classes which can not be learned using
SVM.
45
11
11.1
Sentiment Analysis I
Introduction
The proceeding two sections give reviews of some papers on sentiment and opinion analysis. Researchers have been investigating the problem of automatic text categorization and sentiment analysis for the past two decades. Sentiment analysis seeks to characterize opinionated natural language
text.
11.2
Unstructured Data
The emergence of Internet users, and content generating Internet applicants, have resulted in a great
increase in the amount of unstructured text on the Internet. Blogs, discussion groups, forums, and
similar pages are examples of such growing content. This makes the need for a sentiment analysis
technique that can handle the lack of text structure.
Most of the previous work on sentiment analysis assume the pairwise independence between
features used in classification. Unlike them, the work in  tries to propose a machine learning
technique for learning predominant sentiments of on-line texts that captures dependencies among
words, and to find a minimal, and sufficient set of vocabulary to do the categorization task. Xue
Bai, Rema Padman and Edoardo Airoldi  present a two-stage Markov Blanket Classifier (MBC)
that learns conditional dependencies among the words and encodes them into a Markov Blanket
Directed Acyclic Graph (MB DAG) for the sentiment variable. The classifier then uses a metaheuristic based search, named Tabu Search (TB). Detecting dependencies is important, as it allows
finding semantic relations between subjective words.
Before discussing the methodology, let’s give some background. A Markov Blanket (MB) for
a random variable Y is a subset Q of a set of random variables X, such that y is independent of
X\Q, and conditional on all variables in Q. Different MB DAGs that entail the same factorization
for the conditional probability of Y , conditional on a set of variables X, are said to belong to the
same Markov equivalence class. Additionally, the search used in this paper is the TS, which is a
meta-heuristic strategy that helps local search heuristics explore the state space by guiding them
out of local optima . The basic Tabu search is simple. It starts with a solution and iteratively
chooses the best move, according to a fitness function. This search method assures that solutions
previously met are not revisited in the short-term.
The algorithm can be summarized as follows:
It first generates an initial MB for Y from the data. To do so, it collects variables which are
within two hops of Y in terms of geographical representation: Potential parents and children plus
their potential parents and children. The graph so far is undirected. To make the graph directed
their method uses the rules described in , and then prunes the remaining undirected edges and
bi-directed edges to avoid cycles, passes them to a Tabu Search, and returns the MB DAG.
They evaluate their algorithm using the data that contains approximately 29, 000 posts to the
rec.arts.movies.reviews newsgroup archived at the Internet Movie Database (IMDb). To put the
data into the right format, they convert the explicit ratings into one of three categories: positive,
negative, or neutral. They extracted all the words that appeared in more than 8 documents, and
thus were left with a total number of 7, 716 words, as input features. So each document in this
experiment is represented as
X = [X1 , · · · , X7,716 ]
46
11.3
Appraisal Expressions
Appraisal expression extraction can be viewed as a fundamental task in sentiment analysis. An
appraisal expression is a textual unit expressing an evaluative stance toward some target . The
paper that I’m going to discuss in this section tries to characterize the evaluative attributes of
these textual units. The method in  extracts adjectival expressions. An appraisal expression,
by definition, contains a source, an attitude, and a target, each represented by different attributes.
As a simple example consider the following sentence:
I found the movie quite monotonous.
Here the speaker (the Source) has a negative opinion (quite monotonous) towards the movie (the
Target). The appraisal theory is based on the following definitions:
• Attitude type: affect, appreciation, or judgment.
• Orientation: positive, or negative.
• Force: intensity of the appraisal.
• Polarity: marked, or unmarked.
• Target type: semantic type for the target.
They use a chunker to find attitude groups and targets using a pre-built lexicon, which contains
Each sentence is parsed into dependency representation, and linkages are ranked so that the paths
in the dependency tree connecting words in the source to words in the target can be found. After
these linkages are made, this information is used to disambiguate multiple senses that an appraisal
expression may present.
Let’s denote the linkage type used in a given appraisal expression by `, the set of all possible
linkages as L, and a specific linkage type by l. Also let’s denote target type of a given appraisal
expression by τ , the set of all target types by T , and a specific target type by t.
The goal is to estimate, for each appraisal expression e in the corpus, the probability of its
attitude type being a, given the expressions target type t and its linkage type l.
P (A = a|e) = P (A = a|τ = t, ` = l)
Let the model of this probability be M ,
PM (A = a|τ = t, ` = l) =
PM (τ = t, ` = l|A = a)PM (A = a)
PM (τ = t, ` = l)
Assuming conditional independence yields:
PM (τ = t|A = a)PM (` = l|A = a)PM (A = a)
PM (τ = t)PM (` = l)
Given a set of appraisal expressions E extracted by chunking and linkage detection, the goal is to
find the maximum likelihood model
YY
M ∗ = arg max
M (A = a|e)
M
e∈E a∈A
They perform two separate evaluations on the system to evaluate the overall accuracy of the
entire system, and to specifically evaluate the accuracy of the probabilistic disambiguator.
47
11.4
Blog Sentiment
In this section, we discuss the work in  that describes textual and linguistic features extracted
and used in a classifier to analyze the sentiment in blog posts. The aim to see if a post is subjective,
and whether it represents a good opinion or a bad one.
The training dataset used in this paper is created in two steps. First, the authors obtained the
data via RSS feeds. Objective feeds are from news sites, like CNN, NPR, etc. as well as various sites
about health, world, and science. Subjective feeds include content from newspaper columns, letters
to an editor, reviews and political blogs. In the second step, Chesley et, al  manually verify
each document to confirm whether it is subjective, objective, positive, or negative. The authors
use three class of features to classify the sentiment. Textual features, Part-of-Speech features, and
lexical semantic features. Each post is then represented as a vector of features in SVM.
Chesley et, al  use the online Wikipedia dictionary to determine the polarity of adjectives
in the text, as adjectives in wiktionary are often defined by a list of synonyms. This is based on
their assumption that wiktionary method assumes that an adjective will most likely have the same
polarity as its synonym. Each blog post, in this method, is represented as a vector with count
values for each feature. A binary classification is then performed for each post using a Support
Vector Machine (SVM) classifier. Their hold-out experiments show that for objective and positive
posts, positive adjectives acquired from Wikipedias Wiktionary play a key role in increasing overall
accuracy.
11.5
Online Product Review
A large amount of web content is subjective and shows people’s reviews for different products and
services. The problem of analyzing reviews gives a user an aggregate view of the entire collection
of opinions, and segmenting the articles into classes that can further be explored. In this section
we review the method of sentiment analysis in , which uses a rather large dataset, and n-grams
instead of unigrams. Their dataset consists of over 200k online reviews with an average of more
than 800 bytes.
Three classifiers are described in . The Passive Aggressive (PA) classifiers are a family of
margin based online learning algorithms, and are similar to SVM. In fact, they can be viewed as
online SVM. PA tries to find a hyperplane to divide the instances into two groups. The margin is
proportional to an instance’s distance to the hyperplane. When the classifier encounter errors, the
PA algorithm utilizes the margin and changes the current classifier. Choosing PA instead of SVM
to do the classification has two advantages. First, the PA algorithm follows an online learning
pattern. The PA algorithm has a theoretical loss bound, which makes the performance of the
classifier predictable .
The second classifier is the Language Modeling (LM) based classifier, which is a generative
method and classifies a word, sentence, or string by calculating its probability of generation. The
probability of a string s = w1 · · · wl is calculated as
P (s) =
l
Y
P (wi |w1i−1 )
i=1
where wij = wi · · · wj .
Cui et, al  use the Good-Turning estimation in their method, which states that if an n-gram
48
occurs r times, it should be treated as if it had occurred r∗ times.
r∗ = (r + 1)
nr+1
nr
where nr denotes the number of n-grams that occur r times in the training data.
Window classifier is the third classifier described in . Window is an online learning algorithm
which has been used for sentiment analysis before. This algorithm learns a linear classifier from
bag-of-words of documents to predict the polarity of a review x. More formally,
X
h(x) =
fw cw (x)
w∈V
where cw (x) = 1 if word w appears in review x and is 0 otherwise. Wondow uses a threshold value
V to classify the reviews based on that as positive and negative.
The main contribution of this work, however, is to explore the role of n-grams as features in
analyzing sentiments when n > 3. In their study they set n = 6 and extract nearly 3 million
n-grams after removing those that had appeared in less than 20 reviews. They calculate 2 for each
n-grams. Assume the following assignments:
• t: a term.
• c: a class.
• A: the number of times t and c co-occur.
• B: the number of times t occurs without c
• C: the number of times c occurs without t
• D: the number of times neither t nor c occurs
• N : the number of documents.
Then
X 2 (t, c) =
(A + C) × (B + D) × (A + B) × (C + D)
After calculating this score for each n-gram, The n-grams are sorted in descending order by their
2 scores, and the top M ranked n-grams are chosen as features for the classification procedure.
They show that high order n-grams improve the performance of the classifiers, especially on
negative instances. They also claim that their observation based on large-scale dataset has never
been testified before.
49
12
Sentiment Analysis II
In this chapter we continue our review on some novel techniques in sentiment analysis. We look at
works on classification of review, and subjectivity strength.
12.1
Movie Sale Prediction
Analyzing weblogs to extract opinions is important since they contain discussions about different
products, and opinionated text segments. Mishne and Glance  have looked at blogs to examine
the correlation between sentiment and references. The first set of data that the authors use is
IMDB – the Internet Movie Database. For each movie they obtained the date of its opening, as
well as the gross income during the weekend. Additionally, they have a set of weblog data, from
which they extract relevant posts to each movie. the relevance of a post and a movie is determined
by the following criteria:
• The post should fall within the period starting a month before to a month after the date of
the movie release.
• The post should contain a link to the movie’s IMDB page, or the exact movie name appears
in the content of the post together with one of the words, movie, watch, see, film).
For each post in the list of relevant ones, Mishne and Glance extracted part of text by selecting
k words around the hyperlink where the movie is referred to. The value of k is various between 6
and 250. For each context, they calculate the sentiment score as described in . Their analysis
was carried on 49 movies, released between February, and August 2005. The polarity score of this
sentiment analysis is fitted to a log-linear distribution, with the majority of scores falling within a
range of 4 to 7.
To perform some experiments, they use the Pearson’s correlation between some sentimentderived metrics (number of positive contexts, number of negative contexts, total number of nonneural contexts, ratio between positive and negative contexts, the mean and variance of sentiment
values), and both income per screen, and raw sales of each movie.
The experiments in  show that in the domain of movies, there is a high corelation between
references to the movies in blog entries, and movies gross income. They also demonstrate that the
number of positive reviews correlates better than raw counts in the period prior to the release of a
movie.
12.2
Subjectivity Summarization
The analysis of different opinions and subjectivity has received a great amount of attention lately
because of its various applications. In this section we reviewed a new approach in sentiment analysis
described by Pang and Lee . Two factors distinguish this work from all previous works that
focused on selecting lexical features and classified sentences based on some such features. First,
in this work, labels that are assigned to sentences are either subjective or objective. Second, a
standard machine learning classifier is applied to make an extract.
Document polarity can be considered as a special case of text classification in which we classify
sentences to sentiments rather than topic based categories. This enables us to apply machine
learning techniques to solve this problem. To extract sentiments within a corpus, a system should
50
review the sentences, and tag their subjectivity. Subjective sentences should further be classified
into positive and negative sentences. Suppose we have n terms, x1 , · · · , xn , and we want to classify
them into C1 and C2 . This paper utilizes a mincut-based classification which uses two weight
functions: indj (x), and assoc(x, y). Here, indj (x) shows the closeness of data point x to class j
and the association function shows the similarity of two data points to each other.
We would like to maximize each item’s “net happiness” (i.e., its individual score or the class
it is assigned to, minus its individual score for the other class). They also aim to penalize putting
highly associated items into different classes. Thus, the problem reduces to an optimization one,
and the goal is to minimize
X
X
X
ind2 (x) +
ind2 (x) +
assoc(xi , xk )
x∈C1
x∈C2
xi ∈C1 ,xk ∈C2
To solve this minimization problem, the authors build an undirected graph G with vertices
{v1 , · · · , vn , s,
¡ t}.
¢ Edges (s, vi ) with weights ind1 (xi ) and (vi , t) with weights ind2 (xi ) are added.
After that, n2 edges (vj , vk ) with weights assoc(xj , xk ) are added. Now, every cut in the graph
G corresponds to a partition of items that have a cost equal to the partition cost and the mincut
N B (x) and 1 −
minimizes this cost. Individual scores, ind1 (x) and ind2 (x), are assigned with P rsub
N B (x) respectively, where P r N B (x) is Naive Bayes’ estimate of the probability that the sentence
P rsub
sub
x is subjective (Weights of SVM can also be used). The degree of proximity is used as the association
score of the two nodes.
½
f (j − i) · c
if j − i 6 T
assoc(xi , xj ) =
0
otherwise
The function f (d) specifies how the influence of proximal sentences decays with respect to the
distance d. The authors used f (d) = 1, e1−d , and 1/d2 .
Pang and Lee  show that the for the Naive Bayes polarity classifier, the subjectivity extracts
are shown to be more effective inputs than the original documents. The authors also conclude that
employing the mincut framework can result in an efficient algorithm for sentiment analysis.
12.3
Unsupervised Classification of Reviews
This section focuses on the unsupervised approach of classifying the reviews described in . The
PMI-IR method is used to estimate the orientation of a phrase. This method uses the Pointwise
Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of words or
phrases. The sentiment extraction method in  is simple. First, it extracts phrases that contain
adjectives and adverbs. This is possible by applying a part of speech tagger, and then extracting
the phrases with specific order of tags. In the second step the semantic orientation of the extracted
phrases are estimated using PMI-IR. The PMI score between two words, w1 and w2 is calculated
as follows
³ p(w &w ) ´
1
2
P M I(w1 , w2 ) = log2
p(w1 )p(w2 )
where p(w1 , w2 ) shows the probability that two words co-occur. this ratio is a measure of the degree
of statistical dependence between the words, whose logarithm is the amount of information that
we acquire about the presence of one of the words when the other is observed.
51
Semantic Orientation (SO) is then calculated as
SO(phrase) = P M I(phrase, “excellence00 ) − P M I(phrase, “poor00 )
With some minor algebraic manipulations:
SO(phrase) = log2
h hits(phrase NEAR “excellence00 )hits(“poor00 ) i
hits(phrase NEAR “poor00 )hits(“excellence00 )
where hits(query) is the number of hits returned for the query. Then the system classifies the
review based on the average semantic orientation of the phrases.
In experiments with 410 reviews from Epinions, the algorithm reaches an average accuracy of
74%. Movie reviews seem difficult to classify, and the accuracy on movie reviews is about 66%.
On the other hand, for banks and automobiles, the accuracy is 80% to 84%. Travel reviews are an
intermediate case. This might suggest that in movies, the whole is not the sum of the parts, while
in banks and automobiles it is.
12.4
Opinion Strength
Subjective expressions are terms or phrases that express sentiments. According to  the subjective strength of a word or phrase is the strength of the opinion, emotion that is expressed. Their
examination of the annotated data shows that strong subjectivity is expressed in many different
ways. Subjectivity clues (PREV) include words and phrases obtained from manually annotated
sources. Due to the variety of clues and their sources, the set of PREV clues is not limited to a
fixed word list or to words of a particular part of speech.
The syntactic clues are mostly developed by using a supervised learning procedure. The training
data is based on both a human annotated (the MPQA) corpus and a large unannotated corpus
in which sentences are automatically identified as subjective or objective through a bootstrapping
algorithm. The learning procedure consists of three steps. First, the sentences in the training data
are parsed with the Collins’ parser. Second, five syntactic clues are formed for each word w, in
every parse tree. The class of a word in a parse tree can be one of the following:
• root(w, t): word w with POS t is the root of the parse tree.
• leaf (w, t): word w with POS tag t is a leaf in the dependency parse tree.
• node(w, t): word w with POS tag t.
• bilex(w, t, r, wc , tc ): word w with POS tag t is modified by word wc with POS tag tc , with
grammatical relationship of r.
• allkids(w, t, w1 , t1 , · · · , rn , wn , tn ): Word w with tag t has n children, wi with tags ti that
modify w with grammatical relationship ri .
Some rules are set to specify the usefulness of a clue.
A clue is considered to be potentially useful if more than x% of its occurrences in the
training data are in phrases marked as subjective where x is a parameter tuned on the
training data.
52
The authors have set x = 70 in their experiments. Useful clues are then put into three classes:
highly reliable, not very reliable, and somewhat reliable. These reliability levels are determined by
occurrence. For each clue c they calculate P (strength(c)) = s as the probability of c being in an
annotation of strength s. If P (strength(c)) = s > T for some threshold T , c is put in the set with
strength s.
Their experiments show promising results in extracting opinions in nested clauses, and classifying their strength. Syntactic clues are used as features, and the classification is done using the
bootstrapping method with the 10-fold cross-validation experiments. Improvements in accuracy of
the new system compared to a baseline range from 23% to 79%.
53
13
Spectral Methods
This last chapter covers the spectral methods in graphs. Particularly, we look at transductive
learning, and spectral graph partitioning.
13.1
Spectral Partitioning
Spectral partitioning is a powerful, yet expensive technique for graph clustering . This method
works based on the definition of incidence matrix. The incidence matrix, In(G), of the graph G is
an |N | × |E| matrix, with one row for each node and one column for each edge. A column is zero
except for two values i and j which are +1 and −1 respectively if there is an edge e(i, j) in the
graph. Additionally, the Laplacian matrix L(G) of G is an |N | × |N | symmetric matrix, with one
row and column for each node, where

 deg(i) if i = j
−1
if i 6= j and there is an edge (i, j)
(L(G))(i, j) =

0
Otherwise.
L(G) has several properties:
1. Symmetry. This causes eigenvalues of L(G) to be real, and its eigenvectors be real and
orthogonal.
2. If e = [1, ..., 1]T , then L(G).e = 0.
3. In(G).(In(G))T = L(G)
4. If for a non-zero v, L(G).v = lambda.v then,
λ=
k(In(G)T .v)k2
kvk2
5. 0 6 λ1 6 λ2 6 · · · 6 λn where, λi are eigenvalues of L(G).
6. The number of connected components of G is equal to the number of λi ’s that are equal to
0. This means that λ2 6= 0 if and only if G is connected. (λ2 (L(G)) is called the algebraic
connectivity of G)
Algorithm 1 Spectral partitioning algorithm
1: Compute the eigenvector v2 corresponding to λ2 of L(G)
2: for Each node n of G do
3:
if v2 (n) < 0 then
4:
Put node n in partition N −
5:
else
6:
Put node n in partition N +
7:
end if
8: end for
It is shown in  that if G is connected, and N − and N + are defined by algorithm 1, then
N − is connected. It can also be shown that, if G1 = (N, E1 ) is a subgraph of G = (N, E), such
that G1 is “less connected” than G, then λ2 (L(G1 )) 6 λ2 (L(G)).
54
13.2
Community Finding Using Eigenvectors
In this section we overviewed the method of finding communities in networks using eigenvectors
suggested by Newman . The problem, for which a solution is suggested, tries to cluster network data. Networks have received great attention in physics and other fields as a foundation of
the mathematical representation of a variety of complex systems, including biological and social
systems, the Internet, the World Wide Web, and many others. There has been a lot of effort to
devise algorithms to cluster graphs. However, the major problem is to develop algorithms that can
be run in parallel or distributed computing systems.
If A is the adjacency matrix of the graph, in which Aij is 1 if there’s an edge between two
vetrices i, j, then the number of edges going from one cluster to another is
R=
1
2
X
i,j
Aij
in different clusters
The index vector, s, defines for a vertex, whether it is in the first group or the second. It assigns
a vertex a value of si = 1 in the former case, and si = −1 in the latter. The following is the
immediate result of the index vector’s definition.
½
1
1
if i, j are in different groups
(1 − si sj ) =
0
if i, j are in the same groups
2
Then
R=
1X
(1 − si sj )Aij
4
ij
Also the degree of node i, ki , is calculated as
ki =
X
Aij
j
The above equations can be reformulated in the matrix form as
1
R = sT Ls
4
where Lij = ki δij − Aij and δij is 1 if i = j and 0 otherwise. L is called the Laplacian matrix
of the graph. Assume λi is the eigenvalue of L corresponding to the eigenvector vi , and that
λ1 6 λ2 6 · · · 6 λn . Furthermore, if n1 , n2 are the required sizes of groups, then
a21 = (v1T s)2 =
(n1 − n2 )2
n
where ai = viT s.
Newman  defines the notion of modularity of a graph Q as Q = (number of edges within
communities) − (expected number of such edges). The goal of network clustering is then reduced
to the problem of optimizing the modularity of the network. More formally, Q can be defined as
Q=
1 X
[Aij − Pij ]δ(gi , gj )
2m
ij
55
where Pij is the expected number of the edges falling between a particular pair of vertices, i and
j. In this setting, Aij is the actual number of such edges, and δ(r, s) = 1 if r = s and 0 otherwise.
Note that δ(gi , gj ) = 12 (si sj + 1) thus,
Q=
1X
[Aij − Pij ](si sj + 1)
4
ij
and the matrix form
1 T
s Bs
4m
where Bij = Aij − Pij . B is called the modularity matrix. He solves this equation as
Q=
Q=
1 X 2
αi βi
4m
i
where βi is the eigenvalue of B corresponding to eigenvector ui . Thus, Newman shows that the
modularity of a network can be represented using eigenvalues and eigenvectors of a matrix and is
called the modularity matrix. Using this expression Newman  derives algorithms for identifying
communities, and detecting bipartite or k-partite structure in networks, and a new community
centrality measure.
13.3
Spectral Learning
Spectral learning techniques are algorithms that use information contained in the eigenvectors of a
data affinity matrix to detect structure. The method described in  is summarized as follows:
1. Given data B, the affinity matrix is A ∈ Rn×n = f (B).
P
2. Form D, a diagonal matrix, with Dii = j Aij .
3. Normalize: N = (A + dmax I − D)/dmax .
4. Let x1 , · · · , xk be the k largest eigenvectors of N and form a normalized matrix using X =
[x1 , · · · , xk ] ∈ Rn×k−1
5. Assign a data point xi to j if and only if the row Xi is assigned to j.
Kamvar et, al  specify a data-to-data Markov transition matrix. If A is the similarity
matrix of the documents, then the equality A = B T B holds for a term-document matrix B. To
map document similarities
to transition probabilities, let’s define N = D−1 A, where D is a diagonal
P
matrix with Dii = j Dij . Here, N corresponds to a transitioning with probability proportional
to relative similarity values.
Four different datasets are used in  to compare the spectral learning algorithm to K-means:
20 newsgroups, 3 newsgroups, LYMPHOMA, SOYBEAN.
They further suggest an algorithm for classification.
1. Define A as previously mentioned.
2. For each pair of points (i, j), if they are in the same class, assign Aij = Aji = 1.
56
3. For each pair (i, j) if they’re in different classes, set Aij = Aji = 0.
4. Normalize N =
1
dmax (A
+ dmax I − D)
If natural classes occur in the data, the Markov chain described above should have cliques. The
key differences between the spectral classifier and the clustering algorithm are that A incorporates
labeling information, and a classifier is used in the spectral space rather than a clustering method.
This algorithm is able to classify documents by the similarity of their transition probabilities to
known subsets of B, and this makes this method novel.
1. Given the data B, the affinity matrix is A ∈ Rn×n = f (B).
P
2. Form D, a diagonal matrix, with Dii = j Aij .
3. Normalize: N = (A + dmax I − D)/dmax .
4. Let x1 , · · · , xk be the k largest eigenvectors of N and form a normalized matrix using X =
[x1 , · · · , xk ] ∈ Rn×k−1 ,
5. Represent each data i by the row Xi of X.
6. Classify the rows as points using a classifier.
7. Assign a data point i to the class that Xi is assigned.
13.4
Transductive learning
Transductive learning is referred to some applications, in which the examples are already known
when a classifier is trained. Relevance feedback is an example of such tasks, in which users give
positive and negative labels to examples that are in the training set.
The learning task is defined on a fixed array X of n points (x~1 , x~2 , · · · , x~n ). The classification
label for each data point is denoted by yi and can either be +1 or −1. A transductive learner can
analyze the location of all points and so can structure its hypothesis space based on the input.
The method of spectral graph transducer (SGT) is described in . This algorithm takes as
input the training labels Yl , and a weighted undirected graph on X with adjacency matrix A. The
similarity-weighted k-nearest neighbor graph can be made as
(
sim(x~i ,x~j )
P
if xj ∈ knn(x~i )
0
~i ,x~k )
x
~
∈knn(
x~k ) sim(x
Aij =
k
0
Otherwise
P
First, the diagonal degree matrix B should be computed, in which Bii = j Aij . Then, the
Laplacian matrix can be computed, L = B − A. The normalized Laplacian is computed as
L = B −1 (B − A)
The smallest d + 1 eigenvalues (excluding the first) and their corresponding eigenvectors of L are
chosen and assigned to D and V respectively. For each new training set:
q
q
l+
• γ̂+ = ll−
and
γ̂−
=
l−
+
57
• Set Cii = 2ll+ for positive examples and Cii = 2l1− for negative examples, where C is the cost
of training samples. This will guarantee equal costs of positive and negative samples.
• Compute G = (D + cV T CV ) and ~b = cV T C~γ . Find λ∗ , the smallest eigenvalue of
·
¸
G
−I
− n1 ~b~bT g
• predictions can be computed as
~z∗ = V (G − λ∗ I)−1~b
Finally, to do the hard class assignment we can threshold ~z∗ wrt.
sign(zi − 12 (γ̂+ + γ̂− ))
58
1
2 (γ̂+
+ γ̂− ), and set yi =
References
 L. Abderrafih. Multilingual alert agent with automatic text summarization.
 Lada Adamic and Natalie Glance. The political blogosphere and the 2004 u.s. election: Divided they blog. In Proceedings of the WWW2005 Conference’s 2nd Annual Workshop on the
Weblogging Ecosystem: Aggregation, Analysis, and Dynamics, 2005.
dynamics of Blogspace. 2004.
 S. Afantenos, V. Karkaletsis, and P. Stamatopoulos. Summarization from medical documents:
a survey. Artificial Intelligence In Medicine, 33(2):157–177, 2005.
 David J. Aldous and James A. Fill. Reversible Markov Chains and Random Walks on Graphs.
Book in preparation, http://www.stat.berkeley.edu/~aldous/book.html, 200X.
 L. Antiqueira, MGV Nunes, ON Oliveira Jr, and L.F. Costa. Complex networks in the assessment of text quality. Arxiv preprint physics/0504033, 2005.
 Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. J. Mach.
Learn. Res., 3:1–48, 2003.
 Glymour C Padman R. Spirtis P. Bai, X. and J. Ramsey. Mb fan search classifier for large data
sets with few cases. In Technical Report CMU-CALD-04-102. School of Computer Science,
Carnegie Mellon University, 2004.
 Xue Bai, Rema Padman, and Edoardo Airoldi. Sentiment extraction from unstructured text
using tabu search-enhanced markov blanket. In Workshop on Mining the Semantic Web, 10th
ACM SIGKDD Conference, 2004.
Automatic Text Summarization, 1999.
 Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps for dimensionality reduction and
data representation. Neural Computation, 15(6):1373–1396, 2003.
 R. Blood. How blogging software reshapes the online community. Communications of the
ACM., 47:53–55, 2004.
 Kenneth Bloom, Navendu Garg, and Shlomo Argamon. Extracting appraisal expressions. In
HLT-NAACL 2007, pages 308–315, 2007.
 A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In
Proc. 19th International Conference on Machine Learning (ICML-2001), 2001.
 Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeled data using graph
mincuts. pages 19–26, 2001.
 Avrim Blum, John D. Lafferty, Mugizi Robert Rwebangira, and Rajashekar Reddy. Semisupervised learning using randomized mincuts. 2004.
59
 Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via
graph cuts. In Proceedings of the International Conference on Computer Vision (ICCV 1),
pages 377–384, 1999.
 Jaime G. Carbonell and Jade Goldstein. The use of MMR, diversity-based reranking for
reordering documents and producing summaries. pages 335–336, 1998.
 Jean Carletta. Assessing agreement on classification tasks: the kappa statistic. Comput.
Linguist., 22(2):249–254, 1996.
 H.H. Chen and C.J. Lin. A multilingual news summarizer. Proceedings of the 18th conference
on Computational linguistics-Volume 1, pages 159–165, 2000.
 Paula Chesley, Bruce Vincent, Li Xu, and Rohini Srihari. Using verbs and adjectives to
automatically classify blog sentiment. In Proceedings of AAAI-CAAW-06, the Spring Symposia
on Computational Approaches to Analyzing Weblogs, 2006.
 N. Chomsky. Syntactic Structures. Walter de Gruyter, 2002.
 Hang Cui, Vibhu O. Mittal, and Mayur Datar. Comparative experiments on sentiment classification for online product reviews. In AAAI, 2006.
 H. Dalianis, M. Hassel, K. de Smedt, A. Liseth, TC Lech, and J. Wedekind. Porting and
evaluation of automatic summarization. Nordisk Sprogteknologi, 1988.
 G. DeJong. An Overview of the FRUMP Sy ste m. Strategies for Natural Language Processing,
1982.
 Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 269–274, New York, NY, USA, 2001. ACM.
 Sergey N. Dorogovtsev and José Fernando F. Mendes. Language as an evolving word Web.
Proceedings of the Royal Society of London B, 268(1485):2603–2606, December 22, 2001.
 P. Doyle and J. Snell. Random walks and electric networks. Math. Assoc. America., Washington, 1984.
 HP Edmundso. New Methods in Automatic Extracting. Advances in Automatic Text Summarization, 1999.
 Güneş Erkan and Dragomir R. Radev. Lexrank: Graph-based centrality as salience in text
summarization. Journal of Artificial Intelligence Research (JAIR), 2004.
 Ramon Ferrer i Cancho and Ricard V. Solé. The small-world of human language. Proceedings
of the Royal Society of London B, 268(1482):2261–2265, November 7 2001.
 Ramon Ferrer i Cancho, Ricard V. Solé, and Reinhard Köhler. Patterns in syntactic dependency networks. 69(5), May 26, 2004.
 Notes for Lecture 23 April 9 Berkeley.
Graph partitioning,
http://www.cs.berkeley.edu/ demmel/cs267/lecture20/lecture20.html.
60
1999.
 ML Glasser and IJ Zucker. Extended Watson Integrals for the Cubic Lattices. Proceedings of
the National Academy of Sciences of the United States of America, 74(5):1800–1801, 1977.
 Fred Glover and Fred Laguna. Tabu Search. Kluwer Academic Publishers, Norwell, MA, USA,
1997.
 Daniel Gruhl, R. Guha, David Liben-Nowell, and Andrew Tomkins. Information diffusion
through blogspace. In WWW ’04: Proceedings of the 13th international conference on World
Wide Web, pages 491–501, New York, NY, USA, 2004. ACM.
 David Hull. Using statistical testing in the evaluation of retrieval experiments. In SIGIR
’93: Proceedings of the 16th annual international ACM SIGIR conference on Research and
development in information retrieval, pages 329–338, New York, NY, USA, 1993. ACM.
 Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of ir techniques.
ACM Trans. Inf. Syst., 20(4):422–446, 2002.
 Thorsten Joachims. Transductive learning via spectral graph partitioning. pages 290–297,
2003.
 Sepandar D. Kamvar, Dan Klein, and Christopher D. Manning. Spectral learning. pages
561–566, 2003.
 Klaus Krippendorff. Content Analysis: An Introduction to Its Methodology. Sage Publications,
London, 1980.
 Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. On the bursty
evolution of blogspace. In WWW ’03: Proceedings of the 12th international conference on
World Wide Web, pages 568–576, New York, NY, USA, 2003. ACM.
 Jure Leskovec, Susan Dumais, and Eric Horvitz. Web projections: learning from contextual
subgraphs of the web. In WWW ’07: Proceedings of the 16th international conference on
World Wide Web, pages 471–480, New York, NY, USA, 2007. ACM.
 Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and
Natalie Glance. Cost-effective outbreak detection in networks. In KDD ’07: Proceedings of the
13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages
420–429, New York, NY, USA, 2007. ACM.
 Y. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. Tseng. Discovery of blog communities
based on mutual awareness. In Proceedings of the WWW06 Workshop on Web Intelligence,
2006.
 HP Luhn. The Automatic Creation of Literature Abstracts. Advances in Automatic Text
Summarization, 1999.
 M Maier M Hein. Manifol denoising. Advances in Neural Information Processing Systems
(NIPS), 2006],.
 I. Mani and E. Bloedorn. Summarizing Similarities and Differences Among Related Documents.
Advances in Automatic Text Summarization, 1999.
61
 Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
 D. Marcu. The rhetorical parsing of natural language texts. Proceedings of the 35th annual
meeting on Association for Computational Linguistics, pages 96–103, 1997.
 D. Marcu. The Theory and Practice of Discourse Parsing and Summarization. MIT Press,
2000.
 A. Mehler. Compositionality in quantitative semantics. a theoretical perspective on text mining. Aspects of Automatic Text Analysis, Studies in Fuzziness and Soft Computing, Berlin.
Springer, 2006.
 I.A. Mel’čuk. Dependency Syntax: Theory and Practice. State University of New York Press,
1988.
 Gilad Mishne and Natalie Glance. Predicting movie sales from blogger sentiment. In AAAI
2006 Spring Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW
2006), 2006.
 T. Mitchell. The role of unlabeled data in supervised learning. In Proc. of 6th International
Symposium on Cognitive Science (invited paper), San Sebastian, Spain, 1999.
 M. Módolo. SuPor: um Ambiente para a Exploração de Métodos Extrativos para a Sumarização
Automática de Textos em Português. PhD thesis, Dissertação de Mestrado. Departamento de
Computação, UFSCar. São Carlos-SP, 2003.
 Mark J. Newman. Finding community structure in networks using the eigenvectors of matrices,
2006. http://arxiv.org/abs/physics/0605087.
 Kamal Nigam and Matthew Hurst. Towards a robust metric of opinion. In Proceedings of the
AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications,
2004.
 Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity
summarization based on minimum cuts. In ACL2004, pages 271–278. Association for Computational Linguistics, 2004.
 T.A.S. Pardo and L.H.M. Rino. Descrição do GEI-Gerador de Extratos Ideais para o Português do Brasil. Série de Relatórios do NILC NILC-TR-04-07, Núcleo Interinstitucional de
Lingüı́stica Computacional (NILC), São Carlos-SP, 8.
 T.A.S. Pardo and L.H.M. Rino. GistSumm: A Summarization Tool Based on a New Extractive
Method. Computational Processing of the Portuguese Language: Proceedings, 2003.
 Thiago Alexandre Salgueiro Pardo, Lucas Antiqueira, Maria das Graças Volpe Nunes, Osvaldo
N. Oliveira Jr., and Luciano da Fontoura Costa. Modeling and evaluating summaries using
complex networks. In Proceedings of Computational Processing of the Portuguese Language,
the Seventh International Workshop (PROPOR ’06), pages 1–10. Springer, 2006.
62
 G. Pólya. Über eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend der Irrfahrt im
Straßennetz. Math. Annalen, 84:149–160, 1921.
 D.R. Radev and K. McKeown. Generating Natural Language Summaries from Multiple OnLine Sources. Computational Linguistics, 24(3):469–500, 1998.
 H. Saggion and G. Lapalme. Generating indicative-informative summaries with sumUM. Computational Linguistics, 28(4):497–526, 2002.
 G. Salton, A. Singhal, M. Mitra, and C. Buckley. AUTOMATIC TEXT STRUCTURING
AND SUMMARIZATION. Advances in Automatic Text Summarization, 1999.
 Ricard V. Solé, Bernat Corominas Murtra, Sergi Valverde, and Luc Steels. Language networks:
Their structure, function and evolution. Technical Report 05-12-042, Santa Fe Institute Working Paper, 2005.
 Martin Szummer and Tommi Jaakkola. Partially labeled classification with markov random
walks. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2001. MIT
Press.
 E. M. Trevino. Blogger motivations: Power, pull, and positive feedback. In Internet Research
6.0, 2005.
 Peter Turney. Thumbs up or thumbs down? semantic orientation applied to unsupervised
classification of reviews. In acl2002, pages 417–424, 2002.
 Jean-Philippe Vert and Minoru Kanehisa. Graph-driven features extraction from microarray
data using diffusion kernels and kernel CCA. 2002.
 Duncan J. Watts and Steven Strogatz. Collective dynamics of small-world networks. Nature,
393:440–442, June 1998.
 Theresa Wilson, Janyce Wiebe, and Rebecca Hwa. Just how mad are you? finding strong and
weak opinion clauses. In aaai2004, pages 761–766, 2004.
 X. Zhu and Z. Ghahramani and J. Lafferty. Semi-Supervised Learning Using Gaussian Fields
and Harmonic Functions. In Proceedings of International Conference on Machine Learning,
2003.
 Hongyuan Zha. Generic summarization and keyphrase extraction using mutual reinforcement
principle and sentence clustering. In SIGIR ’02: Proceedings of the 25th annual international
ACM SIGIR conference on Research and development in information retrieval, pages 113–120,
New York, NY, USA, 2002. ACM.
 Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf.
Learning with local and global consistency. Technical Report MPI no. 112, Max Planck Institute for Biological Cybernetics, Tübingen, Germany, June 2003.
63
```