Zhenjie Zhang , Beng Chin Ooi , Srinivasan Parthasarathy , Anthony K. H. Tung. Similarity Search on Bregman Divergence: Towards Non-Metric Indexing . Accepted and to appear in the Proceedings of the 35 th International Conference on Very Large Data Bases(VLDB), Lyon, France August 24

Zhenjie Zhang , Beng Chin Ooi , Srinivasan Parthasarathy , Anthony K. H. Tung. Similarity Search on Bregman Divergence: Towards Non-Metric Indexing . Accepted and to appear in the Proceedings of the 35 th International Conference on Very Large Data Bases(VLDB), Lyon, France August 24
THE NATIONAL UNIVERSITY
of SINGAPORE
S c h o o l of C o m p u t i n g
Computing 1, Singapore 117590
TRA6/08
Edit Distance Evaluation on Graph Structures
Zhiping Zeng, Anthony K.H. Tung, Jianyong Wang,
Jianhua Feng and Lizhu Zhou
June 2008
Technical Report
Foreword
This technical report contains a research paper, development or
tutorial article, which has been submitted for publication in a
journal or for consideration by the commissioning organization.
The report represents the ideas of its author, and should not be
taken as the official views of the School or the University. Any
discussion of the content of the report should be sent to the author,
at the address shown on the cover.
OOI Beng Chin
Dean of School
Edit Distance Evaluation on Graph Structures
§
†
§
§
§
Zhiping Zeng , Anthony K.H. Tung , Jianyong Wang , Jianhua Feng , Lizhu Zhou
§
Tsinghua University, Beijing, 100084, P.R.China
†
National University of Singapore, Singapore
cengzp03@mails.tsinghua.edu.cn, atung@comp.nus.edu.sg
{jianyong, fengjh, dcszlz}@tsinghua.edu.cn
ABSTRACT
Graph data has became ubiquitous and manipulating them
based on similarity is essential for many applications. Graph
edit distance is one of the most widely accepted measure
to determine similarities between graphs and has extensive
applications in the fields of pattern recognition, computer
vision etc. Unfortunately, the problem of graph edit distance computation is NP-Hard in general and is unlikely to
be approximated within a polynomial time computable factor f (n) in polynomial time. Accordingly, in this paper we
introduce three novel methods to compute the upper and
lower bounds for the edit distance between two graphs in
polynomial time. Applying these methods, three algorithms
AppFull, ExactSub and AppSub are introduced to perform different kinds of graph search on graph databases.
Comprehensive experimental studies are conducted on both
real and synthetic datasets to examine various aspects of
the methods for bounding graph edit distance. Result shows
that these methods achieve good scalability in terms of both
the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using
our bounds in filtering and searching of graphs.
1.
INTRODUCTION
In the modern society, graph data are becoming ubiquitous, and graph data models have been studied in the
database community for semantic data modelling, hypertext, multimedia, chemical and biological information system. For example, World Wide Web can be considered as a
graph whose vertices correspond to static pages and edges
correspond to links between pages[6]; in chem-informatics,
labelled graphs are suited to express the connectivity of
chemical compounds[29]; in bioinformatics, collections of
DNA segments in a cell which interact with each other and
with other substances in the cell can be formatted as gene
regulatory networks[4, 5, 15].
Due to the extensive applications of graph models, vast
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
the VLDB copyright notice and the title of the publication and its date appear,
and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers
or to redistribute to lists, requires a fee and/or special permission from the
publisher, ACM.
VLDB ‘08, August 24-30, 2008, Auckland, New Zealand
Copyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00.
amounts of graph data have been collected and graph databases
have attracted significant attention in the academic communities. In recent years, various approach have been proposed
to deal with a variety of graph-related research problems.
For example, a variety of effective algorithms have been devised to mine graph patterns(e.g., frequent patterns) from
graph databases[33], to index graph databases for efficiently
processing graph search[13, 37, 8] and to perform keyword
search over graph databases[14, 17, 30].
With the rapidly increasing amounts of graph data(e.g.,
chemical compounds and social network data), supporting
scalable graph search over large graph databases becomes
an important database research problem. Because of the
technical limitation of processing graph search using conventional database technologies, enormous efforts [26, 13,
35, 8, 37] have been put into constructing practical graph
searching methods.
Given a graph database consisting of n graphs, D = {g1 , g2 ,
· · · , gn }, and a query graph q, almost all existing algorithms
of processing graph search can be classified into the following
three categories:
1. full graph search: find all graphs gi in D s.t. gi is
the same as q [3];
2. subgraph search: find all graphs gi in D containing
q [4, 26, 27, 37] or contained by q[8].
3. similarity search: find all graphs gi in D s.t. gi is
similar to q within a user-specified threshold based on
some similarity measures [23, 31].
As can be seen, manipulating graph data based on structural similarity is essential for many applications [36]. A
number of graph similarity measures therefore have been
proposed [20, 7, 10, 24]. Among these, graph edit distance has been widely accepted as a similarity measure for
representing the distances between attributed graphs. Informally speaking, graph edit distance defines the similarity
of two graphs by the minimum amount of distortion which
is needed to transform one graph into the other. In contrast with other measures, graph edit distance does not suffer from any restrictions and can be applied to any type of
graphs. Furthermore, graph edit distance is known to be
error-tolerant with the ability to find graphs that are of interest to users even in the presence of noises and errors in
the database.
Graph edit distance plays a significant role in the management of graph data and a variety of other applications such
as graph classification, computer vision, pattern recognition
and etc. For example, a familiar problem in computer vision
is to recognizing specific objects within an image[16](e.g.,
face identification and symbol recognition). In this case, a
representative graph is generated from the image according
to structural characteristics, and vertex labels may be assigned based on characteristics of the region to which each
vertex corresponds. After then, this representative graph
is compared to a database of prototype or model graphs to
identify and classify the object of interest. In this context,
graph edit distance provides a good measure for comparing
graphs.
Unfortunately, the main drawback of graph edit distance
is its exponential computation complexity in terms of the
number of graph vertices. As will be shown in Section 2, the
problem of graph edit distance is NP-hard in general, and
it is even unlikely to have an algorithm that can approximate it within a factor f (n) in polynomial time, where f (n)
is any polynomial time computable function. The direct
computation of graph edit distance involving large graphs is
therefore expensive and will take an unacceptable time. Because of this, a few algorithms have been proposed to compute upper and lower bounds for the graph edit distance,
each having their own disadvantages. Justice et al. [16] gave
a solution providing the lower bound in O(n7 ) time and
the upper bound in O(n3 ) time. The computation of the
lower bound is expensive, and their method for obtaining
the upper bound will consider only vertex edit term, i.e.,
without considering structural information in graphs. This
solution is therefore not applicable in practice. The other
methods [22, 32] cannot provide lower bounds, and used
heuristic algorithms to find unbounded suboptimal values.
All their computation complexities are hard to analyze and
not presented in related papers.
Accordingly, in this paper we address the problem of obtaining upper and lower bounds of graph edit distance efficiently. In summary, the contributions of this paper are:
• We give a formal proof that the problem of graph edit
distance computation is NP-Hard and it cannot be approximated within a factor of f (n) unless P = SP P ,
where f (n) is a polynomial time computable function.
• We introduce a notion of star representations for graph
structures and propose three novel methods to obtain
lower and upper bounds of edit distance between two
graphs in polynomial time.
• Based on these efficiently computable bounds, we developed three algorithms AppFull, ExactSub and
AppSub for performing approximate full graph search,
exact subgraph search and approximate subgraph search
respectively.
• Comprehensive experimental studies are conducted to
evaluate the scalability and effectiveness of our algorithms.
The rest of this paper is organized as follows. Section 2
will formalize the problem of graph edit distance computation together with computation complexity analysis on the
problem. Related work will be discussed in Section 3. In
Section 4, three efficiently computable methods are introduced for obtaining lower and upper bounds of graph edit
distance. Section 5 investigates the applications of these
bounds in performing graph search over graph databases,
followed by a comprehensive experimental studies reported
in Section 6. Section 7 concludes the paper.
2. PRELIMINARIES
In this section, we will first formalize the problem of graph
edit distance computation and then perform some computational complexity analysis for the problem. Table 1 summarizes the major notations that we will use in this paper.
Symbols
deg(v)
δ(g)
λ(g1 , g2 )
Lm (g1 , g2 )
τ (g1 , g2 )
ρ(g1 , g2 )
Description
|{u|(u, v) ∈ E}|, the degree of v
maxv∈V (g) deg(v)
the edit distance between graphs g1 and g2
the lower bound of λ(g1 , g2 )
the suboptimal value of λ(g1 , g2 )
the refined suboptimal value λ(g1 , g2 )
Table 1: Notations used in this paper
2.1 Problem Formulation
In this paper, we consider simple graphs which does not
contain self-loops, multi-edges and edge labels. An undirected attributed graph, denoted by g, can be represented
by a 3-tuple g=(V, E, l), where V is a finite set of vertices,
E⊆ V × V is a set of vertex pairs, and l : V →Σ 1 is a
function assigning labels to vertices. In general, we will use
Ei , Vi and li to represent the edges, nodes and label assigning function of a graph gi . Without explicit statement, the
term graph we refer to in the rest of this paper means an
undirected attributed graph.
A graph g1 is subgraph isomorphic to another graph
g2 (denoted by g1 v g2 ) iff there exists an injection f : V1 →
V2 such that for any vertex v ∈ V1 , f (v) ∈ V2 ∧ l1 (v) =
l2 (f (v)), and for any edge (u, v) ∈ E1 , (f (u), f (v)) ∈ E2 .
Moreover, if g1 and g2 are subgraph isomorphic to each
other, they are said to be graph isomorphic (denoted by
g1 = g2 ).
Informally speaking, an edit operation on a graph g is
an insertion or deletion of a vertex/edge or relabelling of a
vertex. A vertex can be deleted only on the condition that
no edge is connected to the vertex, and the costs of different edit operations are assumed to be equal in this paper.
Essentially, a vertex deletion can be considered as a vertex
relabelling by changing its label from σ∈Σ to where is a
special label indicating that the vertex is virtual. Symmetrically, a vertex insertion can be considered as relabelling a
vertex’s label from to σ. Let pi denote an edit operation, an
alignment P is an edit operation sequence hp1 , p2 , . . . , pm i
which can transform a graph g1 into another graph g 0 . If
g 0 is graph isomorphic to another graph g2 , we say g1 can
reach g2 through P . Alignments which can make g1 reach
g2 are not unique, and optimal alignments between g1 and
g2 are alignments containing a minimum number of edit operations.
Definition 2.1. (Graph Edit Distance) The edit distance between g1 and g2 , denoted by λ(g1 , g2 ), is the number
of edit operations in the optimal alignments that make g1
reach g2 .
1
Σ is an alphabet consisting of all labels.
2.2 Computation Complexity Analysis
In the above, the formulation and properties of the GED
problem have been introduced. Next, we investigate its computational complexity. Justice et al.[16] used the adjacency
matrix representation to formulate a BLP(binary linear program, a linear program where all variables are either 0 or
1) to solve the GED problem. Because solving a BLP is
NP-Hard [11] in general, it is likely that GED problem is
NP-Hard. We confirm this possibility here.
Lemma 2.1. Given two graphs g1 and g2 , λ(g1 , g2 ) ≥ ||E2 |−
|E1 || + ||V2 | − |V1 ||.
Proof. Given that the edit operations had transformed
g1 to g2 , then the transformed graph of g1 must have the
same number of nodes and edges as g2 . It should be easy to
see that at least ||E2 | − |E1 || + ||V2 | − |V1 || edit operations
are needed to do so.
Lemma 2.2. Given two graphs g1 and g2 where |V1 |≤|V2 |
and |E1 |≤|E2 |, g1 is subgraph isomorphic to g2 iff λ(g1 , g2 ) =
(|E2 | − |E1 |) + (|V2 | − |V1 |).
Proof. Necessity: if g1 is subgraph isomorphic to g2 ,
then there exists an injection η that satisfy the required
condition. We define V = {u |∃ v ∈ V1 , η(v) = u} and E =
{(η(u), η(v))|(u, v) ∈ E1 }. Since η is injective, V⊆V2 , E⊆E2 ,
|E|=|E1 | and |V|=|V1 | must hold. Suppose P is an alignment
which removes all edges in E2 −E and all vertices in V2 −V,
then g2 can reach g1 through P . Based on Lemma 2.1,
P is an optimal alignment which makes g2 reach g1 , thus
λ(g1 , g2 )=(|E2 |−|E1 |)+ (|V2 |−|V1 |).
Sufficiency: assume P is an alignment that makes g2
reach g1 . In order to converge g2 to g1 by performing P on
g2 possess the identical number of edges and vertices, at least
(|E2 |−|E1 |)+(|V2 |−|V1 |) delete operations exist in P . Thus,
if P is an optimal alignment containing (|E2 |−|E1 |)+(|V2 |−|V1 |)
edit operations, neither insertion nor vertex relabelling exists in P . The graph isomorphism from g1 to g 0 is therefore
a subgraph isomorphism from g1 to g2 , i.e., g1 v g2 .
Therefore, graph edit distance can also be used to determine the subgraph isomorphism which is NP-Complete [11].
Then we can derive the following lemma.
Lemma 2.3. GED problem is NP-Hard.
Proof. For two graphs g1 and g2 , if |V1 |>|V2 | or |E1 |>|E2 |,
we can quickly state that g1 cannot be subgraph isomorphic
to g2 because it is impossible to find such a subgraph isomorphism. In the case of |V1 |≤|V2 | and |E1 |≤|E2 |, based on
Lemma 2.2, the subgraph isomorphism between g1 and g2
which is NP-Complete [11] can be reduced to GED problem
in polynomial time. GED problem is therefore NP-Hard.
According to Lemma 2.3, it is prohibitively difficult to
compute the graph edit distances for large graphs. In this
situation, people usually look for approximation algorithms
to avoid the expensive computation. Unfortunately, according to Lemma 2.4 introduced below, it is unlikely to approximate GED problem within a polynomial time computable
factor f (n) unless P=SPP, where SPP is a complexity class
first introduced in [9]. To date, P is contained in SPP, but
SPP is not known to be contained in NP. Also, SPP is not
known to contain any NP-complete problems, but SPP does
contain some problems believed to be hard, e.g., it contains
decision versions of discrete logarithm and integer factoring.
Therefore, it is unlikely that SPP is in P.
Lemma 2.4. For any polynomial time computable function f (n), the graph edit distance between two graphs cannot
be approximated within a factor of f (n), unless P=SPP.
Proof. Assume on the contrary that there is a factor
f (n) polynomial time approximation algorithm, A, for general GED problem. We will show that A can be used for deciding the graph isomorphism(GI) problem which is SPP[2]
in polynomial time, thus implying P=SPP.
The central idea is a reduction from the GI problem to
GED problem. Given two graphs g0 and g1 whose vertex labels are identical, then g0 and g1 can be treated as
general graphs without vertex label. If g0 is graph isomorphic to g1 , then λ(g0 , g1 )=0 must hold. Observe that when
running on g0 and g1 , algorithm A must return a result of
A(g0 , g1 )≤f (n)×λ(g0 , g1 )=0 if g0 is isomorphic to g1 . Otherwise, A must return a value of greater than 0. Therefore,
A can be used to determine the graph isomorphism between
g0 and g1 .
Based on these analyses, we conclude that heuristic approaches to approximate GED will most probably be the
best way to go towards solving the problem of computing
graph edit distance.
3. RELATED WORK
3.1 Graph Edit Distance
There are a number of existing studies addressing the
graph edit distance computation problem [22, 25]. All of
them fall into two categories: exact algorithms and heuristic algorithms.
The most widely used method for computing exact graph
edit distance is based on the well-known A* algorithm[12],
and Kaspar Riesen et al. used bipartite heuristic to speed up
the computation procedure[25]. However, as stated in [22],
in practice this kind of algorithms are practical for computing the edit distance of graphs typically possessing 12
vertices or less. Exact algorithms therefore cannot be applied in the applications involving large graphs, and plenty
of heuristic algorithms are devised with to compute lower
bound and upper bound for GED with unbounded errors.
Exploiting the strategy of the A* algorithm, Michel Neuhaus
et al. proposed a heuristic algorithm [22] by maintaining
only a fixed number of nodes with the lowest cost and introducing an additional weighting factor favoring long partial
edit paths over shorter ones. In the community of pattern
recognition, the GED problem is named as graph matching.
From the standpoint of information theory, it is seeking the
matched configuration of vertices that has maximum a posteriori probability w.r.t. the available vertex attribute information. As, graph matching algorithms aims to optimize
a global MAP criterion[32, 21], some heuristic algorithms
are devised based on this framework[21, 32]. However, it is
hard to analyze the computation complexities of the above
heuristic algorithms, and the suboptimal solutions provided
by them are also unbounded.
Meanwhile, the authors in [1] and [16] formulated the
GED problem as a BLP problem2 . As this formulation will
be used in Section 4.3, it is therefore described in detail.
2
Although the work in [1] focused on weighted graph matching problem, it can be considered as a special case of graph
edit distance on graphs without vertex labels.
The adjacency matrix Ag for g is given by Ag ={ai,j },
where
(
1, if (vi , vj ) ∈ E(g)
ai,j =
0, otherwise
information are not considered during the computation of
obtaining the upper bound. Accordingly, this upper bound
will not be tight. More details about this solution can be
found in [16].
For two graphs g and h, assume |V (g)|=n and |V (g)|≥|V (h)|,
then vertices with the special label are inserted into h such
that h will contain n vertices. Let P ={Pi,j } be an n×n
permutation matrix and C={Ci,j } be an n×n label matrix
where
n
n
X
X
Pi,j ∈ {0, 1},
Pi,j =
Pi,j = 1
Meanwhile, there are a wealth of literature concerning the
problem of graph search, and a large number of algorithms
have been devised. Due to the complexity of (sub)graph isomorphism, people usually exploit the feature-based indexing
approach in practical graph search systems to improve the
performance. In content-based image retrieval, Petrakis et.
al represented each graph a vector of features and indexed
graphs in multidimensional space using R-trees[23]. Instead
of casting graphs to vectors, a metric indexing scheme is
proposed which organizes graphs hierarchically by means of
their mutual distances[3]. All the above systems are designed to address the full graph search problem.
For subgraph search systems, almost all of them exploit
path-based or graph-based indexing approach. GraphGrep[26] is a famous representative of path-based approach.
Srinivasa et al. built multiple abstract graphs[27], gIndex[34] uses the discriminative frequent structures, C-Tree[13]
adopted graph closure, SAGA[28] employed pathway fragments, cIndex[8] adopted contrast subgraphs while Tree+
∆ [37] exploited frequent tree-features(Tree) and a small
number of discriminative graphs(∆).
Because of noises that are usually present in graph databases,
a common problem in graph search is: “what if there is no
match or very few matches for a given query?”[35]. In this
scenario, similarity search becomes an appealing and natural
choice. There are also a number of systems supporting similarity search over graph databases. For example, a threetier algorithm for full graph similarity search is introduced
by Raymod et al.[24] whose similarity is based on maximum common edge subgraph; He et al. in [13] exploited approximate graph edit distance computed through heuristic
neighbor biased mapping methods; the similarity measure
proposed in [28] consists of three components, StructDist,
NodeMismatches and NodeGaps; while Grafil [35] supports
approximate subgraph search by allowing edge relaxations.
i=1
Ci,j =
(
1,
0,
j=1
if lg (vi ) = lh (uj ), vi ∈ V (g), uj ∈ V (h)
otherwise.
Next, let P be an orthogonal matrix having the property
P P T =P T P =I where I is the identity matrix. For each
permutation matrix P , the cost of transforming g to h using
P , C(g, h, P ), is defined as
C(g, h, P ) =
n X
n
X
Ci,j Pi,j +
i=1 j=1
1 g
||A − P Ah P T ||1
2
(1)
where ||.||1 denotes the L1 norm, i.e.,
||A||1 =
n
n X
X
|ai,j |
i=1 j=1
The edit distance between g and h can be formulated as
λ(g, h) = min C(g, h, P )
(2)
P
It is shown that the problem of graph edit distance is
equivalent to find an optimal permutation matrix P ∗ to minimize C(g, h, P ). According to Lemma 2.3, it is prohibitively
difficult to find P ∗ directly. Hence, some techniques are proposed to overcome this restriction. Let
R = Ag − P Ah P T
(3)
Since P is a permutation matrix, the L1 norm of (3) is
g
h
||R||1 = ||RP ||1 = ||A P − P A ||1
(4)
Let V EC(A) be the column vector obtain by stacking
the columns of A on top of one another. Then, (4) can be
written in the form
V EC(RP ) = Agh · V EC(P )
where Agh is an n2 × n2 constant matrix derived from g and
h. It is clear from the above transformation that ||Ag P −
P Ah ||1 = ||Agh V EC(P )||1 . Then, (2) can be written as:
λ(g, h) = min ||Agh V EC(P )||1 +
P
n
n X
X
Ci,j Pi,j
(5)
i=1 j=1
A consequent advantage of (5) is that (2) is transformed
from a nonlinear optimization problem to a linear optimization problem, and a lower bound of λ can be obtained in
O(n7 ) time by extending the domain of P from {0, 1} to
[0, 1]. Moreover, an upper bound of λ also can be obtained
in O(n3 ) time with only vertex edit term, i.e., connectivity
3.2 Graph Search
4. GRAPH EDIT DISTANCE EVALUATION
In this section, we present methods to efficiently obtaining
the lower and upper bounds of edit distance between two
graphs.
4.1 Graph Transformation
The key idea of this paper is to transform a graph structure to a multiset of star structures. This transformation
retains certain structural information of the original graph
structure.
4.1.1 Star Structure
Definition 4.1. (Star Structure) A star structure s is
an attributed, single-level, rooted tree which can be represented by a 3-tuple s=(r, L, l), where r is the root vertex, L
is the set of leaves and l is a labeling function. Edges exist
between r and any vertex in L and no edge exists among
vertices in L.
In a star structure, the root vertex is the center and vertices in L can be considered as satellites surrounding the cen-
ter. For any vertex vi in a graph g, we can generate a corresponding star structure si in the following way: si =(vi , Li , l)
where Li ={u|(u, vi ) ∈ E}. Accordingly, we can derive n star
structures from a graph containing n vertices. Thus, a graph
can be mapped to a multiset3 of star structures. We call this
multiset the star representation of the graph g, denoted
by S(g).
B
A
?
D
C
Due to the particularity of star structures, edit distance
between two star structures can be computed easily as shown
below.
A
D
B
A
C
A
B
C
A
B
C
B
C
D
A
C
B
g1
A
B
ε
C
g2
4.1.2 Star Edit Distance
B
A
D
S(g
A
B
λ(s1 , s2 ) = T (r1 , r2 ) + d(L1 , L2 )
D
C
g1
T (r1 , r2 ) =
(
0
1
if l(r1 ) = l(r2 ),
otherwise.
d(L1 , L2 ) = ||L1 | − |L2 || + M (L1 , L2 )
M (L1 , L2 ) = max{|ΨL1 |, |ΨL2 |} − |ΨL1 ∩ ΨL2 |
ΨL is the multiset of vertex labels in L.
Proof. This proof is simple and we omit it here.
Based on the above lemma, the main cost for computing
edit distance between two star structures is the multiset intersection operation. To speed up this operation, a total
order should be defined on Σ. For instance, we can attach distinct integers to distinct vertex labels. After that, a
multiset can be sorted in ascending order with computation
complexity O(n log n). Note that, this sort operation can be
accomplished during the preprocessing of input databases.
Algorithm 1 illustrates an efficient method to compute the
intersection of two multisets in which elements are sorted in
ascending order. The analysis of this algorithm is quite simple, and its computation complexity is Θ(n).
Algorithm 1 Mi - Multiset Intersection
Input: two sorted vectors A and B
Output: C - the elements appear in both A and B
1. i ← 0, j ← 0;
2. while i < |A| and j < |B| do
3. if A[i] = B[j] then
4.
C ← C ∪ A[i];
5.
i++,j++;
6. else if A[i] < B[j] then
7.
i++;
8. else if A[i] > B[j] then
9.
j++;
10.
end if
11. end while
12. return C;
4.2 Lower Bound of Edit Distance
Based on the star representation of graphs, a polynomial
time computable distance Lm is introduced in this subsection to give a lower bound for graph edit distance.
3
NOT “set” here, since multiple star structures could appear.
S(g
2)
??
Figure 1: Computing ζ(S1 , S2 ): A Bipartite Graph
Matching Problem
Lemma 4.1. (Star Edit Distance) Given two star structures s1 and s2 ,
where
ε
C
1)
A
B
ε
C
g2
Figure 2: Mapping between Nodes of Two Graphs
in an Optimal Alignment
4.2.1 Mapping Distance
We first define the distance between two multisets of star
structures. Subsequently, we will define the mapping distance between two graphs based on their star representations which are multisets.
Definition 4.2. Given two multisets of star structures
S1 and S2 with the same cardinality, and assume P :S1 →S2
is a bijection. The distance ζ between S1 and S2 is
X
λ(si , P (si ))
ζ(S1 , S2 ) = min
P
si ∈S1
The computation of ζ(S1 , S2 ) is equivalent to solving the
assignment problem, which is one of the fundamental combinational optimization problem aims at finding the minimum
weight matching in a weighted bipartite graph. Given two
set of vertices V and V 0 in a weighted graph, the problem
aim to find a set of edges E such that each vertex in V is
link to exactly one vertex in V 0 such that the sum of weight
for edges in |E| is minimized. In our case, these two set of
vertices are the two multisets of star structures S1 and S2
and the weight of the edge that connect one star in S1 to
another star in S2 is the edit distance between the two stars.
Figure 1 show an example to illustrate the bipartite graph
matching problem that must be solved in order to compute
ζ(S(g1 ), S(g2 )). Given two graph g1 and g2 on the left of
the figure, their corresponding multisets of star structures
are shown on the right hand side of the figure. In this case,
we show the optimal matching between the stars as solid
line while other edges joining the stars are shown as dotted
line.
Note that in Figure 1, the two graphs have different number of nodes and hence an additional node labelled as was added in S2 . We will now address this issue. Assume
|V1 |−|V2 |=k≥0, |S(g1 )|−|S(g2 )|=k must hold. In order to
make them have the same number of vertices, k vertices with
the label are inserted into g2 . We call this process the normalization of g2 w.r.t. g1 . Because vertex insertion can be
considered as vertex relabelling from to σ, this normalization will not change the edit distance between g1 and g2 .
After this normalization, S(g1 ) and S(g2 ) possess the same
cardinality, and the mapping distance between two graphs
can be defined upon the distance of their star representations.
In order to solve the bipartite graph matching problem,
we first create an n×n matrix in which each element represents the edit distance between the ith star in S(g1 ) and
the jth star in S(g2 ). The Hungarian algorithm[18] is
then applied on this square matrix to obtain the minimum
cost in O(n3 ) time. We now formally definite the mapping
distance between two graphs.
Definition 4.3. (Mapping Distance) The mapping distance µ(g1 , g2 ) between g1 and g2 is defined as:
two star structures rooted by vi and vj are affected.
For the star rooted by vi , a new vertex vj and a new
edge (vi , vj ) are newly inserted into this star structure after the edge insertion, as illustrated in Figure 3. Likewise, the same effect is incurred by the
star structure rooted at vj . Thus, we can know that
µ(hm , hm+1 ) ≤ 2 × 2 = 4. In complement, one edge
deletion has the same affection. Therefore, in the case
of performing one edit operation of inserting or deleting a edge on hm , µ(hm , hm+1 ) ≤ 4.
vi
vj
vi
...
...
µ(g1 , g2 ) = ζ(S(g1 ), S(g2 ))
Intuitively, the optimal mapping for computing the mapping distance between S(g1 ) and S(g2 ) is trying to approximate the mapping between the nodes of g1 and g2 in an optimal alignment. Figure 2 shows the mapping between the
nodes of the two graphs that are shown in Figure 1 when
they are optimally aligned. In this case, the optimal mapping computed on the bipartite graph is in fact the same as
the mapping in the optimal alignment. Note that since the
mapping in the bipartite graph take only each nodes and
their neighbors into consideration, there is in fact less constraints on the output when determining the optimal mapping in the bipartite graphs compare to determining the
optimal mapping between the two graphs. Because of this,
it is possible to compute a lower bound of the edit distance
for g1 and g2 by utilizing µ(g1 , g2 ). We will prove this lower
bound formally in the next section.
Furthermore, since the mapping computed on the bipartite graph might not be the same as the mapping computed
for the optimal alignment between g1 and g2 , the number of
edit operation that convert g1 into g2 based on the bipartite
graph mapping4 will either be the same or higher than the
actual edit distance between the two graphs. Later on in
this paper, we will utilize this observation to compute an
upper bound on the edit distance between two graphs.
4.2.2 Lower Bound of Graph Edit Distance
Lemma 4.2. Let g1 and g2 be two graphs, then the mapping distance µ(g1 , g2 ) between g1 and g2 satisfies the following:
µ(g1 , g2 ) ≤ max{4, [max{δ(g1 ), δ(g2 )]} + 1]} · λ(g1 , g2 )
Proof. Let P =(p1 , p2 , . . . , pk ) be an alignment transforming g1 to g2 . Accordingly, there is a sequence of graphs
g1 =h0 →h1 →. . . →hk =g2 , where hi−1 →hi indicates that hi
is the derived graph by performing pi over hi−1 for 1≤i≤k.
Assume there are k1 edge insertion/deletion operations, k2
vertex insertion/deletion operations and k3 vertex relabellings
in P , then k1 +k2 +k3 =k. Next, we will analyze each kind of
edit operations in detail.
1. Edge Insertion/Deletion: while an edge is inserted
between two vertices vi and vj in the graph hm , only
4
This can be computed using Equation 1 in Section 3.1.
Figure 3: Star changes incurred by edge insertion
2. Vertex Insertion/Deletion: In the case of vertex insertion, since the newly inserted vertex v0 has no edge
attached, it is equivalent to changing a vertex’s label
from to σ. Thus, in the case of inserting one vertex,
µ(hm , hm+1 ) = 1. The same result can be induced in
case of deleting one vertex due to the complementary.
3. Vertex Relabeling : Assume a vertex v0 ’s label is
changed from σ1 to σ2 . Obviously, the star rooted by
v0 and deg(v0 ) stars which possess v0 as their leaves are
affected by this relabelling. For each such star, only
the label of v0 is changed. Therefore, for one vertex
relabelling operation, µ(hm , hm+1 ) ≤ 1×(deg(v0 )+1).
Above all, we will get the following inequality
µ(g1 , g2 ) ≤ 4 · k1 + 1 · k2 + (max{δ(g1 ), δ(g2 )} + 1) · k3
≤ max{4, [max{δ(g1 ), δ(g2 )} + 1]} · (k1 + k2 + k3 )
≤ max{4, [max{δ(g1 ), δ(g2 )} + 1]} · λ(g1 , g2 )
This complete the proof. 2
Based on Lemma 4.2, µ provides a lower bound Lm of λ,
i.e.,
λ(g1 , g2 ) ≥ Lm (g1 , g2 ) =
µ(g1 , g2 )
max{4, [max{δ(g1 ), δ(g2 )} + 1]}
Before applying Hungarian algorithm, a n × n matrix must
be constructed and the computational complexity for computing edit distances among star structures is Θ(n), thus
the cost of this construction is Θ(n3 ). Since the computational complexity of Hungarian algorithm is O(n3 ), both µ
and Lm therefore can be computed in Θ(n3 ) time.
4.3 Upper Bound of Edit Distance
In previous subsection, Lm is devised as a lower bound
for the graph edit distance λ. Next, we will introduce two
algorithms to compute upper bounds for λ in this section.
The first upper bound comes naturally during the computation of µ introduced in Section 4.2.1. As mentioned,
since the optimal mapping that is computed for the bipartite graph in Section 4.2.1 are done by considering only
each vertex and it’s neighbors, there is in fact less constraints compared to when computing the edit distance between their corresponding graphs. Assuming that the output from the Hungarian algorithm in Section 4.2.1 leads to
a mapping P̄ from V (g) to V (h), we can simply use Equation 1 from Section 3.1 to compute C(g, h, P̄ ), denoted as
τ (g, h). Apparently, since the mapping might not be optimal τ (g, h)≥λ(g, h), the actual edit distance between g and
h. Because C(g, h, P ) can be solved in Θ(n2 ) time, τ (g, h)
is therefore an upper bound of λ(g, h) that can be computed
in Θ(n3 ) time.
While τ (g, h) give an initial upper bound on the edit
distance, it is possible to perform an iterative refinement
approach on the bipartite graph mapping in order to improve the approximate upper bound. The main idea is that
given any two nodes u1 and u2 in g and their corresponding mapping f (u1 ) and f (u2 ) in h (assuming f is the mapping function corresponding to P̄ ), we swap f (u1 ) and f (u2 )
if this reduce the edit distance. As such, for any pair of
(u1 , u2 ) ∈ V (g), a new mapping function f 0 can be defined
as following:
8
>
if u 6= u1 and u 6= u2
<f (u)
f 0 (u) = f (u2 ) if u = u1
>
:f (u ) if u = u
1
2
Let P 0 denote the permutation matrix corresponding to
f . Because there are Cn2 pairs of vertices in V (g), for a mapping function f there will be Cn2 newly generated permutation matrices. For each P 0 we can get obtain a new value
C(g, h, P 0 ). Assume P0 is the permutation matrix which
results in the minimum value of C(g, h, P 0 ), i.e.,
0
P0 = arg min
C(g, h, P 0 )
0
P
If C(g, h, P0 )<C(g, h, P̄ ), then we get a closer upper bound
of λ(g, h). After that, we assign P̄ to be P0 and continue performing the refinement on P̄ until C(g, h, P0 )≥C(g, h, P̄ ).
Algorithm 2 illustrates this iterative refinement approach
in detail.
Algorithm 2 Refine(g,h,P)
Input: two graph structures g and h
Input: a permutation matrix P of g and h
Output: refined suboptimal distance of g and h
1. dist ← C(g, h, P );
2. min ← dist;
3. for any pair (ui , uj ) ∈ V (g) do
4. get P 0 based on ui and uj ;
5. if min > C(g, h, P 0 ) then
6.
min ← C(g, h, P 0 );
7.
Pmin ← P 0 ;
8. end if
9. end for
10. if min < dist then
11.
min ←Refine(g, h, Pmin )
12. end if
13. return min;
Because the optimization problem shown in Equation (2)
is not a linear optimization problem, Refine will only find
a local optimal solution. ρ is therefore also an upper bound
of λ as well as τ . The relationships between Lm , τ , ρ and
λ can thus be represented by the inequality Lm ≤λ≤ρ≤τ .
In addition, because ρ is always no less than λ, Refine is
guaranteed to terminated. Because Refine is an iterative
algorithm, its computational complexity is difficult to analyze. However, the value of τ will not exceed the total number of vertices and edges residing in these two graphs, i.e.,
τ ≤2×(n+0.5n2 )=2n+n2 where n is the vertex number in
involved graphs. Refine is therefore guaranteed to be terminated in 2n + n2 steps. Moreover, the cost for each step
is Cn2 × n2 , as such the computational complexity Refine is
at most O(n6 ). Note that, the above run time complexity
of ρ is a theoretical extreme case. In practise, it performs
better and will converge after small number of iterations.
5. APPLICATIONS IN GRAPH SEARCH
Next, we will look at how the bounds developed in previous section can be utilized to perform various types of
graph searching. As described in Section 2, graph edit distance can be used to measure structural similarity and also
determine subgraph isomorphism. All three kinds of graph
search listed in Section 1 can therefore be done using graph
edit distance as a similarity measure. However, the problem
of GED computation is NP-Hard, and the exact computation is very expensive. In this case, we exploit the upper
and lower bounds of edit distance to improve the search
performance by filtering out graphs that definitely will not
be in the answer set and thus avoid the expensive graph edit
distance computation.
5.1 Approximate Full Graph Search
Because of the existence of noise in graph databases, graph
search with approximation is more preferable. To the best
of our knowledge, while there are studies dealing with similarity search over graph databases using specific similarity
measures, no algorithm has been proposed in which graph
edit distance is used as the similarity measure. Here, we will
use the three bounds of λ that we have developed earlier to
develop an effective algorithm AppFull for performing approximate full graph over graph databases using graph edit
distance as the similarity measure.
As shown in Algorithm 3, a multi-level composition strategy based on Lm , τ and ρ is adopted in AppFull. Given
a query graph q and the edit distance threshold l, for each
graph g in graph database D, Lm (g, q) is first used to filter
g if Lm is greater than l(lines 2-4), because λ>l must hold
in the case of Lm >l. Subsequently, if τ (g, q)≤l, we know
that the edit distance between g and q must be no greater
than l, and q can therefore be reported as a result(lines 5-8).
If g passes the above two tests, then ρ is exploited further.
Like in the case of τ , if ρ(g, q)≤l, λ(g, q) must be no greater
than l and g can be output as a result (lines 9-12). Finally,
if g passes all the above three tests, then λ(g, q) must be
computed(lines 13-15). The order of the above three tests is
quite significant because the costs of their computation are
different. Among three of them, the computation of Lm is
the most efficient, while the computation of ρ is the most
expensive. Therefore, if g does not pass an earlier test, the
rest of the expensive tests are avoided. This arrangement of
tests therefore make AppFull more efficient. In addition,
from AppFull, we can see that, the expensive computation
of λ is only conducted for graphs that pass all the three tests
and a large number of λ computation are therefore avoided.
The performance of AppFull will be evaluated in the experimental study section.
5.2 Subgraph Search
We next look at subgraph search [13, 37, 35, 8]. We
will first look at exact subgraph search which by itself is
NP-complete. Then, we will look at approximate subgraph
search.
Algorithm 3 AppFull - Approximate Full Graph Search
Input: A query graph q and a graph database D
Input: Distance threshold l
Output: All graphs g in D s.t. λ(g, q) ≤ l
1. for each graph g ∈ D do
2. if Lm (g, q) > l then
3.
continue;
4. end if
5. if τ (g, q) ≤ l then
6.
report g as a result;
7.
continue;
8. end if
9. if ρ(g, q) ≤ l then
10.
report g as a result;
11.
continue;
12.
end if
13.
if λ(g, q) ≤ l then
14.
report g as a result;
15.
end if
16. end for
Definition 5.1. A graph g1 is said to be θ-subgraph
isomorphic to g2 if there exists a graph g3 s.t. g3 vg2 and
λ0 (g1 , g3 )≤θ.
5.2.1 Exact Subgraph Search
According to Lemma 2.2, if g1 is subgraph isomorphic
to g2 , no vertex relabelling will exists in the optimal alignment that make g2 reach g1 . When vertex relabelling is
not allowed, the edit distance between two star structures is
therefore redefined (recall from Lemma 4.1) as follows:
0
0
λ (s1 , s2 ) = T (s1 , s2 ) + d(L1 , L2 )
if l(r1 ) 6= l(r2 ),
otherwise.
Accordingly, without vertex relabelling in the edit operations, we can induce the following lemma from the analysis
illustrated in Lemma 4.2:
Lemma 5.1. Let g1 and g2 be two graphs, if no vertex
relabelling is allowed in the edit operations, µ0 (g1 , g2 ) ≤ 4 ·
λ0 (g1 , g2 ), where µ0 and λ0 are mapping and edit distances
in the case of no vertex relabelling.
Proof. This proof can be easily obtain by considering
only the first two cases shown in the proof of Lemma 4.2.
According to Lemma 5.1, a lower bound L0m of λ0 can
0
therefore be introduced, λ0 ≥L0m = µ4 . Note that, if g1 is a
0
subgraph of g2 , λ(g1 , g2 )=λ (g1 , g2 ) and λ(g1 , g2 )=(|E2 |−|E1 |)
+(|V2 |−|V1 |) must hold. Therefore, L0m can be easily applied
in a filtering algorithm ExactSub for performing exact subgraph search i.e. for a query g1 and a data graph g2 , if
L0m >(|E2 |−|E1 |)+|(V2 |−|V1 |), g2 can be safely filtered.
5.2.2 Approximate Subgraph Search
We next look at approximate subgraph search.
D
A
A
A
N
E
C
C
F
B
a)
I
C
B
B
b)
Thus, in Figure 4, a) is a 6-subgraph of c), while b) is a
4-subgraph of c). Given a query g1 and a graph database D,
the problem of θ-subgraph search is to find out all graphs
g2 in D of which g1 is a θ-subgraph.
Furthermore, assuming l=|E2 |−|E1 |+|V2 |−|V1 |, then
λ0 (g1 , g2 ) ≤ λ0 (g1 , g3 ) + λ0 (g3 , g2 )
= |E2 | − |E3 | + |V2 | − |V3 | + λ0 (g1 , g3 )
= |E2 | − |E1 | + |V2 | − |V1 | + 2λ0 (g1 , g3 )
= l + 2λ0 (g1 , g3 )
where
(
2 + |L1 | + |L2 |
0
T (s1 , s2 ) =
0
a query graph g1 and a data graph g2 , the number of edge
relaxations in Grafil is defined as |E1 |−|E3 |. Note that, the
definition of edge relaxation implicitly implies that no vertex relabelling is allowed. This similarity measure, however,
does not take the vertex mismatches into account. For example in Figure 4, according to this measure, the number of
edge relaxations between a) and c) is 3, which is the same
as that of b) and c). However, a) and b) are quite different. We therefore introduce the following similarity measure
based on the graph edit distance to overcome this weakness.
c)
Figure 4: Example Graphs
In [35], Yan et al. introduced Grafil for performing approximate subgraph search by allowing edge relaxations. Assuming that g3 is the maximal common subgraph between
Thus, if g1 is a θ-subgraph of g2 , λ0 (g1 , g2 )≤l + 2θ must
hold. Accordingly, we devise a filtering algorithm AppSub
to perform θ-subgraph search, in which L0m is used as a filter:
if L0m (g1 , g2 ) >l+2θ, g2 can be safely filtered.
Lemma 5.2. If g1 is subgraph isomorphic to g2 within n
edge relaxations, g1 must be a 2n-subgraph of g2 .
Proof. We will prove by induction.
Let g3 be the maximal common subgraph of g1 and g2 ,
we will show that if |V1 |−|V3 |=k, at least k edges in E1 −E3
are needed to ensure that g1 connected. If |V1 |−|V3 |=1,
apparently one such edge is needed to do so. Assume the
above statement is true when k=i−1. In the case of k=i, at
least one edge is needed to make the newly introduced vertex
connects to the other vertices. As such, k edges are therefore
needed in the case of k=i. Assume |V1 |−|V3 |=n+1, then at
least n+1 edges exist in E1 −E3 . However, g1 is a subgraph
of g2 within n edge relaxations, |E1 |−|E3 |≤n holds. Thus,
|V1 |−|V3 |≤n and λ0 (g1 , g3 )≤2n. g1 is therefore 2n-subgraph
isomorphic to g2 .
According to the above lemma, for the same query q and
graph database D, the result set returned by performing 2nsubgraph search is a superset of the result set returned by
performing Grafil within n edge relaxations. Later on in the
experiment, we will show that using 2n-subgraph search will
in fact return much less candidates for exact edge relaxation
computation compare to the greedy filter approach that is
adopted by Grafil.
To our best knowledge, almost all existing algorithms of
subgraph search adopted the feature-based indexing framework which require the search for features which can involve
expensive data mining processes. ExactSub and AppSub
that are introduced here do not need index and can filter
graphs without performing pairwise subgraph isomorphism
determination. In addition, ExactSub and AppSub inherently support both two kinds of subgraph search, i.e.,
1000
Description
the number of graphs produced
average graph size
the number of vertex labels used
Table 2: Parameters of Synthetic Data Generator
In our study, all experiments were performed on a 2.40GHz
Inter(R) Pentinum(R) PC with 512MB of main memory,
running Debian Linux. All programs were implemented in
C++ using the GNU g++ compiler with -O2 optimization.
Two kinds of datasets were used through our experimental
study: one real dataset and a series of synthetic datasets.
Real dataset. The real dataset used here is the AIDS antivirus screen compound dataset from the Developmental
Theroapeutics Program in NCI/NIH which is available publicly5 . It has been widely used in a large number of existing
studies[35, 8, 37], and contains 42,687 chemical compounds,
among which 422 of them belong to CA, 1081 are of CM
and the remaining are in class CI.
Synthetic datasets. Synthetic datasets were produced by a
graph data generator kindly provided by Kuramochi and
Karypis. This generator allows the user to specify various
parameters, and only three of them related to our experiments are shown in Table 2. For other parameters, we used
the default values provided by this generator. For more details about this generator please refer to [19].
6.1 Comparison with the Exact Algorithm
We first conducted experiments to compare the runtime
of four algorithms for computing Lm , τ , ρ and λ, where
λ is provided by an exact graph edit distance computation
algorithm Exact based on the well-known A* algorithm.
As stated in [22], Exact is only able to compute the edit
distances of graphs typically containing 12 vertices at most
in practice. Accordingly, 1000 graphs were produced by the
synthetic generator by setting D=1k, T =10 and V =4. From
these 1000 graphs, ten graphs each of which contains 10
vertices were randomly selected to form the graph database
D. Meanwhile, six query groups were constructed each of
http://dtp.nci.nih.gov/docs/aids/aids data.html
10
1
0.1
0.01
0.001
0.0001
5
6
7
8
Size of Query Graph
9
10
Figure 5: Runtime Comparison with Exact
Figure 5 depicts the average runtime for calculating four
different distances between a query graph and the graph
database D. The X-axis shows the number of vertices contained in the query graph, and the Y -axis shows the corresponding average runtime in log scale. From this figure
we can observe that the computation of λ is much more
expensive than that of the other three distances, and the
other three algorithms are more than 10,000 times faster
than Exact.
200
150
3
Lm
τ
ρ
2.5
Average Bound
Parameter
D
T
V
Lm
τ
ρ
λ
100
EXPERIMENTAL STUDY
In this section, we present our experimental study on both
real and synthetic datasets. We first compared three methods for obtaining lower and upper bounds of λ with the exact graph edit distance computation algorithm. After that,
experiments were conducted to evaluate the scalability of
these methods in terms of the number of graphs and the
size of graphs. Finally, a variety of experiments were also
conducted to examine the performance of the three graph
search algorithms which apply these bounds, i.e., AppFull,
ExactSub and AppSub.
5
10000
Average Error(%)
6.
which contains 10 graphs. All graphs in the same query
group have the same number of vertices, and the number of
vertices residing in each graph among different groups varies
from 5 to 10.
Average Runtime(sec)
traditional subgraph search[37] and containment search[8].
In comparison to traditional subgraph search that retrieves
all the graphs containing the query q, containment search
returns all the graphs contained by q. However, for existing subgraph search systems, two distinct index structures
must be maintained to support these two kinds of subgraph
search[8]. The performance of ExactSub and AppSub are
investigated in Section 6.
100
50
Lm
τ
ρ
2
1.5
0
1
5
6
7
8
Size of Query Graph
9
a) e1
10
5
6
7
8
Size of Query Graph
9
10
b) e2
Figure 6: Comparison with Exact
Besides the comparison in terms of the runtime, we also
conducted experiments to show which one of Lm , τ and ρ is
tighter to λ. In order to measure the tightness, two measures
were introduced. The first one is e1 defined as |d−λ|
× 100%,
λ
where d is one of Lm , τ and ρ. However, this measure has
a problem that the field of e1 over Lm is [0,1] while the
field of e1 over τ and ρ is [0,+∞]. Thus, we introduced the
second measure e2 which is defined as max { λd , λd }. Actually,
e2 is popularly adopted to measure the approximation ratios
for approximation algorithms. Figure 6 depicts the data of
e1 and e2 respectively. From this figure we can see that ρ
is always tighter to λ than τ which is consistent with the
theoretical analysis.
6.2 Scalability Study
We then conducted experiments to evaluate the scalability of three bounding algorithms in terms of the number of
graphs and the size of graphs.
1000
1000
Lm
τ
ρ
Lm
τ
ρ
Runtime(sec)
Runtime(sec)
100
10
1
100
10
1
1
2
3
4
5
6
7
8
9
10
1
2
3
Number of Graphs(100)
4
5
6
7
8
9
10
Size of Graph(10)
a) number of graphs
b) size of graphs
6.3 Graph Search Performance
Figure 7: Scalability over Synthetic Datasets
Having examined the scalability of bounding algorithm,
we then investigate the performance of three graph search
algorithms applying these bounds.
6.2.1 Scalability over Synthetic Dataset
On the one hand, we conducted experiments to evaluate
their scalability over synthetic datasets. First, the scalability in terms of the number of graphs were examined. By setting T =80,V =50 and varying D from 100 to 1000, a series of
synthetic datasets were produced as graph databases. And
then, 10 query graphs were generated by setting D=10 to
compose a query group. Figure 7 a) shows the average runtime for computing Lm , τ and ρ over graph databases with
different cardinalities. Based on this figure, we can observe
that all three bounding algorithms show good scalability in
terms of the number of graphs. Second, experiments were
conducted to evaluate their scalability in terms of the size
of graphs. By setting D=10, V =50 and varying T from 50
to 100 in steps of 10, six query groups were produced each
of which contains 10 graphs. At the same time, a synthetic
dataset containing 1000 graphs were generated as the graph
database. Figure 7 b) illustrates the total runtime for calculating different distances between the query group and the
graph database. From Figure 7 b) we can see that these
three algorithms also have good scalability in terms of the
size of graphs.
6.2.2 Scalability over AIDS Dataset
5
1000
Runtime(sec)
Runtime(sec)
Lm
τ
ρ
Lm
τ
ρ
4
3
2
100
10
1
0
1
1
2
3
4
5
6
7
8
Number of Graphs(100)
a) number of graphs
9
10
50
55
60
size of graphs. In this case, 1000 graphs were randomly selected from AIDS dataset as the graph database. For each
query group, it contains 10 graphs containing the same number of vertices, which were also randomly selected from the
AIDS dataset. Figure 8 b) depicts the average runtime from
which we can observe that these bounding algorithms also
scale linearly in terms of the size of graphs over real dataset.
Consequently, all three bounding algorithms have good
scalability in terms of the number of graphs and the size of
graphs over both synthetic and real datasets.
65
70
75
80
Size of Graphs
b) size of graphs
6.3.1 Approximate Full Graph Search Performance
Applying Lm , τ and ρ, AppFull shown in Algorithm 3
is introduced to perform approximate full graph search over
graph databases. In this subsection, we investigate its performance in terms of the number of expensive λ computations. In order to compare the effectiveness of different
bounds, we implemented two variants of AppFull by exploiting different bounds. For instance, the legend “Lm ”
denotes a variant of AppFull using only Lm without τ and
ρ, the legend “Lm +τ ” denotes a variant of AppFull using Lm and τ but without ρ. The results are depicted in
Figure 9, the X-axis shows the distance thresholds used in
the search, and the Y -axis shows the average number of
expensive λ computations incurred in different algorithms.
Apparently, it is always preferable to filter as many graphs
as possible before performing the expensive λ computation.
For the experiments conducted on the real dataset, 10
and 1000 graphs were randomly selected from AIDS dataset
to form query graphs and the graph database respectively.
While for synthetic datasets, the query graphs and the graph
database were produced by the synthetic generator using parameters D=10 and D=1000 respectively. From Figure 9 we
can see that AppFull outperformed the other two variants,
i.e., the application of upper bounds make AppFull effective. In addition, AppFull also outperformed “Lm +τ ”, i.e.,
the introduction of ρ makes AppFull more effective. For
example, running on the AIDS dataset with distance threshold 50, about 970 and 432 λ computations were needed in
“Lm ” and “Lm +τ ” respectively, while in AppFull only 330
λ computations were needed. In addition, the runtime of
AppFull is negligible in comparison with the expensive accurate edit distance computation. Without computing λ,
AppFull takes less than 1 second per query on both real
and synthetic graph databases.
Figure 8: Scalability over AIDS Datasets
6.3.2 Exact Subgraph Search Performance
On the other hand, we also examined their scalability
over the real dataset. First, 10 graphs were randomly selected from the AIDS dataset as query graphs, and a series
of graph databases were generated by randomly choosing
specific number of graphs from the AIDS dataset. Figure 8
a) illustrates the average runtime for computing different
distances for a query graph in different graph databases. As
shown in Figure 8, all these three algorithms scale linearly
in terms of the number of graphs over AIDS dataset. After that, by varying the size of query graphs, experiments
were conducted to evaluate the scalability in terms of the
One of important applications of the lower bound of λ is
to serve as a filter for performing subgraph search over graph
databases. We first evaluated the performance of ExactSub in terms of the number of candidate graphs returned.
Actually, we should compare the performance of ExactSub
with other exact subgraph search systems, such as cIndex [8]
and Tree+∆[37]. However, because these approaches involve
complicated modules, such as subgraph pattern mining, feature selection and index construction, all of them are still
in process for release. Meanwhile, with these complicated
modules, our own implementations also cannot guarantee to
1000
1000
800
700
600
500
Number of Candidate Answers
# of Candidate Graphs
# of λ Computation
2500
Least
ExactSub
Lm
Lm+τ
AppFull
900
100
400
300
10
20
25
30
35
40
Distance Threshold
45
50
20
25
30
35
40
45
Absolute Support Threshold
50
10
Lm
Lm+τ
AppFull
0.1
1000
100
10
1
20
25
30
Distance Threshold
b) Synthetic dataset
Figure 9: AppFull
0
35
40
45
1
2
3
4
Number of Edge Relaxation
5
6
5
6
a) Q16
Number of Candidate Answers
# of Candidate Graphs
# of λ Computation
100
15
500
1000
Least
ExactSub
1000
10
1000
55
10000
5
1500
a) AIDS dataset
10000
1
2000
0
15
a) AIDS dataset
Grafil
AppSub
Grafil
AppSub
800
600
400
200
0
5
10
15
20
25
Absolute Support Threshold
30
35
0
1
2
3
4
Number of Edge Relaxation
b) Synthetic dataset
b) Q20
Figure 10: ExactSub
Figure 11: AppSub
achieve the good performance shown in their papers. Therefore, we did not compare with them. But we believe that
our experimental results here are adequate enough to show
the performance of ExactSub.
In the experiments, 1000 graphs were randomly selected
from AIDS dataset as the real graph database Dr , and 1000
graphs were produced by the synthetic generator as the synthetic graph database Ds . In order to get appropriate subgraphs as query graphs, we used gSpan[33] to mine frequent
subgraphs from Dr and Ds . By setting min sup=20, 9,263
frequent subgraphs were discovered from Dr , while 2,252 frequent subgraphs were discovered from Ds with min sup=10.
From these frequent subgraphs, we randomly chose 10 graphs
with the same absolute support to form a query group. For
example, for the AIDS dataset, by ranging sup from 20 to
50, 7 query groups were produced. While for the synthetic
dataset, 5 query groups were constructed by ranging sup
from 10 to 30. The reason for choosing such lower supports
is that if the support is higher, the number of frequent subgraphs with this support might be fewer, e.g., only 1 or 2
such subgraphs. Figure 10 depicts the performance of ExactSub running on both synthetic and real datasets. The
Y -axis shows the average number of candidate graphs returned. For a query graph with sup=n, there are n supergraphs of this query graph in the graph database. In addition, because we do not take the edge labels into account in
our study , the real number of such super-graphs should be
greater than n. The legend “Least” in Figure 10 indicates
the least number of super-graphs in the graph databases
which can never be saved by any filtering efforts. From Figure 10 we can observe that ExactSub is effective in filtering
irrelevant graphs over both synthetic and real datasets. For
example, using the query graphs with sup=30 over the AIDS
dataset, ExactSub can filter about 91% of the graphs in the
graph databases. The effectiveness of ExactSub also con-
firms the effectiveness of L0m in filtering irrelevant graphs.
6.3.3 Approximate Subgraph Search Performance
At last, the performance of AppSub was compared with
Grafil[35] in terms of the number of candidate graphs. Because the software package of Grafil involves many pieces
of codes from different sources, it is still in process for release. We therefore cannot conduct a direct comparison. Instead, we exploited the same experimental settings used in
[35]. 10,000 graphs were randomly selected from the AIDS
dataset to form the graph database, and the query graphs
were directly sampled from the database and were grouped
together according to their edge numbers. The query group
is denoted by Qm where m is the edge number of the graphs
in Qm . Figure 11 depicts the performance comparison between AppSub and Grafil w.r.t. two query sets Q16 and
Q20 (the data of Grafil were directly copied from figure 13
and figure 14 in [35]). The Y -axis shows the average number of candidate graphs returned by each algorithm, and the
X-axis shows the edge relaxation ratio used in Grafil. To
compare the performance of Grafil and AppSub clearly, we
doubled the approximation ratio used in AppSub. For example, if 2 edges is used in Grafil as the edge relaxation ratio,
AppSub performed the 4-subgraph search correspondingly.
Based on Figure 11, even the result set returned by performing 2n-subgraph search is a superset of the result set
returned by performing Grafil within n edge relaxations,
AppSub outperformed Grafil significantly when n > 2 by
a factor of 5-10 times.
7. CONCLUSION AND DISCUSSION
In this paper, three novel distances, Lm , τ and ρ, are introduced to lower and upper bound the graph edit distance
λ in polynomial time. The relationships of them can be
represented by the inequality Lm ≤λ≤ρ≤τ , and their com-
putation complexities are shown in Table 3. In comparison to the exact computation of graph edit distance, these
three bounding algorithms are more efficient and have good
scalability in terms of the number of graphs and the size
of graphs. Applying these efficiently computable bounds,
three effective algorithms AppFull, ExactSub and AppSub are proposed for to perform various kinds of graph
searching. The good performance of these graph search
algorithms demonstrated by the experimental results also
confirms the effectiveness of these three bounds.
Distance
Lm
ρ
Complexity
Θ(n3 )
O(n6 )
Distance
τ
λ
Complexity
Θ(n3 )
NP-Hard
Table 3: Complexity of Different Distances
So far, our discussion focuses on undirected graphs. Fortunately, our proposed algorithms can be easily extended to
directed graphs. For directed graphs, a directed star structure ~s can be represented by a 4-tuple ~s = (r, I, O, l), where
I is the set of vertices from which there are edges to r and
O is the set of vertices to which there are edges from r.
Accordingly, star edit distance can be newly defined as
λ(~s1 , ~s2 ) = T (r1 , r2 ) + d(I1 , I2 ) + d(O1 , O2 )
Based on this new definition, Lm , τ and ρ can be easily
applied to directed graphs.
8.
REFERENCES
[1] H. A. Almohamad and S. O. Duffuaa. A linear programming
approach for the weighted graph matching problem. IEEE
Trans. PAMI, 15(5):522–525, 1993.
[2] V. Arvind and P. P. Kurur. Graph isomorphism is in spp.
Information and Computation, 204(5):835–852, 2006.
[3] S. Berretti, A. D. Bimbo, and E. Vicario. Efficient matching
and indexing of graph models in content-based retrieval. IEEE
Trans. PAMI, 23(10):1089–1105, 2001.
[4] C. Borgelt and M. R. Berthold. Mining molecular fragments:
Finding relevant substructures of molecules. In ICDM ’02.
[5] J. M. Bower and H. Bolouri. Computational Modeling of
Genetic and Biochemical Networks (Computational Molecular
Biology). 2004.
[6] A. Broder, R. Kumar, F. Maghoul, P. Raghavan,
S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph
structure in the web:experiments and models. In WWW ’00.
[7] H. Bunke and K. Shearer. A graph distance metric based on
the maximal common subgraph. Pattern Recognition Letters,
19(3-4):255–259, 1998.
[8] C. Chen, X. Yan, P. S. Yu, J. Han, D.-Q. Zhang, and X. Gu.
Towards graph containment search and indexing. In VLDB ’07.
[9] S. A. Fenner, L. J. Fortnow, and S. A. Kurtz. Gap-definable
counting classes. J. Comput. Syst. Sci., 48(1):116–148, 1994.
[10] M.-L. Fernández and G. Valiente. A graph distance metric
combining maximum common subgraph and minimum common
supergraph. Pattern Recognition Letters, 22(6-7):753–758,
2001.
[11] M. R. Garey and D. S. Johnson. Computers and Intractability;
A Guide to the Theory of NP-Completeness. 1990.
[12] P. Hart, N. Nilsson, and B. Raphael. A formal basis for the
heuristic determination of minimum cost paths. IEEE Trans.
SSC, 4(2):100–107, 1968.
[13] H. He and A. Singh. Closure-tree: An index structure for graph
queries. ICDE’06.
[14] H. He, H. Wang, J. Yang, and P. S. Yu. Blinks: ranked keyword
searches on graphs. In SIGMOD ’07.
[15] H. Hu, X.Y., Y. Hang, J. Han, and X. J. Zhou. Mining coherent
dense subgraphs across massive biological network for
functional discovery. Bioinformatics, 1(1):1–9, 2005.
[16] D. Justice and A. Hero. A binary linear programming
formulation of the graph edit distance. IEEE Trans. PAMI,
28(8):1200–1214, 2006.
[17] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai,
and H. Karambelkar. Bidirectional expansion for keyword
search on graph databases. In VLDB ’05.
[18] H. W. Kuhn. The hungarian method for the assignment
problem. Naval Research Logistics, 2:83–97, 1955.
[19] M. Kuramochi and G. Karypis. Frequent subgraph discovery. In
ICDM ’01.
[20] B. T. Messmer and H. Bunke. A new algorithm for
error-tolerant subgraph isomorphism detection. IEEE Trans.
PAMI, 20(5):493–504, 1998.
[21] R. Myers, R. C. Wilson, and E. R. Hancock. Bayesian Graph
Edit Distance. IEEE Trans. PAMI, 22(6):628–635, 2000.
[22] M. Neuhaus, K. Riesen, and H. Bunke. Fast suboptimal
algorithms for the computation of graph edit distance. In
SSSPR ’06.
[23] E. G. M. Petrakis and C. Faloutsos. Similarity searching in
medical image databases. IEEE Transactions on Knowledge
and Data Engineering, 9(3):435–447, 1997.
[24] J. Raymond, E. Gardiner, and P. Willett. RASCAL:
Calculation of Graph Similarity using Maximum Common Edge
Subgraphs. The Computer Journal, 45(6):631–644, 2002.
[25] K. Riesen, S. Fankhauser, and H. Bunke. Speeding up graph
edit distance computation with a bipartite heuristic. In MLG
’07.
[26] D. Shasha, J. T. L. Wang, and R. Giugno. Algorithmics and
applications of tree and graph searching. In PODS ’02.
[27] S. Srinivasa and S. Kumar. A platform based on the
multi-dimensional data modal for analysis of bio-molecular
structures. In VLDB ’03.
[28] Y. Tian, R. C. McEachin, C. Santos, D. J. States, and J. M.
Patel. Saga: a subgraph matching tool for biological graphs.
Bioinformatics, 23(2):232–239, 2007.
[29] N. Trinajstic, J. V. Knop, W. R. Muller, K. Syzmanski, and
S. Nikolic. Computational Chemical Graph Theory:
Characterization, Enumeration and Generation of Chemical
Structures by Computer Methods. 1991.
[30] S. Trissl and U. Leser. Fast and practical indexing and
querying of very large graphs. In SIGMOD ’07.
[31] P. Willett, J. Barnard, and G. Downs. Chemical similarity
searching. J. Chem. Inf. Comput. Sci, 38(6):983–996, 1998.
[32] R. C. Wilson and E. R. Hancock. Structural matching by
discrete relaxation. IEEE Trans. PAMI, 19(6):634–648, 1997.
[33] X. Yan and J. Han. gspan: Graph-based substructure pattern
mining. In ICDM’02.
[34] X. Yan, P. S. Yu, and J. Han. Graph indexing: a frequent
structure-based approach. In SIGMOD ’04.
[35] X. Yan, F. Zhu, P. S. Yu, and J. Han. Feature-based similarity
search in graph structures. ACM TODS, 31(4), 2006.
[36] R. Yang, P. Kalnis, and A. K. H. Tung. Similarity evaluation
on tree-structured data. In SIGMOD ’05.
[37] P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: Tree + delta
>= graph. In VLDB ’07.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising