COBBLER(

COBBLER(
COBBLER: Combining Column and Row Enumeration for
Closed Pattern Discovery
Feng Pan
Gao Cong
Xu Xin
Anthony K. H. Tung
National University of Singapore
email: panfeng,conggao,xuxin, atung @comp.nus.edu.sg
Contact Author
Abstract
The problem of mining frequent closed patterns has receive considerable attention recently as it promises to have
much less redundancy compared to discovering all frequent
patterns. Existing algorithms can presently be separated
into two groups, feature (column) enumeration and row
enumeration. Feature enumeration algorithms like CHARM
and CLOSET+ are efficient for datasets with small number
of features and large number of rows since the number of
feature combinations to be enumerated will be small. Row
enumeration algorithms like CARPENTER on the other hand
are more suitable for datasets (eg. bioinformatics data) with
large number of features and small number of rows. Both
groups of algorithms, however, will encounter problem for
datasets that have large number of rows and features.
In this paper, we describe a new algorithm called COBBLER which can efficiently mine such datasets . COBBLER
is designed to dynamically switch between feature enumeration and row enumeration depending on the data characteristic in the process of mining. As such, each portion of
the dataset can be processed using the most suitable method
making the mining more efficient. Several experiments on
real-life and synthetic datasets show that COBBLER is order of magnitude better than previous closed pattern mining
algorithm like CHARM, CLOSET+ and CARPENTER.
1 Introduction
The problem of mining frequent closed patterns has received considerable attention recently as it promises to have
much less redundancy compared to discovering all frequent
patterns [8]. Existing algorithms can presently be separated
into two groups, feature (column) enumeration and row enumeration. In feature enumeration algorithms like CHARM
[9] and CLOSET+ [7], combinations of features are tested
systematically to look for frequent closed patterns. Such an
approach is suitable for datasets with small number of features and large number of rows since the number of feature
combinations to be tested will be small.
However, for bioinformatics data with large number of
features and small number of rows, the performance of these
algorithms deteoriate due to the large number of feature
combinations. To go around this problem, the algorithm
Although column is a more suitable term here, we will use the term
feature in this paper to avoid potential confusion during the technical
discussion
CARPENTER [3] is developed to perform row enumeration on bioinformatics datasets instead. CARPENTER is a
row enumeration algorithm which looks for frequent closed
patterns by testing various combinations of rows. Since the
bioinformatics datasets have small number of rows and large
number of features, the number of row combinations will be
much smaller than the number of feature combinations. As
such, row enumeration algorithms like CARPENTER will
be more efficient than feature enumeration algorithms on
these kinds of datasets.
From the above, it is natural to make two observations.
First, we can conclude that different datasets will have
different characteristics and thus require a different enumeration method in order to make closed pattern mining efficient. Furthermore, since these algorithms typically focus
on processing different subset of the data during the mining, the characteristics of the data subset being handled will
change from one subset to another. For example, a dataset
that have much more rows than features may be partitioned
into sub-datasets with more features than rows. Therefore a
single feature enumeration method or a single row enumeration method may become inefficient in some phases of the
enumeration even if they are the better choice at the start of
the algorithm. As such, it makes sense to try to switch the
enumeration method dynamically as different subsets of the
data are being processed.
Second, both classes of algorithms will have problem
handling datasets with large number of features and large
number of rows. This can be seen if we understand the basic philosophy of these algorithms. In both classes of algorithms, the aim is to reduce the amount of data being considered by searching in the smaller enumeration space. For
example, when performing feature enumeration, the number
of rows being considered will decrease as the number of features in a feature set grow. It is thus possible to partition the
large number of rows into smaller subset for efficient mining. However, for datasets with large number of rows and
large number of features, adopting only one single enumeration method will make it difficult to reduce the data being
considered in another dimension.
Motivated by these observations, we derived a new algorithm called COBBLER in this paper. COBBLER is designed to automatically switch between feature enumeration
and row enumeration during the mining process based on
the characteristics of the data subset being considered. As
experiments will show later, such an approach will produce
good results when handling different kinds of datasets. Ex
COBBLER stands for Combining Row and Column Enumeration. The
letter ‘b’ is counted twice here.
periments show that COBBLER outperforms other closed
pattern mining algorithms like CHARM [9], CLOSET+[7]
and CARPENTER [3].
In the next section, we will introduce some preliminaries and give our problem definition. The COBBLER algorithm will be explained in Section 3. To show the advantage
of COBBLER’s dynamic enumeration, experiments will be
conducted on both real-life and synthetic datasets in Section
4. Section 5 introduces some of the related work for this
paper. We will conclude our discussion in Section 6.
2. Preliminary
We will give a problem description and define some notations for further discussion.
We denote our dataset as . Let the set of binary fea and let the set of rows
tures/columns be = be = . We abuse our notation slightly by saying
that a row contain a feature if have a value of 1 in
. Thus we can also say that .
For example, in Figure 1(a), the dataset has 5 features rep and there are 5 rows,
resented by alphabet set , , , in the dataset, The first row contains feature
set i.e. these binary features have a value of “1”
for . To simplify notation, we will use the row number to
represent a set of rows hereafter. For example, “23” will be
used to denote row set ! . And a feature set like "
will also be represented as .
Here, we give two concepts called feature support set
and row support set.
<
?
>
A
@
B
- $ =(
a,c,d
a,b,d,e
b,e
b,c,d,e
a,b,c,e
(a) Original Example Table,C
# $DE(
1,2,5
2,3,4,5
1,4,5
1,2,4
2,3,4,5
(b) Transposed Table, FGF
Figure 1. Running Example
Let us illustrate these notions with another example below.
Example 2 Given that minsup=1, the feature set will
"
be a frequent closed pattern in the table of Figure 1(a) since
the feature set occurs four times in the table. on the
other hand is not a frequent closed pattern although it occurs two times in the table which is more than the minsup
threshold. This is because that it has a superset and
H"
7 # $=
(72;7 # $=
(7 .
,
We will now define our problem as following:
Problem Definition: Given a dataset D which contains
records that are subset of a feature set F, our problem is to
discover all frequent closed patterns with respect to a user
support threshold minsup.
3. The COBBLER Algorithm
Definition 2.1 Feature Support Set, # $%'&)(
Given a set of features '& , we use # $%*&)( the maximal set of rows that contain +& .
to denote
,
Definition 2.2 Row Support Set, - $% & (
Given a set of rows .& , we use - $%/&0( to denote
the largest set of features that are common amount the rows
in /& .
,
Example 1 # $%'&1( and - $%/&0(
Let’s use the table in Figure 1(a). Let +& be the feature set
, then # $%*&)(324 since both and contain
*& and no other rows in table contain '& . Also let R’ be the
since both feature
set of rows , then - $%/&0(526 and feature occur in and and no other features
,
occur in both and .
Definition 2.3 Support, 7 # $%'&1(7
Given a set of features F’, the number of rows in the dataset
that contain '& is called the support of '& . Using earlier
definition, we can denote the support of +& as 7 # $%*&)(7 . ,
Definition 2.4 Closed Patterns
A set of features '& is called a closed pattern if there
exists no F” such that '& 8 *9 and 7 # $%*9:(7 2;7 # $%'&1(7 . ,
Definition 2.5 Frequent Closed Patterns
A set of features '& is called a frequent closed pattern
if (1) 7 # $% & (7 , the support of & , is higher than a minimum
support threshold. (2) '& is a closed pattern.
,
To illustrate our algorithm, we will use the tables in Figure 1 as a running example. Table 1(a) is the original table I
and Table 1(b) is the transposed version of Table 1(a), IJI .
In IJI , the row ids are the features in I while the features
are the row ids in I . A row number i exists in the row of
IJI if and only if the feature occurs in row i in T.
For example, since feature “c” occurs in K and in
the original table, row ids “1”, “4” and “5” occur in row “c”
in the transposed table. To avoid confusion, we will hereafter
use tuples to refer to the rows in the transposed table and use
rows to refer to the rows in the original table.
3.1
Static Enumeration Tree
Algorithms for discovering closed patterns can be represented as a search in an enumeration tree. An enumeration tree can either be a feature enumeration tree or a row
enumeration tree. Figure 2(a) shows a feature enumeration
tree in which each possible combination of features is represented as an unique node in the tree. Node “ab” in the
tree for example represents the
?LB feature combination while the bracket below (i.e. ) indicates that row and
contain . Algorithms like CHARM and CLOSET+
"
find closed pattern by performing depth-first search (DFS)
in the feature enumeration tree (starting from the root). By
imposing an order M # NPO on the feature, each possible combination of features will be systematically visited following
a lexicographical order. In Figure 2(a), the order of enumeration will be (in absence of any
E:"
optimization and pruning strategies).
The concept of a row enumeration tree is similar to a feature enumeration tree except that in a row enumeration tree,
abcd
{}
abc
{5}
ab
{25}
a
{125}
ac
{15}
ad
{12}
ae
{25}
abd
{2}
abe
{25}
bd
{24}
bce
{45}
bde
{24}
be
{2345}
{12345}
cd
{14}
c
{145}
13
{}
14
{cd}
15
{ac}
1
{acd}
bcde
{4}
bcd
{4}
23
{be}
24
{bde}
2
{abde}
1245
{}
145
{c}
2345
{be}
234
{be}
235
{be}
245
{be}
25
{abe}
{abcde}
cde
{4}
45
{bce}
5
{abce}
e
{2345}
(a) feature enumeration tree
Definition 3.2 Conditional Transposed Table, IJI 7 Let be a subset of rows (in the original table). Given
the transposed table IJI , an -conditional transposed table
denoted as IJI 7 is a subset of tuples from IJI such that:
in IJI
2. Let be the row with lowest order in according to
M # N . Row and all that having higher order than
according to M # N are removed from each tuple in
IJI 7 35
{be}
4
{bcde}
de
{24}
ited, an X-conditional table, I 7 (note: = b ) will be created and is as shown in Figure 3(a). From I 7 , we can infer
that there are 4 rows which contain “b”.
,
1. Each tuple is a superset of 345
{be}
34
{be}
3
{be}
ce
{45}
d
{124}
124
{d}
125
{a}
12
{ad}
acde
{}
acd
{1}
ace
{5}
ade
{2}
bc
{45}
b
{2345}
123
{}
abce
{5}
abde
{2}
,
(b) row enumeration tree
Example 4 Let the transposed table in Figure 1(b) be IJI .
When the node “12” in the row enumeration tree of Figure 2(b) is visited, an X-conditional transposed table, IJI 7
(note: = 1,2 ) will be created and is as shown in Figure
3(b). The inference we make from IJI 7 is slightly different
from that we make from the earlier example. Here we can
infer that a,d occurs in two rows of the dataset (i.e. and
).
,
Figure 2. Traditional row and feature enumeration tree.
<
- $ (
d,e
@
e
A
c,d,e
B
c,e
(a)
E
?
F -conditional
# $D(
5
4
table,
-conditional transposed table, FGF (b)
Figure 3. Conditional Table
each possible combination of rows(instead of features), & ,
is represented as a node in the tree. Figure 2(b) shows a
row enumeration tree.
> ? Node “12” in the figure represents
row combination while the bracket “ ad ” below de
” is found in both and (i.e.
notes> the
fact
the
“
ad
?
- $=
( 2 . Again, by imposing a order M # N on the rows, row enumeration algorithm like CARPENTER
will be able to visit each possible combination of rows in a
DFS manner on the enumeration
A B B of node vis> > ? tree.
> [email protected] The order
ited in Figure 2(b) will be when no
pruning strategies are adopted.
Regardless of row or feature enumeration, searches in the
enumeration tree are simulated by successive generation of
conditional (original) table and conditional transpose table defined as the followings.
Definition 3.1 Conditional Table, I 7 Let be a subset of features. Given the original table I , an
-conditional original table denoted as I 7 is a subset of
rows from I such that:
1. Each row is a superset of in I
2. Let be the feature with lowest order in according to
M # NPO . Feature and all that having higher order
than according to M # NPO are removed from each row
in I 7 ,
Example 3 Let the original table in Figure 1(a) be I . When
the node “b” in the enumeration tree of Figure 2(a) is vis-
In both Example 3 and 4, it is easy to see that the number of rows (tuples) in the conditional (transposed) table will
be reduced as the search move down the enumeration tree.
This enhanced the efficiency of mining since the number of
rows (tuples) being processed at deeper level of the tree will
also be reduced. Furthermore, the conditional (transposed)
table of a node can be easily obtained from that of its parent.
Searching the enumeration tree is thus a successive generation of conditional tables where the conditional table at each
node is obtained by scanning the conditional table of its parent node.
3.2
Dynamic Enumeration Tree
As we can see, the basic characteristic of a row enumeration tree or a feature enumeration tree is that the tree is static.
The current solution is to make a selection between these approaches based on the characteristic of I at the start of the
algorithm. For datasets with many rows and few features,
algorithms like CHARM [9] and CLOSET+ [7] that search
in the feature enumeration tree will be more efficient since
the number of possible feature combinations will be small.
However, when the number of features is much larger than
the number of rows, a row enumeration algorithm like CARPENTER [3] was shown to be much more efficient.
There are two motivations for adopting a more dynamic
approach.
First, the characteristics of the conditional tables could be
different from the orignal table. Since the number of rows
(or tuples) can be reduced as we move down the enumeration tree, it is possible that a table I which has more
rows than features initially, could have the characteristic
reversed for it’s conditional tables I 7 (i.e. more features
than rows). As such, it makes sense to adopt a different
enumeration approach as the data characteristic changes.
Second, for datasets with large number of rows and also
large number of features, a combination of row and feature enumeration could help to reduce both the number
of rows and features being considered in the conditional
tables thus enhancing the efficiency of mining.
ab
{25}
a
{125}
ac
{15}
ad
{12}
ae
{25}
Next, we will illustrate with a simple example on what
we mean by dynamic switching of enumeration method:
bc
{45}
b
{2345}
bd
{24}
be
2
{ab}+{de}
25
{ab}+{e}
5
{ab}+{ce}
1
{ac}+{d}
5
{ac}+{e}
15
{ac}+{}
a
{125}
12
{ad}
1
{acd}
1
{ad}+{}
2
{ad}+{e}
4
45
{bc}+{de} {bc}+{e}
5
{bc}+{e}
2
24
{bd}+{e}
{bd}+{e}
4
{bd}+{e}
{abcde}
13
{}
14
{cd}
ad
{12}
d
{124}
c
{145}
cd
{14}
d
{124}
15
{ac}
be
b
{2345} {2345}
e
23
bd
{2345}
{be}
{24}
b
be
{2345}
{2345}
d
24
2
{124}
{bde}
{abde}
e
ab
{2345}
{25}
a
25
ae
{125}
{abe}
{25}
b
be
{2345}
{2345}
e
{2345}
be
34
b
{2345}
{be}
{2345}
3
e
{be}
{2345}
be
b
35
{2345}
{2345}
{be}
e
bc
{2345}
{45}
b
45
4
be
{2345}
{bce}
{bcde}
{2345}
c
{145}
ce
5
e
{45}
{abce}
{2345}
bde
{24}
abe
{25}
{2345}
{12345}
Example 5 Consider the table I in Figure 1(a). Let us
1
cd
{cd}+{}
{14}
assume that the order for features, M > # ? N @ O A is B c
4
{145}
ce
{cd}+{e}
. Suppose,
and the order for rows, M # N is {45}
de
d
we first perform a feature enumeration generating the b {24}
{124}
conditional table (shown earlier in Figure 3(a)) followed
e
{2345}
by the b,c -conditional table in Figure 4(a). To switch
(a)
Switching
from feature-wise to (b) Switching from row-wise to
to row enumeration, I 7 will first be transposed to create
!
row-wise enumeration.
feature-wise enumeration.
IJI+$0I 7 "(
in Figure 4(b). Since only row 4 and 5 are
in the tuples of IJI+$0I 7 "( , we next perform row enumeraFigure 5. Dynamic enumeration trees.
tion on row 4, which give IJI+$0I 7 (7 K in Figure 4(c). From
IJI+$0I 7 "(7 K , we see that feature “d” and “e” are both in 1. Create transposed table IJI+$0I 7 %( such that
row 4. Thus, we can conclude that only 1 row (i.e. row 4)
we have a tuple for each feature , having
contains the feature set b,c + d,e = b,c,d,e ( b,c is
lower rank than obtained from feature enumeration while d,e is obtained
from row enumeration).
,
given a tuple in IJI+$0I 7 %( representing a feature ,
the tuple contains all row such that # $%+&)( and
# $= (
<
- $ =(
# $D(
E
# $DE(
A
2. Perform row enumeration on IJI+$0I 7 ( following the ord,e
4
B
der M # N .
e
4,5
5
(a) F (b) TT(F )
(c) TT F ,
Figure 4. Conditional Table
Figure 5(a) and 5(b) show examples of possible dynamic
enumeration tree that could be generated from table I in our
running example. In Figure 5(a), we highlight the path linking nodes “b”, “bc” and “4” as they correspond to the nodes
we visited in Example 5. Switching from row enumeration
to feature enumeration is also possible as shown in Figure
5(b).
Like previous algorithms, COBBLER will also perform
a depth first search on the enumeration tree. To ensure a
systematic search, enumeration is done based on M # N for
row enumeration and on M # N O for feature enumeration.
To formalize the actual enumeration switching procedure,
let us first divide all the nodes in our dynamic enumeration
tree into two classes, row enumerated node and feature enumerated node. As the name implies, row enumerated node
is a node which represents a subset of rows '& being enumerated while a feature enumerated node is a node which
represents a subset of features '& being enumerated. For example, in Figure 5(a), the nodeA 9 is a feature enumerated
"
node while its children node 9 is a row enumerated node.
Definition
3.3 Feature to Row Enumeration Switch Let
be a feature enumerated node representing the feature
subset *& and let # $%*&1( be the rows containing +& in I . In
additional, let be the lowest ranking feature in '& based
on M # NPO . A switch from feature to row enumeration will
follow these steps:
TT stand for transposed
bce
{45}
Example 6 For example, in Figure 5(a), while node 9
enumerates feature set, its descendant will switch to enumerate row set. The sub-tree of node 9 will create a
transposed table with one tuple for each feature , and
since , , ? are
of lower rank than in M # N 7 O . Since
B
# $= ( = , the tuples in? the
B enumeration table will
"
only contain some
subsets
of
. We thus have the enu? ?LB B
merating order on the transposed table
,
To define the procedure for switching from row to feature enumeration, we first introduce the concept of Direct
Feature Enumerated Ancestor
Definition
3.4 Direct Feature Enumerated Ancestor,
$ (
Given a row enumerated node , its nearest ancestor which
enumerates feature subsets is called its direct feature enu $ ( . In addition, we will use
merated ancestor, * & to denote the feature set represented by $ (
The root node of the enumerating tree can be considered to
enumerate both row set and feature set.
,
? A
For example,in Figure 5(b), $ 9:(5
2 9 .
"
Definition
3.5 Row to Feature Enumeration Switch
Let be a row enumerated node representing the row subset
& and let - $% & ( be the maximal set of features that is found
in every row of .& in I . In addition, let ' & be the
$ ( and let be
feature set that is represented by the lowest ranking feature in + & based on M # NPO .
A switch from row to feature enumeration will follow these
steps:
1. Create table I & such that for each row in # $%' &
a correspond row & is in IJ& with
& * &
& (,
&
- $%.&)(
2. Remove all features which have lower rank than all the &
from
3. Perform feature enumeration on I & following the order
M # NPO .
,
In essence, a row to feature enumeration create a conditional table I & such all features combinations that is a superset of * & but a subset of - $%.&)( can be tested systematically based on feature enumeration.
? A
Example 7 For example, in Figure 5(b), while node 9
enumerates row set, its descendant will switch to enumerate
feature set. I & will thus be generated for finding all frequent
? A
(i.e. - $= ? A (
closed patterns that is a subset of H"
but a superset of L (since that is the DFA of node 9 ).
Since # $=L( contain rows , , ! , K and , we will crethat & , &
ate 5 corresponds rows > & ,...,< & such
B
and & . Based on M # NPO , the
for
H"
enumeration order will be .
,
""::
Having specified the operation for switching enumeration
method, we will next prove that no closed frequent patterns
are missed by our algorithm. Our main argument
here is
that switching the enumeration method at a node will not
effect the set of closed patterns that are tested at the descendants of . We will first prove that this is true for switching
from feature to row enumeration.
Lemma 3.1 Given a feature enumerated
node , let I be the enumeration subtree rooted at after switching from
feature to row enumeration.
Let I O be the imaginary
subtree rooted at node if there is no switch in enumeration
method. Let P$0I 5( be the set of frequent closed patterns
found in enumeration tree I and I O E( be the set
of frequent closed patterns that are found in enumeration
tree I O . We claim that P$0I O "(G
2 $0I 5( .
Proof:
We first prove that P$0I O "( $0I 5( and then that
P$0I 5( $0I O E( .
Suppose node
represents the feature set +& . Assuming that in I O , a depth first search will produce a frequent closed pattern . In this case 2 *
& with
being the additional feature set that are added onto &
when searching in subtree I O . It can be deduced that
# $% ( # $%*&1( because '& . Since is a frequent
closed pattern, being its subset will also be a frequent
pattern in # $%*&1( . Let /& # $%*&)( be the unique maximal
set of rows that contain . It is easy to see that .& will
also be enumerated in I since all combinations of rows
in # $% & ( are enumerated in I . We can now see that both
*& (since /& # $%*&1( ) and are in .& which means that
will be enumerated in I . Since all closed pattern enumerated in I O
will be enumerated in I . Therefore,
P$0I O "( P$0I 5( .
On the other hand, assuming that is a frequent closed
pattern that is found under I . Let be the row combination enumerated in subtree I that give (i.e 2
- $% ( ). Since I essentially enumerate all row combinations from # $%*&)( , we know # $%*&1( and thus *&
is in every row of . By definition of - $% ( , we know
& which means that all rows containing are
in # $%*&)( . Since I O will enumerate all combination
of features which are in # $%'&1( , we know will be enumerated in I O . Since all closed pattern enumerated in
I will be enumerated in I O . Therefore, P$0I 5( P$0I O "( .
We can now conclude that $0I O "(G
2 $0I 5( since
P$0I O "( P$0I 5( and P$0I 5( P$0I O "( .
,
We next look at the proceduce for switching from row to
feature enumeration. Our argument will go along the same
line as Lemma 3.1.
Lemma 3.2 Given a row enumerated node , let I O be the enumeration subtree rooted at after switching from
row to feature enumeration.
Let I be the imaginary sub
tree rooted at node
if there is no switch in enumeration
method. Let P$0I 5( be the set of frequent closed patterns
found in enumeration tree I and I O E( be the set
of frequent closed patterns that are found in I O . We
2 $0I 5( .
claim that $0I O "(G
,
We omitted the proof for Lemma 3.1 due to lack of space.
The gist of the proof is however similar to the proof for
Lemma 3.2
With Lemma 3.1 and Lemma 3.2, we are sure that the set
of frequent closed patterns found by our dynamic enumeration tree is equal to the set found by a pure row enumeration or feature enumeration tree. Therefore, by a depth first
search of the dynamic enumeration tree, we can be sure that
all the frequent closed patterns in the database can be found.
It is obvious that a complete traversal of the dynamic enumeration tree is not efficient and pruning methods must be
introduced to prune off unnecessary searches. Before we explain these methods, we will first introduce the framework
of out algorithm in the next section.
3.3. Algorithm
Our formal algorithm is shown in Figure 6 and the details
about the subroutines are in Figure 7.
We use both the original table I and the transposed table IJI in our algorithm with infrequent features removed.
Our algorithm involves recursive computation of conditional
tables and conditional transposed tables for performing a
depth-first traversal of the dynamic enumeration tree. Each
conditional table represents a feature enumerated node while
each conditional transposed table represents a row enumerated node. For example, the -conditional table? repB
"
resents the node “a b” in Figure 5(a) while the conditional transposed table represents the node “2 5” in
Figure 5(b). After setting , the set of frequent closed
Algorithm
Input: Original table F , transposed table FGF , features set
set and support level
Output: Complete set of frequent closed patterns,
Method:
, row
;
1. Initialization.
Subroutine: RowMine(F5F
Parameters:
F5F
:
2. Check switching conditions. SwitchingCondition();
Figure 6. The Main Algorithm
patterns, to be empty, our algorithm will check a switching
K
condition to decide whether to perform row enumeration or
feature enumeration. Depending
on the switch
condition, ei<
<
ther subroutine or
will
be called.
<
The subroutine takes in three parameters
IJIJ& 7 , /& and . IJIJ& 7 is an X-conditional transposed
table while .& contains the set of rows that will be considered for row enumeration according to M # N . contains the frequent closed patterns which have been found so
far. Step 1 to 3 in the subroutine performs the counting and
pruning. We will delay all discussion on pruning to Section
3.5. Step 4 in the subroutine will output the frequent closed
pattern. The switching condition will be checked in Step 5 to
decide whether a row enumeration or a feature enumeration
will be executed next. Based on this condition, the subroutine will either continue to Step 6 for row enumeration< or
to Step 7 for feature enumeration. Note that the subroutine has essentially no difference from the row enumeration algorithm, CARPENTER in [3] except for Step 7
where we switch to feature enumeration. Since CARPENTER is proven to be correct and Lemma 3.2 has shown that
the switch to feature enumeration
does not affect our result,
<
we know that the subroutine will output the cor
rect set of frequent closed patterns.
<
The subroutine takes in three parameters
IJ& 7 , *& and . IJ& 7 is an X-conditional original table.
*& contains the set of features that will be considered for feature enumeration according to M # N O . contains the
frequent closed patters which have been found so far. Step
1 to 3 performs counting and pruning and their explanation
will also be done in later section. Step 4 will output the frequent closed pattern while Step 5 will check the switching
condition to decide on the enumeration method. Based on
the switching condition, the subroutine will either continue
to Step 6 for feature enumeration or to Step 7< for row enumeration. We again note that the subroutine
has essentially no difference from other feature enumeration
algorithm like CHARM [9] and CLOSET+ [7] except for
Step 7 where we switch to row enumeration. Since these algorithms are proven to be correct and Lemma 3.1 has shown
that switch to row enumeration
< does not affect our result.
We know that the subroutine will output the
correct set of frequent closed pattern.
We can observe that
the recursive computation will
<
, the .& becomes empty or in
stop when in We will delay the discussion for this switching condition to the next
section.
:A
!
).
-conditional transposed table;
A subset of rows which have not been considered in the
enumeration;
3. If mine frequent closed patterns in row enumeration first.
);
RowMine(FGF , ,
4. If mine frequent closed patterns in feature enumeration first.
FeatureMine(F , ,
);
, , : The set of frequent closed patterns that have been found;
Method:
and count the frequency of occurrences for each
" #%$ & . ' .
Pruning 1: Let ( ) & be the set of rows in & which occur
in at least one tuple of FGF . If ( +* ! ,-.
, then
return; else /0( ;
Pruning 2: Let ' be the set of rows which are found in every
tuple of the ! -conditional transposed table. Let 1 /2 '
and remove all rows of ' from F5F ;
6 , add 5 ! into
If ! +* ' 431.
and 5 ! 7$8
;
1. Scan F5F
row,
2.
3.
4.
5. Check the switching condition, SwitchingCondition();
6. If go on for row enumeration,
for each
9 2 " #
RowMine(FGF
: ; & " #%$ ,
);
7. If switch to feature
enumeration,
for each
/=5 ! 2 < #
);
FeatureMine(F > ;
Subroutine: FeatureMine(F , , < #%$ 5 !
,
).
Parameters:
F
: A ! -conditional original table;
: A subset of features which have not been considered in the
enumeration;
: The set of frequent closed patterns that have been found;
Method:
and count the frequency of occurrences for each fea< #%$7 . ' .
Pruning 1: Let ( )? be the set of features in which occur
in at least .
rows of F . 0( ;
Pruning 2: Let ' be the set of features which are found in every
row of the ! -conditional original table. Let 2 ' and
remove all features of ' from F ;
6 and @ ! 3A.BCD , add ! * ' into
If ! * ' $0
;
1. Scan F
ture,
2.
3.
4.
5. Check the switching condition, SwitchingCondition();
6. If go on the feature
for each
enumeration,
/
2 <#
FeatureMine(F
E> ; < #%$7 ,
);
7. If switch to row enumeration, transpose X conditional
table
,
table F5 F
, for each
F to a transposed
FDG H
[email protected] ! * ' 2 " # RowMine(FGF FD G H : ; " #%$ @ ! * '
);
Figure 7. The Subroutines
< 3.4
, the *& becomes empty.
Switching Condition
<
I
;
$D I 7 (
<
I where is the average processing time of
E
rows.
node
On the path from node to node
, the and its estimated
will represent feature set enumeration cost is
<
I $D I 7 (
Switching condition are used to decide whether to switch
from row enumeration to feature enumeration or vice verse.
To determine that, our main idea is to estimate the enumeration cost for the subtree at a node and select the smaller one
between a feature enumeration subtree and a row enumeration subtree.
The enumeration cost of a tree can be estimated from two
components, the size of the tree and the computation cost at
each node of the tree. The size of a tree is judged based on
the estimated number of nodes it contains while the computation cost at a node is measured using the estimated number
of rows (or features) that will be processed at the node.
I = conFor example, if a feature enumeration tree
+ and node will
tains m nodes process
/ rows, the enumeration cost of I = is . To
simplify explanation, we will focus on estimating the enumeration cost of a feature enumeration tree. The estimation
of the enumeration cost of a row enumeration tree will be
similar.
Assume
that a feature enumeration tree, I = , rooted
and # $%( and contains
at node which
representing
sub-nodes . Let correspond to
conditional table I 7 . We give some definitions below.
;
;
Let .$ =( be the estimated enumeration
cost ofenumerating
through the entire path from node to node
,
.$
=(52
$ !
E
<
I
;
;
$D I 7 ( (
"
f3
f2
f1
......... fn
f1
f2
{f1,f2}
{f2,f3}
f3 .... fn
26
2 7 # $%(7 , the number of rows conditional table I 7 contains.
$ =( , the estimated
maximum height of the subtree
rooted at node .
Given one of the node representing feature set ' & ,
we will first use a simple probability deduction
to calculate
$
=
(
$
=
(
.
Suppose
the
node
on
level
is
represented
as
, we then calculate $
( , the estimated num
ber of relevant rows being processed at the node
.
Assume that> the
set
of
features
which
have
not
been
con sidered is :7
2 and are
sorted by descending order of $D I 7 ( . Let be a value
such that
;
;
;
.
.
.
.
.
.
.
{f2,f3,f4}
{f3,f4,f5} .
.
.
.
.
.
(deeptest node under f1) (deeptest node under f2)
GF
.
.
.
{f1,f2,...fp}
{f3,...fk}
{f2,...fq}
{f1,f2,...fp}
(deeptest node under f3)
(deeptest node under f1)
{f3,f4}
.
.
.
{f3,...fk}
.
.
.
{f2,...fq}
(deeptest node under f3)
(deeptest node under f2)
(a) The entire feature enumer- (b) Simplified feature enumer-
ation tree, F
$#!%
ation tree, F
.
G F
$#!%
.
Figure 8. Entire and simplified enumeration
tree
$D I 7 G( , the frequency of feature in I 7 .
{f1,f2,f3}
{f3,f4}.....{f3,fn}
.
{f1,f2}..... {f1,fn} {f2,f3}.....{f2,fn}
;
$DE I 7 (
<
;
;
$DE I 7 (
Then we calculate $
$
( and $
$ =(52
; (52
;(
as
;
;
$DE I 7 (
Intuitively, $ =( corresponds to the expected maximum
number of levels enumeration will take place before support
pruning take place.
Thus the estimated enumeration cost on node
is
;
;
;
3.5. Prune Method
<
<
Both subroutines and applies
pruning strategies. We will only give a brief discussion here
since they are developed in previous work and not the emphasis of our work here.
The correctness
of pruning strategy 1 and 2 used in sub<
routine has been proven in [3]. Here we will
only prove the correctness
of the pruning strategy applied in
<
.
subroutine <
In step 3 of subroutine , all the
features which occur in every row of X-conditional original
table I & 7 will be removed from I & 7 and will be considered to be already enumerated. We will prove its correctness
by the following Lemma.
Figure 8(a) shows the entire representation of feature enumeration tree I D . Figure 8(b) is a simplified enumeration tree IJ=& of I = in which
only the longest pathes
in each sub-tree rooted at node O are retained.
The esti
mated enumeration cost of I D& is .$ O ( . We use
the estimated enumeration cost of I =& as an criterion for
the estimated enumeration coat of I = . Therefore, the estimated enumeration cost of the feature enumeration tree is
.$
O (
&
The estimated enumeration cost of a row enumeration tree
is computed in the similar way. Having computed these two
estimated values, we will select the searching method that
has a smaller estimated enumeration cost in the next level of
enumeration.
Lemma 3.3 Let IJ& 7 be an X conditional original table and
Y be the set of features which occur in every row of I & 7 .
Given any subset & , we have # $ & (J2 # $ *&)( .
Proof: By definition, # $ '&)( contains a set of rows, all
of which contain feature set +& . Since the features in
occur in every row of I & 7 , this means that these features
also occur in every row of I & 7 (Note: IJ&=7 8
IJ& 7 ). Thus, the set of rows in I & 7 is exactly the set
of rows in I & 7 . From this, we can conclude that
# $ *&)(G2 # $ *&)( .
,
Example 8 As an example to illustrate Lemma 3.3, let us
consider the -conditional table in Figure 3(a). Since fea
ture “e” occurs?Lin
@ A B every row of I 7 , we can conclude that
# $ ( =# $ ( =
. Thus, we need not create I 7 in our
E
search and feature “e” need not be considered for further
enumeration down that branch of the enumeration tree. ,
Lemma 3.3 proves that all the frequent closed patterns
found in the X-conditional table I & 7 will contain feature
set , since for each feature set & found in I & 7 , we
can get its superset '& and # $ *&)( 2
# $ *&)( . Thus it is correct to remove from all the rows
of IJ& 7 and consider to be enumerated.
4. Performance
In this section we will compare the performance of COBBLER against other algorithms. All our experiments were
performed on a PC with Pentium IV 2.4Ghz CPU, 1 G RAM
and a 30GB hard-disk. Algorithms were coded in Standard
C.
Algorithms: We compare COBBLER against two other
closed pattern discovery algorithms, CHARM [9] and
CLOSET+ [7]. CHARM and CLOSET+ are both feature
enumeration algorithms. We also compared the performance of CARPENTER [3] and COBBLER, but since COBBLER’s performance is always better than CARPENTER,
we do not present the result for CARPENTER here. To
make a fair comparison, CHARM and CLOSET+ are also
run in the main memory after one disk scan is done to load
the datasets.
Datasets: We choose 1 real-life datasets and 1 synthetic datset to analyze the performance of COBBLER. The characteristics of the 2 datasets are shown in the table below.
Dataset
thrombin
synthetic data
# items
139351
100000
# rows
1316
15000
row length
29745
1700
3.6. Implementation
To show the feasibility of implementation, we will show
some details about the implementation of COBBLER.
1 a, c, d
a-conditional
ri
1
Pos 1
2
1
5
1
2 a, b, c, d
3
b-conditional
ri
3
4
1
Pos 1
b, e
4 b, c, d, e
a 1, 2, 5
1-conditional
fi
Pos
a
1
c
1
d
1
b
1
e
1
5 a, b, c, e
3-conditional
........
5-conditional
c-conditional
.......
e-conditional
(a) feature
enumeration
Conditional Pointer List at
Node “a”
2, 3, 4, 5
c 1, 4, 5
2-conditional
fi
Pos
b
d 1, 2, 4
e 2, 3, 4, 5
(b) row enumeration Conditional Pointer List at
Node “1”
Figure 9. Conditional Pointer List
The data structure for enumeration we used in COBBLER
is similar to that we used in CARPENTER. Dataset are organized in a table and memory pointers pointing to various
positions in the table are organized in a conditional pointer
list [3]. Since we enumerate both row and feature in COBBLER, we maintain two sets of conditional pointer list for
original table I and transposed table IJI respectively. The
conditional pointer list for row enumeration is the same as
the conditional pointer list used in CARPENTER while the
conditional pointer list for feature enumeration is create simply by replacing the feature ids with row ids and pointing
them to the original table I . Figure 9 gives an example for
feature enumeration conditional pointer list and row enumeration conditional pointer list. Most of the operations we take
to maintain the conditional pointer lists are similar to CARPENTER. Interested readers are referred to [3] for details.
As we can see, the 2 datasets we used have different characteristics. The thrombin dataset consists of compounds
tested for their ability to bind to a target site on thrombin, a
key receptor in blood clotting. Each compound is described
by a single feature vector comprised of a class value (A for
active, I for inactive) and 139,351 binary features, which
describe three-dimensional properties of the molecule. The
synthetic dataset is generated by IBM data generator. It is a
dense dataset and contains long frequent patterns even with
relatively high support value.
Parameters: Three parameters
are varied in our experiment,
< ), row ratio ( ) and length ratio
minimum support (
< ( ). The parameter minimum support, , is a minimum threshold of support which has been explained earlier.
The parameters and are used to varying the size of the
synthetic dataset we used for scalability test. The parameter row ratio, , has a value above 0. It is used to generate
new datasets with different number of rows using IBM data
generator. All dataset with different row ratio of was generated using a same set of parameters
> B except
that each time,
the number of rows is changed to
. The parameter
length ratio, , has a value between 0 and 1. It is used to generate new datasets with different average row size from the
original synthetic dataset listed in the table above.
>A
dataset
with a length ratio of retains on average of the
columns in the original dataset. Columns to be retained are
randomly selected for each row.
LB The default value of is 1
and the default value of is
. Because the real-life data
is very different from the synthetic dataset, we will only use
and for the synthetic dataset.
http://www.biostat.wisc.edu/ page/Thrombin.testset.zip
4.1. Varying Minimum Support
In this set
we set and to their de LofB experiments,
>
fault value,
and , and vary the minimum support. Be
cause of the different characteristics of the 2 datasets, we
vary the minimum support in different ranges. The thrombin
dataset is relatively sparse and its minimum support varies
in a range which has low minimum support value. The synthetic dataset is relatively dense and the number of frequent
items is quite sensitive to the minimum support, so its minimum support varies in a smaller range which has relatively
high minimum support value.
Figure 10 and 11 show how COBBLER
compares against
< CHARM and CLOSET+ as is varied. We can
observe that on the real-life dataset, CLOSET+ performs
worst for
of the time while CHARM performs
best
< most
<
is relatively high and when the
is dewhen creased to be low, < COBBLER
performs the best. This is be
is high, the structure of the dataset
cause when the after removing all the infrequent items is relatively simple.
Because the characteristic of the data subset seldom changes
during the enumeration, COBBLER will only use one of the
enumeration method and become either a pure feature enumeration algorithm or a pure row enumeration algorithm.
The advantage of COBBLER’s dynamic enumeration cannot been seen and therefore COBBLER is outperformed by
CHARM which is a highly optimized feature enumeration
algorithm.
< With the decrease of , the structure of the dataset
after removing infrequent items will become more complex. COBBLER begins to switch between feature enumeration method and row enumeration method according to the
varying characteristic of the data subset.
< Therefore COBBLER outperforms CHARM in low on the real-life
datasets.
On the synthetic dataset, COBBLER performs the best
for most of the time since the synthetic dataset is dense and
complex enough. CHARM
performs worst on this dataset,
< . This is due to the fact that the
even at very high synthetic dataset is a very dense one which results in a very
large feature enumeration space for CHARM.
4.2. Vary Length Ratio
In this set of experiments, we varying the size of the
synthetic
dataset
the length ratio,
set
> B by changing
>
. We
>
<
to
, to and
vary
from
to
.
If
is set to values smaller than
, the generated dataset will
be too sparse for any interesting result. Figure 12 shows
the performance comparison of COBBLER, CHARM and
CLOSET+ on the synthetic dataset when we vary . For
CHARM and CLOSET+,
it takes too much time to run on
>
dataset with 2
, so the result is not included in Figure
12. As we can see from the graph, COBBLER outperforms
CHARM and CLOSET+ in most cases. CHARM is always
the worst among these 3 algorithms and both COBBLER and
CLOSET+ are order of magnitude better than it. CLOSET+
has a steep increase in run time as length ratio is increased.
Its performance is as good as COBBLER when is low but
is soon outperformed by COBBLER when is increased to
some higher values.
COBBLER performance is not significantlly better than
CLOSET+ with low values because a low value of will
destroy many of the frequent patterns in the dataset, making
the dataset sparse. This will cause COBBLER to perform
pure feature enumeration method and lose the advantage of
performing dynamic enumeration. With the increase of ,
the dataset will become more complex and COBBLER will
show its advantage over CLOSET+ and also CHARM.
4.3. Varying Row Ratio
In this set of experiments, we varying the size of< the
syn
thetic
dataset
by
varying
row
ratio,
.
We
set
>B LB
to
,
to
its
default
value
of
and
varying
from
?
to . Figure 13 shows the performance comparison of COBBLER, CHARM and CLOSET+ on the synthetic dataset
when we vary . As we can see, with the increase of the number of rows, the datasets become more complex and COBBLER ’s dynamic enumeration strategy shows its advantage
over the other two algorithms. In all the cases, COBBLER
outperforms CHARM and CLOSET+ by an order of magnitude and also has a smoothest increase in run time.
As can be seen, in all the experiments we conducted,
COBBLER outperforms CLOSET+ in most cases and outperforms CHARM when the dataset< becomes
complicated
. This result also
for increased and or decreased demonstrates that COBBLER is efficient in datasets with different characteristics as it uses combined row and feature
enumeration and can switch between these two enumeration
methods according to the characteristics of a dataset while
in the searching process.
5. Related Work
Frequent pattern mining [1, 2, 6, 10] as a vital topic has
received a significant amount of attention during the past
decade. The number of frequent patterns in a large data set
can be very large and many of these frequent patterns may be
redundant. To reduce the frequent patterns to a compact size,
mining frequent closed patterns has been proposed. The followings are some new advances for mining closed frequent
patterns.
CLOSET [5] and CLOSET+ [7] are two algorithms
which discover closed patterns by depth-first, feature enumeration. CLOSET uses a frequent pattern tree (FPstructure) for a compressed representation of the dataset.
CLOSET+ is an updated version of CLOSET. In CLOSET+,
a hybrid tree-projection method is implemented and it builds
conditional projected table in two different ways according
to the density of the dataset. As shown in our experiment,
both CLOSET and CLOSET+ are unable to handle long
datasets due to their pure feature enumeration strategy.
CHARM [9] is a feature enumeration algorithm for mining frequent closed pattern. Like CLOSET+, CHARM performs depth-first, feature enumeration. But instead of using
FP-tree structure, CHARM use a vertical format to store the
dataset in which a list of row ids is stored for each feature.
These row id lists are then merged during the feature enumeration to generate new row id lists that represent corresponding feature sets in the enumeration tree. In addition, a
Figure 10. Varying
.
(thrombin)
Figure 11. Varying
.
(synthetic data)
technique called diffset is used to reduce the size of the row
id lists and the computational complexity for merging them.
Another algorithm for mining frequent closed pattern is
CARPENTER [3]. CARPENTER is a pure row enumeration
algorithm. CARPENTER discovers frequent closed patterns
by performing depth-first, row enumeration combined with
efficient search pruning techniques. CARPENTER is especially designed to mine frequent closed patterns in datasets
containing large number of columns and small number of
rows.
6. Conclusion
In this paper, we proposed an algorithm called COBBLER which can dynamically switch between row and
feature enumeration for frequent closed pattern discovery.
COBBLER can automatically select an enumeration method
according to the characteristics of the datasets before and
during the enumeration. This dynamic strategy helps COBBLER to deal with different kind of dataset including large,
dense datasets that have varying characteristics on different data subsets. Experiments show that our approach
yields good payoff as COBBLER outperforms existing frequent closed pattern discovery algorithms like CLOSET+,
CHARM and CARPENTER on several kinds of datasets. In
the future, we will look at how COBBLER can be extended
to handle datasets that can’t be fitted into the main memory.
References
[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases
(VLDB’94), pages 487–499, Sept. 1994.
[2] H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. In Proc. AAAI’94
Workshop Knowledge Discovery in Databases (KDD’94).
[3] F. Pan, G. Cong, and A. K. H. Tung. Carpenter: Finding closed patterns in long biological datasets. In Proc.
Of ACM-SIGKDD Int’l Conference on Knowledge Discovery
and Data Mining, 2003.
[4] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. Hmine: Hyper-structure mining of frequent patterns in large
databases. In Proc. IEEE 2001 Int. Conf. Data Mining
(ICDM’01), Novermber.
[5] J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm
for mining frequent closed itemsets. In Proc. 2000 ACMSIGMOD Int. Workshop Data Mining and Knowledge Discovery (DMKD’00).
Figure 12. Varying
Figure 13. Varying
(synthetic data)
(synthetic data)
"
[6] P. Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa,
and D. Shah. Turbo-charging vertical mining of large
databases. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), pages 22–23, Dallas, TX, May
2000.
[7] J. Wang, J. Han, and J. Pei. Closet+: Searching for the best
strategies for mining frequent closed itemsets. In Proc. 2003
ACM SIGKDD Int. Conf. on Knowledge Discovery and Data
Mining (KDD’03), Washington, D.C., Aug 2003.
[8] M. Zaki. Generating non-redundant association rules. In
Proc. 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), 2000.
[9] M. Zaki and C. Hsiao. Charm: An efficient algorithm for
closed association rule mining. In Proc. of SDM 2002, 2002.
[10] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proc. 1997
Int. Conf. Knowledge Discovery and Data Mining (KDD’97),
pages 283–286, Newport Beach, CA, Aug. 1997.
Sample-Wise Enumeration Methods
for Mining Microarray Datasets
Anthony K. H. Tung
Department of Computer Science
National University of Singapore
A Microarray Dataset
1000 - 100,000 columns
Class
100500
rows
Sample1
Cancer
Sample2
Cancer
Gene1
Gene2
Gene3
Gene4
Gene5
Gene6
Ge
.
.
.
SampleN-1
~Cancer
SampleN
~Cancer
• Find closed patterns which occur frequently among genes.
• Find rules which associate certain combination of the columns
that affect the class of the rows
– Gene1,Gene10,Gene1001 -> Cancer
Challenge I
• Large number of patterns/rules
– number of possible column combinations is extremely high
• Solution: Concept of a closed pattern
– Patterns are found in exactly the same set of rows are
grouped together and represented by their upper bound
• Example: the following patterns are found in row 2,3
and 4
i
ri
Class
aeh
ae
upper
bound
(closed
pattern)
ah
e
eh
h
lower bounds
1
2
3
4
5
a ,b,c,l,o,s
a ,d, e , h ,p,l,r
a ,c, e , h ,o,q,t
a , e ,f, h ,p,r
b,d,f,g,l,q,s,t
C
C
C
~C
~C
“a” however not part of
the group
Challenge II
• Most existing frequent pattern discovery
algorithms perform searches in the column/item
enumeration space i.e. systematically testing
various combination of columns/items
• For datasets with 1000-100,000 columns, this
search space is enormous
• Instead we adopt a novel row/sample
enumeration algorithm for this purpose.
CARPENTER (SIGKDD’03) is the FIRST
algorithm which adopt this approach
Column/Item Enumeration Lattice
• Each nodes in the lattice represent
a combination of columns/items
• An edge exists from node A to B if
A is subset of B and A differ from B
by only 1 column/item
• Search can be done breadth first
a,b,c,e
a,b,c a,b,e a,c,e b,c
a,b
i
1
2
3
4
5
ri
a,b,c,l,o,s
a,d,e,h,p,l,r
a,c,e,h,o,q,t
a,e,f,h,p,r
b,d,f,g,l,q,s,t
Class
C
C
C
~C
~C
a,c
a,e
a
start
b
b,c
c
{}
b
Column/Item Enumeration Lattice
• Each nodes in the lattice represent
a combination of columns/items
• An edge exists from node A to B if
A is subset of B and A differ from B
by only 1 column/item
• Search can be done depth first
• Keep edges from parent to child
only if child is the prefix of parent
i
1
2
3
4
5
ri
a,b,c,l,o,s
a,d,e,h,p,l,r
a,c,e,h,o,q,t
a,e,f,h,p,r
b,d,f,g,l,q,s,t
Class
C
C
C
~C
~C
a,b,c,e
a,b,c a,b,e a,c,e b,c
a,b
a,c
a,e
a
start
b
b,c
c
{}
b
General Framework for Column/Item
Enumeration
Read-based
Write-based
Point-based
Association Mining
Apriori[AgSr94],
DIC
Eclat,
MaxClique[Zaki01],
FPGrowth
[HaPe00]
Hmine
Sequential Pattern
Discovery
GSP[AgSr96]
SPADE
[Zaki98,Zaki01],
PrefixSpan
[PHPC01]
Iceberg Cube
Apriori[AgSr94]
BUC[BeRa99], HCubing [HPDW01]
A Multidimensional View
types of data or
knowledge
others
other interest
measure
associative
pattern
constraints
pruning method
sequential
pattern
iceberg
cube
compression method
closed/max
pattern
lattice transversal/
main operations
read
write
point
Sample/Row Enumeration Algorihtms
• To avoid searching the large column/item
enumeration space, our mining algorithm search
for patterms/rules in the sample/row
enumeration space
• Our algorithms does not fitted into the
column/item enumeration algorithms
• They are not YAARMA (Yet Another Association
Rules Mining Algorithm)
• Column/item enumeration algorithms simply
does not scale for microarray datasets
Existing Row/Sample Enumeration Algorithms
• CARPENTER(SIGKDD'03)
– Find closed patterns using row enumeration
• FARMER(SIGMOD’04)
– Find interesting rule groups and building classifiers
based on them
• COBBLER(SSDBM'04)
– Combined row and column enumeration for tables with
large number of rows and columns
• FARMER's demo (VLDB'04)
• Balance the scale: 3 row enumeration algorithms
vs >50 column enumeration algorithms
Concepts of CARPENTER
ij
i
1
2
3
4
5
ri
a,b,c,l,o,s
a,d,e,h,p,l,r
a,c,e,h,o,q,t
a,e,f,h,p,r
b,d,f,g,l,q,s,t
Class
C
C
C
~C
~C
Example Table
a
b
c
d
e
f
g
h
l
o
p
q
r
s
t
R (ij )
C
~C
1,2,3 4
1
5
1,3
2
5
2,3
4
4,5
5
2,3
4
1,2
5
1,3
2
4
3
5
2
4
1
5
3
5
Transposed Table,TT
a
e
h
C
1,2,3
2,3
2,3
TT|{2,3}
~C
4
4
4
ij
Row Enumeration
123
{a}
12
{al}
1
{abclos}
2
{adehplr}
{}
3
{acehoqt}
4
{aefhpr}
5
{bdfglqst}
13
{aco}
14
{a}
124
{a}
125
{l}
134
{a}
15
{bls}
135
{}
23
{aeh}
145
{}
24
{aehpr}
234
{aeh}
25
{dl}
34
{aeh}
35
{q}
45
{f}
1234
{a}
12345
{}
1235
{}
1245
{}
ij
a
TT|{1} b
c
l
o
s
1345
{}
2345
{}
ij
a
TT|{12} l
235
{}
245
{}
345
{}
R (ij )
C
~C
1,2,3 4
1
5
1,3
1,2 5
1,3
1
5
a
b
c
d
e
f
g
h
l
o
p
q
r
s
t
ij
TT|{124}
{123}
a
R (ij )
C
~C
1,2,3 4
R (ij )
C
~C
1,2,3 4
1
5
1,3
2
5
2,3
4
4,5
5
2,3
4
1,2
5
1,3
2
4
3
5
2
4
1
5
3
5
R (ij )
C
~C
1,2,3 4
1,2 5
Pruning Method 1
•
Removing rows that appear in all
tuples of transposed table will not
affect results
a
e
h
r2 r3
{aeh}
r2 r3 r4
{aeh}
r4 has 100% support in the conditional table of
“r2r3”, therefore branch “r2 r3r4” will be
pruned.
C
1,2,3
2,3
2,3
TT|{2,3}
~C
4
4
4
Pruning method 2
123
{a}
12
{al}
1
{abclos}
2
{adehplr}
{}
3
{acehoqt}
4
{aefhpr}
5
{bdfglqst}
13
{aco}
14
{a}
124
{a}
125
{l}
134
{a}
15
{bls}
135
{}
23
{aeh}
145
{}
24
{aehpr}
234
{aeh}
25
{dl}
34
{aeh}
35
{q}
45
{f}
235
{}
245
{}
345
{}
• if a rule is discovered
before, we can prune
1235
{}
enumeration below this
node
1245
{}
– Because all rules
1345
below this node has
{}
been discovered
before
2345
{}
– For example, at node
34, if we found that
C
~C
{aeh} has been
a 1,2,3 4
found, we can prune
e 2,3
4
off all branches
h 2,3
4
below it
TT|{3,4}
1234
{a}
12345
{}
Pruning Method 3: Minimum Support
• Example: From TT|{1}, we
can see that the support of
all possible pattern below
node {1} will be at most 5
rows.
TT|{1}
ij R (ij )
C ~C
a 1,2,3 4
b 1 5
c 1,3
l 1,2 5
o 1,3
s 1 5
From CARPENTER to FARMER
• What if classes exists ? What more can
we do ?
• Pruning with Interestingness Measure
– Minimum confidence
– Minimum chi-square
• Generate lower bounds for classification/
prediction
Interesting Rule Groups
• Concept of a rule group/equivalent class
– rules supported by exactly the same set of rows are grouped
together
• Example: the following rules are derived from row 2,3
and 4 with 66% confidence
i
aeh--> C(66%)
ae-->C (66%)
ah--> C(66%)
e-->C (66%)
upper
bound
eh-->C (66%)
h-->C (66%)
lower bounds
1
2
3
4
5
ri
a ,b,c,l,o,s
a ,d, e , h ,p,l,r
a ,c, e , h ,o,q,t
a , e ,f, h ,p,r
b,d,f,g,l,q,s,t
Class
C
C
C
~C
~C
a-->C however is not in
the group
Pruning by Interestingness Measure
• In addition, find only interesting rule groups
(IRGs) based on some measures:
– minconf: the rules in the rule group can predict the
class on the RHS with high confidence
– minchi: there is high correlation between LHS and
RHS of the rules based on chi-square test
• Other measures like lift, entropy gain, conviction
etc. can be handle similarly
Ordering of Rows: All Class C
before ~C
123
{a}
12
{al}
1
{abclos}
2
{adehplr}
{}
3
{acehoqt}
4
{aefhpr}
5
{bdfglqst}
13
{aco}
14
{a}
124
{a}
125
{l}
134
{a}
15
{bls}
135
{}
23
{aeh}
145
{}
24
{aehpr}
234
{aeh}
25
{dl}
34
{aeh}
35
{q}
45
{f}
1234
{a}
12345
{}
1235
{}
1245
{}
ij
a
TT|{1} b
c
l
o
s
1345
{}
2345
{}
a
b
c
d
e
f
g
h
l
o
p
q
r
s
t
ij
a
TT|{12} l
235
{}
245
{}
345
{}
R (ij )
C
~C
1,2,3 4
1
5
1,3
1,2 5
1,3
1
5
ij
ij
TT|{124}
{123}
a
R (ij )
C
~C
1,2,3 4
R (ij )
C
~C
1,2,3 4
1
5
1,3
2
5
2,3
4
4,5
5
2,3
4
1,2
5
1,3
2
4
3
5
2
4
1
5
3
5
R (ij )
C
~C
1,2,3 4
1,2 5
Pruning Method: Minimum Confidence
• Example: In TT|{2,3} on the
right, the maximum
confidence of all rules below
node {2,3} is at most 4/5
a
e
h
C
1,2,3,6
2,3,7
2,3
TT|{2,3}
~C
4,5
4,9
4
Pruning method: Minimum chi-square
• Same as in computing
maximum confidence
a
e
h
C
~C
Total
A
max=5
min=1
Computed
~A
Computed
Computed
Computed
Constant
Constant
Constant
C
1,2,3,6
2,3,7
2,3
TT|{2,3}
~C
4,5
4,9
4
Finding Lower Bound, MineLB
a,b,c,d,e
ad ae
abc
a
b
bd
be
cde
c
d
e
Candidate
Candidatelower
lowerbound:
bound:ad,
ad,ae,
ae,bd,
bd,bebe, cd, ce
Kept
Removed
since no
since
lower
d,ebound
are stilloverride
lower bound
them
– Example: An upper
bound rule with
antecedent A=abcde
and two rows (r1 : abcf
) and (r2 : cdeg)
– Initialize lower bounds
{a, b, c, d, e}
– add “abcf”--- new
lower {d ,e}
– Add “cdeg”--- new
lower bound{ad, bd,
ae, be}
Implementation
• In general, CARPENTER
FARMER can be implemented in
many ways:
– FP-tree
– Vertical format
• For our case, we assume the
dataset can be fitted into the
main memory and used pointerbased algorithm similar to BUC
ij
a
b
c
d
e
f
g
h
l
o
p
q
r
s
t
R (ij )
C
~C
1,2,3 4
1
5
1,3
2
5
2,3
4
4,5
5
2,3
4
1,2
5
1,3
2
4
3
5
2
4
1
5
3
5
Experimental studies
• Efficiency of FARMER
– On five real-life dataset
• lung cancer (LC), breast cancer (BC) , prostate cancer
(PC), ALL-AML leukemia (ALL), Colon Tumor(CT)
– Varying minsup, minconf, minchi
– Benchmark against
• CHARM [ZaHs02] ICDM'02
• Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99
• Usefulness of IRGs
– Classification
Example results--Prostate
100000
FA RM ER
10000
Co lumnE
1000
CHA RM
100
10
1
3
4
5
6
mi ni mum sup p o r t
7
8
9
Example results--Prostate
1200
FA RM ER:minsup=1:minchi=10
1000
FA RM ER:minsup =1
800
600
400
200
0
0
50
70
80
85
minimum confidence(%)
90
99
Naive Classification Approach
•
•
•
•
Generate the upper bounds of IRGs
Rank the upper bounds, thus ranking the IRGs;
Apply coverage pruning on the IRGs;
Predict the test data based on the IRGs that it
covers.
Classification results
Summary of Experiments
• FARMER is much more efficient than existing
algorithms
• There are evidences to show that IRGs is useful
for classification of microarray datasets
COBBLER: Combining Column and Row Enumeration
• Extend CARPENTER to handle datasets with
both large number of columns and rows
• Switch dynamically between column and row
enumeration based on estimated cost of
processing
Single Enumeration Tree
abc
{r1}
abd { }
ab
{r1}
a
{r1r2}
ac
{r1r2}
abcd
{}
acd { r2}
r1r2
{ac}
r1
{abc}
ad {r2}
{}
b
{r1r3}
bc
{r1r3}
d
{r2r4}
cd
{r2 }
r1r3r4 { }
r1r4 { }
bcd
{}
{}
r2
{acd}
r2r3
{c}
r2r3r4 { }
r2r4{d }
bd { }
c
{r1r2r3}
r1r3
{bc}
r1r2r3r4
r1r2r3
{}
{c}
r1r2r4 { }
r1
a b c
r2
a c d
r3
b c
r4
d
Feature enumeration
r3
{bc}
r3r4
{}
r4{d}
Row enumeration
Dynamic Enumeration Tree
r1
bc
r2
cd
r1
{bc}
{}
b
r1
c
r1 r2
d
r2
a
{r1r2}
r2
{cd}
b
{r1r3}
r1
{c}
r3
{ c}
r1r2
{c}
ab
{r1}
a
{r1r2}
r1r3
{ c}
ac
{r1r2}
ad
{r2}
abc: {r1}
ac: {r1r2}
c
{r1r2r3}
r2
{d }
acd: {r2}
d
{r2r4}
Feature enumeration to Row enumeration
abc
{r1}
abd
{}
acd
{ r2}
abcd
{}
Dynamic Enumeration Tree
a{r2}
ab {}
r1r2 {ac}
ac { r2}
r1
{abc}
b{r3}
c{r2r3 }
{}
r2
{acd}
bc {r3 }
ac{r1 }
a{r1}
ad{ }
c {r1r3}
cd { }
d {r4 }
r3
{bc}
b{r1 }
r1r2r3
{c}
r1r2r4 { }
r1
{abc}
acd { }
r1r3 {bc}
r1r3r4 { }
r1r4 { }
ac: {r1r2}
bc: {r1r3}
bc {r1 }
c: {r1r2r3}
c{r1r2 }
r4
{d}
r1r2r3r4
{}
Row enumeration to Feature Enumeration
Switching Condition
• Naïve idea of switching based on row number and feature
number does not work well
• to estimate the required computation of an enumeration
sub-tree, i.e., row enumeration sub-tree or feature
enumeration sub-tree.
– Estimate the maximal level of enumeration for each children subtree
• Example of estimating the maximal level of enumeration:
– Suppose r=10, S(f1)=0.8, S(f2)=0.5, S(f3)=0.5, S(f4)=0.3 and
minsup=2
– S(f1)*S(f2)*S(f3)*r =2 ≥ minsup
– S(f1)*S(f2)*S(f3)*S(f4)*r =0.6 < minsup
– Then the estimated deepest node under f1 is f1f2f3
Switching Condition
Switching Condition
To estimate for a node:
To estimate for a path:
To sum up estimation of all paths as the final estimation
Length and Row ratio
COBBLER
80000
12000
CLOSET+
70000
10000
CHARM
60000
COBBLER
Runtime (sec.)
Runtime (sec.)
14000
8000
6000
4000
2000
0
0.75
CLOSET+
CHARM
50000
40000
30000
20000
10000
0.8
0.85
0.9
0.95
1
1.05
0
0.5
1
Row Ratio
Length Ratio
Synthetic data
1.5
2
Extension of our work by other groups
(with or without citation)
• [1]
Using transposition for pattern discovery from microarray data, Francois
Rioult (GREYC CNRS), Jean-Francois Boulicaut (INSA Lyon), Bruno Cremileux
(GREYC CNRS), Jeremy Besson (INSA Lyon)
• See the presence and absence of genes in the
sample as a binary matrix. Perform a transposition
of the matrix which is essentially our transposed
table. Enumeration methods are the same
otherwise.
Extension of our work by other groups
(with or without citation) II
•
[2] Mining Coherent Gene Clusters from Gene-Sample-Time Microarray Data.
D. Jiang, Jian Pei, M. Ramanathan, C. Tang and A. Zhang. (Industrial full
paper, Runner-up for the best application paper award). SIGKDD’2004
Gene1 Gene
2
Sample1
Sample2
.
.
.
SampleN1
SampleN
Gene3 Gene
4
Extension of our work by other groups
(with or without citation) III
Gene
1
Gene
2
S1
1.23
S2
1.34
Gene
3
Gen
4
.
.
.
SN-1
1.52
SN
A gene in two samples are say to
be coherent if their time series
satisfied a certain matching
condition
In CARPENTER, a gene in two
samples are say to be matching if
their expression in the two
samples are almost the same
Extension of our work by other groups
(with or without citation) IV
[2] Try to find a subset of
samples S such that a subset
of genes G is coherent for
each pair of samples in S.
|S|>mins, |G|>ming
In CARPENTER, we try to
find a subset of samples S in
which a subset of genes G is
similar in expression level for
each pair of samples in S.
|S|>mins, |G|>0
Gene1 Gene2
S1
1.23
S2
1.34
.
.
.
SN-1
SN
1.52
Gene
3
Gene4
Extension of our work by other groups
(with or without citation) V
123
{a}
12
{a
12
1
{abclos}
2
{adehplr}
{}
[2] Perform sample-wise
enumeration and remove
genes that are not pairwise
coherent across the samples
enumerated
3
{acehoqt}
13
{aco}
14
{a}
{a}
125
{l}
134
{a}
15
{bls}
135
{}
23
{aeh}
145
{}
24
{aehpr}
234
{aeh}
25
{dl}
12
{
12
{
13
{
23
{
235
{}
245
CARPENTER: Perform samplewise enumeration and remove
genes that does not have the
same expression level across
the samples enumerated
Extension of our work by other groups
(with or without citation) VI
From [2]: Pruning Rule 3.1
(Pruning small sample
sets). At a node v = fsi1 ; :
: : ; sikg, the subtree of v
can be pruned if (k +
jTailj) < mins
TT|{1}
• Pruning Method 3 in CARPENTER:
From TT|{1}, we can see that the
support of all possible pattern below
node {1} will be at most 5 rows.
ij R (ij )
C ~C
a 1,2,3 4
b 1 5
c 1,3
l 1,2 5
o 1,3
s 1 5
Extension of our work by other groups
(with or without citation) VII
• [2] Pruning Rule 3.2
(Pruning subsumed sets).
At a node v = {si… sik} if
{si1,…sik} U Tail is a
subset of some maximal
coherent sample set, then
the subtree of the node
can be pruned.
123
{a}
12
{al}
1
{abclos}
2
{adehplr}
{}
• CARPENTER Pruning
Method 2: if a rule is
discovered before, we
can prune enumeration
below this node
3
{acehoqt}
4
{aefhpr}
5
{bdfglqst}
13
{aco}
14
{a}
124
{a}
125
{l}
135
{}
23
{aeh}
145
{}
24
{aehpr}
234
{aeh}
34
{aeh}
35
{q}
45
{f}
1235
{}
1245
{}
134
{a}
15
{bls}
25
{dl}
1234
{a}
1345
{}
2345
{}
235
{}
245
{}
345
{}
a
e
h
C
1,2,3
2,3
2,3
~C
4
4
4
TT|{3,4}
Extension of our work (Conclusion)
• The sample/enumeration framework had been
successfully adopted by other groups in mining
microarray datasets
• We are proud of our contribution as the group the
produce the first row/sample enumeration algorithm
CARPENTER and is happy that other groups also
find the method useful
• However, citations from these groups would have
been nice. After all academic integrity is the most
important things for a researcher.
Future Work: Generalize Framework for
Row Enumeration Algorithms?
types of data or
knowledge
others
other interest
measure
associative
pattern
constraints
pruning method
sequential
pattern
iceberg
cube
Only if real life applications require it.
compression method
closed/max
pattern
lattice transversal/
main operations
read
write
point
Conclusions
• Many datasets in bioinformatics have very different
characteristics compared to those that has been
previously studied
• These characteristics can either work against you or
for you
• In the case of microarray datasets with large number
columns but small number of rows/samples, we turn
what is against us to our advantage
– Row/Sample enumeration
– Pruning strategy
• We show how our methods have been modified by
other groups to produce useful algorithm for mining
microarray datasets
Thank you!!!
[email protected]
www.comp.nus.edu.sg/~atung/sfu_talk.pdf
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement