Extracting Redundancy-Aware Top-K Patterns Dong Xin Hong Cheng Xifeng Yan ∗ Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign {dongxin, hcheng3, xyan, hanj}@uiuc.edu ABSTRACT Observed in many applications, there is a potential need of extracting a small set of frequent patterns having not only high significance but also low redundancy. The significance is usually defined by the context of applications. Previous studies have been concentrating on how to compute top-k significant patterns or how to remove redundancy among patterns separately. There is limited work on finding those top-k patterns which demonstrate high-significance and low-redundancy simultaneously. In this paper, we study the problem of extracting redundancy-aware top-k patterns from a large collection of frequent patterns. We first examine the evaluation functions for measuring the combined significance of a pattern set and propose the MMS (Maximal Marginal Significance) as the problem formulation. The problem is known as NP-hard. We further present a greedy algorithm which approximates the optimal solution with performance bound O(log k) (with conditions on redundancy), where k is the number of reported patterns. The direct usage of redundancy-aware top-k patterns is illustrated through two real applications: disk block prefetch and document theme extraction. Our method can also be applied to processing redundancy-aware topk queries in traditional database. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications - Data Mining General Terms: Algorithms Keywords: Pattern Extraction, Significance, Redundancy 1. INTRODUCTION Frequent patterns are widely used in sophisticated ∗ The work was supported in part by the U.S. National Science Foundation NSF IIS-03-08215/05-13678 and NSF BDI-05-15813. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’06, August 20–23, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-339-5/06/0008 ...$5.00. data mining and database applications, including association rule mining, classification, clustering, and indexing. Recent progress on frequent-pattern mining has seen two trends: (1) measuring significance of various kinds of patterns, such as tf-idf scores [23] for text topics and position-weighted matrix score [17] for biological motifs; and (2) eliminating redundancy among discovered patterns, e.g., lossless compression using closed [18] or non-derivable [4] patterns, and lossy summarization using ordered patterns [16], cover-set [1], clustering [25], or pattern profiles [26]. These studies often emphasize significance and redundancy separately, while many applications need to consider these two measures together. One interesting example is correlation-directed disk block prefetch. A disk access sequence is a sequence of blocks, e.g., b35 , b100 , b9039 , ..., where bi represents the ith block on the disk. Suppose an access to b35 is repeatedly followed by an access to b9039 , it may improve the I/O performance if these two blocks are arranged adjacent to each other and fetched together when block b35 is accessed. Li et al. [14] show that correlation-directed prefetch can improve the average I/O response time by up to 25%. The system uses association rules as a decision system: Whenever the left-hand side of a rule is satisfied, the blocks on the right-hand side are prefetched. However, there are considerable redundancy existing in association rules, for example, one can generate more than 200k rules for one I/O trace collected at the HP Lab [20]. Due to the resource limitation, a system may only want to pick a subset of important yet divergent rules. The significance of each rule can be measured by its additional value to the existing rules. The second example is document theme extraction [3, 15], where each document (or each sentence) is treated as a transaction. The goal is to extract the frequent patterns of term occurrence, called themes, buried in a large set of documents. Given a document set, the topk frequent patterns returned by a mining algorithm are not necessarily the best k themes one can find. Many frequent term sets could overlap significantly with each other. Such overlapping may render top-k important themes very redundant. As shown in the above two applications, a useful compact pattern set should simultaneously demonstrate high significance and low-redundancy. We call this kind of patterns redundancy-aware top-k patterns. Previous studies on pattern compression (summarization) [1, 16, 25, 26] are able to approximate a collection of frequent patterns using a small pattern set, which aims to minimize the frequency restoration error for those patterns that are not selected. A close work to this paper is the pattern ordering problem studied in [16], where the authors rank patterns such that the topk patterns are able to best summarize the whole set of frequent patterns. The major difference between our problem and all of the previous works is that we emphasize both significance and redundancy on the selected top-k patterns, and the pattern significance is defined by the context of the applications; while summarizing the whole collection of the patterns is not our goal. The previous works only consider pattern relevance rather than significance, thus may not provide a solution to redundancy-aware top-k pattern extraction. Previous works on top-k frequent pattern mining [10] assume patterns are independent, which unfortunately is not the case. Figure 1(a) shows a set of frequent patterns where each circle represents one pattern whose significance is colored in gray scale, and the distance between two circles reflects their relevance. The intuition of redundancy-aware top-k patterns is illustrated in Figure 1(b) as opposed to the traditional top-k patterns in Figure 1 (c) and the k summarized patterns in Figure 1(d). Redundancy-aware top-k patterns make a tradeoff between significance and redundancy. The three patterns pointed by arrow in Figure 1(b) have high significance and low redundancy. On the other hand, the traditional top-k approach picks patterns based on significance solely and a pattern summarization approach picks patterns based on relevance solely. significance significance + relevance (a) a set of patterns significance (c) traditional top-k (b) redundancy-aware top-k relevance (d) summarization Figure 1: Redundancy-aware Top-k, Traditional Top-k, and Summarization In this paper, we formulate the redundancy-aware top-k pattern extraction problem through a general ranking model which integrates two measures, significance and redundancy, into one objective function. We first examine the evaluation functions for measuring the combined significance of a pattern set and propose the MMS (Maximal Marginal Significance) as the problem formulation. The problem is known as NP-hard. We further present a greedy algorithm which approximates the optimal solution with performance bound O(log k) , where k is the number of reported patterns. Although our work focuses on pattern extraction, the methodology developed in this paper can also be applied to many top-k query applications [2] to help users explore query results more effectively. More specifically, since similar results are often ranked closely, the top-k query results may not provide enough diversified information to users. Our method can be used to get the redundancy-aware top-k ranking. The rest of the paper is organized as follows. Section 2 introduces the concept of redundancy-aware top-k pattern extraction and its problem formulation. A comparison of the alternative objective functions is made in Section 3. We propose an improved algorithm for the MMS problem in Section 4. Section 5 presents two case studies of document theme extraction and correlationdirected prefecth. The related work is presented in Section 6 and we conclude our study in Section 7. 2. PROBLEM FORMULATION In this section, we first discuss measures for pattern significance and pattern redundancy, and then propose the formal problem formulation. 2.1 Significance and Redundancy Here we define significance and redundancy in the context of this paper. Definition 1. (Pattern Significance) A significance measure S is a function mapping a pattern p ∈ P to a real value such that S(p) is the degree of interestingness (or usefulness) of the pattern p. There are several previous studies on the significance (or interestingness) measure of patterns, which include [11] on rule interestingness, and [22, 24, 12] on interesting measure of frequent item-set or association patterns. According to [22], the significance measure can be divided into objective measures and subjective measures. Commonly used objective measures include support, confidence, lift, coherence, and tf-idf for text patterns and attribute values for database tuples. Subjective measure is usually a relative score compared with some prior knowledge or background model. It measures the unexpectedness of a pattern by computing its divergence from the background model. [11, 12] are examples that use subjective measures. We further extend the expression S to combined significance and relative significance. Let S(p, q) be the combined significance of patterns p and q, and S(p|q) = S(p, q) − S(q) be the relative significance of p given q. Note that the combined significance S(p, q) means the collective significance of two individual patterns p and q, not the significance of a single super pattern p ∪ q. Given significance measures, we can define the redundancy between two patterns. Definition 2. (Pattern Redundancy) Given the significance measure S, the redundancy R between two patterns p and q is defined as R(p, q) = S(p)+S(q)−S(p, q). Subsequently, we have S(p|q) = S(p) − R(p, q). In this paper, we make the assumption that the combined significance of two patterns is no less than the significance of any individual pattern (since it is a collective significance of two patterns) and does not exceed the sum of two individual significance (since there exists redundancy). This simply says that the redundancy between two patterns should satisfy 0 ≤ R(p, q) ≤ min(S(p), S(q)). (1) The ideal redundancy measure R(p, q) is usually hard to obtain. In this paper, we approximate redundancy using distance between patterns. Definition 3. (Pattern Distance) A distance measure D : P × P → [0, 1] is a function mapping two patterns p, q ∈ P to a value in [0, 1], where 0 means p, q are completely relevant and 1 means p, q are totally independent. The distance can be calculated based on the pattern structure, e.g., the edit distance between two DNA sequences; or based on the underlying data used in the discovery process, e.g., the Jaccard distance used in [13]; or based on the distribution of the patterns, e.g., KullbackLeibler Divergence. If a distance is a metric measure, i.e., it has properties of isolation, symmetry, and triangle inequality, it will bring many desirable properties. In the above example, both string edit distance and the Jaccard distance are metrics. More generally, the distance D(p, q) can be weighted to reflect users’ preference on penalizing redundancy. Since distance is the complementary of redundancy, we use the following equation to approximate R: R(p, q) = (1 − D(p, q)) × min(S(p), S(q)). (2) The above function indicates that the value of R(p, q) is bounded by [0, min(S(p), S(q))] (see Eqn. (1)). 2.2 Evaluating k Patterns We extend our formulation to a set of k patterns. Let G be an evaluation function measuring the significance of a set of k patterns P k = {p1 , p2 , . . . , pk }. If we assume patterns in P k are all independent, we have: Gind (P k ) = k X S(pi ), i=1 where S is the significance measure. In general, there are redundancies between patterns. Let L be a function returning redundancies among P k : Ggen (P k ) = k X S(pi ) − L(P k ). i=1 In many cases, L is very hard to formulate. We propose two heuristic evaluation functions Gas (average significance) and Gms (marginal significance), which sacrifice some generality but are more practical for computation and search. We first define our computation model based on a new concept, redundancy graph. Definition 4. (Redundancy Graph) Given a significance measure S and redundancy measure R, a redundancy graph of a set of patterns P is a weighted graph G(P) where each node i corresponds to a pattern pi . The weight on node i is pattern significance S(pi ) and the weight on an edge (i, j) is the redundancy R(pi , pj ). Let the redundancy subgraph induced by the set of k patterns be G(P k ). The natural formulation of L is to consider all pair-wise redundancy by summing the edge weights of G(P k ). Since there are k patterns and k(k−1) 2 edges, we further normalize it by taking average weights on edges. Typically, the average weights associated with a pattern pi are: w(pi ) = 1 k−1 k X R(pi , pj ). j=1,j6=i The evaluation function Gas is defined as below: Gas (P k ) = k X S(pi ) − i=1 k 1X w(pi ), 2 i=1 (3) where 12 is introduced because every redundancy R(pi , pj is counted twice by both pi and pj . Substitute w(pi ) in Eqn. (3): Gas (P k ) = k X S(pi ) − i=1 k i−1 1 XX R(pi , pj ) k − 1 i=1 j=1 (4) We refer this formulation as average significance. An alternative formulation for L is to compute the maximum spanning tree of G(P). Let the sum of edge weights on the maximum spanning tree be w(M STP ). The evaluation function Gms is defined as below: Gms (P k ) = k X S(pi ) − w(M STP ). (5) i=1 Note that the Gms formulation is a generalization of maximal marginal relevance (MMR) heuristic in information retrieval [5], where a document has high marginal relevance if it is both relevant to the query and contains minimal marginal similarity to previously selected documents. The marginal similarity is computed by choosing the most relevant selected document. Different from Gms , this definition gives a procedural way to evaluate a set of documents. If we use this concept to compute the score of a set of patterns P k (by adding patterns p1 , p2 , . . . , pk incrementally), we have M M R(P k ) = S(p1 ) + k X i=2 i−1 (min S(pi |pj )). j=1 Combining the definition of relative significance, one can easily verify that MMR approximates L by computing a spanning tree on G(P k ). However, the score of MMR depends on the order on which patterns are selected. Gms is the minimum score over all possible MMR scores. We refer Gms formulation as marginal significance. Correspondingly, the problems of finding redundancyaware top-k patterns are as follows: Definition 5. (Maximal Average Significance) Given a set of pattern collection P, the problem of Maximal Average Significance (MAS) is to find k-pattern set P k such that Gas (P k ) is maximized. Definition 6. (Maximal Marginal Significance) Given a set of pattern collection P, the problem of Maximal Marginal Significance (MMS) is to find k-pattern set P k such that Gms (P k ) is maximized. 3. COMPARING MAS AND MMS In this section, we examine the two proposed evaluation functions. We show that both MAS and MMS problems are NP-hard, and adopt a well-known greedy algorithm to compare their performance. 3.1 The Greedy Algorithm We consider a special case of the redundancy graph where all patterns have the same significance score, and thus only the weights on edges take effect. The problem of MAS is thus to find a k-pattern set where the sum of edge weights are minimized. This problem is equivalent to k-dense subgraph problem, which is known to be NP-hard [7]. The problem of MMS is to find a kmaximum spanning tree whose overall weights are minimized. Holldorsson et al. [9] show that this problem is NP-hard. Since it is difficult to find the optimal solutions, we adopt a well-known greedy algorithm to examine the performance of MAS and MMS. The algorithm incrementally selects patterns from P with an estimated gain g. A pattern is selected if it has the maximum gain among the remaining patterns. Given a set of selected patterns P k , the gain of a pattern p ∈ P − P k is: P S(p) − |P1k | q∈P k R(p, q), f or M AS, g(p) = S(p) − maxq∈P k R(p, q), f or M M S. At beginning, the result set P k is empty. The algorithm picks the most significant pattern and inserts it to P k . When |P k | < k, we will compute gain g(p) for every remaining pattern p ∈ P − P k , and select the pattern with the maximum gain. After a pattern is inserted into P k , it remains in P k . The naive implementation of the above algorithm takes time O(k2 n). The alternative approach with time complexity O(kn) can be implemented as follows. For each remaining pattern, we can remember the previous gain and compute the new gain by updating the redundancy with the last pattern added to P k . As an example, assume at the ith iteration, the pattern pi is selected, and for each pattern p ∈ P − P k , g i (p) was computed with respect to P k − {pi }. To search for next candidate pattern, we need to update g(p) by incorporating the newly selected pattern pi . One can verify the following update formulas for MAS and MMS: ` ´ S(p) − 1i (i`− 1)(S(p) − g i (p)) + R(p, pi ) , i+1 ´ g (p) = S(p) − max (S(p) − g i (p)), R(p, pi ) . The execution of update functions takes constant time. The algorithm is described in Algorithm 1. Finding the most significant pattern takes time O(n). At each iteration, we need to compute gain g(p) for each pattern p ∈ P −P k , and select the one with the maximum value. Using the update functions, each iteration also takes time O(n). The total time complexity of the greedy algorithm is O(kn). 3.2 Comparing MAS and MMS We examine both formulations using the same greedy algorithm. The experiments are conducted on two real applications: disk block prefetch and document theme Algorithm 1 The Greedy Algorithm Input: A set of n patterns, P Number of output patterns, k Significance Measure, S Divergence Measure, D Output: k-pattern set, P k 1: 2: 3: 4: 5: 6: 7: Let p be the most significant pattern; P k = {p}; while (|P k | < k) Find a pattern p such that the gain g(p) is the maximum among the set of patterns in P − P k ; k P = P k ∪ {p}; return extraction. For clear presentation, the results are organized in Section 5. We observe that MMS performs much better in both experiments. There are two possible reasons that may explain the results. First, the unified greedy algorithm may favor MMS; and second, the formulation of MMS is more reasonable. We discuss these two issues one by one. Since both problems are NP-hard and the greedy algorithm reports approximate solutions. We study the performance bound of the greedy solutions with respect to the optimal solutions. The following theorem shows that Algorithm 1 has performance bound 2 for MAS. Due to limited space, we omit the proof. Theorem 1. Let the k-pattern set returned by Algorithm 1 (with MAS gain) be P k , and the optimal pattern set be Ok . We have: Gas (Ok ) ≤ 2Gas (P k ). To our best knowledge, the algorithm does not have performance bound for MMS. In fact, a counter example in Section 4.2 shows that the worst case performance bound on MMS could be much worse than that of MAS. This analysis indicates that Algorithm 1 does not favor MMS and the worse performance of MAS may be caused by the limitation of its formulation. We further examine the top-k patterns returned by both algorithms in our experiments. The patterns returned by MAS clearly contain more redundancy. This is because the redundancy penalty in MAS formulation is averaged by the number of patterns k, and each pattern usually has redundancy with a few other patterns. The larger the value of k, the smaller the redundancy penalties. One may suggest to remove the denominator (i.e., k − 1) in Eqn. (4). However, this may lead to over penalizing in the objective function since the number of redundancy penalties is the order of square of the number of patterns. On the other hand, the MMS formulation is not sensitive to the value of k. In summary, the MMS formulation is quite reasonable. One possible extension to MMS formulation is to allow weighted combination of the significance and redundancy penalty. This actually is implicitly handled by our definition of distance measure because we can always incorporate the user-defined weights into the distance definitions. In the rest of the paper, we mainly focus on the MMS problem. 4. AN IMPROVED METHOD FOR MMS Here we discuss an improved method to the MMS problem. We assume that the distance measure satisfies triangle inequality. Our method is not restricted to this constraint. However, if this condition holds, our solution has a guaranteed performance bound. equation of S(p|q) here for easy understanding of the example: S(p|q) = S(p) − (1 − D(p, q)) min(S(p), S(q)). s1 d13 = We first introduce a variant computation model based on redundancy graph. As defined in Section 2, the redundancy graph is an edge-weighted and node-weighted undirected graph. We transform it to the directed redundancy graph as follows: for each pair of patterns pi and pj , we create a directed edge from pi to pj , and the associated edge weight is the relative significance S(pj |pi ). The weight on each node pi is still the pattern significance S(pi ). An example of this transformation is shown in Fig. 2 (Not all directed edges are shown in the transformed redundancy graph). S(p1 ) R(p1 , p2 ) R(p1 , p3 ) S(p2 |p1 ) S(p2 ) S(p3 ) R(p3 , p4 ) S(p4 ) S(p2 ) S(p3 |p1 ) S(p3 ) R(p3 , p5 ) S(p5 ) (a) Undirected Redundancy Graph S(p4 |p3 ) S(p4 ) S(p5 |p3 ) S(p5 ) (b) Directed Redundancy Graph Figure 2: Directed redundancy graph In MMS problem, Gms (P k ) is evaluated by computing the maximum spanning tree on the sub redundancy graph G(P k ). There are k node weights and k − 1 edge weights in the tree. We particularly select the most significant pattern as the root of the maximum spanning tree, and combine the other k − 1 node weights and the k − 1 edge weights. Example 1 shows this procedure. Example 1. In Fig. 2, suppose the set of pattern P k = {p1 , p2 , p3 , p4 , p5 } is evaluated by the spanning tree shown in Fig. 2 (a), and p1 isP the most significant pattern. Originally, Gms (P k ) = 5i=1 S(pi ) − R(p1 , p2 ) − R(p1 , p3 ) − R(p3 , p4 ) − R(p3 , p5 ). It is equivalent to Gms (P k ) = S(p1 ) + S(p2 |p1 ) + S(p3 |p1 ) + S(p4 |p3 ) + S(p5 |p3 ), as shown in Fig. 2 (b). Since we transform the negative redundancy penalties to positive relative significance, the original maximum spanning tree on the undirected redundancy graph corresponds to the minimum spanning tree on the directed redundancy graph. The MMS problem is equivalent to searching a constrained rooted minimum spanning tree on the directed redundancy graph such that the overall weights on the root node and on the edges in the tree are maximized. The constraint specifies that the root must be the most significant pattern in the tree. 4.2 Performance Study of Algorithm 1 We study the worst case performance of MMS by Algorithm 1, under the assumption that the distance measure satisfies triangle inequality. The following example shows that this greedy approach may lead to a serious problem in some case. We rewrite the computation p2 c−1 d13 c s2 = p1 4.1 The Computational Model S(p1 ) d12 = c s c−1 3 −δ d23 = 1c d13 1 c s3 p3 Figure 3: A directed redundancy graph with 3 patterns Example 2. Consider a graph with three patterns p1 , p2 , and p3 (Fig. 3). For simplicity, we use si and dij to denote S(pi ) and D(pi , pj ), respectively. Let s1 ≥ c s2 ≥ s3 , and s2 = c−1 s3 − δ (where δ > 0 is a small perturbation). Let d12 = c−1 d13 , d23 = 1c d13 , and d13 = c 1 . One can verify that d12 , d13 , and d23 satisfy triangle c inequality. The greedy algorithm will first select pattern p1 . Since S(p3 |p1 ) = d13 s3 > d12 s2 = S(p2 |p1 ), the algorithm will pick p3 as the next. The estimated gain on the objective function is r = S(p3 |p1 ). The algorithm continues to look for the next pattern p2 . The estimated gain for adding p3 and p2 is: S(p3 |p1 ) + min(S(p2 |p1 ), S(p2 |p3 )) ≈ 2r. However, the real objective function of MMS is evaluated by the spanning tree p1 → p2 → p3 , with the gain S(p2 |p1 ) + S(p3 |p2 ) ≈ r + rc , where c can be chosen arbitrarily large. This over-estimation can be accumulated quite large when the number of patterns increases. The reason that the greedy approach has the overestimation problem is that the relative significance is not symmetric. Given patterns p and q, we have S(q|p) ≥ S(p|q) if S(q) > S(p). If we select the less significant pattern p first, there will be an over-estimation. To avoid this problem, we should try to incrementally add patterns according to significance decreasing order. This motivates our alternative approximation algorithm. 4.3 An Alternative Approach We first outline the main ideas. The algorithm searches for a specific value r, with which, the algorithm first finds the most significant pattern (as p1 ), and removes all patterns p such that S(p|p1 ) ≤ r; then finds the most significant pattern in the remaining patterns (as p2 ), and removes all patterns p such that S(p|p2 ) ≤ r, and so on. We finally get kr patterns. Ideally, we want to find the perfect r value such that kr = k. The first intuition is that when r value is small, we may have kr > k, and when r value is large, kr < k. If the kr value is monotonic to r, then we can run a binary search on the domain of r. Unfortunately, kr is not monotonic to r. Fig. 4 shows a counter example that a larger r value leads to a larger kr . Example 3. Suppose S(p1 ) ≥ S(p2 ) ≥ . . . ≥ S(p5 ). We only display the edges whose weights are less than 1.5. When r = 1.0, we get two patterns p1 and p3 . When r = 1.4, we get three patterns p1 , p4 and p5 . remaining patterns (i.e., we select pattern pt2 ). After k patterns are selected, the remaining pti will find a selected pattern which belongs to the same circle pTj with pti . In our example, pt5 is a remaining pattern, and it belongs to circle pT3 with a selected pattern pt4 . We further merge pt5 as well as all the patterns in circle pt5 to circle pt4 . p1 1.0 1.4 p2 0.5 p3 1.0 1.0 p4 p5 Figure 4: A Counter Example Instead of searching for the perfect r value, we search for a pair of trial values t and T (t < T ), such that T leads to kT ≤ k and t leads to kt ≥ k. If the difference T − t = ǫ is sufficiently small, we can pick k patterns from the kt patterns with some desired property (i.e., Lemma 1). We introduce the ǫ-normalization on edge weights. For each pattern pair pi and pj , the edge weight S(pi |pj ) = S(pi ) − R(pi , pj ) ≤ S(pi ). Suppose p1 is the most significant pattern, we have S(pi |pj ) ≤ S(p1 ). That is, every edge weight is upper bounded by S(p1 ). We partition [0, S(p1 )] into B equi-width intervals, and each 1) interval has width ǫ = S(p . S(pi |pj ) is normalized to B The complete procedure is summarized as Algorithm 2, which is self-explanatory. Each iteration takes time O(kn), and the complexity to find the values of T and t (T −t = ǫ) is O(kn log B). Generally, we use k ≤ B ≤ n. The complexity of selecting k patterns from kt patterns relies on the generation of kt patterns, whose complexity is O(kt n). In most cases, kt is comparable to k. Algorithm 2 Greedy Algorithm for MMS Input: A set of n patterns, P Number of output patterns, k Significance measure, S Divergence measure, D Weight normalization, B Output: k-pattern set, P k B×S(p |p ) S(pi |pj ) = ⌊ S(p1i) j ⌋ × ǫ. With this normalization, we run a binary search on the normalized edge weights whose search space is 0 to S(p1 ) (i.e., B intervals). Initially, kT = 1 ≤ k by T = S(p1 ), and kt = |P| ≥ k by t = 0. If k(T +t)/2 ≥ k, we update t = (T + t)/2. Otherwise, we update T = (T + t)/2. After log B times binary search, we have T − t = ǫ and kT ≤ k ≤ kt . We discuss how to select k patterns from kt patterns when T − t = ǫ. Our goal is to find k patterns such that (1) the directed-edge weight between them is lower bounded by a positive value d, and (2) for any other pattern q, there exists one pattern p in the selected k patterns such that the edge weight S(q|p) is upper bounded by a constant factor of d. pT1 pT2 pT3 pt1 pt2 pt3 pt4 pt1 pt2 pt3 pt4(pt5) kT = 3 pt5 kt = 5 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 1) , t = 0, T = S(P1 ); ǫ = S(p B Run the binary search with (t, T ) in space [0, S(p1 )]; selected[i] = f alse (i = 1, . . . , n); removed[i] = f alse (i = 1, . . . , n); for i = 1 to k if there is no pattern left //k T +t < k, decrease T 2 T = T 2+t , goto line 2; Let ps be the most significant pattern s.t. selected[s] ≡ f alse and removed[s] ≡ f alse; Assign selected[s] = true, removed[s] = true; for j = 1 to n if (!removed[j] and !selected[j])) if (S(pj |ps ) ≤ T 2+t ) removed[j] = true; if there are patterns left //k T +t > k, increase t 2 t = T 2+t , goto line 2; Generate kt patterns; Select k patterns from kt patterns; return; k=4 Figure 5: Find k patterns from (u, l)-pair The selecting strategy is demonstrated by Fig. 5. Let pt1 , pt2 , . . . , ptkt be the selected kt patterns (assume S(pt1 ) ≥ S(pt2 ) ≥ . . . ≥ S(ptkt )), and pT1 , pT2 , . . . , pTkT be the selected kT patterns. Each pattern is around by a circle which indicates a set of patterns removed due to the selection of this pattern. Every pattern pti must belong to one circle in pTj (j = 1, 2, . . . , kT ). We select k patterns from the kt patterns by the following rules: 1. The most significant pattern pti in each pTj circle is first selected. In our example, patterns pt1 , pt3 and pt4 are selected; 2. While the number of selected patterns is less than k, we select the most significant pti patterns in the The desired property as we claimed earlier is summarized in Lemma 1. Lemma 1. Let d = t and the selected k patterns be p1 , p2 ,. . . , pk (significance decreasing order). If the distance satisfies triangle inequality, then for each pi and pj , S(pi |pj ) ≥ d and S(pj |pi ) ≥ d; and for each pattern q within the circle of pattern pi , S(q|pi ) ≤ 3d + 5ǫ. Sketch of Proof. See Appendix. The following theorem shows that Algorithm 2 has a performance guarantee for the MMS problem. Theorem 2. Let the k-pattern set returned by Algorithm 2 be P k , and the optimal pattern set be Ok . If the distance measure satisfies triangle inequality, we have: Gms (Ok ) ≤ (6 + 10k + log k)Gms (P k ). B Sketch of Proof. See Appendix. By setting B ≥ k, the performance bound of algorithm 2 for MMS problem is O(log k), while the additional factor on complexity (i.e., log B) does not introduce heavy computational cost. In fact, as we will show in the experiments, the running time of Algorithm 2 is similar to that of Algorithm 1. 5. EXPERIMENTAL RESULTS To test the performance of the proposed algorithms, we design two sets of experiments. The first examines the quality of extracted top-k patterns, and the second measures the computational performance of the proposed methods. For simplicity, we refer Algorithm 1 for maximal average significance as MAS, Algorithm 1 for maximal marginal significance as MMS, and Algorithm 2 (with bound) for maximal marginal significance as MMSb. We use SIG to refer to the method extracting top-k patterns completely based on significance (without considering redundancy). In all experiments, the number of intervals for the binary search in MMSb is set as B = k. and s is a parameter. Given two patterns, p1 and p2 , we use the Jaccard distance measure [13]: D(p1 , p2 ) =1 − |T S(p1 ) ∩ T S(p2 )| , |T S(p1 ) ∪ T S(p2 )| where T S(p1 ) is the set of transactions containing pattern p1 . Quality Evaluation: We run SIG, MAS, MMS and MMSb on the original collection of 8, 718 themes to extract top-10 results, which are displayed in Table 1. Without considering redundancy, the top-10 results returned by SIG only consist of two valuable themes (themes 1 and 4), and all the others are redundant. MMS and MMSb report the identical results, where all 10 themes have high significance score and are different from each other. There are two redundant themes in MAS. This suggests that the redundancy penalty by MAS formulation is not enough, and some theme patterns whose high significance scores compensate the redundancy penalties can still survive. 5.1 Quality of Top-K Patterns Here we demonstrate two case studies that use our proposed methods: (1) document theme extraction, and (2) correlation-directed disk block prefetch. For each case study, we discuss pattern generation, significance measure, distance measure, and quality evaluation. 5.1.1 Document Theme Extraction Theme discovery uses knowledge about the meaning of words in a text to identify broad topics covered in a document [3, 15]. One way to find themes from text document is to extract the frequent patterns of term occurrence. For example, a frequent pattern of “database management” indicates that the document might be related to a collection of database papers, whereas a frequent pattern like “red cross” might identify the topic of the documents as aid and relief. In this case study, we show how to apply our methods to discovering redundancy-aware top-k term occurrence patterns. Pattern Generation: A document collection is constructed by a mixture of documents of four topics: 386 news articles about Tsunami, 367 research papers about data mining, 350 research papers about bioinformatics, and 347 blog articles about iPod Nano. A document is broken into sentences as transactions. We mine sequential patterns [27] with a minimum support of 0.02%, and 8, 718 patterns are generated. Significance and Distance Measure: A pattern’s significance is modeled by a tf-idf scoring function similar to the Pivoted Normalization weighting based document score [23]. Specifically, given a theme pattern p = w1 ...wt , the significance is defined by S(p) = t X 1 + ln(1 + ln(tfi )) N +1 · ln , dl dfi (1 − s) + s avdl i=1 where tfi equals the support of the pattern p, dfi is the inverse sentence frequency of word wi in the whole transaction set, dl is the average sentence length associated with P , avdl is the overall average sentence length 5.1.2 Correlation Directed Prefetch Block correlations are common semantic patterns in storage systems [14]. Correlated blocks tend to be accessed relatively close to each other in an access stream. Exploring these correlations is very useful for improving the effectiveness of storage caching, pre-fetching, and data layout. Particularly, at each access, a storage system can pre-fetch correlated blocks into its storage cache so that subsequent accesses to these blocks do not need to access disks, which is several orders of magnitude slower than accessing directly from a storage cache. A correlation pattern is a rule in the form of “b35 b100 → b9039 ” implying that if disk block b35 and b100 are accessed sequentially, then disk block b9039 will be pre-fetched (note there is always only one block-id at the right-hand side of a rule). Since the computer resources are limited, our task is to extract top-k important rules for prefetch purposes. Pattern Generation: We use the rules provided by [14]. The experiment uses a set of real system traces, Cello-92, collected at the Hewlett-Packard Laboratories [20]. It captured all low-level disk I/Os performed on Cello, which is a timesharing system used by a group of researchers at the HP Labs to do simulation, compilation, editing, and e-mail. The traces include the accesses to 8 disks. Long trace sequences are broken into fixed-size short sequential transactions (in our experiment, the window size is 50). We mine sequential patterns from the transformed transaction database and 276, 054 rules are generated. Significance and Distance Measure: The significance of a rule should be measured by the performance gain with its existence. The model of cost-benefit of prefetching could be very complicated. Here we adopt a simplified yet effective measure [14]. Given a rule l → r, the significance of this rule is |T S(l, r)|, where T S(l, r) is the set of transactions having l followed by r. Given two rules, “rule1 : l1 → r1 ” and “rule2 : l2 → r2 ”, the Table 1: Top-10 Document Themes 2 3 4 5 6 7 8 9 10 SIG MAS MMS MMSb permission make digital permission make digital permission make digital permission make digital copy personal grant copy personal grant copy personal grant copy personal grant without fee distribute without fee distribute without fee distribute without fee distribute commercial full citation commercial full citation commercial full citation commercial full citation permission make digital database manage database database manage database database manage database copy personal distribute application mine algorithm application mine algorithm application mine algorithm commercial full citation keyword keyword keyword permission make digital pattern recognition design pattern recognition design pattern recognition design copy personal copy fee method classify evaluate method classify evaluate method classify evaluate database manage database information retrieval information retrieval information retrieval application mine algorithm storage information storage information storage information keyword search keyword search keyword search keyword database manage database permission make digital artificial intelligence artificial intelligence mine algorithm keyword copy personal distribute learn general term learn general term commercial full citation algorithm experimentation algorithm experimentation database manage database artificial intelligence international federate international federate application mine algorithm learn general term red cross red crescent red cross red crescent algorithm experimentation database manage database international federate australia prime minister australia prime minister application mine keyword red cross red crescent john howard australia john howard australia database manage database australia prime minister indonesia president indonesia president mine term algorithm john howard australia susilo bambang yudhoyono susilo bambang yudhoyono database manage database database manage database deputy defense secretary deputy defense secretary mine term keyword application mine keyword paul wolfowitz paul wolfowitz database manage database indonesia president thailand prime minister thailand prime minister application mine term susilo bambang yudhoyono thaksin shinawatra thaksin shinawatra distance measure is defined as follows: 8 1 , r1 6= r2 , < D(rule1 , rule2 ) = |T S(l 1 , r1 ) ∩ T S(l2 , r2 )| :1 − , r1 = r2 . |T S(l1 , r1 ) ∪ T S(l2 , r2 )| If two rules have different block-ids at the right-hand side, then they are not related to each other. Otherwise, these two rules trigger the same pre-fetching target. We compare the support sets of these two rules. If the overlap is significant, then the relative significance of one rule with respect to the other is small. 2.2 Response Time(msec) Top-k 1 2.15 2.1 2.05 2 5000 32 SIG MAS MMS MMSb 31.5 Miss Ratio 30.5 30 29.5 10000 15000 20000 10000 15000 top-k 20000 25000 Figure 7: Response Time w.r.t top-k 31 29 5000 SIG MAS MMS MMSb 25000 top-k Figure 6: Miss Ratio w.r.t top-k Quality Evaluation: We run SIG, MAS, MMS, and MMSb on the original collection of 276, 054 rules to ex- tract top-k rules, which are further fed into a simulation system [14]. The performance is evaluated by miss ratio (Fig. 6) and response time (Fig. 7). We observe (1) both MMS and MMSb perform much better than SIG, indicating that the redundancy-aware top-k patterns contain more valuable information; (2) the MMSb method is better than MMS, which is consistent with our claim that MMSb is more robust; and (3) MAS is almost identical to SIG. This is because in this experiment, k is relative large, whereas redundancy only exists among very limited number of patterns (i.e., only the rules that have the same right-hand side are possibly redundant to each other). Averaging by a very large number of k makes the redundancy penalty negligible. 5.2 Computational Performance Here we examine the computational performance of the two proposed greedy algorithms for MMS. We run the experiments on the document theme data set. The computation times w.r.t. different top-k values are shown in Fig. 8. Given a collection of patterns, both algorithms scale well with respect to k. Although MMSb has higher complexity in the worst case, its running time is comparable to MMS. This is because (1) it generally stops early in each trial r where we try to find k patterns, thus the complexity of each iteration is less than O(kn); and (2) a pattern does not participate in further computation as soon as it is removed (while in MMS each pattern will be compared with all the selected k patterns). 200 MMS MMSb 180 Running Time(sec) 160 140 120 100 80 They give approximation algorithms in the metric undirected graph, where only edge weights exist. Our problem is different because patterns form a node-weighted as well as the edge-weighted graph. 7. CONCLUSIONS To extract redundancy-aware top-k patterns, we examined two problem formulations: MAS and MMS. We studied a unified greedy approach to compare these two functions and show that MMS is a reasonable formulation to our problem. We further present an improved algorithm for MMS and show that the performance is bounded by O(log k). We present two case studies to examine the performance of our proposed approaches. Both MMS algorithms are able to find high-significant and low-redundant top-k patterns. Particularly, in block correlation experiments, we observe that our improved algorithm performs better. This study opens a new direction on finding both diverse and significant top-k answers to querying, searching, and mining, which may lead to promising further studies. One further issue is the formal study of the evaluation functions for a pattern set. Direct mining of top-k patterns from data is another promising direction. 60 40 8. APPENDIX 20 10 20 40 60 top-k 80 100 Figure 8: Computation Time w.r.t top-k Sketch of Proof for Lemma 1. The first result is true because all pi patterns are selected from kt patterns. If i > j, we also have S(pi |pj ) ≥ S(pj |pi ) ≥ d. To prove the second result, we first show two related claims. For simplicity, we use d12 and s1 to denote D(p1 , p2 ) and S(p1 ), respectively. 6. RELATED WORK In Section 1, we have discussed the connection of our work with previous pattern compression (summarization) approaches and database top-k query processing. A closely related work is the pattern ordering problem studied in [16], where the authors also compute topk patterns. Their criterion of the top-k pattern set is to provide best frequency estimation of those patterns that are not selected. Thus the objective function to evaluate the k pattern set is well defined. Our problem definition is more general since we do not assume any specific application. The greedy algorithm used in [16] is similar to Algorithm 1. Our work is also related to document retrieving and ranking problem in Information Retrieval [5, 21]. The formulation of MMS is a generalization of maximal marginal relevance heuristic [5]. Different from techniques in IR where results are generally evaluated by user study, we propose explicit objective functions and develop an approximate algorithm with the near optimal solution. The problem of MAS is identical to the maximum dispersion problem in graph algorithm. Ravi et al. [19] show that the bound of performance guarantee of any polynomial approximation is at least 2 and Algorithm 1 achieves this. The problem of MMS is related to finding a minimum spanning tree in a subgraph. Finding subset maximizing the minimum weight of a combinatorial structure was first proposed by Halldorsson et al. [9]. p2 p2 p1 p3 p1 (a) p3 (b) Figure 9: Two directed triangles If the distance measure satisfies triangle inequality, then given a directed triangle as shown in Fig. 9(a), S(p2 |p1 ) + S(p3 |p2 ) ≥ S(p3 |p1 ) (Claim 1 ); and given a directed triangle as shown in Fig. 9(b), S(p1 |p2 ) + S(p3 |p2 ) ≥ S(p3 |p1 ), where s1 ≥ s3 (Claim 2 ). The proof of these two claims are similar. We show one case for claim 1. If s1 ≥ s2 ≥ s3 , we have S(p2 |p1 )+ S(p3 |p2 ) = d12 s2 + d23 s3 ≥ d12 s3 + d23 s3 ≥ d13 s3 = S(p3 |p1 ). For each pattern q in the circle of pi , assume q originally belongs to circle pj , and both pi and pj belong to circle pTv . We have: S(q|pi ) ≤ S(q|pj ) + S(pj |pi ) ≤ S(q|pj ) + S(pj |pTv ) + S(pi |pTv ) ≤ S(q|pj ) + S(pj |pTv ) + S(pi |pTv ) + 3ǫ ≤ t + T + T + 3ǫ ≤ 3t + 5ǫ. Sketch of Proof for Theorem 2. (Claim1) (Claim2) Let us call the patterns in P k greedy patterns and the patterns in Ok optimal patterns. The algorithm partitions all patterns in P into k groups. In each group, the most significant pattern is reported (let the pattern reported from group i be pi ). The edge weight between any pi and pj (i, j ∈ {1, 2, . . . , k}) is at least d. We have Gms (P k ) ≥ S(p1 ) + (k − 1)d, where p1 is the most significant pattern. Assume the k optimal patterns in Ok = {q1 , q2 , . . . , qk } are distributed in k′ ≤ k groups. We create a spanning tree for Ok based on the following two rules. First, if i there are multiple optimal patterns q1i , q2i , . . . , qki within group i, we locate the most significant pattern q1i and include edges S(qji |q1i ) for all other patterns. According to Lemma 1, S(qji |q1i ) ≤ S(qji |pi )+S(q1i |pi ) ≤ 6d+10ǫ. The overall sum of weights inside k′ groups is (k − k′ )(6d + 10ǫ). Second, we further include edges between optimal patterns q1i to make a spanning tree on Ok . This is achieved by an iterative procedure. Let the spanning tree corresponding to Gms (P k ) be M STp . We can de′ compose M STp into ⌈ k2 ⌉ paths such that the two end nodes of each path are patterns pi , whose group contains an optimal pattern q1i . We group k′ optimal pat′ terns into ⌈ k2 ⌉ pairs. In each pair (a, b), we include the edge S(a|b) (or S(b|a)) if S(b) ≥ S(a) (otherwise). ′ There are at most ⌈ k2 ⌉ edges that will be included. The sum of weights of the included edges is: w(k′ ) ≤ w(M STp ) + k′ (6d + 10ǫ), where w(M STp ) is the sum of edge weights on M STp . In each pair (a, b), we remove the pattern whose significance value is smaller, and the larger one stays for the next iteration. Since we remove half number of patterns at each iteration, there will be at most log (k′ ) iterations. When there is only one pattern left, a spanning tree over Ok is constructed. The overall sum of edge weights included ′ in this procedure is: w(k′ ) + w( k2 ) + . . . + w(2) ≤ ′ ′ log (k )w(M STp ) + k (6d + 10ǫ). Since Gms (Ok ) is the minimum score of all spanning trees on Ok , we have Gms (Ok ) ≤ G′ms (Ok ). Because p1 is the globally most significant pattern, maxki=1 S(qi ) ≤ S(p1 ). Furthermore, Gms (P k ) = S(p1 ) + w(M STp ) ≥ 1 S(p1 ) + (k − 1)d, we have d ≤ k−1 (Gms (P k ) − S(p1 )) ≤ k 1 G (P ). Finally, fr k ms k k k Bǫ = S(p1 ), we have kǫ = B Bǫ = B S(p1 ) ≤ B Gms (P k ). Combining all of the above, we have: Gms (Ok ) ≤ G′ms (Ok ) k ≤ max S(q1i ) + k(6d + 10ǫ) + log (k′ )w(M STp ) i=1 ≤ S(p1 ) + 6kd + 10kǫ + log (k′ )(Gms (P k ) − S(p1 )) 10k ≤ S(p1 ) + (6 + + log k)Gms (P k ) − log kS(p1 ) B 10k + log k)Gms (P k ). ≤ (6 + B 9. REFERENCES [1] F. Afrati, A. Gionis, and H. Mannila. Approximating a collection of frequent sets. KDD’04. [2] S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated Ranking of Database Query Results. CIDR’03. [3] P. Brown, V. Della Pietra, P. deSouza, J. Lai, and R. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992. [4] T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. PKDD’02. [5] J. Carbonell and J. Coldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. SIGIR’98. [6] S. Chaudhuri and L. Gravano Evaluating Top-k Selection Queries. VLDB’99. [7] E. Epkut, T. Baptie, and B. Hohenbalken. The discrete p-maxian localtion problem. Comput. & Opns. Res., 17:51–61, 1990. [8] R. Hassin, S. Rubinstein, and A. Tamir. Approximation algorithms for maximum dispersion. Operations Research Let., 21:133–137, 1997. [9] M. Holldorsson, K. Iwano, N. Katoh, and T. Tokuyama. Finding subsets maximizing minimum structures. SODA’95. [10] J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining top-k frequent closed patterns without minimum support. In ICDM’02. [11] S. Jaroszewicz and D.A. Simovici. A general measure of rule interestingness. PKDD’01. [12] S. Jaroszewicz and D.A. Simovici. Interestingness of frequent itemsets using bayesian networks as background knowledge. KDD’04. [13] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. [14] Z. Li, Z. Chen, S. Srinivasan, and Y. Zhou. Mining block correlations in storage systems. USENIX FAST’04. [15] Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. KDD’05. [16] T. Mielikäinen and H. Mannila. The pattern ordering problem. PKDD’03. [17] D. Mount. Bioinformatics: Sequence and genome analysis. Cold Spring Harbor Lab., 2001. [18] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT’99. [19] S. Ravi, D. Rosenkrantz, and G. Tayi. Heuristic and special case algorithms for dispersion problems. Operations Research, 42:299–310, 1994. [20] C. Ruemmler and J. Wilkes. Unix disk access patterns. USENIX’93. [21] X. Shen and C. Zhai. Active feedback in ad-hoc information retrieval. SIGIR’05. [22] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. Knowledge & Data Engineering, 8:970–974, 1996. [23] A. Singhal. Modern information retrieval: A brief overview. Bull. IEEE CS Tech. Comm. Data Eng., 24(4):35–43, 2001. [24] P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. KDD’02. [25] D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed frequent-pattern sets. VLDB’05. [26] X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: A profile-based approach. KDD’05. [27] X. Yan, J. Han and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM’03.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement