The Model Summary Problem and a Solution for Trees

The Model Summary Problem and a Solution for Trees
The Model-Summary Problem and a Solution for
Trees
Biswanath Panda 1 , Mirek Riedewald 2 , Daniel Fink 3
1
2
Google Inc., USA; [email protected]
College of Computer and Information Science, Northeastern University, USA; [email protected]
3
Cornell Lab of Ornithology, USA; [email protected]
Abstract— Modern science is collecting massive amounts of
data from sensors, instruments, and through computer simulation. It is widely believed that analysis of this data will hold the
key for future scientific breakthroughs. Unfortunately, deriving
knowledge from large high-dimensional scientific datasets is
difficult. One emerging answer is exploratory analysis using
data mining; but data mining models that accurately capture
natural processes tend to be very complex and are usually not
intelligible. Scientists therefore generate model summaries to find
the most important patterns learned by the model. We formalize
the model-summary problem and introduce it as a novel problem
to the database community. Generating model summaries creates
serious data management challenges: Scientists usually want to
analyze patterns in different “slices” and “dices” of the data
space, comparing the effects of various input variables on the
output. We propose novel techniques for efficiently generating
such summaries for the popular class of tree-based models. Our
techniques leverage workload structure on multiple levels. We
also propose a scalable implementation of our techniques in
MapReduce. For both sequential and parallel implementation,
we achieve speedups of one or more orders of magnitude over
the naive algorithm, while guaranteeing the exact same results.
I. I NTRODUCTION
Across many scientific disciplines, the availability of very
large amounts of data is creating a paradigm shift. This is
usually referred to as data-driven science or eScience. The
National Science Foundation (NSF) in the US has made
data-driven science one of its funding priorities, and there
are similar efforts world-wide. In its 2007 report, the NSF
Cyberinfrastructure Council stated that “. . . U.S. international
leadership in science and engineering will increasingly depend
upon our ability to leverage this reservoir of scientific data
captured in digital form, and to transform these data into information and knowledge aided by sophisticated data mining,
integration, analysis and visualization tools.”
Data management skills are needed to solve many of
the grand challenges in data-driven science. However, these
problems have not received adequate attention in the database
community. In this paper we introduce the model-summary
computation problem, a challenging general problem from
data-driven science. We show solutions for a popular instance
of the problem, and point out various future challenges. To
illustrate the problem, consider the following example from
bird ecology.
Ornithologists and conservation biologists try to identify
large-scale threats to sensitive bird species such as climate
Fig. 1.
Typical data-intensive science workflow
change or land use change associated with human population
expansion. To do this, they need to explore the complex
and highly dynamic ecology of bird populations across huge
geographic extents. The traditional approach of having experts
come up with a hypothesis and then design an experiment to
collect the data for testing this hypothesis requires a sufficient
understanding of the studied phenomenon. And due to the high
cost of collecting data from a carefully designed experiment,
it is limited to the study of few variables at small scale.
Instead, the bird ecology community, like many other
domain sciences, is shifting focus to the analysis of nonexperimental, or what we call “observational”, data that can
be more efficiently collected across large spatial and temporal scales [1]. Figure 1 shows the corresponding workflow. For example, the Cornell Lab of Ornithology and
dozens of partner organizations are collecting millions of
bird sighting reports every year from various protocols (see
www.avianknowledge.net). These records are joined
based on time and location with other datasets, adding thousands of attributes describing habitat, climate, census, elevation, and other features (synthesis step).
In the next step of the workflow, exploratory analysis
techniques are used to identify and describe the input attributes
that are most strongly associated with observed distributions of
birds. Understanding such statistical associations is essential
quantitative information, which provides the inspiration for
new hypotheses about the true causal relationships. Then
more traditional hypothesis-driven science can be carried out
through careful collection of additional data or through quasi-
experimental design approaches [2].
Exploratory analysis begins by training prediction models.
These models are then analyzed as an approximation of the
underlying real process that generated the data [3], [4]. Nonparametric data mining models like tree-based ensembles,
SVM’s, and artificial neural nets (ANNs) are perfectly suited
for exploratory analysis. They are flexible enough to model
complex interactions between many variables and they can
handle large datasets. Even with little understanding of a
complex natural process, data mining techniques can generate
excellent models that have high predictive accuracy by “letting
the data speak for itself” and avoiding prior assumptions as
much as possible.
Although non-parametric models have great accuracy, they
tend to be very complex for all but the most trivial data
sets. This makes it difficult to directly discover associations
between input attributes and output. Even decision trees become unintelligible once they have thousands of nodes. In
other words, the model behaves like a blackbox. Scientists
therefore compute low-dimensional summaries to extract and
visualize “what the model has learned.” All such summaries
are based on predictive “experiments”, calculated across a series of systematic, structured predictions from the exploratory
model. Most often, these analyses begin by investigating how
each attribute, in isolation, affects the output. More detailed
investigations deal with the joint effect of attribute sets, called
“interactions”. Importantly, the number of model evaluations
required increases dramatically when we investigate highorder interactions. Thus, a thorough investigation of even
a relatively small analysis may require a huge number of
summaries, and hence model predictions, to be computed
to generate useful scientific knowledge. In fact, it is the
bottleneck of the workflow as described in Figure 1, dwarfing
other costs such as model training.
The final step in the workflow, confirmatory analysis, is
used to refine the results from exploratory analyses and
strengthen the basis for inference, e.g., by taking into account
the estimation error. In practice, error and confidence is
estimated with resampling techniques. This exacerbates the
computational bottleneck associated with summary generation.
This model-summary problem is not specific to bird ecology. It is relevant to any scientific use of predictive modeling
or supervised learning. As programs like NSF’s DataNet are
poised to create massive repositories with petabytes of data
from many scientific disciplines, exploratory analysis as discussed here will become an indispensable tool, and achieving
scalability will be crucial.
In this paper we make the following contributions:
• We introduce the general model-summary computation
problem for complex data mining models and identify
common structure in the workloads for generating model
summaries. (Sections II and III)
• We argue that any solution has to be tailored to the
model type and propose algorithms that take advantage
of the workload structure for speeding up summary
computations in tree-based models, including state of the
art ensembles. (Sections IV, V and VI)
We evaluate our algorithms using models built from real
world data and show impressive speedups in computing
large sets of summaries. (Section VII)
Section VIII discusses algorithm extensions, Section IX discusses related work, and Section X concludes the paper.
•
II. E XAMPLES
To illustrate the problem, we discuss a toy example and
then show how summaries are used in a real-world study. Both
examples are from bird ecology, but it is easy to see how they
generalize to other domains, e.g., analysis of medical records.
A. Toy Example
Assume
a
scientist
has
trained
a
model
F (Elev, Year, Hpop), which for a given combination of
elevation (e ∈ Elev), year (y ∈ Year), and human population
density (h ∈ Hpop) can accurately predict the probability of
observing some bird species of interest.1
Now the scientist would like to study how bird occurrence
is associated with Year by generating a plot like the right one
in Figure 2. In general, there is no perfect way of summarizing
a high-dimensional function with a lower-dimensional one.
Some information will inevitably be lost, no matter which
method we choose. However, all accepted methods follow the
same fundamental principle of experimental design: To study
the dependency of the output on a set of variables, one varies
only the values of these variables, while holding all other
variables constant. The two basic approaches for generating
a summary are:
Non-aggregate summaries: The most fine-grained way
of studying the effect of Year on bird occurrence is the
following. We pick a pair (e1 , h1 ) ∈ Elev × Hpop and
then compute Fe1 ,h1 (Year) = F (e1 , Year, h1 ). In particular, if we want to visualize the effect for the years
1994 to 2004, then we evaluate the model for points
(e1 , 1994, h1), (e1 , 1995, h1),. . . , (e1 , 2004, h1). Hence the
summary consists of the 11 points (1994, F (e1 , 1994, h1)),
(1995, F (e1, 1995, h1)),. . . , (2004, F (e1, 2004, h1)). We can
do the same for many different pairs (ei , hi ) ∈ Elev × Hpop.
Aggregate summaries: Looking at Year-summaries for
many different elevation-human population pairs will give a
very detailed picture of the statistical association between
Year and bird occurrence. However, scientists usually prefer
to aggregate many of these summaries. Since the data is
high-dimensional, it tends to be sparse and hence aggregate
summaries are usually trusted more. And aggregating summaries also reduces the amount of information that needs to
be examined.
An aggregate summary is produced by averaging multiple non-aggregate summaries. In our example, for a set of
elevation-human population pairs {(ei , hi )}ni=1 , the aggregate
1 We slightly abuse notation by using the same symbol for both an attribute
name and the set of all values of this attribute, e.g., y ∈ Year means that y
is a value of attribute Year.
Fig. 2.
Summaries of a complex bird occurrence prediction model.
Pn
summary is computed as n1 i=1 Fei ,hi (Year). Stated differently,
for the year 1994, the function value in the summary is
1 Pn
F (ei , 1994, hi ); similar for the other years.
i=1
n
Which type of summary (aggregate versus non-aggregate)
and for which pairs (ei , hi ) to generate the summary on Year
is largely a function of the research questions of interest.
Visualizations (or plots) are a convenient way to present
summaries.
B. Real Example
Consider again the example from bird ecology where ornithologists would like to analyze bird observation records.
As mentioned earlier, by joining the observations with other
datasets about habitat, climate etc, each observation record is
described by thousands of attributes. Data mining techniques
can produce highly accurate models, but often these models are
unintelligible and do not reveal statistical associations directly.
To understand what the model has learned, scientists rely on
low-dimensional summaries like those discussed for the toy
example above. Partial dependence plots are one particularly
popular type of aggregate summaries [3], [5], [6], [7].
Figure 2 presents an example of two 1-dimensional partial
dependence plots. They show the estimated probability of occurrence of a bird species at feeders in some Bird Conservation
Region (BCR)2 , for a selected summary attribute. The left plot
is for the Acorn Woodpecker in California, showing a drop in
the probability of Acorn Woodpecker occurrence as human
population density increases above 1,000 people per square
mile. It is hypothesized that habitat competition between the
woodpecker, which needs dead or dying branches to store
acorns, and humans who remove these branches could be
the cause for this decline. The plot on the right shows the
biennial winter irruptive migration of Common Redpoll into
New England, likely caused by biennial cycles of production
of tree seeds in Northern Canada. Interestingly, Purple Finch
shows a similar biennial pattern in BCR 14, but its highs and
lows are exactly the opposite compared to Common Redpoll.
This hints at a biological process driven by availability of
certain food sources and competition for similar habitat.
As the example indicates, interesting patterns could be
observed for various species, attributes, and regions. Scientists
therefore would like to search across many different species,
variables (attributes), and regions to see if there are any
2 BCRs
correspond to large geographical regions in North America.
interesting patterns like those in Figure 2. We are currently
working on a search engine for such model summaries. It
will enable scientists to express their preferences, e.g., to find
summaries showing a strong effect (measured as the difference
between max and min value in the summary), and then return
a ranked list of summaries according to these preferences.
However, to make such a pattern search engine useful, we
first have to create a large collection of these summaries.
Creating summaries is an expensive process, even for a
small dataset. Assume we have 1000 attributes
that
are poten1000
tially interesting. Hence there are 1000
+
≈ 500, 000
1
2
different 1- and 2-dimensional summaries. To produce a plot
like those in Figure 2, we need to evaluate the model for
sufficiently many values of the summary attribute (the one
on the x axis), at least 10. And each point in an aggregate
summary is obtained by appropriately averaging over many
combinations of data points, typically 1000 or more, to take
the average contribution of other variables into account. To
discover regional trends, not only for geographical regions,
but also for say certain elevation ranges, human population
ranges, or temperature ranges, this analysis is done for many
“slices” and “dices” of the data space, i.e., various selections
of the original data. At the very least, thousands of such
selections are typically explored. This results in a total of at
least 500, 000 ·10 ·1000·1000 = 5 ·1012 model evaluations. At
the optimistic estimate of 1 microsecond per evaluation, this
adds up to about 2 months of computation time.
For larger data, more complex models, and more ambitious
studies, we experienced that the naive method of creating summaries is computationally infeasible. On top of that, scientists
cannot rely on studying a single model. Correlated attributes
distort the results and noise affects model structure. Hence
during confirmatory analysis, scientists explore how summaries vary when different projections of the data are studied
(eliminating some of the correlated attributes), and different
samples are used for training the models. In short, without
dramatically speeding up summary computation, scientists are
limited to small-scale studies or poor approximations.
In theory summaries like those in Figure 2 could also be
obtained by training a model directly on the low-dimensional
space, i.e., a projection of the dataset on the attributes of
interest. However, this usually results in poor models and
hence low-quality summaries, because variables that do not
appear in the summary can still have a significant influence on
the output. Since these would be projected away, their effects
cannot be learned by the model.
III. T HE M ODEL -S UMMARY P ROBLEM
A. Terminology
We will use the terms attribute and variable interchangeably
throughout the paper. Like in a database, an attribute describes
a property of a data record. At the same time, this attribute
corresponds to a variable in a statistical or data mining model.
We will refer to those attributes that the scientist wants to
visualize with a model summary as the summary attributes.
The remaining attributes in the model are the non-summary
attributes of this summary. In the toy example of a summary on Year, Year is the summary attribute, while Elev
and Hpop are the non-summary attributes. Similarly, if the
scientist wanted to study the combined effect of Year and
Hpop on bird occurrence, she would choose these two as the
summary attributes, while Elev would be the non-summary
attribute. The corresponding summary plot would show a twodimensional function surface.
We refer to the values of the summary attributes at which
the model is evaluated as the visualization points. In the right
summary in Figure 2 on Year, all year values between 1994
and 2004 were selected as the visualization points.
Let X = {X1 , X2 , . . . , X|X |} be a set of |X | attributes with
domains dom1 , dom2 ,. . . , dom|X | , respectively. With D =
{x(1), x(2), . . . , x(|D|)}, where all x(i) ∈ dom1 × dom2 ×
· · · × dom|X | , we denote a dataset from the input domain.
Let Y be the output with domain domy and let F : dom1 ×
dom2 ×· · ·×dom|X | → domy be a data mining model. Model
F maps |X |-dimensional vectors x = (x1 , x2 , . . . , x|X | ) of
input attribute values to the corresponding output value F (x).
Algorithm 1 : Naive Algorithm
Input: model F (X ), summary attributes S, non-summary
attributes S̃, dataset D, visualization points VS
1: for all vS ∈ VS do
2:
sum = 0
3:
for all x ∈ D do
4:
sum = sum + F (vS , πS̃ (x))
5:
return (vS , sum/|D|)
C. Summaries for Blackbox Models
The naive algorithm as outlined in Algorithm 1 computes a
single summary. It directly implements the summary definition
and it is the only algorithm currently available for this problem. (There is also an approximation algorithm for a restricted
version of the problem, which we discuss in Section IX.)
For each visualization point vS ∈ VS , the naive algorithm
iterates through all data points x ∈ D and evaluates the model
for each query point obtained by combining vS with the
appropriate projection of x on the non-summary attributes.
In our toy example for the Year summary with visualizaB. Problem Definition
tion points 1994 to 2004 and data set {(ei , yi , hi )}ni=1 , we
We first formalize the notion of a summary and then define query the model with (e1 , 1994, h1),. . . , (e1 , 2004, h1), then
(e2 , 1994, h2),. . . , (e2 , 2004, h2), and so on.
the model summary problem.
In fact, if the model is a true blackbox, then the naive
Definition 1: Let X , F and D be as defined above and
let S ⊂ X and S̃ = X − S be the sets of N
summary and algorithm is the only option: Even if two query points are
non-summary attributes, respectively. Let VS ⊆ Xj ∈S domj similar, their model predictions can be very different. Hence
denote the set of visualization points. The summary of F on one cannot obtain the exact model summary without actually
evaluating the model for each individual point. Even an
S and VS is defined as
 approximate algorithm would be problematic for a blackbox

|D|
 model, because there is no way to establish a bound on the

1 X
F (xS , xS̃ (i)) similarity of predictions based on the similarity of the input
xS , F̂S̃ (xS ) | xS ∈ VS , F̂S̃ (xS ) =


|D| i=1
values, without actually generating the predictions.
(1)
Stated differently, any improvement over the naive algowhere xS̃ (i) = πS̃ (x(i)) is the projection of the i-th data
rithm
has to take advantage of the internal structure of
˜
record in D on the attributes in S.
the data mining model.
Notice that for |D| = 1, we obtain a non-aggregate summary, while for |D| > 1 it would be an aggregate summary D. Workload Properties
(see Section II-A). Depending on the choice of points in
Since data mining models, though complex, are not true
dataset D, aggregate summaries with different properties can blackboxes, there are opportunities for designing algorithms
be generated. For example, if D is the set of data points that with lower cost than the naive one. The following workload
was used for training model F , then the summary is called properties can be exploited:
a partial dependence function [5]. Another popular choice Repetitive structure among query points: Algorithm 1
for D is to use points from a regular grid of non-summary evaluates the model for all query points in VS × πS̃ (D), i.e.,
attribute value combinations.
the cross product of the set of visualization points with the
The different choices of D affect the summary properties, S̃-projected set of data points. Hence for each vS ∈ VS , there
but these are irrelevant from our point of view. Our techniques are |D| query points that all have the same value vector vS for
support all variations of summary definitions. We can now the summary attributes. Similarly, for each x ∈ D, there are
define the model-summary problem.
|VS | query points that all have the same value vector πS̃ (x)
Definition 2: Let X , F and D be as defined above and let for the non-summary attributes. This creates a potential for
P = {p1 , p2 , . . . , p|P | } be a set of summaries (“plots”). Each sharing computation across query points.
summary pi , 1 ≤ i ≤ |P |, is defined by its set of summary Aggregation: For a given visualization point vS , its summary
attributes pi .S ⊂ X and a set of visualization points pi .VS ⊆ output is the average of the model predictions for all query
N
points in {vS } × πS̃ (D). Rather than first computing each
Xj ∈pi .S domj . The model-summary computation problem is
to compute all summaries in P efficiently and to scale to large individual prediction and then averaging them, aggregation
problems.
could be “pushed into the model”.
X1
2
4
8
X2
7
2
1
X3
4
5
8
X1<3
X2<5
a
X3<7
b
X1<5
e
D
c
Fig. 3.
d
Example tree and dataset D
Inter-summary commonality: Multiple summaries for the
same model can have non-summary or summary attributes in
common. This provides additional opportunities for sharing
computation across summaries.
IV. S UMMARY C OMPUTATION
IN
T REES
As discussed in Section III-C, we can only improve over
the naive algorithm if we work with the internal structure
of a model. In this paper we focus on tree-based models,
because they are among the most popular models in practice
for several reasons. First, trees can handle all attribute types
and missing values. Second, the split predicates in tree nodes
provide an explanation why the tree made a certain prediction
for a given input. Third, tree models like bagged trees, boosted
trees, Random Forests, and Groves are among the very best
prediction models for both classification and regression problems [8], [9]. Fourth, they are perfectly suited for explanatory
analysis because they work well with fairly little tuning. We
briefly introduce trees and show how to take advantage of their
structure to exploit the observations discussed in Section III-D.
The algorithms will be more formally introduced in Section V.
A. Tree Models
Classification and regression trees are some of the oldest and
most popular predictive models [10]. A tree model partitions
the data space recursively, attempting to achieve partitions
with high purity, low mean squared error, or similar goals.
Each non-leaf node in the tree splits the data space on some
attribute; the leaves contain predictions for points that fall into
the corresponding region of the data space.
Figure 3 shows an example tree for attributes X1 , X2 , and
X3 . The root corresponds to the entire data space. It contains a
split predicate on X1 (X1 < 3). The root has two children. The
left child corresponds to the “half” of the data space containing
all records with X1 < 3, while the right child corresponds
to the other “half” of the data space with records satisfying
X1 ≥ 3. Partitioning continues recursively at the children,
who can divide their respective sub-spaces further by similarly
splitting on any attribute. The leaf predictions in the example
are some constants a, b, c, d, and e. Nodes can have more
than two children. In the rest of the paper, we will refer to a
node in a tree model as nde, and its children as nde.chld1 ,
nde.chld2 and so forth.
When making predictions for a point x, the tree is traversed
from the root. At each node, the split predicate is evaluated for
x. This evaluation, which we will refer to as nde.TestSplit(x)
in the rest of the paper, returns the appropriate child where
the traversal continues recursively until a leaf is reached. For
example, in the tree of Figure 3, if x = (1, 1, 2), then the
predicate evaluation at the root results in the traversal of the
left child (X2 < 5), after which the prediction a is returned.
Since trees are well-known, we omit a detailed discussion and
refer the reader to Breiman et al. [10].
Trees work well for all types of prediction problems, but
the predictive performance of single tree models usually is
not competitive with more recent machine learning techniques. This disadvantage has been eliminated by ensemble
methods like bagged trees [11], boosted trees [12], Random
Forests [13], and Groves [8]. These ensembles consist of
many trees and make predictions by adding and/or averaging
predictions of all trees in the ensemble. Our techniques can
be applied to all these tree ensembles.
Many variations of trees have been proposed, including
some with multivariate splits (split predicates on more than one
variable) and with non-trivial prediction functions in the leaves
(rather than a constant value). With the advent of ensemble
methods, these more “exotic” trees are rarely used because
they (1) are much harder to train and (2) not necessarily
produce better models. Furthermore, even the simple ID3 tree
can represent any finite discrete-valued function [14]. For all
our algorithms, we will therefore focus on trees with univariate
splits and constant predictions in the leaves. In Section VIII
we outline how our algorithms can be extended to the more
complex tree types.
B. Sharing Computation
We show how to speed up computation by leveraging
tree structure together with the workload properties discussed
in Section III-D. We will focus on single trees; all ideas
extend to ensembles by applying them to each tree in the
ensemble individually. For ease of presentation, the techniques
are explained for a concrete example. The general algorithms
are discussed in Section V.
Short circuiting: Recall that to compute a summary, we have
to evaluate the model for all points in VS × πS̃ (D). (Notice
that projection here
S returns a multi-set!) We can rewrite this
cross-product as x(i)∈D VS ×πS̃ (x(i)). In VS ×πS̃ (x(i)), the
same set of non-summary attribute values, πS̃ (x(i)), occurs
|VS | many times. To avoid duplicate computation, we can
“compress” the original tree for a given πS̃ (x(i)) as follows.
Consider a tree node nde that splits on a non-summary
attribute X̃ ∈ S̃. Since all query points in VS × πS̃ (x(i)) have
the same value πX̃ (x(i)), the result of nde.TestSplit(x) will be
the same for all of them. Stated differently, whenever we reach
node nde during tree traversal for any point in VS × πS̃ (x(i)),
the traversal will continue with the same child every time. To
avoid repeated split predicate evaluations, we can hard-code
this traversal path with a short-circuit pointer. This pointer
connects the parent of nde directly to the appropriate child of
nde, effectively removing nde from the tree and pruning away
all other sub-trees of nde. Short-circuiting can be applied to
all tree nodes that split on non-summary attributes.
root
X1<3
X1<3
X1<3
root.chld1.res_pred=2a+b 2a+b
root.chld1.shckts={}
X1<5
a
root.chld2.res_pred=e
root.chld2.shckts={nde1}
X1<5
e
b
c
e
d
a
e
b
c
d
nde1
X1<5
Fig. 4. Retained nodes for summary on X1 and single-point shortcircuit tree
for (X2 , X3 ) = (7, 4)
To illustrate the idea, consider the example tree in Figure 3
and assume we want to compute a summary on S = {X1 } for
dataset D, shown in the same figure. Now consider the set of
query points for the first point in D, (X1 , X2 , X3 ) = (2, 7, 4).
This set of query points is VX1 × {(7, 4)}. Predicates X2 < 5
and X3 < 7 will evaluate to false and true, respectively,
for each of these query points. Hence we can compress
the original tree to obtain the one shown on the right in
Figure 4. This compressed tree can now be traversed for every
visualization point x1 ∈ VX1 .
We will refer to such a tree as a single-point shortcircuit
tree, because it was created based on a single point from D.
For the other points in D, we obtain similar trees. For example,
for the last point (8, 1, 8), the single-point shortcircuit tree is a
stump, consisting of the node with predicate X1 < 3, pointing
directly to leaves a (left child) and e (right child).
Instead of short-circuiting a tree on the non-summary attributes, one could alternatively short-circuit on the summary
attributes. However, in practice |S| ≪ |S̃|, because scientists
usually care about summaries for visualization (|S| = 1 or
|S| = 2). This implies that short-circuiting on non-summary
attributes will usually result in better tree compression.
Aggregating shortcircuit trees: Aggregate
summaries are
1 P|D|
computed using terms of the form |D|
i=1 F (xS , xS̃ (i))
for each visualization point xS . With shortcircuit trees as
described above, we would run point xS through each of the
|D| single-point shortcircuit trees obtained for the points in D,
then compute the sum of the individual predictions. A much
faster algorithm for computing the same value is based on the
following observation.
For the sake of simplicity, we will continue the discussion
for the concrete example of a summary on X1 . It is easy
to see how it generalizes. We can show formally that every
single-point shortcircuit tree for the summary on X1 satisfies
the following properties: (1) All nodes with split predicates on
non-summary attributes (X2 and X3 ) are effectively eliminated
(dotted nodes on the left side in Figure 4). (2) Some leaves
and some non-leaf nodes with split predicates on summary
attributes (X1 ) are retained, each connected directly through
a short-circuit pointer to its closest ancestor that splits on a
summary attribute. Stated differently, each shortcircuit tree for
the summary on X1 consists of a subset of the bold nodes as
marked on the left in Figure 4 and whenever a certain node nde
is retained in a single-point shortcircuit tree, it is connected
to the same ancestor node.
2c
Fig. 5.
2d
Multi-point shortcircuit tree for running example
From these observations it follows that all shortcircuit trees
for a given summary can be equivalently represented by a
single tree whose nodes are a subset of the original tree’s nodes
(in particular its leaves and the nodes that split on summary
attributes). Nodes are connected through short-circuit pointers
such that each node is directly connected to its closest ancestor
that splits on a summary attribute. In addition, for each shortcircuit pointer there is a counter that indicates how many of the
individual shortcircuit trees contained this pointer. Traversing
this
tree with a visualization point xS ∈ VS , we directly obtain
P|D|
i=1 F (xS , xS̃ (i)) by returning the sum of the predictions of
all leaves reached, weighted by the count value of the shortcircuit pointer pointing to the leaf.
We will refer to this single tree that represents all |D| singlepoint shortcircuit trees for a given summary as a multi-point
shortcircuit tree for this summary. As an optimization, we
replace a set of short-circuit pointers to leaves in the same
sub-tree by the corresponding weighted sum for this sub-tree
and store this sum directly in the tree node. Figure 5 shows
the corresponding multi-point shortcircuit tree for the running
example. For example, the node with predicate X1 < 3 would
have two left (“true” branch) short-circuit pointers, one to leaf
a with weight 2 (from the single-point trees for points (4, 2, 5)
and (8, 1, 8)) and one to leaf b with weight 1 (from the singlepoint tree for point (2, 7, 4)). These are replaced by value 2a+b
in the node.
For convenience, in the rest of the paper whenever we
refer to a shortcircuit tree, unless mentioned otherwise, it
refers to a multi-point shortcircuit tree.
Inter summary structure: The techniques discussed so far
eliminate repetitive work in the computation of a single
summary. In the model-summary problem as introduced in
Section III, a scientist might request a large set of summaries,
P , for a given dataset D. In this case, any pair of summaries
{pi , pj } ∈ P can share computation on attributes in X −
{pi .S ∪pj .S}. For example, for X = {X1 , X2 , . . . , X100 }, the
summaries (X1 , X2 ) and (X1 , X3 ) can share computation on
attributes X4 , . . . , X100 . We share computation by generating
the multi-point shortcircuit trees for all summaries in a single
tree traversal.
Algorithm 2 : PointComputeOutput
Input: tree node nde, visualization point vS
1: chld = nde.TestSplit(vS )
2: sum = chld.res pred
3: for all nde′ ∈ chld.shckts do
4:
sum = sum + PointComputeOutput(nde′ ,vS )
5: return sum
V. A LGORITHMS
We first introduce shortcircuit trees more formally and then
present algorithms for creating and querying such trees. For
ease of presentation, in the following discussion we will
assume that there are no missing values in the set D. Support
for missing values will be addressed in Section VIII.
A. Shortcircuit Tree Structure
Let T be a given tree for which we want to compute a
summary on S. The corresponding multi-point shortcircuit tree
is TS . Let nde be a node of T that splits on a summary
attribute. In TS this node maintains two types of information
about each subtree rooted at its children. Let chld be a
child of node nde. The first type of information is an array
of shortcircuit pointers, called shckts. Each pointer in this
array points to a node in the subtree rooted at chld with
the following properties: (1) the node splits on a summary
attribute and (2) none of the node’s ancestors that are also
descendants of node nde splits on a summary attribute. In
general, there can be 0 or more pointers in shckts, depending
on the tree structure. The second type of information for a
subtree rooted at chld is a residual prediction, called res pred,
which is the sum of the predictions for all points in πS̃ (D) that
traversed this subtree and reached a leaf without encountering
a node that splits on a summary attribute. (See Figure 5 for
an example.)
If the root node of T splits on a non-summary attribute, then
TS̃ also has a new root node with the trivial split predicate
“true”. It maintains an array of shortcircuit pointers and a
residual prediction computed for the entire tree T, as explained
above for nodes that split on summary attributes.
each visualization point xS , we compute
PFor
|D|
F
(v
S , xS̃ (i)) by traversing the shortcircuit tree
i=1
using Algorithm 2, starting the traversal at the root. The
algorithm determines the appropriate subtree by evaluating
the split predicate (line 1); it then recursively traverses all
shortcircuit pointers for this subtree. For all accessed nodes,
their residual predictions are added.
For a given set of summaries P , the corresponding shortcircuit tree is equivalent to the set of per-summary shortcircuit
trees, but all of them merged together into a single structure.
More precisely, a node has not just a single (shckts, res pred)
pair. It now has an array of these entries, one array element
for every summary that contains the split attribute of the node
as a summary attribute.
Algorithm 3 : ShortCircuitTree
Input: tree T, summary set P , dataset D
1: Create new root node new root
2: for all x ∈ D do
3:
new root.Update( ShortCircuitNode(T,P ,x) )
4: return new root
Algorithm 4 : ShortCircuitNode
Input: tree node nde, set of active summaries Pnde , data
record x
1: if nde is a leaf then
2:
for all p ∈ Pnde do
3:
op[p] = hnull, nde.predictioni
4: else
s
5:
Pnde
= {p ∈ Pnde | nde.splitAttribute ∈ p.S}
6:
for all children chld of node nde do
7:
if chld == nde.TestSplit(x) then
8:
/* Pass all summaries to the child for which the
node predicate is satisfied. */
9:
op = ShortCircuitNode(chld,Pnde ,x)
s
10:
for all p ∈ Pnde
do
11:
chld [p].shckts.add( op[p].shckts )
12:
chld [p].res pred.add( op[p].res pred )
13:
op[p] = hnde, nulli
14:
else /* Pass only summaries that have nde’s split attr.
as a summary attribute to all other children. */
s
15:
optmp = ShortCircuitNode(chld,Pnde
,x)
s
16:
for all p ∈ Pnde do
17:
chld [p].shckts.add( optmp [p].shckts )
18:
chld [p].res pred.add( optmp [p].res pred )
19: return op
B. Generating Shortcircuit Trees
Algorithms 3 and 4 describe the pseudocode for generating
the multi-point shortcircuit trees for a set P of summaries. For
each record in D, they perform a single traversal of the tree
(instead of |P | traversals). During this traversal the pointer
structure and residual prediction values for all summaries in
P are generated. We first discuss Algorithm 4 which performs
operations at a single node nde in the tree.
To better understand Algorithm 4, let us for now assume
that P contains only a single summary on S. In this case the
output (called op in the pseudocode) is a single pair consisting
of a short-circuit pointer to a node in nde’s subtree (possibly
nde itself) and a prediction value. Exactly one of them is null.
Now let us examine how the output is computed.
If nde is a leaf, then the algorithm returns the node’s
prediction value and null for the pointer (lines 1–3). If nde
is not a leaf, there are two cases. Case 1: If nde splits on a
summary attribute, then we need to recursively traverse all its
children. For each child, this traversal returns either a shortcircuit pointer or a residual prediction value for this subtree
(if no node splitting on a summary attribute was accessed
in the subtree). The returned pointer or prediction value is
stored in nde.chld for the corresponding subtree. Since nde
splits on a summary attribute, its nearest ancestor that splits
on a summary attribute should point to it. Hence the algorithm
returns op as a pair containing a pointer to nde and null for
the residual prediction value. Case 2: If nde splits on a nonsummary attribute, then we only traverse that child for which
the split predicate evaluates to true (nde.TestSplit(x)). This
recursive call returns either a pointer or a residual prediction
as described before, but since nde splits on a non-summary
attribute, nde is conceptually deleted from the tree and hence
does not store any short-circuit pointers or residual predictions
itself. Instead, it returns what it received from its subtree to
its ancestor.
Algorithm 4 implements this procedure for an entire set of
summaries together during a single tree traversal. The set of
active summaries encodes the set of all summaries that reach
node nde. Notice also the branches in lines 9–13 and 15–18.
For the child that was selected by the split predicate evaluation,
all summaries that were active at nde remain active. For
the other children, only those summaries for which the split
attribute is a summary attribute will be active. Similarly, as
lines 10 and 16 indicate, we only update nde’s pointers and
residual prediction values for those summaries that contain
nde’s split attribute as a summary attribute. Line 13 ensures
that for these summaries, we return a short-circuit pointer to
nde to nde’s ancestors. For all other summaries, i.e., those
for which nde’s split attribute is a non-summary attribute,
the algorithm simply passes up the call chain the pointer or
residual prediction value it received from traversing the subtree
rooted at the child that was selected by the split predicate
evaluation.
Algorithm 3 calls Algorithm 4 for every point in D on the
root node of tree T with the active set of summaries set to P .
The return array op from the call to Algorithm 4 is used to
update the shortcircuit pointers and residual prediction for the
root of the shortcircuit tree for each summary in P (line 3 in
Algorithm 3).
C. Algorithm Analysis
Single summary computation. Let T be an ensemble of trees
and n(T ) denote the total number of tree nodes in the ensemble. The naive algorithm queries each tree in the ensemble for
each point in VS × πS̃ (D). Its runtime is O(|VS | · |D| · n(T ))
and it needs O(n(T )) space. With single-point shortcircuit
trees, for each point in D, we create a short-circuit tree with
a single traversal, then query the smaller tree. Hence total
computation time is O(|D| · n(T ) + |VS | · |D| · n(T ′ )), where
usually n(T ′ ) ≪ n(T ). If a shortcircuit tree ensemble for
a single point is discarded before the next one is generated,
space cost is O(n(T ) + n(T ′ )) = O(n(T )).
A multi-point shortcircuit tree ensemble is constructed by
traversing T for each point in πS̃ (D), then predictions are
made by running each point in VS through it. As we discussed
earlier, a multi-point shortcircuit tree cannot have more nodes
than the original tree, no matter how big D is. This results in
a total computation cost of O(|D|·n(T )+ |VS |·n(T ′ )), where
usually n(T ′ ) ≪ n(T ). This is a dramatic improvement over
the naive algorithm, essentially reducing cost by a factor in
the order of VS or |D|, depending on which term dominates.
Space cost is still low at O(n(T ) + n(T ′ )) = O(n(T )).
Multiple summaries. If |P | summaries are computed one-byone, the above costs are |P | times higher. When computing all
summaries together in a single tree traversal, the asymptotic
cost is the same. However, this algorithm evaluates node
predicates for each point in D exactly once. And for a point in
D, it makes all updates to short-circuit pointers and residual
predictions in one visit to a node. This results in significant
cost savings in practice.
Desirable properties. Our short-circuiting based algorithms
have several important properties. First, as points are added
to or deleted from data set D, it is easy to incrementally
maintain a multi-point shortcircuit tree. Algorithm 3 already
computes the tree with a single scan of D, hence when
adding a new point to D, we just call line 3 for this point.
Similarly, when deleting a point from D, we traverse the
original tree to determine which short-circuit pointers and
nodes are affected. Then we simply decrement the counter
values for these short-circuit pointers and reduce the residual
predictions in the nodes accordingly. Second, due to their
incremental maintainability and fixed space cost, independent
of |D|, multi-point shortcircuit trees are perfectly suited for
data stream applications, i.e., where data set D is streaming.
However, a change of the original tree model would require a
re-computation of the shortcircuit tree.
Third, by decomposing summary computation into multipoint shortcircuit tree construction (does not need visualization
points VS ) and faster prediction on that more compact tree
(only needs VS ), our algorithms are ideal for the online
version of the summary computation problem. In that version,
scientists explore a summary interactively, presenting visualization points on-the-fly.
VI. D ISTRIBUTED C OMPUTATION
To scale model-summary computation to realistic workloads, we have to parallelize both the construction and the
evaluation of shortcircuit trees. In this section we propose
algorithms that allow us to scale in all important input parameters of summary computation: size of the dataset (|D|), number
of summaries (|P |), number of trees in the ensemble (|T |), and
number of visualization points (|VS |) per summary. Following
common practice, we say that our algorithm is (linearly)
scalable in a parameter, if we can achieve the following:
With c times the computing resources, we can process a c
times larger job (i.e., parameter scaled up by a factor of c)
without suffering a significantly higher response time. Based
on a careful evaluation of alternatives, we determined that
MapReduce [15] would be a perfect fit for parallelizing our
approach.
A. MapReduce Overview
The MapReduce framework can be used to implement
a two-phase distributed computation on a very large input
Algorithm 5 : GeneralizedMap
Input: P ′ ⊆ P , T ′ ⊆ T , D′ ⊆ D
1: for all T ∈ T ′ do
2:
ShortCircuitTree(T, P ′ , D′ )
3: for all p ∈ P ′ do
4:
for all vSP∈ p.VS do
5:
sum = T∈T ′ PointComputeOutput(T, vS )
6:
Output((p, vS ), (sum, id(T ′ ), |T ′ |, id(D′ ), |D′ |))
dataset, which we denote as I. The first phase, Map, partitions I into a set of disjoint chunks. A user-specified map
function is then applied to each chunk in parallel by a set of
machines, called the mappers. The output of map is a set of
< key, value > pairs. The second phase, Reduce, works on
all the key-value pairs produced by the mappers. Conceptually,
these pairs are grouped by their key; then each group of values
is processed by a single reducer. This happens in parallel
on many reducer machines. The output produced by all the
reducers is the final output of the distributed computation.
B. Algorithms
There are three major inputs for our algorithm: P (set of
plots to be computed, including their visualization points), T
(set of tree models), and D (set of data points). Assume we
partition P into two subsets P1 ∪P2 = P , P1 ∩P2 = ∅; and we
similarly partition T into T1 , T2 and D into D1 and D2 . We
can run our algorithm on each input combination (Pi , Tj , Dk ),
where i, j, k ∈ {1, 2} and then combine the individual outputs
into the corresponding output for (P, T , D).
In general, summary computation with shortcircuit trees
can be parallelized by partitioning each input set into smaller
subsets (“chunks”), then running our sequential algorithms on
each chunk combination, and finally combining the individual
outputs. However, the Map function of MapReduce is defined
as a one-tuple-at-a-time processor for a single input set. To
work around this limitation, we define a function GeneralizedMap (Algorithm 5), which processes any combination
(P ′ , T ′ , D′ ), where P ′ ⊆ P , T ′ ⊆ T , and D′ ⊆ D.
When implementing GeneralizedMap with a Map function,
we choose the parameter we want the algorithm to scale in as
the Map input and load the other two inputs into each mapper
before executing the map function. For example, to scale in D,
the corresponding Map function would declare D as its input
(and hence the MapReduce runtime would assign a chunk of D
to each mapper node); and it would load the entire sets P and
T onto each mapper node. Through an appropriate Reducer
(Algorithm 6), the output of the MapReduce computation will
be the final summary outputs.
GeneralizedMap computes for each visualization point of
a summary in P ′ the total contribution that D′ and trees in
T ′ make to the summary output at that visualization point.
Each reduce function receives as key a (p, vS ) combination
and as values a set of partial summary outputs computed over
subsets of T and D. The reduce function performs a simple
aggregation of this set to produce the final summary output
Algorithm 6 : Reduce
Input: Key
=
(p, vS ),
Values
{(sum1 , id1,1 , |T1′ |, id1,2 , |D1′ |),
(sum2 , id2,1 , |T2′ |, id2,2 , |D2′ |),. . . }
1: AVG = ComputeAVG((sum1 , id1,1 , |T1′ |, id1,2 , |D1′ |),
(sum2 , id2,1 , |T2′ |, id2,2 , |D2′ |),. . . )
2: Output((p, vS ), AVG)
=
for the (p, vS ) combination. The logic for this computation is
encoded in the ComputeAVG function and depends on the
ensemble type. For a bagged tree ensemble, ComputeAVG
would first compute total data set size (|D|) and number of
trees in the ensemble (|T |) by adding the various |Di′ | and |Ti′ |.
The algorithm uses the data and ensemble chunk id’s to avoid
double-counting. (Hence these id’s have to be included in the
Map output.) Finally ComputeAVG simply adds all the sumi
values and then divides the total sum by |D| · |T |. For other
ensembles the computation is similar, e.g., based on weighted
sums for boosted trees and additive models like Groves.
VII. E XPERIMENTS
We compare the performance of our shortcircuit-tree based
algorithm against the only existing solution for the modelsummary problem—the naive algorithm. This comparison was
done on a single processor. Then we demonstrate the scalability of the parallel version of our algorithm on a cluster.
We experimented with different datasets. Due to space constraints, we present results for a single real dataset from bird
ecology. These results are representative. In fact, the presented
results are for a comparably small dataset for several reasons.
(1) This demonstrates how expensive summary computation
is in practice, even for small data. (2) The larger the data, the
larger the models tend to be. Since our technique dramatically
reduces model size in the prediction phase, its performance
advantage over the naive algorithm increases with larger data.
(3) For the parallel algorithm, scalability is not affected by
larger input data.
We report results for a real bird ecology dataset
obtained
from
the
Avian
Knowledge
Network
(www.avianknowledge.net), Project FeederWatch, that
was joined with datasets containing geographical features
of observation locations. It covers a geographical region
in North America and contains about 90,000 observation
records, described by 155 continuous attributes (e.g., time,
location, habitat features, climate, census features, elevation).
We use 60,000 records to train a model for predicting the
probability of observing the Dark-eyed Junco. The model is a
bagged tree model consisting of 10 trees and is trained using
the IND package [16] using the information gain splitting
criterion. Each tree in the ensemble had on average 10,300
non-leaf nodes. This was the best-performing tree-based
model trained on the data.
For our experiments, we use the entire 60,000 training
records as the D set in summary computations. The training
data had missing values on some attributes, which we filled
TABLE I
OneByOne
MultipleSummary
600
C OMPUTATION TIME ( SEC ): SINGLE SUMMARY, FREQUENT ATTRIBUTES
|VS |
100
400
625
Naive
85.0
311.5
469.8
ShCkt
3.02 (= 2.96 + 0.06)
3.17 (= 2.97 + 0.20)
3.29 (= 2.96 + 0.33)
Time in seconds
500
400
300
200
100
TABLE II
0
0
C OMPUTATION TIME ( SEC ): SINGLE SUMMARY, INFREQUENT ATTRIBUTES
|VS |
100
400
625
Naive
84.8
324.5
462.7
ShCkt
2.1 (= 2.1 + 0.001)
2.1 (= 2.1 + 0.001)
2.1 (= 2.1 + 0.002)
in using values randomly selected from the attribute’s domain.
We verified that all leaves of the tree contained data points to
guard against degenerate cases. Not handling missing values
in the short-circuited trees is not an inherent limitation of our
approach, as discussed in Section VIII, just a limitation in our
current implementation. The choice of the actual D set does
not matter much, because our experiments are only aimed at
evaluating the speedups we obtain in summary computations.
A. Single Machine Experiments
Our algorithms are implemented in Java and the experiments
reported in this section were run on a Linux machine with a
2.66GHz processor and a JVM heap size of 3GB. All reported
times are in seconds. Standard deviations in the reported times
were negligible and hence are not reported.
Naive vs. short-circuiting: We begin our evaluation by
comparing the benefits of the short-circuiting algorithm (Section V) over the naive algorithm (Section III-C). The first
summary, which we refer to as the frequent attributes summary, is on a pair of attributes that are frequently used as
split attributes in the ensemble (Table I), at a total of 23%
of all nodes (12% and 11% for the first and second summary
attribute, respectively). The second summary is on a pair of
attributes used infrequently in the ensemble (Table II), together
accounting for less than 1% of the splits. The tables report
the time taken to compute a single summary for the naive
algorithm (Naive) and the short-circuiting algorithm (ShCkt).
For short-circuiting we break total cost down into time for
generating the shortcircuit trees and time for querying these
trees and generating the summary outputs. In both cases, our
algorithm achieves significant speedup, between 30 times and
more than 200 times. The speedup increases with the number
of visualization points. As the cost breakdown shows, the cost
of generating the shortcircuit trees remains constant, while the
cost for querying the trees grows linearly with the size of
VS . As expected, for the infrequent attribute case, the trees
are compressed more and hence the model evaluation time
on the shortcircuit trees is almost zero (Table II). Notice
that shortcircuit tree construction is slightly more expensive
in the frequent attribute case. This is due to the fact that
the shortcircuiting algorithm only needs to traverse a single
100
200
300
Number of summaries
Fig. 6.
Multiple Summaries
subtree at nodes that split on non-summary attributes, while it
has to traverse both subtrees for nodes splitting on summary
attributes. In the frequent attribute case there are many more
nodes that split on summary attributes. Small differences in
the runtime of the naive algorithm for the same number of
visualization points are caused by the fact that the query points
are different and the trees are not balanced.
Multiple Summaries: The last experiment in this section
measures the benefits of generating the shortcircuit trees for a
set of summaries together (MultipleSummary) rather than one
summary at a time (OneByOne). Summary workloads were
generated by sampling randomly from the set of all possible
one- and two-dimensional summaries over the attributes used
by the trees in the ensemble as split attributes. Figure 6 shows
that generating shortcircuit trees for multiple plots together is
up to 10 times faster than generating them one summary at a
time. Stated differently, on top of the 20-200 times speedup for
single summary computation, our algorithm achieves another
order of magnitude or more improvement over the naive
algorithm for workloads with multiple summaries.
B. Distributing Summary Computation
In this section we evaluate how the MapReduce algorithms
proposed in Section VI scale in the different input parameters
of summary computations. The experiments were run on
a cluster with 20 machines. Each machine in the cluster
had a 2.66Ghz processor and 8GB RAM and the cluster
was running Hadoop v0.18, the open source implementation
of MapReduce configured with all the default settings (see
hadoop.apache.org). All times in this section are job
completion times as reported by the Hadoop framework and
include all costs such as job setup and teardown times.
Deviations in the measured times were small and hence not
reported.
Recall that while GeneralizedMap can work on any subset
P ′ × D′ × T ′ of the P × D × T input parameter space, Map
only allows a single input file. To scale in a certain parameter,
we declare it as the Map input and let the MapReduce system
partition the space along this parameter; we do not partition
along the other parameters. For example, to scale in D, D is
the input to Map and hence each mapper works on chunks
P × D′ × T , where D′ ⊆ D.
Figures 7, 8 and 9 show how the MapReduce algorithm
scales in D, P , and T respectively. (The graph for scalability
120
200
Single
Distributed
Single
Distributed
60
40
120
150
Time in seconds
Time in seconds
Time in seconds
80
Single
Distributed
140
100
100
50
20
100
80
60
40
20
0
0
0
5
Fig. 7.
10
15
Scaling Factor for D
20
Scaling in |D|
0
0
100
Fig. 8.
200
300
Number of summaries
Scaling in |P |
400
500
0
50
Fig. 9.
100
Number of trees
150
200
Scaling in |T |
in |VS | looks virtually the same, and hence is omitted.)
For Figure 7 we fixed the number of summaries at 1 (the
frequent attributes summary) with 900 visualization points.
The model is the same 10-tree ensemble as before. The number
of reducers was set to 1. We then computed the summary
for D of varying size. The larger datasets were generated by
simply replicating the D set used in the previous section. This
ensures that as the datasets are scaled up, access patterns in
the tree remain the same and any increase in cost is only due
to processing additional data points. The scaling factor is the
size multiplier for D. Line Single shows job completion time
when the entire computation is done on a single mapper, for
the Distributed graph, we increased the number of mappers in
proportion to the scaling factor of the dataset. As we can see,
response time remains approximately constant with increasing
|D|, showing that the framework scales well in D.
For Figure 8, we fixed the dataset D to the usual 60,000
points, used the same 10-tree ensemble, and fixed the number
of visualization points at 900. The number of summaries is
varied from 20 to 400 (scaling factor 1 to 20). Notice that when
partitioning on P , each mapper works on a subset of P , but
the entire D and T . Hence we do not need a reduce phase and
the number of reducers was set to 0. We use the same method
as described above for generating summary workloads.
Figure 9 reports scalability in the size of the ensemble. We
fixed the number of summaries to 1 (the frequent attributes
summary) with 900 visualization points, and used the usual
dataset of 60,000 points. Like the experiment for scaling in
D, the number of reducers was set to 1.
then computing the weighted average of the corresponding
leaves. Our algorithms, which were described for the case that
there are no missing values in D, can be extended to support
this behavior. We extend Algorithm 4 by associating a weight
with each active summary at a node. At the root all weights are
1. When line 7 in Algorithm 4 fails because x is missing the
value for the split attribute, all children are recursively visited
with Pnde as the active set. However, when visiting a child
s
chld, the weight for a summary p ∈ Pnde −Pnde
is modified to
the current weight of p times the fraction of training cases that
went into the subtree of chld. Line 3 returns the prediction in
the leaf multiplied with the weight of the summary. Also note
that op[p].shckts could now contain multiple node pointers and
not just one.
Complex Trees: Our approach also generalizes to tree types
with multivariate splits at non-leaf nodes and non-trivial
functions as leaf predictions. For these trees, the prediction
made by a leaf and the predicate evaluation at a non-leaf
node may require the values of both the visualization point
and some non-summary attributes. This means that the multipoint shortcircuit tree is not guaranteed any more to have only
a subset of the nodes of the original tree. In the worst case
it degenerates to the equivalent of all single-point shortcircuit
trees, which still represents a significant improvement over the
naive algorithm. (Time and space complexity of the algorithm
would in the worst case be that of the single-point shortcircuit
algorithm.) The multisummary optimization can also still be
applied.
VIII. E XTENSIONS
Several papers discuss partial dependence plots [5], [7] for
summarization of complex models. Friedman [5] proposes
a technique for computing approximate partial dependence
plots in tree models. This method gives no approximation
quality guarantees (a major limitation for use by scientists),
it produces accurate summaries only when strong independence assumptions hold, it is limited to partial dependence
summaries, and it does not support summary computation in
“slices” and “dices” of the data space without generating a new
model. Our focus is on efficiently computing the exact same
summaries as the naive algorithm for all types of summaries
and all possible data partitions, no matter what the attribute
distribution.
For the sake of clarity, we made a few simplifying assumptions in previous sections. Our algorithms generalize naturally
and can be extended with additional functionality, as we briefly
discuss in this section.
Confidence Intervals: For aggregate summaries, scientists are
also interested in obtaining the standard deviation. This can be
supported by not only maintaining sums of residual predictions
for each child of a node, but also the sum of squares of these
residual predictions.
Missing Values: Trees can handle missing values in a query
point gracefully by sending partial weights down each sub-tree
of a node that splits on an attribute whose value is missing;
IX. R ELATED W ORK
Our work is motivated by OLAP [17], but OLAP techniques
cannot be directly applied because our main bottleneck is the
evaluation cost of a model. Prediction cubes [18] focus on
efficiently (and often approximately) computing predictions of
a model for a large region of the data space from many models
covering smaller partitions of this region. This approach is
problematic for sparse high-dimensional data and it is in some
sense the dual of computing region-based summaries for a
“large” model.
Other model summary types like dependence diagrams [19]
were recently proposed, but data mining research usually
concentrates on improving prediction quality [11], [12], [13] or
on scalable algorithms for training tree models from large data
sets [20]. In general, little work has been done to address the
performance issues that arise when using complex data mining
models for making predictions. Bucila et al. [21] propose
model compression to reduce model size and computational
cost for making predictions. Model prediction time has also
been studied in the context of scientific simulations [22].
However, in both cases the original model is approximated,
and elimination of redundant computation for summaries is not
considered. Our approach is orthogonal in the sense that one
could speed up summary computation for such approximate
models further by eliminating redundant computation.
The database community has started to explore efficient
data management for models [23], [4], but has not considered
summary computation from complex models. Sen et al. [24]
show how to exploit shared correlations to reduce the cost
of inference on probabilistic graphical models. Their work
is superficially related, having the same high-level theme of
exploiting workload structure to improve query performance.
Multi-query optimization has been studied in many different
contexts like relational databases [25] and stream processing [26]. While our algorithms have a similar theme of sharing
computation, the structural properties that we exploit are very
different.
X. C ONCLUSIONS
AND
F UTURE W ORK
We introduced a new data management problem that arises
during the analysis of observational data in many domains. We
identified various types of structure in summary computation
workloads and showed how to exploit it to speed up modelsummary computations in tree-based models. Our algorithms
produce the exact same results as the naive approach, but
are several orders of magnitude faster. Tree-based models are
widely used and our algorithms support all types of common
model summaries, hence the algorithms in this paper are
widely applicable in practice.
There are many directions for future work. First, since
algorithms have to be model-specific, new techniques need to
be developed for other complex data mining models. Second,
scientists need automatic techniques for selecting visualization
points that capture all interesting features of a summary. Third,
summary computation cost can be further reduced through
approximation, but it is only useful to scientists if such
techniques provide confidence bounds.
ACKNOWLEDGMENT
This research was supported by the National Science
Foundation (NSF) under award 0920869. Additional support
came from the Leon Levy Foundation and the NSF (awards
0427914, 0542868, 0612031, 0734857, 0832782). Any opinions, findings, and conclusions or recommendations expressed
in this publication are those of the author(s) and do not
necessarily reflect the views of the sponsors.
We would like to thank the reviewers and the Avian Knowledge Network team, in particular Wesley Hochachka, Steve
Kelling, Art Munson, and Giles Hooker for their contributions.
R EFERENCES
[1] S. Kelling, W. M. Hochachka, D. Fink, M. Riedewald, R. Caruana,
G. Ballard, and G. Hooker, “Data intensive science: A new paradigm
for biodiversity studies,” BioScience, vol. 57, no. 7, pp. 613–620, 2009.
[2] D. Jensen, A. S. Fast, B. J. Taylor, and M. E. Maier, “Automatic identification of quasi-experimental designs for discovering causal knowledge,”
in KDD, 2008, pp. 372–380.
[3] W. M. Hochachka et al., “Data-mining discovery of pattern and process
in ecological systems,” Journal of Wildlife Management, vol. 71(7), pp.
2427–2437, 2006.
[4] A. Thiagarajan and S. Madden, “Querying continuous functions in a
database system,” in SIGMOD, 2008, pp. 791–804.
[5] J. H. Friedman, “Greedy function approximation: A gradient boosting
machine,” Annals of Statistics, vol. 29, pp. 1189–1232, 2001.
[6] G. Hooker, Diagnostics and extrapolation in machine learning. PhD
thesis, Stanford University, 2004.
[7] O. Linton and J. P. Nielsen, “A kernel method of estimating structured
nonparametric regression based on marginal integration,” Biometrika,
vol. 82(1), pp. 93–100, 1995.
[8] D. Sorokina, R. Caruana, and M. Riedewald, “Additive groves of
regression trees,” in Proc. ECML, 2007, pp. 323–334.
[9] R. Caruana and A. Niculescu-Mizil, “An empirical comparison of
supervised learning algorithms,” in Proc ICML, 2006, pp. 161–168.
[10] L. Breiman, J. Friedman, C. J. Stone, and R. Olshen, Classification and
regression trees. McGraw-Hill, 2000.
[11] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp. 123–
140, 1996.
[12] R. Schapire, The boosting approach to machine learning: An overview.
MSRI Workshop on Nonlinear Estimation and Classification, 2001.
[13] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32,
2001.
[14] T. Mitchell, Machine Learning. McGraw-Hill, 1997.
[15] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on
large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
[16] W. Buntine and R. Caruana, Introduction to ind and recursive partitioning. Technical Report FIA-91-28, NASA Ames Research Center,
1991.
[17] S. Chaudhuri and U. Dayal, “An overview of data warehousing and
OLAP technology,” SIGMOD Record, vol. 26, no. 1, pp. 65–74, 1997.
[18] B. Chen, L. Chen, Y. Lin, and R. Ramakrishnan, “Prediction cubes,” in
VLDB, 2005, pp. 982–993.
[19] K. Karimi and H. J. Hamilton, “Using dependence diagrams to summarize decision rule sets,” in Advances in AI, vol. 5032, 2008, pp. 163–172.
[20] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh, “BOAT-optimistic
decision tree construction,” in Proc. SIGMOD, 1999, pp. 169–180.
[21] C. Bucila, R. Caruana, and A. Niculescu-Mizil, “Model compression,”
in Proc. SIGKDD, 2006, pp. 535–541.
[22] B. Panda, M. Riedewald, J. Gehrke, and S. B. Pope, “High-speed
function approximation.” in Proc. ICDM, 2007, pp. 613–618.
[23] A. Deshpande and S. Madden, “Mauvedb: Supporting model-based user
views in database systems,” in SIGMOD, 2006, pp. 73–84.
[24] P. Sen, A. Deshpande, and L. Getoor, “Exploiting shared correlations in
probabilistic databases,” PVLDB, vol. 1, no. 1, pp. 809–820, 2008.
[25] T. K. Sellis, “Multiple-query optimization,” ACM TODS, vol. 13(1), pp.
23–52, 1988.
[26] A. J. Demers, J. Gehrke, M. Hong, M. Riedewald, and W. M. White,
“Towards expressive publish/subscribe systems.” in EDBT, 2006, pp.
627–644.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement