Robust Large Margin Approaches for Machine

Robust Large Margin Approaches for Machine
ROBUST LARGE MARGIN APPROACHES FOR MACHINE LEARNING IN
ADVERSARIAL SETTINGS
by
MOHAMADALI TORKAMANI
A DISSERTATION
Presented to the Department of Computer and Information Science
and the Graduate School of the University of Oregon
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
September 2016
DISSERTATION APPROVAL PAGE
Student: MohamadAli Torkamani
Title: Robust Large Margin Approaches for Machine Learning in Adversarial Settings
This dissertation has been accepted and approved in partial fulfillment of the
requirements for the Doctor of Philosophy degree in the Department of Computer
and Information Science by:
Daniel Lowd
Dejing Dou
Christopher Wilson
Hal Sadofsky
Chair
Core Member
Core Member
Institutional Representative
and
Scott L. Pratt
Dean of the Graduate School
Original approval signatures are on file with the University of Oregon Graduate
School.
Degree awarded September 2016
ii
c 2016 MohamadAli Torkamani
iii
DISSERTATION ABSTRACT
MohamadAli Torkamani
Doctor of Philosophy
Department of Computer and Information Science
September 2016
Title: Robust Large Margin Approaches for Machine Learning in Adversarial Settings
Many agencies are now using machine learning algorithms to make high-stake
decisions. Determining the right decision strongly relies on the correctness of the
input data. This fact provides tempting incentives for criminals to try to deceive
machine learning algorithms by manipulating the data that is fed to the algorithms.
And yet, traditional machine learning algorithms are not designed to be safe when
confronting unexpected inputs.
In this dissertation, we address the problem of adversarial machine learning;
i.e., our goal is to build safe machine learning algorithms that are robust in the
presence of noisy or adversarially manipulated data.
Adversarial machine learning will be more challenging when the desired
output has a complex structure. In this dissertation, a significant focus is on
adversarial machine learning for predicting structured outputs. First, we develop a
new algorithm that reliably performs collective classification, which is a structured
prediction problem. Our learning method is efficient and is formulated as a convex
quadratic program. This technique secures the prediction algorithm in both the
presence and the absence of an adversary.
iv
Next, we investigate the problem of parameter learning for robust, structured
prediction models. This method constructs regularization functions based on the
limitations of the adversary. In this dissertation, we prove that robustness to
adversarial manipulation of data is equivalent to some regularization for largemargin structured prediction, and vice versa.
An ordinary adversary regularly either does not have enough computational
power to design the ultimate optimal attack, or it does not have sufficient
information about the learner’s model to do so. Therefore, it often tries to apply
many random changes to the input in a hope of making a breakthrough. This fact
implies that if we minimize the expected loss function under adversarial noise, we
will obtain robustness against mediocre adversaries. Dropout training resembles
such a noise injection scenario. We derive a regularization method for largemargin parameter learning based on the dropout framework. We extend dropout
regularization to non-linear kernels in several different directions.
Empirical evaluations show that our techniques consistently outperform the
baselines on different datasets.
This dissertation includes previously published and unpublished coauthored
material.
v
CURRICULUM VITAE
NAME OF AUTHOR: MohamadAli Torkamani
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:
University of Oregon, Eugene, OR, USA
Isfahan University of Technology, Isfahan, Iran
DEGREES AWARDED:
Doctor of Philosophy, Computer and Information Science, 2016, University of
Oregon
Master of Science, Artificial Intelligence, 2006, Isfahan University of
Technology
AREAS OF SPECIAL INTEREST:
Machine learning, Statistics, Convex Optimization, Robust Modeling
PROFESSIONAL EXPERIENCE:
Graduate Research & Teaching Assistant, Department of Computer and
Information Science, University of Oregon, 2011 to present
Research Intern, Clari, Mountain View, California, 2015
Research Intern, Comcast Labs, Washington, D.C., 2012
Research Assistant, Department of Electrical Engineering and Computer
Science, Oregon State University, 2009 to 2011
GRANTS, AWARDS AND HONORS:
Graduate Teaching & Research Fellowship, Computer and Information
Science, 2011 to present
vi
PUBLICATIONS:
Torkamani, M., Lowd, D. (2013). Convex Adversarial Collective
Classification. In Proceedings of the 31th International Conference on
Machine Learning (ICML 2014), Pages 642-650.
Torkamani, M., Lowd, D. (2014). On Robustness and Regularization
of Structural Support Vector Machines. In Proceedings of the 30th
International Conference on Machine Learning (ICML 2013), Pages 577-585.
Torkamani, M., Lowd, D. (2016). Marginalized and Kernelized Dropout
Training for Support Vector Machines. Under review in Journal of Machine
Learning Research (JMLR).
vii
ACKNOWLEDGEMENTS
First and foremost I want to thank my advisor Daniel Lowd, who has given
me every opportunity to pursue my ideas, and whose mentorship has shaped
my development as a scientist. It has been an honor to be one of his first Ph.D.
Students. I appreciate all his contributions of time, ideas, and funding to make my
Ph.D. experience productive and stimulating.
I would like to thank my dissertation committee members Dejing Dou,
Christopher Wilson, and Hal Sadofsky. I also would like to thank Andrzej
Proskurowski and Jun Li for their constructive comments in the past years.
I gratefully acknowledge the funding sources, ARO grant and NSF grant that
made my Ph.D. work possible.
Lastly, I would like to thank my friends and family for all their wholehearted
support and encouragement. I want to thank my parents who raised me with
love and taught me to love science. And most of all for my loving, supportive,
encouraging, and patient wife Fereshteh, whose support during my Ph.D. is so
appreciated.
viii
To my wife, Fereshteh
ix
TABLE OF CONTENTS
Chapter
I.
II.
Page
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.
Motivation and approach . . . . . . . . . . . . . . . . . . . . .
2
1.2.
Learning and prediction under uncertainty . . . . . . . . . . .
4
1.3.
Predicting complex outputs . . . . . . . . . . . . . . . . . . . .
7
1.4.
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.5.
Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.1.
Statistical learning . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.
Structured learning . . . . . . . . . . . . . . . . . . . . . . . .
19
2.3.
Adversarial machine learning . . . . . . . . . . . . . . . . . . .
30
2.4.
Applications of adversarial structured learning . . . . . . . . .
42
III. CONVEX ADVERSARIAL COLLECTIVE CLASSIFICATION . . . .
48
3.1.
Max-margin relational learning . . . . . . . . . . . . . . . . . .
51
3.2.
Convex formulation . . . . . . . . . . . . . . . . . . . . . . . .
56
3.3.
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.4.
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
x
Chapter
Page
IV. EQUIVALENCY OF ADVERSARIAL ROBUSTNESS AND
REGULARIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . .
V.
70
4.1.
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.2.
Robust structural SVMs . . . . . . . . . . . . . . . . . . . . .
73
4.3.
Mapping the uncertainty sets . . . . . . . . . . . . . . . . . . .
78
4.4.
Robust optimization programs . . . . . . . . . . . . . . . . . .
81
4.5.
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.6.
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
4.7.
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
MARGINALIZATION AND KERNELIZATION OF DROPOUT FOR
SUPPORT VECTOR MACHINES . . . . . . . . . . . . . . . . . . . .
95
5.1.
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.2.
Dropout in linear SVMs . . . . . . . . . . . . . . . . . . . . .
99
5.3.
Dropout in non-linear SVMs . . . . . . . . . . . . . . . . . . . 109
5.4.
Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5.
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
VI. CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . . 124
6.1.
Summary of contributions . . . . . . . . . . . . . . . . . . . . 124
6.2.
Future directions . . . . . . . . . . . . . . . . . . . . . . . . . 125
xi
Chapter
Page
APPENDICES
A.
INTEGRALITY OF THE ADVERSARIAL SOLUTION IN CONVEX
ADVERSARIAL COLLECTIVE CLASSIFICATION . . . . . . . . 128
B.
PROOFS FOR EQUIVALENCE OF ROBUSTNESS AND
REGULARIZATION IN LARGE MARGIN METHODS . . . . . . . . 138
C.
DIRECT PROOF FOR DERIVATION OF MARGINALIZED LINEAR
SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
D.
α-REG PROOF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
E.
LINEAR TIME INFERENCE FOR DIMENSION DROPOUT IN RBF
KERNEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
F.
NOTATIONS AND SYMBOLS . . . . . . . . . . . . . . . . . . . . . 151
REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
xii
LIST OF FIGURES
Figure
Page
1.1
The adversary manipulates the unseen data as a response to the learner’s
strategy. A robust model decreases the harmful effect of the adversarial
data alteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.1
The supervised learning procedure . . . . . . . . . . . . . . . . . . . .
3.1
The adversary knows the parameters of our classifier and can maliciously
modify data to attack. The learner should select the best classifier,
assuming the worst adversarial manipulation. . . . . . . . . . . . . 50
3.2
Accuracy of different classifiers in presence of worst-case adversary. The
number following the dataset name indicates the adversary’s strength
at the time of parameter tuning. The x-axis indicates the adversary’s
strength at test time. Smaller is better. . . . . . . . . . . . . . . . .
62
3.3
Accuracy of different classifiers in presence of random adversary. We
observe that even strong random attacks are not efficient in disguising
the true class of the sample. . . . . . . . . . . . . . . . . . . . . .
66
3.4
The distribution of the learned weight values for different models.
The robust method tends to have a high density on the weights that
are saturated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
67
3.5
The sorted learned weights for each method. The robust method constrains
the maximum value of the weights. This suggests that robustness could
also be achieved through regularization with L∞ norm. . . . . . . .
68
4.1
Average prediction error of robust and non-robust models, trained on year
2004 and evaluated on years 2003-2013. . . . . . . . . . . . . . . . .
93
5.1
The results of running a Monte-Carlo simulation of calculating 1−y(wT x̃+
b) for randomly drawn x̃’s and drawings from the approximated
Gaussian distributions. The dimension of each sample x̃ is 50 in the
histogram on the left. Right: simulation of the Berry-Esséen upperbound for different number of non-zero weights . . . . . . . . . . . . 102
5.2
Losses and differences in losses as a function of a single model weight.
xiii
Figure
Page
Note that the marginalized cost function is always an upper-bound on
the hinge loss. Although the effective regularization function is nonconvex, the marginalized objective function itself is convex. . . . . 107
xiv
LIST OF TABLES
Table
Page
5.1
Classification error (%) of linear classifiers on text datasets. The last
column is the decrease percentage of the prediction error for best
method (mostly SVM-Marg) vs. SVM. . . . . . . . . . . . . . . . . 120
5.2
Classification error (%) of approximated RBF kernel.
5.3
Classification error(%) curve for different size subsets of MNIST.
Comparing no-dropout standard RBF to Monte-Carlo dimension
dropout and α-Reg. The last row is the decrease percentage of the
prediction error for no dropout vs. the best dropout method. . . . . 122
xv
. . . . . . . . . 121
CHAPTER I
INTRODUCTION
“Things will go wrong in any given situation if you give them a chance.”
– Edward A. Murphy
Smith’s Law: “Murphy was an optimist.”
Machine learning is widely used for prediction and decision-making, often
taking the place of human agents. Reliability of machine learning algorithms is a
rising concern in many sensitive applications, where the input data can be noisy
and uncertain. The uncertainty and noise in the data used to be random most of
the time, but now, the criminals have incentives to adversarially change the data.
As tasks pertaining to detecting malicious activities are increasingly assigned to
machine learning algorithms, criminals become increasingly motivated to put extra
effort into deceiving these algorithms.
The traditional prediction models are vulnerable when confronting
unexpected or maliciously manipulated data. This vulnerability is a serious
problem for the applicability of this modern technology. The criminals are learning
to tactfully disguise their actions. They strive to elaborately design innocentlooking fraudulent samples when attacking machine learning systems. As a result,
since the classical machine learning techniques are not constructed with a safety
mindset in the first place, their susceptibility to data manipulation makes them
untrustworthy in many of the high-stake applications.
1
In this thesis, we present machine learning methods that are resilient and
reliable. Our methods utilize domain knowledge and problem structures to deliver
reliable predictions. This work has advanced the state-of-the-art in adversarial
machine learning by introducing efficient algorithms for learning robust models
when the output space is exponential in the input size. We show that by taking
advantage of the weaknesses of the adversaries, we will be able to learn models that
are particularly reliable when being attacked by those parties.
1.1. Motivation and approach
Conventional statistical methods – including machine learning – suppose that
training and test instances are independently and identically drawn from the same
distribution (the IID assumption)1 , which is frequently not true. Due to this fact,
the traditional machine learning algorithms do not offer a realistic solution to many
of the existing and emerging real-world problems, where there are fundamental
reasons for the data samples to be interdependent or to be drawn from different
distributions.
In fact, there are two common situations, in each of which the IID assumption
does not hold. First, the train and the test data might have been drawn from
two non-identical distributions. The difference in the distributions, at train
time and at test time, can derive from: constant changes in the underlying data
generation sources; random noise; subjective conceptual drift, e.g., change of topic
in a discussion forum; or, a pensive agent might have intentionally manipulated
some parts of the data. Spiteful manipulation of the data practically serves some
1
If the data samples are independently and identically drawn from a distribution, then we say
the samples are IID. The mathematical modeling of the distribution of IID samples is simpler
and cleaner. Although the IID assumption rarely holds in practice, many statistical approaches,
including classical machine learning, still suppose that it is satisfied.
2
FIGURE 1.1. The adversary manipulates the unseen data as a response to the
learner’s strategy. A robust model decreases the harmful effect of the adversarial
data alteration.
interests of the adversaries. To satisfy their interests, the adversaries design specific
samples such that some utility functions are maximized.
The learner usually has little or no knowledge of the details of these utility
functions. However, the adversary has either full information or a partial guess
of the learner’s strategies or the parameters of its decision-making algorithm.
The adversary may increase its knowledge about the learner’s underlying model
by submitting query examples and studying the responses of the machine
learning system. Potentially, the adversary will be able to acquire a near-perfect
approximation of the internal functionality of the learner’s prediction system.
Interdependent data samples are the second cause of violation of the IID
assumption. The dependency of data samples can have different forms. For
example, the sentences in a paragraph of an English text are not statistically
independent. And in graphed data, wherein each vertex has a label, the label
of each node may depend on the labels of the neighboring nodes. A particularly
important type of dependency is when the desired output of the algorithm has
3
some internal structure; examples of such outputs are the parse tree of a sentence,
labeling of the nodes in a graph, and segments of an image.
To date, most of the modern methods in machine learning are designed to
solve only one of these two challenges; i.e., either they approach the problem of
inter-related data, or they develop robust algorithms against noise and natural
or adversarial changes in the distribution of the data samples. In this thesis, we
introduce novel methods in machine learning where both of the IID assumptions
are violated: The samples are not independent, and they are not drawn from a
static distribution at train and test time. In particular, we focus on the worst-case
scenario, where the unseen data in the future will be intentionally manipulated by
some adversary to deceive the machine learning algorithm (Figure 1.1).
We introduce a direct but efficient robust modeling approach for solving the
problem of label prediction on graphs, where some opponent changes the properties
of each node to misguide the labeling algorithm as much as possible. We will
consider the conditions under which the efficiency of this algorithm is guaranteed.
Then, we propose a regularization-based approach, which creates customized
optimization programs to adopt the weaknesses of the adversaries and converts
them to the points of strength of the machine learning algorithm. This is done
by learning robust models that take the ultimate advantage of how the adversary
budget allows joint changing of a set of values in the input data.
1.2. Learning and prediction under uncertainty
Learning classifiers in the presence of noisy and uncertain instances is a
challenging and important task in modern machine learning. Noise in the data may
refer to the observations that are added or multiplied by unknown random values,
4
that have missing attributes, or that have inaccurate labels. Many of the realworld data, such as texts, gene expression data, or images and videos are naturally
noisy. The noise can variously derive from, e.g., human error in data collection,
data processing, and/or data tagging; measurement errors; and/or sub-optimal
sampling resolutions. However, the existence of adversarial uncertainty in the data
is a more severe issue. Given the prospect of cyber-crimes in this century, it is an
important and more challenging task to learn models that are not only robust to
random noise, but are also robust to the worst-case adversarial ones. Therefore,
developing algorithms that are robust to the uncertainty caused by adversaries is of
growing interest (Kloft and Laskov, 2007).
When the instances are noisy, in most of the cases, there exists little or no
knowledge about the level of uncertainty in the data. In adversarial scenarios, the
adversary usually aims for maximizing a utility function, while having some budget
constraints for changing individual sets of features. Therefore, as the learner, we
do not know whether the observed information is what it initially used to be, or if
the adversary has changed it according to some underlying set of constraints and
utilities.
The adversaries actively change their strategies: As the learner blocks them
on one front, they seek to find another vulnerability of the machine learning
system. This problem can be formulated as a game between the learner and the
adversary: Each side will be rewarded when it chooses the right strategies.
One of the earliest works in adversarial machine learning was reverse
engineering classifiers (Lowd and Meek, 2005b,a; Nelson et al., 2010). The idea
is to find optimal attacks as a response to the specific model that the machine
learning algorithm has learned. Then, the machine learning algorithm can adjust
5
itself to be able to correctly classify the optimal attack. This ends up in a race
between the two players of an antagonistic game: the learner and the adversary.
In general, finding the Nash equilibrium for this game is intractable. Dalvi
et al. (2004) suggest that instead of finding a Nash equilibrium, we can select
a strategy for the next move of the adversary. Brückner and Scheffer derive an
optimization approach for finding the Nash equilibrium in static prediction games
under certain convexity assumptions (Brückner and Scheffer, 2009; Brückner
et al., 2012). They also propose a formulation for approximating the Stackelberg
equilibria (Brückner and Scheffer, 2011; Sawade et al., 2013).
Assuming that an adversarial game is zero-sum leads to a min-max
formulation: The learner tries to minimize a worst-case loss function under the
adversarial manipulation of the input data. Globerson and Roweis (2006) modeled
this data manipulation by feature deletion at test time. A generalized version of
this method was later proposed by Teo et al. (2008).
Xu et al. (2009) show that penalizing the optimization program by the dual
norm of the adversarial constraint is equivalent to optimizing against a worst-case
adversary that can manipulate features within that constraining ball.
Developing secure algorithms that are not mistrained by poisoned data is a
different view of adversarial machine learning. Data poisoning refers to engineering
samples that are adversarially crafted to mislead a specific machine learning
algorithm (Kloft and Laskov, 2007; Laskov and Kloft, 2009; Laskov and Lippmann,
2010; Biggio et al., 2012).
The dropout technique is another method that was originally introduced
for stabilizing the behavior of deep neural networks in the presence of noise in
the unseen data: During the training phase, some attributes of the data are
6
randomly dropped out while learning the parameters (Srivastava et al., 2014). In
shallow models, such as logistic regression (LR), dropout behaves as a regularizer
that penalizes feature weights based on how much they influence the classifier’s
predictions (Wager et al., 2013). Since, in adversarial machine learning, robustness
is often equivalent to regularization through the right penalty function, we expect
to gain robustness by deriving regularization methods that emulate the effect of
dropout training.
On the other hand, in many real-world scenarios, the machine learning
algorithm does not need to be robust to the worst-case adversary. Instead, it
suffices to learn the model such that it is reliable when encountering an average
opponent that might change the input data frequently, yet randomly in order to
deceive the algorithm. This fundamental idea, suggests that if we minimize the
expected loss function under adversarial noise, we will gain some robustness against
average adversaries. Dropout training simulates such an adversarial behavior. In
this dissertation, we derive a closed-form formulation for the expected hinge loss.
Our formulation is convex, and can be optimized efficiently.
In this thesis, we further expand some of the algorithms mentioned above
to perform robust prediction of complex outputs. We will show how we can gain
robustness by designing the appropriate regularization functions. We induce the
regularization functions from a worst-case uncertainty set, or we derive them from
the implicit marginalization effect of applying the dropout framework.
1.3. Predicting complex outputs
Structured learning is the problem of finding a predictive model for mapping
the input data into complex outputs that have some internal structure. Structured
7
output prediction is a challenging task by itself, but the problem becomes even
more troublesome when the input data is adversarially manipulated to deceive
the predictive model. The problem of adversarial structured output prediction is
relatively new in the field of machine learning.
We can abstract many real-world applications as an adversarial structured
output prediction problem. A motivating example of adversarial structured
prediction is collective classification of interconnected and potentially dishonest
nodes of a network. In a collective classification problem (Sen et al., 2008), the
goal is to label a set of interconnected objects simultaneously, using both their
attributes and their relationships. For example, linked web pages are likely to have
related topics; friends in a social network are likely to have similar demographics;
and proteins that interact with each other are likely to have similar locations and
related functions. Probabilistic graphical models, such as Markov networks (Taskar
et al., 2004a; Koller et al., 2003), and their relational extensions, such as Markov
logic networks (Domingos and Lowd, 2009b), can handle both uncertainty and
complex relationships in a single model, making them well-suited to collective
classification problems (Torkamani and Lowd, 2013).
Many collective classification models are evaluated on test data that is
drawn from a different distribution than the training data. This can be a matter
of concept drift, such as varying topics in interconnected news web pages at
different times, or the change in the distribution can be attributed to one or more
adversaries who are actively modifying their behavior to avoid detection. For
example, when the search engines began to use incoming links to rank web pages,
spammers began posting comments on unrelated blogs or message boards, with
links back to their websites. Since incoming links are used as an indication of the
8
quality of the web page, manufacturing of the incoming links makes a spammy
website appear more legitimate. Web spam (Abernethy et al., 2010; Drost and
Scheffer, 2005) is one of many examples with explicitly adversarial domains; some
other examples are counter-terrorism, online auction fraud (Chau et al., 2006), and
spam in online social networks.
One important aspect of adversarial machine learning that is currently
missing in the literature of adversarial structured prediction is a deep analysis of
the vulnerability of structured output prediction methods to exploratory evasion
attacks. In particular, in the existing studies, the assumption is that the adversary
is completely aware of the classifier and the learned parameters of the classifier; but
this assumption will not hold in practice, in general. In real problems, such as a
web spam detector in a search engine, the parameters of the classifier are unknown
for the spammers, and the spammers need to infer them by exploration techniques.
In this thesis, we address the problem of adversarial structured prediction and
propose efficient algorithms for learning and prediction of structured outputs in
adversarial settings.
1.4. Contributions
In this thesis, we propose novel methods for constructing large margin
classifiers, which are robust to uncertainties and have a better generalization on the
future data. Tractability of the robust learning algorithms is a central theme in this
dissertation. We attack the hard problem of adversarial stuctured prediction. We
prove that robustness can be achieved by penalizing the problem by a customized
regularization function. Then, we show that the dropout framework also results
in a regularization effect in the large margin classifiers, which leads to a better
9
generalization of the predictive model. The following are the highlights of our
contributions:
1. Convex adversarial collective classification
We present a novel method for robustly performing collective classification in
the presence of a malicious adversary that can modify up to a fixed number of
binary-valued attributes. Our method is formulated as a convex quadratic
program that guarantees optimal weights against a worst-case adversary
in polynomial time. In addition to increased robustness against active
adversaries, this kind of adversarial regularization can also lead to improved
generalization, even when no adversary is present. In experiments on real and
simulated data, our method consistently outperforms both non-adversarial
and non-relational baselines.
2. Equivalency of adversarial robustness and regularization
Previous analysis of binary SVMs has demonstrated a deep connection
between robustness to perturbations over uncertainty sets and regularization
of the weights. We explore the problem of learning robust models for
structured prediction problems. We first formulate the problem of learning
robust structural SVMs when there are perturbations in the feature space.
We consider two different classes of uncertainty sets for the perturbations:
ellipsoidal uncertainty sets and polyhedral uncertainty sets. In both cases, we
show that the robust optimization problem is equivalent to the non-robust
formulation with an additional regularizer. For the ellipsoidal uncertainty
set, the additional regularizer is based on the dual norm of the norm that
constrains the ellipsoidal uncertainty. For the polyhedral uncertainty set, we
10
show that the robust optimization problem is equivalent to adding a linear
regularizer in a transformed weight space related to the linear constraints
of the polyhedron. We also show that the constraint sets can be combined,
and we demonstrate some interesting special cases. This represents the
first theoretical analysis of robust optimization of structural support vector
machines. Our experimental results show that our method outperforms
the non-robust structural SVMs on real-world data, when the test data
distributions are drifted from the training data distribution.
3. Robustness of large margin methods through dropout
regularization
Dropout training is a regularization technique that consists of setting
randomly selected input features or hidden units to zero for each training
example. Dropout training was originally proposed for deep neural networks,
but even shallow models, such as logistic regression, can benefit from
training with this kind of noise. In this thesis, we analyze dropout training
in support vector machines (SVMs). First, we derive a convex, closed-form
objective for linear SVMs that marginalizes over all possible dropout noise.
Our objective is simple, efficient to optimize, and closely approximates
the exact marginalization. For SVMs with non-linear kernels, we define
dropout over input space, feature space, and input dimensions. We introduce
methods for approximate marginalization over feature space dropout, even
when the feature space is infinite-dimensional, and Monte-Carlo methods
for input space and dimension dropout. We introduce two methods for
approximating dropout on the kernel feature map. The first uses a Fourier
basis to approximate a high-dimensional kernel with a finite feature map
11
and then applies our linear SVM dropout marginalization technique to the
transformed representation. The second approximately marginalizes over
dropout noise in the dual representation. In experiments on several text
datasets, our marginalized objective is more accurate than standard linear
SVM training. On several text datasets, our marginalized objective in the
primal form is more accurate than standard linear SVM training. On MNIST
and census data, both marginalized kernel dropout methods outperform the
standard RBF kernel. We also introduce a novel dimension dropout method
and show that it is more accurate than the standard RBF kernel on MNIST,
especially when the training sizes are smaller.
1.5. Thesis outline
The following is the summary of the dissertation’s chapters:
Chapter 2. Background: First, we review the basic concepts of statistical
machine learning and structured prediction methods. Then, we focus on the highlevel explanation of the adversarial machine learning algorithms. We introduce a
general framework that abstracts most of the adversarial scenarios as a generic
multi-agent game. The adversary’s counteractive effects on the learning and
prediction algorithms cause the learned model perform poorly in the future. To be
robust to unpredictable effects, we should know the capabilities of the adversaries.
We define a theoretical model for the adversary and categorize the properties of the
adversary based on different criteria.
Chapter 3. Convex adversarial collective classification: In this
chapter, we start by formulating the problem of adversarial collective classification
as a bi-level minimax optimization program. We show that under certain
12
interconnectivity conditions of the data graph, the solution of the lower-level
optimization program is guaranteed to be integral after relaxation. Then, we
introduce an equivalent quadratic optimization program that can be efficiently
solved. We run experiments on the various datasets, and we show that our method
always outperforms the baselines. This chapter is co-authored with my advisor
Dr. Daniel Lowd and is published in the thirtyth proceedings of international
conference on machine learning (Torkamani and Lowd, 2013).
Chapter 4. Equivalency of adversarial robustness and
regularization: We focus on learning robust models for generic structured
prediction problems. We discuss the different classes of uncertainty in the feature
space: ellipsoidal and polyhedral. Then, we derive the robust optimization problem
for each of these uncertainty sets. We show how the non-robust formulations
become equivalent to the robust ones by adding a customized regularizer to their
objective functions. We show how the customized regularization function should
be derived from each specific uncertainty set, and we study several special cases of
such sets. Finally, we derive a regularizer for combined ellipsoidal and polyhedral
uncertainty sets. This chapter is co-authored with my advisor Dr. Daniel Lowd and
is published in the thirty-first proceedings of international conference on machine
learning (Torkamani and Lowd, 2014).
Chapter 5. Marginalization and kernelization of dropout for
support vector machines: We study dropout training for support vector
machines. We derive a closed-form objective function for linear SVMs. This
objective is the result of marginalizing over the continuum of possible dropped
out noisy samples. We also discuss the possibility of applying dropout to SVMs
with non-linear kernels. We define the concept of applying dropout in input
13
space, feature space, and input dimensions, and we introduce several methods
for approximating the marginalization effect of dropout on kernel SVMs. The
experimental results on several datasets, such as text and image classification, show
that our methods are more accurate than the standard support vector machines.
This chapter is co-authored with my advisor Dr. Daniel Lowd and is under review
in the Journal of Machine Learning Research (JMLR).
Chapter 6. Conclusion and future directions: We summarize our
contributions. We also discuss the future research directions and how the proposed
methods in this thesis can be extended.
14
CHAPTER II
BACKGROUND
In this chapter, we review the basic concepts of adversarial machine learning.
Our focus is the on methods that also apply to structured prediction problems. The
chapter concludes with examples of real-world problems that are adversarial, and
the output space is structured.
2.1. Statistical learning
In machine learning, output prediction is the procedure of observing the state
x of some phenomenon (input) and using our understanding of the concept (learned
model) to predict some hidden property y of the observed data (output). In this
section, we briefly address the fundamentals of statistical machine learning.
2.1.1. Supervised learning
In supervised learning, the learner has access to samples that contain both
the attributes’ vectors and their corresponding labels. The training data samples
D = {(x1 , y1 ), . . . , (xN , yN )} ∈ (X × Y)N , are input-output pairs from the
past. We assume that each sample (xi , yi ) is drawn from an underlying joint
distribution over inputs and outputs: P (X , Y). Traditionally in machine learning,
the researchers usually assume that yi is the correct label for the input xi .
The goal is to find a mapping function (also known as a hypothesis function)
h ∈ H : X → Y, where H is the space of relevant hypotheses, and X and Y are the
set of possible inputs and outputs, respectively. Given x ∈ X , the predicted output
is ŷ = h(x) ∈ Y.
15
FIGURE 2.1. The supervised learning procedure
If Y = Rm (m-constant), then the problem is called regression; if |Y| = 2 (e.g.
Y = {0, 1}), then the prediction is called binary classification; if Y is a discrete set,
and |X | |Y| > 2, then the problem is called multi-class classification. If |Y| is
extremely large and each member Y has some internal structure, then the problem
is called “structured prediction”.
The mapping function h should produce accurate predictions; i.e., for an
input xi , the predicted output ŷi = h(xi ) should be “close” to the true output yi .
This closeness is usually defined by some non-negative loss function l : Y × Y → R
that determines the distance of ŷ to y. Sometimes the loss function l(y 0 , y) is not
convex; and therefore, the optimization problem in Equation 2.5 is not tractable.
Then, a convex surrogate function for l(y 0 , y) is used instead. We are interested
in the hypothesis h that generalizes well to the unseen samples of the joint
distribution over inputs and outputs. From a statistical point of view, we would
like to find h∗ ∈ H, such that the expected loss is minimized:
16
h∗ = arg min E(x,y)∼P (X ,Y) [l (h(x), y)]
(Equation 2.1)
h∈H
In real world problems, we don’t have access to the whole population, or
equivalently, we don’t know P (X , Y); therefore, the empirical population (observed
samples from the past) is used, instead:
h∗ = arg min E(x,y)∼D [l (h(x), y)]
h∈H
N
1 X
= arg min
l (h(xi ), yi )
N i=1
h∈H
The term
1
N
PN
i=1
(Equation 2.2)
l (h(xi ), yi ) is called the empirical risk. Figure 2.1 shows
the procedure of supervised learning.
2.1.2. Generalized linear models
Flexibility of h mostly depends on the function space H. We assume that h is
parameterized by a parameter vector w. In a general form, the hypothesis h can be
a search pocedure that finds the best output. We can assume that the best output
y maximizes some score function s(x, y; w), then h can be formally defined as:
h(x; w) = arg max s(x, y; w)
y∈Y
17
(Equation 2.3)
In this thesis, we suppose that the scoring function is linear in the parameters
w:
s(x, y; w) =
m
X
wj fj (x, y) = wT f (x, y)
(Equation 2.4)
j=1
where fj (x, y) is an arbitrary function of values from the input and the
output space and is called a feature function. We refer to this parameterization
of the hypothesis function as a generalized linear model (GLM).
For some problems, such as when |Y| = 2, arg max s(x, y; w) can be
y∈Y
calculated in closed-form; then, we will have an explicit form for the hypothesis
h.
2.1.3. Regularization
If the number of observed samples |D| is small, or if the number of possible
hypotheses |H| is extremely large, then the learned hypothesis h∗ in Equation 2.2
is likely to “overfit” the training data; i.e., we will achieve zero (or very small)
empirical loss, but large errors on output prediction for unseen (test) data. We
usually can not increase the number of training data, but we can control the
“flexibility” of the hypothesis h to prevent it from overfitting to the training data.
This task is performed by “regularizing” the hypothesis h. Regularization is done
by minimizing a linear combination of the empirical risk and a penalty function
ΩH (h) that controls the flexibility of h:
h∗ = arg min λΩH (h) +
h∈H
18
N
1 X
l (h(xi ), yi )
N i=1
(Equation 2.5)
This approach is called regularized risk minimization. The coefficient of
the regularization term λ is used to create a balance between the amount of
penalization of the model parameters and the empirical risk minimization.
We can interpret the regularized risk minimization as an a posteriori
probabilistic parameter learning method. The regularizer can be seen as the log
of the prior distribution over the parameters, while its partition function does
not depend on the parameters and can be removed from the objective of the
optimization program (Bishop, 2006).
Choosing the right regularization function is crucial in gaining the desirable
generalization effect. For example, in GLMs, if we have prior knowledge that
the weights are IID and are drawn from a Gaussian distribution, then we set
Ωw (w) = wT w. This assumption is somewhat common because the squared L2
norm is continuous, its derivative is simple, and it can be very efficiently optimized.
If we expect the weight vector w to be sparse, then we can implicitly assume that
it is drawn from a Laplacian distribution or equivalently set the regularization
P
function to the L1 norm: Ωw (w) = m
j=1 |wj |
Clearly, such naı̈ve assumptions are not necessarily optimal choices. Some
of the main contributions of this thesis are centered around recipes for deriving
effective regularization functions.
2.2. Structured learning
The traditional machine learning algorithms are designed to solve prediction
problems whose outputs are a fixed number of binary or real-valued variables1 .
1
In these prediction algorithms, the desired output must be representable as a K-dimensional
vector, where K is a constant (e.g. K
=
1 for scalars). For example, for a desired output
y ∈ {c1 , . . . , cK }, the common practice is to use a different representation for the output y. In
19
In contrast, there are problems with a strong interdependence among the output
variables, often with sequential, graphical, or combinatorial structures. These
problems involve prediction of complex outputs, where the output has some
structure such as trees and graphs; these kinds of outputs are called structured
outputs. Problems of this kind arise in security, computer vision, natural language
processing, robotics, and computational biology, among many others.
Structured prediction (Bakir et al., 2007) provides a unified treatment
for dealing with structured outputs. The structured prediction algorithms root
back in a few seminal works: McCallum et al. (2000); Lafferty et al. (2001);
Punyakanok and Roth (2001); Collins (2002); Koller et al. (2003); Altun et al.
(2003); McAllester et al. (2004); Tsochantaridis et al. (2006), among others.
In this section, we explain the basics of structured prediction methods. We
start with a brief explanation of the basics of supervised learning for structured
prediction, and then we present some of the most practiced training algorithms for
training structured predictors.
2.2.1. Motivation of using structured prediction
Before the emergence of the structured prediction algorithms, probabilistic
graphical models (PGMs) (Pearl, 1988) were the most successful methods for
solving problems with strongly interdependent outputs. By combining statistical
learning and graph theory, PGMs provide a framework for making an inference
about dependent variables and confounding factors. The basic idea behind PGMs
is that the probability distribution function of the variables in the model can be
factorized based on the graph of the direct dependencies among the variables.
this case, y will be represented as a K-dimensional binary vector y0 , where yi0 = 1 if y = ci , and is
zero otherwise.
20
Although PGMs apply to many problems, they are overly general purpose,
which is inevitably costly. Using the probability distribution function of variables in
the model is desirable in theory, but estimating the parameters of the distribution
functions – especially the normalization constants (a.k.a. partition functions), can
be intractable. Structured prediction algorithms do not calculate the probability
distribution of the variables explicitly, and mainly avoid the calculation of
the normalization constants. Therefore, learning the parameters of structured
prediction models is usually tractable, especially when tailored to specific problems.
The principal theme in all structured output prediction problems is the
combinatorial nature of the labels. In particular, the number of possible outputs
in such problems is exponential in the input size. This fact makes these problems
distinctive from the classic problems that classical machine learning algorithms
have been trying to solve. Therefore, new algorithms are needed for handling such
problems.
2.2.2. Scoring function
A key concept in the state-of-the-art structured prediction algorithms is the
notion of extended feature function in a GLM setting. The inputs of the feature
functions are both the original input x ∈ X and a hypothesized output ỹ ∈ Y.
We define f (x, y) as the feature vector. The mathematical details of f (x, y) are
problem-specific. For example, in graphical models (Lauritzen, 1996), the feature
function is the same as the vector of all potential functions (Bilmes et al., 2001;
Torkamani and Lowd, 2013; Taskar et al., 2004a), and in maximum entropy
(MaxEnt) models (Theil and Fiebig, 1984), or equivalently in log-linear models,
the sufficient statistics are used as the feature functions.
21
In general, the choice of f (x, y) is a model selection problem. A specific
example is collective classification of inter-connected documents (such as web
pages) as “spam” and “non-spam”. Let E be the set of the edges between the
documents, where eik = 1 means that there is an edge from node i to node k
and is zero otherwise. Also, let xij be the indicator variable that represents if
the jth word is present in the ith document; for example if “[email protected]” has index
700 in the dictionary, then x200,700 = 1 means that the word “[email protected]” is present
in the 200th document, and x200,700 = 0 means it is not present. Also let yi ∈
{“spam”, “non-spam”} be encoded as the pair (yi1 , yi2 ), where (yi1 , yi2 ) = (1, 0)
means yi = “spam” , and (yi1 , yi2 ) = (0, 1) means yi = “non-spam”. Now we can
define a simple feature function:
fjk (x, y) =
X
fekk0 (x, y) =
X
xij yik
(Equation 2.6)
eij yik yik0
(Equation 2.7)
i
i,j
The feature function f (x, y) now will be built by stacking all fjk (x, y)’s and
fekk0 (x, y)’s in one vector. The feature function f (x, ỹ), with true values of x and
a hypothetical output ỹ is used as the higher level input to the mathematical
model that describes the relevance of output structure ỹ. In particular, a linear
combination of individual elements in f (x, ỹ) is used as the criterion for relevance
of the hypothetical output ỹ to the true y, and is called the scoring function.
Formally, the scoring function is defined in the following form:
score(x, ỹ, w) = wT f (x, ỹ)
22
(Equation 2.8)
w is called the model weight vector, and the goal of the machine learning algorithm
is to learn such that the true labeling y gains the maximum score when plugged
into the score function. Unfortunately, it is possible that in some cases an alternate
labeling ỹ, which is very different than y also gains a high score. Therefore, the
learning algorithm needs to select a w that penalizes such scenarios. We want
to learn w such that the closer ỹ is to y, the higher the score of ỹ is. Therefore
∆(ỹ, y) is defined as a measure of dissimilarity between ỹ and y. The Hamming
distance between ỹ and y is one of the popular choices. The difference function
∆(ỹ, y) plays an important role in many of the weight learning algorithms for
structured output prediction.
In structured output prediction algorithms, a crucial problem is the hardness
of searching different applicable ỹ ∈ Y that maximizes the scoring function. In
particular, after learning a weight vector w, one will need to find the best output
for a given input. This is the “argmax problem” defined in Equation 2.9 and
referred to as maximum a posteriori (MAP) inference:
ŷprediction = hw (x)
= arg max wT f (x, ỹ)
ỹ∈Y
(Equation 2.9)
This problem is not tractable in the general case. However, for specific Y
and f (x, y), one can use methods such as dynamic programming algorithms or
integer programming algorithms to efficiently find solutions. In particular, if f (x, y)
decomposes over the vector representation of y, such that no feature depends on
other features that have the same elements of y, then the problem is efficiently
solvable.
23
2.2.3. Structured Prediction Methods
In this part, we briefly explain some of the primary methods for weight
learning in structured prediction methods.
2.2.3.1. Structured Perceptron
The structured perceptron is an extension of the standard perceptron
(Lippmann, 1987) to structured prediction (Collins, 2002; Collins and Duffy, 2002;
McDonald et al., 2010). The algorithm of learning w is shown in Algorithm 1.
Algorithm 1 AveragedStructuredPerceptron((x1 , y1 ), . . . , (xN , yN ), maxIter)
w ← [0, . . . , 0]T
c ← 1
for l = 1 to maxIter do
for i = 1 to N do
ŷi = arg maxỹ∈Y wT f (xi , ỹ)
if ŷi 6= yi then
w ← (1 − θl )w + θl α (f (xi , yi ) − f (xi , ŷi ))
end if
end for
end for
return w
In Algorithm 1, θl is a real number between 0 and 1 that determines the
weight of the current update relative to previous weight in the lth iteration.
In a simple averaging algorithm, we can set θl =
1
.
i
α as the learning
rate. The algorithm applies an update to the weight whenever the output of
arg maxỹ∈Y wT f (x, ỹ) is not equal to the true y. Note that the algorithm is only
applicable when the resulting output is either exactly equal to the true one, or it is
completely different. In other words, the difference function ∆(ỹ, y) ∈ {0, 1}. As a
consequence, this algorithm does not generalize well to unseen data.
24
2.2.3.2. Maximum entropy and log-linear models
The maximum entropy and log-linear models are duals of each other when
seen as optimization programs. Therefore, both of them are essentially the same
algorithm. In these algorithms, a parameterized distribution is discriminatively
defined over an output ỹ (or sometimes generatively over both the input x and the
hypothetic label ỹ), the feature function f (x, y) is seen as the sufficient statistics of
this distribution:
p(ỹ; x, w) =
1
T
ew f (x,ỹ)
z(x, w)
(Equation 2.10)
The function z(x, w) is the normalization function, and is called the partition
function. For z(x, w) we have:
z(x, w) =
X
ew
T f (x,ỹ)
(Equation 2.11)
ỹ∈Y
The higher the value of p(ỹ; x, w) is for a specific ỹ, the more probable it is that ỹ
is “close” to the true labeling y. Sometimes, L(ỹ; x, w) = − log p(ỹ; x, w) is used as
measure of unlikeliness of ỹ; smaller L(ỹ; x, w) means better ỹ:
L(ỹ; x, w) = − log p(ỹ; x, w)
X T
= −wT f (x, ỹ) + log(
ew f (x,ỹ) )
(Equation 2.12)
ỹ∈Y
The maximum entropy framework is one of the most successful methods
for structured prediction. For example McCallum et al. applied this method to
sequence labeling problems (McCallum et al., 2000), and a lot of follow-up work
25
applied maximum entropy structured prediction in different disciplines (Califf and
Mooney, 2003; McDonald and Pereira, 2005; Begleiter et al., 2004; Punyakanok and
Roth, 2001; Chieu and Ng, 2002; Shen et al., 2007; Domke, 2013).
It is worth mentioning that conditional random fields (CRFs) can be seen as
a more general framework where a probability distribution is fitted to the data, and
the inference could be performed over structured outputs as well.
2.2.3.3. Re-ranking and search-based methods
Re-ranking is mostly applied to the natural language processing problems.
Assume that we have access to the Oracle that solves some inference problem, but
instead of generating “the best” output, it generates a list of “n best” outputs.
Then, the learner’s goal is to build a second model for choosing “one output” from
the “n best” outputs. A second model then improves this initial ranking, using
additional features as evidence. This approach allows a tree to be represented as
an arbitrary set of features, without concerns about how these features interact or
overlap, and without the need to define a derivation which takes these features into
account (Collins and Duffy, 2002; Collins and Koo, 2005).
Re-ranking has been applied in a variety of NLP problems including parsing
(Collins and Duffy, 2002; Collins and Koo, 2005; Charniak and Johnson, 2005),
machine translation (Shen et al., 2004; Och et al., 2003), question answering
(Ravichandran et al., 2003), semantic role labeling (Toutanova et al., 2005), and
other tasks. A main feature of re-ranking is that different loss functions can be
easily embedded into the algorithm and immediately tested. There are also some
drawbacks. For example, in a re-ranking algorithm, one should have an Oracle for
26
choosing n-best initial ranking, which may not be available, or n may be too large
to be useful.
Search-based structured prediction can be seen as an improved and more
advanced version of re-ranking. These algorithms are mostly developed by the
re-enforcement learning community and have a flavor of solving the structured
prediction problems from a planning perspective. Daumé et al. (Daumé Iii et al.,
2009) introduced search-based structured prediction with the SEARN (SEarch
And leaRN) algorithm. This algorithm integrates searching and learning to solve
structured prediction problems. SEARN is a meta-algorithm that transforms
structured prediction problems into simple classification problems, to which any
binary classifier may be applied. SEARN is able to learn prediction functions for
different loss functions and different features functions. There are several other
related works that use similar techniques (Daumé III and Marcu, 2005; Daumé III,
2009b,a; Doppa et al., 2012).
2.2.3.4. Maximum-margin Markov networks
The max-margin Markov network (M3 N) class of structured prediction
methods are a generalization of max-margin methods in traditional machine
learning (also known as support vector machines (SVM)) to structured output
prediction settings. The early work by Taskar et al. (Koller et al., 2003; Taskar
et al., 2004a, 2005) was followed by a large quantity of additional progresses in
development of max-margin methods (Tsochantaridis et al., 2006, 2004; Yu and
Joachims, 2009; Sen et al., 2008; McDonald et al., 2007).
27
To date, the state-of-the-art structural SVM is the 1-slack formulation
(Joachims et al., 2009), which solves the following optimization program:
minimize f (w) + Cζ
w,ζ
subject to
(Equation 2.13)
ζ ≥ max wT (φ(x, ỹ) − f (x, y)) + ∆(y, ỹ)
ỹ
f (w) is a regularization function, that penalizes “large” weights. Depending on
the application, f (w) can be any convex function in general. Semi-homogeneous
functions, such as norms, or positive powers of norms are among the favorite
choices2 . f (w) = 21 wT w is the most commonly used regularization function. For
simplicity, I have expressed the input data as a single training example, but it
can easily be expanded to set of N independent examples, each of which makes
an independent contribution to the loss function. The variable ζ is the only slack
variable, which should be minimized, along with the regularization function.
The large-margin Markov networks are developed as convex optimization
programs. Therefore, it is mathematically convenient to derive robust formulations
based on them. In this dissertation, we mainly focus on large-margin methods.
2.2.4. Optimization Algorithms
In most of the methods that we described above, the learning algorithm is
embedded into the model, but for the max-margin methods, we usually come up
with a mathematical optimization program. In the following, we briefly explain two
of the state-of-the-art optimization algorithms that are used for structured learning.
– Cutting plane algorithm:
2
A function f (z) is semi-homogeneous if and only if f (az) = aα f (z) for some positive α.
28
In parameter learning of the max-margin structured methods, the goal is to
select the parameters for which the score of the true labels is ranked higher
than the score of all alternate labels. Theoretically, this can be done via a
convex optimization program, such as a quadratic program. The issue is
that the number of alternate labels is usually exponential in the input size;
therefore, listing all of them is intractable. The cutting plane algorithm at
each iteration finds the alternate labeling that is most different from the
true labeling and has the highest score, then adds appropriate constraints to
make sure the score of the true labeling is relatively higher than this alternate
labeling (Tsochantaridis et al., 2004, 2006; Koller et al., 2003; Taskar et al.,
2005; Yu and Joachims, 2009; Joachims et al., 2009)
– Column generation:
We can solve the convex program that is generated by the max margin
approach in its dual form. The dual optimization program has a similar
difficulty where the number of the dual variables is exponential in the
input size. Similar to the cutting plane algorithm, the column generation
method selects a dual variable at each iteration, and then adds it to the dual
program. Solving the problem in its dual form is useful because then we can
use the power of kernel functions. There are several works that use column
generation for parameter learning (Taskar et al., 2005; Teo et al., 2008; Smola
et al., 2007; McAuley et al., 2008).
– Exponentiated gradient:
The exponentiated gradient algorithm also solves the optimization program
in its dual form and uses a gradient ascent algorithm for each update
in each iteration. The key point in the algorithm is that the gradient is
29
exponentiated (i.e. eg is used instead of the gradient g), and there are
convergence theorems as well as experimental evaluations that prove the
efficiency of this approach (Kivinen and Warmuth, 1997; Bartlett et al., 2004;
Globerson et al., 2007; Collins et al., 2008).
2.3. Adversarial machine learning
In this section, we discuss the theoretical framework of adversarial machine
learning in general, and at the same time address the main branches of the existing
work that apply to structured prediction problems.
Adversarial machine learning studies machine learning techniques that are
robust against adversarial components, which rule over the process of input data
generation. As security challenges are increasing, the need for adversarial machine
learning algorithms is becoming more apparent these days (Laskov and Lippmann,
2010). In analogy with security problems, adversarial machine learning can be
seen as a game between two players, where one player wants to protect the normal
functionality of a system, and the other player wants to pursue its malicious goals.
In adversarial machine learning terminology, the first player is called the learner (or
the defender), and the second player is called the adversary (or the attacker) (Dalvi
et al., 2004). There has been a comprehensive body of work in recent years that
examines the security of machine learning systems; this set involves different classes
of possible attacks against machine learning systems (Lowd and Meek, 2005a;
Globerson and Roweis, 2006; Teo et al., 2008; Lowd and Meek, 2005b; Blanzieri and
Bryl, 2008; Brückner and Scheffer, 2009; Nelson, 2010; Brückner and Scheffer, 2011;
Dreves et al., 2011; Brückner et al., 2012; Dritsoula et al., 2012; Sawade et al.,
2013).
30
In the following subsection, we briefly address some of the most important
aspects of the state-of-the-art methods, and we will discuss the common themes in
adversarial machine learning algorithms.
We will also talk about regret minimization algorithms, which are somewhat
complementary to the adversarial machine learning. In the regret minimization
framework, Nature behaves like an adversary and sets the costs and rewards. The
goal is to choose a sequence of actions that minimizes the future regret. Regret is
defined as the sum of all incurred costs of chosen actions at all time steps, minus
the sum of the costs when only one best-fixed action or policy had been taken at all
the times. The best-fixed action would be the one that would have been selected if
all of the costs were known in hindsight.
In this section, our perspective is mostly from the learner’s point of view,
and we categorize the adversarial attacks based on higher-level properties of an
adversary. For an extensive collection of possible threats that make most of the
classical machine learning algorithms vulnerable to adversarial attacks refer to
Nelson (2010).
2.3.1. Adversary’s theoretical model
We start this section with some definitions:
Antagonistic adversary and zero-sum games: The adversary’s goals are
explicitly against the duties of the learner; i.e. the adversary’s win equally means
the learner’s loss and vice-versa. These games are called zero-sum, and such an
opponent is known as an antagonistic adversary.
Non-antagonistic adversary and non-zero-sum games: If the
opponent’s goals are implicitly against the learner’s goals, then the adversary is
31
seeking its benefits, which may or may not be directly harmful to the learner.
Whenever the amount of bilateral rewards and losses of each side of the game
are not necessarily equal, then the game is non-zero-sum. If increasing the cost
of the learner is not the primary aim of the adversary, then it is a non-antagonistic
adversary.
Modeling the non-zero-sum game is relatively simple. Let w ∈ W be
the parameters of the learners model, and a ∈ A be the parameters of the
adversary’s model, through which the adversary directly affects the performance of
the machine learning algorithm. W and A are respectively the action space for the
learner and the adversary. Also, let ra (w, a) be the loss function that the learner
wants to minimize by choosing the right w3 . An antagonistic adversary wants to
maximize the loss of the learner by selecting an appropriate action a. Therefore,
the adversarial game can be formulated as:
min max ra (w, a)
w∈W a∈A
(Equation 2.14)
We present a general abstraction of adversarial games in Algorithm 2.
The machine learning algorithm (the learner) chooses an algorithm such
as decision tree classification, Naı̈ve Bayes, support vector machine, etc., and
learns the parameters of the selected model based on its prior belief about the
adversary and the previously observed data. On the other hand, the adversary
also chooses an action from its plausible set of actions; this action is selected
3
The function ra (w, a) is the reward of the adversary. In a zero-sum game, the reward function
for the learner is rl (w, a) = −ra (w, a); therefore, ra (w, a) is the loss function from the learner’s
perspective.
32
Algorithm 2 Adversarial Game
Initialize:
– Learner’s prior belief:
∗ The learner chooses a model M as the machine learning algorithm.
∗ The learner initializes its belief about the adversary’s set of strategies:
 based on the previous observations.
∗ The learner selects parameters w of the model M based on  and the
earlier observations.
– Adversary’s prior belief:
∗ The adversary chooses a set of strategies A based on its own prior
knowledge and restrictions.
∗ The adversary initializes its initial belief on the learner’s model M̂, and
its belief on the model parameters ŵ.
∗ The adversary chooses an action a ∈ A.
– Nature sets the laws:
∗ Nature chooses a set of incentives R.
while Set of Incentives R exists do
Defend:
– The learner updates its approximation of the adversary’s set of strategies Â.
– The learner updates parameters w based on  and the observed adversary’s
action a.
– The learner gains reward rl (w, a) ∈ R
Attack:
– The adversary chooses an attack a ∈ A
– The adversary gains reward ra (w, a) ∈ R
– The adversary updates  based on the observed reward ra (w, a) and its
new understanding of R.
Nature:
– Nature updates R.
end while
based on the adversary’s prior belief about the learner’s choice of the model
and its parameters. Note that each of the adversary’s or learner’s moves can be
randomized or deterministic. In fact, each of the players may choose a mixed
33
strategy, rather than a fixed move. It is Nature that decides on the amount of
positive or negative payoffs of each combination of the strategies that are chosen
by the players. For example, in email spam detection, there are three sides: the
spam-filter, the spammer, and the user of the email service. Some emails are
considered as spam by some users but are valuable information for some other
users. Therefore, if the spam filter algorithm wants to use a fixed model for all
users, then it should carefully update its belief about the pay-offs that are made by
Nature.
The order of the itemized events in Algorithm 2 can be completely arbitrary.
Each of the existing approaches to adversarial machine learning is designed based
on some assumptions about Algorithm 2. In the following, we briefly categorize the
main possibilities.
2.3.1.1. Type of adversarial problems
A key point of difference, among algorithmic approaches that are designed for
adversarial machine learning, is the order in which the events of Algorithm 2 occur.
In particular, the existing studies are mostly based on three general assumptions
regarding the possible order in which events occur:
– Based on Stackelberg competition scenario : The Stackelberg
competition model is a strategic game model where one of the players (called
“the leader”) plays first, and then the other player (called “the follower”)
plays sequentially. This model is the closest model to real-world challenges.
The learner (the leader) updates its model parameters after observing the
adversary’s (the follower’s) action, and possibly incurs some losses (Globerson
34
and Roweis, 2006; Teo et al., 2008; Brückner and Scheffer, 2011; Sawade
et al., 2013; Torkamani and Lowd, 2013).
– Based on Nash Equilibria: In these models, although the order of events
is arbitrary, hypothetically, there exist optimum joint strategies of both
players, where no player gains more rewards by deviating from its current
policy. It is a known fact from Game Theory that such optima do not
necessarily exist among pure strategies (Brückner and Scheffer, 2009; Dreves
et al., 2011; Brückner et al., 2012; Dritsoula et al., 2012).
– Based on Poisoning the Training Data: The adversary generates
several specially designed data points and injects them into the training
data. The adversary’s goal in these kind of attacks is to make the machine
learning algorithm learn a wrong model in the first place. Such attacks can be
designed to target individual machine learning algorithms (Biggio et al., 2012;
Dekel et al., 2010; Biggio et al., 2013a,b, 2014).
– Based on Regret Minimization: In these models, the adversary and
Nature are the same, and Nature chooses a new cost function for each action
of the learner at each iteration of the game. The goal is to minimize the
regret that the learner would suffer, in comparison with what they would have
chosen at a time in which they knew all of the costs imposed by Nature, in
hindsight, and had chosen a fixed strategy as the response. (Shalev-Shwartz,
2011; Ross et al., 2011, 2010).
In general, finding the Nash equilibrium becomes harder when the
dimensionality of the players’ actions is large and the utility functions are arbitrary.
Brückner and Scheffer (2009) show that under certain convexity and separability
35
conditions of the utility function, a Nash equilibrium exists; this equilibrium can be
found by simulating the adversarial game. Therefore, Stackelberg competitions are
more approachable techniques, because the learner should select the strategy that
restricts the worst-case adversary in a minimax formulation. The learner attempts
to minimize a loss function assuming a worst-case adversarial manipulation. An
unrealistic assumption that many of the papers make to simplify the problem is the
continuity of feature functions, which does not hold in many domains (Globerson
and Roweis, 2006; Teo et al., 2008; Brückner and Scheffer, 2011; Sawade et al.,
2013; Brückner and Scheffer, 2009; Dreves et al., 2011; Brückner et al., 2012;
Dritsoula et al., 2012).
Globerson and Roweis (2006) formulate the problem of feature deletion at
test time as a Stackelberg game. This method is only applicable to binary and
multi-label classification and does not apply to the structured output prediction
problems. Another weakness of this approach is that it is only robust to feature
deletion; other possible adversarial manipulations of data, such as feature, are
ignored. Teo et al. (2008) generalized the former method to all invariants of
input data4 . This method is not practical whenever the number of possible
transformations is exponential in the input size (or sometimes infinite).
2.3.1.2. Knowledge about the opponent
From the knowledgeability perspective, there are two types of adversaries:
passive or active. Passive adversaries do not have access to the learner’s model,
so they try to attack the system, and observe the outcomes, in order to infer the
parameters of the algorithm working behind the scenes. Active adversaries have
4
In machine learning and computer vision terminology, an “invariant” of a data point x with
label y is a variation of x, namely x̃, that the classifier of interest still labels it as y.
36
full access to the learner’s model and the parameters that the learner has selected
for the model (Lowd and Meek, 2005b,a; Blanzieri and Bryl, 2008). A passive
adversary may converge to an active adversary in theory, especially if the learner
does not update its model parameters. In real-world’ problems, the adversaries are
passive in general, but most of the existing studies focus on the active adversary
assumption.
It is also important for the learner to know the adversary’s limitations
and incentives. If the model is non-antagonistic, then the adversary has its own
incentives; knowing these incentives can be used in modeling the adversary. This
knowledge can be used in generating robust model parameters for the learner. The
effectiveness of our methods depends on how accurately we model the adversary,
but the true costs and constraints of the adversary are rarely known in advance.
There is not much work that models the incentives of the adversary, but there are a
few methods that assume that adversary is rational (Nguyen et al., 2013).
One advantage of the learner is the adversary’s limitations; most of the
Stackelberg games use this fact to learn robust models by incorporating the
restrictions of the adversary into the learning algorithm (Globerson and Roweis,
2006; Teo et al., 2008; Torkamani and Lowd, 2013; Livni and Globerson, 2012).
Some other recent papers have considered the relationship between regularization
and robustness to restricted adversaries in SVMs. Xu et al. (2009) demonstrate
that using a norm as a regularizer is equivalent to optimizing against a worstcase adversary that can manipulate features within a ball defined by the dual
norm. There are several related works that expand this idea in different directions
(Xu et al., 2010; Xu and Mannor, 2012). For example in follow-up work Xu et al.
expand their approach to the robust regression problem(Xu et al., 2010).
37
2.3.1.3. The role of Nature
In Algorithm 2, we have separated the adversary and Nature. In fact, the
adversary follows the rules that Nature sets. For example, in stock markets, there
are some traders (adversaries) who want to increase their pay-offs by choosing the
right portfolio, but the demands of the market are the main criteria that affect
the stock indices. Another example is the laws of physics that Nature sets. A
robot controller algorithm should be robust to adversarial accidents that threaten
the autonomous robot agents, but falling from 2-feet-tall piece of rock is clearly
different than falling from a cliff which has the height of 500 feet. As a result, it is
important for both learner and the adversary to learn the laws of Nature as well.
2.3.2. Adversarial learning techniques
In this subsection, we review some of the primary techniques in adversarial
machine learning that are applicable to supervised learning methods, and we
formulate the adversarial game as set of optimization programs.
2.3.2.1. Utility-based approaches
Utility-based approaches are of the early works in adversarial machine
learning that are applicable to structured prediction as well. In these models, both
the learner and the adversary have their specific utility functions. The utilities
are some arbitrary reward functions. In a game-theoretic framework, each of the
players tries to maximize its reward. Brückner and Scheffer take this approach
in some of their papers. They show that for particular non-antagonistic utility
functions, the prediction game has a unique Nash equilibrium, and they derive
a simulation-based algorithm for finding the converging models (Brückner and
38
Scheffer, 2009; Brückner et al., 2012). In another work, they model the interaction
between the learner and the adversary as a Stackelberg game, in which the
learner plays the role of the leader and the adversary reacts to the learned model
(Brückner and Scheffer, 2011). This framework is, in fact, a minimax scenario
where the learner tries to minimize the maximum possible harmful damage that
the adversary can cause. These methods are not designed for structured prediction
problems, but their underlying framework is general purpose. Satisfying some of
the assumptions may not be possible, especially for finding the Nash equilibrium.
The main drawback of these works is that the formulations assume a relaxed action
space; this assumption does not hold in many structured (and non-structured)
output spaces.
Other works expand the analysis of the conditions for the existence of the
Nash equilibrium; for example, Dreves et al. have analyzed the Karush-KuhnTucker (KKT) conditions for which the generalized Nash equilibrium exists (Dreves
et al., 2011).
2.3.2.2. Max-margin-based Adversarial Structured Learning
The max-margin based algorithms include the large class of SVM classifiers.
Therefore, many adversarial methods are based on max-margin formulations. The
following is a brief review of each of these methods.
– Embedding the simulated adversary:
The key idea for making max-margin learning approaches robust to
adversarial data manipulation is to embed the adversarial uncertainty
component into the optimization program of the max-margin method.
(Schölkopf et al., 1997) were among the first authors who used the idea
39
of using virtual (e.g. noise-polluted) samples for training the model. This
approach was first used as an embedded part of the algorithm for binary
SVMs by Globerson and Roweis (2006) when a limited number of the features
could be set to zero by the adversary at test time. Later Teo et al. (2008)
expanded this idea to include a wider class of possible adversaries. The
main limitation of the latter work is that there should exist an efficient
computational procedure for simulating the adversary. This is not always
tractable because the number of possible adversarial manipulations of input
data can be extremely large.
Other approaches with a similar nature. For example, Biggio et al. (2011)
formulate the problem in the dual form and model the adversarial noise as
the Hadamard product of a noise matrix and the kernel matrix. Some other
authors assume that the adversarial noise is drawn from a distribution and
try to ensure robustness to that kind of perturbation (Livni and Globerson,
2012; Maaten et al., 2013).
– Robustness by regularization:
In general, robust optimization addresses optimization problems in which
some degree of uncertainty governs the known parameters of the model. BenTal and Nemirovski (Ben-Tal and Nemirovski, 1998, 1999, 2000, 2001) showed
that there exist a range of applications that could be formulated in a robust
convex optimization framework. Robust linear programming is a central
method in most of these formulations. Bertsimas and Sim (2004) show that
for box-bounded disturbances, the parameters can take the worst-case value,
and there is a trade-off between optimality and robustness. In Bertsimas
et al. (2004), the authors focus on the case when the disturbance of the inputs
40
is restricted to an ellipsoid around the actual values defined by some norm.
They show that the robust linear programming problem can be reduced to
a convex cone program, where the conic constraint is defined by the dual of
the original norm. A number of other authors have explored the application
of robust optimization to classification problems (e.g.,(Lanckriet et al., 2003;
El Ghaoui et al., 2003; Bhattacharyya et al., 2004; Shivaswamy et al., 2006)).
Recently, Xu et al. (2009) showed that regularization of support vector
machines can be derived from a robust formulation, and they also argue that
robustness in feature space entails robustness in sample space.
– Robustness to poisoning attacks:
Poisoning attack is used to refer to a scenario, where the adversary injects
some corrupted samples to the training data to make sure that the classifier
will learn a wrong model, and as a result, the test error increases. To the
best of my knowledge, there is no existing published work that attempts
to guarantee robustness against these kind of attacks. Filling this gap is
worthwhile, and it is specially relevant to applications in which the number
of training samples is limited.
Biggio et al. (2012) have studied this problem for non-structural prediction.
They investigate a family of poisoning attacks against SVMs. Most of the
learning algorithms assume that their training data comes from a natural
distribution, and therefore they are vulnerable to these kind of attacks. An
intelligent adversary can, to some extent, predict the change of the SVM’s
decision function due to malicious input and use this ability to construct
malicious data. Dekel et al. (2010) solve a similar problem for binary SVMs,
41
where they apply several relaxations to the integer program formulation
of the problem, and use L∞ as the regularizer. Because of this choice of
regularizer, they end up with a linear program. In their paper, they state that
with the choice of L∞ regularization their method is more efficient and don’t
go into more arguments. There are some other works in the literature that
attempt to train models that are robust to poisoning attacks (Biggio et al.,
2013a,b, 2014).
2.3.2.3. Online learning and regret minimization:
Online learning is based on the idea of choosing the best strategy based on
the data that is being received in a stream (Shalev-Shwartz, 2011). The amount
of available data is usually huge. Therefore, we prefer to look at each data point
only for a limited number of times – ideally only once. Regret minimization is an
adversarial method for learning in online settings.
2.4. Applications of adversarial structured learning
Improving the performance of structured prediction algorithms is one of our
main contributions in this thesis. In this section, we review the significance that
this improvement will have on the real-world applications.
2.4.1. Collective Classification
Many real-world relational learning problems can be formulated as a collective
classification problem. For example, webspam detection can be formulated as a
joint classification problem where each webpage is either spam or non-spam, and
the label of each webpage not only depends on its contents but also depends on
42
the label of neighboring webpages that are linked to it (Sen et al., 2008; Abernethy
et al., 2010).
Our paper “Collective Adversarial Collective Classification” (Torkamani and
Lowd, 2013), is the first published work in the field of structured output prediction
that is designed to be directly robust against adversarial manipulation of data
at test time. We assumed that the adversary can change up to D attributes of
all webpages, and by incorporating this limitation of the adversary5 in a robust
optimization program, we come up with an efficient method6 for robustly solving
the problem of collective classification in associative Markov networks (Taskar
et al., 2004a).
Other researchers have solved this problem with an implicit effort to address
the robustness issue. Sen et al. (2008) discuss that the “Iterative Classification
Algorithm” (Jensen and Neville, 2002; Lu and Getoor, 2003) is relatively robust
to the order that the nodes are visited, but their method is not robust to the
manipulation of test data. Tian et al. (2006) introduce an additional heuristic
weight on top of a dependency network (Neville and Jensen, 2007; Lowd and
Shamaei, 2011) to model the strength of the dependencies. Although this
additional weight makes the approach robust to random noise, the method is
not robust to malicious noise. McDowell et al. (2009) introduce the cautious
iterative classification algorithm, where at each local classification, the classifier also
generates a confidence criterion about the performed classification. If this criterion
is less than some threshold, the predicted label is ignored by the algorithm. This
5
This is the main limitation of the adversary. Therefore, the adversary cannot manipulate
“everything” in the network.
6
For binary labels, such as spam detection, the efficiency is guaranteed. When there are more
than two possible labels, the results are approximate, in theory but in practice, we get pretty
accurate results.
43
method is also heuristic and does not rely on the related literature of robust
machine learning.
Abernethy et al. (2010) introduce the “WITCH” algorithm, which uses a
graph regularization approach to utilizing the link information for regularizing the
model parameters. Their method gains implicit robustness due to regularization,
but it is not robust to adversarial attacks against the collective classification
algorithms.
2.4.2. Anomaly Detection
Anomaly detection is the problem of detecting unusual samples among some
ordinary ones. For example, detecting network intrusions or instances of credit
card fraud are acts of anomaly detection. An intrusion detection system is now
an important part of any computer network. When a set of agents in the network
collaborate in an attack, then the network protection system needs to perform
structured prediction to determine the role of each agent in the network. There
is a group of papers that use conditional random fields or hidden Markov models to
perform this task (Gupta et al., 2007, 2010; Qiao et al., 2002). The main drawback
of these methods is the issue of robustness of the algorithms. In other words, these
methods use machine learning algorithms to improve the robustness issue of the
system, but the used algorithms themselves are not robust to engineered attacks.
Song et al. (2013) introduce a one-class classification approach for detecting
the sequential anomalies. Their method is robust to outliers in the training data.
Although the method is elegant, what makes it less applicable to adversarial
settings is that the adversarially manipulated samples are different than outliers.
44
In particular, the adversary manipulates the data as a response to the learned
parameters of the classification method.
2.4.3. Practical applications
The following is a list of some of the real-world applications of adversarial
structured prediction.
2.4.3.1. Security applications
Security issues are becoming more serious and critical these days, and
naturally, machine learning tools are also being used to solve some of these
problems. The security challenges can be formulated as a game between the
defender (or learner) and the attacker (or the adversary). Not only the action
space in security games is large, but also the limited resources of the defender is
a challenge in most cases. In fact, in real-world security problems, there are not
enough agents to patrol all the targets that the adversary could attack. Therefore,
deciding the placement of the resources is highly important.
Pita et al. (2008); Jain et al. (2010b) have developed an algorithm called
ARMOR, which is now deployed at the Los Angeles International Airport
(LAX) to randomize the checkpoints on the roadways that enter the airport.
By randomization, the strategies are drawn from some mixture of strategy
distributions, rather than a taking a fixed pure strategy all the times. As a result,
the criminal will not be able to precisely determine the next action. Some other
related works are IRIS (Tsai et al., 2009), fast generation of the flight schedules
(Jain et al., 2010a), PROTECT (Shieh et al., 2012; Fang et al., 2013), GUARDS
45
(Pita et al., 2011), among others (Yin et al., 2011, 2012; Jiang et al., 2013b,a;
Basilico et al., 2009; An et al., 2012; Korzhyk et al., 2011).
Dickerson et al. (2010) look at security games from a graph theoretic
approach and propose a greedy algorithm for protecting the moving targets from
adversaries.
2.4.3.2. Computer vision
Both robustness and structured output prediction are highly needed in the
computer vision applications.
Fua et al. (2013) propose a working set based approximate subgradient
descent algorithm to solve the optimization program of the structured SVM. They
solve an image segmentation problem, where exact inference is intractable, and
the most violated constraints can only be approximated. They randomly sample
new constraints, instead of computing them using the more expensive approximate
inference techniques. This random sampling is not designed to explicitly block the
adversaries, but it gains some robustness at the prediction time. From the theory
point of view, we know that this method should not work well in general, because
the randomly selected constraints may be insignificant, and this slows down the
convergence of the algorithm. However, this method has been successful in their
application.
Gong et al. (2012) propose a structured prediction method where the output
space is a subset of two distinct manifolds, and their method tries to be robust to
noise and to choose the output from the right manifold. This method is shown
to be efficient in human motion-capturing from videos. Ranjbar et al. (2013)
focuses on keeping robust features in advance to gain robustness in the structured
46
prediction. Exploiting the domain knowledge is also a method that increases
robustness in play-type recognition for a football game, which is recorded by noisy
sensors (Chen et al., 2014b).
2.4.4. Speech recognition
As the applications of structured prediction grow in different subfields of
signal processing, the robustness issue becomes more prominent. Speech recognition
is an attractive example. Zhang et al. have parameterized a noise model, and they
have embedded it into the optimization program. They optimize for the noise
control parameter as well (Zhang et al., 2010, 2011). In their problem the noise
in the speech signal is not adversarial, and adversarial speech recognition is also
among the fields that have major applications in real-world problems.
In the next chapter, we introduce a novel method for efficient collective
classification in adversarial settings.
47
CHAPTER III
CONVEX ADVERSARIAL COLLECTIVE CLASSIFICATION
This work was published in the proceedings of the thirtieth International
Conference on Machine Learning (ICML 2013). I was the primary contributor
to the methodology and writing, and designed and conducted the experiments.
My Ph.D. advisor, Dr. Daniel Lowd contributed partly to the methodology and
writing. Daniel Lowd was the principle investigator for this work.
In collective classification (Sen et al., 2008), we wish to jointly label a set
of interconnected objects using both their attributes and their relationships. For
example, linked web pages are likely to have related topics; friends in a social
network are likely to have similar demographics; and proteins that interact with
each other are likely to have similar locations and related functions. Probabilistic
graphical models, such as Markov networks, and their relational extensions, such as
Markov logic networks (Domingos and Lowd, 2009a), can handle both uncertainty
and complex relationships in a single model, making them well-suited to collective
classification problems.
However, many collective classification models must also cope with test data
that is drawn from a different distribution than the training data. In some cases,
this is simply a matter of concept drift. For example, when classifying blogs,
tweets, or news articles, the topics being discussed will vary over time. In other
cases, the change in distribution can be attributed to one or more adversaries
actively modifying their behavior in order to avoid detection. For example, when
search engines began using incoming links to help rank web pages, spammers began
posting comments on unrelated blogs or message boards with links back to their
48
websites. Since incoming links are used as an indication of quality, manufacturing
incoming links makes a spammy web site appear more legitimate. In addition
to web spam (Abernethy et al., 2010; Drost and Scheffer, 2005), other explicitly
adversarial domains include counter-terrorism, online auction fraud (Chau et al.,
2006), and spam in online social networks.
Rather than simply reacting to an adversary’s actions, recent work in
adversarial machine learning takes the proactive approach of modeling the learner
and adversary as players in a game. The learner selects a function that assigns
labels to instances, and the adversary selects a function that transforms malicious
instances in order to avoid detection. The strategies chosen determine the outcome
of the game, such as the success rate of the adversary and the error rate of the
chosen classifier. By analyzing the dynamics of this game, we can search for an
effective classifier that will be robust to adversarial manipulation. Even in nonadversarial domains such as blog classification, selecting a classifier that is robust
to a hypothetical adversary may lead to better generalization in the presence of
concept drift or other noise (Figure 3.1).
Early work in adversarial machine learning included methods for blocking the
adversary by anticipating their next move (Dalvi et al., 2004), reverse engineering
classifiers (Lowd and Meek, 2005b,a) (and later: (Nelson et al., 2010)), and
building classifiers robust to feature deletion or other invariants (Globerson and
Roweis, 2006; Teo et al., 2008). More recently, Brückner and Scheffer showed
that, under modest assumptions, Nash equilibria can be found for domains such as
spam (Brückner and Scheffer, 2009). However, current adversarial methods assume
that instances are independent, ignoring the relational nature of many domains.
49
In this chapter, we present Convex Adversarial Collective Classification
(CACC), which combines the ideas of associative Markov networks (Taskar et al.,
2004a) (AMNs) and convex learning with invariants (Teo et al., 2008). Unlike
previous work in learning graphical models, CACC selects the most effective
weights assuming a worst-case adversary who can modify up to a fixed number
of binary-valued attributes. Unlike previous work in adversarial machine learning,
CACC allows for dependencies among the labels of different objects, as long as
these dependencies are associative. Associativity means that related objects are
more likely to have the same label, which is a reasonable assumption for many
collective classification domains. Surprisingly, all of this can be done in polynomial
time using a convex quadratic program.
In experiments on real and synthetic data, CACC finds much better strategies
than both a naı̈ve AMN that ignores the adversary and a non-relational adversarial
baseline. In some cases, the adversarial regularization employed by CACC helps
it generalize better than AMNs even when the test data is not modified by any
adversary.
FIGURE 3.1. The adversary knows the parameters of our classifier and can
maliciously modify data to attack. The learner should select the best classifier,
assuming the worst adversarial manipulation.
50
3.1. Max-margin relational learning
We use uppercase bold letters (X) to represent sets of random variables,
lowercase bold letters (x) to represent their values, and subscripts and superscripts
(xij , yik ) to indicate individual elements in those sets.
Markov networks (MNs) represent the joint distribution over a set of random
variables X = {X1 , . . . , XN } as a normalized product of factors:
P (X) =
1Y
φi (Di )
Z i
where Z is a normalization constant so that the distribution sums to one, φi is
the ith factor, and Di ⊆ X is the scope of the ith factor. Factors are sometimes
referred to as potential functions. For positive distributions, a Markov network can
also be represented as a log-linear model:
X
1
wi fi (Di )
P (X) = exp
Z
i
!
where wi is a real-valued weight and fi a real-valued feature function. For the
common case of indicator features, each feature equals 1 when some logical
expression over the variables is satisfied and 0 otherwise.
A factor or potential function is associative if its value is at least as great
when the variables in its scope take on identical values as when they take on
different values. For example, consider a factor φ parameterized by a set of nonnegative weights {wk }, so that φ(yi , yj ) = exp(wk ) when yi = yj = k and 1
otherwise. φ is clearly associative, since its value is higher when yi = yj . An
associative Markov network (AMN) (Taskar et al., 2004a) is an MN where all
51
factors are associative. Certain learning and inference problems that are intractable
in general MNs have exact polynomial-time solutions in AMNs with binary-valued
variables, as will be discussed later.
An MN can also represent a conditional distribution, P (Y|X), in which case
the normalization constant becomes a function of the evidence, Z(X).
In this chapter, we focus on collective classification, in which each object in
a set is assigned one of K labels based on its attributes and the labels of related
objects. We now give an example of a simple log-linear model for collective
classification, which we will continue to use for the remainder of the chapter.
Following Taskar et al. (2004a), let yik = 1 if the ith object is assigned the kth
label, and 0 otherwise. We use xij to represent the value of the jth attribute of the
ith object. The relationships among the objects are given by E, a set of undirected
edges of the form (i, j).
Our model includes features connecting each attribute xij to each label yik ,
represented by the product xij yik . To add the prior distribution over the labels, we
simply define an additional feature xi,0 that is 1 for every object, similar to a bias
node in neural networks. For each pair of related objects (i, j) ∈ E, we also include
a feature yik yjk which is 1 when both the ith and jth object are assigned label k.
This leads to the following model:
P (y|x) =
1
Z(x)

X
exp 
wjk xij yik +
ijk

X
wek yik yjk 
(Equation 3.1)
(i,j)∈E,k
Note that all objects share the same attribute weights, wjk , and all links share the
same edge weights, wek , in order to generalize to unseen objects and relationship
graphs. This model can also be easily expressed as a Markov logic network
52
(MLN) (Domingos and Lowd, 2009a) in which formulas relate class labels to other
attributes and the labels of linked objects.
MLNs make it easy to compactly describe very complex distributions. For
example, a simple collective classification model can be defined using relatively
simple formulas, as shown in Table ??. The subscript j and superscript k indicate
that different formulas are defined for each attribute j ∈ {1, . . . , M } and object
label k ∈ {1, . . . , K}. The formula from the first line defines features for the
prior distribution over labels in the absence of any attributes or links. The next
line relates each object’s attributes to its label. The third line relates the labels of
neighboring objects. Note that the first formula may be omitted as a special case
of the second if we assume that a special bias attribute Attribute0 (o) is true for
every object o.
A common inference task is to find the most probable explanation (MPE),
the most likely assignment of the non-evidence variables y given the evidence.
This can be done by maximizing the unnormalized log probability, since log is a
monotonic function and the normalization factor Z is constant over y. For the
simple collective classification model, the MPE task is to find the most likely
labeling given the links and attributes:
arg max
y
X
wjk xij yik +
ijk
X
wek yik yjk
(i,j)∈E,k
In general, inference in graphical models is computationally intractable.
However, for the special case of AMNs with binary-valued variables, MPE
inference can be done in polynomial time by formulating it as a min-cut
problem (Kolmogorov and Zabin, 2004). For wek ≥ 0, our working example of a
53
collective classification model is an AMN over the labels y given the links E and
attributes x. In general, associative interactions are very common in collective
classification problems since related objects tend to have similar properties, a
phenomenon known as homophily. Markov networks and MLNs are often learned
by maximizing the (conditional) log-likelihood of the training data (e.g., Lowd
and Domingos (2007)). An alternative is to maximize the margin between the
correct labeling and all alternative labelings, as done by max-margin Markov
networks (M 3 N s) (Taskar et al., 2004b) and max-margin Markov logic networks
(M 3 LN s) (Huynh and Mooney, 2009). Both approaches are intractable in
the general case. For the special case of AMNs, however, max-margin weight
learning can be formulated as a quadratic program which gives optimal weights
in polynomial time as long as the variables are binary-valued (Taskar et al., 2004a).
We now briefly describe the solution of Taskar et al., which will later motivate our
adversarial extension of AMNs. (We use slightly different notation from the original
presentation in order to make the structure of x and y clearer.)
The goal of the AMN optimization problem is to maximize the margin
between the log probability of the true labeling, h(w, x, ŷ), and any alternative
labeling, h(w, x, y). For our problem, h follows from Equation 3.1: h(w, x, y) =
P
P
k
k
k k k
i,j,k wj xij yi +
(i,j)∈E,k we yi yj . We can omit the log Z(x) term because it cancels
in the difference. Margin scaling is used to enforce a wider margin from labelings
that are more different. We defined this difference as the Hamming distance:
P
∆(y, ŷ) = N − i,k yik ŷik where N is the total number of objects. We thus obtain
the following minimization problem with an exponential number of constraints (one
54
for each y):
min
w,ξ
1
kwk2 + Cξ
2
(Equation 3.2)
s.t. h(w, x, ŷ) − h(w, x, y) ≥ ∆(y, ŷ) − ξ
∀y ∈ Y
Minimizing the norm of the weight vector is equivalent to maximizing the margin.
The slack variable ξ represents the magnitude of the margin violation, which is
scaled by C and used to penalize the objective function. To transform this into a
tractable quadratic program, Taskar et al. modify it in several ways. First, they
replace each product yik yjk with a new variable yijk and add constraints yijk ≤ yik
and yijk ≤ yjk . In other words, yijk ≤ min(yik , yjk ), which is equivalent to yik yjk for
yik , yjk ∈ {0, 1}. Second, they replace the exponential number of constraints with
a continuum of constraints over a relaxed set of y ∈ Y 0 , where Y 0 = {y : yik ≥
P
0; k yik = 1; yijk ≤ yik ; yijk ≤ yjk }. Since all constraints share the same slack
variable, ξ, we can take the maximum to summarize the entire set by the most
violated constraint. After applying these modifications, substituting in h and ∆,
and simplifying, we obtain the following optimization problem for our collective
classification task:
w,ξ
1
kwk2 + Cξ
2
s.t.
w ≥ 0;
min
ξ − N ≥ max0
y∈Y
+
X
X
wjk xij (yik − ŷik )
i,j,k
k
k
wek (yij
− ŷij
)−
X
yik · ŷik
(Equation 3.3)
i,k
(i,j)∈E,k
Finally, since the inner maximization is itself a linear program, we can replace it
with the minimization of its dual to obtain a single quadratic program (not shown).
55
For the two-class setting, Taskar et al. prove that the inner program always has an
integral solution, which guarantees that the weights found by the outer quadratic
program are always optimal.
For simplicity and clarity of exposition, we have used a very simple collective
classification model as our working example of an AMN. This model can easily
be extended to allow multiple link types with different weights, link weights that
are a function of the evidence, and higher-order links (hyper-edges), as described
by Taskar et al. (2004a). Our adversarial variant of AMNs, which will be described
in Section 4, supports most of these extensions as well.
3.2. Convex formulation
Collective classification problems are hard because the number of joint label
assignments is exponential in the number of nodes. As discussed in Section 2,
if neighboring nodes are more likely to have the same label, then the collective
classification problem can be represented as an associative Markov network (AMN),
in which max-margin learning and MPE inference are both efficient. To construct
an adversarial collective classifier, we start with the AMN formulation (Equation
3.3) and incorporate an adversarial invariant, similar to the approach of Globerson
and Roweis (2006). Specifically, we assume that the adversary may change up to
D binary-valued features xij , for some positive integer D that we select in advance.
We use x̂ to indicate the true features and x to indicate the adversarially modified
P
features. The number of changes can be written as: ∆(x, x̂) = i,j xij + x̂ij −2xij x̂ij
We define the set of valid x as X 0 = {x : 0 ≤ xij ≤ 1; ∆(x, x̂) ≤ D}. Note
that X 0 is a relaxation that allows fractional values, much like the set Y 0 defined by
56
Taskar et al. We will later show that there is always an integral solution when both
the features and labels are binary-valued.
In our adversarial formulation, we want the true labeling ŷ to be separated
from any alternate labeling y ∈ Y 0 by a margin of ∆(y, ŷ) given any x ∈ X 0 .
Rather than including an exponential number of constraints (one for each x and y),
we use a maximization over x and y to find the most violated constraint:
max h(w, x, y) − h(w, x, ŷ) + ∆(y, ŷ)
y∈Y 0 ,x∈X 0
=
max
0
X
y∈Y ,x∈X 0
wjk xij yik +
i,j,k
−
X
wek yijk
(i,j)∈E,k
X
X
wjk xij ŷik −
i,j,k
wek ŷijk
(i,j)∈E,k
+N −
X
yik · ŷik
(Equation 3.4)
i,k
Next, we convert this to a linear program. Since xij yik is bilinear in x and y, we
replace it with the auxiliary variable zijk , satisfying the constraints: zijk ≥ 0; zijk ≤
xij ; and zijk ≤ yik . The removes the bilinearity and is exactly equivalent as long as
xij or yik is integral.
Putting it all together and removing terms that are constant with respect to
x, y, and z, we obtain the following linear program:
max
x,y,z
s.t.
X
k
wjk (zij
− ŷik xij ) +
i,j,k
X
k
wek yij
−
(i,j)∈E,k
0 ≤ xij ≤ 1;
X
X
yik · ŷik
i,k
xij + x̂ij − 2xij x̂ij ≤ D
i,j
0 ≤ yik ;
X
yik = 1;
k
yij
≤ yik ;
k
yij
≤ yjk
k
k
zij
≤ xij ;
k
zij
≤ yik
57
∀i, j, k
(Equation 3.5)
Given the model’s weights, this linear program allows the adversary to change
up to D binary features. Recall that, in the AMN formulation, the exponential
number of constraints separating the true labeling from all alternate labelings
are replaced with a single non-linear constraint that separates the true labeling
from the best alternate labeling (Eqs. Equation 3.2,Equation 3.3). This nonlinear constraint contains a nested maximization. We have a similar scenario, but
here the margin can also be altered by changing the binary features, affecting
the probabilities of both the true and alternate labelings. By substituting this
new MPE inference task (Equation 3.5) into the original AMN’s formulation, the
resulting program’s optimal solution will be robust to the worst manipulation of the
input feature vector:
min
w,ξ
1
kwk2 + Cξ
2
ξ − N ≥ max
x,y,z
−
s.t. w ≥ 0;
X
k
wjk (zij
− ŷik xij ) +
i,j,k
X
X
k
wek yij
(i,j)∈E,k
yik · ŷik
s.t.
i,k
0 ≤ yik ;
X
k
k
yik = 1; yij
≤ yik ; yij
≤ yjk
k
0 ≤ xij ≤ 1;
X
xij + x̂ij − 2xij x̂ij ≤ D
i,j
k
zij
≤ xij ;
k
zij
≤ yik
(Equation 3.6)
The mathematical program in Equation 3.6 is not convex because of the
bilinear terms and the nested maximization (similar to solving a bilevel Stackelberg
58
game). Fortunately, we can use the strong duality property of linear programs to
resolve both of these difficulties. The dual of the maximization linear program is a
minimization linear program with the same optimal value as the primal problem.
Therefore, we can replace the inner maximization with its dual minimization
problem to obtain a single convex quadratic program that minimizes over w, ξ,
and the dual variables (not shown). A similar approach is used by Globerson and
Roweis (2006). As long as this relaxed program has an integral optimum, it is
equivalent to maximizing only over integral x and y. Thus, the overall program
will find optimal weights. Taskar et al. (2004a) prove that the inner maximization
in a 2-class AMN always has an integral solution. We can prove a similar result for
the adversarial AMN:
Theorem 1. Equation Equation 3.5 has an integral optimum when w ≥ 0 and the
number of classes is 2.
Proof Sketch. The structure of our argument is to show that an integral optimum
exists by taking an arbitrary adversarial AMN problem and constructing an
equivalent AMN problem that has an integral solution. Since the two problems
are equivalent, the original adversarial AMN must also have an integral solution.
First, we use a Lagrange multiplier to incorporate the constraint ∆(x, x̂) ≤ D
directly into the maximization. The extra term acts as a “per-change” penalty,
which remains linear in x. Minimizing over the Lagrange multiplier effectively
adjusts this per-change penalty until there are at most D changes between x and
x̂, but does not affect the integrality of the inner maximization. Next, we replace
all x variables with equivalent variables v. Assume that either wj1 = 0 or wj2 = 0,
for all j. (If both are positive, then we can subtract the smaller value from both
to obtain a new set of weights with the same optimum as before.) We define v as
59
follows:
vij1
=



x
ij
if wj1 > 0,


1 − xij
if wj1 = 0.
vij2 = 1 − vij1
By construction:
X
wjk xij (yik − ŷik ) =
i,j,k
X
k
wjk vij
(yik − ŷik )
i,j,k
Thus, we can replace the x variables with v. Since the connections between the
vijk and corresponding yik variables are all associative, this defines an AMN over
variables {y, v}, which is guaranteed to have an integral solution when there are
only two classes.
By translating v back into x, we obtain a solution that is integral in both x
and y.
A complete proof can be found in Appendix A.
Many extensions of our model are possible. One extension is to restrict the
adversary to only changing certain features of certain objects. For example, in
a web spam domain, we might assume that the adversary will only modify spam
pages. We could also have different budgets for different types of changes, such as
a separate budget for each web page, or even separate budgets for changing the
title of a web page and changing its body. These are easily expressed by changing
the definition of X 0 and adding the appropriate constraints to the quadratic
program. Our model can also support higher-order cliques, as described by Taskar
et al. (2004a), as long as they are associative. For simplicity, our exposition and
experiments focus on the simpler case described above.
60
One important limitation of our model is that we do not allow edges to be
added or removed by the adversary. While edges can be encoded as variables in
the model, they result in non-associative potentials, since the presence of an edge
is not associated with either class label. Instead, the presence of an edge increases
the probability that the two linked nodes will have the same label. Handling the
adversarial addition and removal of edges is an important area for future work, but
will almost certainly be a non-convex problem.
3.3. Experiments
In this section, we describe our experimental evaluation of CACC. Since
CACC is both adversarial and relational, we compared it to four baselines: AMNs,
which are relational but not adversarial; SVMInvar (Teo et al., 2008), which is
adversarial but not relational; and SVMs with a linear kernel, which are neither.
AMNs, SVMInvar, and SVMs can be seen as special cases of CACC: fixing the
adversary’s budget D to zero results in an AMN, fixing the edge weights wek to zero
results in SVMInvar, and doing both results in an SVM.
3.3.1. Datasets
We evaluated our method on three collective classification problems.
Synthetic. To evaluate the effectiveness of our method in a controlled setting
where the distribution is known, we constructed a set of 10 random graphs, each
with 100 nodes and 30 Boolean features. Of the 100 nodes, half had a positive
label (‘+’) and half had a negative label (‘−’). Nodes of the same class were more
likely to be linked by an edge than nodes with different classes. The features
were divided evenly into three types: positive, negative, and neutral. Half of the
61
50
40
30
20
10
60
50
40
30
20
10
5
10
15
20
5
Strength of adversary (%)
Classification Error (%)
Classification Error (%)
30
20
10
10
15
60
20
25
30
20
Classification Error (%)
Classification Error (%)
70
40
30
20
10
5
10
15
20
Strength of adversary (%)
(g) Reuters: 0%
25
20
25
50
40
30
20
0
0
25
5
60
15
20
25
(f) Political blogs: 20%
80
SVM
SVMINV
AMN
CACC
70
50
40
30
20
0
0
10
Strength of adversary (%)
10
20
15
10
80
15
60
(e) Political blogs: 10%
50
10
SVM
SVMINV
AMN
CACC
Strength of adversary (%)
SVM
SVMINV
AMN
CACC
10
5
(c) Synthetic dataset: 20%
70
40
0
0
80
5
20
Strength of adversary (%)
50
(d) Political blogs: 0%
0
0
30
0
0
25
10
5
40
80
Strength of adversary (%)
60
20
SVM
SVMINV
AMN
CACC
70
40
70
15
80
SVM
SVMINV
AMN
CACC
50
0
0
10
(b) Synthetic dataset: 10%
80
60
50
Strength of adversary (%)
(a) Synthetic dataset: 0%
70
60
10
0
0
25
SVM
SVMINV
AMN
CACC
70
Classification Error (%)
0
0
80
SVM
SVMINV
AMN
CACC
70
Classification Error (%)
60
80
SVM
SVMINV
AMN
CACC
Classification Error (%)
Classification Error (%)
70
Classification Error (%)
80
60
SVM
SVMINV
AMN
CACC
50
40
30
20
10
5
10
15
20
Strength of adversary (%)
(h) Reuters: 10%
25
0
0
5
10
15
20
Strength of adversary (%)
(i) Reuters: 20%
FIGURE 3.2. Accuracy of different classifiers in presence of worst-case adversary.
The number following the dataset name indicates the adversary’s strength at the
time of parameter tuning. The x-axis indicates the adversary’s strength at test
time. Smaller is better.
positive and negative nodes had different feature distributions based on their class;
that is, the positive nodes had more positive attributes and the negative nodes
had more negative attributes, on average. In such nodes, on average there are 6
words, one of which is of the opposite class’s words, two words are consistent with
the class label and three words are neutral. The other half of the nodes had an
ambiguous distribution consisting mainly of the neutral words (on average one word
62
25
is consistent with class label, one word is not consistent and 3 words are neutral).
Therefore, an effective classifier for these graphs must rely on both the attributes
and relations. On average, each node had 8 neighbors, 7 of which had the same
class and 1 of which had a different class.
Political Blogs. Our second domain is based on the Political blogs dataset
collected by Adamic and Glance (2005). The original dataset contains 1490 online
blogs captured during the 2004 election cycle, their political affiliation (liberal or
conservative), and their linking relationships to other blogs. We extended this
dataset with word information from four different crawls at different dates in
2012: early February, late February, early May and late May. We used mutual
information to select the 100 words that best predict the class label (Peng et al.,
2005), only using blogs from February and half of the blogs in early May, in order
to limit the influence of test labels on our training procedure. We found that some
of the blogs in the original dataset were no longer active, and had been replaced by
empty or spam web pages. We manually removed these from consideration. Finally,
we partitioned the blogs into two disjoint subsets and removed all edges between
nodes in the different subsets.
Reuters. As our third dataset, we prepared a Reuters dataset similar to the
one used by Taskar et al. (2004a). We took the ModApte split of the Reuters-21578
corpus and selected articles from four classes: crude, grain, trade, and money-fx.
We used the 200 words with highest mutual information as features. We linked
each document to the two most similar documents based on TF-IDF weighted
cosine distance. We split the data into 7 sets based on time, and performed the
tuning and then the training phases based on this temporal order (as explained in
3.3.3.).
63
3.3.2. Simulating an adversary
In real world adversarial problems, the adversary does not usually have
complete access to the model parameters. Researchers have widely studied the
different ways that an adversary can acquire access to the model parameters
actively or passively (Lowd and Meek, 2005b,a). In this section, we have examined
two extreme cases. In the first, the adversary has complete access to the model
parameters and manipulates the features to maximize the misclassification rate.
Since exactly maximizing the error rate is typically NP-hard, our intelligent
adversary instead maximizes the margin loss by solving the linear program in
(Equation 3.5). In the second scenario, the random adversary randomly toggles
D binary features, representing random noise or perhaps a very naı̈ve adversary.
3.3.3. Methodology and metrics
In order to evaluate the robustness of these methods to malicious adversaries,
we applied a simulated adversary to both the tuning data and the test data. We
assumed the worst-case scenario, in which the adversary has perfect knowledge of
the model parameters and only wants to maximize the error rate of the classifier.
Since exactly maximizing the error rate is typically NP-hard, our intelligent
adversary instead maximizes the margin loss by solving the linear program in
Equation 3.5 for a fixed budget. Each model was attacked separately. On the
validation data, we used adversarial budgets of 0% (no adversarial manipulation),
10%, and 20% of the total number of features present in the data. This allowed
us to tune our models to “expect” adversaries of different strengths. Of course, we
rarely know the exact strength of the adversary in advance. Thus, on the test data,
64
we used budgets that ranged from 0% to 25%, in order to see how well different
models did against adversaries that were weaker and stronger than expected.
We used the fraction of misclassified nodes as our primary evaluation
criterion. For all methods, we tuned the regularization parameter C using held-out
validation data. For the adversarial methods (CACC and SVMInvar), we tuned the
adversarial training budget D as well. All parameters were selected to maximize
performance on the tuning set with the given level of adversarial manipulation.
For political blogs, we tuned our parameters using the words from the
February crawls, and then learned models on early May data and evaluated them
on late May data. In this way, our tuning procedure could observe the concept drift
within February and select parameters that would handle the concept drift during
May well. For Synthetic data, we ran 10-fold cross validation. For Reuters, we split
the data into 7 sets based on time. We tuned parameters using articles from time t
and t + 1 and then learned on articles at time t + 1 and evaluated on articles from
time t + 2.
We used CPLEX to solve all quadratic and linear programming problems.
Most problems were solved in less than 1 minute on a single core.
All of our code and datasets are available upon request.
3.3.4. Results and discussion
Figure 3.2 shows the performance of all four methods on test data
manipulated by rational adversaries of varying strength (0%-25%), after being
tuned against adversaries of different strengths (0%, 10%, and 20%). Lower is
better. On the far left of each graph is performance without an adversary. To the
right of each graph, the strength of the adversary increases.
65
50
40
30
20
10
60
50
40
30
20
10
5
10
15
20
5
Strength of adversary (%)
Classification Error (%)
Classification Error (%)
30
20
10
10
15
60
20
25
30
20
Classification Error (%)
Classification Error (%)
70
40
30
20
10
5
10
15
20
Strength of adversary (%)
(g) Reuters: 0%
25
20
25
50
40
30
20
0
0
25
5
60
15
20
25
(f) Political blogs: 20%
80
SVM
SVMINV
AMN
CACC
70
50
40
30
20
0
0
10
Strength of adversary (%)
10
20
15
10
80
15
60
(e) Political blogs: 10%
50
10
SVM
SVMINV
STACKED
AMN
CACC
Strength of adversary (%)
SVM
SVMINV
AMN
CACC
10
5
(c) Synthetic dataset: 20%
70
40
0
0
80
5
20
Strength of adversary (%)
50
(d) Political blogs: 0%
0
0
30
0
0
25
10
5
40
80
Strength of adversary (%)
60
20
SVM
SVMINV
STACKED
AMN
CACC
70
40
70
15
80
SVM
SVMINV
STACKED
AMN
CACC
50
0
0
10
(b) Synthetic dataset: 10%
80
60
50
Strength of adversary (%)
(a) Synthetic dataset: 0%
70
60
10
0
0
25
SVM
SVMINV
STACKED
AMN
CACC
70
Classification Error (%)
0
0
80
SVM
SVMINV
STACKED
AMN
CACC
70
Classification Error (%)
60
80
SVM
SVMINV
STACKED
AMN
CACC
Classification Error (%)
Classification Error (%)
70
Classification Error (%)
80
60
SVM
SVMINV
AMN
CACC
50
40
30
20
10
5
10
15
20
25
0
0
5
Strength of adversary (%)
(h) Reuters: 10%
10
15
20
Strength of adversary (%)
(i) Reuters: 20%
FIGURE 3.3. Accuracy of different classifiers in presence of random adversary. We
observe that even strong random attacks are not efficient in disguising the true
class of the sample.
When a rational adversary is present, CACC clearly and consistently
outperforms all other methods. When there is no adversary, its performance is
similar to a regular AMN. On political blogs, it appears to be slightly better, which
may be the result of the large amount of concept drift in that dataset.
As expected, tuning against stronger adversaries (10% and 20%) makes
CACC more effective against stronger adversaries at test time. Surprisingly, tuning
66
25
FIGURE 3.4. The distribution of the learned weight values for different models.
The robust method tends to have a high density on the weights that are
saturated.
against a stronger adversary does not significantly reduce performance against
weaker adversaries: CACC remains nearly as effective against no adversary when
tuned for a 20% adversary as when tuned for no adversary. Specifically, when there
is no adversary at test time, the increase in error rate from training against a 20%
adversary is less than 1% on Synthetic and Reuters, and on Political the error rate
actually decreases slightly. Thus, this additional robustness comes at a very small
cost.
In Figures 3.2d, 3.2e, and 3.2f, the AMN classification error jumps sharply
as the adversary budget increases. This is the point when enough nodes are
mis-classified that links are actively misleading in one or two of the eight crossvalidation folds, leading to worse performance than the SVM for those folds.
This demonstrates that relational classifiers are potentially more vulnerable to
adversarial attacks than non-relational classifiers. A smoother version of this effect
can also be observed on both the synthetic dataset and Reuters.
Another interesting result was that our solutions on Reuters were always
integral, even though the number of classes is 4 and integrality is not guaranteed.
67
FIGURE 3.5. The sorted learned weights for each method. The robust method
constrains the maximum value of the weights. This suggests that robustness could
also be achieved through regularization with L∞ norm.
An inspiring observation is about the distribution of learned weights in robust
and non-robust models. The robust models have restricted the maximum value
that the weight parameter can take Figure (3.5). Intuitively, this means that if
the learner unconditionally trusts the importance of a certain feature, then it will
become a point of weakness for itself. The adversarial budget in this experiment
had been an L1 , therefore, from a technical point of view, this result suggests that
we can achieve the same robustness by regularizing the weights by an L∞ norm.
This was the motivation of the work that we present in the next chapter.
We also performed additional experiments against irrational adversaries that
modify attributes uniformly at random. These random attacks had little effect on
the accuracy of any of the methods; all remained nearly as effective as against no
adversary (Figure 3.3).
68
3.4. Conclusion
In this chapter, we provide a generalization of SVMInvar (Teo et al., 2008)
and AMNs (Taskar et al., 2004a) that combines the robustness of SVMInvar
with the ability to reason about interrelated objects. In experiments on real and
synthetic data, CACC finds consistently effective and robust models, even when
there are more than two labels.
In the next chapter, we extend robustness to adversarial manipulation of
input data to generic structured prediction models. We show how robustness
is equivalent to regularization for structured models, and we propose methods
for developing customized regularization functions for particular adversarial
uncertainty sets.
69
CHAPTER IV
EQUIVALENCY OF ADVERSARIAL ROBUSTNESS AND
REGULARIZATION
This work was published in the proceedings of the thirty-first International
Conference on Machine Learning (ICML 2014). I was the primary contributor
to the methodology and writing, and designed and conducted the experiments.
My Ph.D. advisor, Dr. Daniel Lowd contributed partly to the methodology and
writing. Daniel Lowd was the principle investigator for this work.
Traditional machine learning methods assume that training and test data
are drawn from the same distribution. However, in many real-world applications,
the distribution is constantly changing. In some cases, such as spam filtering and
fraud detection, an adversary may be actively manipulating it to defeat the learned
model. In such cases, it is beneficial to optimize the model’s performance on not
just the training data but on the worst-case manipulation of the training data,
where the manipulations are constrained to some domain-specific uncertainty set.
For example, in an image classification problem, the uncertainty set could include
minor translations, rotations, noise, or color shifts of the training data. This type of
robust optimization leads to models that perform well on points that are “close” to
those in the training data.
In general, robust optimization addresses optimization problems in which
some degree of uncertainty governs the known parameters of the model BenTal and Nemirovski (1998, 1999, 2000, 2001); Bertsimas and Sim (2004). In
many of the existing robust formulations, robust linear programming is a central
method. For example, Bertsimas et al. Bertsimas et al. (2004) show that when the
70
disturbance of the inputs is restricted to an ellipsoid around the true values defined
by some norm, then the robust linear programming problem can be reduced to a
convex cone program. A number of other authors have explored the application
of robust optimization to classification problems (e.g., Lanckriet et al. (2003);
El Ghaoui et al. (2003); Bhattacharyya et al. (2004); Shivaswamy et al. (2006)).
Recently, Xu et al. Xu et al. (2009) showed that regularization of support vector
machines can be derived from a robust formulation. However, robustness for
structured prediction models has remained largely unexplored.
In this chapter, we develop a general-purpose technique for learning robust
structural support vector machines. Our basic approach is to consider the
worst-case corruption of the input data within some uncertainty set and use
this to define a robust formulation. This optimization problem is often much
harder than standard training of structural SVMs when written directly; we
overcome this obstacle by transforming the robust optimization problem into a
standard structural support vector machine learning problem with an additional
regularizer. This gives us both robustness and computational efficiency in the
structured prediction setting, as well as establishing an elegant relationship between
robustness and regularization for structural SVMs.
We demonstrate our approach on a new dataset consisting of snapshots of
political blogs from 2003 through 2013, based on the political blogs dataset from
Adamic and Glance (2005). Blogs are classified as liberal or conservative using both
their words and link structure. To make this more challenging, we train on blogs
from 2004 but evaluate on every year, from 2003 to 2013. In this domain, we define
an uncertainty set, show to construct an appropriate regularizer, and show that this
regularization can lead to substantially lower test error than a non-robust model.
71
4.1. Preliminaries
We begin by describing our notation and then provide a brief overview of
structural support vector machines.
x and y denote the vectorized input and the representation of the structured
output in the training data, respectively. For simplicity of notation, we assume a
single training example, such as a single social network graph, but our results easily
extend to a set of training examples.
The feature vector φ(x, y) is a function of both inputs and labels (and also
manipulated input or alternate labels, when used as the input argument). We use
∆φ(x, y, ỹ), to refer to the difference between two feature vectors with different
labels y and ỹ; in particular: ∆φ(x, y, ỹ) = φ(x, ỹ) − φ(x, y). The value of
wT φ(x, ỹ) is called the score of labeling x as ỹ, for the given model weights w.
∆(y, ỹ) is a scalar distance function, such as Hamming distance, which is a
measure of dissimilarity between the true and alternate labels.
We use k.k to refer to a general norm function and k.k∗ for the dual norm of
k.k, where kyk∗ = sup{yT x|kxk ≤ 1}.
In this chapter, we focus on the derivation of robust formulations for 1-slack
structural SVM Joachims et al. (2009). (With minor changes, the results of this
chapter can be applied to n-slack structural SVMs as well, but we skip them here.)
The optimization program of a 1-slack structural SVM is:
minimize f (w) + Cζ
w,ζ
subject to
ζ ≥ max wT ∆φ(x, y, ỹ) + ∆(y, ỹ)
ỹ
72
(Equation 4.1)
where x is the vector of all input variables, y is the desired structured query
variables, and w is the vector of the model parameters. The goal is to learn w.
f (w) is a regularization function that penalizes “large” weights. Depending
on the application, f (w) can be any convex function in general. Semi-homogeneous
functions, such as norms or powers of norms with power value equal to or greater
than 1, are among the favorite choices. (A function f (z) is semi-homogeneous if
and only if f (az) = aα f (z) for some positive α.) f (w) =
1 T
w w
2
is the most
commonly used regularization function.
4.2. Robust structural SVMs
In this section, we motivate and define a robust formulation of structural
SVMs. We begin by considering how an adversary might modify an input in order
to maximize the prediction error, and use this to derive a definition of a robust
structural SVM in sample space and feature space.
4.2.1. Worst-case/Adversarial data manipulation
Adversaries might have a wide range of goals, but in the worst case they will
antagonistically try to reduce the accuracy of the predictive model. For structural
SVMs, the predicted output is chosen by solving ỹ = arg maxỹ wT φ(x, ỹ), where
wT φ(x, ỹ) is the classification score. Thus, an adversary’s antagonistic goal would
be to replace the true input x with a manipulated version x̃ that maximizes the
classification loss ∆(y, ỹ). If the highest scoring label is not unique, we assume the
73
adversary tries to maximize the minimum loss in the set:
maximize min ∆(y, ỹ),
x̃
ỹ
subject to
ỹ ∈ arg max wT φ(x̃, ỹ)
ỹ6=y
x̃ ∈ S(x, y)
(Equation 4.2)
S(x, y) is a domain-specific uncertainty set, which constrains the set of possible
corrupt inputs x̃. We always assume that x ∈ S(x, y), which means x can remain
unchanged. The set S(x, y) can contain a wide range of possible variations, such
as the amount of affordable/possible change in an attribute, or the restrictions that
are enforced on combinations of changes among several attributes.
The bi-level optimization program in (Equation 4.2) is not tractable in
general, especially when x and y have integer components. A slightly more
tractable solution is to relax the program and only require that ỹ be scored higher
than the true output y:
maximize ∆(y, ỹ),
x̃,ỹ
subject to
wT φ(x̃, y) ≤ wT φ(x̃, ỹ)
x̃ ∈ S(x, y)
(Equation 4.3)
The maximization in (Equation 4.3) might be infeasible, but its Lagrangian
relaxation is always feasible:
maximize λwT ∆φ(x̃, y, ỹ) + ∆(y, ỹ)
x̃,ỹ
subject to
x̃ ∈ S(x, y)
74
(Equation 4.4)
We want to attract the reader’s attention to the similarity of (Equation
4.4), and the nested max operation in the constraint of (Equation 4.1). In fact,
λwT ∆φ(x̃, y, ỹ) + ∆(y, ỹ) is a component of the loss function that the learner
wants to minimize. In the next subsection, we reformulate the standard 1-slack
structural SVM so that the effect of adversarial manipulation of input data will be
minimized.
4.2.2. Robust formulation in sample space
Our goal is to find a set of model parameters that perform well against
the worst-case manipulated input x̃ in the uncertainty set. We formulate this
by replacing the loss-augmented margin in (Equation 4.1) with the worst-case
adversarial loss obtained by (Equation 4.4):
minimize Cf (w) +
w
sup
Lλ (w, x̃, ỹ, y)
(Equation 4.5)
x̃∈S(x,y),ỹ
where Lλ (w, x̃, ỹ, y) = λwT ∆φ(x̃, y, ỹ) + ∆(y, ỹ). We replace the maximization
with a sup operator to indicate that the maximum value might not be achieved.
Both λ and C are tunable parameters that can be determined by cross-validation.
In the following lemma we show that it is possible to tune only one of them by
performing a re-parameterization.
Lemma 1. For semi-homogeneous f (.), the problem (Equation 4.5) can be
equivalently re-written in the following form:
minimize Cf (w) +
w
sup
x̃∈S(x,y),ỹ
where L(w, x̃, ỹ, y) = wT ∆φ(x̃, y, ỹ) + ∆(y, ỹ)
75
L(w, x̃, ỹ, y)
(Equation 4.6)
Proof. Let w0 = λw, and C 0 =
C
.
λα
f (aw) = aα f (w), we have Cf (w) =
Then, for a semi-homogeneous f (.), where
C
f (λw).
λα
Therefore, for semi-homogeneous
regularization functions f (.) by re-parameterization of w0 as w, and C 0 as C,
(Equation 4.5) can be rewritten as (Equation 4.6).
Problem (Equation 4.6) is similar in form to a standard structural SVM,
except that the inner maximization is done over both x̃ and ỹ. This is potentially
much harder than simply maximizing over ỹ, since the input often has a much
higher dimension than the output. For example, when labeling a set of 1000 web
pages, there are only 1000 labels to predict but 1,000,000 possible hyperlinks that
the adversary could add or remove. In the next subsection, we show that we can
avoid the above-mentioned computational complexity by instead restricting the
variations in the feature space.
4.2.3. Robustness in feature space
Let ∆x be the disturbance in the sample space such that: x̃ = x + ∆x. Then,
by finite difference approximation1 :
φ(x̃, y) = φ(x + ∆x, y) = φ(x, y) + δ(x̃, y)
φ(x̃, ỹ) = φ(x + ∆x, ỹ) = φ(x, ỹ) + δ(x̃, ỹ)
Note that we are not introducing any error; both functions δ(x̃, y) and δ(x̃, ỹ)
contain as many high-order approximation terms as needed for achieving
infinitesimal error introduction, although we never unpack these functions. In
fact, the difference between δ(x̃, y) and δ(x̃, ỹ) is particularly important; let
1
For more on finite difference approximations, refer to Smith Smith (1985).
76
δỹ (x, y, x̃) = δ(x̃, ỹ) − δ(x̃, y), then:
φ(x̃, ỹ) − φ(x̃, y)
= φ(x + ∆x, ỹ) − φ(x + ∆x, y)
= φ(x, ỹ) − φ(x, y) + δ(x̃, ỹ) − δ(x̃, y)
= φ(x, ỹ) − φ(x, y) + δỹ (x, y, x̃)
(Equation 4.7)
Therefore, the manipulation of the input data affects the margin L(.) in (Equation
4.6) through δỹ (x, y, x̃). In the rest of the chapter, we will use δ i to refer to the ith
element of the vector δỹ (x, y, x̃).
Clearly, δỹ depends on the specific choice of the alternate labeling ỹ, as well
as x̃, x, and y. Let:
∆2 Φ(x, y) = {δ = δỹ (x, y, x̃)| ∀x̃ ∈ S(x, y), ỹ}
(Equation 4.8)
be the set of all possible variations. Note that ∆2 Φ(x, y) is independent of
ỹ. In the next section, we introduce some mechanical procedures for calculating
∆2 Φ(x, y) from S(x, y), for certain choices of S(x, y) and φ(x, y).
Lemma 2. Let L1 (w, x̃, ỹ, y) = wT (φ(x̃, ỹ) − φ(x̃, y)) + ∆(y, ỹ), and L2 (w, δ, ỹ) =
wT (φ(x, ỹ) − φ(x, y) + δ) + ∆(y, ỹ). Then we will have:
sup
δ∈∆2 Φ(x,y),ỹ
L2 (w, δ, ỹ) ≥
sup
x̃∈S(x,y),ỹ
77
L1 (w, x̃, ỹ, y)
Proof sketch. The left-hand side of the inequality is equal to the right-hand side
except that the supremum is taken over a superset of function values. Thus, the
left-hand side cannot be any less than the right-hand side.
Now, we can rewrite the robust formulation in (Equation 4.6) over variations
in the feature space:
minimize Cf (w) +
w
sup
L(w, δ, ỹ)
(Equation 4.9)
δ∈∆2 Φ(x,y),ỹ
where L(w, δ, ỹ) = wT (∆φ(x, y, ỹ) + δ) + ∆(y, ỹ).
By Lemma (2), the objective of (Equation 4.9) is an upper-bound for the
objective of (Equation 4.6); therefore, the formulation of the problem in (Equation
4.9) is an approximate, but more tractable, solution for (Equation 4.6).
In the next section, we will show that for a wide class of ∆2 Φ(x, y)’s,
problem (Equation 4.9) reduces to an optimization program which can be solved
as efficiently as an ordinary 1-slack structural SVM.
4.3. Mapping the uncertainty sets
In many real world problems, there exists some expert knowledge about the
uncertainty sets in the sample space. For example, for the webpage classification
problem, a spammer can modify web pages by adding and removing words and
links, but is constrained by the cost of compromising legitimate web pages, which
takes time and effort, or obfuscating spam pages, which may make them less
effective at gaining clicks. We can approximate this with a simple budget on the
number of words and links the adversary can change over the entire dataset. Even
when such information is not readily available, it may be possible to infer an
78
uncertainty set from training data. For example, if our dataset contains outliers,
we can pair each outlier (x̃) with the most similar non-outlier (x) and take the
differences as possible directions of manipulation: ∆x = x̃ − x. The convex hull
of these difference vectors (or an approximation thereof) can be used to define an
uncertainty set for any instance.
Lemma 2 states that the robust formulation in feature space is a reasonable
approximation for the robust formulation in the sample space, but it does not
suggest any mechanical procedure for calculating the uncertainty sets in feature
space from the ones in the sample space. We now derive such procedures for certain
types of uncertainty sets and feature functions.
Many features of interest, including logical conjunctions, can be represented
as products of several variables. We define a multinomial feature function as a sum
of many such products:
φC (x, y) =
X
Y
(cx ,cy )∈C i∈cx
xi
Y
yi
(Equation 4.10)
i∈cy
where C is a set of cliques and (cx , cy ) are the index sets of the attribute and
output variables that contribute to the feature. The summation groups together
many products that share the same pattern into a single, aggregate feature so that
they may be considered collectively. For example, in web page classification, the
P
multinomial feature i xi,j yi could represent the number of web pages with label 1
that contain word j. This is equivalent to having many features with tied weights.
Lemma 3. If the feature function φC (x, y) is multinomial with 0 ≤ x, y ≤ 1;
then, its disturbance δ C can be upper-bounded by a function of the variations in the
79
sample space, and the following inequality relation holds:
|δ C |p
αC |C|
p
q
≤
XX
|x̃i − xi |p
(Equation 4.11)
cx ∈C i∈cx
where p ≥ 1 is an arbitrary power value and
1
p
+
1
q
= 1; α = max|cx |(p−1) ; |cx | is the
cx ∈C
number of evidence variables in cx ; and |C| is the number of different sets cx in C.
Now, the resulting inequality of Lemma 3 can be used as the core inequality
for upper-bounding the variations in the feature space.
The proofs can be found in Appendix B.
The next theorem is the main result of this section.
Theorem 2. For multinomial feature functions and spherical uncertainty sets in
the sample spaceS(x, y) = {x̃ | kx̃ − xkp ≤ B, p ≥ 1}, one can construct an
ellipsoidal uncertainty set in the feature space:
∆2 Φ(x, y) = {δ| kMδkp ≤ 1}
where M is a diagonal matrix with
1
1
1
B(dαi ) p |Ci | q
(Equation 4.12)
on the (i, i)th position. d, αi , and
|Ci | are appropriate constants.
Proof. Assume that P = {C1 , . . . , CL } is a set of cliques that covers all variable xi ’s.
Note that such a set should exist; otherwise, some variables are never used in the
model. For each of the cliques, we form a corresponding difference in the feature
function from Equation 4.7, and apply Lemma 3. By adding all of the resulting
80
inequalities, we obtain:
X |δ Ci |p
Ci ∈P
αi |Ci |
p
q
dim(x)
≤ d
X
|x̃i − xi |p
i=1
= dkx̃ − xkpp ≤ dB p
⇒
p
Ci ∈P
X
⇒
Ci ∈P
|δ Ci |p
X
1
1
p
B p dαi |Ci | q
!p
B(dαi ) |Ci |
1
q
|δ Ci |
≤ 1
≤ 1
where αi = max|cx |(p−1) , and |cx | is the number of variables in cx . Since it is
cx ∈Ci
possible that cliques cover overlapping sets of variables, the coefficient d ≥ 1 will
be used to maintain the inequality.
Now let
1
1
1
be the diagonal entry in matrix M that corresponds to
B(dαi ) p |Ci | q
Ci
feature disturbance δ . For this choice of M, kMδkp ≤ 1.
We have an example of applying Theorem 2 in Section 6.2, which will show
how this construction works in practice.
Corollary 1. If S(x, y) = {x̃ | kx̃ − xk1 ≤ B}, then M can be constructed by
setting
1
Bd
as its (i, i)th element, which results in a tighter upper bound.
The proof can be found in Appendix B.
4.4. Robust optimization programs
Our main contribution in this chapter is achieving robust formulations that
can be efficiently solved. We do this by demonstrating a connection between
robustness to certain perturbations in feature space and certain types of weight
regularization. In this section we derive formulations for achieving robust weight
81
learning in structural SVMs when ∆2 Φ(x, y) is an ellipsoid, a polyhedron, or the
intersection of an ellipsoid and a polyhedron.
4.4.1. Ellipsoidal constrained uncertainty
We first consider the case when the uncertainty set ∆2 Φ(x, y) is ellipsoidal.
Recall that any ellipsoid can be represented in the form of {t | kMtk ≤ 1}, where
k.k is the relevant norm.
Theorem 3. For ∆2 Φ(x, y) = {δ | kMδk ≤ 1} where M is positive definite, the
optimization program of the robust structural SVM in (Equation 4.9) reduces to the
following regularized formulation of the ordinary 1-slack structural SVM:
minimize Cf (w) + kM−1 wk∗ + ζ
w,ζ
(Equation 4.13)
subject to
ζ ≥ sup wT ∆φ(x, y, ỹ) + ∆(y, ỹ)
ỹ
where k.k∗ is the dual norm of k.k.
Proof. We begin with the robust formulation of a structural SVM from (Equation
4.9), where the uncertainty set of δ is defined by the ellipsoid kMδk ≤ 1:
minimize Cf (w) +
w
sup
kMδk≤1,ỹ
82
L(w, δ, ỹ)
Let ν = Mδ, so that δ = M−1 ν. Then we will have:
L(w, δ, ỹ)
sup
kMδk≤1,ỹ
=
wT (∆φ(x, y, ỹ) + δ) + ∆(y, ỹ)
sup
kMδk≤1,ỹ
=
sup wT δ + sup wT ∆φ(x, y, ỹ) + ∆(y, ỹ)
kMδk≤1
=
ỹ
T
−1
sup w M ν + sup wT ∆φ(x, y, ỹ) + ∆(y, ỹ)
kνk≤1
ỹ
By definition of the dual norm, supkνk≤1 (wT M−1 )ν = kM−T wk∗ . Since M−1 is also
a definite matrix, it is symmetric; therefore, kM−T wk∗ = kM−1 wk∗ .
= kM−1 wk∗ + sup wT ∆φ(x, y, ỹ) + ∆(y, ỹ)
ỹ
By substitution, the rest of the proof is straightforward.
Note that Theorem 3 can still be applied when M is not positive definite by
using the Moore-Penrose inverse of M instead of the regular inverse. The result in
Theorem 3 uses the technique of robust linear programming with arbitrary norms
that is introduced in Bertsimas et al. (2004). This theorem can also be seen as a
generalization of Theorem 3 in Xu et al. (2009) to structural SVMs. Theorem 3
shows the direct connection between the robust formulation and regularization of
the non-robust formulation for structural SVMs.
Corollary 2. For disturbances of the form kδk ≤ B in the feature space, with
B being a maximum budget for the applicable changes and k.k being an arbitrary
norm, robustness can be achieved by adding the regularization function Bkwk∗ to
the objective.
83
Proof. Since kδk/B ≤ 1 ⇒ k B1 δk = k B1 Iδk ≤ 1. Let M =
1
I,
B
then M−1 = BI.
Thus, kM−1 wk∗ = kBIwk∗ = Bkwk∗ . By Theorem 3, Bkwk∗ is the appropriate
regularization function.
Note that M can also be seen as a tuning parameter. In particular, if there is
a low-dimensional representation of M, then tuning M might be an option.
The commonly used L2 regularization can be in fact interpreted as a
regularization function that enforces robustness to disturbances in the feature space
that are restricted to a hypersphere.
Corollary 3. If f (w) = 0, then setting M =
1
I
C
and k.k = k.k2 will recover the
commonly used L2 -regularized structural SVM.
Proof. If M =
1
I,
C
then M−1 = CI. Note that the L2 norm is dual to itself.
Therefore, f (w) + kM−1 wk∗2 = 0 + kCIwk2 = Ckwk2 .
Corollary 4. Robustness to variations restricted by a Mahalanobis norm kδkS =
√
δ T Sδ ≤ 1, where S is positive definite, is equivalent to adding the regularization
√
function kwkS−1 = wT S−1 w to the objective.
1
Proof. Let S = UΛUT be the spectral decomposition of S. Set M = UΛ 2 UT
√
√
√
and the norm k.k to k.k2 . Then kMδk2 = δ T MT Mδ = δ T M2 δ = δ T Sδ.
Therefore the resulting regularization function will be kM−1 wk∗2 = kM−1 wk2 =
p
√
√
√
1
1
wT M−T M−1 w = wT UΛ− 2 UT UΛ− 2 UT w = wT UΛ−1 UT w = wT S−1 w =
kwkS−1 , Note that UT U = I because U is a unitary matrix.
4.4.2. Polyhedral constrained uncertainty
For some problems, an ellipsoid may not be a good representation of the
uncertainty set, but almost any convex uncertainty set can be approximated by a
84
polyhedron. In this subsection we consider the situations in which we are aware of
the shape of the polyhedral constraints on the variations in the feature space; i.e.,
∆2 Φ(x, y) = {δ|Aδ ≤ b}. The next theorem shows that polyhedral uncertainty
sets are equivalent to linear regularization in a transformed feature space. We begin
with a supporting lemma.
Lemma 4. If x ∈ S(x, y), then for the corresponding ∆2 Φ(x, y) = {δ|Aδ ≤ b}, b
is a non-negative vector.
Proof. x ∈ S(x, y), and φ(x̃, ỹ) − φ(x̃, y) = φ(x, ỹ) − φ(x, y) + δ. Therefore,
when x̃ = x then δ = 0, so we should have 0 ∈ ∆2 Φ(x, y). Therefore, for δ = 0,
Aδ = A0 ≤ b; i.e., b ≥ 0.
Theorem 4. For ∆2 Φ(x, y) = {δ|Aδ ≤ b}, the optimization program of the
robust structural SVM in (Equation 4.9) reduces to the following ordinary 1-slack
structural SVM
minimize Cf (AT λ) + λT b + ζ
λ≥0,ζ
(Equation 4.14)
subject to ζ ≥ sup λT A∆φ(x, y, ỹ) + ∆(y, ỹ)
ỹ
Proof. By substituting the uncertainty set ∆2 Φ(x, y) = {δ|Aδ ≤ b} into the
optimization program (Equation 4.9), we obtain:
minimize Cf (w) + sup L(w, δ, ỹ)
w≥0
Aδ≤b,ỹ
85
(Equation 4.15)
We can rewrite sup L(w, δ, ỹ) as:
Aδ≤b,ỹ
sup wT (∆φ(x, y, ỹ) + δ) + ∆(y, ỹ)
Aδ≤b,ỹ
=
sup wT δ + sup wT ∆φ(x, y, ỹ) + ∆(y, ỹ)
Aδ≤b
ỹ
We perform a Lagrangian relaxation on Aδ ≤ b:
= inf sup(wT δ − λT Aδ + λT b)
λ≥0
δ
+ sup wT ∆φ(x, y, ỹ) + ∆(y, ỹ)
ỹ
T
T
T
= inf λ b + sup(w − λ A)δ
λ≥0
δ
T
+ sup w ∆φ(x, y, ỹ) + ∆(y, ỹ)
ỹ
Note that the value of the sup(wT − λT A)δ will be +∞, unless w = AT λ, therefore:
δ
=




inf λT b + sup [wT ∆φ(x, y, ỹ) + ∆(y, ỹ)]



λ≥0
ỹ


if w = AT λ







+∞
otherwise.
Therefore (Equation 4.15) can be rewritten as:
minimize Cf (w) + inf λT b +
w≥0
λ≥0
T
sup w ∆φ(x, y, ỹ) + ∆(y, ỹ)
ỹ
subject to w = AT λ
86
(Equation 4.16)
By substituting w with AT λ, (Equation 4.16) can be equivalently written as
(Equation 4.14). Note that by Lemma (4), the value of b is is always non-negative,
so no value of λ can lead the value of the objective in the outer minimization to
negative infinity.
It is a known fact that maximization (or minimization) of L1 and L∞ norms
of affine functions can be converted to linear programs (Boyd and Vandenberghe,
2004). In the following proposition, we state that both Theorem 3 and Theorem 4
will lead to equivalent optimization programs in these cases.
Proposition 1. If the disturbances in the feature space are restricted by some
ellipsoid that is defined by L1 or L∞ norms, then optimization program that is
generated by Theorem 3 can be equivalently transformed to one that is generated
by Theorem 4
The proof can be found in Appendix B.
4.4.3. Ellipsoidal/Polyhedral conjunction
In some cases, the uncertainty set in feature space may resemble an ellipsoid
but with additional linear constraints. We can model this as the intersection of an
ellipsoid and a polyhedron. The following theorem describes how such uncertainty
sets can be transformed into regularizers.
Theorem 5. For ∆2 Φ(x, y) = {δ|kMδk ≤ 1, Aδ ≤ b}, the optimization program
of the robust structural SVM in (Equation 4.9) reduces to the following ordinary
87
1-slack structural SVM:
minimize Cf (w) + kM−1 (w − AT λ)k∗ + bT λ + ζ
w,λ≥0,ζ
subject to
ζ ≥ sup wT ∆φ(x, y, ỹ) + ∆(y, ỹ)
(Equation 4.17)
ỹ
The proof of Theorem 5 is a combination of the proofs of Theorems 3 and 4.
First, we perform the Lagrangian relaxation as in the proof of 4, and then we add
the dual of M−1 (w − AT λ) (the coefficient of δ) as the regularization term.
The results in Theorems 3, 4, and 5 apply to binary and multi-class SVMs as
well simply by restricting the space of y to a small set of values. For Theorem 3,
this reduces to results proved by Xu et al. (2009). For the later theorems, we are
not aware of any analogous previous work in binary or multi-class SVMs.
Some limiting cases of Theorem 5 are also interesting. For example, for a
(geometrically) infinitely large polyhedron Aδ ≤ b (e.g., elements of the vector b
are infinitely large), λ must be 0, which recovers the regularization term kM−1 wk∗
introduced in Theorem 3.
Let λ1 , . . . , λm be the eigenvalues of M. If min(λi ) → +∞ (for example, a
diagonal matrix with very large numbers on the diagonal), then as a result δ →
0 in the robust formulation. Intuitively, this means that the uncertainty set only
contains the unmodified input x. In this case, M−1 approaches the zero matrix,
and as a result the regularization term kM−1 (w − AT λ)k∗ fades as expected. On
the other hand, if max(λi ) → 0, then kM−1 (w − AT λ)k∗ ≈ kLM I(w − AT λ)k∗ =
LM k(w − AT λ)k∗ , where LM → +∞. Therefore, the constraint w = AT λ must be
satisfied, leading to (Equation 4.14).
88
4.5. Experiments
We demonstrate the utility of our approach by applying it to a collective
classification problem.
4.5.1. Dataset
We introduce a new dataset based on the political blogs dataset collected
by Adamic and Glance (2005). The original dataset consists of 1490 blogs and their
network structure from the 2004 presidential election period. Each blog is labeled
as liberal or conservative. We expanded this dataset by crawling the actual blog
texts in different years to obtain a vector of 250 word features for each blog in each
yearly snapshot from 2003 to 2013. We used the internet archive website (https:
//archive.org/web/) to obtain snapshots of each blog in each year. We selected
the snapshot closest to October 10th of each year and removed blogs that were
inactive for an 8 month window (4 months before and after October 10th).
The political affiliation of a blog can thus be inferred from both the words
on the blog and its hyperlink relationships to other blogs, which are likely to have
similar political views. Since political topics evolve quickly over time, we expect a
significant amount of concept drift over the years, especially over the word features.
Since the test distribution is evolving significantly, we might expect a robust model
to outperform a non-robust model when trained and tested on different years.2
2
We plan to release both the expanded political blogs dataset and our robust SVM
implementation after publication of this work.
89
4.5.2. Problem Formulation
In our experiments, we use both word features and link features. We
construct one multinomial feature for each word i and label k, φik (x, y) =
P w
w
j xji yjk , where xji = 1 if the jth blog contains the ith word, and yjk = 1
if the jth blog has label k. We also construct a link feature for each label k:
P
φk (x, y) = ij xeij yik yjk , where xeij = 1 if there is a link from the ith blog to the
jth blog.
For our constraints, we assume that the number of words added or removed
is bounded by some budget, Bw , and the number of edges by another budget, Be .
Thus, letting xw be vector of all word-related variables, kx̃w −xw k1 ≤ Bw . Similarly,
kx̃e − xe k1 ≤ Be .
In order to construct the uncertainty set in the feature space, we follow the
construction procedure in Theorem 2 and then apply Corollary 1. For the word
features φik and edge features φk we can construct separate uncertainty sets:
|δik | ≤
X
|δik | ≤
X
|δek | ≤
X
w
|x̃w
ik − xik |
i
⇒
X
w
w
w
|x̃w
ik − xik | = kx̃ − x k ≤ Bw
i,l
l
|x̃eij − xeij | = kx̃e − xe k ≤ Be
i,j
In our domain there are two classes, liberal and conservative, so k ∈ {0, 1}.
P P |δlk |
P
As a result: 1k=0 l 2B
≤ 1, and 1k=0 |δ2Beke| ≤ 1. Summing the equalities and
w
dividing by two:
X |δlk | X |δek |
+
≤ 1
4Bw
4Be
k
lk
90
Finally, let δ = [δ11 , . . . , δnm , δe0 , δe1 ]T , where m = 250 is the number of words
attribute that are chosen from training data, and n is the number of the nodes in
the graph. Then, M is a diagonal matrix with entries [ 4B1w , . . . , 4B1w , 4B1 e , 4B1 e ], so we
will have kMδk1 ≤ 1. Note that, in this uncertainty translation, the base case of
Lemma 3 holds in the first place, so the inequality is in its tightest form.
4.5.3. Methods and Results
We partitioned the blogs into three separate sub-networks and used three-way
cross-validation, training on one sub-network, using the next as a validation set for
tuning parameters, and evaluating on the third. We used mutual information to
select the 250 most informative words separately for each training set. However,
rather than training, tuning, and testing on the same year, we trained and tuned on
the snapshot from 2004 and evaluated the models on every snapshot from 2003 to
2013.
Standard structural SVMs have one parameter C that needs to be tuned. The
robust method has an additional regularization parameter C 0 = 1/Be = 1/Bw
which scales the strength of the robust regularization.3 We chose these parameters
from the semi-logarithmic set {0, .001, .002, .005, .1, . . . , 10, 20, 50}. We intentionally
added 0 to this set to allow the algorithm remove one of the regularization terms.
We learned parameters using a cutting plane method, implemented using the
Gurobi optimization engine 5.60 (Gurobi Optimization, 2014) for running all
integer and quadratic programs. We ran for 50 iterations and selected the weights
from the iteration with the best performance on the tuning set.
3
In general, Be and Bw could be tuned separately, but we did not do this in our experiments.
91
Figure 4.1 shows the average error rate of the robust and non-robust
formulations in each year. In 2004, both have very similar accuracy. This is not
surprising, since they were tuned for this particular year. In years before and
after 2004, the error rate increases for both models. However, the error rate of
the robust model is often substantially lower than the non-robust model. We
attribute this to the fact that the robust model has additional L∞ regularization
(since L∞ is the dual of the L1 uncertainty set used). This prevents the model from
relying too much on a small set of features that may change, such as a particular
political buzzword that might go out of fashion. These results demonstrate that
robust methods for learning structural SVMs can lead to large improvements in
accuracy, even when we do not have an explicit adversary or a perfect model of the
perturbations.
4.6. Related work
In this chapter, the big picture of our formulation for robustness in the
presented algorithms is based on a minimax formulation, where the learner
minimizes a loss function and, at the same time, the antagonistic adversary tries to
maximize the same quantity. Some related work has focused on designing classifiers
that are robust to adversarial perturbation of the input data in a minimax
formulation. For example, Globerson and Roweis (2006) introduce a classifier
that is robust to feature deletion. Teo et al. (2008) extend this to any adversarial
manipulation that can be efficiently simulated. Livni and Globerson (2012) show
that a minimax formulation of robustness in the presence of stochastic adversaries
results in L2 (Frobenius for matrix weights) regularization, and for the multi-class
case results in two-infinity regularization of the model weights. Torkamani and
92
40
Robust Model
Non−robust model
Prediction Error (%)
35
30
25
20
15
10
2003
2004
2005
2006
2007
2008 2009
Test Year
2010
2011
2012
2013
FIGURE 4.1. Average prediction error of robust and non-robust models, trained on
year 2004 and evaluated on years 2003-2013.
Lowd (2013), show that for associative Markov networks, robust weight learning for
collective classification can be efficiently done with a convex quadratic program.
Xu et al.’s work on robustness and regularization (Xu et al., 2009) is the
most related previous work, which analyzes the connection between robustness and
regularization in binary SVMs. Our work goes well beyond these results (and the
ones mentioned in the introduction) by analyzing arbitrary structural SVMs and
showing how they can be made robust without directly simulating the adversary, by
choosing the appropriate regularization function.
93
4.7. Conclusion
In this chapter, we showed that the robust formulation of structural SVMs,
which is intractable in general, can be reduced to tractable optimization programs
for special uncertainty sets. We also showed that for multinomial feature functions,
ellipsoidal uncertainty in sample space can be translated to one in feature space.
We also showed that robustness to polyhedral uncertainties, can be achieved by
linear regularization of the objective and linear transformation of the feature space.
We introduced a new dataset that can be used for structured output prediction in
the presence of distribution change over time. Experimental results showed that our
method outperforms the standard non-robust approach in the presence of concept
drift in the real word data.
So far our focus had been on worst-case adversarial changes in the input data.
In the next chapter, we introduce a regularization method, which robustifies the
machine learning in the presence of average adversaries. The proposed method
optimizes a new loss function, which is the expected hinge loss function under
dropout noise.
94
CHAPTER V
MARGINALIZATION AND KERNELIZATION OF DROPOUT FOR
SUPPORT VECTOR MACHINES
This work is under review in the Journal of Machine Learning Research
(JMLR). I was the primary contributor to the methodology and writing, and
designed and conducted the experiments. My Ph.D. advisor, Dr. Daniel Lowd
contributed partly to the methodology and writing. Daniel Lowd was the principle
investigator for this work.
A central problem in machine learning is learning complex models that
generalize to unseen data. One common solution is to use an ensemble of many
models instead of a single model. Another strategy is to expand the dataset, either
implicitly or explicitly, by exploiting invariances in the domain. Both strategies
reduce the variance of the estimator, leading to more robust models. Dropout
training can be viewed as an instance of either of these strategies. In dropout
training, portions of the model or input data are randomly “dropped out” while
learning the parameters (Srivastava et al., 2014). Thus, dropout can be viewed as
optimizing a distribution of models, or optimizing a model on a distribution over
datasets. In deep networks, this reduces co-adaptation of the weights and allows
more complex models to be learned with less overfitting. In shallow models, such as
logistic regression (LR), dropout acts as a regularizer that penalizes feature weights
based on how much they influence the classifier’s predictions (Wager et al., 2013).
Support vector machines (SVMs) are among the most popular and effective
classification methods, obtaining state-of-the-art results in many domains. SVM
training algorithms reduce generalization error by maximizing the (soft) margin
95
between the classes. For linear classifiers, this amounts to minimizing the hinge
loss plus a quadratic weight regularizer. To learn a non-linear classifier, SVMs can
use a kernel function to compute dot products in a high-dimensional feature space
without constructing the explicit feature representation. While the max-margin
principle is helpful in improving generalization, overfitting remains a risk when
learning complex functions from limited data. Kernelized SVMs are at the greatest
risk, due to their increased expressivity.
Previous work on dropout has mostly focused on deep networks and logistic
regression (Srivastava et al., 2014; Wager et al., 2013; Wang and Manning, 2013;
Maaten et al., 2013). For logistic regression, there are methods to make training
more efficient by approximating or marginalizing over the randomness introduced
by dropout (Wager et al., 2013; Maaten et al., 2013). Other papers analyze the
quantitative and qualitative effect of dropout in logistic regression (Wager et al.,
2013, 2014). The only work on dropout in SVMs is limited to linear SVMs and
consists of a relatively complicated method for optimizing the marginalized dropout
objective (Chen et al., 2014a).
In this chapter, we analyze dropout in both linear and non-linear SVMs. Our
goal is to develop methods that are simple, efficient, and effective at improving the
generalization of SVMs on real-world datasets. For linear SVMs, we show that the
expected hinge loss under dropout noise can be closely approximated as a smooth,
closed-form function. This marginalized dropout objective is easy to optimize and
leads to improved performance on a number of datasets.
For non-linear SVMs, we present two methods for efficiently performing
dropout on the kernel feature map, even when this feature map is high- or infinitedimensional. Our first method generates a linear representation of the input
96
data by randomly sampling from the Fourier transformation bases of the kernel
function as introduced by Rahimi and Recht (2007). It then learns a linear SVM
with marginalized dropout noise on this transformed feature representation. The
second method approximates the effect of dropout in feature space by adding a
weighted L2 regularizer to the dual variables in the SVM optimization problem.
In experiments on digit classification and census datasets, both methods lead to
improved performance compared to a standard SVM with a radial basis function
(RBF) kernel, but the transformed feature representation method is more effective
than dual regularization.
5.1. Related work
The connection between different types of noise and regularization has been
explored by many authors. For example, Bishop (1995) shows that adding Gaussian
noise to neural network inputs while training is equivalent to L2 regularization of
the weights. For the case of linear SVMs, Xu et al. (2009) demonstrate that worstcase additive noise with bounded norm is equivalent to regularizing the weights
with the dual norm. Globerson and Roweis (2006) introduce the “nightmare at test
time” scenario in which an adversary removes a certain number of features from the
model, setting them to zero. They propose a modified SVM formulation to optimize
performance against such an adversary.
Wager et al. (2013) analyze the regularization effect of dropout noise in
generalized linear models (GLMs) by computing a second-order approximation to
the expected loss of the dropout-corrupted data. This allows the dropout objective
to be optimized explicitly rather than implicitly. Unfortunately, this second-order
97
approximation cannot be applied to linear SVMs because the hinge loss is not
differentiable.
Maaten et al. (2013) also introduce methods for learning linear models with
corrupted features, marginalizing over the corruption by introducing a surrogate
upper bound of the logistic loss. For certain loss functions and noise distributions,
they can compute the marginalized objective directly; for logistic loss, they
minimize an upper bound on the expected loss instead. They do not consider
hinge loss. Chen et al. (2014a) extend these methods to analyze linear SVMs with
dropout noise. Since exactly computing the marginalized objective is hard, the
authors introduce a variational approximation. They optimize this approximate
objective using expectation maximization and iterative least squares. The goals
of Chen et al. are similar to ours, but our formulation is simpler and easier to
optimize.
Wang and Manning (2013) introduce a fast way to approximate the expected
dropout gradient. The key idea is to draw the noised activation of each unit from a
normal distribution instead of directly sampling many Bernoulli variables. By using
this approximation several times for each training example, the variance of the
gradients is reduced without a significant increase in computation time. They also
present a closed-form solution which relies on approximating the logistic function as
a Gaussian cumulative distribution function.
In this chapter, we also use a Gaussian approximation to the noisy dot
products. However, we focus on hinge loss rather than logistic loss, and we show
how to compute compute the gradient analytically without sampling or introducing
any additional approximations.
98
Dropout is significantly different from additive noise, since the expected
perturbation of a feature depends on its value in the data. For example, features
that are already zero will be perturbed by standard additive noise, but remain
unchanged by dropout. Instead, dropout noise is best viewed as an instance of
multiplicative noise, since each feature is multiplied by 0 with some probability δ
and 1/(1 − δ) with probability (1 − δ).
To date, there has been limited exploration of training with multiplicative
noise other than dropout1 , and no study of training SVMs with multiplicative
noise. In this chapter, we address both of these questions, leading to a better
understanding of how noise relates to generalization in different types of models.
5.2. Dropout in linear SVMs
A standard formulation for learning linear SVMs is to minimize the hinge loss
of the training data with a quadratic regularizer on the weights:
N
minimizew,b
X
λ
kwk22 +
[1 − yi (wT xi + b)]+
2
i=1
(Equation 5.1)
where w and b are the model parameters (weights and bias); the training data
consists of instance and label pairs, xi ∈ Rn and yi ∈ {+1, −1}; λ is the L2
regularization coefficient; and [z]+ = max(z, 0) is the hinge function. We focus
on binary classification, where labels are +1 and −1; multiclass classification can be
reduced to binary classification.
The idea of dropout training is to optimize performance over a distribution
of model structures or datasets. For linear SVMs, this amounts to minimizing the
1
Wang et al. (2013) also consider multiplicative Gaussian noise, and observe that it is
equivalent to dropout under the quadratic approximation.
99
expected loss over noisy versions of the training data:
N
minimizew,b
X
λ
kwk22 +
Ex̃i [1 − yi (wT x̃i + b)]+
2
i=1
(Equation 5.2)
For dropout noise, x̃i is constructed by removing features from the original training
example xi with some dropout probability δ. More formally, x̃i can be represented
as xi with multiplicative noise: x̃ij = ζj xij , where ζj = 0 with probability δ and
ζj = 1/(1 − δ) with probability 1 − δ. Note that E[ζj ] = 1 and E[x̃i ] = xi .
When the data is low-dimensional, or the data matrix is extremely sparse,
it may be affordable to compute the expected loss or its gradient exactly. More
formally, when there are few non-zeros in a data sample or the weight vector is
expected to be sparse (e.g., because of an `1 regularization), then Ex̃i [1 − yi (wT x̃i +
P
b)]+ can be expanded to ξ p(ξ)[1−yi ((wxi )T ξ +b)]+ , where ξ is the vector of the
multiplicative noise in all dimensions, is the elementwise (Hadamard) product,
and p(ξ) = δ (#zeros in ξ) (1−δ)(#ones in ξ) . Since the number of applicable dropout noise
vectors is exponential in the number of the non-zeros in w xi (i.e., kw xi k0 ),
for small values of kw xi k0 the computation of the expected value of the loss
function under dropout noise may be tractable. There can be cases where the data
is not sparse, but the weight vector is expected to be sparse, due to a sparsityinducing penalty. Even in such a scenario, if we start the optimization algorithm
with a sparse initial weight vector, we may be able to calculate the exact dropout
expectation during the optimization.
The difficulty comes when the data is high-dimensional and the expected
weight vector is relatively dense. Then, neither the expected loss nor its gradient
can be efficiently calculated.
100
The simplest alternative is to approximate the expected loss with sampling or
Monte-Carlo methods. For online learning algorithms (such as Pegasos (ShalevShwartz et al., 2011)), noisy instances can be generated in each iteration. For
batch learning algorithms, we can approximate this expectation using K noisy
replications of the dataset:
minimizew,b
K
λ
1 X
kwk22 +
2
K k=1
X
[1 − y(wT x̃ + b)]+
(x̃,y)∈D̃(k)
where D̃(k) is the kth noisy replication of D, in which each instance x has
been replaced by a noised instance x̃.
The Monte-Carlo approach is simple, but it can be computationally
expensive. Obtaining a good approximation of the expectation may require many
iterations for online algorithms or many noisy replications of the data for batch
algorithms. Thus, we propose to approximate the expectation analytically, rather
than stochastically.
The advantages of an analytic approximation are faster training times and
more accurate solutions. This idea has already been applied to dropout in logistic
regression, either optimizing an approximation or an upper bound on the expected
logistic loss (Wager et al., 2013; Maaten et al., 2013). For linear SVMs, the
quadratic approximation cannot be applied, because hinge loss is non-differentiable.
In this section, we derive a smooth approximation of the expected hinge loss.
The objective is easy to compute and can be optimized directly with standard
gradient-based methods.
101
4000
Dropout
Linderberg−Feller CLT
3500
0
10
Simulated Berry−Esseen Upper−bound
2500
Empirical Upper−bound
Number of samples
3000
2000
1500
1000
−1
10
−2
10
500
0
50
60
70
80
90
The margin value
100
110
−3
10
0
1
10
10
(a)
2
3
10
10
Number of Non−zero Weights
4
10
(b)
FIGURE 5.1. The results of running a Monte-Carlo simulation of calculating
1 − y(wT x̃ + b) for randomly drawn x̃’s and drawings from the approximated
Gaussian distributions. The dimension of each sample x̃ is 50 in the
histogram on the left. Right: simulation of the Berry-Esséen upperbound for different number of non-zero weights
Let x̃i = xi ζ (ζ = [ζ1 , . . . , ζm ]T and m is the dimension of xi ) be the
corrupted version of x and y be its label, such that ζj ’s are independently and
identically drawn from a Bernoulli distribution with parameter δ. According to
the Lindeberg-Levy central limit theorem, if we have minimum and maximum
values for the features and the weights, by the increase of the dimension m and
non-zero weights and features per example, the margins of the SVM for this sample
.D.
converge-in-distribution as following: 1 − y(wT x̃i + b) −→ N (1 − y(wT xi +
Pm 2 2
δ
b), 1−δ
j=1 xij wj ).
In practice, in an SVM training process with fixed regularization, the weights
have bounded magnitude. This is similar to the approach of Wang and Manning
(2013), where they propose a similar application of the central limit theorem to
improve the speed of Monte-Carlo dropout in logistic regression. Figure 5.1a shows
an example distribution over margin values according to sampled dropout noise and
102
the approximated Gaussian distribution. Although the dimension of the sample
vectors in this simulation is small (∼ 50), we observe a close match between the
two histograms.
Lemma 5. The expected value of the hinge function over a normal distribution is:
µ
µ
Eξ∼N (µ,σ2 ) [ξ]+ = µΦ( ) + σφ( )
σ
σ
(Equation 5.3)
where Φ and φ are respectively the cumulative and probability density functions of a
normal distribution with zero mean and variance equal to one.
The proof is provided in Appendix C.
Therefore, by Lemma 5, the optimization program of the SVM with L2
regularization in the primal form (Problem Equation 5.2) with dropout noise can
be approximated by the following optimization program:
N
minimizew,b
X
ui
ui
λ
kwk22 +
ui Φ( ) + σi φ( )
2
σi
σi
i=1
where ui = 1 − yi (wT xi + b), σi =
q
δ
1−δ
Pm
j=1
(Equation 5.4)
x2ij wj2 (m is the number of features), Φ
and φ are the cumulative and probability density functions of the standard normal
distribution. A direct proof is given in Appendix C.
5.2.1. Convexity
The marginalized cost function (Equation 5.4) is nonlinear, but it is always
convex. We use the following lemma for proving its convexity:
103
Lemma 6. Let f : Rm → R be a multivariate function. Also let g(t) = f (x0 + t∆x)
(t ∈ R) for some arbitrary x0 , ∆x ∈ Rm . If g(t) is convex in t for all x0 , ∆x ∈ Rm ,
then f (x) is convex in x.
Proof. By the definition of convexity, it suffices to show (1 − λ)f (A) + λf (B) ≥
f ((1 − λ)A + λB) for any λ ∈ [0, 1] and any A, B ∈ Rm . Let x0 := A and ∆x :=
B − A, then the former inequality is equivalent to (1 − λ)g(0) + λg(1) ≥ g(λ), which
holds by assumed convexity of g.
In the following theorem, we prove that the proposed cost function is
surprisingly convex. Therefore, it can be efficiently optimized by off-the-shelf
optimization algorithms.
Theorem 6. The marginalized loss f (w, b; yi , xi ) = ui Φ( uσii ) + σi φ( uσii ) is jointly
convex in w and b for any given sample and label pair (xi , yi ), where ui = 1 −
qP
m
2 2
yi (wT xi + b), σi = σδ
j=1 xij wj .
Proof. Consider a slice cut of the objective function in an arbitrary direction
(∆w, ∆b) from an arbitrary point (w, b) in the parameter space.
Let:
ui (t) = 1 − yi (w + t∆w)T xi + b + t∆b = 1 − yi (wT xi + b) − t(yi ∆wT xi + yi ∆b)
= U − ∆U t
sX
sX
σi (t) = σδ
x2ik (wk + t∆wk )2 = σδ
x2ik (wk2 + 2twk ∆wk + t2 ∆wk2 )
k
= σδ
sX
k
x2k wk2 + t
k
X
2x2k wk ∆wk + t2
k
X
x2k ∆wk2 )
k
p
= σδ S + pt + qt2
(Equation 5.5)
104
P 2 2
where U = 1 − yi (wT xi + b), ∆U = yi (∆wT xi + ∆b), S =
k xk wk ,
P 2
P
2
2
p =
k xk ∆wk . Also, let f (t) = ui (t)Φ(ui (t)/σi (t)) +
k 2xk wk ∆wk and q =
σi (t)φ(ui (t)/σi (t)). Based on Lemma 6, if f (t) is convex in t for any (xi , yi ), w, b,
∆w and ∆b, then f (w, b; yi , xi ) is jointly convex in its parameters. We have:
2
−
(U −∆U t)2
2(S+pt+qt2 )σ 2
δ
−
(ui )2
2σ 2
i
e
∂ f (t)
=
2
∂ t
=
e
((2∆U S + ∆U pt + pU + 2qtU )2 + (4qS − p2 )(S + tp + qt2 )σδ2 )
√
4 2π(S + tp + qt2 )5/2 σδ
((2∆U S + ∆U pt + pU + 2qtU )2 + (4qS − p2 )σi2 )
√
4 2π(σ)5 /σδ4
(Equation 5.6)
Note that the denominator of the second derivative is non-negative
√
(4 2π(σ)5 /σδ4 ≥ 0), and in the nominator, all terms are always non-negative,
except 4qS − p2 , which can be negative for some values of S, p and q (i.e.
−
e
(U −∆U t)2
2(S+pt+qt2 )σ 2
δ
≥ 0, (2∆U S + ∆U pt + pU + 2qtU )2 ≥ 0 and (S + tp + qt2 )σδ2 ≥ 0).
By definition, σi (t) is always non-negative. Consider the hypothetical values
p
of S, p and q, for which, there exist some t such that σi (t) = σδ S + pt + qt2 = 0.
√
−p± p2 −4qS
Then the roots of σi (t), will be t =
.
2q
p
As long as σi (t) has no real roots (i.e. p2 − 4qS is imaginary), we will have
(4qS − p2 ) > 0, and as a result
∂ 2 f (t)
∂2t
> 0.
The marginalized cost function is undefined for σi = 0, which appears in
ui
,
σi
however, it is continuous and convex in the limit as σi (t) → 0 (or equivalently
√
−p± p2 −4qS
t→
, when p2 − 4qS ≥ 0 ). Let mint σi (t) = 0 (i.e. for some values of S,
2q
p and q, p2 − 4qS ≥ 0), then it is easy to show that:
105
lim
σi (t)→0+
f (t) =



ui (t) ui (t) > 0


0
= [ui (t)]+ = [1 − yi ((w + t∆w)T xi + b + t∆b)]+
ui (t) ≤ 0
which is the hinge loss of misclassifying the ith sample as t varies. As a result,
if σi (t) = 0 for the ith training sample, then the contribution of that sample to the
overall objective function will be exactly the same as adding a regular hinge-loss.
Clearly, the overall objective remains convex: addition of several convex functions
result in a convex function. Therefore, for any possible σi (t) (σi (t) ≥ 0), the
function f (t) is convex. Correspondingly, f (w, b; yi , xi ) will be convex (by Lemma
6).
5.2.2. Regularization effect
The resulting cost function (Equation 5.4) can be directly optimized, and it is
not the same as the hinge loss any more. In order to understand the theoretical
reasons of why dropout performs well in shallow model such as SVMs, we can
compare the resulting cost function with ordinary hinge-loss. From a theoretical
point of view, the generalization power of dropout-based methods comes from the
regularization penalty Rdropout (w) that dropout incurs to the model weights:
Rdropout (w) =
N
X
i=1
ui Φ(
ui
ui
) + σi φ( ) − [1 − yi (wT xi + b)]+ (Equation 5.7)
σi
σi
106
where ui = 1 − yi (wT xi + b), σi =
qP
m
j=1
x2ij wj2 . Although, the incurred
regularization function is highly non-convex, but as proved the previous section, the
overall cost function remains convex (5.2).
1.4
7
6
Incurred Regularization Function
Dropout Noise applied to Hinge−loss
Hinge Loss Cost
Closed−form Dropout regularization of the Hinge Loss
Objective Function
5
4
3
2
1
0
−6
−4
−2
0
2
Varying wj
4
6
1.2
1
0.8
0.6
0.4
0.2
0
−5
8
marginalized loss minus the hinge loss)varying the
weight vector in one dimension.
5
Incurred Regularization Function
Dropout Noise applied to Hinge−loss
Hinge Loss Cost
Closed−form Dropout regularization of the Hinge Loss
45
Objective Function
40
35
30
25
20
15
10
−15
−10
−5
0
Varying wj
10
(b) The regularization effect of one sample (i.e. the
55
5
−20
5
Varying w
j
(a) Single sample’s contribution to the loss
function
50
0
5
10
15
3
2
1
0
−20
20
(c) Aggregated loss of several samples
4
−10
0
Varying wj
10
20
(d) The aggregated regularization effect of several
samples from a one dimensional cut of the loss
function
FIGURE 5.2. Losses and differences in losses as a function of a single model
weight.
Note that the marginalized cost function is always an upper-bound on the hinge
loss. Although the effective regularization function is non-convex, the marginalized
objective function itself is convex.
107
5.2.3. Approximation quality
Since the approximation depends on the central limit theorem, (assuming that
zi = 1 − yi (wT xi + b) ∼ N (ui , σi2 )), this method should be used when the data
is not extremely sparse (e.g., there are at least 10 non-zero features in the average
sample), and the regularization penalty does not favor extremely sparse solutions.
More formally, let ui = 1 − yi (w̄T x̃i + b) be a random variable that represents
the margin for some fixed weights w̄ and some arbitrary dropped-out sample x̃i
with the desired label yi , and mw̄ be the number of non-zero elements in w̄. Also
let Fui (z) = Pui (ui ≤ z) be the cumulative density function (CDF) of ui . By the
Berry-Esséen theorem, the supremum of the difference between the CDF of ui and
its Gaussian approximation is upper-bounded by:
sup |Fui (z) − Φ(
z
z − µi
Cρi
)| ≤ 3 √
σi
σi mw̄
(Equation 5.8)
By the best estimate to date, C ≤ 0.4748 (Korolev and Shevtsova, 2012).
ρi is the third moment of ui , and can be calculated in closed-form. In Figure 5.1b,
we simulate this upper-bound for different numbers of non-zero weights on a toy
dataset. In practice, we observe that the true and the approximated distributions of
ui closely match each other as in Figure 5.1a.
It is easy to prove that the optimization program in (Equation 5.4) is
always an upper bound on the regular SVM’s objective. Therefore, the dropout
approximation is in fact an optimization transfer that intrinsically applies
extra regularization effects on the learned weights. The objective is an smooth
approximation of a convex function (the expected hinge-loss), and is easily
108
differentiated and optimized with gradient descent, LBFGS, or other standard
methods.
We provide visual intuition about our proposed approximation in Figure 5.2.
In Figure 5.2a, We consider one single sample, and show the hinge loss (red), its
closed form expectation from Equation 5.3 (green), and the Monte-Carlo function
when the function is averaged over actual dropout noisy samples (blue). The noised
hinge loss provides an upper bound that is tight at the extremes and smooth in
between. Figure 5.2c shows how several samples with different margins form the
aggregated loss function. As the dimensionality of model weights increases, the
approximation tightly converges to the true expectation which is convex. For very
low-dimensional inputs (∼ 4-5), the method can still be applied but might perform
poorly. This method is appropriate for real-world problems, where we deal with
hundreds or thousands of dimensions.
5.3. Dropout in non-linear SVMs
By using the kernel trick, SVMs can learn a linear classifier in a higher
dimensional feature space without explicitly constructing those features. The kernel
trick relies on the dual SVM optimization program:
maximizeα
X
αi −
i
subject to
X
1X
yi yj αi αj k(xi , xj )
2 i,j
yi αi = 0, 0 ≤ αi ≤ 1/λ ∀i
(Equation 5.9)
i
where y is the vector of labels, λ is the L2 regularization weight of the primal
optimization program, and k(xi , xj ) = f (xi )T f (xj ) is the dot-product (reproducing
109
kernel) of a feature function vector f (.) in a Hilbert space. For many feature
functions, the kernel entry k(xi , xj ) can be calculated even if f (x) has no explicit
representation and is infinite dimensional. Instead of maintaining feature weights
w (which could be infinite dimensional), the dual problem uses instance weights α.
P
The predicted label for a new instance x0 is given by: sign( i yi αi k(xi , x0 )). The
instances xi with αi > 0 are commonly referred to as support vectors.
5.3.1. Defining dropout in kernels
In deep networks, dropout can be applied to the input layer, any of the
hidden layers, or some combination of them. In SVMs with non-linear kernels, we
can analogously apply dropout noise to the either the input space attributes or the
implicit features.
Given a kernel function k(xi , xj ) with corresponding feature function f ,
we define the kernelized dropout function, k̃(xi , xj , ζ, ξ), as a function of both
the instances, xi and xj , and the dropout noise, ζ and ξ. The specific definition
depends on the type of dropout:
– Input space dropout:
k̃(xi , xj , ζ, ξ) = k(ζ xi , ξ xj )
– Feature space dropout:
k̃(xi , xj , ζ, ξ) = (ζ f (xi ))T (ξ f (xj ))
We can also drop the whole support vectors (i.e. dropping out α’s). This
turns out to be very similar to some variations of bagging, therefore we skip it in
this chapter.
110
For a linear kernel, feature space and input space are identical, so dropout
in both spaces is the same. Dimension dropout is also the same, since excluding
dimensions from the kernel calculation is equivalent to multiplying those attributes
by zero.
kζ,ξ (ζ xi , ξ xj ) =
X
(ζl xi,l )(ξl xj,l )
l:ζl 6=0,ξl 6=0
=
X
(ζl xi,l )(ξl xj,l ) = k(ζ xi , ξ xj )
(Equation 5.10)
l
More generally, dimension dropout is equivalent to input space dropout for
any kernel function that only depends on the dot-products of the original vectors,
and not the original vectors themselves. That is, if k(xi , xj ) = g(xTi xj ) for some
function g, then dimension dropout is equivalent to input space dropout. This
includes all polynomial kernels, which can be expressed as k(xi , xj ) = (xTi xj + c)d .
One kernel where they differ is the radial basis function (RBF) kernel:
−γkxi −xj k2
. The RBF kernel is translation invariant, so that
k(xi , xj ) = exp
2
k(xi + ∆, xj + ∆) = k(xi , xj ). Standard input space dropout does not maintain
this invariance, since the effect of zeroing out an attribute depends on its original
magnitude.
Dropout can be applied both to training and testing data. In fact after
learning the model, we can apply the dropout noise to the test data, and then
perform the classification on the corrupted input (or make the final classification
by the ensemble result of classifying several noisy versions of the same input data).
We address this issue later. In Appendix E, we derive the marginalized (expected)
prediction function for dimension dropout in RBF kernels.
111
Ideally, we would like to find the dual solution for the kernelized version of
Equation 5.2. Instead of the one-to-one correspondence of αi ’s and xi ’s, we need
to index each αi by the noise value as well. If we let αi (ζ) be the corresponding
dual variable for the noisy sample x̃i = ζ xi (or equivalently, the noisy feature
f˜(xi ) = ζ f (xi )), then Equation 5.9 turns to the following calculus of variation
optimization problem:
maximize E[αi (ζ)]
α
1X
yi yj E[αi (ζ)αj (ξ)k̃(xi , xj , ζ, ξ)]
−
2 i,j
X
subject to
yi Eζ [αi (ζ)] = 0,
i
0 ≤ αi (ζ) ≤ 1/λ ∀i, ζ
(Equation 5.11)
where ζ and ξ are drawn from the dropout noise distribution.
Proposition 2. After applying dropout in input or feature space to a valid kernel,
the resulting matrix is a valid kernel.
Proof. We first define an augmented instance space X 0 containing both the original
attributes and the dropout noise, e.g., x0i = (xi , ζ) and x0j = (xj , ξ). We then define
a kernel k 0 over this space by constructing an appropriate feature function f 0 . For
input space dropout, let f 0 (x0i ) = f (ζ xi ), and for feature space dropout, let
f 0 (x0i ) = ζ f (xi ). In both cases, it follows that k̃(xi , xj , ζ, ξ) = f 0 (x0i )T f 0 (x0j ) =
k 0 (x0i , x0j ).
The kernel from input dimension dropout is not guaranteed to be positive
semidefinite (PSD). However in practice, we rarely observed non-PSD kernels;
112
even the rare cases of non-PSD kernels had very small (in magnitude) negative
eigenvalues. We solved the optimization programs by implementing on-the-fly
kernels in LibSVM, and solved the corresponding optimization programs by the
SMO algorithm. In our experiments, the optimization programs always converged.
Example 5.1. We present an example in which dropout in dimension
can result
kernel. For a dataset with only two samples
 in
 a non-PSD
 
 1 
 0 
{x1 =   , x2 =  }, suppose dimension-dropout generates the following
0
1








1 
 ×  1  ×  2  0 
noisy dataset: {x̃11 = 
}, where ‘×’
 , x̃2 = 
 , x̃2 = 

 , x̃21 = 
×
1
0
×
means that the corresponding dimension is dropped. Then the RBF-dimensiondropout kernel will generate the following kernel matrix, which has negative
eigenvalues for some values of γ:

1
1
1


 1
1 e−γ



 1 e−γ 1

e−γ 1
1
e−γ



1 



1 

1
(Equation 5.12)
In the rest of this section, we introduce several approximations and variations
of Equation 5.11, which we will evaluate in the experiments section.
5.3.2. Marginalized dropout in feature space
For many kernels, the explicit feature representation is extremely highdimensional or even infinite-dimensional. Therefore, direct application of dropout
marginalization will not be practical. In this subsection, we introduce two methods
113
for taking advantage of both the efficiency of kernel tricks and the accuracy
improvement by dropout.
5.3.2.1. Kernel approximation
Although kernel methods have proven to be successful in predictive nonlinear models, learning these models requires O(n2 ) memory as well as a long
training time, and computing the decision function can be costly when the number
of support vectors is large. These two issues make kernel methods less practical,
especially on large datasets. Randomized algorithms for approximating kernel
matrices (Schölkopf, 2002; Blum, 2006) have inspired several methods for efficiently
converting the training and evaluation of kernel machines into linear weight
learning and score prediction (Rahimi and Recht, 2007; Le et al., 2013). The
basic idea behind these methods is to find a relatively low-dimensional feature
representation z(x) such that z(xi )T z(xj ) approximates the desired kernel function,
k(xi , xj ).
Besides the practical efficiency of such feature representation methods, we can
take advantage of more complicated linear methods to improve the prediction. For
example, this will allow us to naturally apply the marginalized linear SVM method
from Section 5.2. on z(x) as the training features.
We have built on Rahimi and Recht’s method (Rahimi and Recht, 2007) by
focusing on the RBF kernel. However, it can be applied to any other translationinvariant kernel as well. Their method is based on Bochner’s theorem (Rudin,
2011): “A continuous translation-invariant kernel k(xi , xj ) = k(xi − xj ) on Rd
is positive definite if and only if k(∆) is the Fourier transform of a non-negative
measure.” By randomly sampling from the terms of this Fourier transformation,
114
we can approximate the kernel with some convergence guarantees. As a result,
−γkxi −xj k22
for the RBF kernel, k(xi , xj ) = exp
, we can randomly draw random
2
frequencies {β1 , . . . , βD } from a normal probability density function with mean
zero and covariance 2γdI (d is the dimension of input space and I is the identity
matrix), and draw random rotation angles {α1 , . . . , αD } uniformly from [0, 2π].
q
2
T
x + αD )]T will be the linear feature
Then, z(x) =
[cos(β1T x + αi ), . . . , cos(βD
D
−γkxi −xj k22
representation, such that z(xi )T z(xj ) ≈ exp
.
2
Our experimental results show that applying the marginalized linear SVM on
top of this feature representation will outperform the accuracy of an exact kernel
SVM on the MNIST and Adult datasets.
5.3.2.2. α-Regularization
Next, we consider dropout in the feature space of the original kernel feature
mapping. Formally: k(x̃i , x̃j ) = f˜(xi )T f˜(xj ), where f˜(x) = f (x)ζ, Eζ [f˜(x)] = f (x)
(i.e., E[ζ] = 1), and let var(ζ) = σζ2 I, where I is the identity matrix.
We would prefer to optimize the marginalized dropout objective directly, as
done in the linear case. There are two key challenges. First, the dual formulation
in Equation 5.11 has an exponential number of variables, one for each possible
corruption of each training instance. Second, for infinite-dimensional feature
functions, the dropout noise will also be infinite-dimensional. We can solve both
problems by introducing a simple approximation: we constrain the value of αi (ζ)
to a constant αi for all ζ. In other words, all noisy copies of the same instance will
share the same weight in the SVM.
The following theorem shows how this simplification results in a tractable
approximation of Equation 5.11.
115
Theorem 7. When each αi is constant, Equation 5.11 is equivalent to standard
SVM learning (Equation 5.9) with a modified kernel Q = K + σζ2 K I, where K is
the original kernel matrix and σζ2 is the variance of each dimension of the dropout
noise, ζl .
The proof can be found in Appendix D.
This optimization problem can be viewed as adding a weighted L2 regularizer
on the αi weights:
σζ2 X
R(α) =
k(xi , xi )αi2
2 i
(Equation 5.13)
Therefore, we refer to this technique as α-regularization. In the dual program
of L2 -SVMs (where the squared value of hinge-loss is minimized) (Chang et al.,
2008), there is a similar regularization effect, but with constant coefficients.
Our proposed α-regularization method puts a different weight on each of the
dual variables in order to approximate the effect of dropout noise in the feature
representation.
Note that all of the support vectors learned with α-regularization must
be instances in the original training data. For a linear kernel, this means that
the weight vector must lie within the span of the training data. Derezinski
and Warmuth (2014) present hardness results for predictors that use linear
combinations of instances, suggesting that this can be a serious disadvantage on
some problems. In contrast, neither Monte-Carlo dropout, nor the marginalized
linear SVM, nor the marginalized linear SVM on the approximated kernel has this
restriction.
116
5.3.3. Monte-Carlo dropout in input space and dimension
We can create a Monte-Carlo approximation of Equation 5.11 by replacing
the expectations over all dropout noise with K samples of dropout noise for each
training instance. This is equivalent to learning from several noisy copies of the
training data. For input space dropout, we can create several noisy replications of
the training data and apply standard SVM learning algorithms. This works because
input space dropout applies noise before computing the kernel.
For input dimension dropout, we need to keep track of the dropout noise
explicitly and use it to modify the kernel computation. For example, for the
2
RBF kernel, k(xi , xj ) = e−γkxi −xj k . When applying dimension dropout, we need
to modify the distance computation so that it only considers non-dropped-out
P
dimensions. Let d(xi , xj ; ζ, ξ)2 = l:ζl 6=0,ξl 6=0 (ζl xi,l − ξl xj,l )2 . Then k̃rbf (xi , xj , ζ, ξ) =
2
e−γd(xi ,xj ,ζ,ξ) . To implement this efficiently, we represent xi and xj as sparse vectors
where all unspecified dimensions are dropped out and all non-dropped-out zeros
are encoded explicitly. In the kernel computation, we iterate only over dimensions
where both xi and xj have a defined value (which could be zero).
The key advantage of input dimension dropout is that it maintains the
translation-invariance property of RBF kernels. The key disadvantage is that
the resulting kernel matrix may be non-PSD. In our experiments, we found that
input dimension dropout outperforms the ordinary RBF kernel. Furthermore, the
negative eigenvalues of this kernel were usually very small in magnitude and did
not cause any practical problems for the sequential minimal optimization (SMO)
algorithm. If necessary, techniques for stabilizing the optimization of non-PSD
kernels could be applied here as well (Lin and Lin, 2003).
117
For RBF kernels, a model learned with dropout may work poorly on nonnoisy instances. We apply two different approaches in our experiments. The
first is to ignore this difference and apply the model directly. For small dropout
probabilities (5%-10%), the additional bias should be small. The second approach
is to compute the expected kernel function over all possible dropout noise.
Since each dropout probability is independent, this can be done in linear time.
(The proof is provided in Appendix E.) We refer to this latter approach as the
“corrected” prediction. In both cases, dropout noise is applied by removing random
features and not rescaling the remaining features; the rescaling correction (1/(1−δ))
is designed for linear models and causes problems when support vectors and test
instances are scaled differently.
5.4. Empirical results
Datasets. We ran our experiments on several text classification datasets,
the MNIST digit classification dataset (LeCun et al., 1998), and the Adult dataset
from the UCI repository. The text datasets were two sentiment analysis datasets
(PolarityV2, Subj) introduced by Pang and Lee (2004), three datasets based on
20-newsgroups (AthR, BpCrypt, XGraph) previously used by Wang and Manning
(2012), and four Amazon sentiment datasets (Books, Kitchen, DVD, Electronics).
We also constructed an artificial dataset called M27 from MNIST. In M27, we have
selected all 2 and 7 digits from MNIST. For each digit, we randomly selected two
integer numbers i ∈ [1, 390] and j ∈ [391, 784], then set all pixels that correspond
to the indices from i to j of the vectorized 784-dimensional digit image to zero. We
repeat this for the training, the tuning, and the testing data.
118
Methods. On the text datasets, our main point is to show the comparative
performance of different linear SVM-based methods: we compare the marginalized
SVM (SVM-Marg), Monte-Carlo dropout (SVM-MC), and α-regularization (αReg), all with linear kernels (Table 5.1). For nonlinear kernels we focus on radial
basis functions (RBF). We compare different linear methods that use the random
Fourier bases as feature representation with the exact regular SVM and our
proposed α-regularization method (Table 5.2).
Experimental Setup. For the text classification experiments, we used five
fold cross validation. For the Monte-Carlo methods, we generate K copies of the
training data and apply dropout noise to each sample independently. Learning
with noisy replications of the training data is an approximation to minimizing the
expected loss when the noise elements are randomly drawn from their respective
distribution. All hyper-parameters are selected by cross-validation. For the
approximated kernel experiments in Table 5.2, we set the dimension of Fourier
bases D = 4000 for MNIST and M27 datasets, and D = 1500 for the Adult dataset.
But, we tuned all other hyper-parameters using held-out data, then re-trained the
final model by including both the training and tuning data samples.
Nonlinear models (without linear approximation) are more sensitive to hyperparameters. Because of this fact, we have also tuned σζ2 for the nonlinear kernel
α-regularization method, as well as the `2 regularization coefficient λ, and the
RBF kernel parameter γ for all methods. On the other hand, the tuning procedure
usually selected larger dropout probabilities for linear (both linear and linear
approximation of RBF) models.
Results. Table 5.1 shows the error percentage of each linear classifier
on each of the text datasets. The best-performing variant is shown in bold.
119
Marginalized dropout outperforms all other methods-except in one dataset, on
which α-regularization outperforms SVM-Marg. Monte-Carlo dropout training
led to improved results on all datasets. α-Reg led to slight improvements on
seven of nine datasets but usually worked worse than SVM-Marg, suggesting that
marginalization in the primal is more effective when applicable. We have also
compared our methods with logistic regression, LR with Monte-Carlo dropout,
and LR with (marginalized) deterministic dropout (Wang and Manning, 2013).
The results have a solid basic trend, whenever SVM itself outperforms LR, the
SVM-based dropout methods also outperform the LR-based dropout methods, and
vice-versa.
TABLE 5.1. Classification error (%) of linear classifiers on text datasets. The last
column is the decrease percentage of the prediction error for best
method (mostly SVM-Marg) vs. SVM.
Dataset
SVM
AthR
7.16
SVM-MC α-Reg
8.98
6.74
SVM-Marg
6.88
Err.Dec.(%)
5.87
(SVM-Marg:3.91)
BpCrypt
Polar2
Subj
XGraph
Books
Kitchen
DVD
Elect.
2.22
19.70
12.96
9.33
17.43
12.31
17.55
14.01
1.52
18.90
11.76
8.51
16.95
11.61
16.84
13.82
2.22
18.70
13.00
9.02
17.35
11.91
17.61
13.91
1.21
16.10
11.12
7.48
13.34
10.61
15.43
11.88
45.49
18.27
14.20
19.83
23.46
13.74
12.08
15.06
We compared the performance of several linear weight learning algorithms
using Fourier basis features for approximating the RBF kernel with ordinary RBF
kernel on three datasets in Table 5.2. We observe that the simple least-squares
(LS+Fourier) method (Rahimi and Recht, 2007) outperforms the exact RBF kernel
120
on two datasets just by itself. However, when it is combined with the marginalized
SVM it outperforms all other methods.
We also observe that α-regularization always outperforms the regular
kernel SVM on these datasets. However, it does not work nearly as well as the
marginalized SVM on the Fourier basis. We attribute this difference to the fact
that α-regularization is constrained to only using support vectors from the original
dataset, unlike the marginalized SVM.
The highest gain is achieved on M27, where the training and testing is
performed on samples with large missing portions. This fact suggests that dropout
might be also useful for learning with missing data in non-linear models. (Dekel
et al. (2010) have directly addressed this issue for linear models using a relaxationbased technique.)
TABLE 5.2. Classification error (%) of approximated RBF kernel.
Dataset
MNIST
M27
Adult
RBF-SVM
LS
Exact
(Fourier)
1.43
6.31
15.1
2.41
5.97
14.9
α-Reg
Lin.SVM Marg.SVM
(Fourier)
(Fourier)
1.48
5.66
14.93
1.37
4.93
14.84
1.41
6.05
14.97
Err.Dec.(%)
4.38
27.99
1.75
Stacked jittered features are known to help improve the prediction accuracy of
kernel SVMs on MNIST (Decoste and Schölkopf, 2002). The jittered pixels depend
on the geometrical location of non-zero pixels. Unlike stacked jittered features,
dropout works equally well with any fixed permutation of the pixels (regardless of
the geometric shape of digits), therefore dropout training is different than learning
with additional virtual features. Both methods can be applied simultaneously, but
in this work, we would like to measure the amount of improvement that can be
achieved only by applying dropout noise.
121
TABLE 5.3. Classification error(%) curve for different size subsets of MNIST.
Comparing no-dropout standard RBF to Monte-Carlo dimension
dropout and α-Reg. The last row is the decrease percentage of the
prediction error for no dropout vs. the best dropout method.
Training size
1000
2500
5000
10000
25000
50000
60000
No-DO
α-Reg
DO
6.84
6.85
6.44
4.70
4.69
4.26
3.54
3.57
3.34
2.85
3.05
2.65
2.10
2.07
1.95
1.53
1.49
1.50
1.43
1.41
1.40
Err.Dec.(%)
5.85%
9.36%
5.65%
7.02%
7.14%
4.58%
2.80%
For dimension dropout, we can efficiently marginalize the dropout effect
on the kernel at the prediction time. In Appendix C, we derive the marginalized
prediction function. Table 5.3 shows results for variants of RBF SVMs on MNIST.
We vary the training size from 1000 to 60,000 instances. On average, dropout
shows small but consistent improvements over no dropout with training instances.
On average, training with dropout noise leads to a 5.77% reduction in error. For
smaller number of samples, prediction with expected kernel performs better than
all other methods. α-Reg was slightly more accurate than no dropout for larger
training sets. For the Monte-Carlo methods, we used 100 noisy replications of the
training data for the linear methods. For the kernel methods: we used 10 noisy
replications training subsets of sizes 1000, 2500, 5000, and 10000; 6 replications
for sample size of 25000; and, 3 replications for the samples sizes 50000 and
60000. Larger numbers of replications become increasingly expensive, due to the
increased number of support vectors and larger kernel matrices. In experiments
with 3 replications of 60,000 examples, the accuracy for increased to 98.61% which
suggests more replications could result in higher prediction accuracy.
122
5.5. Conclusion
While previous results (Hinton et al., 2012; Maaten et al., 2013; Wager
et al., 2013; Wang et al., 2013; Wang and Manning, 2013) show that learning with
dropout noise can improve the accuracy of neural networks and logistic regression,
our work confirms that dropout training can improve the prediction accuracy of
SVMs as well.
In this chapter, we introduced two new methods that take advantage of
dropout learning without actually drawing samples from a noise distribution. These
methods marginalize the effect of dropout in the primal (Marginalized SVM) and
dual formulations (α-Regularization) of the SVM optimization program. Both of
these methods are simple and easy to implement. The experimental results show
that these methods often outperform ordinary SVMs.
This results are the first work to use dropout to improve the performance
of SVMs with non-linear kernels. We presented two types of dropout with kernels
and experimentally showed their effectiveness. We showed that randomized kernel
approximation may be used along with marginalized dropout in primal to improve
both the performance and efficiency of kernel machines.
123
CHAPTER VI
CONCLUSION AND FUTURE DIRECTIONS
This thesis presents novel convex optimization algorithms for learning robust
large margin models. Our methods rely on formulating the machine learning
problems as mathematical optimization programs that can be efficiently solved.
In all of our contributions, we started from a conceptual formulation of the problem
and converted it to a manageable and convex problem, which can be solved by offthe-shelf convex optimization methods.
6.1. Summary of contributions
– Convex adversarial collective classification Our method robustly
performs collective classification in the presence adversary. The formulation
is a convex quadratic program that can be effieciently solved. This solution
improved the performance of collective classification, even if there was no
adversarial component in the test data. Our method consistently outperforms
both non-adversarial and non-relational baselines.
– Equivalency of adversarial robustness and regularization Our method
takes advantage of the adversary’s weakness, and converts their weakness
to its strength. For each adversary that is capable of altering the feature
space, we can derive specific regularization functions that immunes the
machine learning algorithm to that type of adversary. Since the method only
adds extra convex regularization functions to the objective of the original
optimization program, little computation overhead is added. Therefore, the
124
problem can be optimized in the same order as the non-robust optimization
program.
– Robustness of large margin methods through dropout
regularization Average adversaries do not have enough information about
the underlying machine learning system, and they do not have ample
computation resources to calculate an optimal attack. As a result, they resort
to frequent random attacks. Their hope is that some of the random changes
in the input data finally tricks the machine learning algorithm. In order
to be robust against such adversaries, we can minimize the expected loss
function, when data is randomly changing. Dropout training is a great match
for such circumstances. We derive the regularization effect of marginalized
dropout on linear and non-linear SVMs. Our derivation is simple and convex.
Experimentally we show that our method is efficient, and that it almost
always outperforms regular SVMs.
6.2. Future directions
The ideal goal is to design a global recipe for robustness that applies to
most of the machine learning algorithms; however, only the vulnerability of a
few of machine learning algorithms is studied in depth; many algorithms remain
unexplored.
6.2.1. Improving Adversarial Machine Learning
The robustness of many of the machine learning algorithms is not studied
in depth yet. As we suggest in Algorithm 2, a range of combinations of explicit
125
adversarial and chance-based adverse situations can be studied altogether. Some
other future directions in adversarial machine learning are:
– Scaling-up current methods
Scaling up adversarial methods to large datasets remains an open issue. A
promising direction is using online algorithms that are shown to be successful
in other fields of machine learning.
– Learning utility functions
If we can approximate the opponent’s utility, then we will have a more
realistic model of the adversarial game. In addition, we will be able to use
decision theoretic approaches to model non-zero-sum games. Solving nonzero-sum games in adversarial settings is another important issue that needs
to be addressed.
– Efficient use of knowledge about the opponent
We have shown that by taking advantage of adversary’s limitations, we can
design more robust algorithms; yet, there are still many details about how to
translate the raw knowledge about the adversary into useful parameters in the
learning algorithm.
All of these items apply to both structured and non-structured output
prediction.
6.2.2. Expansion of Existing Work to Structural Settings
There exist many methods in adversarial machine learning that are designed
for specific problems. By right abstraction, these methods can be generalized to the
wider class of structured output prediction. Good examples of such methods are
126
regret minimization algorithms; these methods are based on elegant mathematical
foundations, and they are designed to be robust against adversarial noise. There
are only a couple of papers that use regret minimization algorithms for structured
output prediction. An important feature of regret minimization algorithms is that
they are mostly based on some scalable online algorithm, which is a great candidate
for scaling up existing structured prediction algorithms.
On the other hand, regret minimization algorithms can also benefit from the
work that is already done in the field of adversarial machine learning. The current
regret minimization algorithms assume that the adversary is completely arbitrary1 .
A potential improvement to regret minimization algorithms can be gained by
restricting the adversary in a more realistic and practical way.
In this thesis, we derived a formulation for robustness through dropout
regularization in ordinary SVMs. This method can be expanded to be applied to
structured prediction problems as well. Due to the hardness of the optimization
problems of structured learning, this expansion needs more research and is
not trivial. However, our promising result on the ordinary SVMs suggests that
marginalized dropout should improve structured prediction as well.
1
Although there are some simple versions of bounded adversaries, which are mostly from the
reinforcement learning community, the possible restrictions of the adversary are not studied as
comprehensively as it’s done in adversarial machine learning.
127
APPENDIX A
INTEGRALITY OF THE ADVERSARIAL SOLUTION IN CONVEX
ADVERSARIAL COLLECTIVE CLASSIFICATION
Lemma 7. For K=2, any fixed j and 0 ≤ xij , yik ≤ 1, ŷik ∈ {0, 1},
P k
PN
PK
k
k
k
K
k ŷi = 1, if Aj =
i=1 min(xij , yi ) − xij ŷi , then
k=1 Aj ≥ 0.
Proof. A1j + A2j =
P
k
yik = 1 and
PN
min(xij , yi1 ) − xij ŷi1 + min(xij , yi2 ) − xij ŷi2 . Since yi1 + yi2 = 1
P
1
2
1
1
and ŷi1 + ŷi2 = 1, we can rewrite it as N
i=1 min(xij , yi )−xij (ŷi + ŷi )+min(xij , 1−yi )
P
1
1
= N
i=1 min(xij , yi ) + min(xij , 1 − yi ) − xij . Now three cases can happen:
i=1
(a) If xij ≥ max(yi1 , 1−yi1 ), then min(xij , yi1 )+min(xij , 1−yi1 )−xij = yi1 +1−yi1 −xij
= 1 − xij ≥ 0.
(b) If min(yi1 , 1 − yi1 ) ≤ xij ≤ max(yi1 , 1 − yi1 ), then min(xij , min(yi1 , 1 − yi1 )) +
min(xij , max(yi1 , 1 − yi1 )) − xij = min(xij , min(yi1 , 1 − yi1 )) + xij − xij =
min(xij , yi1 , 1 − yi1 ) ≥ 0.
(c) If xij ≤ min(yi1 , 1 − yi1 ), then min(xij , yi1 ) + min(xij , 1 − yi1 ) − xij = xij + xij − xij
= xij ≥ 0.
Therefore min(xij , yi1 ) + min(xij , yi2 ) − xij is always nonnegative and consequently
P
1
1
2
2
A1j + A2j = N
i=1 min(xij , yi ) − xij ŷi + min(xij , yi ) − xij ŷi is always nonnegative.
Lemma 8. For K = 2, in the optimal solution of the final quadratic program, W ∗
satisfies the following property: min(wj1 , wj2 ) = 0 ∀j = 1 . . . m.
Proof. Let θj = min(wj1 , wj2 ), we define u1j = wj1 − θj and u2j = wj2 − θj , by
substitution the objective of the constraint’s linear program will be:
128
X
X
X
X
(ukj + θj )zijk − (ukj + θj )xij ŷik +
wek yijk −
yik · ŷik +
δij (1 − 2x̂ij )xij
i,j,k
(i,j)∈E,k
i,j
i,k
{z
|
=
XX
j
B
}
u1j zij1 − u1j xij ŷi1 + u2j zij2 − u2j xij ŷi2 + θj (zij1 − xij ŷi1 + zij2 − xij ŷi2 ) + B
i



X
X
X


=
Fij + θj
Hi j 

+B


j
i
| i {z }
≥0
In which Fij and Hij are:
Fij =
u1j zij1 − u1j xij ŷi1 + u2j zij2 − u2j xij ŷi2
Hij =
zij1 − xij ŷi1 + zij2 − xij ŷi2
According to Lemma 7,
1
1
2
2
i (zij −xij ŷi +zij −xij ŷi )
P
≥ 0, therefore the coefficient
of each θj is non-negative. Since θj = min(wj1 , wj1 ) ≥ 0, thus:
i. If the optimization algorithm chooses smaller value for θj , the relaxed
inequality constraint will not be violated, and also smaller θj will not imply
larger ξ.
ii. A smaller θj will directly reduce the objective value.
Therefore, the optimization algorithm chooses the smallest possible θj , which is
θj = 0 ∀j. So min(wj1 , wj2 ) = 0 or equivalently wj1 wj2 = 0 ∀j = 1 . . . m.
Theorem 8. Adversary’s problem in Equation 3.4, has integral solution for both X
and Y .
129
Proof. According to Lemma 8, we know that min(wj1 , wj2 ) = 0 for all j. So we can
rewrite Equation 3.4 as:
max
0
y∈Y ,0≤x≤1
X
i,j
Dij +
X
(i,j)∈E,k
wek yijk −
X
yik · ŷik +
i,k
X
δij (1 − 2x̂ij )xij
i,j
(Equation A.1)
Where Dij = wj1 zij1 − wj1 xij ŷi1 + wj2 zij2 − wj2 xij ŷi2 . Here we assume that either wj1
or wj2 is not zero. Because this is the interesting case, otherwise the proof is trivial.
Therefore, since either wj1 or wj2 is zero. We have:
Dij = wj1 min(xij , yi1 ) − wj1 xij ŷi1 + wj2 min(xij , yi2 ) − wj2 xij ŷi2
= I(wj1 = 0) wj1 min(1 − xij , yi1 ) − wj1 (1 − xij )ŷi1 + wj2 min(xij , yi2 ) − wj2 xij ŷi2 +
I(wj2 = 0) wj1 min(xij , yi1 ) − wj1 xij ŷi1 + wj2 min(1 − xij , yi2 ) − wj2 (1 − xij )ŷi2
Let vijk = xij I(wjk > 0) + (1 − xij )I(wjk = 0), where I(.) is the indicator
function, then:
Dij = I(wj1 = 0) wj1 min(vij1 , yi1 ) − wj1 vij1 ŷi1 + wj2 min(vij2 , yi2 ) − wj2 vij2 ŷi2 +
I(wj2 = 0) wj1 min(vij1 , yi1 ) − wj1 vij1 ŷi1 + wj2 min(vij2 , yi2 ) − wj2 vij2 ŷi2
= I(wj1 = 0) + I(wj2 = 0) wj1 min(vij1 , yi1 ) − wj1 vij1 ŷi1 + wj2 min(vij2 , yi2 ) − wj2 vij2 ŷi2
= wj1 min(vij1 , yi1 ) − wj1 vij1 ŷi1 + wj2 min(vij2 , yi2 ) − wj2 vij2 ŷi2
130
(Equation A.2)
Clearly, we vij1 + vij2 = 1, because:
vij1 + vij2 = xij I(wj1 > 0) + (1 − xij )I(wj1 = 0) + xij I(wj2 > 0) + (1 − xij )I(wj2 = 0)
= xij I(wj1 > 0) + I(wj2 > 0) +(1 − xij ) I(wj1 = 0) + I(wj2 = 0)
|
{z
}
|
{z
}
=1
=1
= xij + 1 − xij = 1.
Obviously, as a result we will have zijk = min(vijk , yik ), because otherwise
increasing zijk can increase the objective, so the solver program will choose the
maximum possible value for zijk . By Lemma 9, and reformulation of suggested Dij
in Equation A.2, we conclude that Equation A.1 has integral solution for yik and vijk
for all i, j and k = 1, 2. Since inetgrality of vijk implies integrality of xij , proof is
complete.
k T
] , linear program in
Lemma 9. If K=2, for any W = [W 1 , W 2 ], W k = [w1k , . . . , wm
Equation A.1, has an integral solution.
Proof. Here, our argument is similar to the proof of the Theorem 1 of Taskar et al.
(2004a). We show that for any fractional solution X (and respectively V) and Y of
Equation A.1, we can construct a new feasible integral assignment X0 and Y 0 , that
increases the objective or does not change it.
Since all wek ’s and wjk ’s are positive, therefore, yijk = min(yik , yjk ) and zijk =
min(yik , xij ); this means that the slack variables corresponding to zijk ≤ yik ,zijk ≤ xij
and yijk ≤ yik ,yijk ≤ yjk are zero, because otherwise by increasing yijk or zijk , the
objective could be increased.
Let λk = min(mini,yik >0 yik , minij,vijk >0 vijk ) and λ = λ1 or λ = −λ2 . We
propose a new construction of solution, that either increases the objective or does
131
not change it, and at the same time reduces the number of fractional values in the
solution.
0
0
0
0
0
0
0
0
vij1 = vij1 − λI(0 < vij1 < 1), vij2 = vij2 + λI(0 < vij2 < 1)
zij1 = zij1 − λI(0 < zij1 < 1), zij2 = zij2 + λI(0 < zij2 < 1)
yi1 = yi1 − λI(0 < yi1 < 1), yi2 = yi2 + λI(0 < yi2 < 1)
yij1 = yij1 − λI(0 < yij1 < 1), yij2 = yij2 + λI(< 0yij2 < 1)
It is obvious that by this update, at least two of the fractional values become
integral. First, we show that in this new construction, values remain feasible. So we
0
0
0
0
0
0
0
0
0
need to show that vij1 + vij2 = 1,yi1 + yi2 = 1, vijk ≥ 0, yik ≥ 0, yijk = min(yik , yjk ) and
0
0
0
zijk = min(vijk , yik ). In the following we show that all of the feasibility requirements
are satisfied.
0
0
vij1 + vij2 = vij1 − λI(0 < vij1 < 1) + vij2 + λI(0 < vij2 < 1 = vij1 + vij2 = 1.
0
0
yi1 + yi2 = yi1 − λI(0 < yi1 < 1) + yi2 + λI(0 < yi2 < 1) = yi1 + yi2 = 1.
Above we used the fact that if vij1 is fractional, then vij2 will also be fractional, and
similarly if yi1 is fractional then yi2 will also be fractional, since vij1 + vij2 = 1 and
0
0
0
yi1 + yi2 = 1. To show vijk ≥ 0 and yik ≥ 0, we prove that minij vijk ≥ 0 and
0
mini yik ≥ 0.
132
0
min vijk = min(vijk − (min( min yik , min vijk ))I(0 < vijk < 1))
ij
ij
i,yik >0
k >0
ij,vij
"
#!
= min min vijk , min vijk − (min( min yik , min vijk ))
ij
ij
i,yik >0
"
k >0
ij,vij
#!
≥ min min vijk , min vijk − ( min vijk )
ij
ij
k >0
ij,vij
"
#
≥ min vijk − ( min vijk ) = 0.
ij
k >0
ij,vij
0
min yik = min(yik − (min( min yik , min vijk ))I(0 < yik < 1))
i
i
i,yik >0
k >0
ij,vij
"
#!
= min min yik , min yik − (min( min yik , min vijk ))
i
i
i,yik >0
k >0
ij,vij
k
k
k
≥ min min yi , min yi − ( min yi )
i
i
i,yik >0
≥ min yik − ( min yik ) = 0.
i
i,yik >0
The last step in showing that the proposed construction is feasible is showing
0
0
0
0
0
0
that yijk = min(yik , yjk ) and zijk = min(vijk , yik ).
0
yij1 = yij1 − λI(0 < yij1 < 1)
= min(yi1 , yj1 ) − λI(0 < min(yi1 , yj1 ) < 1)
= min(yi1 − λI(0 < yi1 < 1), yj1 − λI(0 < yj1 < 1))
0
0
= min(yi1 , yj1 ).
133
0
yij2 = yij2 + λI(0 < yij2 < 1)
= min(yi2 , yj2 ) + λI(0 < min(yi2 , yj2 ) < 1)
= min(yi2 + λI(0 < yi2 < 1), yj2 + λI(0 < yj2 < 1))
0
0
= min(yi2 , yj2 ).
0
zij1 = zij1 − λI(0 < zij1 < 1)
= min(vij1 , yi1 ) − λI(0 < min(vij1 , yi1 ) < 1)
= min(vij1 − λI(0 < vij1 < 1), yi1 − λI(0 < yi1 < 1))
0
0
= min(vij1 , yi1 ).
0
zij2 = zij2 + λI(0 < zij2 < 1)
= min(vij2 , yi2 ) + λI(0 < min(vij2 , yi2 ) < 1)
= min(vij2 + λI(0 < vij2 < 1), yi2 + λI(0 < yi2 < 1))
0
0
= min(vij2 , yi2 ).
So far we have shown that the new variable construction is feasible, and
it remains to show that we can increase the objective. We substitute the newly
constructed feasible values in Equation A.1 and subtract the objective with
unchanged values from it. Then we show that with proper choice of λ = λ1 or of
λ = −λ2 , we can improve the objective.
134
Vold =
X
Dij +
i,j
=
X
wek yijk −
yik · ŷik +
X
δij (1 − 2x̂ij )xij
i,j
i,k
(i,j)∈E,k
X
X
wj1 zij1 − wj1 vij1 ŷi1 + wj2 zij2 − wj2 vij2 ŷi2
i,j
X
+
wek yijk −
X
X
X
δij (1 − 2x̂ij )xij
i,j
i,k
(i,j)∈E,k
=
yik · ŷik +
wj1 zij1 − wj1 vij1 ŷi1 + wj2 zij2 − wj2 vij2 ŷi2
i,j
X
+
+
wek yijk −
X
(i,j)∈E,k
i,k
X
δij (1 − 2x̂ij )
yik · ŷik
I(wj1 > 0) − I(wj1 = 0) vij1 + I(wj1 = 0)
i,j
=
X
wj1 zij1 − wj1 vij1 ŷi1 + wj2 zij2 − wj2 vij2 ŷi2
i,j
+
X
(i,j)∈E,k
+
X
wek yijk −
X
yik · ŷik
i,k
δij (1 − 2x̂ij ) I(wj1 > 0) − I(wj1 = 0) vij1 + C.
i,j
0
0
0
0
Above we have used the fact that xij = I(wjk > 0)vijk + I(wjk = 0)(1 − vijk ) =
0
0
0
0
0
0
0
0
I(wj1 > 0)vij1 + I(wj1 = 0)(1 − vij1 ) = I(wj1 > 0) − I(wj1 = 0) vij1 + I(wj1 = 0).
135
Vnew =
X
0
0
0
0
wj1 zij1 − wj1 vij1 ŷi1 + wj2 zij2 − wj2 vij2 ŷi2
i,j
X
+
0
wek yijk −
X
0
yik · ŷik
i,k
(i,j)∈E,k
X
0
+
δij (1 − 2x̂ij ) I(wj1 > 0) − I(wj1 = 0) vij1 + C
i,j
=
X
[wj1 (zij1 − λI(0 < zij1 < 1)) − wj1 ŷi1 (vij1 − λI(0 < vij1 < 1))
i,j
+ wj2 (zij2 + λI(0 < zij2 < 1)) − wj2 ŷi2 (vij2 + λI(0 < vij2 < 1))]
X +
we1 (yij1 − λI(0 < yij1 < 1)) + we2 (yij2 + λI(0 < yij2 < 1))
(i,j)∈E
−
X
+
X
ŷi1 · (yi1 − λI(0 < yi1 < 1)) + ŷi2 · (yi2 + λI(0 < yi2 < 1))
i
δij (1 − 2x̂ij ) I(wj1 > 0) − I(wj1 = 0) (vij1 − λI(0 < vij1 < 1)) + C
i,j
= Vold +
X
[wj1 (−λI(0 < zij1 < 1)) − wj1 ŷi1 (−λI(0 < vij1 < 1))
i,j
2
wj (λI(0
+
+
X < zij2 < 1)) − wj2 ŷi2 (λI(0 < vij2 < 1))]
we1 (−λI(0 < yij1 < 1)) + we2 (λI(0 < yij2 < 1))
(i,j)∈E
−
X
ŷi1 · (−λI(0 < yi1 < 1)) + ŷi2 · (+λI(0 < yi2 < 1))
i
X
+
δij (1 − 2x̂ij ) I(wj1 > 0) − I(wj1 = 0) (−λI(0 < vij1 < 1)).
i,j
136
Therefore, we can write Vnew − Vold as:
Vnew − Vold = λ[
X
[−wj1 I(0 < zij1 < 1) + wj1 ŷi1 I(0 < vij1 < 1)
i,j
+wj2 I(0
< zij2 < 1) − wj2 ŷi2 I(0 < vij2 < 1)]
X +
−we1 I(0 < yij1 < 1) + we2 I(0 < yij2 < 1)
(i,j)∈E
−
X
+
X
ŷi1 · (−I(0 < yi1 < 1)) + ŷi2 · (+I(0 < yi2 < 1))
i
−δij (1 − 2x̂ij ) I(wj1 > 0) − I(wj1 = 0) I(0 < vij1 < 1)]
i,j
= λD.
The change in objective is λD, and since D is constant with respect to λ, by
choosing λ = −λ2 for negative D, or λ = λ1 for positive D, we can always have
positive or zero λD. It means that the integral solution will increase the objective
or will not change it, while leaving fewer fractional values.
137
APPENDIX B
PROOFS FOR EQUIVALENCE OF ROBUSTNESS AND
REGULARIZATION IN LARGE MARGIN METHODS
Proof of Lemma 3:
Proof. We form δỹC (x, y, x̃) from Equation 4.7:
δ C = δỹC (x, y, x̃) =
φC (x̃, ỹ) − φC (x̃, y) − φC (x, ỹ) − φC (x, y) =
X Y
Y
Y
Y
(
x̃i −
xi )(
ỹi −
yi )
(cx ,cy )∈C i∈cx
i∈cx
i∈cy
(Equation B.1)
i∈cy
For an individual elements of the vector δ as expanded in (Equation B.1), we
can apply Hölder’s inequality to the right-hand side:
|δ C | ≤
X Y
Y
Y
1 X Y
1
(
|
x̃i −
xi |p ) p (
|
ỹi −
yi |q ) q
cx ∈C i∈cx
cy ∈C i∈cy
i∈cx
i∈cy
Q
Q
where p1 + 1q = 1. Since | i∈cy ỹi − i∈cy yi |q ≤ 1, we will have:
P
Q
Q
q
|
ỹ
−
i
cy ∈C
i∈cy
i∈cy yi | ≤ |C|, therefore:
1
|δ C | ≤ |C| q (
X Y
Y
1
|
x̃i −
xi |p ) p
cx ∈C i∈cx
138
i∈cx
After applying Lemma 10 and raising both sides of the inequality to the power of p,
we will have:
p
|δ C |p ≤ |C| q (α
XX
|x̃i − xi |p )
cx ∈C i∈cx
C p
⇒
|δ |
α|C|
p
q
≤
XX
|x̃i − xi |p
(Equation B.2)
cx ∈C i∈cx
where α = max|cx |(p−1) , and |cx | is the number of variables in cx .
cx ∈C
The proof of Lemma 3 depends on the following lemma:
Lemma 10. For any sequence a1 , . . . , an , b1 , . . . , bn , such that 0 ≤ ai , bj ≤ 1, we
P
Q
Q
have | ni=1 ai − ni=1 bi |p ≤ n(p−1) ni=1 |ai − bi |p .
Qbn/2c
Q
Proof. For n = 1, the inequality is trivial. Let u1 = bn/2c
i=1 bi ,
i=1 ai , u2 =
Q
Q
v1 = ni=bn/2c+1 ai , and v2 = ni=bn/2c+1 ai . Also it is a known fact that |f + g|p ≤
2p−1 (|f |p + |g|p ) g, f ∈ R. We have:
|
n
Y
i=1
ai −
n
Y
bi |p = |u1 v1 − u2 v2 |p
i=1
= |u1 v1 − u1 v2 + u1 v2 − u2 v2 |p
≤ 2p−1 (|u1 v1 − u1 v2 |p + |u1 v2 − u2 v2 |p )
= 2p−1 (up1 |v1 − v2 |p + v2p |u1 − u2 |p )
≤ 2p−1 (|v1 − v2 |p + |u1 − u2 |p )
139
by recursive application of the above procedure, the products can be decomposed at
most log2 n times. Therefore,
|
n
Y
ai −
i=1
n
Y
p
(p−1) log2 n
bi |
≤ 2
n
X
|ai − bi |p
i=1
i=1
= np−1
n
X
|ai − bi |p
i=1
Proof of Corollary 1:
Proof. We begin with the result of Theorem 4.3, where
1
1
1
is the coeficient
B(dαi ) p |Ci | q
of variations in the feature corresponding to clique Ci . Since p = 1 then q = ∞, and
αi = max|cx |(p−1) = 1:
cx ∈Ci
1
1
p
B(dαi ) |Ci |
1
q
=
1
1
Bd|Ci | ∞
1
=
Bd
Also in (Equation B.2), set p = 1 and q = ∞.
|δ C | ≤
X Y
Y
Y
Y
(
|
x̃i −
xi |) max |
ỹi −
yi |
cx ∈C i∈cx
Since maxcy ∈C |
Q
i∈cy
ỹi −
i∈cx
Q
i∈cy
cy ∈C
i∈cy
i∈cy
yi | = 1, we will be using a tighter upper-
bound.
Proof of Proposition 1:
140
Proof. We prove the case when regularization function is kwk = kwk∞ (the proofs
for kM−1 wk∞ and kM−1 wk1 are very similar, but for simplicity we chose this
case). Recall that the optimization program of the robust structural SVM is:
minimize c1 f (w) + c2 kwk∞ + ξ
w,ξ
subject to
(Equation B.3)
ξ ≥ max wT (φ(x, ỹ) − φ(x, y)) + ∆(y, ỹ)
ỹ
It can be re-written as:
minimize c1 f (w) + c2 t + ξ
w,ξ,t
subject to
ξ ≥ max wT (φ(x, ỹ) − φ(x, y)) + ∆(y, ỹ)
ỹ
wi ≤ t, − wi ≤ t ∀wi
In vector form we can write these constraints as: w ≤ 1t and −w ≤ 1t. Clearly,
there are two vectors s1 and s2 for which:
w + s1 = 1t
⇒
w = 1t − s1
−w + s2 = 1t
⇒
w = s2 − 1t
Let γ = [s1 T s2 T t]T , m = dim w, Is1 = [Im×m 0m×m 0m×1 ], Is2 =
[0m×m Im×m 0m×1 ], and It = [01×m 01×m 1]. (i.e. s1 = Is1 γ, s2 = Is2 γ, t = It γ). By
substitution:
w = 1It γ − Is1 γ = (1It − Is1 )γ
w = Is2 γ − 1It γ = (Is2 − 1It )γ
141
which implies (1It − Is1 )γ = (Is2 − 1It )γ, therefore: (2 ∗ 1It − Is1 − Is2 )γ = 0, or
equivalently γ ∈ N (2 ∗ 1It − Is1 − Is2 ), where N (.) returns the null-space of the
input matrix. Let columns of matrix B span N (2 ∗ 1It − Is1 − Is2 ), also let γ = Bλ,
we will have w = (1It − Is1 )Bλ. Let A = BT (1It − Is1 )T and b = It T , then we can
rewrite Problem (Equation B.3) as:
minimize c1 f (AT λ) + c2 bT λ + ξ
λ≥0,ξ
subject to
ξ ≥ max λT A(φ(x, ỹ) − φ(x, y)) + ∆(y, ỹ)
ỹ
Note that since (2 ∗ 1It − Is1 − Is2 )B = 0, we will have (1It − Is1 )B = (Is2 − 1It )B,
and A can be the transpose of any of them.
142
APPENDIX C
DIRECT PROOF FOR DERIVATION OF MARGINALIZED LINEAR SVM
In the following, we provide a direct proof for the asymptotic results of
marginalizing the linear SVMs.
Let fe (w, b) be the exact dropout-marginalization of the regularized hinge
loss:
fe (w, b) = λkwk +
X
= λkwk +
X
Eξ max(0, 1 − yi (wT xi ξ/(1 − δ) + b))
i
Eξ (1 − yi (wT xi ξ/(1 − δ) + b))I(1 − yi (wT xi ξ/(1 − δ) + b) ≥ 0)
i
where I is the step function (i.e. I(T rue) = 1 and I(F alse) = 0). Let l(w, b) =
Eξ (1 − yi (wT xi ξ/(1 − δ) + b))I(1 − yi (wT xi ξ/(1 − δ) + b) ≥ 0). If zi =
1 − yi (wT xi ξ/(1 − δ) + b) ∼ N (µi , σi2 ), then:
Z
∞
l(w, b) = Eξ zi I(zi ≥ 0) =
zi I(zi ≥ 0)φξ (zi )dzi
−∞
Z
∞
=
Z
zi φξ (zi )dzi =
0
Z
=
zi
0
∞
zi
0
i
φ( zi σ−µ
)
i
σi
dzi
∞
i
φ( zi σ−µ
)
i
σi
dzi
(Equation C.1)
where φξ (zi ) is the PDF of N (µi , σi2 ), and φ(zi ) is the PDF of the standard
normal distribution (i.e. N (0, 1)).
143
We can construct the expectation of the truncated normal in the following
way:
b
Z
l(w, b) =
lim
zi
b−→+∞
Z
σi
0
+∞
zi
=
i
φ( zi σ−µ
)
i
i
φ( zi σ−µ
)
i
σi
0
i
i
Φ( b−µ
) − Φ( 0−µ
)
σi
σi
i
i
Φ( b−µ
) − Φ( 0−µ
)
σi
σi
!
i
1 − Φ( −µ
)
σi
dzi
i
1 − Φ( −µ
)
σi
!
dzi
(Equation C.2)
where Φ(x) is the CDF of standard normal distribution. Since, Φ(−x) =
1 − Φ(x), we have:
Z
+∞
zi
l(w, b) =
i
φ( zi σ−µ
)
i
Φ( µσii )
σi
i
)
1 − Φ( −µ
σi
0
µi
= Φ( )
σi
Note that
zi −µi
)
σi
−µi
σi (1−Φ( σ ))
i
φ(
Z
+∞
zi
0
i
)
φ( zi σ−µ
i
i
))
σi (1 − Φ( −µ
σi
!
dzi
dzi
(Equation C.3)
is the PDF of truncated normal distribution:
N (µi , σi2 |0 ≤ zi ≤ +∞), therefore:
µi
2 (zi )
)E
σi zi ∼N (µi ,σi )
φ(− µσii )
µi
= Φ( )(µi + σi
)
σi
1 − Φ(− µσii )
l(w, b) = Φ(
(Equation C.4)
Because of the symmetry of the standard normal distribution, we have
φ(− µσii ) = φ( µσii ) and 1 − Φ(− µσii ) = Φ( µσii ), therefore:
144
l(w, b) = µi Φ(
µi
µi
) + σi φ( )
σi
σi
145
(Equation C.5)
APPENDIX D
α-REG PROOF
In this appendix, we provide a proof for Theorem 7.
Proof. We start with Problem Equation 5.11. By expanding the expectation, we
will have:
XZ
Z
1X
maximizeα
αi (ζi )dP (ζi ) −
yi yj
αi (ζi )αj (ζj )k̃(xi , xj , ζi , ζj )dP (ζi )dP (ζj )
2
ζ
∈Z
ζ
,ζ
∈Z
i
i
j
i
i,j
X Z
subject to
yi
αi (ζi )dP (ζi ) = 0, 0 ≤ αi (ζi ) ≤ C ∀i, ζi
(Equation D.1)
ζi ∈Z
i
The theorem assumes αi (ζi )’s are independent of ζi ’s and equal to scalars αi s,
and f˜(x) = f (x) ζi (i.e. dropout in feature space). So (Equation 5.11) can be
written as:
maximizeα
X
subject to
X
i
1X
αi −
yi yj
2 i,j
Z
αi αj (f (xi ) ζi )T (f (xj ) ζj )dP (ζi )dP (ζj )
ζi ,ζj ∈Z
yi αi = 0, 0 ≤ αi ≤ C ∀i,
i
We have:
146
(Equation D.2)
Z
1X
yi yj
αi αj (f (xi ) ζi )T (f (xj ) ζj )dP (ζi )dP (ζj )
2 i,j
ζi ,ζj ∈Z
X
1X
=
yi yj αi αj Eζi ,ζj ∼p(δ) [
fk (xi )fk (xj )ζik ζjk ]
2 i,j
k
X
1X
=
yi yj αi αj
fk (xi )fk (xj )Eζik ,ζjk ∼p(δ) [ζik ζjk ]
(Equation D.3)
2 i,j
k
Eζik ,ζjk ∼p(δ) [ζik ζjk ] = Eζik ,ζjk ∼p(δ) [(ζik − 1)(ζjk − 1) + ζik + ζjk − 1]
= Eζik ,ζjk ∼p(δ) [(ζik − 1)(ζjk − 1)] + Eζik ∼p(δ) [ζik ] + Eζjk ∼p(δ) [ζjk ] − 1
= Eζik ,ζjk ∼p(δ) [(ζik − 1)(ζjk − 1)] + 1
= I(i = j)σζ2 + 1
where I(i = j) = 1 if i = j and zero otherwise, Eζik ∼p(δ) [ζik ] = Eζjk ∼p(δ) [ζjk ] =
1. Since ζik and ζjk are identically and independently drawn from the noise
distribution,
Eζik ,ζjk ∼p(δ) [(ζik − 1)(ζjk − 1)] =
147



var(δ) = σ 2
if i = j


0
if i 6= j
ζ
Therefore,
X
1X
yi yj αi αj
fk (xi )fk (xj )Eζik ,ζjk ∼p(δ) [ζik ζjk ]
2 i,j
k
X
1X
=
yi yj αi αj (1 + I(i = j)σζ2 )
fk (xi )fk (xj )
2 i,j
k
1X
=
yi yj αi αj (1 + I(i = j)σζ2 )k(xi , xj )
2 i,j
148
APPENDIX E
LINEAR TIME INFERENCE FOR DIMENSION DROPOUT IN RBF KERNEL
In the Chapter 5, we briefly mention the marginalized prediction for dropoutdimension, based on the expected kernel function. We now show how this is
obtained.
Theorem 9. For dimension dropout, the expected RBF kernel function between a
support vector x with noise ξ and a test instance x0 has the following closed-form
solution:
Eζ [k̃(x, x0 , ξ, ζ)] =
Y
0 2
(δ + (1 − δ)e−γ(xl −xl ) )
l:ξl 6=0
Proof. The dimension dropout RBF kernel is defined as follows:
0
2
k̃(x, x0 , ξ, ζ) = e−γd(x,x ,ξ,ζ)
where d(x, x0 , ξ, ζ)2 =
P
l:ξl 6=0,ζl 6=0 (xl
− x0l )2 . We can also express this kernel as a
product:
Y
k̃(x, x0 , ξ, ζ) =
0 2
e−γ(xl −xl ) =
l:ξl 6=0,ζl 6=0
Y
0 2
e−γζl (xl −xl )
l:ξl 6=0
For dropout noise, ζl equals 0 with probability δ and 1 with probability 1 − δ. Since
each ζl is drawn independently, we can convert the expectation of a product to a
product of expectations:
"
Eζ
#
Y
l:ξl 6=0
e
−γζl (xl −x0l )2
=
Y
h
i
Y
0 2
−γζl (xl −x0l )2
Eζl e
=
(δ + (1 − δ)e−γ(xl −xl ) )
l:ξl 6=0
l:ξl 6=0
149
Since the SVM prediction is a linear function of kernel values, the expected
prediction is a linear function of the expected kernel values:
#
"
Eζ
X
yi αi k̃(xi , ξi , x0 , ζ) =
i
X
yi αi Eζ [k̃(xi , ξi , x0 , ζ)]
i
Each expected kernel value can be computed in linear time by iterating over the
non-dropped-out features of the support vector.
150
APPENDIX F
NOTATIONS AND SYMBOLS
|A|
The determinant of matrix A
kxk0
The L0 norm of vector x. kxk0 is used to indicate the number
of non-zero elements in x.
kxkp
P
1
p p
The Lp norm of vector x. kxkp = ( m
i=0 |xi | )
0
A vector of zeros
1
A vector of ones
A
Matrix A ∈ Rn×m for some positive integers n and m; bold
capital letters are used for matrices.
AT
The transpose of matrix A
Aij
The element of A in row i and column j
AB
Hadamard (element-wise) product of matrices A
B:
(A B)ij = Aij Bij . Same notation is used for the Hadamard
product of vectors.
diag(x)
The m × m diagonal matrix, in which the elements of vector x
form the diagonal.
m
The dimension of a random vector or a feature function
n
The number of samples in a dataset
x
A scalar x ∈ R
151
x
Vector x
∈
Rm for some positive integer m; bold lowercase
letters are used to indicate vectors. Vectors are assumed
to be m
×
1 dimensional matrices (i.e. column vector):
x = [x1 , . . . , xm ]T
xi
The ith element of x
x ∼ fx (θ)
Random variable (vector) x is drawn from the probability
distribution fx , which is parameterized by vector θ.
N (µ, Σ)
A Gaussian (normal) distribution with mean µ and covariance
matrix Σ
φ(x; µ, Σ)
The probability density function (PDF) of the normal
distribution N (µ, Σ). φ(x; µ, Σ) = √ 1
|2πΣ|
φ(x)
1
T Σ−1 (x−µ)
e− 2 (x−µ)
The standard normal PDF with mean 0 and variance 1 (i.e.
N (0, 1)). φ(x)
=
1 2
√1 e− 2 x .
2π
For univariate random variable
).
x ∼ N (0, 1), we have: φ(x; µ, σ 2 ) = φ( x−µ
σ
Φ(t; µ, Σ)
The cumulative density function (CDF) of the normal
R
distribution N (µ, Σ). Φ(t; µ, Σ) = x≤t φ(x; µ, Σ)dx
152
REFERENCES CITED
Abernethy, J., Chapelle, O., and Castillo, C. (2010). Graph regularization methods
for web spam detection. Machine Learning, 81(2):207–225.
Adamic, L. and Glance, N. (2005). The political blogosphere and the 2004 US
election: divided they blog. In Proceedings of the 3rd International Workshop
on Link Discovery, pages 36–43. ACM.
Altun, Y., Tsochantaridis, I., Hofmann, T., et al. (2003). Hidden markov support
vector machines. In ICML, volume 3, pages 3–10.
An, B., Kempe, D., Kiekintveld, C., Shieh, E., Singh, S., Tambe, M., and
Vorobeychik, Y. (2012). Security games with limited surveillance. Ann Arbor,
1001:48109.
Bakir, G. H., Hofmann, T., Scholkopf, B., Smola, A. J., Taskar, B., and
Vishwanathan, S. V. N. (2007). Predicting Structured Data. MIT Press.
Bartlett, P. L., Collins, M., Taskar, B., and McAllester, D. A. (2004).
Exponentiated gradient algorithms for large-margin structured classification.
In Advances in Neural Information Processing Systems (NIPS) 18.
Basilico, N., Gatti, N., and Amigoni, F. (2009). Leader-follower strategies for
robotic patrolling in environments with arbitrary topologies. In Proceedings of
The 8th International Conference on Autonomous Agents and Multiagent
Systems-Volume 1, pages 57–64. International Foundation for Autonomous
Agents and Multiagent Systems.
Begleiter, R., El-Yaniv, R., and Yona, G. (2004). On prediction using variable order
markov models. J. Artif. Intell. Res.(JAIR), 22:385–421.
Ben-Tal, A. and Nemirovski, A. (1998). Robust convex optimization. Mathematics
of Operations Research, 23(4):769–805.
Ben-Tal, A. and Nemirovski, A. (1999). Robust solutions of uncertain linear
programs. Operations research letters, 25(1):1–13.
Ben-Tal, A. and Nemirovski, A. (2000). Robust solutions of linear programming
problems contaminated with uncertain data. Mathematical Programming,
88(3):411–424.
Ben-Tal, A. and Nemirovski, A. (2001). On polyhedral approximations of the
second-order cone. Mathematics of Operations Research, 26(2):193–205.
153
Bertsimas, D., Pachamanova, D., and Sim, M. (2004). Robust linear optimization
under general norms. Operations Research Letters, 32(6):510–516.
Bertsimas, D. and Sim, M. (2004). The price of robustness. Operations research,
52(1):35–53.
Bhattacharyya, C., Pannagadatta, K., and Smola, A. J. (2004). A second order
cone programming formulation for classifying missing data. Advances in
neural information processing systems, 17:153–160.
Biggio, B., Corona, I., Nelson, B., Rubinstein, B. I., Maiorca, D., Fumera, G.,
Giacinto, G., et al. (2014). Security evaluation of support vector machines in
adversarial environments. arXiv preprint arXiv:1401.7727.
Biggio, B., Fumera, G., and Roli, F. (2013a). Security evaluation of pattern
classifiers under attack.
Biggio, B., Nelson, B., and Laskov, P. (2011). Support vector machines under
adversarial label noise. Journal of Machine Learning Research-Proceedings
Track, 20:97–112.
Biggio, B., Nelson, B., and Laskov, P. (2012). Poisoning attacks against support
vector machines. arXiv preprint arXiv:1206.6389.
Biggio, B., Pillai, I., Rota Bulò, S., Ariu, D., Pelillo, M., and Roli, F. (2013b). Is
data clustering in adversarial settings secure? In Proceedings of the 2013
ACM workshop on Artificial intelligence and security, pages 87–98. ACM.
Bilmes, J., Zweig, G., Richardson, T., Filali, K., Livescu, K., Xu, P., Jackson, K.,
Brandman, Y., Sandness, E., Holtz, E., et al. (2001). Discriminatively
structured graphical models for speech recognition. In Report of the JHU
2001 Summer Workshop.
Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization.
Neural Computation, 7(1):108–116.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Blanzieri, E. and Bryl, A. (2008). A survey of learning-based techniques of email
spam filtering. Artificial Intelligence Review, 29(1):63–92.
Blum, A. (2006). Random projection, margins, kernels, and feature-selection. In
Subspace, Latent Structure and Feature Selection, pages 52–68. Springer.
Boyd, S. P. and Vandenberghe, L. (2004). Convex optimization. Cambridge
university press.
154
Brückner, M., Kanzow, C., and Scheffer, T. (2012). Static prediction games for
adversarial learning problems. the Journal of Machine Learning Research,
13(1):2617–2654.
Brückner, M. and Scheffer, T. (2009). Nash equilibria of static prediction games. In
Advances in Neural Information Processing Systems 22.
Brückner, M. and Scheffer, T. (2011). Stackelberg games for adversarial prediction
problems. In Proceedings of the Seventeenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM Press.
Califf, M. E. and Mooney, R. J. (2003). Bottom-up relational learning of pattern
matching rules for information extraction. The Journal of Machine Learning
Research, 4:177–210.
Chang, K.-W., Hsieh, C.-J., and Lin, C.-J. (2008). Coordinate descent method for
large-scale l2-loss linear support vector machines. The Journal of Machine
Learning Research, 9:1369–1398.
Charniak, E. and Johnson, M. (2005). Coarse-to-fine n-best parsing and maxent
discriminative reranking. In Proceedings of the 43rd Annual Meeting on
Association for Computational Linguistics, pages 173–180. Association for
Computational Linguistics.
Chau, D., Pandit, S., and Faloutsos, C. (2006). Detecting fraudulent personalities
in networks of online auctioneers. Knowledge Discovery in Databases: PKDD
2006, pages 103–114.
Chen, N., Zhu, J., Chen, J., and Zhang, B. (2014a). Dropout training for support
vector machines. In Proceedings of Twenty-Eighth AAAI Conference on
Artificial Intelligence, pages 1752–1759, Quebec, Canada.
Chen, S., Feng, Z., Lu, Q., Mahasseni, B., Fiez, T., Fern, A., and Todorovic, S.
(2014b). Play type recognition in real-world football video. WACV.
Chieu, H. L. and Ng, H. T. (2002). A maximum entropy approach to information
extraction from semi-structured and free text. AAAI/IAAI, 2002:786–791.
Collins, M. (2002). Discriminative training methods for hidden Markov models:
Theory and experiments with perceptron algorithms. In Proceedings of the
2002 Conference on Empirical Methods in Natural Language Processing, pages
1–8, Philadelphia, PA. ACL.
Collins, M. and Duffy, N. (2002). New ranking algorithms for parsing and tagging:
Kernels over discrete structures, and the voted perceptron. In Proceedings of
the 40th annual meeting on association for computational linguistics, pages
263–270. Association for Computational Linguistics.
155
Collins, M., Globerson, A., Koo, T., Carreras, X., and Bartlett, P. L. (2008).
Exponentiated gradient algorithms for conditional random fields and
max-margin markov networks. The Journal of Machine Learning Research,
9:1775–1822.
Collins, M. and Koo, T. (2005). Discriminative reranking for natural language
parsing. Computational Linguistics, 31(1):25–70.
Dalvi, N., Domingos, P., Mausam, Sanghai, S., and Verma, D. (2004). Adversarial
classification. In Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 99–108, Seattle,
WA. ACM Press.
Daumé III, H. (2009a). Semi-supervised or semi-unsupervised? In NAACL
Workshop on Semi-supervised Learning for NLP. Citeseer.
Daumé III, H. (2009b). Unsupervised search-based structured prediction. In
Proceedings of the 26th Annual International Conference on Machine
Learning, pages 209–216. ACM.
Daumé Iii, H., Langford, J., and Marcu, D. (2009). Search-based structured
prediction. Machine learning, 75(3):297–325.
Daumé III, H. and Marcu, D. (2005). Learning as search optimization:
Approximate large margin methods for structured prediction. In Proceedings
of the 22nd international conference on Machine learning, pages 169–176.
ACM.
Decoste, D. and Schölkopf, B. (2002). Training invariant support vector machines.
Machine learning, 46(1-3):161–190.
Dekel, O., Shamir, O., and Xiao, L. (2010). Learning to classify with missing and
corrupted features. Machine learning, 81(2):149–178.
Derezinski, M. and Warmuth, M. K. (2014). The limits of squared Euclidean
distance regularization. In Advances in Neural Information Processing
Systems, pages 2807–2815.
Dickerson, J. P., Simari, G. I., Subrahmanian, V., and Kraus, S. (2010). A
graph-theoretic approach to protect static and moving targets from
adversaries. In Proceedings of the 9th International Conference on
Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pages
299–306. International Foundation for Autonomous Agents and Multiagent
Systems.
Domingos, P. and Lowd, D. (2009a). Markov Logic: An Interface Layer for AI.
Morgan & Claypool, San Rafael, CA.
156
Domingos, P. and Lowd, D. (2009b). Markov logic: An interface layer for artificial
intelligence, volume 3. Morgan & Claypool Publishers.
Domke, J. (2013). Structured learning via logistic regression. In Advances in Neural
Information Processing Systems, pages 647–655.
Doppa, J. R., Fern, A., and Tadepalli, P. (2012). Output space search for
structured prediction. arXiv preprint arXiv:1206.6460.
Dreves, A., Facchinei, F., Kanzow, C., and Sagratella, S. (2011). On the solution of
the kkt conditions of generalized nash equilibrium problems. SIAM Journal
on Optimization, 21(3):1082–1108.
Dritsoula, L., Loiseau, P., and Musacchio, J. (2012). A game-theoretical approach
for finding optimal strategies in an intruder classification game. In Decision
and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 7744–7751.
IEEE.
Drost, I. and Scheffer, T. (2005). Thwarting the nigritude ultramarine: Learning to
identify link spam. In Proceedings of the Sixteenth European Conference on
Machine Learning, pages 96–107. Springer.
El Ghaoui, L., Lanckriet, G., and Natsoulis, G. (2003). Robust classification with
interval data. Computer Science Division, University of California.
Fang, F., Jiang, A. X., and Tambe, M. (2013). Optimal patrol strategy for
protecting moving targets with multiple mobile resources. In Proceedings of
the 2013 international conference on Autonomous agents and multi-agent
systems, pages 957–964. International Foundation for Autonomous Agents and
Multiagent Systems.
Fua, P., Li, Y., Lucchi, A., et al. (2013). Learning for structured prediction using
approximate subgradient descent with working sets. In Computer Vision and
Pattern Recognition (CVPR), number EPFL-CONF-185082.
Globerson, A., Koo, T. Y., Carreras, X., and Collins, M. (2007). Exponentiated
gradient algorithms for log-linear structured prediction. In Proceedings of the
24th international conference on Machine learning, pages 305–312. ACM.
Globerson, A. and Roweis, S. (2006). Nightmare at test time: robust learning by
feature deletion. In Proceedings of the Twenty-Third International Conference
on Machine Learning, pages 353–360, Pittsburgh, PA. ACM Press.
Gong, D., Zhao, X., and Medioni, G. (2012). Robust multiple manifolds structure
learning. ICML.
157
Gupta, K. K., Nath, B., and Kotagiri, R. (2010). Layered approach using
conditional random fields for intrusion detection. Dependable and Secure
Computing, IEEE Transactions on, 7(1):35–49.
Gupta, K. K., Nath, B., and Ramamohanarao, K. (2007). Conditional random
fields for intrusion detection. In Advanced Information Networking and
Applications Workshops, 2007, AINAW’07. 21st International Conference on,
volume 1, pages 203–208. IEEE.
Gurobi Optimization, I. (2014). Gurobi optimizer reference manual.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov,
R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv preprint arXiv:1207.0580.
Huynh, T. and Mooney, R. (2009). Max-margin weight learning for Markov logic
networks. In In Proceedings of the European Conference on Machine Learning
and Principles and Practice of Knowledge Discovery in Databases
(ECML/PKDD-09). Bled, pages 564–579. Springer.
Jain, M., Kardes, E., Kiekintveld, C., Ordóñez, F., and Tambe, M. (2010a).
Security games with arbitrary schedules: A branch and price approach. In
AAAI.
Jain, M., Tsai, J., Pita, J., Kiekintveld, C., Rathi, S., Tambe, M., and Ordóñez, F.
(2010b). Software assistants for randomized patrol planning for the lax
airport police and the federal air marshal service. Interfaces, 40(4):267–290.
Jensen, D. and Neville, J. (2002). Linkage and autocorrelation cause feature
selection bias in relational learning. In Proceedings of the Nineteenth
International Conference on Machine Learning, pages 259–266, Sydney,
Australia. Morgan Kaufmann.
Jiang, A. X., Nguyen, T. H., Tambe, M., and Procaccia, A. D. (2013a). Monotonic
maximin: A robust stackelberg solution against boundedly rational followers.
In Decision and Game Theory for Security, pages 119–139. Springer.
Jiang, A. X., Yin, Z., Zhang, C., Tambe, M., and Kraus, S. (2013b).
Game-theoretic randomization for security patrolling with dynamic execution
uncertainty. In Proceedings of the 2013 international conference on
Autonomous agents and multi-agent systems, pages 207–214. International
Foundation for Autonomous Agents and Multiagent Systems.
Joachims, T., Finley, T., and Yu, C.-N. J. (2009). Cutting-plane training of
structural svms. Machine Learning, 77(1):27–59.
158
Kivinen, J. and Warmuth, M. K. (1997). Exponentiated gradient versus gradient
descent for linear predictors. Information and Computation, 132(1):1–63.
Kloft, M. and Laskov, P. (2007). A poisoning attack against online anomaly
detection. In NIPS Workshop on Machine Learning in Adversarial
Environments for Computer Security. Citeseer.
Koller, B., Carlos, T., and Daphne, G. (2003). Max-margin markov networks. In
Advances in Neural Information Processing Systems (NIPS) 17, Vancouver,
BC, Canada.
Kolmogorov, V. and Zabin, R. (2004). What energy functions can be minimized via
graph cuts? Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 26(2):147–159.
Korolev, V. and Shevtsova, I. (2012). An improvement of the berry–esseen
inequality with applications to poisson and mixed poisson random sums.
Scandinavian Actuarial Journal, 2012(2):81–105.
Korzhyk, D., Conitzer, V., and Parr, R. (2011). Solving stackelberg games with
uncertain observability. In The 10th International Conference on Autonomous
Agents and Multiagent Systems-Volume 3, pages 1013–1020. International
Foundation for Autonomous Agents and Multiagent Systems.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields:
Probabilistic models for segmenting and labeling data. In Proceedings of the
Eighteenth International Conference on Machine Learning, pages 282–289,
Williamstown, MA. Morgan Kaufmann.
Lanckriet, G. R., Ghaoui, L. E., Bhattacharyya, C., and Jordan, M. I. (2003). A
robust minimax approach to classification. The Journal of Machine Learning
Research, 3:555–582.
Laskov, P. and Kloft, M. (2009). A framework for quantitative security analysis of
machine learning. In Proceedings of the 2nd ACM workshop on Security and
artificial intelligence, pages 1–4. ACM.
Laskov, P. and Lippmann, R. (2010). Machine learning in adversarial environments.
Machine learning, 81(2):115–119.
Lauritzen, S. L. (1996). Graphical models. Oxford University Press.
Le, Q., Sarlós, T., and Smola, A. (2013). Fastfood-computing hilbert space
expansions in loglinear time. In Proceedings of the 30th International
Conference on Machine Learning, pages 244–252.
159
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
Lin, H.-T. and Lin, C.-J. (2003). A study on sigmoid kernels for SVM and the
training of non-PSD kernels by SMO-type methods. Technical report,
Department of Computer Science, National Taiwan University.
Lippmann, R. P. (1987). An introduction to computing with neural nets. ASSP
Magazine, IEEE, 4(2):4–22.
Livni, R. and Globerson, A. (2012). A simple geometric interpretation of SVM
using stochastic adversaries. Proceedings of the 15th International Conference
on Artificial Intelligence and Statistics.
Lowd, D. and Domingos, P. (2007). Efficient weight learning for Markov logic
networks. In Proceedings of the Eleventh European Conference on Principles
and Practice of Knowledge Discovery in Databases, pages 200–211, Warsaw,
Poland. Springer.
Lowd, D. and Meek, C. (2005a). Adversarial learning. In Proceedings of the
Eleventh ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pages 641–647. ACM.
Lowd, D. and Meek, C. (2005b). Good word attacks on statistical spam filters. In
Proceedings of the Second Conference on Email and Anti-Spam (CEAS),
pages 125–132.
Lowd, D. and Shamaei, A. (2011). Mean field inference in dependency networks:
An empirical study. In AAAI.
Lu, Q. and Getoor, L. (2003). Link-based classification. In ICML, volume 3, pages
496–503.
Maaten, L., Chen, M., Tyree, S., and Weinberger, K. Q. (2013). Learning with
marginalized corrupted features. In Proceedings of the 30th International
Conference on Machine Learning (ICML-13), pages 410–418.
McAllester, D., Collins, M., and Pereira, F. (2004). Case-factor diagrams for
structured probabilistic modeling. In Proceedings of the 20th conference on
Uncertainty in artificial intelligence, pages 382–391. AUAI Press.
McAuley, J. J., Caetano, T. S., and Smola, A. J. (2008). Robust near-isometric
matching via structured learning of graphical models. In Advances in Neural
Information Processing Systems (NIPS) 22, pages 1057–1064.
160
McCallum, A., Freitag, D., and Pereira, F. C. (2000). Maximum entropy markov
models for information extraction and segmentation. In ICML, pages
591–598.
McDonald, R., Hall, K., and Mann, G. (2010). Distributed training strategies for
the structured perceptron. In Human Language Technologies: The 2010
Annual Conference of the North American Chapter of the Association for
Computational Linguistics, pages 456–464. Association for Computational
Linguistics.
McDonald, R., Hannan, K., Neylon, T., Wells, M., and Reynar, J. (2007).
Structured models for fine-to-coarse sentiment analysis. In Annual
Meeting-Association For Computational Linguistics, volume 45, page 432.
McDonald, R. and Pereira, F. (2005). Identifying gene and protein mentions in text
using conditional random fields. BMC bioinformatics, 6(Suppl 1):S6.
McDowell, L. K., Gupta, K. M., and Aha, D. W. (2009). Cautious collective
classification. The Journal of Machine Learning Research, 10:2777–2836.
Nelson, B. (2010). Behavior of Machine Learning Algorithms in Adversarial
Environments. PhD thesis, Electrical Engineering and Computer Sciences
University of California at Berkeley, California, United States.
Nelson, B., Rubinstein, B., Huang, L., Joseph, A., Lau, S., Lee, S., Rao, S., Tran,
A., and Tygar, J. (2010). Near-optimal evasion of convex-inducing classifiers.
In Proceedings of the 13th International Conference on Artificial Intelligence
and Statistics (AISTATS) 2010, volume 9, Chia Laguna Resort, Sardinia,
Italy.
Neville, J. and Jensen, D. (2007). Relational dependency networks. The Journal of
Machine Learning Research, 8:653–692.
Nguyen, T. H., Yang, R., Azaria, A., Kraus, S., and Tambe, M. (2013). Analyzing
the effectiveness of adversary modeling in security games. In Conf. on
Artificial Intelligence (AAAI).
Och, F. J., Gildea, D., Khudanpur, S., Sarkar, A., Yamada, K., Fraser, A., Kumar,
S., Shen, L., Smith, D., Eng, K., et al. (2003). Syntax for statistical machine
translation. In Johns Hopkins University 2003 Summer Workshop on
Language Engineering, Center for Language and Speech Processing,
Baltimore, MD, Tech. Rep.
Pang, B. and Lee, L. (2004). A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts. In Proceedings of the
42nd Annual Meeting on Association for Computational Linguistics, page 271.
Association for Computational Linguistics.
161
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Francisco, CA.
Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual
information criteria of max-dependency, max-relevance, and min-redundancy.
Pattern Analysis and Machine Intelligence, IEEE Transactions on,
27(8):1226–1238.
Pita, J., Jain, M., Marecki, J., Ordóñez, F., Portway, C., Tambe, M., Western, C.,
Paruchuri, P., and Kraus, S. (2008). Deployed armor protection: the
application of a game theoretic model for security at the los angeles
international airport. In Proceedings of the 7th international joint conference
on Autonomous agents and multiagent systems: industrial track, pages
125–132. International Foundation for Autonomous Agents and Multiagent
Systems.
Pita, J., Tambe, M., Kiekintveld, C., Cullen, S., and Steigerwald, E. (2011).
Guards: innovative application of game theory for national airport security.
In Proceedings of the Twenty-Second international joint conference on
Artificial Intelligence-Volume Volume Three, pages 2710–2715. AAAI Press.
Punyakanok, V. and Roth, D. (2001). The use of classifiers in sequential inference.
arXiv preprint cs/0111003.
Qiao, Y., Xin, X., Bin, Y., and Ge, S. (2002). Anomaly intrusion detection method
based on hmm. Electronics Letters, 38(13):663–664.
Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines.
In Advances in neural information processing systems, pages 1177–1184.
Ranjbar, M., Lan, T., Wang, Y., Robinovitch, S. N., Li, Z.-N., and Mori, G. (2013).
Optimizing nondecomposable loss functions in structured prediction. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 35(4):911–924.
Ravichandran, D., Hovy, E., and Och, F. J. (2003). Statistical qa-classifier vs.
re-ranker: what’s the difference? In Proceedings of the ACL 2003 workshop on
Multilingual summarization and question answering-Volume 12, pages 69–75.
Association for Computational Linguistics.
Ross, S., Gordon, G. J., and Bagnell, J. A. (2010). A reduction of imitation
learning and structured prediction to no-regret online learning. arXiv preprint
arXiv:1011.0686.
Ross, S., Gordon, G. J., and Bagnell, J. A. (2011). No-regret reductions for
imitation learning and structured prediction. In In AISTATS. Citeseer.
162
Rudin, W. (2011). Fourier analysis on groups. John Wiley & Sons.
Sawade, C., Scheffer, T., et al. (2013). Bayesian games for adversarial regression
problems. In Proceedings of the 30th International Conference on Machine
Learning (ICML-13), pages 55–63.
Schölkopf, A., Simard, P., Vapnik, V., and Smola, A. (1997). Improving the
accuracy and speed of support vector machines. Advances in neural
information processing systems, 9:375–381.
Schölkopf, D. (2002). Sampling techniques for kernel methods. In Advances in
Neural Information Processing Systems 14: Proceedings of the 2001
Conference, volume 1, page 335. MIT Press.
Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T.
(2008). Collective classification in network data. AI Magazine, 29(3):93.
Shalev-Shwartz, S. (2011). Online learning and online convex optimization.
Foundations and Trends in Machine Learning, 4(2):107–194.
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2011). Pegasos: Primal
estimated sub-gradient solver for svm. Mathematical programming,
127(1):3–30.
Shen, D., Sun, J.-T., Li, H., Yang, Q., and Chen, Z. (2007). Document
summarization using conditional random fields. In IJCAI, volume 7, pages
2862–2867.
Shen, L., Sarkar, A., and Och, F. J. (2004). Discriminative reranking for machine
translation. In HLT-NAACL, pages 177–184.
Shieh, E., An, B., Yang, R., Tambe, M., Baldwin, C., DiRenzo, J., Maule, B., and
Meyer, G. (2012). Protect: A deployed game theoretic system to protect the
ports of the united states. In Proceedings of the 11th International Conference
on Autonomous Agents and Multiagent Systems-Volume 1, pages 13–20.
International Foundation for Autonomous Agents and Multiagent Systems.
Shivaswamy, P. K., Bhattacharyya, C., and Smola, A. J. (2006). Second order cone
programming approaches for handling missing and uncertain data. The
Journal of Machine Learning Research, 7:1283–1314.
Smith, G. D. (1985). Numerical solution of partial differential equations: finite
difference methods. Oxford University Press.
Smola, A. J., Vishwanathan, S., and Le, Q. V. (2007). Bundle methods for machine
learning. In Advances in Neural Information Processing Systems (NIPS) 21,
pages 1377–1384.
163
Song, Y., Wen, Z., Lin, C.-Y., and Davis, R. (2013). One-class conditional random
fields for sequential anomaly detection. In Proceedings of the Twenty-Third
international joint conference on Artificial Intelligence, pages 1685–1691.
AAAI Press.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.
(2014). Dropout: A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15(1):1929–1958.
Taskar, B., Chatalbashev, V., and Koller, D. (2004a). Learning associative Markov
networks. In Proceedings of the twenty-first international conference on
machine learning. ACM Press.
Taskar, B., Chatalbashev, V., Koller, D., and Guestrin, C. (2005). Learning
structured prediction models: A large margin approach. In Proceedings of the
22nd international conference on Machine learning, pages 896–903. ACM.
Taskar, B., Wong, M. F., Abbeel, P., and Koller, D. (2004b). Max-margin Markov
networks. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in
Neural Information Processing Systems 16. MIT Press, Cambridge, MA.
Teo, C., Globerson, A., Roweis, S., and Smola, A. (2008). Convex learning with
invariances. In Advances in Neural Information Processing Systems 21.
Theil, H. and Fiebig, D. G. (1984). Exploiting continuity: Maximum entropy
estimation of continuous distributions. Ballinger Cambridge, MA.
Tian, Y., Huang, T., and Gao, W. (2006). Robust collective classification with
contextual dependency network models. In Advanced Data Mining and
Applications, pages 173–180. Springer.
Torkamani, M. (2014). Adversarial structured output prediction. Available
athttp://www.cs.uoregon.edu/Reports/ORAL-201406-Torkamani.pdf. Oral
Comprehensive Exam.
Torkamani, M. and Lowd, D. (2013). Convex adversarial collective classification. In
Proceedings of the 30th International Conference on Machine Learning
(ICML), pages 642–650.
Torkamani, M. A. and Lowd, D. (2014). On robustness and regularization of
structural support vector machines. Proceedings of the Thirty-First
International Conference on Machine Learning (ICML), pages 577–585.
Toutanova, K., Haghighi, A., and Manning, C. D. (2005). Joint learning improves
semantic role labeling. In Proceedings of the 43rd Annual Meeting on
Association for Computational Linguistics, pages 589–596. Association for
Computational Linguistics.
164
Tsai, J., Kiekintveld, C., Ordonez, F., Tambe, M., and Rathi, S. (2009). Iris-a tool
for strategic security allocation in transportation networks.
Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. (2004). Support
vector machine learning for interdependent and structured output spaces. In
Proceedings of the twenty-first international conference on Machine learning,
page 104. ACM.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. (2006). Large margin
methods for structured and interdependent output variables. Journal of
Machine Learning Research, 6(2):1453.
Wager, S., Fithian, W., Wang, S., and Liang, P. S. (2014). Altitude training:
Strong bounds for single-layer dropout. In Advances in Neural Information
Processing Systems, pages 100–108.
Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive
regularization. In Advances in Neural Information Processing Systems, pages
351–359.
Wang, S. and Manning, C. (2013). Fast dropout training. In Proceedings of the
30th International Conference on Machine Learning (ICML-13), pages
118–126.
Wang, S., Wang, M., Wager, S., Liang, P., and Manning, C. D. (2013). Feature
noising for log-linear structured prediction. In EMNLP, pages 1170–1179.
Wang, S. I. and Manning, C. D. (2012). Baselines and bigrams: Simple, good
sentiment and topic classification. In Proceedings of the ACL, pages 90–94.
Xu, H., Caramanis, C., and Mannor, S. (2009). Robustness and regularization of
support vector machines. The Journal of Machine Learning Research,
10:1485–1510.
Xu, H., Caramanis, C., and Mannor, S. (2010). Robust regression and lasso. IEEE
Transactions on Information Theory, 56(7):3561–3574.
Xu, H. and Mannor, S. (2012). Robustness and generalization. Machine learning,
86(3):391–423.
Yin, Z., Jain, M., Tambe, M., and Ordóñez, F. (2011). Risk-averse strategies for
security games with execution and observational uncertainty. In Proceedings
of the AAAI Conference on Artificial Intelligence (AAAI).
Yin, Z., Jiang, A. X., Johnson, M. P., Kiekintveld, C., Leyton-Brown, K.,
Sandholm, T., Tambe, M., and Sullivan, J. P. (2012). Trusts: Scheduling
randomized patrols for fare inspection in transit systems. In IAAI.
165
Yu, C.-N. J. and Joachims, T. (2009). Learning structural svms with latent
variables. In Proceedings of the 26th Annual International Conference on
Machine Learning, pages 1169–1176. ACM.
Zhang, S.-X., Gales, M. J., et al. (2011). Structured support vector machines for
noise robust continuous speech recognition. In INTERSPEECH, pages
989–990.
Zhang, S.-X., Ragni, A., and Gales, M. J. F. (2010). Structured log linear models
for noise robust speech recognition. Signal Processing Letters, IEEE,
17(11):945–948.
166
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement