Introduction to multivariate discrimination Balázs Kégl LAL/LRI, CNRS & University Paris-Sud, France March 2012 Multivariate discrimination or classification is one of the best-studied problem in machine learning, with a plethora of well-tested and well-performing algorithms. There are also several good general textbooks [1, 2, 3, 4, 5, 6, 7, 8, 9] on the subject written to an average engineering, computer science, or statistics graduate student; most of them are also accessible for an average physics student with some background on computer science and statistics. Hence, instead of writing a generic introduction, we concentrate here on relating the subject to a practitioner experimental physicist. After a short introduction on the basic setup (Section 1) we delve into the practical issues of complexity regularization, model selection, and hyperparameter optimization (Section 2), since it is this step that makes high-complexity non-parametric fitting so different from low-dimensional parametric fitting. To emphasize that this issue is not restricted to classification, we illustrate the concept on a low-dimensional but non-parametric regression example (Section 2.1). Section 3 describes the common algorithmic-statistical formal framework that unifies the main families of multivariate classification algorithms. We explain here the large-margin principle that partly explains why these algorithms work. Section 4 is devoted to the description of the three main (families of) classification algorithms, neural networks, the support vector machine, and A DA B OOST. We do not go into the algorithmic details; the goal is to give an overview on the form of the functions these methods learn and on the objective functions they optimize. Besides their technical description, we also make an attempt to put these algorithm into a sociohistorical context. We then briefly describe some rather heterogeneous applications to illustrate the pattern recognition pipeline and to show how widespread the use of these methods is (Section 5). We conclude the chapter with three essentially open research problems that are either relevant to or even motivated by certain unorthodox applications of multivariate discrimination in experimental physics. 1 The supervised learning setup The goal of supervised learning is to infer a function g : X → Y from a data set D = (x1 , y1 ), . . . , (xn , yn ) ∈ (X × Y )n comprised of pairs of observations xi and labels yi . The components x ( j) of the d-dimensional feature or observation vector x are either real numbers or they come from an unordered set of cardinality M( j) . In this latter case, without loss of generality, we will assume that x ( j) ∈ I ( j) = {1, . . . , M( j) }. When the label space Y is real-valued (for example, the label represents the mass of a particle), the problem is known as regression, whereas when the labels come from a finite set of classes (for example, Y = {signal, background}), we are talking about classification. The quality of g on an arbitrary pair (x, y) ∈ X × Y is measured by an error or loss function L g, (x, y) that depends on the type of problem. In regression, the goal is to get g(x) as close to y as possible, so the loss grows with the difference between g(x) and y. The most widely used loss function in this case is the quadratic or squared loss 2 L2 g, (x, y) = g(x) − y . In classification, typically there is no distance or similarity defined between the classes, so all we can measure is whether or not g predicts correctly the class y. The usual loss in this case is the one-loss or zero-one loss LI g, (x, y) = I { g(x) 6= y}, (1) where the indicator function I { A} is 1 if its argument A is true and 0 otherwise. The goal of learning algorithms is not to memorize D , rather to generalize from it. Indeed, it is rather trivial to construct a function g that has zero loss on every sample point (xi , yi ) by explicitly setting g(xi ) , yi on all sample points from D , and, say, g(xi ) , 0 everywhere else. We obviously have not learned anything from D , we have simply memorized it. It is also clear that this function has very little chance to perform well on points not in the set D (unless 0 is a good prediction everywhere in which case the problem itself is trivial). To formalize this intuitive 2 notion of generalization, it is usually assumed that all sample points (including those in D ) are independent samples from a fixed but unknown distribution D, and the goal is to minimize the expected loss (a.k.a., risk or error) R( g) = E(x,y)∼D L g, (x, y) , where E(x,y)∼D { L} denotes the expectation of L with respect to the random variable (x, y) drawn from the distribution D. When we know D, the optimal function is the one that minimizes the risk, that is, g∗ = arg min R( g). g With this notation, since the expectation of the indicator of an event is the probability of the event, the misclassification probability P { g(x) 6= y} is the risk generated by the one-loss (2) RI ( g) = E(x,y)∼D LI g, (x, y) = P { g(x) 6= y}, 1 and so it can be shown that the optimal classifier (the so-called Bayes classifier) is gI∗ (x) = arg max P {y | x}. y In regression, the risk is the expectation of the squared error n o R2 ( g) = E(x,y)∼D ( g(x) − y)2 , and it can be shown easily that the optimal regressor is the conditional expectation of y given x, that is, g2∗ (x) = E {y | x}. In practice, of course, the data distribution D is unknown, and the goal is to get close to the performance of g∗ by only using the sample D . As our imaginary example demonstrates it, finding a function g that minimizes the empirical risk (or training error) b ( g) = 1 R n n ∑L g, (xi , yi ) (3) i =1 is not necessarily a good idea. Nevertheless, it turns out that the estimator b ( g) gb∗ = arg min R (4) g∈G can have good properties both in theory and in practice if the function class G is not too “rich” so the functions in G are not too complex. In fact, the real error R( gb∗ ) of the empirically best b ( gb∗ ), but the bias can be controlled, classifier gb∗ is usually underestimated by the training error R and the minimization (4) can lead to a good solution if the complexity of the function class G is “small” compared to the number of data points n. How to control the complexity of G and how to find an optimal function in G computationally efficiently are the two main subjects of the design of supervised learning algorithms. Finding gb∗ in G is the subject of algorithmic design. The process is called training or learning, and the data b ( gb) is minimized is the training set. Of course, once D is used for finding gb∗ , set D on which R b ( gb∗ ) is no longer an unbiased estimator of R( gb∗ ). it is tainted from a statistical point of view: R Overfitting is the term that describes the situation when the risk R( gb∗ ) is significantly larger than b ( gb∗ ). To detect and to assess overfitting, gb∗ has to be evaluated on a test set the training error R 0 D , independent of D . 1 For denoting the risk generated by a particular loss, we will use the same index: the risk generated by the loss L will • be denoted by R• . 3 1.1 Parametric fitting and the maximum likelihood method The main subject of the chapter is non-parametric (model-less) multi-variate classification, and the setup of Section 1 is designed to formalize the theoretical framework for this approach. Nevertheless, it is worthy to note that “classical” low-dimensional fitting, be it maximum likelihood or least-square, also fits into this framework. These methods are treated in other chapters, so here we only sketch them to illuminate their relationship to the setup of Section 1. When we know a generative probabilistic model, one of the possibilities of fitting a model is to maximize the likelihood p(z). This can also be cast into the framework of this section with the usual loss function L P (z) = − log p(z), and with the risk that L P implies R P (z) = Ez∼D {− log p(z)}. For example, when the data points zi are triplets (xi , yi , σi ) with the generative model of yi ∼ N g(xi ), σi ,2 maximum likelihood reduces to weighted least square (“chi square”), that is, 2 n n g ( xi ) − y i . arg min ∑ − log p(xi , yi , σi ) = arg min ∑ σi2 g∈G g∈G i =1 i =1 Although the least-square method is mostly used in parametric fitting when G is a low-complexity function class parametrized with a few parameters, neural networks and other model-less approaches can be used also with a (weighted) quadratic loss, and in this case all considerations about overfitting, complexity regularization, and hyperparameter optimization, usually associated with non-parametric classification, are also valid. 2 Complexity regularization (model selection) and hyperparameter optimization The simplest way to control the complexity of the function class G is to fix it either explicitly (for example, by taking the set of all linear functions) or implicitly (for example, by taking the set of all functions that a neural network with N neurons and one hidden layer can represent). In parametric fitting, G is a set of functions parametrized with a low number of parameters, and the dimensionality d of the input space is usually low. Most importantly, the number of parameters is fixed independently of the size n of the training data set, so the overfitting issue only shows up in goodness-of-fit tests (for example, in setting the degrees of freedom of the chi-square distribution). The situation is different in non-parametric fitting. Contrary to what its name suggests, nonparametric fitting means that we have a lot of parameters. This approach is used when we have little knowledge about the model that generated the observations, so we want to avoid the bias caused by fitting a misspecified rigid model. More importantly, we do not want to decide the complexity or the smoothness of the solution beforehand; we prefer that the data speak for itself. Formally, we can define a (possibly infinite) set of function classes Gθ , parametrized by a vector θ of the so-called hyperparameters. Hyperparameters can take diverse forms. Most of them are related to the complexity of the class; the number of neurons in a neural network, the depth of decision trees, or the number of boosting iterations are typical examples. Other hyperparameters are penalty coefficients in a framework called complexity regularization [4] or structured risk minimization [9]. In this setup, instead of finding the empirical minimizer in G , we penalize the complexity of the functions g ∈ G by an appropriate penalty C ( g), and find b ( g) + βC ( g), gb∗β = arg min R g∈G 2 N ( µ, σ ) denotes the normal distribution with mean µ and standard deviation σ. 4 (5) where β is a penalty coefficient (a hyperparameter) which has to be tuned. The classical practices of weight decay or early stopping in neural networks fall in the category of structured risk minimization. Another typical example is the penalty coefficient (usually denoted by C) in support vector machines. A third family of hyperparameters are simple “technical” parameters, such as the learning rate in neural networks, or the number of features x ( j) tested at each cut in a decision tree. There is a vast literature on how to tune hyperparameters on the training set D itself. This socalled model selection problem has both Bayesian and frequentist solutions, but most of the time, at least in model-less non-parametric fitting, optimality results are only valid asymptotically (as training sample size n → ∞), and so they only provide some ballpark hints in practice. The solution adopted in practice is called validation. In the simplest setup, we choose gbθ∗ by minimizing the training error in each set Gθ b ( g; D), gbθ∗ = arg min R (6) g∈Gθ then we choose the overall best predictor gb∗ by minimizing the error on a held-out validation set D 00 b ( gb∗ ; D 00 ), gb∗ = arg min R θ gbθ∗ where we explicitly added D and D 00 as an argument of the error to emphasize that the two minimizations use two different sets. Indeed, D 00 has to be independent of both the training set D and the eventual test set D 0 (used for estimating the performance of the final classifier). This procedure is known as hyperparameter optimization or hyperparameter tuning in machine learning. The particularity of hyperparameter optimization is that the evaluation of the function f (θ ) = b ( gb∗ ; D 00 ) is computationally expensive. Indeed, the evaluation of f (θ ) requires the optimization R θ of gbθ ∈ Gθ , that is, training gbθ on D and testing it on D 00 , which can take hours or days, depending on the particular learning algorithm, the data size n, and the data dimensionality d. This feature relates hyperparameter optimization to experimental design where, say, we want to optimize a handful parameters of a detector by evaluating its performance using expensive simulations. When the number of hyperparameters is small (say, not more than 3), the usual procedure is simple grid search: we discretize the components of θ, and evaluate f (θ ) for all members of the discrete set. For example, we train A DA B OOST for T ∈ {10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10 000} iterations using decision trees with a number of leaves of N ∈ {2, 3, 5, 10, 20, 50, 100}, and then select the best ( T, N ) pair out of the 70 trained functions. When the number of hyperparameters exceeds three, this procedure rapidly becomes infeasible because the number of grid points blows up exponentially. Simple heuristics, such as Latin hypercube search, simple random search [10], or gradient search [11] can be applied, but the principled solution is surrogate optimization: we replace f by a, usually non-parametric, smooth function which we train iteratively on a sequence of evaluation points θt , f (θt ) . In each iteration t, a surrogate function fbt (θ ) is regressed over t the set θt0 , f (θt0 ) t0 =1 , then a new evaluation point is selected based on fbt (θ ). Gaussian process regression [12] is one of the most popular concrete choices for the regression method since it provides not only a regressor but also a conditional distribution p f (θ ) | θ , and so it can be used with probabilistic criteria to select the next evaluation point θt . The best-known such criterion is expected improvement. The recent success of deep neural networks [13, 14, 15], with often tens of hyperparameters, generated a surge of papers on the subject [16, 10, 17, 18, 19], showing that the jury is still out on what the best strategy is. 2.1 An example in low-dimensional fitting To illustrate that overfitting and model selection do not only concern multi-variate classification, we start by a simple example using Gaussian process (GP) regression [12] on a one-dimensional fitting problem. The main ingredient of a GP is the kernel function K ( x, x 0 ) that defines the smoothness of the regressor functions gb( x ). In this example we use the squared exponential 5 kernel ( x − x 0 )2 K ( x, x 0 ) = a exp − , w2 where a and w are hyperparameters to be tuned. Without going into the details, the GP regression function is given in the form of n gb( x ) = ∑ αi K(x, xi ). i =1 The coefficient vector α = (α1 , . . . , αn ) is given by α = (K + Σ)−1 y, n where K = [K ( xi , x j )]i,j =1 is the Gram matrix and y = (y1 , . . . , yn ) is the label vector. The diagonal matrix Σ contains either the squared uncertainties σi2 in the diagonal in case they are known, or a constant σ2 which becomes a third hyperparameter that has to be tuned along with a and w. There are both Bayesian and frequentist techniques to find the optimal hyperparameters a, w, and σ using a single training set D . These methods work well in practice, so validation is not a common technique in GP regression. Nevertheless, for the sake of illustrating the general technique, we show on an example how validation works in this case. We generate 50 training and 50 validation points ( xi , yi ) by first drawing xi uniformly from the interval [3, 100], and then drawing yi ∼ N g( xi ), 0.4 , where √ x g( x ) = sin is our target regression function. Figure 1 depicts g( x ) in blue, and the training set D in red (the validation set D 00 is not shown). For the sake of simplicity, we fix a = 10 and σ = 0.4 and only tune the kernel width w. The role of w is to set the smoothness of the regressor function. We plot three cases in Figure 1. When w = 10, the estimated regressor gb( x ) clearly overfits the p data: it follows the training data too closely, achieving a root mean square error (RMSE) R2 ( g) = 0.33 on the training set. At the same time, on the validation set its RMSE is 0.47, giving a qualitative indication of overfitting. When w = 80, we are underfitting (oversmoothing) the data. The training and validation RMSEs are close (0.47 and 0.48, respectively), but they are both suboptimal. At the optimum w = 40, selected by the validation procedure, we can achieve a 0.39 training RMSE and a 0.38 validation RMSE. The validation RMSE slightly underestimates the expected RMSE since w was selected based on this sample, but this “hyper-overfitting” is negligible. In fact it is comparable to overfitting a single-parameter function in a parametric setup. 3 Convex losses in binary classification Finding the empirical minimizer gb∗ (4), gbθ∗ (6), or gb∗β (5) can be algorithmically challenging even in relatively “simple” function sets G . For example, finding the linear binary classifier that minibI ( g) is NP hard [20].3 One way to make the learning problem tractable mizes the training error R bI ( g) by defining convex losses that upper bound the one-loss is to relax the minimization of R LI g, (x, y) (1). To formalize this, we need to introduce some notions used in binary classification. Most of the learning algorithms do not directly output a {±1}-valued classifier g, rather, they learn a real-valued discriminant function f , and obtain g by thresholding the output of f (x) at 0, that is, ( +1 if f (x) > 0, g (x) = −1 if f (x) ≤ 0. 3 http://en.wikipedia.org/wiki/NP (complexity) 6 Figure 1: Over- and underfitting√ in non-parametric GP regression. The blue curve is the target regression function g( x ) = sin x . The red dots are the training set: xi is drawn uniformly from [3, 100], and yi ∼ N g( xi ), 0.4 . Two of the GP hyperparameters are fixed to a = 10 and σ = 0.4, and the third hyperparamater w is tuned on a validation set, yielding an optimal value of w = 40. Using the real-valued output of the discriminant function, the classifier can be evaluated on a finer scale by defining the (functional) margin ρ = ρ f , (x, y) = f (x)y. With this notation, the misclassification indicator (one-loss) of a discriminant function f on a sample point (x, y) can be re-defined as a function of the margin ρ by LI f , (x, y) = LI ρ f , (x, y) = LI (ρ) = I {ρ < 0}. Besides the sign of ρ that represents the classification error, the magnitude of ρ is also important: it indicates the confidence or the robustness of the classification. Indeed, the combination of a large ρ with a Lipschitz-type (slope) penalty on f means that the geometric margin ρ(g) is also large, that is, x is far from the decision border (defined by the set of points for which f (x) = 0, see Figure 2). The idea behind large margin classification is to design loss functions that “push” the decision border away from the training points. The goal is not just to minimize the training zero-one error RI ( g) (2) but also to increase the margin of the correctly classified points. The common feature of these loss functions is that 1) they penalize larger errors (negative margins) more than smaller ones, and 2) they keep penalizing even correctly classified points especially if they are close to the decision border (their margin is close to zero). Formally, one can define smooth convex upper 7 f HxL = w x + w0 1.5 1.0 1.0 HgL Ρ2 0.0 HgL - Ρ2 -1.0 -0.5 -1.5 -1.0 0 -2.0 0 gHxL 0.0 Ρ1 -0.5 f HxL 0.5 Ρ1 0.5 1 2 3 4 1 2 3 4 5 5 (a) (b) Figure 2: Geometric and functional margins. The labels of the red and blue points are y = +1 and y = −1, respectively. (a) An upper bound ω on the slope w and a functional margin ρ implies a geometric margin (distance from the point x : f (x) = 0) of ρ(g) > ρ/ω. (b) The discriminant function f (x) and the induced classifier g(x). The blue point (with label y = −1) is misclassified and the two red points (with label y = +1) are correctly classified, but the second red point has a larger margin f ( x )y indicating that its classification is more robust. bounds of the margin-based one-loss LI (ρ), and minimize the corresponding empirical risks inbI . The most common losses, depicted in Figure 3, are the exponential stead of the training error R loss L EXP (ρ) = exp(−ρ), (7) the hinge loss L1+ (ρ) = max(0, 1 − ρ), (8) L LOG (ρ) = log2 exp(−ρ) + 1 . (9) and the logistic loss Given the margin-based loss L• (ρ), the corresponding margin-based empirical risk can be defined as n b • ( f ) = 1 ∑ L • ( f ( xi ) y i ) R (10) n i =1 b EXP ( f ), R b1+ ( f ) , or R b LOG ( f ) have both algorithMinimizing the margin-based empirical risks R mic and statistical advantages. First, combining the minimization of these risks with Lipschitztype penalties4 often leads to convex optimization problems that can be solved with standard algorithms (e.g., linear, quadratic, or convex programming). Second, it can also be shown within the complexity regularization framework that the minimization of these penalized convex risks leads to large margin classifiers with good generalization properties, confirming the intuitive explanation of the previous paragraph. 4 Binary classification algorithms The three most popular binary classification algorithms are the multi-layer perceptron or neural network (NN) [21, 1], the support vector machine (SVM) [22, 23, 9], and A DA B OOST [24]. They have very different origins, and the algorithmic details also make them stand apart, nevertheless, they 4 often of the form of L1 or L2 penalties on the coefficient vector of a generalized linear model; see Section 4 8 exponential HAdaBoostL 2.5 hinge HSVML logistic HNNL 2.0 1.5 1.0 misclassification 0.5 0.0 -2 -1 0 1 2 Ρ Figure 3: Convex margin losses used in binary classification. share some important concepts, and their practical success is explained, at least qualitatively, by the same theory on large margin classification. An important pragmatic similarity is that they all learn generalized linear discriminant functions of the form T ∑ α ( t ) h ( t ) (x). f (x) = (11) t =1 In neural networks h(t) (x) is a perceptron, that is, a linear combination of the input features followed by a sigmoidal nonlinearity σ (such as arctan) ! h ( t ) (x) = σ (t) d w0 + ∑ wj (t) ( j) x , j =1 where x ( j) is the jth component of the d-dimensional input vector x. The output weight vector α = α(1) , . . . , α(T ) and the T × (d + 1)-dimensional input weight matrix w 0. W= .. (1) ... .. . w0 (T ) ... (1) wd .. . (T ) wd are learned using gradient descent optimization (called back-propagation in the NN literature). In the case of binary classification the neural network is usually trained to minimize the logistic loss (9). One of the advantages of NNs is their versatility: as long as the loss function is differentiable, it can be plugged into the algorithm. The differentiability condition imposed by the gradient descent optimization, on the other hand, constrains the loss function and the nonlinearities used in the hidden units: one could not use, say, a Heaviside-type nonlinearity or the one-loss. The invention of the multilayer perceptron in 1986 [21] was a technological breakthrough: complex pattern recognition problems (e.g., handwritten character recognition) could suddenly be solved efficiently using neural networks. At the same time, the theory of machine learning at these early days was not yet developed to the point of being able to explain the principles 9 behind algorithmic design. Most of the engineering techniques (such as weight decay or early stopping) were developed using a trial-and-error approach, and they were theoretically justified only much later within the complexity regularization and large margin framework [25]. Partly because of the “empirical” nature of neural network design, a common belief developed about the “user-unfriendliness” of neural networks. Added to this reputation was the uneasiness of having a non-convex optimization problem: back-propagation cannot be guaranteed to converge to a global minimum of the empirical risk. This is basically a no-issue in practice (especially on today’s large problems), still, it is a common criticism from people who usually have never experimented with neural networks on real problems. As a consequence, neural networks were slightly over-shadowed in the 90s with the appearance of the support vector machine and A D A B OOST . Today neural networks are, again, becoming more popular partly because of userfriendly program packages (e.g., [26]), partly due to the computational efficiency of the training algorithm (especially its stochastic version), and partly because of the recent success of unsupervised feature-learning techniques that use deep neural architectures to solve hard computer vision and language processing problems [13, 14, 15]. Support vector machines also learn generalized linear discriminant functions of the form (11) with h ( t ) (x ) = y t K (xt , x ), where K (x, x0 ) is a positive semidefinite kernel function that expresses the similarity between two observations x and x0 . K (x, x0 ) is usually monotonically decreasing with the distance between x and x0 . As in Gaussian process regression (Section 2.1), the most common choice for K is the squared exponential (a.k.a. Gaussian) kernel. The index t ranges over t = 1, . . . , n which means that each training point (xt , yt ) ∈ D contributes to f a kernel function centered at xt with a sign equal to the label yt , so the final discriminant function can be interpreted as a weighted nearest neighbor classifier where the weight comprises of the “similarity term” K (xt , x) and the “importance term” α(t) . The objective of training the SVM is to find the weight vector α = α(1) , . . . , α(T ) that minimizes the hinge loss (8) with a complexity penalty (the L2 loss on the weights of features in the Hilbert space induced by K (x, x0 )). The objective function was explicitly designed based on the theory of large margin classification to ensure good generalization properties. A great advantage of the setup is that the objective is a quadratic function of α with linear constraints 0 ≤ α(t) ≤ C (the so-called box constraints), so the global optimum can be found using quadratic programming. The result is sparse: only a subset of the training points in D have nonzero coefficients α(t) . These points are called support vectors, giving the name of the method. The biggest disadvantage of the technique is its training time: naı̈ve quadratic programming solvers run in super-quadratic time, and even with the most sophisticated tricks it is hard to beat the O(n2 ) barrier. The second disadvantage of the method is that in high-dimensional problems the number of support vectors is comparable to the size of the training set5 , so, when evaluating f at the test phase is a bottleneck (see Section 6.1), SVMs can be prohibitively slow. Third, unlike neural networks, SVMs are designed for binary classification, and the generic extensions for multi-class classification and regression do not reproduce the performance of binary SVM. Despite these disadvantages, the appearance of the support vector machine in the middle of the 90s revolutionized the technology of pattern recognition. Besides its remarkable generalization performance, the small number of hyperparameters6 and the appearance of turn-key software packages7 made SVMs the method of choice for a wide range of applications involving small-tomoderate size training sets. With the appearance of extra-large training sets in the last decade, training time and optimization became more important issues than generalization [27], so SVMs lost somewhat their dominant role. 5 This loss of sparsity is one of the manifestations of the phenomenon known as the curse of dimensionality. in the bound of the αs and a length scale parameter of the kernel function K (x, x0 ) 7 http://svmlight.joachims.org, http://www.csie.ntu.edu.tw/∼cjlin/libsvm, http://www.torch.ch, http: //www.loria.fr/∼lauer/MSVMpack 6C 10 A DA B OOST learns a generalized linear discriminant function of the form (11) in an iterative fashion. It can be considered as constructing a neural network by adding one neuron at a time, but since back-propagation is no longer applied, there is no differentiability restriction on the base classifiers h(t) . This opens the door to using domain-dependent features, making A DA B OOST easy to adapt to a wide range of applications. The basic binary classification algorithm (Figure 4) consists of elementary steps that can be implemented by a first year computer science student in an hour. The only tricky step is the implementation of the base learner (line 3) whose goal is to return h(t) with a small weighted error bI (h, w) = 1 R n n ∑ w i I { h ( xi ) 6 = y i } , i =1 but most of the simple learning algorithms (e.g., decision trees, linear classifiers) can be easily adapted to this modified risk. The weights w = (w1 , . . . , wn ) over the training points are initialized uniformly (line 1), and then updated in each iteration by a simple and intuitive rule (lines 7-10): if h(t) misclassifies (xi , yi ), the weight wi increases (line 8), whereas if h(t) correctly classifies (xi , yi ), the weight wi decreases (line 10). In this way subsequent base classifiers will concentrate on points that were “missed” by previous base classifiers. The coefficients α(t) are also set analytically to the formula in line 5 which is monotonically decreasing with respect to the weighted error. A DA B OOST D , B ASE (·, ·), T 1 w(1) ← (1/n, . . . , 1/n) . initial weights 2 for t ← 1 to T h(t) ← B ASE D , w(t) . base classifier n o n (t) e ( t ) ← ∑ w i I h ( t ) ( xi ) 6 = y i . weighted error of the base clas- 3 4 i =1 sifier (t) 1 − e(t) e(t) 1 ← ln 2 5 α 6 for i ← 1 to n 7 if h ( t ) (x 10 . coefficient of the base classifier . re-weighting the training points 6= yi then ( t +1) wi 8 9 i) ! ← . error (t) wi 2e(t) . weight increases . correct classification else ( t +1) wi 11 return f (T ) (·) = (t) ← wi 2 (1− e ( t ) ) . weight decreases T ∑ α(t) h(t) (·) . weighted “vote” of base classifiers t =1 Figure 4: The pseudocode of binary A DA B OOST. D = (x1 , y1 ), . . . , (xn , yn ) is the training set, B ASE (·, ·) is the base learner, and T is the number of iterations. This simple algorithm has several beautiful mathematical properties. First it can be shown [24] that if each base classifier h(t) is slightly better than a random guess (that is, every weighted bI of the final discriminant error e(t) ≤ 12 − δ with δ > 0), then the (unweighted) training error R 11 function f (T ) becomes zero after at most T= log n +1 2δ2 steps. The logarithmic dependence of T on the data size n implies that the technique is formally a boosting algorithm from which the second part of its name derives.8 Second, it can be shown b EXP ( f ) (7) using a greedy functional gradient that the algorithm minimizes the exponential risk R approach [28, 29]. In this alternative but equivalent formulation, in each iteration h(t) is selected b EXP f (t−1) + αh(t) at α = 0, and then α(t) is set to to maximize the decrease (derivative) of R b EXP f (t−1) + αh(t) . α(t) = arg min R α Furthermore, for inseparable data sets (for which no linear combination of base classifiers achieves b EXP ( f (T ) ) goes to the minimum achievable exponential risk 0 training error), the exponential risk R as T → ∞ [30]. It can also be proven [31] that for the normalized discriminant function ∑ T α ( t ) h ( t ) (x) fe(x) = t=1 T , ∑ t =1 α ( t ) the margin-based training error b (θ ) ( fe) = 1 R I n n ∑I n fe(xi )yi < θ o i =1 also goes to zero exponentially fast for all θ < ρe∗ /2, where ρe∗ = T max e α:∑tT=1 |e αt |=1 min ∑ e α ( t ) h ( t ) ( xi ) y i i t =1 is the maximum achievable (normalized) minimum margin. This results shows that A DA B OOST, similarly to the support vector machine, leads to a large margin classifier. It also explains the surprising experimental observation that the generalization error RI ( f ) (estimated on a test set) bI ( f ) reaches 0. often decreases even after the training error R A DA B OOST arrived to the pattern recognition scene in the late 90s when support vector machines were in full bloom. A DA B OOST has a lot of advantages over its main rival: it is fast (its time complexity is linear in n and T), it is an any-time algorithm9 , it has few hyperparameters, it is resistant to overfitting, and it can be directly extended to multi-class classification [32]. Due to these advantages, it rapidly became the method of choice of machine learning experts in certain types of problems where the natural setup is to define a large set of plausible features of which the final solution f (T ) uses only a few (a so called sparse classifier). Arguably the most famous application is the first face detector running real-time on 30 frame-per-second video [33]. At the same time, A DA B OOST is much less known among users with no computer science background than the support vector machine. The reason is paradoxically the simplicity of A DA B OOST: since it is so easy to implement with a little background in programming, no machine learning expert took the effort to provide a turn-key software package easily usable for non-experts.10 In fact it is not surprising that the only large non-computer-scientist community in which A DA B OOST is much more popular than SVM is experimental physics: physicists, especially in high energy physics, are heavy computer-users with considerable programming skills. In the last five years algorithm is also A DAptive because the coefficients α(t) depend on the errors e(t) of the base classifiers h(t) . outputs a classifier even if it is stopped before convergence. 10 We are attempting to improve the situation with our free multi-class program package available at http:// multiboost.org. 8 The 9 It 12 The regression problem can be illustrated by our recent work [45] in which we aim to predict the number of muons (an elementary particle) in a signal recorded in a water Cherenkov detector of te Pierre Auger experiment [46] (see Chapter 4 for a detailed description of the context). The pipeline, again, starts by extracting features that are plausibly correlated with the muon content of the signal. The 172 features are quite correlated, so first we apply principal component analysis to reduce the size of the observation vector x to 19. Then we train a neural network [47, 48] and convert its output into a point estimate and a pointwise error bar (uncertainty) of the number of muons. In [49, 50] we use A DA B OOST.MH for web page ranking. The input x of the classifier g is a set DA B OOST (and other asimilar methods) seem and to have been �taking over the the relevance field of ofAfeatures representing searchensemble query and a document, the label represents large-scale applications. In recent large-scale challenges [34, 35] the top entries are dominated of the document to the query. The goal is to rank the documents in order of their relevance. Of by ensemble-based solutions, SVM isisalmost non-existent in the the most efficient approaches. course, if the relevance of the and document correctly predicted, implied ranking will also be Although binary A DA B OOST (Figure 4) is indeed simple to implement, multi-class A DA B OOST good. Nevertheless, it is hard to formally derive a meaningful loss for each (document, query, has some nontrivial tricks. There areloss several multi-class extensions the original algorelevance) triplet from the particular defined on rankings. The of solution to thisbinary problem is a rithm, and it is not well-known, even among machine learning experts, that the original ADpost-processing technique called calibration: instead of directly using the output of the classifier A B OOST.M1 and A DA B OOST.M2 algorithms [24] are largely suboptimal compared to A DA B OOST.MH g, we send it through another regressor or ranker which is now trained on the desired loss funcpublished a couple of years later [32].11 On the other hand, the A DA B OOST.MH paper [32] did tion. Our concrete system, descibed in detail in Section 3.8, also contains another postprocessing not specify the implementation details of multi-class base learning, making the implementation modul called model aggregation: instead of training one classifier g, we train a large number of non-trivial. For more information on multi-class A DA B OOST.MH we refer the reader to the docclassifiers by varying data preprocessing and algorithmic hyperparameters, and combine the reumentation of M ULTI B OOST, available from the http://multiboost.org website. sults using a simple weighted voting scheme. 5 3.2.2 Applications Trigger design in experimental physics In this section we illustrate the versatility of the abstract supervised learning model described in Our last example is a work in progress; we present it here because it is one of those problems for this chapter through presenting some of the machine learning applications we have worked on. which no turn-key machine learning solution exists even though the problem definitely belongs It turns out that real-world applications are never so simple as just taking a turn-key implementotation the domain of box machine learning. problem is trigger in experimental out of the and running it. The To make a system work,design one needs both domainphysics. expertiseIn experimental phyics, detectors are built to collect observations of partially known or unknown and machine learning expertise, and so it often requires an interdisciplinary approach and conphenomena in order to measure (estimate) some physical quantities. The phenomena can siderable communication effort from experts of different backgrounds. Nevertheless, onceeither the occur naturally (e.g.,tocosmic rays) setup or they can be in generated man-made instruments (e.g., problem is reduced the abstract described Section 1,by machine learning algorithms canin particle A common feature of today’s detectors is that the phenomena they want be veryaccelerators). efficient. to measure is very rare that it is separate generated As the examples willwhich show,means classification orincreasingly regression isdifficult only onetostep in the signal pattern recogby the phenomenon from background, that is, an observed event that just happen to look real nition pipeline (Figure 5). Data collection is arguably the most costly step in most of the like applisignal because of random fluctuations or due to some uninteresting events. The final goal of the cations. In particular, harvesting the labels yi usually involves human interaction. How to do analysis to collect a large of observations that, withto high probability, this in aiscost-effective way statistics by, for example, using CAPTCHAs obtain characterwere labelsgenerated [36] or by the targeted phenomena. the abackground is isoften ordersresearch of magnitudes making people label images bySince playing game [37, 38], itselfseveral a challenging subject. more On frequent the part of physics the background/signal cannotfrom be done offline. Due 1 insignal, Introduction the otherthan hand, experimental simulators can beseparation used to sample the distribution top(either diskapplications capacity ordata limited bandwidth, most ofan the raw signal has to be discarded x | y), limited so in these collection is usually not issue. by online triggers. Multivariate discrimination or classification is one of the most-studied problem in machine learning, learning with a plethora of well-tested In the machine paradigm, a trigger is just a binary classifier with Data collection L = { SIGNAL, BACKGROUND }. 2 The supervised learning setup There are several attributes that make the trigger a special binary classifier. First, it is extremely Preprocessing The goal of supervised learningis ispractically to infer a function : Xcases. → LThis frommakes a datathe set clasunbalanced: the probability of BACKGROUND 1 in a lotg of � � it is very nhard to beat the sical setup of minimizing the error probability inadequate: D = R(Ix(1g,)y(3.2) 1 ), . . . , (xn , yn ) ∈ (X × L) trivial constant classifier g(x) ≡ BACKGROUND which has an error of RI ( g) = P(� = SIGNAL ). Indeed, the natural gain inoftrigger design is the true hitelements indicatorx ( j) of the d-dimensional feature or Training (learning) comprised pairs of observations andpositive labels. or The � � either real numbers or they come from an unordered set of cardinality observation vector x are GTP g, (x, �) = I { g(x) = SIGNAL | � = SIGNAL }. M( j) . In this latter case, without loss of generality, we will assume that x ( j) ∈ I ( j) = {1, . . . , M( j) }. When the label space L is real-valued, the problem is known as regression, whereas when the labels come Postprocessing from a finite set of classes,� we are �talking about classification. The quality of g is measured by an error or loss function L g, (x, y) that depends on the type of problem. In regression, the goal5:isThe to get g(x) as close to ypipeline. as possible, so the loss grows with the difference Figure pattern recognition between g(x) and y. The most widely used loss function in this case is the quadratic or squared � is a wide �2 range of techniques The secondloss step of the pipeline is preprocessing �the data.� There L2 of g, (simplifying x, y) = g(the x) −supervised y . that can be applied here, usually with the objective learning task. First, if the observations are not alreadythere in aisformat of a real vector x ∈ Rd , features (column In classification, typically no distance or similarity defined between the classes, so all we can measure is whether or not g predicts correctly the class y. The usual loss in this case is the 11 The commonly used WEKA package (http://www.cs.waikato.ac.nz/ml/weka), for example, only contains a badly one-loss orDAzero-one losseffectively harming�the reputation written implementation of A B OOST.M1, � of boosting as a whole. LI g, (x, y) = I { g(x) �= y}, (1) where the indicator function I { A} is 1 if its argument A is true and 0 otherwise. 13 is not to memorize D , rather to generalize from it. Indeed, The goal of learning algorithms it is rather trivial to construct a function g that has zero loss on every sample point (xi , yi ) by explicitly setting g(xi ) � yi on all sample points from D , and, say, g(xi ) � 0 everywhere else. We obviously have not learned anything from D , we have simply memorized it. It is also clear that this function has very little chance to perform well on points not in the set D (unless 0 is a ( j) ( j) vectors ( x1 , . . . , xn ) of the data matrix) should be extracted from the raw data. The goal here is to find features that are plausibly correlated with the label y, so domain knowledge is almost always necessary to carry out this step. Since learning algorithms usually deteriorate as the dimension of the input x increases (because of the so called curse of dimensionality), dimensionality reduction is often part of data preprocessing. Principal component analysis is the basic tool that is usually applied, but nonlinear manifold algorithms [39, 40] may also be useful if the training set is not too large. The output space Y can also be transformed to massage the original problem into a setup that can be solved by standard machine learning tools. This can be done in an ad hoc way, but there exist also principled reduction techniques between machine learning problems (binary/multi-class classification, regression, even reinforcement learning) [41]. After training, postprocessing the results can also be an important step. If, for example, the original complex problem was reduced to an easy-to-solve setup, re-transforming the obtained labels and even re-calibrating the solution is often a good idea. In well-documented challenges of the last five years [34, 42, 35] we have also learned that the most competitive results are obtained when divers models trained using different algorithms are combined, so model aggregation techniques are also becoming part of the generic machine learning toolbox. Sometimes it happens that an application poses a specific problem that can not be solved with existing tools. Solving such a problem and generalizing the solution to make it applicable for similar problems is one of the motivational engines behind the development of the machine learning domain. 5.1 Music classification, web page ranking, muon counting In [43] we use A DA B OOST for telling apart speech from music. The system starts by constructing the spectrogram on a set of recorded audio signals which constitute the observation vectors x in this case. A DA B OOST is then run in a feature space inspired by image classification on the spectrogram “images”. The output y of the system is binary, that is, y ∈ Y = { MUSIC, SPEECH }. In [44] we stay within the music classification domain but we tackle a more difficult problem: finding the performing artist and the genre of a song based on the audio signal of the first 30 s of the song. The first module of the classifier pipeline is an elaborate signal processor that collects a vector of features for each 2 s segment and then aggregates them to constitute an observation vector x with about 800 components per song. We train two systems, one for finding the artist performing the song and another for predicting its genre. This application also stretches the limits of the classical multi-class (single-label) setup. It is plausible that a song belongs to several genres leading to a so-called multi-label classification. It may also be useful to train a unique system for predicting both the artist and the genre at the same time. This problem is known as multi-task learning. The multi-variate regression problem can be illustrated by our recent work [45] in which we aim to predict the number of muons in a signal recorded in a water Cherenkov detector of the Pierre Auger experiment [46]. The pipeline, again, starts by extracting features that are plausibly correlated with the muon content of the signal. The 172 features are quite correlated, so first we apply principal component analysis to reduce the size of the observation vector x to 19. Then we train a neural network [1, 26] and convert its output into a point estimate and a pointwise error bar (uncertainty) of the number of muons. In [47, 48] we use A DA B OOST.MH for web page ranking. The input x of the classifier g is a set of features representing a search query and a document, and the label y represents the relevance of the document to the query. The goal is to rank the documents in order of their relevance. Of course, if the relevance of the document is correctly predicted, the implied ranking will also be good. Nevertheless, it is hard to formally derive a meaningful loss for each (document, query, relevance) triplet from the particular loss defined on rankings. The solution to this problem is a post-processing technique called calibration: instead of directly using the output of the classifier g, we send it through another regressor or ranker which is now trained to minimize the desired risk. Our concrete system also contains another postprocessing module called model aggregation: 14 instead of training one classifier g, we train a large number of classifiers by varying data preprocessing and algorithmic hyperparameters, and combine the results using a simple weighted voting scheme. 6 Three open research problems In this section we briefly describe three open research problems that are either relevant to or even motivated by certain unorthodox applications of multivariate discrimination in experimental physics. 6.1 Trigger design: classification with test-time constraints One of the recent applications of multivariate discriminants in high-energy physics is trigger design [49]. The goal of a trigger is to separate signals generated by a phenomenon to be detected or measured from background, which is an observed event that just happen to look like real signal because of random fluctuations or due to some uninteresting phenomena. The final goal of the analysis is to collect a large statistics of observations that, with high probability, were generated by the targeted phenomenon. Since the background is often several orders of magnitudes more frequent than the signal, part of the background/signal separation cannot be done offline. Due to either limited disk capacity or limited bandwidth, most of the raw signal has to be discarded by online triggers. In the machine learning paradigm, a trigger is just a binary classifier with Y = { SIGNAL, BACKGROUND }. There are several attributes that make the trigger a special binary classifier. First, it is extremely unbalanced: the probability of BACKGROUND is practically 1 in a lot of cases. This makes the classical setup of minimizing the error probability RI ( g) (2) inadequate: it is very hard to beat the trivial constant classifier g(x) ≡ BACKGROUND which has an error of RI ( g) = P(y = SIGNAL ). Indeed, the natural gain in trigger design is the true positive or hit indicator GTP ( g, (x, y)) = I { g(x) = SIGNAL | y = SIGNAL }. Taking the complement of the true positive indicator L TP ( g, (x, y)) = 1 − I { g(x) = SIGNAL | y = SIGNAL } = I { g(x) = BACKGROUND | y = SIGNAL } as the loss, the implied risk is R TP ( g) = E(x,y)∼D { L TP ( g, (x, y))} = P { g(x) = BACKGROUND | y = SIGNAL }. Again, minimizing R TP ( g) is trivial by setting g(x) ≡ SIGNAL this time, so the goal is to minimize R TP ( g) with a constraint that the false positive rate R FP ( g) = P( g(x) = SIGNAL | y = BACKGROUND ) is kept below a fixed level p FP . As we mentioned above, in triggers we have P(y = SIGNAL ) P(y = BACKGROUND ), so the false positive rate R FP ( g) is approximately equal to the unconditional positive rate P( g(x) = SIGNAL ). In experimental physics terminology this means that a constraint is imposed on the trigger rate. The second attribute that makes trigger design special is that we have strict computational constraints imposed on the evaluation of the classifier g on test samples x. Typically, observations x arrive at a given rate and g(x) has to be run online. The designer can have some flexibility on the parallel handling of the incoming events, but computational resources are often limited. In 15 certain cases12 the electric consumption may also be limited and harsh conditions may require the use of robust hardware with low clock-rate. Trigger design shares these two attributes (unbalanced classes and test-time constraints) with object detection in computer vision where machine learning has been applied widely. For example, when the goal is to detect faces in images, the probability of the signal class (face) is much lower then the probability of background (everything else). We have also computational constraints at test time if the goal is to detect faces online in video recordings with given frame rate and the detector hardware must fit into a compact camera. What makes trigger design slightly more challenging is the sometimes extremely low signal probability and the fact that the computational cost of each feature (components of x) can vary in a large range. For example, the LHCb trigger [51, 49] can use “cheap” observables that can be evaluated fast to rapidly obtain a rough classifier. One can also almost reconstruct the collision event which can take up a large portion of the allotted time, but the resulting feature can be used reliably to discard background events. This example shows why a natural answer to these challenges is to design cascade classifiers both in experimental physics and in object detection [33, 52, 53, 54, 55, 56]. A cascade classifier g(x) is composed of a list of simpler binary classifiers h1 (x), . . . , h N (x) evaluated sequentially. For j < N, if h j (x) classifies the observation x negatively, the final classification of g(x) is BACK GROUND , and if the output of h j (x) is positive, the observation is sent to the next ( j + 1)th stage (or level in the terminology of experimental physics). The classifier h N (x) at the last stage is the only one that can classify x as a SIGNAL. In computer vision, the stage classifiers h j (x) are usually learned using classification algorithms (most often A DA B OOST). Some of the newest detection algorithms also attempt to learn the cascade structure automatically, nevertheless, manual experimentation is usually required to set hyperparameters (stagewise false positive/false negative rates, computational complexity of stage classifiers h j (x)). In [57], we present a principled approach that can be used to automatically design test-time constrained classifiers. The automatic design also allows us to go beyond the cascade structure which is, although quite intuitive, an artificial constraint to keep the classifier structure simple and to accommodate manual tuning. 6.2 Learning to discover The main application of multivariate discrimination in high-energy particle physics is, interestingly, not classification. The goal in these applications is to increase the sensitivity of countingbased statistical tests [58, 59]. Intuitively, the goal is to find regions of the feature space where the signal is present or where it is amplified with respect to its average abundance. Once the subregion is found, we claim the discovery of a novel phenomenon (particle) when the number of events in the region is significantly higher than that predicted by the pure background hypothesis. The simplest formal goal, motivated by a Poisson test, is to maximize s G( f ) = √ b (12) where n s = ∑ I { f (xi ) = SIGNAL ∧ yi = SIGNAL} i =1 n b = ∑ I { f (xi ) = SIGNAL ∧ yi = BACKGROUND}. i =1 Although G ( f ) has the right “flavor”, it does not take into consideration the statistical fluctuations of the test when s and/or b are small, so other, more sophisticated, approximate criteria have also been derived [60]. Whereas all these criteria (including (12)) are clearly related to the classical classification error RI ( f ), the two are not equivalent. A notable difference is that the expectation of G depends on the sample size n, whereas increasing sample size only decreases the 12 For example, in JEM EUSO [50], where the detector will be installed on the International Space Station. 16 variance of RI . Nevertheless, the standard practice is to learn a discriminant function f : Rd → R on D using standard classification methods that minimize the classification error R, and then use G only to optimize a threshold θ which defines the function gθ by ( SIGNAL if f (x) > θ gθ (x) = BACKGROUND otherwise. This way of using multivariate discriminants raises several interesting research questions. First, given a concrete test, what should the training criteria G be? Second, what is the relationship between G and RI ( f )? Finally, how to adapt classical classification algorithms to the given criteria? 6.3 Deep learning for automatic feature construction Finally let us raise a completely open research question. The current technology of multivariate discrimination allows us to classify objects that are represented by vectors x = x (1) , . . . , x (d) of ( j) fixed length d. The representation has to be “semantically aligned”, that is, x1 must be compa( j) rable to x2 in two objects x1 and x2 . At the same time, raw observation, be it pixels of an image or electronic channels in a detector event, seldom comes in this form. Today, the process of translating the raw observation into a semantically aligned vector of features is in the hands of the domain expert. One of the most important movements of the last decade in machine learning is the development of a family of techniques for building deep unsupervised neural architectures [13]. The goal is to construct representations in a mostly non-supervised way that can capture deep invariances in the raw data. Although the field is rather young, it already had a major impact on natural language processing [14] and computer vision [15]. Interesting and natural questions are whether these techniques can be adapted to scientific data, whether the invariances in, say, raw detector data are sufficient to construct representations that could be useful in the ultimate analysis, and whether automatic or semi-automatic feature extraction can improve on the manually constructed representations. References [1] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [2] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. [3] N. Cristianini and J. Shawe-Taylor, Kernel methods for pattern recognition, Cambridge University Press, 2004. [4] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, New York, 1996. [5] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2009. [6] K.P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, Cambridge, MA, 2012. [7] Y. Freund and R.E. Schapire, Boosting: Foundations and Algorithms, MIT Press, Cambridge, MA, 2012. [8] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. [9] V. N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [10] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research, 2012. 17 [11] Y. Bengio, “Gradient-based optimization of hyperparameters,” Neural Computation, vol. 12, no. 8, pp. 1889–1900, 2000. [12] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, MIT Press, 2006. [13] Y. Bengio, A.C. Courville, and P. Vincent, “Unsupervised feature learning and deep learning: A review and new perspectives,” CoRR, vol. abs/1206.5538, 2012. [14] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011. [15] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems. 2012, vol. 25, MIT Press. [16] J. Bergstra, R. Bardenet, B. Kégl, and Y. Bengio, “Algorithms for hyperparameter optimization,” in Advances in Neural Information Processing Systems (NIPS). The MIT Press, 2011, vol. 24. [17] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian optimization of machine learning algorithms,” in Advances in Neural Information Processing Systems, 2012, vol. 25. [18] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-WEKA: Automated selection and hyper-parameter optimization of classification algorithms,” Tech. Rep., http: //arxiv.org/abs/1208.3719, 2012. [19] R. Bardenet, M. Brendel, B. Kégl, and M. Sebag, “Collaborative hyperparameter tuning,” in International Conference on Machine Learning (ICML), 2013. [20] D. S. Johnson and F. P. Preparata, “The densest hemisphere problem,” Theoretical Computer Science, vol. 6, pp. 93–107, 1978. [21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by backpropagating errors,” Nature, vol. 323, pp. 533–536, 1986. [22] B. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” in Fifth Annual Workshop on Computational Learning Theory, 1992, pp. 144–152. [23] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. [24] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, pp. 119–139, 1997. [25] P. Bartlett, “The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network,” IEEE Transactions on Information Theory, vol. 44, no. 2, pp. 525–536, 1998. [26] I. Nabney, Netlab: Algorithms for Pattern Recognition, Springer, 2002. [27] L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,” in Advances in Neural Information Processing Systems, 2008, vol. 20, pp. 161–168. [28] L. Mason, P. Bartlett, J. Baxter, and M. Frean, “Boosting algorithms as gradient descent,” in Advances in Neural Information Processing Systems. 2000, vol. 12, pp. 512–518, The MIT Press. [29] L. Mason, P. Bartlett, and J. Baxter, “Improved generalization through explicit optimization of margins,” Machine Learning, vol. 38, no. 3, pp. 243–255, March 2000. 18 [30] M. Collins, R.E. Schapire, and Y. Singer, “Logistic regression, AdaBoost and Bregman distances,” Machine Learning, vol. 48, pp. 253–285, 2002. [31] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, “Boosting the margin: a new explanation for the effectiveness of voting methods,” Annals of Statistics, vol. 26, no. 5, pp. 1651–1686, 1998. [32] R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidence-rated predictions,” Machine Learning, vol. 37, no. 3, pp. 297–336, 1999. [33] P. Viola and M. Jones, “Robust real-time face detection,” International Journal of Computer Vision, vol. 57, pp. 137–154, 2004. [34] G. Dror, M. Boullé, I. Guyon, V. Lemaire, and D. Vogel, Eds., Proceedings of KDD-Cup 2009 competition, vol. 7 of JMLR Workshop and Conference Proceedings, 2009. [35] Olivier Chapelle and Yi Chang, “Yahoo! Learning-to-Rank Challenge overview,” in Yahoo! Learning-to-Rank Challenge (JMLR W&CP), 2011, vol. 14, pp. 1–24. [36] L. von Ahn, B. Maurer, C. McMillen, D. Abraham, and Blum M., “reCAPTCHA: Humanbased character recognition via web security measures,” Science, vol. 321, no. 5895, pp. 1465–1468, 2008. [37] L. Von Ahn and L. Dabbish, “Labeling images with a computer game,” in Conference on Human factors in computing systems (CHI04), 2004, pp. 319–326. [38] L. Von Ahn, “Games with a purpose,” Computer, vol. 39, no. 6, 2006. [39] S. Roweis and Saul L. K., “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, 2000. [40] J. B. Tenenbaum, V. de Silva, and Langford J. C., “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, pp. 2319–2323, 2000. [41] A. Beygelzimer, J. Langford, and B. Zadrozny, Performance Modeling and Engineering, chapter Machine Learning Techniques–Reductions Between Prediction Quality Metrics, Springer, 2008. [42] J. Bennett and S. Lanning, “The Netflix prize,” in KDDCup 2007, 2007. [43] N. Casagrande, D. Eck, and B. Kégl, “Geometry in sound: A speech/music audio classifier inspired by an image classifier,” in International Computer Music Conference, Sept. 2005, vol. 17. [44] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl, “Aggregate features and AdaBoost for music classification,” Machine Learning Journal, vol. 65, no. 2/3, pp. 473–484, 2006. [45] B. Kégl, R. Busa-Fekete, K. Louedec, R. Bardenet, X. Garrido, I.C. Mariş, D. MonnierRagaigne, S. Dagoret-Campagne, and M. Urban, “Reconstructing Nµ19 (1000),” Tech. Rep. 2011-054, Auger Project Technical Note, 2011. [46] Pierre Auger Collaboration, “Pierre Auger project design report,” Tech. Rep., Pierre Auger Observatory, 1997. [47] R. Busa-Fekete, B. Kégl, T. Éltető, and Gy. Szarvas, “Ranking by calibrated AdaBoost,” in Yahoo! Ranking Challenge 2010 (JMLR workshop and conference proceedings), 2011, vol. 14, pp. 37–48. [48] R. Busa-Fekete, B. Kégl, T. Éltető, and Gy. Szarvas, “A robust ranking methodology based on diverse calibration of AdaBoost,” in European Conference on Machine Learning, 2011, vol. 22, pp. 263–279. 19 [49] V. Gligorov and M. Williams, “Efficient, reliable and fast high-level triggering using a bonsai boosted decision tree,” Tech. Rep., arXiv:1210.6861, 2012. [50] Y. Takizawa, T. Ebisuzaki, Y. Kawasaki, M. Sato, M. E. Bertaina, H. Ohmori, Y. Takahashi, F. Kajino, M. Nagano, N. Sakaki, N. Inoue, H. Ikeda, Y. Arai, Y. Takahashi, T. Murakami, James H. Adams, and the JEM-EUSO Collaboration, “JEM-EUSO: Extreme Universe Space Observatory on JEM/ISS,” Nuclear Physics B - Proceedings Supplements, vol. 166, pp. 72–76, 2007. [51] V. Gligorov, “A single track HLT1 trigger,” Tech. Rep. LHCb-PUB-2011-003, CERN, 2011. [52] L. Bourdev and J. Brandt, “Robust object detection via soft cascade,” in Conference on Computer Vision and Pattern Recognition. 2005, vol. 2, pp. 236–243, IEEE Computer Society. [53] R. Xiao, L. Zhu, and H. J. Zhang, “Boosting chain learning for object detection,” in Ninth IEEE International Conference on Computer Vision, 2003, vol. 9, pp. 709–715. [54] J. Sochman and J. Matas, “WaldBoost – learning for time constrained sequential detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 150–156. [55] B. Póczos, Y. Abbasi-Yadkori, Cs. Szepesvári, R. Greiner, and N. Sturtevant, “Learning when to stop thinking and do something!,” in Proceedings of the 26th International Conference on Machine Learning, 2009, pp. 825–832. [56] M. Saberian and N. Vasconcelos, “Boosting classifier cascades,” in Advances in Neural Information Processing Systems 23. 2010, pp. 2047–2055, MIT Press. [57] D. Benbouzid, R. Busa-Fekete, and B. Kégl, “Fast classification using sparse decision DAGs,” in International Conference on Machine Learning, June 2012, vol. 29. [58] V. M. Abazov et al., “Observation of single top-quark production,” Physical Review Letters, vol. 103, no. 9, 2009. [59] Aaltonen, T. et. al, “Observation of electroweak single top-quark production,” Phys. Rev. Lett., vol. 103, pp. 092002, Aug 2009. [60] G. Cowan, K. Cranmer, E. Gross, and O. Vitells, “Asymptotic formulae for likelihood-based tests of new physics,” The European Physical Journal C, vol. 71, pp. 1–19, 2011. 20

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement