Journal of Machine Learning Research 1 (2000) 1-48 Submitted 4/00; Published 10/00 Bias–variance analysis of Support Vector Machines for the development of SVM-based ensemble methods Giorgio Valentini [email protected] DSI - Dipartimento di Scienze dell’Informazione Università degli Studi di Milano Via Comelico 39, Milano, Italy Thomas G. Dietterich [email protected] Department of Computer Science Oregon State University Corvallis, OR 97331, USA Editor: Abstract Bias–variance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well-tuned to the properties of a speciﬁc base learner. Indeed the eﬀectiveness of ensemble methods critically depends on accuracy, diversity and learning characteristics of base learners. We present an extended experimental analysis of bias–variance decomposition of the error in Support Vector Machines (SVMs), considering gaussian, polynomial and dot–product kernels. A characterization of the error decomposition is provided, by means of the analysis of the relationships between bias, variance, kernel type and its parameters, oﬀering insights into the way SVMs learn. The results show that the expected trade-oﬀ between bias and variance is sometimes observed, but more complex relationships can be detected, especially in gaussian and polynomial kernels. We show that the bias–variance decomposition oﬀers a rationale to develop ensemble methods using SVMs as base learners, and we outline two directions for developing SVM ensembles, exploiting the SVM bias characteristics and the bias-variance dependence on the kernel parameters. Keywords: Bias–variance analysis, Support Vector Machines, ensemble methods, multiclassiﬁer systems. 1. Introduction Ensembles of classiﬁers represent one of the main research directions in machine learning (Dietterich, 2000a). Empirical studies showed that both in classiﬁcation and regression problems ensembles are often much more accurate than the individual base learner that make them up (Bauer and Kohavi, 1999, Dietterich, 2000b, Freund and Schapire, 1996), and recently diﬀerent theoretical explanations have been proposed to justify the eﬀectiveness of some commonly used ensemble methods (Kittler et al., 1998, Schapire, 1999, Kleinberg, 2000, Allwein et al., 2000). Two main theories are invoked to explain the success of ensemble methods. The ﬁrst one consider the ensembles in the framework of large margin classiﬁers (Mason et al., 2000), showing that ensembles enlarge the margins, enhancing the generalization capabilities of c 2000 Giorgio Valentini and Thomas Dietterich. Valentini and Dietterich learning algorithms (Schapire et al., 1998, Allwein et al., 2000). The second is based on the the classical bias–variance decomposition of the error (Geman et al., 1992), and it shows that ensembles can reduce variance (Breiman, 1996b) and also bias (Kong and Dietterich, 1995). Recently Domingos proved that Schapire’s notion of margins (Schapire et al., 1998) can be expressed in terms of bias and variance and viceversa (Domingos, 2000c), and hence Schapire’s bounds of ensemble’s generalization error can be equivalently expressed in terms of the distribution of the margins or in terms of the bias–variance decomposition of the error, showing the equivalence of margin-based and bias–variance-based approaches. The eﬀectiveness of ensemble methods depends on the speciﬁc characteristics of the base learners; in particular on the relationship between diversity and accuracy of the base learners (Dietterich, 2000a, Kuncheva et al., 2001b, Kuncheva and Whitaker, 2003), on their stability (Breiman, 1996b, Bousquet and Elisseeﬀ, 2002), and on their general geometrical properties (Cohen and Intrator, 2001). From this standpoint the analysis of the features and properties of the base learners used in ensemble methods is crucial in order to design ensemble methods well-tuned to the characteristics of a speciﬁc base learner. For instance, considering that the agglomeration of many classiﬁers into one classiﬁcation rule reduces variance (Breiman, 1996a), we could apply low-bias base learners to reduce both bias and variance using ensemble methods. To this purpose in this paper we study Support Vector Machines (SVMs), that are “strong” dichotomic classiﬁers, well-founded on Vapnik’s Statistical Learning Theory (Vapnik, 1998), in order to establish if and how we can exploit their speciﬁc features in the context of ensemble methods. We analyze the learning properties of SVMs using the bias–variance decomposition of the error as a tool to understand the relationships between kernels, kernel parameters, and learning processes in SVM. Historically, the bias–variance insight was borrowed from the ﬁeld of regression, using squared–loss as the loss function (Geman et al., 1992). For classiﬁcation problems, where the 0/1 loss is the main criterion, several authors proposed bias–variance decompositions related to 0/1 loss. Kong and Dietterich (1995) proposed a bias–variance decomposition in the context of ECOC ensembles (Dietterich and Bakiri, 1995), but their analysis is extensible to arbitrary classiﬁers, even if they deﬁned variance simply as a diﬀerence between loss and bias. In Breiman’s decomposition (Breiman, 1996b) bias and variance are always non-negative (while Dietterich deﬁnition allows a negative variance), but at any input the reducible error (i.e. the total error rate less noise) is assigned entirely to variance if the classiﬁcation is unbiased, and to bias if biased. Moreover he forced the decomposition to be purely additive, while for the 0/1 loss this is not the case. Kohavi and Wolpert (1996) approach leads to a biased estimation of bias and variance, assigning a non-zero bias to a Bayes classiﬁer, while Tibshirani (1996) did not use directly the notion of variance, decomposing the 0/1 loss in bias and an unrelated quantity he called ”aggregation eﬀect”, which is similar to the James’ notion of variance eﬀect (James, 2003). Friedman (1997) showed that bias and variance are not purely additive: in some cases increasing variance increases the error, but in other cases can also reduce the error, especially when the prediction is biased. 2 Bias–variance analysis of SVMs Heskes (1998) proposed a bias-variance decomposition using as loss function the KullbackLeibler divergence. By this approach the error between the target and the predicted classiﬁer densities is measured; anyway when he tried to extend this approach to the zero-one function interpreted as the limit case of log-likelihood type error, the resulting decomposition produces a deﬁnition of bias that loses his natural interpretation as systematic error committed by the classiﬁer. As brieﬂy outlined, these decompositions suﬀer of signiﬁcant shortcomings: in particular they lose the relationship to the original squared loss decomposition, forcing in most cases bias and variance to be purely additive. We consider classiﬁcation problems and the 0/1 loss function in the Domingos’ uniﬁed framework of bias–variance decomposition of the error (Domingos, 2000c,b). In this approach bias and variance are deﬁned for an arbitrary loss function, showing that the resulting decomposition specializes to the standard one for squared loss, but it holds also for the 0/1 loss (Domingos, 2000c). A similar approach has been proposed by James (2003): he extended the notion of variance and bias for general loss functions, distinguishing also between bias and variance, interpreted respectively as the systematic error and the variability of an estimator, and the the actual eﬀect of bias and variance on the error. Using Domingos theoretical framework, we tried to answer two main questions: 1. Can we characterize bias and variance in SVMs with respect to the kernel and its parameters? 2. Can the bias–variance decomposition oﬀer guidance for developing ensemble methods using SVMs as base learners? In order to answer these two questions, we planned and performed an extensive series of experiments on synthetic and real data sets to evaluate bias variance–decomposition of the error with diﬀerent kernels and diﬀerent kernel parameters. The paper is organized as follows. In Sect. 2, we summarize the main results of Domingos’ uniﬁed bias–variance decomposition of error. Sect. 3 outlines how to measure in practice bias and variance decomposition of the error with artiﬁcial or large benchmark data sets, or when only a small ”real” data set is available. Sect. 4 outlines the main characteristics of the data sets employed in our experiments and the main experimental tasks performed. Then we present the main results of our experiments about bias–variance decomposition of the error in SVMs, considering separately gaussian, polynomial and and dot-product SVMs, and comparing also the results between diﬀerent kernels. Sect. 6 provides a characterization of bias–variance decomposition of the error for gaussian, polynomial and and dot-product SVMs, highlighting the common patterns for each diﬀerent kernel. Sect. 7 exploits the knowledge achieved by the bias–variance decomposition of the error to formulate hypotheses about the eﬀectiveness of SVMs as base learners in ensembles of learning machines, and two directions for developing new ensemble models of SVM are proposed. An outline of ongoing and future developments of this work concludes the paper. 3 Valentini and Dietterich 2. Bias–Variance Decomposition for the 0/1 loss function The analysis of bias–variance decomposition of the error has been originally developed in the standard regression setting, where the squared error is usually used as loss function. Considering a prediction y = f (x) of an unknown target t, provided by a learner f on input x, with x ∈ Rd and y ∈ R, the classical decomposition of the error in bias and variance for the squared error loss is (Geman et al., 1992): Ey,t [(y − t)2 ] = Et [(t − E[t])2 ] + Ey [(y − E[y])2 ] + (E[y] − E[t])2 = N oise(t) + V ar(y) + Bias2 (y) In words, the expected loss of using y to predict t is the sum of the variances of t (noise) and y plus the squared bias. Ey [·] indicates the expected value with respect to the distribution of the random variable y. This decomposition cannot be automatically extended to the standard classiﬁcation setting, as in this context the 0/1 loss function is usually applied, and bias and variance are not purely additive. As we are mainly interested in analyzing bias–variance for classiﬁcation problems, we introduce the bias–variance decomposition for the 0/1 loss function, according to the Domingos uniﬁed bias–variance decomposition of the error (Domingos, 2000b). 2.1 Expected loss depends on the randomness of the training set and the target Consider a (potentially inﬁnite) population U of labeled training data points, where each point is a pair (xj , tj ), tj ∈ C, xj ∈ Rd , d ∈ N, where C is the set of the class labels. Let P (x, t) be the joint distribution of the data points in U . Let D be a set of m points drawn identically and independently from U according to P . We think of D as being the training sample that we are given for training a classiﬁer. We can view D as a random variable, and we will let ED [·] indicate the expected value with respect to the distribution of D. Let L be a learning algorithm, and deﬁne fD = L(D) as the classiﬁer produced by L applied to a training set D. The model produces a prediction fD (x) = y. Let L(t, y) be the 0/1 loss function, that is L(t, y) = 0 if y = t, and L(t, y) = 1 otherwise. Suppose we consider a ﬁxed point x ∈ Rd . This point may appear in many labeled training points in the population. We can view the corresponding labels as being distributed according to the conditional distribution P (t|x). Recall that it is always possible to factor the joint distribution as P (x, t) = P (x)P (t|x). Let Et [·] indicate the expectation with respect to t drawn according to P (t|x). Suppose we consider a ﬁxed predicted class y for a given x. This prediction will have an expected loss of Et [L(t, y)]. In general, however, the prediction y is not ﬁxed. Instead, it is computed from a model fD which is in turn computed from a training sample D. Hence, the expected loss EL of learning algorithm L at point x can be written by considering both the randomness due to the choice of the training set D and the randomness in t due to the choice of a particular test point (x, t): EL(L, x) = ED [Et [L(t, fD (x))]], 4 Bias–variance analysis of SVMs where fD = L(D) is the classiﬁer learned by L on training data D. The purpose of the bias-variance analysis is to decompose this expected loss into terms that separate the bias and the variance. 2.2 Optimal and main prediction. To derive this decomposition, we must deﬁne two things: the optimal prediction and the main prediction: according to Domingos, bias and variance can be deﬁned in terms of these quantities. The optimal prediction y∗ for point x minimizes Et [L(t, y)] : y∗ (x) = arg min Et [L(t, y)] y (1) It is equal to the label t that is observed more often in the universe U of the data points, and corresponds to the prediction provided by the Bayes classiﬁer. The optimal model fˆ(x) = y∗ , ∀x makes the optimal prediction at each point x, and corresponds to the Bayes classiﬁer; its error rate corresponds to the Bayes error rate. The noise N (x), is deﬁned in terms of the optimal prediction, and represents the remaining loss that cannot be eliminated, even by the optimal prediction: N (x) = Et [L(t, y∗ )] Note that in the deterministic case y∗ = t and N (x) = 0. The main prediction ym at point x is deﬁned as ED [L(fD (x), y )]. ym = arg min y (2) This is a value that would give the lowest expected loss if it were the “true label” of x. It expresses the ”central tendency” of a learner, that is its systematic prediction, or, in other words, it is the label for x that the the learning algorithm “wishes” were correct. For 0/1 loss, the main prediction is the class predicted most often by the learning algorithm L when applied to training sets D. 2.3 Bias, unbiased and biased variance. Given these deﬁnitions, the bias B(x) (of a learning algorithm L on training sets of size m) is the loss of the main prediction relative to the optimal prediction: B(x) = L(y∗ , ym ) For 0/1 loss, the bias is always 0 or 1. We will say that L is biased at point x, if B(x) = 1. The variance V (x) is the average loss of the predictions relative to the main prediction: V (x) = ED [L(ym , fD (x))] (3) It captures the extent to which the various predictions fD (x) vary depending on D. In the case of the 0/1 loss we can also distinguish two opposite eﬀects of variance (and noise) on the error: in the unbiased case variance and noise increase the error, while in the biased case variance and noise decrease the error. There are three components that determine whether t = y: 5 Valentini and Dietterich y = ym ? * no [bias] yes ym = y ? ym = y ? no [variance] yes t =y ? t =y ? correct yes no [noise] no [variance] t =y ? * * yes yes t =y ? * yes no [noise] error error correct [noise] [variance] [noise cancels variance] error [bias] no [noise] yes * no [noise] correct correct [noise [variance cancels cancels bias] bias] error [noise cancels variance cancels bias] Figure 1: Case analysis of error. 1. Noise: is t = y∗ ? 2. Bias: is y∗ = ym ? 3. Variance: is ym = y ? Note that bias is either 0 or 1 because neither y∗ nor ym are random variables. From this standpoint we can consider two diﬀerent cases: the unbiased and the biased case. In the unbiased case, B(x) = 0 and hence y∗ = ym . In this case we suﬀer a loss if the prediction y diﬀers from the main prediction ym (variance) and the optimal prediction y∗ is equal to the target t, or y is equal to ym , but y∗ is diﬀerent from t (noise). In the biased case, B(x) = 1 and hence y∗ = ym . In this case we suﬀer a loss if the prediction y is equal to the main prediction ym and the optimal prediction y∗ is equal to the target t, or if both y is diﬀerent from ym (variance), and y∗ is diﬀerent from t (noise). Fig. 1 summarizes the diﬀerent conditions under which an error can arise, considering the combined eﬀect of bias, variance and noise on the learner prediction. Considering the above case analysis of the error, if we let P (t = y∗ ) = N (x) = τ and P (ym = y) = V (x) = σ, in the unbiased case we have: L(t, y) = τ (1 − σ) + σ(1 − τ ) (4) = τ + σ − 2τ σ = N (x) + V (x) − 2N (x)V (x) while, in the the biased case: L(t, y) = τ σ + (1 − τ )(1 − σ) = 1 − (τ + σ − 2τ σ) = B(x) − (N (x) + V (x) − 2N (x)V (x)) 6 (5) Bias–variance analysis of SVMs Error 1.0 biased 0.5 unbiased 0.5 Variance Figure 2: Eﬀects of biased and unbiased variance on the error. The unbiased variance increments, while the biased variance decrements the error. Note that in the unbiased case (eq. 4) the variance is an additive term of the loss function, while in the biased case (eq. 5) the variance is a subtractive term of the loss function. Moreover the interaction terms will usually be small, because, for instance, if both noise and variance term will be both lower than 0.1, the interaction term 2N (x)V (x) will be reduced to less than 0.02. In order to distinguish between these two diﬀerent eﬀects of the variance on the loss function, Domingos deﬁnes the unbiased variance, Vu (x), to be the variance when B(x) = 0 and the biased variance, Vb (x), to be the variance when B(x) = 1. We can also deﬁne the net variance Vn (x) to take into account the combined eﬀect of the unbiased and biased variance: Vn (x) = Vu (x) − Vb (x) Fig. 2 summarizes in graphic form the opposite eﬀects of biased and unbiased variance on the error. If we can disregard the noise, the unbiased variance captures the extents to which the learner deviates from the correct prediction ym (in the unbiased case ym = y∗ ), while the biased variance captures the extents to which the learner deviates from the incorrect prediction ym (in the biased case ym = y∗ ). 7 Valentini and Dietterich 2.4 Domingos bias–variance decomposition. Domingos (2000a) showed that for a quite general loss function the expected loss is: EL(L, x) = c1 N (x) + B(x) + c2 V (x) (6) For the 0/1 loss function c1 is 2PD (fD (x) = y∗ ) − 1 and c2 is +1 if B(x) = 0 and −1 if B(x) = 1. Note that c2 V (x) = Vu (x) − Vb (x) = Vn (x) (eq. 3), and if we disregard the noise, eq. 6 can be simpliﬁed to: (7) EL(L, x) = B(x) + Vn (x) Summarizing, one of the most interesting aspects of Domingos’ decomposition is that variance hurts on unbiased points x, but it helps on biased points. Nonetheless, to obtain low overall expected loss, we want the bias to be small, and hence, we see to reduce both the bias and the unbiased variance. A good classiﬁer will have low bias, in which case the expected loss will approximately equal the variance. This decomposition for a single point x can be generalized to the entire population by deﬁning Ex [·] to be the expectation with respect to P (x). Then we can deﬁne the average bias Ex [B(x)], the average unbiased variance Ex [Vu (x)], and the average biased variance Ex [Vb (x)]. In the noise-free case, the expected loss over the entire population is Ex [EL(L, x)] = Ex [B(x)] + Ex [Vu (x)] − Ex [Vb (x)]. 3. Measuring bias and variance The procedures to measure bias and variance depend on the characteristics and on the cardinality of the data sets used. For synthetic data sets we can generate diﬀerent sets of training data for each learner to be trained. Then a large synthetic test set can be generated in order to estimate the bias–variance decomposition of the error for a speciﬁc learner model. Similarly, if a large data set is available, we can split it in a large learning set and in a large testing set. Then we can randomly draw subsets of data from the large training set in order to train the learners; bias–variance decomposition of the error is measured on the large independent test set. However, in practice, for real data we dispose of only one and often small data set. In this case, we can use cross-validation techniques for estimating bias–variance decomposition, but we propose to use out-of-bag (Breiman, 2001) estimation procedures, as they are computationally less expensive. 3.1 Measuring with artificial or large benchmark data sets Consider a set D = {Di }ni=1 of learning sets Di = {xk , tk }m k=1 . The data sets Di can be generated according to some known probability distribution or can be drawn with replacement from a large data set D according to an uniform probability distribution. Here we consider only a two-class case, i.e. tk ∈ C = {−1, 1}, xk ∈ X, for instance X = Rd , d ∈ N, but the extension to the multiclass case is straightforward. The estimates of the error, bias, unbiased and biased variance are performed on a test set T separated from the training set D. In particular these estimates with respect to a 8 Bias–variance analysis of SVMs single example (x, t) ∈ T are performed using the classiﬁers fDi = L(Di ) produced by a learner L using training sets Di drawn from D. These classiﬁers produce a prediction y ∈ C, that is fDi (x) = y. The estimates are performed for all the (x, t) ∈ T , and the overall loss, bias and variance can be evaluated averaging over the entire test set T . In presence of noise and with the 0/1 loss, the optimal prediction y∗ is equal to the label t that is observed more often in the universe U of data points: y∗ (x) = arg max P (t|x) t∈C The noise N (x) for the 0/1 loss can be estimated if we can evaluate the probability of the targets for a given example x: L(t, y∗ )P (t|x) = ||t = y∗ ||P (t|x) N (x) = t∈C t∈C where ||z|| = 1 if z is true, 0 otherwise, In practice it is diﬃcult to estimate the noise for ”real world” data sets, and to simplify the computation we consider the noise free case. In this situation we have y∗ = t. The main prediction is a function of the y = fDi (x). Considering a 0/1 loss, we have ym = arg max(p1 , p−1 ) where p1 = PD (y = 1|x) and p−1 = PD (y = −1|x), i.e. the main prediction is the mode. To calculate p1 , having a test set T = {xj , tj }rj=1 , it is suﬃcient to count the number of learners that predict class 1 on a given input x: n fDi (xj ) = 1 p1 (xj ) = i=1 n where z = 1 if z is true and z = 0 if z is false The bias can be easily calculated after the evaluation of the main prediction: ym − t 1 if ym = t = B(x) = 0 if ym = t 2 or equivalently: B(x) = (8) 1 if pcorr (x) ≤ 0.5 0 otherwise where pcorr is the probability that a prediction is correct, i.e. pcorr (x) = P (y = t|x) = PD (fD (x) = t). In order to measure the variance V (x), if we deﬁne yDi = fDi (x), we have: V (x) = 1 1 L(ym , yDi ) = ||(ym = yDi )|| n n n n i=1 i=1 The unbiased variance Vu (x) and the biased variance Vb (x) can be calculated evaluating if the the prediction of each learner diﬀers from the main prediction respectively in the unbiased and in the biased case: n 1 ||(ym = t) and (ym = yDi )|| Vu (x) = n i=1 9 Valentini and Dietterich 1 Vb (x) = ||(ym = t) and (ym = yDi )|| n n i=1 In the noise-free case, the average loss on the example x ED (x) is calculated by a simple algebraic sum of bias, unbiased and biased variance: ED (x) = B(x) + Vu (x) − Vb (x) = B(x) + (1 − 2B(x))V (x) We can easily calculate the average bias, variance, unbiased, biased and net variance, averaging over the entire set of the examples of the test set T = {(xj , tj )}rj=1 . In the remaining part of this section the indices j refer to the examples that belong to the test set T , while the indices i refer to the training sets Di , drawn with replacement from the separated training set D, and used to train the classiﬁers fDi . The average quantities are: Average bias: r r 1 1 ym (xj ) − tj B(xj ) = Ex [B(x)] = r r 2 j=1 j=1 Average variance: 1 V (xj ) r r Ex [V (x)] = j=1 r n = 1 nr = r n 1 ||ym (xj ) = fDi (xj )|| nr L(ym (xj ), fDi (xj )) j=1 i=1 j=1 i=1 Average unbiased variance: r r n 1 1 Vu (xj ) = ||(ym (xj ) = tj ) and (ym (xj ) = fDi (xj ))|| Ex [Vu (x)] = r nr j=1 j=1 i=1 Average biased variance: 1 1 Vb (xj ) = ||(ym (xj ) = tj ) and (ym (xj ) = fDi (xj ))|| Ex [Vb (x)] = r nr r r j=1 n j=1 i=1 Average net variance: Ex [Vn (x)] = 1 1 Vn (xj ) = (Vu (xj ) − Vb (xj )) r r r r j=1 j=1 Finally, the average loss on all the examples (with no noise) is the algebraic sum of the average bias, unbiased and biased variance: Ex [L(t, y)] = Ex [B(x)] + Ex [Vu (x)] − Ex [Vb (x)] 10 Bias–variance analysis of SVMs 3.2 Measuring with small data sets In practice (unlike in theory), we have only one and often small data set S. We can simulate multiple training sets by bootstrap replicates Sb = {x|x is drawn at random with replacement from S}. In order to measure bias and variance we can use out-of-bag points, providing in such a way an unbiased estimate of the error. At ﬁrst we need to construct B bootstrap replicates of S (e. g., B = 200): S1 , . . . , SB . Then we apply a learning algorithm L to each replicate Sb to obtain hypotheses fb = L(Sb ). Let Tb = S\Sb be the data points that do not appear in Sb (out of bag points). We can use these data sets Tb to evaluate the bias–variance decomposition of the error; that is we compute the predicted values fb (x), ∀x s.t. x ∈ Tb . For each data point x, we have now the observed corresponding value t and several predictions y1 , . . . , yK , where K = |{Tb |x ∈ Tb , 1 ≤ b ≤ B}|, K ≤ B, and on the average K B/3, because about 1/3 of the predictors is not trained on a speciﬁc input x. Note that the value of K depends on the speciﬁc example x considered. Moreover if x ∈ Tb then x∈ / Sb , hence fb (x) makes a prediction on an unknown example x. In order to compute the main prediction, for a two-class classiﬁcation problem, we can deﬁne: B 1 ||(x ∈ Tb ) and (fb (x) = 1)|| p1 (x) = K b=1 p−1 (x) = B 1 ||(x ∈ Tb ) and (fb (x) = −1)|| K b=1 The main prediction ym (x) corresponds to the mode: ym = arg max(p1 , p−1 ) The bias can be calculated as in eq. 8, and the variance V (x) is: V (x) = B 1 ||(x ∈ Tb ) and (ym = fb (x))|| K b=1 Similarly can be easily computed unbiased, biased and net–variance: Vu (x) = B 1 ||(x ∈ Tb ) and (B(x) = 0) and (ym = fb (x))|| K b=1 Vb (x) = B 1 ||(x ∈ Tb ) and (B(x) = 1) and (ym = fb (x))|| K b=1 Vn (x) = Vu (x) − Vb (x) Average bias, variance, unbiased, biased and net variance, can be easily calculated averaging over all the examples. 11 Valentini and Dietterich 4. Bias–Variance Analysis in SVMs The bias–variance decomposition of the error represents a powerful tool to analyze learning processes in learning machines. According to the procedures described in the previous section, we measured bias and variance in SVMs, in order to study the relationships with diﬀerent kernel types and their parameters. To accomplish this task we computed bias– variance decomposition of the error on diﬀerent synthetic and ”real” data sets. 10 I I 8 II II 6 X2 I I 4 2 0 I II 0 2 4 6 8 10 X1 Figure 3: P2 data set, a bidimensional two class synthetic data set. Roman numbers label the regions belonging to the two classes. 4.1 Experimental setup In the experiments we employed 7 diﬀerent data sets, both synthetic and ”real”. P2 is a synthetic bidimensional two–class data set 1 ; each region is delimited by one or more of four simple polynomial and trigonometric functions (Fig. 3). The synthetic data set Waveform is generated from a combination of 2 of 3 ”base” waves; we reduced the original three classes of Waveform to two, deleting all samples pertaining to class 0. The other data sets are all from the UCI repository (Merz and Murphy, 1998). Tab. 1 summarizes the main features of the data sets used in the experiments. In order to perform a reliable evaluation of bias and variance we used small training set and large test sets. For synthetic data we generated the desired number of samples. 1. The application gensimple, that we developed to generate the data, is freely available on line at ftp://ftp.disi.unige.it/person/ValentiniG/BV/gensimple. 12 Bias–variance analysis of SVMs Table 1: Data sets used in the experiments. Data set P2 Waveform Grey-Landsat Letter Letter w. noise Spam Musk # of attr. 2 21 36 16 16 57 166 # of tr. samples 100 100 100 100 100 100 100 # of tr. sets 400 200 200 200 200 200 200 # base tr. set synthetic synthetic 4425 614 614 2301 3299 # of test samples 10000 10000 2000 613 613 2300 3299 For real data sets we used bootstrapping to replicate the data. In both cases we computed the main prediction, bias, unbiased and biased variance, net-variance according to the procedures explained in Sect. 3.1. In our experiments, the computation of James’ variance and systematic eﬀect (James, 2003) is reduced to the measurements of the net-variance and bias, and hence we did not explicitly compute these quantities (see Appendix A for details). With synthetic data sets, we generated small training sets of about 100 examples and reasonably large test sets using computer programs. In fact small samples show bias and variance more clearly than having larger samples. We produced 400 diﬀerent training sets for P2 and 200 training sets for Waveform. The test sets were chosen reasonably large (10000 examples) to obtain reliable estimates of bias and variance. For real data sets we ﬁrst divided the data into a training D and a test T sets. If the data sets had a predeﬁned training and test sets reasonably large, we used them (as in Grey-Landsat and Spam), otherwise we split them in a training and test set of equal size. Then we drew from D bootstrap samples. We chose bootstrap samples much smaller than |T | (100 examples). More precisely we drew 200 data sets from D, each one consisting of 100 examples uniformly drawn with replacement. Summarizing, both with synthetic and real data sets we generated small training sets for each data set and a much larger test set. Then all the data were normalized in such a way that for each attribute the mean was 0 and the standard deviation 1. In all our experiments we used NEURObjects (Valentini and Masulli, 2002)2 , a C++ library for the development of neural networks and machine learning applications, and SVM-light (Joachims, 1999), a set of C applications for training and testing SVMs. We developed and used the C++ application analyze BV, to perform bias–variance decomposition of the error3 . This application analyzes the output of a generic learning machine model and computes the main prediction, error, bias, net–variance, unbiased and biased variance using the 0/1 loss function. Other C++ applications have been developed to process and analyze the results, using also Cshell scripts to train, test and analyze bias– variance decomposition of all the SVM models for each speciﬁc data set. 2. Download web site: http://www.disi.unige.it/person/ValentiniG/NEURObjects. 3. The source code is available at ftp://ftp.disi.unige.it/person/ValentiniG/BV. Moreover C++ classes for bias–variance analysis have been developed as part of the NEURObjects library 13 Valentini and Dietterich Avg. err. 0.5 0.4 0.3 0.2 0.1 0 0.01 1 5 C 20 100 1000 0.10.02 0.01 0.5 0.2 2 1 5 20 10 sigma 100 50 (a) Bias Net var. 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.01 -0.1 0.01 1 1 5 C 5 0.02 0.01 0.2 0.1 20 100 1000 20 C 1 0.5 5 2 20 10 sigma 100 50 100 1000 (b) 0.01 0.10.02 0.5 0.2 2 1 5 20 10 sigma 100 50 (c) Unb. var. Biased var. 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0.01 0 0.01 1 1 5 C 0.02 0.01 0.2 0.1 20 100 1000 1 0.5 5 2 20 10 sigma 100 50 5 C 20 100 1000 (d) 0.2 1 0.5 5 2 20 10 sigma 100 50 0.01 0.10.02 (e) Figure 4: Grey-Landsat data set. Error (a) and its decomposition in bias (b), net variance (c), unbiased variance (d), and biased variance (e) in SVM RBF, varying both C and σ. 14 Bias–variance analysis of SVMs 4.2 Experimental Tasks To evaluate bias and variance in SVMs we conducted experiments with diﬀerent kernels (gaussian, polynomial and dot-product) and diﬀerent kernel parameters. For each kernel we considered the same set of values for the parameter C that controls the trade-oﬀ between training error and margin, ranging from C = 0.01 to C = 1000. 1. Gaussian kernels. We evaluated bias–variance decomposition varying the parameters σ of the kernel and the C parameter. In particular we analyzed: (a) The relationships between average error, bias, net–variance, unbiased and biased variance, the σ parameter of the kernel and the C parameter. (b) The relationships between generalization error, training error, number of support vectors and capacity with respect to σ. We trained RBF-SVM with all the combinations of the parameters σ and C, using the a set of values for σ ranging from σ = 0.01 to σ = 1000. We evaluated about 200 diﬀerent RBF-SVM models for each data set. 2. Polynomial kernels. We evaluated bias–variance decomposition varying the degree of the kernel and the C parameter. In particular we analyzed the relationships between average error, bias, net–variance, unbiased and biased variance, the degree of the kernel and the C parameter. We trained polynomial-SVM with several combinations of the degree parameter of the kernel and C values, using all the polynomial degrees between 1 and 10, evaluating in such a way about 120 diﬀerent polynomial-SVM models for each data set. Following the heuristic of Jakkola, the dot product of polynomial kernel was divided by the dimension of the input data, to ”normalize” the dot–product before to raise to the degree of the polynomial. 3. Dot–product kernels. We evaluated bias–variance decomposition varying the C parameter. We analyzed the relationships between average error, bias, net–variance, unbiased and biased variance and the parameter C (the regularization factor) of the kernel. We trained dot–product-SVM considering diﬀerent values for the C parameter, evaluating in such a way 12 diﬀerent dot–product-SVM models for each data set. Each SVM model required the training of 200 diﬀerent SVMs, one for each synthesized or bootstrapped data set, for a total of (204+120+12)×200 = 67200 trained SVM for each data set (134400 for the data set P2, as for this data set we used 400 data sets for each model). The experiments required the training of more than half million of SVMs, considering all the data sets and of course the testing of all the SVM previously trained in order to evaluate the bias–variance decomposition of the error of the diﬀerent SVM models. For each SVM model we computed the main prediction, bias, net-variance, biased and unbiased variance and the error on each example of the test set, and the corresponding average quantities on the overall test set. 15 Valentini and Dietterich C=10 C=10 0.5 avg. error bias net variance unbiased var. biased var 0.5 0.4 avg. error bias net variance unbiased var. biased var 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 -0.1 0.01 0.02 0.1 0.2 0.5 1 2 sigma 5 10 20 50 -0.1 100 0.01 0.02 0.1 0.2 0.5 (a) 0.5 C=0.1 0.5 avg. error bias net variance unbiased var. biased var 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.5 1 50 100 2 5 10 20 50 avg. error bias net variance unbiased var. biased var -0.1 100 0.01 0.02 0.1 0.2 0.5 (c) 1 2 sigma 5 10 20 50 100 (d) C=10 C=1 avg. error bias net variance unbiased var. biased var 0.5 0.4 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.5 1 2 sigma 5 10 20 50 avg. error bias net variance unbiased var. biased var 0.5 0.3 0.01 0.02 20 C=1 sigma -0.1 10 0.4 0.3 0.01 0.02 5 (b) 0.4 -0.1 1 2 sigma -0.1 100 (e) 0.01 0.02 0.1 0.2 0.5 1 2 sigma 5 10 20 50 100 (f) Figure 5: Bias-variance decomposition of error in bias, net variance, unbiased and biased variance in SVM RBF, varying σ and for ﬁxed C values: (a) Waveform, (b) Grey-Landsat, (c) Letter-Two with C = 0.1, (c) Letter-Two with C = 1, (e) Letter-Two with added noise and (f) Spam. 16 Bias–variance analysis of SVMs 5. Results In this section we present the results of the experiments. We analyzed bias–variance decomposition with respect to the kernel parameters considering separately gaussian, polynomial and dot product SVMs, comparing also the results among diﬀerent kernels. Here we present the main results. Full results, data and graphics are available by anonymous ftp at ftp://ftp.disi.unige.it/person/ValentiniG/papers/bv-svm.ps.gz. 5.1 Gaussian kernels Fig. 4 depicts the average loss, bias net–variance, unbiased and biased variance varying the values of σ and the regularization parameter C in RBF-SVM on the Grey-Landsat data set. We note that σ is the most important parameter: although for very low values of C the SVM cannot learn, independently of the values of σ, (Fig. 4 a), the error, the bias, and the net–variance depend mostly on the σ parameter. In particular for low values of σ, bias is very high (Fig. 4 b) and net-variance is 0, as biased and unbiased variance are about equal (Fig. 4d and 4e). Then the bias suddenly drops (Fig. 4b), lowering the average loss (Fig. 4a), and then stabilizes for higher values of σ. Interestingly enough, in this data set (but also in others, data not shown), we note an increment followed by a decrement of the net–variance, resulting in a sort of ”wave shape” of the net variance graph (Fig. 4c). Fig. 5 shows the bias–variance decomposition on diﬀerent data sets, varying σ, and for a ﬁxed value of C, that is a sort of ”slice” along the σ axis of the Fig. 4. The plots show that average loss, bias, and variance depend signiﬁcantly on σ for all the considered data sets, conﬁrming the existence of a “high biased region” for low values of σ. In this region, biased and unbiased variance are about equal (net–variance Vn = Vu − Vb is low). Then unbiased variance increases while biased variance decreases (Fig. 5 a,b,c and d), and ﬁnally both stabilize for relatively high values of σ. Interestingly, the average loss and the bias do not increase for high values of σ, especially if C is high. Bias and average loss increases with σ only for very small C values. Note that netvariance and bias show opposite trends only for small values of C (Fig. 5 c). For larger C values the symmetric trend is limited only to σ ≤ 1 (Fig. 5 d), otherwise bias stabilizes and net-variance slowly decreases. Fig. 6 shows more in detail the eﬀect of the C parameter on bias-variance decomposition. For C ≥ 1 there are no variations of the average error, bias and variance for a ﬁxed value of σ. Note that for very low values of σ (Fig. 6a and b) there is no learning. In the Letter-Two data set, as in other data sets (ﬁgures not shown), only for small C values we have variations in bias and variance values (Fig. 6). 5.1.1 Discriminant function computed by the SVM-RBF classifier In order to get insights into the behaviour of the SVM learning algorithm with gaussian kernels we plotted the real-valued functions computed without considering the discretization step performed through the sign function. The real valued function computed by a gaussian SVM is the following: yi αi exp(−xi − x2 /σ 2 ) + b f (x, α, b) = i∈SV 17 Valentini and Dietterich σ=0.01 σ=0.1 avg. error bias net variance unbiased var. biased var 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 -0.1 0.01 0.1 1 2 5 10 20 50 100 200 avg. error bias net variance unbiased var. biased var 0.5 -0.1 500 1000 0.01 0.1 1 2 5 10 C (a) 100 200 500 1000 σ=5 avg. error bias net variance unbiased var. biased var 0.5 0.4 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0.01 0.1 1 2 5 10 20 50 100 200 -0.1 500 1000 0.01 0.1 1 2 5 10 20 C C (c) (d) σ=20 50 100 200 500 1000 σ=100 avg. error bias net variance unbiased var. biased var 0.5 0.4 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0.1 1 2 5 10 20 50 100 200 avg. error bias net variance unbiased var. biased var 0.5 0.3 0.01 avg. error bias net variance unbiased var. biased var 0.5 0.3 -0.1 50 (b) σ=1 -0.1 20 C -0.1 500 1000 C 0.01 0.1 1 2 5 10 20 50 100 200 500 1000 C (e) (f) Figure 6: Letter-Two data set. Bias-variance decomposition of the error in bias, net variance, unbiased and biased variance in SVM RBF, while varying C and for some ﬁxed values of σ: (a) σ = 0.01, (b) σ = 0.1, (c) σ = 1, (d) σ = 5, (e) σ = 20, (f) σ = 100. 18 Bias–variance analysis of SVMs 0.5 0 −0.5 −1 Z 1 0.5 0 −0.5 −1 −1.50 10 8 2 4 6 6 X 4 8 Y 2 Figure 7: The real valued function computed by the SVM on the P2 data set with σ = 0.01, C = 1. 1 0.5 0 −0.5 −1 Z 1.5 1 0.5 0 −0.5 −1 −1.50 10 2 8 4 X 6 6 4 8 2 Y Figure 8: The real valued function computed by the SVM on the P2 data set, with σ = 1, C = 1. 19 Valentini and Dietterich where the αi are the Lagrange multipliers found by the solution of the dual optimization problem, the xi ∈ SV are the support vectors, that is the points for which αi > 0. We plotted the surface computed by the gaussian SVM with the synthetic data set P2. Indeed it is the only surface that can be easily visualized, as the data are bidimensional and the resulting real valued function can be easily represented through a wireframe threedimensional surface. The SVMs are trained with exactly the same training set composed by 100 examples. The outputs are referred to a test set of 10000 examples, selected in an uniform way through all the data domain. In particular we considered a grid of equi-spaced data at 0.1 interval in a two dimensional 10 × 10 input space. If f (x, α, b) > 0 then the SVM matches up the example x with class 1, otherwise with class 2. With small values of σ we have ”spiky” functions: the response is high around the support vectors, and is close to 0 in all the other regions of the input domain (Fig. 7). In this case we have overﬁtting: a large error on the test set (about 46 % with σ = 0.01 and 42.5 % with σ = 0.02 ), and a training error near to 0. 1 0 −1 10 0 −10 Z Z 1.5 1 0.5 0 −0.5 −1 −1.5 −20 20 15 10 5 0 −5 −10 −15 −200 10 2 10 8 4 6 X 2 8 6 4 8 4 Y 2 6 6 X 4 8 (a) (b) 1 0.5 0 −0.5 −1 Z 1.5 1 0.5 0 −0.5 −1 −1.50 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −30 8 4 1 0 −1 −2 Z 10 2 X 10 2 8 6 6 4 8 Y 2 2 4 Y X (c) 6 6 4 8 2 Y (d) Figure 9: The real valued function computed by the SVM on the P2 data set. (a) σ = 20 C = 1, (b) σ = 20 C = 1000, (c) σ = 500 C = 1, (d) σ = 500 C = 1000. 20 Bias–variance analysis of SVMs 0.5 C=1 C=0.1 0.5 avg. error bias net variance unbiased var. biased var 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.01 0.02 0.1 0.2 0.5 1 2 5 avg. error bias net variance unbiased var. biased var 10 20 50 100 200 300 400 500 1000 sigma 0.01 0.02 0.1 0.2 0.5 1 2 (a) 10 20 50 100 200 300 400 500 1000 sigma (b) C=10 C=100 0.5 0.5 avg. error bias net variance unbiased var. biased var 0.4 0.3 0.2 0.2 0.1 0.1 0 0 1 2 5 avg. error bias net variance unbiased var. biased var 0.4 0.3 0.01 0.02 0.1 0.2 0.5 5 10 20 50 100 200 300 400 500 1000 sigma (c) 0.01 0.02 0.1 0.2 0.5 1 2 5 10 20 50 100 200 300 400 500 1000 sigma (d) Figure 10: Grey-Landsat data set. Bias-variance decomposition of error in bias, net variance, unbiased and biased variance in SVM RBF, while varying σ and for some ﬁxed values of C: (a) C = 0.1, (b) C = 1, (c) C = 10, (d) C = 100. If we enlarge the values of σ we obtain a wider response on the input domain and the error decreases (with σ = 0.1 the error is about 37 %). With σ = 1 we have a smooth function that ﬁts quite well the data (Fig. 8). In this case the error drops down to about 13 %. Enlarging too much σ we have a too smooth function (Fig. 9 (a)), and the error increases to about 37 %: in this case the high bias is due to an excessive smoothing of the function. Increasing the values of the regularization parameter C (in order to better ﬁt the data), we can diminish the error to about 15 %: the shape of the function now is less smooth (Fig. 9 (b)). As noted in Scholkopf and Smola (2002), using very large values of sigma, we have a very smooth discriminant function (in practice a plane), and increasing it even further does not change anything. Indeed, enlarging σ to 500 we obtain a plane (Fig 9 (c)), and a very biased 21 Valentini and Dietterich C=1 C=1000 avg. error bias net variance unbiased var. biased var 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.01 0.02 0.1 0.2 0.5 1 2 5 avg. error bias net variance unbiased var. biased var 0.5 10 20 50 100 200 300 400 500 1000 sigma 0.01 0.02 0.1 0.2 0.5 1 2 (a) 10 20 50 100 200 300 400 500 1000 sigma (b) C=1 C=1000 avg. error bias net variance unbiased var. biased var 0.2 0.15 0.1 0.1 0.05 0.05 0 0 1 2 5 avg. error bias net variance unbiased var. biased var 0.2 0.15 0.01 0.02 0.1 0.2 0.5 5 10 20 50 100 200 300 400 500 1000 sigma (c) 0.01 0.02 0.1 0.2 0.5 1 2 5 10 20 50 100 200 300 400 500 1000 sigma (d) Figure 11: Bias-variance decomposition of the error in bias, net variance, unbiased and biased variance in SVM RBF, while varying σ and for some ﬁxed values of C: (a) P2, with C = 1, (b) P2, with C = 1000, (c) Musk, with C = 1, (d) Musk, with C = 1000. function (error about 45 %), and even if we increment C, we can obtain better results, but always with a large error (about 35 %, Fig 9 (d)). 5.1.2 Behavior of SVMs with large σ values Fig 4 and 5 show that σ parameter plays a sort of smoothing eﬀect, as the value of σ increases. In particular with large values of σ we did not observe any increment of bias nor decrement of variance. In order to get insights into this counter-intuitive behaviour we tried to answer these two questions: 1. Does the bias increase while variance decrease with large values of σ, and what is the combined eﬀect of bias-variance on the error? 22 Bias–variance analysis of SVMs 2. In this situation (large values for σ), what is the eﬀect of the C parameter? In Fig. 5 we do not observe an increment of bias with large values of σ, but we limited our experiments to values of σ ≤ 100. Here we investigate the eﬀect for larger values of σ (from 100 to 1000). In most cases, also increasing the values of σ right to 1000 we do not observe an increment of the bias and a substantial decrement of the variance. Only for low values of C, that is C < 1, the bias and the error increase with large values of σ (Fig. 10). With the P2 data set the situation is diﬀerent: in this case we observe an increment of the bias and the error with large values of σ, even if with large values of C the increment rate is lower (Fig. 11 a and b). Also with the musk data set we note an increment of the error with very large values of σ, but surprisingly this is due to an increment of the unbiased variance, while the bias is quite stable, at least for values of C > 1, (Fig. 11 c and d). Larger values of C counter-balance the bias introduced by large values of σ. But with some distributions of the data too large values of σ produce too smooth functions, and also incrementing C it is very diﬃcult to ﬁt the data. Indeed, the discriminant function computed by the RBF-SVM with the P2 data set (that is the function computed without considering the sign function) is too smooth for large values of σ: for σ = 20, the error is about 37%, due almost entirely to the large bias, (Fig. 9 a), and for σ = 500 the error is about 45 % and also incrementing the C value to 1000, we obtain a surface that ﬁts the data better, but with an error that remains large (about 35%). Indeed with very large values of σ the gaussian kernel becomes nearly linear (Scholkopf and Smola, 2002) and if the data set is very far from being linearly separable, as with the P2 data set (Fig. 3), the error increases, especially in the bias component (Fig. 11 (a) and (b)). Summarizing with large σ values bias can increment, while net-variance tends to stabilize, but this eﬀect can be counter-balanced by larger C values. 5.1.3 Relationships between generalization error, training error, number of support vectors and capacity Looking at Fig. 4 and 5, we see that SVMs do not learn for small values of σ. Moreover the low error region is relatively large with respect to σ and C. In this section we evaluate the relationships between the estimated generalization error, the bias, the training error, the number of support vectors and the estimated Vapnik Chervonenkis dimension, in order to answer the following questions: 1. Why SVMs do not learn for small values of σ? 2. Why we have a so large bias for small values of σ? 3. Can we use the variation of the number of support vectors to predict the ”low error” region? 4. Is there any relationship between the bias, variance and VC dimension, and can we use this last one to individuate the ”low error” region? 23 Valentini and Dietterich Error 0.5 VC dim. C=1 900 0.4 675 0.3 Error 0.5 C=10 gen. error bias train error %SV VC dim. 0.4 VC dim. 900 675 0.3 450 0.2 450 0.2 225 0.1 0 0.01 0.02 0.1 0.2 0.5 1 2 5 225 0.1 gen. error bias train error %SV VC dim. 0 0 10 20 50 100 200 300 400 500 1000 sigma 0 0.01 0.02 0.1 0.2 0.5 1 2 5 (a) Error 0.5 (b) C=100 gen. error bias train error %SV VC dim. 0.4 10 20 50 100 200 300 400 500 1000 sigma VC dim. 900 Error 0.5 C=1000 gen. error bias train error %SV VC dim. 0.4 675 VC dim. 900 675 0.3 0.3 450 450 0.2 0.2 225 225 0.1 0.1 0 0 0 0.01 0.02 0.1 0.2 0.5 1 2 5 0 0.01 0.02 0.1 0.2 0.5 10 20 50 100 200 300 400 500 1000 sigma (c) 1 2 5 10 20 50 100 200 300 400 500 1000 sigma (d) Figure 12: Letter-Two data set. Error, bias, training error, support vector rate, and estimated VC dimension in SVM RBF, while varying the σ parameter and for some ﬁxed values of C: (a) C = 1, (b) C = 10, (c) C = 100, and C = 1000. The generalization error, bias, training error, number of support vectors and the Vapnik Chervonenkis dimension are estimated averaging with respect to 400 SVMs (P2 data set) or 200 SVMs (other data sets) trained with diﬀerent bootstrapped training sets composed by 100 examples each one. The test error and the bias are estimated with respect to an independent and suﬃciently large data set. The VC dimension is estimated using the Vapnik’s bound based on the radius R of the sphere that contains all the data (in the feature space), approximated through the sphere centered in the origin, and on the norm of the weights in the feature space (Vapnik, 1998). In this way the VC dimension is overestimated but it is easy to compute and we are interested mainly in the comparison of the VC dim. of diﬀerent SVM models: V C ≤ R2 · w2 + 1 24 Bias–variance analysis of SVMs Error 0.5 VC dim. C=1 400 gen. error bias train error %SV VC dim. 0.4 Error 0.5 VC dim. C=10 400 gen. error bias train error %SV VC dim. 0.4 300 300 0.3 0.3 200 200 0.2 0.2 100 100 0.1 0.1 0 0 0 0.01 0.02 0.1 0.2 0.5 1 2 5 0 0.01 0.02 0.1 0.2 0.5 10 20 50 100 200 300 400 500 1000 sigma 1 2 5 (a) Error 0.5 (b) VC dim. C=100 400 gen. error bias train error %SV VC dim. 0.4 10 20 50 100 200 300 400 500 1000 sigma Error 0.5 C=1000 gen. error bias train error %SV VC dim. 0.4 300 0.3 VC dim. 400 300 0.3 200 200 0.2 0.2 100 100 0.1 0.1 0 0 0.01 0.02 0.1 0.2 0.5 1 2 5 0 10 20 50 100 200 300 400 500 1000 sigma 0 0.01 0.02 0.1 0.2 0.5 (c) 1 2 5 10 20 50 100 200 300 400 500 1000 sigma (d) Figure 13: Grey-Landsat data set. Error, bias, training error, support vector rate, and estimated VC dimension in SVM RBF, while varying the σ parameter and for some ﬁxed values of C: (a) C = 1, (b) C = 10, (c) C = 100, and C = 1000. where w2 = αi αj K(xi , xj )yi yj i∈SV j∈SV and R2 = max K(xi , xi ) i The number of support vectors is expressed as the halved ratio of the number (% SV ) of support vectors with respect to the total number of the training data: %SV = #SV #trainingdata · 2 In the graphs shown in Fig. 12 and Fig. 13, on the left y axis is represented the error, training error and bias, and the halved ratio of support vectors. On the right y axis is reported the estimated Vapnik Chervonenkis dimension. 25 Valentini and Dietterich For very small values of σ the training error is very small (about 0), while the number of support vectors is very high, and high is also the error and the bias (Fig.12 and 13). These facts support the hypothesis of overﬁtting problems with small values of σ. Indeed the realvalued function computed by the SVM (that is the function computed without considering the sign function, see Sect. 5.1.1) is very spiky with small values of σ (Fig. 7). The response of the SVM is high only in small areas around the support vectors, while in all the other areas ”not covered” by the support vectors the response is very low (about 0), that is the SVM is not able to get a decision, with a consequently very high bias. In the same region (small values for σ) the net variance is usually very small, for either one of these reasons: 1) biased and unbiased variance are almost equal because the SVM performs a sort of random guessing for the most part of the unknown data; 2) both biased and unbiased variance are about 0, showing that all the SVMs tend to answer in the same way independently of a particular instance of the training set (Fig. 5 a, b and f). Enlarging σ we obtain a wider response on the input domain: the real-valued function computed by the SVM becomes smoother (Fig. 8), as the ”bumps” around the support vectors become wider and the SVM can decide also on unknown examples. At the same time the number of support vectors decreases (Fig. 12 and 13). Considering the variation of the ratio of the support vectors with σ, in all data sets the trend of the rate of support vectors follows the error, with a sigmoid shape that sometimes becomes an U shape for small values of C (Fig.12 and 13). This is not surprising because it is known that the support vector ratio oﬀers an approximation of the generalization error of the SVMs (Vapnik, 1998). Moreover, on all the data sets the %SV decreases in the ”stabilized” region, while in the transition region remains high. As a consequence the decrement in the number of support vectors shows that we are entering the ”low error” region, and in principle we can use this information to detect this region. In our experiments, an inspection of the support vectors relative to the Grey-Landsat and Waveform data sets found that most of the support vectors are shared in polynomial and gaussian kernels with respectively the best degree and σ parameters. Even if these results conﬁrmed the ones found by other authors (see e.g. Vapnik (1998)), it is worth noting that we did not perform a systematic study on this topic: we considered only two data sets and we compared only few hundreds of diﬀerent SVMs. In order to analyze the role of the VC dimension on the generalization ability of learning machines, we know from Statistical learning Theory that the form of the bounds of the generalization error E of SVMs is the following: E(f (σ, C)kn )) ≤ Eemp (f (σ, C)kn )) + Φ( hk ) n (9) where f (σ, C)kn represents the set of functions computed by an RBF-SVM trained with n examples and with parameters (σk , Ck ) taken from a set of parameters S = {(σi , Ci ), i ∈ N}, Eemp represents the empirical error and Φ the conﬁdence interval that depends on the cardinality n of the data set and on the VC dimension hk of the set of functions identiﬁed by the actual selection of the parameters (σk , Ck ). In order to obtain good generalization capabilities we need to minimize both the empirical risk and the conﬁdence interval. According to Vapnik’s bounds (eq. 9), in Fig. 12 and 13 the lowest generalization error is obtained for a small empirical risk and a small estimated VC dimension. 26 Bias–variance analysis of SVMs But sometimes with relatively small values of V C we may have a very large error, as the training error and the number of support vectors increase with very large values of σ (Fig. 12 a and 13 a). Moreover with a very large estimate of the VC dimension and low empirical error (Fig. 12 and 13) we may have a relatively low generalization error. In conclusion it seems very diﬃcult to use in practice these estimate of the VC dimension to infer the generalization abilities of the SVM. In particular it seems unreliable to use the VC dimension to infer the ”low error” region of the RBF-SVM. 5.2 Polynomial and dot-product kernels In this section we analyze the characteristics of bias–variance decomposition of the error in polynomial SVMs, varying the degree of the kernel and the regularization parameter C. 0.12 C=0.1 0.12 avg. error bias net variance unbiased var. biased var 0.1 0.08 0.06 0.06 0.04 0.04 0.02 0.02 1 2 3 4 5 6 7 polynomial degree 8 9 avg. error bias net variance unbiased var. biased var 0.1 0.08 0 C=50 0 10 1 2 3 4 5 6 7 polynomial degree (a) 0.2 C=0.1 0.2 avg. error bias net variance unbiased var. biased var 0.1 0.05 0.05 2 3 4 5 6 7 polynomial degree 10 8 9 C=50 avg. error bias net variance unbiased var. biased var 0.15 0.1 1 9 (b) 0.15 0 8 0 10 (c) 1 2 3 4 5 6 7 polynomial degree 8 9 10 (d) Figure 14: Bias-variance decomposition of the error in bias, net variance, unbiased and biased variance in polynomial SVM, while varying the degree and for some ﬁxed values of C: (a) Waveform, C = 0.1, (b) Waveform, C = 50, (c) Letter-Two, C = 0.1, (d) Letter-Two, C = 50. 27 Valentini and Dietterich Avg. err. 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.1 1 2 C 5 10 20 100 500 10 9 6 7 8 5 4 degree 3 2 (a) Bias 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.1 Net var. 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 C 2 5 10 20 100 500 10 9 8 7 6 4 5 degree 3 2 0.1 1 C 2 5 10 20 100 500 (b) 10 9 8 7 6 5 4 degree (c) Figure 15: P2 data set. Error (a) and its decomposition in bias (b) and net variance (c), varying both C and the polynomial degree. Error shows a U shape with respect to the degree. This shape depends on unbiased variance (Fig. 14 a and b), or both by bias and unbiased variance (Fig. 14 c and d). The U shape of the error with respect to the degree tends to be more ﬂat for increasing values of C, and net-variance and bias show often opposite trends (Fig. 15). Average error and bias tends to be higher for low C and degree values, but, incrementing the degree, the error is less sensitive to C values (Fig. 16). Bias is ﬂat (Fig. 17 a) or decreasing with respect to the degree (Fig. 15 b), or it can be constant or decreasing, depending on C (Fig. 17 b). Unbiased variance shows an U shape (Fig. 14 a and b) or it increases (Fig. 14 c) with respect to the degree, and the net–variance follows the shape of the unbiased variance. Note that in the P2 data set (Fig. 15) bias and net–variance follow the classical opposite trends with respect to the degree. This is not the case with other data sets (see, e.g. Fig. 14). 28 3 2 Bias–variance analysis of SVMs degree=2 degree=3 avg. error bias net variance unbiased var. biased var 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 -0.1 0.01 0.1 1 2 5 10 20 50 100 200 -0.1 500 1000 0.1 1 2 5 10 20 C (a) (b) 50 100 200 500 1000 degree=10 avg. error bias net variance unbiased var. biased var 0.5 0.4 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0.1 1 2 5 10 20 50 100 200 avg. error bias net variance unbiased var. biased var 0.5 0.3 0.01 0.01 C degree=5 -0.1 avg. error bias net variance unbiased var. biased var 0.5 -0.1 500 1000 0.01 0.1 1 2 5 10 20 C C (c) (d) 50 100 200 500 1000 Figure 16: Letter-Two data set. Bias-variance decomposition of error in bias, net variance, unbiased and biased variance in polynomial SVM, while varying C and for some polynomial degrees: (a) degree = 2, (b) degree = 3, (c) degree = 5, (d) degree = 10 For large values of C bias and net–variance tend to converge, as a result of the bias reduction and net–variance increment (Fig. 18), or because both stabilize at similar values (Fig. 16). In dot–product SVMs bias and net–variance show opposite trends: bias decreases, while net–variance and unbiased variance tend to increase with C (Fig. 19). On the data set P2 this trend is not observed, as in this task the bias is very high and the SVM does not perform better than random guessing (Fig. 19a). The minimum of the average loss for relatively low values of C is the result of the decrement of the bias and the increment of the net–variance: it is achieved usually before the crossover of bias and net–variance curves and before the stabilization of the bias and the net–variance for large values of C. The biased variance remains small independently of C. 29 Valentini and Dietterich Bias Bias 0.5 0.2 0.4 0.15 0.3 0.1 0.2 0.05 0.1 0 0.01 0 0.01 1 1 5 20 C 100 1000 10 9 8 7 6 4 5 degree 3 2 5 1 20 C 100 1000 10 9 (a) 8 7 6 5 4 degree (b) Figure 17: Bias in polynomial SVMs with (a) Waveform and (b) Spam data sets, varying both C and polynomial degree. degree=6 degree=3 0.4 0.4 avg. error bias net variance unbiased var. biased var 0.35 0.3 avg. error bias net variance unbiased var. biased var 0.3 0.25 0.2 0.2 0.1 0.15 0.1 0 0.05 0 0.01 0.1 1 2 5 10 20 50 100 200 -0.1 500 1000 0.01 0.1 1 2 5 10 20 C C (a) (b) 50 100 200 500 1000 Figure 18: Bias-variance decomposition of the error in bias, net variance, unbiased and biased variance in polynomial SVM, varying C: (a) P2 data set with degree = 6, (b) Spam data set with degree = 3. 5.3 Comparing kernels In this section we compare the bias–variance decomposition of the error with respect to the C parameter, considering gaussian, polynomial and dot–product kernels. For each kernel and for each data set the best results are selected. Tab. 2 shows the best results achieved by the SVM, considering each kernel and each data set used in the experiments. Interestingly enough in 3 data sets (Waveform, Letter-Two with added noise and Spam) there are not 30 3 2 1 Bias–variance analysis of SVMs 0.08 avg. error bias net variance unbiased var. biased var 0.6 0.5 avg. error bias net variance unbiased var. biased var 0.07 0.06 0.05 0.4 0.04 0.3 0.03 0.2 0.02 0.1 0 0.01 0.01 0.1 1 2 5 10 20 50 100 200 0 500 1000 0.01 0.1 1 2 5 10 20 50 100 200 500 1000 C C (a) (b) 0.2 0.5 avg. error bias net variance unbiased var. biased var 0.15 avg. error bias net variance unbiased var. biased var 0.4 0.3 0.1 0.2 0.05 0.1 0 0.01 0.1 1 2 5 10 20 50 100 200 0 500 1000 0.01 0.1 1 2 5 10 20 C C (c) (d) 0.25 avg. error bias net variance unbiased var. biased var 0.2 50 100 200 500 1000 avg. error bias net variance unbiased var. biased var 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0.01 0.1 1 2 5 10 20 50 100 200 0 500 1000 C 0.01 0.1 1 2 5 10 20 50 100 200 500 1000 C (e) (f) Figure 19: Bias-variance decomposition of error in bias, net variance, unbiased and biased variance in dot-product SVM, varying C: (a) P2, (b) Grey-Landsat, (c) LetterTwo, (d) Letter-Two with added noise, (e) Spam, (f) Musk. signiﬁcant diﬀerences in the error between the kernels, but there are diﬀerences in bias, net– variance, unbiased or biased variance. In the other data sets gaussian kernels outperform 31 Valentini and Dietterich Table 2: Compared best results with diﬀerent kernels and data sets. RBF-SVM stands for SVM with gaussian kernel; Poly-SVM for SVM with polynomial kernel and D-prod SVM for SVM with dot-product kernel. Var unb. and Var. bias. stand for unbiased and biased variance. Parameters Data set P2 RBF-SVM C = 20, σ = 2 Poly-SVM C = 10, degree = 5 D-prod SVM C = 500 Data set Waveform RBF-SVM C = 1, σ = 50 Poly-SVM C = 1, degree = 1 D-prod SVM C = 0.1 Data set Grey-Landsat RBF-SVM C = 2, σ = 20 Poly-SVM C = 0.1, degree = 5 D-prod SVM C = 0.1 Data set Letter-Two RBF-SVM C = 5, σ = 20 Poly-SVM C = 2, degree = 2 D-prod SVM C = 0.1 Data set Letter-Two with added noise RBF-SVM C = 10, σ = 100 Poly-SVM C = 1, degree = 2 D-prod SVM C = 0.1 Data set Spam RBF-SVM C = 5, σ = 100 Poly-SVM C = 2, degree = 2 D-prod SVM C = 0.1 Data set Musk RBF-SVM C = 2, σ = 100 Poly-SVM C = 2, degree = 2 D-prod SVM C = 0.01 Avg. Error Bias Var. unb. Var. bias. Net Var. 0.1516 0.2108 0.4711 0.0500 0.1309 0.4504 0.1221 0.1261 0.1317 0.0205 0.0461 0.1109 0.1016 0.0799 0.0207 0.0706 0.0760 0.0746 0.0508 0.0509 0.0512 0.0356 0.0417 0.0397 0.0157 0.0165 0.0163 0.0198 0.0251 0.0234 0.0382 0.0402 0.0450 0.0315 0.0355 0.0415 0.0137 0.0116 0.0113 0.0069 0.0069 0.0078 0.0068 0.0047 0.0035 0.0743 0.0745 0.0908 0.0359 0.0391 0.0767 0.0483 0.0465 0.0347 0.0098 0.0111 0.0205 0.0384 0.0353 0.0142 0.3362 0.3432 0.3410 0.2799 0.2799 0.3109 0.0988 0.1094 0.0828 0.0425 0.0461 0.0527 0.0563 0.0633 0.0301 0.1263 0.1292 0.1306 0.0987 0.0969 0.0965 0.0488 0.0510 0.0547 0.0213 0.0188 0.0205 0.0275 0.0323 0.0341 0.0884 0.1163 0.1229 0.0800 0.0785 0.1118 0.0217 0.0553 0.0264 0.0133 0.0175 0.0154 0.0084 0.0378 0.0110 polynomial and dot–product kernels, lowering bias or net–variance or both. Considering bias and net–variance, in some cases they are lower for polynomial or dot–product kernel, showing that diﬀerent kernels learn in diﬀerent ways with diﬀerent data. Considering the data set P2 (Fig. 20 a, c, e), RBF-SVMs achieve the best results, as bias is lower. Unbiased variance is comparable between polynomial and gaussian kernel, while net–variance is lower, as biased variance is higher for polynomial-SVM. In this task the bias of dot–product SVM is very high. Also in the data set Musk (Fig. 20 b, d, f) RBF-SVM obtains the best results, but in this case unbiased variance is responsible for this fact, while bias is similar. With the other data sets the bias is similar between RBF32 Bias–variance analysis of SVMs SVM and polynomial-SVM, but for dot–product SVM often the bias is higher (Fig. 21 b, d, f). Interestingly enough RBF-SVM seems to be more sensible to the C value with respect to both polynomial and dot–product SVM: for C < 0.1 in some data sets the bias is much higher (Fig. 21 a, c, e). With respect to C bias and unbiased variance show sometimes opposite trends, independently of the kernel: bias decreases, while unbiased variance increases, but this does not occur in some data sets. We outline also that the shape of the error, bias and variance curves is similar between diﬀerent kernels in all the considered data sets: that is, well-tuned SVMs having diﬀerent kernels tend to show similar trends of the bias and variance curves with respect to the C parameter. 33 Valentini and Dietterich σ=1 0.5 0.2 avg. error bias net variance unbiased var. biased var 0.4 0.1 0.2 0.05 0.1 0 0.01 0.1 1 2 5 10 20 50 100 200 avg. error bias net variance unbiased var. biased var 0.15 0.3 0 σ=100 -0.05 500 1000 0.01 0.1 1 2 5 10 C (a) degree=5 0.2 avg. error bias net variance unbiased var. biased var 0.4 0.1 0.2 0.05 0.1 0 0.1 1 2 5 10 100 200 500 1000 20 50 100 200 degree=2 avg. error bias net variance unbiased var. biased var 0.15 0.3 0.01 50 (b) 0.5 0 20 C -0.05 500 1000 0.01 0.1 1 2 5 10 C 20 50 100 200 500 1000 C (c) (d) 0.2 avg. error bias net variance unbiased var. biased var 0.6 0.5 avg. error bias net variance unbiased var. biased var 0.15 0.1 0.4 0.3 0.05 0.2 0 0.1 0 0.01 0.1 1 2 5 10 20 50 100 200 -0.05 500 1000 C 0.01 0.1 1 2 5 10 20 50 100 200 500 1000 C (e) (f) Figure 20: Bias-variance decomposition of the error in bias, net variance, unbiased and biased variance with respect to C, considering diﬀerent kernels. (a) P2, gaussian; (b) Musk, gaussian (c) P2, polynomial; (d) Musk, polynomial; (e) P2, dot– product; (f) Musk, dot–product. 34 Bias–variance analysis of SVMs 0.2 σ=20 0.3 avg. error bias net variance unbiased var. biased var 0.15 σ=20 avg. error bias net variance unbiased var. biased var 0.25 0.2 0.1 0.15 0.1 0.05 0.05 0 0.01 0.1 1 2 5 10 20 50 100 200 0 0.01 500 1000 0.1 1 2 5 10 20 50 C C (a) (b) 100 200 500 1000 degree=2 degree=3 0.2 0.3 avg. error bias net variance unbiased var. biased var 0.15 avg. error bias net variance unbiased var. biased var 0.25 0.2 0.15 0.1 0.1 0.05 0.05 0 0.01 0.1 1 2 5 10 20 50 100 200 0 500 1000 0.01 0.1 1 2 5 10 20 C C (c) (d) 0.2 50 100 0.3 avg. error bias net variance unbiased var. biased var 200 500 1000 avg. error bias net variance unbiased var. biased var 0.25 0.15 0.2 0.1 0.15 0.1 0.05 0.05 0 0.01 0.1 1 2 5 10 20 50 100 200 0 500 1000 C 0.01 0.1 1 2 5 10 20 50 100 200 500 1000 C (e) (f) Figure 21: Bias-variance decomposition of the error in bias, net variance, unbiased and biased variance with respect to C, considering diﬀerent kernels. (a) Waveform, gaussian; (b) Letter-Two, gaussian (c) Waveform, polynomial; (d) Letter-Two, polynomial; (e) Waveform, dot–product; (f) Letter-Two, dot–product. 35 Valentini and Dietterich Table 3: Evaluation of the variation of the estimated values of bias variance decomposition with the Musk data set. RBF-SVM stands for SVM with gaussian kernel; Poly-SVM for SVM with polynomial kernel and D-prod SVM for SVM with dotproduct kernel. Net Var. Var unb. and Var. bias. stand for net, unbiased and biased variance. For each value is represented the mean value over 100 replicated experiments and the corresponding value of the standard deviation. Kernel type RBF Poly D-prod Avg. Error 0.0901 ± 0.0087 0.1158 ± 0.0069 0.1305 ± 0.0133 Bias 0.0805 ± 0.0126 0.0782 ± 0.0083 0.1179 ± 0.0140 Var. unb. 0.0237 ± 0.0039 0.0585 ± 0.0071 0.0285 ± 0.0084 Var. bias. 0.0141 ± 0.0025 0.0109 ± 0.0018 0.0159 ± 0.0045 Net Var. 0.0096 ± 0.0019 0.0376 ± 0.0047 0.0126 ± 0.0035 In our experiments we used relatively small training sets (100 examples), while the number of input variables ranged from 2 (P2 data set) to 166 (Musk data set). Hence, even if for each SVM model (that is for each combination of SVM parameters) we used 200 training sets Di , 1 ≤ i ≤ 200 in order to train 200 diﬀerent classiﬁers fDi , you could wonder whether the estimated quantities (average error, bias, net-variance, unbiased and biased variance) could be noisy. An extensive evaluation of the sensitivity of the estimated quantities to the sampling procedure would be very expensive. Indeed if we replicate only 10 times our experiments on all the data sets, we should train and test more than 5 millions of diﬀerent SVMs. Anyway, in order to get insights about this problem, we performed 100 replicates of our experiments limited only to the Musk data set (that is the data set with the largest dimensionality in our experiments), for a subset of the parameters near the optimal ones. We found that the standard deviation of the estimated values is not too large. For instance, considering the best model for gaussian, polynomial and dot-product kernels we obtained the values shown in Tab. 3. It seems that the computed quantities are not too noisy, even if we need more experiments to conﬁrm this result. 5.4 Bias–variance decomposition with noisy data While the estimation of the noise is quite straightforward with synthetic data, it is a difﬁcult task with ”real” data James (2003). For these reasons, and in order to simplify the computation and the overall analysis, in our experiments we did not explicitly consider noise. Anyway, noise can play a signiﬁcant role in the bias–variance analysis. Indeed, according to Domingos, with the 0/1 loss the noise is linearly added to the error with a coeﬃcient equal to 2PD (fD (x) = y∗ ) − 1 (eq. 6). Hence, if the classiﬁer is accurate, that is if PD (fD (x) = y∗ ) 0.5, then the noise N (x), if present, inﬂuences the expected loss. In the opposite situation also, with very bad classiﬁers, that is when PD (fD (x) = y∗ ) 0.5, the noise inﬂuences the overall error in the opposite sense: it reduces the expected loss. 36 Bias–variance analysis of SVMs σ=5 σ=5 avg. error bias net variance unbiased var. biased var 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 -0.1 0.01 0.1 1 2 5 10 20 50 100 200 -0.1 500 1000 0.1 1 2 5 10 20 C (a) (b) degree=3 50 100 200 500 1000 degree=3 avg. error bias net variance unbiased var. biased var 0.4 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0.1 1 2 5 10 20 50 100 200 avg. error bias net variance unbiased var. biased var 0.5 0.3 0.01 0.01 C 0.5 -0.1 avg. error bias net variance unbiased var. biased var 0.5 -0.1 500 1000 0.01 0.1 1 2 5 10 20 C C (c) (d) 50 100 200 500 1000 Figure 22: Eﬀect of noise on bias and variance. The bias–variance decomposition of the error is shown while varying the C regularization parameter with polynomial and gaussian kernels. (a) Letter-Two: gaussian kernel, σ = 5, (b) Letter-Two with added noise: gaussian kernel, σ = 5, (c) Letter-Two: polynomial kernel, degree = 3, (d) Letter-Two with added noise: polynomial kernel, degree = 3. If PD (fD (x) = y∗ ) ≈ 0.5, that is if the classiﬁer performs a sort of random guessing, then 2PD (fD (x) = y∗ ) − 1 ≈ 0 and the noise has no substantial impact on the error. Hence if we know that the noise is small we can disregard it, but what about the eﬀect of noise when it is present but not explicitly considered in the bias–variance decomposition of the error? The analysis of the results in the data set Letter-Two without and with 20 % added noise shows that the main eﬀect of noise in this speciﬁc situation consists in incrementing the bias and consequently the average error. Indeed, with gaussian kernels (Fig. 22 (a) and (b)) the bias is raised to about 0.3, with an increment of about 0.25 with respect to the data set without noise, while the net–variance is incremented only by about 0.02, as the increment of the unbiased variance is counter-balanced by the increment of the biased variance. A similar behavior is registered also with polynomial (Fig. 22 (c) and (d)) and dot-product kernels (Fig. 19 (c) and (d)). 37 Valentini and Dietterich 0.5 avg. error bias net variance unbiased var. biased var 0.4 Bias drops down 0.3 Wave-shaped net-variance 0.2 Comparable unbiased biased variance No bias-variance variations 0.1 0 -0.1 0.01 0.02 0.1 0.2 0.5 1 2 5 10 20 50 100 sigma High bias region Transition region Stabilized region Figure 23: The 3 regions of the error in RBF-SVM with respect to σ. 6. Characterization of Bias–Variance Decomposition of the Error Despite the diﬀerences observed in diﬀerent data sets, common patterns of bias and variance can be detected for each of the kernels considered in this study. Each kernel presents a speciﬁc characterization of bias and variance with respect to its speciﬁc parameters, as explained in the following sections. 6.1 Gaussian kernels Error, bias, net–variance, unbiased and biased variance show a common trend in the 7 data sets we used in the experiments. Some diﬀerences, of course, arise in the diﬀerent data sets, but we can distinguish three diﬀerent regions in the error analysis of RBF-SVM, with respect to increasing values of σ (Fig. 23): 1. High bias region. For low values of σ, error is high: it depends on high bias. Net– variance is about 0 as biased and unbiased variance are equivalent. In this region there are no remarkable ﬂuctuations of bias and variance: both remain constant, with high values of bias and comparable values of unbiased and biased variance, leading to net–variance values near to 0. In some cases biased and unbiased variance are about equal, but diﬀerent from 0, in other cases they are equal, but near to 0. 2. Transition region. Suddenly, for a critical value of σ, the bias decreases rapidly. This critical value depends also on C: for very low values of C, we have no learning, 38 Bias–variance analysis of SVMs then for higher values the bias drops. Higher values of C cause the critical value of σ to decrease (Fig. 4 (b) and 5). In this region the increase in net–variance is lower than the decrease in bias: so the average error decreases. The boundary of this region can be determined at the point where the error stops decrementing. This region is characterized also by a particular trend of the net–variance. We can distinguish two main behaviors: (a) Wave-shaped net–variance. Net–variance ﬁrst increases and then decreases, producing a wave-shaped curve with respect to σ. The initial increment of the net–variance is due to the simultaneous increment of the unbiased variance and decrement of the biased variance. In the second part of the transition region, biased variance stabilizes and unbiased variance decreases, producing a parallel decrement of the net–variance. The rapid decrement of the error with σ is due to the rapid decrement of the bias, after which the bias stabilizes and the further decrement of the error with σ is determined by the net–variance reduction (Fig. 4c, 5). (b) Semi-wave-shaped net–variance. In other cases the net–variance curve with σ is not so clearly wave-shaped: the descending part is very reduced (Fig. 5 e, f). In particular in the musk data set we have a continuous increment of the net–variance (due to the continuous growing of the unbiased variance with σ), and no wave-shaped curve is observed (at least for C > 10, Fig. 11 d). In both cases the increment of net–variance is slower than the increment in bias: as a result, the average error decreases. 3. Stabilized region. This region is characterized by small or no variations in bias and net–variance. For high values of σ both bias and net–variance stabilize and the average error is constant (Fig. 4, 5). In other data sets the error increases with σ, because of the increment of the bias (Fig. 11 a,b) or the unbiased variance (Fig. 11 c,d). In the ﬁrst region, bias rules SVM behavior: in most cases the bias is constant and close to 0.5, showing that we have a sort of random guessing, without eﬀective learning. It appears that the area of inﬂuence of each support vector is too small (Fig. 7), and the learning machine overﬁts the data. This is conﬁrmed by the fact that in this region the training error is about 0 and almost all the training points are support vectors. In the transition region, the SVM starts to learn, adapting itself to the data characteristics. Bias rapidly goes down (at the expenses of a growing net–variance), but for higher values of σ (in the second part of the transition region), sometimes net–variance also goes down, working to lower the error (Fig. 5). Even if the third region is characterized by no variations in bias and variance, sometimes for low values of C, the error increases with σ (Fig. 10 a, 12 a), as a result of the bias increment; on the whole RBF-SVMs are sensitive to low values of C: if C is too low, then bias can grow quickly. High values of C lower the bias(Fig. 12 c, d). 39 Valentini and Dietterich 0.2 avg. error bias net variance unbiased var. biased var The error depends both of bias and unb. variance 0.15 U-shape of the error 0.1 Bias and net-var. switch the main contribution to the error 0.05 0 1 2 3 4 5 6 7 polynomial degree 8 9 10 Figure 24: Behaviour of polynomial SVM with respect of the bias–variance decomposition of the error. 6.2 Polynomial and dot-product kernels For polynomial and dot–product SVMs, we have also characterized the behavior of SVMs in terms of average error, bias, net–variance, unbiased and biased variance, even if we are not able to distinguish between diﬀerent regions clearly deﬁned. However, common patterns of the error curves with respect to the polynomial degree, considering bias, net–variance and unbiased and biased variance can be noticed. The average loss curve shows in general a U shape with respect to the polynomial degree, and this shape may depend on both bias and unbiased variance or in some cases mostly on the unbiased variance according to the characteristics of the data set. From these general observations we can schematically distinguish two main global pictures of the behaviour of polynomial SVM with respect to the bias–variance decomposition of the error: 1. Error curve shape bias–variance dependent. In this case the shape of the error curve is dependent both on the unbiased variance and the bias. The trend of bias and net–variance can be symmetric or they can also have non coincident paraboloid shape, depending on C parameter values (Fig. 14 c, d and 15). Note that bias and net variance show often opposite trends (Fig. 15). 2. Error curve shape unbiased variance dependent. In this case the shape of the error curve is mainly dependent on the unbiased variance. The bias (and the biased variance) tend to be degree independent, especially for high values of C (Fig. 14 a, b) . 40 Bias–variance analysis of SVMs 0.25 avg. error bias net variance unbiased var. biased var Minimum of the error due to large decrement of bias 0.2 0.15 Stabilized region Opposite trends of bias and net-var. 0.1 0.05 0 Low biased var. independent of C 0.01 0.1 1 2 5 10 20 50 100 200 500 1000 C Figure 25: Behaviour of the dot–product SVM with respect of the bias–variance decomposition of the error. Fig. 24 schematically summarizes the main characteristics of the bias–variance decomposition of error in polynomial SVM. Note however that the error curve depends for the most part on both variance and bias: the prevalence of the unbiased variance (Fig. 14 a, b) or the bias seems to depend mostly on the distribution of the data. The increment of the values of C tends to ﬂatten the U shape of the error curve: in particular for large C values bias becomes independent with respect to the degree (Fig. 17). Moreover the C parameter plays also a regularization role (Fig. 18) Dot–product SVM are characterized by opposite trends of bias and net–variance: bias decrements, while net–variance grows with respect to C; then, for higher values of C both stabilize. The combined eﬀect of these symmetric curves produces a minimum of the error for low values of C, as the initial decrement of bias with C is larger than the initial increment of net–variance. Then the error slightly increases and stabilizes with C (Fig. 19). The shape of the net–variance curve is determined mainly by the unbiased variance: it increases and then stabilizes with respect to C. On the other hand the biased variance curve is ﬂat, remaining small for all values of C. A schematic picture of this behaviour is given in Fig. 25. 41 Valentini and Dietterich 7. Two directions for developing ensembles of SVMs In addition to providing insights into the behavior of SVMs, the analysis of the bias– variance decomposition of the error can identify the situations in which ensemble methods might improve SVM performance. On several real-world problems, SVM ensembles are reported to give improvements over single SVMs (Kim et al., 2002, Valentini et al., 2003), but few works showed also negative experimental results about ensembles of SVMs (Buciu et al., 2001, Evgeniou et al., 2000). In particular Evgeniou et al. (2000) experimentally found that leave-one-out error bounds for kernel machines ensembles are tighter that the equivalent ones for single machines, but they showed that with accurate parameters tuning single SVMs and ensembles of SVMs perform similarly. In this section we propose to exploit bias–variance analysis in order to develop ensemble methods well tuned to the bias–variance characteristics of the base learners. In particular we present two possible ways of applying bias–variance analysis to develop SVM-based ensemble methods. 7.1 Bagged Ensemble of Selected Low-Bias SVMs From a general standpoint, considering diﬀerent kernels and diﬀerent parameters of the kernel, we can observe that the minimum of the error, bias and net–variance (and in particular unbiased variance) do not match. For instance, considering RBF-SVM we see that we achieve the minimum of the error, bias and net–variance for diﬀerent values of σ (see, for instance, Fig. 5). Similar considerations can also be applied to polynomial and dot–product SVM. Often, modifying parameters of the kernel, if we gain in bias we lose in variance and vice versa, even if this is not a rule. Under the bootstrap assumption, bagging reduces only variance. From bias-variance decomposition we know that unbiased variance reduces the error, while biased variance increases the error. Hence bagging should be applied to low-biased classiﬁers, because the biased variance will be small. Summarizing, we can schematically consider the following observations: • We know that bagging lowers net–variance (in particular unbiased variance) but not bias (Breiman, 1996b). • SVMs are strong, low-biased learners, but this property depends on the proper selection of the kernel and its parameters. • If we can identify low-biased base learners with a relatively high unbiased variance, bagging can lower the error. • Bias–variance analysis can identify SVMs with low bias. Hence a basic high–level algorithm for a general Bagged ensemble of selected low-bias SVMs is the following: 1. Estimate bias-variance decomposition of the error for diﬀerent SVM models 2. Select the SVM model with the lowest bias 42 Bias–variance analysis of SVMs 3. Perform bagging using as base learner the SVM with the estimated lowest bias. This approach combines the low bias properties of SVMs with the low unbiased variance properties of bagging and should produce ensembles with low overall error. We named this approach Lobag, that stands for Low bias bagging. Using SVMs as base learners, depending on the type of kernel and parameters considered, and on the way the bias is estimated for the diﬀerent SVM models, diﬀerent algorithmic variants can be provided: For instance, depending on the type of kernel and parameters considered, diﬀerent implementations can be given: 1. Selecting the RBF-SVM with the lowest bias with respect to the C and σ parameters. 2. Selecting the polynomial-SVM with the lowest bias with respect to the C and degree parameters. 3. Selecting the dot–prod-SVM with the lowest bias with respect to the C parameter. 4. Selecting the SVM with the lowest bias with respect to the kernel. Another issue is how to implement the estimation of the bias–variance decomposition of the error for diﬀerent SVM models. We could use cross-validation in conjunction with bootstrap replicates, or out-of-bag estimates (especially if we have small training sets), or hold-out techniques in conjunction with bootstrap replicates if we have suﬃciently large training sets. A ﬁrst implementation of this approach, using an out-of-bag estimate of the bias– variance decomposition of the error, has been proposed, and quite encouraging results have been achieved (Valentini and Dietterich, 2003). Another problem is the estimate of the noise in real data sets. A straightforward approach simply consists in disregarding it, but in this way we could overestimate the bias (see Sect. 5.4). Some heuristics are proposed in James (2003), but the problem remains substantially unresolved. It is worth noting that this approach can be viewed as an alternative way for tuning SVM parameters, using an ensemble instead of a single SVM. From this standpoint recent works proposed to automatically choose multiple kernel parameters (Chapelle et al., 2002, Grandvalet and Canu, 2003), setting, for instance diﬀerent σ values for each input dimension in gaussian kernels, by applying a minimax procedure to iteratively maximize the margin of the SVM and to minimize an estimate of the generalization error over the set of kernel parameters (Chapelle et al., 2002). This promising approach could be in principle extended to minimize the bias, instead of the overall error. To this purpose we need to solve non trivial problems such as providing an upper bound for the bias and the variance, or at least an easy to compute their estimator having, if possible, an analytical expression. This approach could represent a new interesting research line that could improve the performances and/or reduce the computational burden of the Lobag method. 7.2 Heterogeneous Ensembles of SVM The analysis of bias–variance decomposition of error in SVM shows that the minimum of the overall error, bias, net–variance, unbiased and biased variance occur often in diﬀerent 43 Valentini and Dietterich SVM models. These diﬀerent behaviors of diﬀerent SVM models could be in principle exploited to produce diversity in ensembles of SVMs. Although the diversity of base learner itself does not assure the error of the ensemble will be reduced (Kuncheva et al., 2001b), the combination of accuracy and diversity in most cases does (Dietterich, 2000a). As a consequence, we could select diﬀerent SVM models as base learners by evaluating their accuracy and diversity through the bias-variance decomposition of the error. Our results show that the “optimal region” (low average loss region) is quite large in RBF-SVMs (Fig. 4). This means that C and σ do not need to be tuned extremely carefully. From this point of view, we can avoid time-consuming model selection by combining RBF-SVMs trained with diﬀerent σ values all chosen from within the “optimal region.” For instance, if we know that the error curve looks like the one depicted in Fig. 23, we could try to ﬁt a sigmoid-like curve using only few values to estimate where the stabilized region is located. Then we could train an heterogeneous ensemble of SVMs with diﬀerent σ parameters (located in the low bias region) and average them according to their estimated accuracy. A high-level algorithm for Heterogeneous Ensembles of SVMs could include the following steps: 1. Individuate the ”optimal region” through bias–variance analysis of the error 2. Select the SVMs with parameters chosen from within the optimal region deﬁned by bias-variance analysis. 3. Combine the selected SVMs by majority or weighted voting according to their estimated accuracy. We could use diﬀerent methods or heuristics to ﬁnd the ”optimal region” (see Sect. 5.1.3) and we have to deﬁne also the criterion used to select the SVM models inside the ”optimal region” (for instance, improvement of the diversity). The combination could be performed using also other approaches, such as minimum, maximum, average and OWA aggregating operators (Kittler et al., 1998) or Behavior-Knowledge space method (Huang and C.Y., 1995), Fuzzy aggregation rules (Wang et al., 1998), Decision templates (Kuncheva et al., 2001a) or Meta-learning techniques (Prodromidis et al., 1999). Bagging and boosting (Freund and Schapire, 1996) methods can also be combined with this approach to further improve diversity and accuracy of the base learners. 7.3 Numerical experiments with Low bias bagged SVMs In order to show that these research directions could be fruitful to follow further, we performed numerical experiments on diﬀerent data sets to test the Lobag ensemble method using SVMs as base learners. We compared the results with single SVMs and classical bagged SVM ensembles. We report here some preliminary results.More detailed results are reported in Valentini and Dietterich (2003). We employed the 7 diﬀerent two-class data sets described in Sect. 4.1, using small D training sets and large test T sets in order to obtain a reliable estimate of the generalization error: the number of examples for D was set to 100, while the size of T ranged from a few thousands for the “real” data sets to ten thousands for synthetic data sets. Then we 44 Bias–variance analysis of SVMs Table 4: Results of the experiments using pairs of train D and test T sets. Elobag , Ebag and ESV M stand respectively for estimated error of lobag, bagged and single SVMs on the test set T . The three last columns show the conﬁdence level according to the Mc Nemar test. L/B, L/S and B/S stand respectively for the comparison Lobag/Bagging, Lobag/Single SVM and Bagging/Single SVM. If the conﬁdence level is equal to 1, no signiﬁcant diﬀerence is registered. Kernel type Polyn. Gauss. Linear Polyn. Gauss. Linear Polyn. Gauss. Linear Polyn. Gauss. Linear Polyn. Gauss. Linear Polyn. Gauss. Linear Polyn. Gauss. Elobag Ebag Esingle Confidence level L/B L/S B/S Data set 0.1735 0.1375 Data set 0.0740 0.0693 0.0601 Data set 0.0540 0.0400 0.0435 Data set 0.0881 0.0701 0.0668 Data set 0.3535 0.3404 0.3338 Data set 0.1408 0.0960 0.1130 Data set 0.1291 0.1018 0.0985 P2 0.2008 0.2097 0.001 0.001 0.1530 0.1703 0.001 0.001 Waveform 0.0726 0.0939 1 0.001 0.0707 0.0724 1 0.1 0.0652 0.0692 0.001 0.001 Grey-Landsat 0.0540 0.0650 1 0.001 0.0440 0.0480 1 0.1 0.0470 0.0475 0.1 0.1 Letter-Two 0.0929 0.1011 1 0.025 0.0717 0.0831 1 0.05 0.0717 0.0799 1 1 Letter-Two with added noise 0.3518 0.3747 1 1 0.3715 0.3993 1 0.05 0.3764 0.3829 0.05 0.025 Spam 0.1352 0.1760 0.05 0.001 0.1034 0.1069 0.1 0.025 0.1256 0.1282 0.005 0.001 Musk 0.1291 0.1458 1 0.001 0.1157 0.1154 0.001 0.001 0.1036 0.0936 0.05 1 0.001 0.001 0.001 0.1 0.001 0.001 1 1 0.05 0.1 1 0.1 0.1 1 0.001 1 1 0.001 1 0.05 applied the Lobag algorithm setting the number of samples bootstrapped from D to 100, and performing an out-of-bag estimate of the bias–variance decomposition of the error. The selected lobag, bagged and single SVMs were ﬁnally tested on the separated test set T . Table 4 shows the results of the experiments. We measured 20 outcomes for each method: 7 data sets, and 3 kernels (gaussian, polynomial, and dot-product) applied to each data set except P2 for which we did not apply the dot-product kernel (because it was obviously inappropriate). For each pair of methods, we applied the McNemar test (Dietterich, 1998) 45 Valentini and Dietterich to determine whether there was a signiﬁcant diﬀerence in predictive accuracy on the test set. On nearly all the data sets, both bagging and Lobag outperform the single SVMs independently of the kernel used. The null hypothesis that Lobag has the same error rate as a single SVM is rejected at or below the 0.1 signiﬁcance level in 17 of the 20 cases, while the null hypothesis that bagging has the same error rate as a single SVM is rejected at or below the 0.1 level in 13 of the 20 cases. Most importantly, Lobag generally outperforms standard bagging. Lobag is statistically signiﬁcantly better than bagging in 9 of the 20 cases, and signiﬁcantly inferior only once. These preliminary results show the feasibility of our approach, as shown also by similar experiments presented in Valentini and Dietterich (2003), but we need more experimental studies and applications to real problems in order to better understand when and in which conditions this approach could be fruitful. 8. Conclusion and Future Works We applied bias–variance decomposition of the error as a tool to gain insights into SVM learning algorithm. In particular we performed an analysis of bias and variance of SVMs, considering gaussian, polynomial, and dot–product kernels. The relationships between parameters of the kernel and bias, net–variance, unbiased and biased variance have been studied through an extensive experimentation involving training, testing, and bias–variance analysis of more than half million of SVMs. We discovered regular patterns in the behavior of the bias and variance, and we related those patterns to the parameters and kernel functions of the SVMs. The characterization of bias–variance decomposition of the error showed that in gaussian kernels we can individuate at least three diﬀerent regions with respect to the σ parameter, while in polynomial kernels the U shape of the error can be determined by the combined eﬀects of bias and unbiased variance. The analysis revealed also that the expected trade-oﬀ between bias and variance holds systematically for dot product kernels, while other kernels showed more complex relationships. The information supplied by bias-variance analysis suggests two promising approaches for designing ensembles of SVMs. One approach is to employ low-bias SVMs as base learners in a bagged ensemble. The other approach is to apply bias-variance analysis to construct a heterogeneous, diverse set of accurate and low-bias classiﬁers. We are designing and experimenting with both of these approaches. An outgoing development of this work extends this analysis to bagged and boosted ensemble of SVMs, in order to achieve more insights about the behavior of SVM ensembles based on resampling methods. In our experiments we did not explicitly consider the noise: analyzing the role of the noise in the decomposition of the error (Sect. 5.4) could help to develop ensemble methods speciﬁcally designed for noisy data. Moreover in our experiments we did not explicitly consider the characteristics of the data. Nonetheless, such as we could expect and as our experiments suggested, diﬀerent data characteristics inﬂuence bias–variance patterns in learning machines. To this purpose we plan to explicitly analyze the relationships between bias–variance decomposition of the 46 Bias–variance analysis of SVMs error and data characteristics, using data complexity measures based on geometrical and topological characteristics of the data (Li and Vitanyi, 1993, Ho and Basu, 2002). Acknowledgments We thanks the anonymous reviewers for their comments and suggestions. Appendix A. In this appendix we discuss the notions of systematic and variance eﬀect introduced by James (2003), showing that these quantities are reduced respectively to the bias and the netvariance when the 0/1 loss is used and the noise is disregarded. James (2003) provides deﬁnitions of bias and variance that are similar to those provided by Domingos (2000c). Indeed bias and variance deﬁnitions are based on quantities that he named the systematic part sy of the prediction y and the systematic part st of the target t. These correspond respectively to the Domingos main prediction (eq.2) and optimal prediction (eq.1). Moreover James distinguishes between bias and variance and systematic and variance eﬀects. Bias and variance satisfy respectively the notion of the diﬀerence between the systematic parts of y and t, and the variability of the estimate y. Systematic eﬀect SE represents the change in error of predicting t when using sy instead of st and the variance eﬀect V E the change in prediction error when using y instead of sy in order to predict t. Using Domingos notation (ym for sy, and y∗ for st) the variance eﬀect is: V E(y, t) = Ey,t [L(y, t)] − Et [L(t, ym )] while the systematic eﬀect corresponds to: SE(y, t) = Et [L(t, ym )] − Et [L(t, y∗ )] In other words the systematic eﬀect represents the change in prediction error caused by bias, while the variance eﬀect the change in prediction error caused by variance. While for the squared loss the two sets of bias–variance deﬁnitions match, for general loss functions the identity does not hold. In particular for the 0/1 loss James proposes the following deﬁnitions for noise, variance and bias with 0/1 loss: N (x) = P (t = y∗ ) V (x) = P (y = ym ) B(x) = I(y∗ = ym ) (10) where I(z) is 1 if z is true and 0 otherwise. The variance eﬀect for the 0/1 loss can be expressed in the following way: V E(y, t) = Ey,t [L(y, t) − L(t, ym )] = Py,t (y = t) − Pt (t = ym ) = = 1 − Py,t (y = t) − (1 − Pt (t = ym )) = Pt (t = ym ) − Py,t (y = t) 47 (11) Valentini and Dietterich while the systematic eﬀect is: SE(y, t) = Et [L(t, ym )] − Et [L(t, y∗ )] = Pt (t = ym ) − Pt (t = y∗ ) = = 1 − Pt (t = ym ) − (1 − Pt (t = y∗ )) = Pt (t = y∗ ) − Pt (t = ym ) (12) If we let N (x) = 0, considering eq. 7, 10 and eq. 11 the variance eﬀect becomes: V E(y, t) = Pt (t = ym ) − Py,t (y = t) = P (y∗ = ym ) − Py (y = y∗ ) = = 1 − P (y∗ = ym ) − (1 − Py (y = y∗ )) = 1 − B(x) − (1 − EL(L, x)) = EL(L, x) − B(x) = Vn (x) (13) while from eq. 10 and eq. 12 the systematic eﬀect becomes: SE(y, t) = Pt (t = y∗ ) − Pt (t = ym ) = 1 − Pt (t = y∗ ) − (1 − Pt (t = ym )) = P (y∗ = ym ) = I(y∗ = ym ) = B(x) (14) Hence if N (x) = 0, it follows that the variance eﬀect is equal to the net-variance (eq. 13), and the systematic eﬀect is equal to the bias (eq. 14). References E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for margin classiﬁers. Journal of Machine Learning Research, 1:113–141, 2000. E. Bauer and R. Kohavi. An empirical comparison of voting classiﬁcation algorithms: Bagging, boosting and variants. Machine Learning, 36(1/2):525–536, 1999. O. Bousquet and A. Elisseeﬀ. Stability and Generalization. Journal of Machine Learning Research, 2:499–526, 2002. L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996a. L. Breiman. Bias, variance and arcing classiﬁers. Technical Report TR 460, Statistics Department, University of California, Berkeley, CA, 1996b. L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001. I. Buciu, C. Kotropoulos, and I. Pitas. Combining Support Vector Machines for Accurate Face Detection. In Proc. of ICIP’01, volume 1, pages 1054–1057, 2001. Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, and Sayan Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1):131–159, 2002. S. Cohen and N. Intrator. Automatic Model Selection in a Hybrid Perceptron/Radial Network. In Multiple Classiﬁer Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 349–358. Springer-Verlag, 2001. T.G. Dietterich. Approximate statistical test for comparing supervised classiﬁcation learning algorithms. Neural Computation, (7):1895–1924, 1998. 48 Bias–variance analysis of SVMs T.G. Dietterich. Ensemble methods in machine learning. In J. Kittler and F. Roli, editors, Multiple Classiﬁer Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 1–15. Springer-Verlag, 2000a. T.G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization. Machine Learning, 40(2):139– 158, 2000b. T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artiﬁcial Intelligence Research, (2):263–286, 1995. P. Domingos. A uniﬁed bias–variance decomposition. Technical report, Department of Computer Science and Engineering, University of Washington, Seattle, WA, 2000a. P. Domingos. A Uniﬁed Bias-Variance Decomposition and its Applications. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 231–238, Stanford, CA, 2000b. Morgan Kaufmann. P. Domingos. A Uniﬁed Bias-Variance Decomposition for Zero-One and Squared Loss. In Proceedings of the Seventeenth National Conference on Artiﬁcial Intelligence, pages 564–569, Austin, TX, 2000c. AAAI Press. T. Evgeniou, L. Perez-Breva, M. Pontil, and T. Poggio. Bounds on the Generalization Performance of Kernel Machine Ensembles. In P. Langley, editor, Proc. of the Seventeenth International Conference on Machine Learning (ICML 2000), pages 271–278. Morgan Kaufmann, 2000. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, pages 148–156. Morgan Kauﬀman, 1996. J.H. Friedman. On bias, variance, 0/1 loss and the curse of dimensionality. Data Mining and Knowledge Discovery, 1:55–77, 1997. S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias-variance dilemma. Neural Computation, 4(1):1–58, 1992. Y. Grandvalet and S. Canu. Adaptive Scaling for Feature Selection in SVMs. In S. Becker, S. Thrun, and K. Obermayer, editors, NIPS 2002 Conference Proceedings, Advances in Neural Information Processing Systems, volume 15, Cambridge, MA, 2003. MIT Press. T. Heskes. Bias/Variance Decompostion for Likelihood-Based Estimators. Neural Computation, 10:1425–1433, 1998. T.K. Ho and M. Basu. Complexity measures of supervised classiﬁcation problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):289–300, 2002. Y.S. Huang and Suen. C.Y. Combination of multiple experts for the recognition of unconstrained handwritten numerals. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17:90–94, 1995. 49 Valentini and Dietterich G. James. Variance and bias for general loss function. Machine Learning, (2):115–135, 2003. T. Joachims. Making large scale SVM learning practical. In Smola A. Scholkopf B., Burges C., editor, Advances in Kernel Methods - Support Vector Learning, pages 169–184. MIT Press, Cambridge, MA, 1999. H.C. Kim, S. Pang, H.M. Je, D. Kim, and S.Y. Bang. Pattern Classiﬁcation Using Support Vector Machine Ensemble. In Proc. of ICPR’02, volume 2, pages 20160–20163. IEEE, 2002. J. Kittler, M. Hatef, R.P.W. Duin, and Matas J. On combining classiﬁers. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(3):226–239, 1998. E.M. Kleinberg. A Mathematically Rigorous Foundation for Supervised Learning. In J. Kittler and F. Roli, editors, Multiple Classiﬁer Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 67–76. Springer-Verlag, 2000. R. Kohavi and D.H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In Proc. of the Thirteenth International Conference on Machine Learning, The Seventeenth International Conference on Machine Learning, pages 275–283, Bari, Italy, 1996. Morgan Kaufmann. E. Kong and T.G. Dietterich. Error - correcting output coding correct bias and variance. In The XII International Conference on Machine Learning, pages 313–321, San Francisco, CA, 1995. Morgan Kauﬀman. L.I. Kuncheva, J.C. Bezdek, and R.P.W. Duin. Decision templates for multiple classiﬁer fusion: an experimental comparison. Pattern Recognition, 34(2):299–314, 2001a. L.I. Kuncheva, F. Roli, G.L. Marcialis, and C.A. Shipp. Complexity of Data Subsets Generated by the Random Subspace Method: An Experimental Investigation. In J. Kittler and F. Roli, editors, Multiple Classiﬁer Systems. Second International Workshop, MCS 2001, Cambridge, UK, volume 2096 of Lecture Notes in Computer Science, pages 349–358. Springer-Verlag, 2001b. L.I. Kuncheva and C.J. Whitaker. Measures of diversity in classiﬁer ensembles. Machine Learning, 51:181–207, 2003. M. Li and P Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, Berlin, 1993. L. Mason, P. Bartlett, and J. Baxter. Improved generalization through explicit optimization of margins. Machine Learning, 2000. C.J. Merz and P.M. Murphy. UCI repository of machine learning databases, 1998. www.ics.uci.edu/mlearn/MLRepository.html. A. Prodromidis, P. Chan, and S. Stolfo. Meta-Learning in Distributed Data Mining Systems: Issues and Approaches. In H. Kargupta and P. Chan, editors, Advances in Distributed Data Mining, pages 81–113. AAAI Press, 1999. 50 Bias–variance analysis of SVMs R.E. Schapire. A brief introduction to boosting. In Thomas Dean, editor, 16th International Joint Conference on Artiﬁcial Intelligence, pages 1401–1406. Morgan Kauﬀman, 1999. R.E. Schapire, Y. Freund, P. Bartlett, and W. Lee. Boosting the margin: A new explanation for the eﬀectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998. B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. R. Tibshirani. Bias, variance and prediction error for classiﬁcation rules. Technical report, Department of Preventive Medicine and Biostatistics and Department od Statistics, University of Toronto, Toronto, Canada, 1996. G. Valentini and T.G. Dietterich. Low Bias Bagged Support Vector Machines. In T. Fawcett and N. Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), pages 752–759, Washington D.C., USA, 2003. AAAI Press. G. Valentini and F. Masulli. NEURObjects: an object-oriented library for neural network development. Neurocomputing, 48(1–4):623–646, 2002. G. Valentini, M. Muselli, and F. Ruﬃno. Bagged Ensembles of SVMs for Gene Expression Data Analysis. In IJCNN2003, The IEEE-INNS-ENNS International Joint Conference on Neural Networks, pages 1844–49, Portland, USA, 2003. IEEE. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. D. Wang, J.M. Keller, C.A. Carson, K.K. McAdoo-Edwards, and C.W. Bailey. Use of fuzzy logic inspired features to improve bacterial recognition through classiﬁer fusion. IEEE Transactions on Systems, Man and Cybernetics, 28B(4):583–591, 1998. 51

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement