Learning theory I I I Ensemble methods Probability distribution P over X × {0, 1}; let (X, Y ) ∼ P . We get S := {(xi , yi )}n i=1 , an iid sample from P . Goal: Fix , δ ∈ (0, 1). With probability at least 1 − δ (over random choice of S), learn a classifier fˆ: X → {−1, +1} with low error rate err(fˆ) = P (fˆ(X) 6= Y ) ≤ . I I Basic question: When is this possible? I Suppose I even promise you that there is a perfect classifier from a particular function class F. (E.g., F = linear classifiers or F = decision trees.) I Default: Empirical Risk Minimization (i.e., pick classifier from F with lowest training error rate), but this might be computationally difficult (e.g., for decision trees). Another question: Is it easier to learn just non-trivial classifiers in F (i.e., better than random guessing)? 1 / 35 Boosting 2 / 35 Boosting: history 1984 Valiant and Kearns ask whether “boosting” is theoretically possible (formalized in the PAC learning model). Boosting: Using a learning algorithm that provides “rough rules-of-thumb” to construct a very accurate predictor. 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. Motivation: Easy to construct classification rules that are correct more-often-than-not (e.g., “If ≥5% of the e-mail characters are dollar signs, then it’s spam.”), 1990 Freund creates an optimal boosting algorithm (Boost-by-majority). 1992 Drucker, Schapire, and Simard empirically observe practical limitations of early boosting algorithms. but seems hard to find a single rule that is almost always correct. 1995 Freund and Schapire create AdaBoost—a boosting algorithm with practical advantages over early boosting algorithms. Basic idea: Winner of 2004 ACM Paris Kanellakis Award: Input: training data S For their “seminal work and distinguished contributions [...] to the development of the theory and practice of boosting, a general and provably effective method of producing arbitrarily accurate prediction rules by combining weak learning rules”; specifically, for AdaBoost, which “can be used to significantly reduce the error of algorithms used in statistical analysis, spam filtering, fraud detection, optical character recognition, and market segmentation, among other applications”. For t = 1, 2, . . . , T : 1. Choose subset of examples St ⊆ S (or a distribution over S). 2. Use “weak learning” algorithm to get classifier: ft := WL(St ). Return an “ensemble classifier” based on f1 , f2 , . . . , fT . 3 / 35 4 / 35 AdaBoost Interpretation input Training data {(xi , yi )}n i=1 from X × {−1, +1}. 1: initialize D1 (i) := 1/n for each i = 1, 2, . . . , n (a probability distribution). 2: for t = 1, 2, . . . , T do 3: Give Dt -weighted examples to WL; get back ft : X → {−1, +1}. 4: Update weights: zt := n X i=1 Interpreting zt Suppose (X, Y ) ∼ Dt . If P (f (X) = Y ) = Dt (i) · yi ft (xi ) ∈ [−1, +1] then 1 1 + zt ln ∈ R (weight of ft ) αt := 2 1 − zt Dt+1 (i) := Dt (i) exp −αt · yi ft (xi ) /Zt for each i = 1, 2, . . . , n , zt = n X i=1 I where Zt > 0 is normalizer that makes Dt+1 a probability distribution. 5: end for T X 6: return Final classifier fˆ(x) := sign αt · ft (x). I I 1 + γt , 2 Dt (i) · yi f (xi ) = 2γt ∈ [−1, +1] . zt = 0 ⇐⇒ random guessing w.r.t. Dt . zt > 0 ⇐⇒ better than random guessing w.r.t. Dt . zt < 0 ⇐⇒ better off using the opposite of f ’s predictions. t=1 (Let sign(z) := 1 if z > 0 and sign(z) := −1 if z ≤ 0.) 6 / 35 5 / 35 Interpretation Example: AdaBoost with decision stumps Classifier weights αt = 1 2 ln 1+zt 1−zt 3 2 αt 1 0 -1 -2 -3 -1 -0.5 0 0.5 1 zt Weak learning algorithm WL: ERM with F = “decision stumps” on R2 (i.e., axis-aligned threshold functions x 7→ sign(vxi − t)). Straightforward to handle importance weights in ERM. Example weights Dt+1 (i) Dt+1 (i) ∝ Dt (i) · exp(−αt · yi ft (xi )) . (Example from Figures 1.1 and 1.2 of Schapire & Freund text.) 7 / 35 8 / 35 Example: execution of AdaBoost D1 D2 Example: final classifier from AdaBoost D3 – – + + + + – – f1 z1 = 0.40, α1 = 0.42 f2 z2 = 0.58, α2 = 0.65 f3 z3 = 0.72, α3 = 0.92 – + + f1 z1 = 0.40, α1 = 0.42 f2 z2 = 0.58, α2 = 0.65 Final classifier fˆ(x) = sign(0.42f1 (x) + 0.65f2 (x) + 0.92f3 (x)) – (Zero training error rate!) f3 z3 = 0.72, α3 = 0.92 10 / 35 9 / 35 Empirical results UCI UCIResults Results Training error rate of final classifier Test error rates of C4.5 and AdaBoost on several classification problems. Each point represents a single classification problem/dataset from UCI repository. Recall γt := P (ft (X) = Y ) − 1/2 = zt /2 when (X, Y ) ∼ Dt . 25 25 25 25 20 20 20 20 Training error rate of final classifier from AdaBoost: T X n 2 err(fˆ, {(xi , yi )}i=1 ) ≤ exp−2 γt . C4.5 C4.5 C4.5 30 30 C4.5 C4.5 C4.5 30 30 15 15 t=1 15 15 10 10 10 10 5 5 5 5 0 0 0 0 5 510 1015 1520 2025 2530 30 0 0 0 0 5 510 1015 1520 2025 2530 30 AdaBoost+stumps boosting Stumps boosting Stumps If average γ̄ 2 := 1 T 2 2 γ > 0, then training error rate is ≤ exp −2γ̄ T . t t=1 PT “AdaBoost” = “Adaptive Boosting” Some γt could be small, even negative—only care about overall average γ̄ 2 . AdaBoost+C4.5 What about true error rate? boosting C4.5 boosting C4.5 C4.5 = popular algorithm for learning decision trees. (Figure 1.3 from Schapire & Freund text.) 11 / 35 12 / 35 Combining classifiers A typical run of boosting 16 Let F be the function class used by the weak learning algorithm WL. The function class used by AdaBoost is T X FT := x 7→ sign αt ft (x) : f1 , f2 , . . . , fT ∈ F , α1 , α2 , . . . , αT ∈ R Error rate 20 t=1 i.e., linear combinations of T functions from F. Complexity of FT grows linearly with T . 10 AdaBoost test error AdaBoost training error 10 100 1000 # of rounds T (# Figure 1.7 The training and test percent error rates obtained using boosting on an OCR 6 dataset with C4.5 as the base lea nodes across all decision trees error, in fˆrespectively. is >2 ×The 10top )horizontal line shows the test error The top and bottom curves are test and training using just C4.5. The bottom line shows the final test error rate of AdaBoost after 1000 rounds. (Reprinted permission of the Institute of Mathematical Statistics.) Training error rate is zero after just five rounds, but test error rate continues to decrease, even up to 1000 rounds! Theory suggests danger of overfitting when T is very large. Indeed, this does happen sometimes . . . but often not! 13 / 35 Boosting the margin This pronounced lack of overfitting seems to flatly contradict our earlier intuition that s pler is better. Surely, a combination of five trees is much, much simpler than a combinat (Figure 1.7 from Schapire & Freund text) of 1000 trees (about 200 times simpler, in terms of raw size), and both perform equ well on the training set (perfectly, in fact). So how can it be that the far larger14and m / 35 complex combined classifier performs so much better on the test set? This would appea be a paradox. One superficially plausible explanation is that the αt ’s are converging rapidly to ze so that the number of base classifiers being combined is effectively bounded. However noted above, the ϵt ’s remain around 5–6% in this case, well below 12 , which means that weights αt on the individual base classifiers are also bounded well above zero, so that combined classifier is constantly growing and evolving with each round of boosting. Regard function class F resistance used by weak learning algorithm as “feature functions”: Such to overfitting is typical of boosting, although, as we have seen in s tion 1.2.3, boosting certainly can overfit. This resistanceFis one of the properties that m x it7→ φ(x) := f (x) : f ∈ F ∈ {−1, +1} such an attractive learning algorithm. But how can we understand this behavior? In chapter 5, we present a theoretical explanation of how, why, and when AdaBoost wo (possibly infinite dimensional!). and, in particular, of why it often does not overfit. Briefly, the main idea is the followi The description above of AdaBoost’s performance AdaBoost’s final classifier is a linear classifier in {−1, +1}F :on the training set took into acco only the training error, which is already zero after just five rounds. However, training e tells T only part of the story, in that X Xit reports just the number of examples that are corre classified. Instead, understand AdaBoost, we also need to consider h to w fˆ(x) = signor incorrectly αt ft (x) = sign f f (x) = sign hw, φ(x)i confident the predictions being fmade t=1 ∈F by the algorithm are. We will see that such confide can be measured by a quantity called the margin. According to this explanation, altho the training error—that is, whether or not the predictions are correct—is not chang where Linear classifiers Final classifier from AdaBoost: fˆ(x) = sign | PT t=1 αt ft (x) P T t=1 |αt | ! . {z } g(x) ∈ [−1, +1] Call y · g(x) ∈ [−1, +1] the margin achieved on example (x, y). New theory [Schapire, Freund, Bartlett, and Lee, 1998]: Larger margins ⇒ better resistance to overfitting, independent of T . AdaBoost tends to increase margins on training examples. (Similar but not the same as SVM margins.) On “letters” dataset: training error rate test error rate % margins ≤0.5 min. margin C4.5 test error 0 Error due to finite sample I 15 5 Theoretical guarantee (e.g., when F = decision stumps in Rd ): With high probability (over random choice of training sample), ! r T log d 2 ˆ err(f ) ≤ exp −2γ̄ T + O . n {z } | | {z } Training error rate I 1 Introduction and Overv AdaBoost+C4.5 on “letters” dataset. T =5 0.0% 8.4% 7.7% 0.14 T = 100 0.0% 3.3% 0.0% 0.52 wf := T = 1000 0.0% 3.1% 0.0% 0.55 T X t=1 15 / 35 αt · 1{ft = f } ∀f ∈ F . 16 / 35 Exponential loss More on boosting AdaBoost is a particular “coordinate descent” algorithm (similar to but not the same as gradient descent) for Many variants of boosting: n 1X exp −yi hw, φ(xi )i . n i=1 min w∈RF 3 exp(-x) log(1+exp(-x)) 2.5 2 I AdaBoost with different loss functions. I Boosted decision trees = boosting + decision trees. I Boosting algorithms for ranking and multi-class. I Boosting algorithms that are robust to certain kinds of noise. I Boosting for online learning algorithms (very new!). I ... Many connections between boosting and other subjects: 1.5 1 I Game theory, online learning I “Geometry” of information (replace k · k22 with relative entropy divergence) I 0.5 I Computational complexity ... 0 -1 -0.5 0 0.5 1 1.5 18 / 35 17 / 35 Application: face detection Face detection Problem: Given an image, locate all of the faces. An application of AdaBoost As a classification problem: I Divide up images into patches (at varying scales, e.g., 24×24, 48×48). I Learn classifier f : X → Y, where Y = {face, not face}. Many other things built on top of face detectors (e.g., face tracking, face recognizers); now in every digital camera and iPhoto/Picasa-like software. Main problem: how to make this very fast. 20 / 35 Face detectors via AdaBoost Viola & Jones “integral image” trick [Viola & Jones, 2001] Face detector architecture by Viola & Jones (2001): major achievement in computer vision; detector actually usable in real-time. I I “Integral image” trick: For every image, pre-compute (0, 0) 2 Think of each image patch (d × d-pixel gray-scale) as a vector x ∈ [0, 1]d . Used weak learning algorithm that picks linear classifiers fw,t (x) = sign(hw, xi − t), where w has very particular form: 144 Viola andaJones (r, c) number of sub-windows that need further processing with very few operations: 1. Evaluate the rectangle features (requires between 6 and 9 array references per feature). 2. Compute the weak classifier for each feature (requires one threshold operation per feature). 3. Combine the weak classifiers (requires one multiply per feature, an addition, and finally a threshold). s(r, c) = sum of pixel values in rectangle from (0, 0) to (r, c) (single pass through image). To compute inner product hw, xi = average pixel value in black box − average pixel value in white box just need to add and subtract a few s(r, c) values. I Figure 5. The first and second features selected by AdaBoost. The two features are shown in the top row and then overlayed on a typical training face in the bottom row. The first feature measures the difference in intensity between the region of the eyes and a region across the upper cheeks. The feature capitalizes on the observation that the eye region is often darker than the cheeks. The second feature compares the intensities in the eye regions to the intensity across the bridge of the nose. AdaBoost combines many “rules-of-thumb” of this form. I I Very simple. Extremely fast to evaluate via pre-computation. features to the classifier, directly increases computation time. Viola & Jones cascade architecture 4. The Attentional Cascade This section describes an algorithm for constructing a Problem: severe class imbalance (most don’t a face). cascadepatches of classifiers which contain achieves increased detection performance while radically reducing computation time. The key insight is that smaller,and and therefore morein a Solution: Train several classifiers (each using AdaBoost), arrange efficient, boosted classifiers can be constructed which special kind of decision list called areject cascade: many of the negative sub-windows while detect- x f (1) (x) −1 −1 +1 A two feature classifier amounts to about 60 microprocessor instructions. It seems 152 hard toViola imagine and Jones that any simpler filter could⇒ achieve higher rejection Evaluating “rules-of-thumb” rates. By comparison, scanning a simple image template would require at least 20 times as many operations per sub-window. The overall form of the detection process is that of a degenerate decision tree, what we call a “cascade” (Quinlan,211986) / 35 (see Fig. 6). A positive result from the first classifier triggers the evaluation of a second classifier which has also been adjusted to achieve very high detection rates. A positive result from the second classifier triggers a third classifier, and so on. A negative outcome at any point leads to the immediate rejection of the sub-window. The structure of the cascade reflects the fact that within any single image an overwhelming majority of sub-windows are negative. As such, the cascade attempts to reject as many negatives as possible at the earliest stage possible. While a positive instance will classifiers is extremely fast. 22 / 35 Viola & Jones detector: example results ing almost all positive instances. Simpler classifiers are f before more +1 the majority (2) used to reject (3) of sub-windows ··· +1 (x) f are(x) complex classifiers called upon to achieve low false positive rates. Stages in the cascade −1 −1are constructed by training classifiers using AdaBoost. Starting with a two-feature classifier, an effective face filter can be obtained −1strong −1 by adjusting the strong classifier threshold to mini- mize false negatives. The initial AdaBoost threshold, 1 !T t=1 αt , is designed to yield a low error rate on the 2 training data. A lower threshold yields higher detecI Each f (`) is trained (using AdaBoost), adjust threshold (before passing tion rates and higher false positive rates. Based on perthrough sign) to minimize falseformance negative rate. measured using a validation training set, the Figure 6. Schematic depiction of a the detection cascade. A series of classifiers are applied to every sub-window. The initial classifier two-feature classifier can be adjusted to detect 100% of (`) I Can make f eliminates a large number of negative examples with very little proin later stages more than rate in ofearlier the facescomplex with a false positive 50%. Seestages, Fig. 5 for since cessing. Subsequent layers eliminate additional negatives but require a description of the two features used in this classifier. most examples don’t make it to the end. additional computation. After several stages of processing the numThe detection performance of the two-feature clasFigure 10. processing Output of our face detector on a number of test images from the MIT + CMU test set. ber of sub-windows have been reduced radically. Further sifier is far from acceptable as a face detection system. can take any form such as additional stages of the cascade (as in our ⇒ (Cascade) classifier evaluation extremely fast. detection system) or an alternative detection system. Nevertheless the classifier can significantly reduce the 6. Conclusions This paper brings together new algorithms, represen- 23 / 35 We have presented an approach for face detection which minimizes computation time while achieving high detection accuracy. The approach was used to construct a face detection system which is approximately tations, and insights which are quite generic and may well have broader application in computer vision and image processing. The first contribution is a new a technique for computing a rich set of image features using the integral 24 / 35 Summary Bagging Two key points: I AdaBoost effectively combines many fast-to-evaluate “weak classifiers”. I Cascade structure optimizes speed for common case. 25 / 35 Bagging Aside: sampling with replacement Question: if n individuals are picked from a population of size n u.a.r. with replacement, what is the probability that a given individual is not picked? Bagging = Bootstrap aggregating (Leo Breiman, 1994). Input: training data For t = 1, 2, . . . , T : {(xi , yi )}n i=1 Answer: n 1 1− n from X × {−1, +1}. For large n: 1. Randomly pick n examples with replacement from training data (t) (t) −→ {(xi , yi )}n i=1 (a bootstrap sample). (t) lim n→∞ (t) 2. Run learning algorithm on {(xi , yi )}n i=1 −→ classifier ft . 1− 1 n n = 1 ≈ 0.3679 . e Implications for bagging: Return a majority vote classifier over f1 , f2 , . . . , fT . 27 / 35 I Each bootstrap sample contains about 63% of the data set. I Remaining 37% can be used to estimate error rate of classifier trained on the bootstrap sample. I Can average across bootstrap samples to get estimate of bagged classifier’s error rate (sort of). 28 / 35 Random Forests Random Forests (Leo Breiman, 2001). d Input: training data {(xi , yi )}n i=1 from R × {−1, +1}. For t = 1, 2, . . . , T : Breaking the holdout 1. Randomly pick n examples with replacement from training data (t) (t) −→ {(xi , yi )}n i=1 (a bootstrap sample). (t) (t) 2. Run variant of decision tree learning algorithm on {(xi , yi )}n i=1 , where each split is chosen by only considering a random subset of √ d features (rather than all d features) −→ decision tree classifier ft . Return a majority vote classifier over f1 , f2 , . . . , fT . 29 / 35 Holdout method Adapting to holdout data 1. Randomly split data into three sets: Training, Holdout, and Test data. Training Holdout Test 2. Loop: 1. Randomly split data into three sets: Training, Holdout, and Test data. Training Holdout I Test Train some classifiers on Training data. Evaluate on Holdout data. 2. Train some classifiers on Training data. I Evaluate on Holdout data. Exit loop at some point. 3. Use results from previous step to pick final classifier. 3. Use results from previous step to pick final classifier. Why is this different from the basic holdout method? Steps after first loop iteration might not be independent of Holdout. 31 / 35 32 / 35 Extreme case of overfitting Cautionary tale (A variant of “Freedman’s paradox”.) I I nytimes.com/2015/06/04/technology/computer-scientists-are-astir-after-baidu-team-is-barred-from-ai-competition.html Construct completely random classifiers f1 , f2 , . . . , fK : X → {−1, +1}. Compute Holdout error rates err(fi , Holdout) , i = 1, 2, . . . , K . number of classifiers 200 150 100 50 0 0.46 0.48 0.5 0.52 0.54 holdout error rate I I I Select better-than-random classifiers: G := {i : err(fi , Holdout) > 1/2}. Construct fˆ := majority vote classifier over {fi : i ∈ G}. Compute Holdout error rate err(fˆ, Holdout) . Is err(fˆ, Holdout) a good estimate of err(fˆ)? 33 / 35 What can we do? I Avoid excessive adaptation to Holdout if you can help it. (Think carefully about what to evaluate; use Training data when possible.) I Guard against inadvertent “information leakage” from Holdout. Example of leakage: using both Training and Holdout to determine standardization or other preprocessing. I Use “blurred vision” when looking at Holdout: e.g., err(f, Holdout) + noise , where noise is on the order of expected standard deviation. I Explicit safeguards for adaptive data analysis. Dwork et al. The reusable holdout: preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015. 35 / 35 34 / 35

Download PDF

advertisement