{"title": "Potential Boosters?", "book": "Advances in Neural Information Processing Systems", "page_first": 258, "page_last": 264, "abstract": null, "full_text": "Potential Boosters ? \n\nNigel Duffy \n\nDavid Helmbold \n\nDepartment of Computer Science \n\nDepartment of Computer Science \n\nUniversity of California \nSanta Cruz, CA 95064 \nnigedufJ@cse. ucsc. edu \n\nUniversity of California \nSanta Cruz, CA 95064 \n\ndph@~se. ucsc. edu \n\nAbstract \n\nRecent interpretations of the Adaboost algorithm view it as per(cid:173)\nforming a gradient descent on a potential function. Simply chang(cid:173)\ning the potential function allows one to create new algorithms re(cid:173)\nlated to AdaBoost. However, these new algorithms are generally \nnot known to have the formal boosting property. This paper ex(cid:173)\namines the question of which potential functions lead to new al(cid:173)\ngorithms that are boosters. The two main results are general sets \nof conditions on the potential; one set implies that the resulting \nalgorithm is a booster, while the other implies that the algorithm \nis not. These conditions are applied to previously studied potential \nfunctions , such as those used by LogitBoost and Doom II. \n\n1 \n\nIntroduction \n\nThe first boosting algorithm appeared in Rob Schapire's thesis [1]. This algorithm \nwas able to boost the performance of a weak PAC learner [2] so that the resulting \nalgorithm satisfies the strong PAC learning [3] criteria. We will call any method \nthat builds a strong PAC learning algorithm from a weak PAC learning algorithm \na PAC boosting algorithm. Freund and Schapire later found an improved PAC \nboosting algorithm called AdaBoost [4], which also tends to improve the hypotheses \ngenerated by practical learning algorithms [5]. \n\nThe AdaBoost algorithm takes a labeled training set and produces a master hy(cid:173)\npothesis by repeatedly calling a given learning method. The given learning method \nis used with different distributions on the training set to produce different base \nhypotheses. The master hypothesis returned by AdaBoost is a weighted vote of \nthese base hypotheses. AdaBoost works iteratively, determining which examples \nare poorly classified by the current weighted vote and selecting a distribution on \nthe training set to emphasize those examples. \nRecently, several researchers [6, 7, 8, 9, 10] have noticed that Adaboost is performing \na constrained gradient descent on an exponential potential function of the margins \nof the examples. The margin of an example is yF(x) where y is the \u00b11 valued label \nof the example x and F(x) E lR is the net weighted vote of master hypothesis F. \nOnce Adaboost is seen this way it is clear that further algorithms may be derived \nby changing the potential function [6, 7, 9, 10]. \n\n\fPotential Boosters? \n\n259 \n\nThe exponential potential used by AdaBoost has the property that the influence \nof a data point increases exponentially if it is repeatedly misclassified by the base \nhypotheses. This concentration on the \"hard\" examples allows AdaBoost to rapidly \nobtain a consistent hypothesis (assuming that the base hypotheses have certain \nproperties). However, it also means that an incorrectly labeled or noisy example \ncan quickly attract much of the distribution. It appears that this lack of noise(cid:173)\ntolerance is one of AdaBoost's few drawbacks [11]. Several researchers [7, 8, 9, 10] \nhave proposed potential functions which do not concentrate as much on these \"hard\" \nexamples. However, they generally do not show that the derived algorithms have \nthe PAC boosting property. \nIn this paper we return to the original motivation behind boosting algorithms and \nask: \"for which potential functions does gradient descent lead to PAC boosting \nalgorithms\" (i.e. boosters that create strong PAC learning algorithms from arbitrary \nweak PAC learners). We give necessary conditions that are met by some of the \nproposed potential functions (most notably the LogitBoost potential introduced by \nFriedman et al. [7]). Furthermore, we show that simple gradient descent on other \nproposed potential functions (such as the sigmoidal potential used by Mason et \nal. [10]) cannot convert arbitrary weak PAC learning algorithms into strong PAC \nlearners. The aim of this work is to identify properties of potential functions required \nfor PAC boosting, in order to guide the search for more effective potentials. \n\nSome potential functions have an additional tunable parameter [10] or change over \ntime [12]. Our results do not yet apply to such dynamic potentials. \n\n2 PAC Boosting \n\nHere we define the notions of PAC learningl and boosting, and define the notation \nused throughout the paper. \nA concept C is a subset of the learning domain X. A random example of C is a pair \n(x E X,y E {-1, +1}) where x is drawn from some distribution on X and y = 1 if \nx E C and -1 otherwise. A concept class is a set of concepts. \n\nDefinition 1 A (strong) PAC learner for concept class C has the property that for \nevery distribution D on X, all concepts C E C, and all 0 < E,O < 1/2: with probabil(cid:173)\nity at least 1 - 0 the algorithm outputs a hypothesis h where P D [h( x) :j:. C (x)] ~ E. \nThe learning algorithm is given C, E, 0, and the ability to draw random examples of \nC (w.r.t. distribution D), and must run in time bounded by poly(l/E,l/o). \n\nDefinition 2 A weak PAC learner is similar to a strong PA C learner, except that \nit need only satisfy the conditions for a particular 0 < EO, 00 < 1/2 pair, rather than \nfor all E,O pairs. \n\nDefinition 3 A PAC boosting algorithm is a generic algorithm which can leverage \nany weak PAC learner to meet the strong PAC learning criteria. \n\nIn the remainder of the paper we emphasize boosting the accuracy E as it is much \neasier to boost the confidence 0, see Haussler et al. [13] and Freund [14] for details. \nFurthermore, we emphasize boosting by re-sampling, where the strong PAC learner \ndraws a large sample, and each iteration the weak learning algorithm is called with \nsome distribution over this sample. \n\nlTo simplify the presentation we omit the instance space dimension and target repre(cid:173)\n\nsentation length parameters. \n\n\f260 \n\nN. Duffy and D. Helmbold \n\nThroughout the paper we use the following notation. \n\nnot be normalized so that 2::, =1 at' = l. \n\n\u2022 m is the cardinality of the fixed sample {(Xl, Y1), ... , (Xm, Ym) }. \n\u2022 ht(x) is the \u00b11 valued weak hypothesis created at iteration t. \n\u2022 at is the weight or vote of ht in the master hypothesis, the a's mayor may \n\u2022 Ft (x) = 2::'=1 (at' ht, (x) /2:;=1 aT) E !R, is the master hypothesis2 at iter(cid:173)\n\u2022 Ui,t = Yi 2::'=1 at,ht, (X) is the margin of Xi after iteration t; the t sub(cid:173)\nscript is often omitted. Note that the margin is positive when the master \nhypothesis is correct, and the normalized margin is Ui,t/ 2::'=1 at'\u00b7 \n\nation t. \n\n\u2022 p(u) is the potential of an instance with margin u, and the total potential \n\nis 2:~1 p(Ui). \n\n\u2022 P v[ ],P s[ ], and Es[ ] are the probability with respect to the unknown \ndistribution over the domain, and the probability and expectations with \nrespect to the uniform distribution over the sample, respectively. \n\nOur results apply to total potential functions of the form 2:~1 p(Ui) where p is \npositive and strictly decreasing. \n\n3 Leveraging Learners by Gradient Descent \n\nAdaBoost [4] has recently been interpreted as gradient descent independently by \nseveral groups [6, 7, 8, 9, 1m. Under this interpretation AdaBoost is seen as minimiz(cid:173)\ning the total potential 2:i=l P(Ui) = 2:~1 exp( -Ui) via feasible direction gradient \ndescent. On each iteration t + 1, AdaBoost chooses the direction of steepest descent \nas the distribution on the sample, and calls the weak learner to obtain a new base \nhypothesis hHl . The weight at+! of this new weak hypothesis is calculated to min(cid:173)\nimize3 the resulting potential 2:~1 p(Ui,H1) = 2:~1 exp( -(Ui,t + aHIYi ht+! (Xi))). \nThis gradient descent idea has been generalized to other potential functions [6, 7, \n10]. Duffy et al. [9] prove bounds for a similar gradient descent technique using a \nnon-componentwise, non-monotonic potential function. \n\nNote that if the weak learner returns a good hypothesis ht (with training error at \nmost \u20ac < 1/2), then 2:~1 Dt(Xi)Yiht(Xi) > 1 - 2\u20ac > O. We set T = 1 - 2\u20ac, and \nassume that each base hypothesis produced satisfies 2:~1 Dt(Xi)Yiht(Xi) ~ T. \nIn this paper we consider this general gradient descent approach applied to various \npotentials 2:~1 P(Ui). Note that each potential function P has two corresponding \ngradient descent algorithms (see [6]). The un-normalized algorithms (like AdaBoost) \ncontinually add in new weak hypotheses while preserving the old a's. The normal(cid:173)\nized algorithms re-scale the a's so that they always sum to 1. In general, we call \nsuch algorithms \"leveraging algorithms\", reserving the term \"boosting\" for those \nthat actually have the PAC boosting property. \n\n4 Potentials that Don't Boost \n\nIn this section we describe sufficient conditions on potential functions so that the \ncorresponding leveraging algorithm does not have the PAC boosting property. We \n\n2The prediction of the master hypothesis on instance x is the sign of Ft(x). \n30ur current proofs require that the actual at's be no greater than a constant (say 1). \n\nTherefore, this minimizing a may need to be reduced. \n\n\fPotential Boosters? \n\n261 \n\napply these conditions to show that two potentials from the literature do not lead \nto boosting algorithms. \n\nTheorem 1 Let p( u) be a potential function for which: \n\n1} the derivative, p' (u), is increasing (_p' (u) decreasing} in ?R+, and \n\n2} 3{3 > \u00b0 such that for all u > 0, -{3p'(u) ~ -p'(-2u). \n\nThen neither the normalized nor the un-normalized leveraging algorithms corre(cid:173)\nsponding to potential p have the PAC boosting property. \n\nThis theorem is proven by an adversary argument. Whenever the concept class is \nsufficiently rich4 , the adversary can keep a constant fraction of the sample from \nbeing correctly labeled by the master hypothesis. Thus as the error tolerance \u20ac goes \nto zero, the master hypotheses will not be sufficiently accurate. \n\nWe now apply this theorem to two potential functions from the literature. \nFriedman et al. [7J describe a potential they call \"Squared Error(p)\" where the \n\n. This potential can be re-written \n\npotential at Xi is T - eF(Xi) + e-F(Xi) \n\neF(Xi)) 2 \n\nyo + 1 \n\n(\n\nas PSE(Ui) = 4\" 1 + 2 eUi + e-Ui + eUi + e- Ui \n\n1 ( \n\ne- Ui _ eUi \n\n(e-Ui _ eUi )2) \n\n\u2022 \n\nCorollary 1 Potential \"Squared Error{p} \" does not lead to a boosting algorithm. \n\nProof: This potential satisfies the conditions of Theorem 1. It is strictly decreas(cid:173)\ning, and the second condition holds for {3 = 2. \nMason et al. [lOJ examine a normalized algorithm using the potential PD(U) = \n1- tanh (AU). Their algorithm optimizes over choices of A via cross-validation, and \nuses weak learners with slightly different properties. However, we can plug this \npotential directly into the gradient descent framework and examine the resulting \nalgorithms. \n\nCorollary 2 The DOOMII potential PD does not lead to a boosting algorithm for \nany fixed A. \n\nProof: The potential is strictly decreasing, and the second condition of Theorem 1 \nholds for {3 = 1. \n\nOur techniques show that potentials that are sigmoidal in nature do not lead to \nalgorithms with the PAC boosting property. Since sigmoidal potentials are gen(cid:173)\nerally better over-estimates of the 0, 1 loss than the potential used by AdaBoost, \nour results imply that boosting algorithms must use a potential with more subtle \nproperties than simply upper bounding the 0, 1 loss. \n\n5 Potential Functions That Boost \n\nIn this section we give sufficient conditions on a potential function for it's corre(cid:173)\nsponding un-normalized algorithm to have the PAC boosting property. This result \nimplies that AdaBoost [4J and LogitBoost [7J have the PAC boosting property (Al(cid:173)\nthough this was previously known for AdaBoost [4J, we believe this is a new result \nfor LogitBoost). \n\n4The VC-dimension 4 concept class consisting of pairs of intervals on the real line is \n\nsufficient for our adversary. \n\n\f262 \n\nN. Duffy and D. Helmbold \n\nOne set of conditions on the potential imply that it decreases roughly exponen(cid:173)\ntially when the (un-normalized) margins are large. Once the margins are in this \nexponential region, ideas similar to those used in AdaBoost's analysis show that the \nminimum normalized margin quickly becomes bounded away from zero. This allows \nus to bound the generalization error using a theorem from Bartlett et al. [15]. \nA second set of conditions governs the behavior of the potential function before \nthe un-normalized margins are large enough. These conditions imply that the total \npotential decreases by a constant factor each iteration. Therefore, too much time \nwill not be spent before all the margins enter the exponential region. \nThe margin value bounding the exponential region is U, and once 2::=1 p(Ui) ~ \np(U), all margins p(Ui) will remain in the exponential region. The following theorem \ngives conditions on p ensuring that 2::=1 P(Ui) quickly becomes less than p(U). \n\nTheorem 2 If the following conditions hold for p( u) and U: \n\n1. -p'(u) is strictly decreasing -and 0 < pll(U) ~ B, and \n2. 3q> 0 such that p(u) ~ -qp'(u) Vu > U, \n\nm \n\nthen 2:i=l P(Ui) ~ p(U) after Tl ~ \n\n4Bq2m 2p(0) In ( ~rb))) \n\np(U)2r2 \n\niterations. \n\nThe proof of this theorem approximates the new total potential by the old potential \nminus a times a linear term, plus an error. By bounding the error as a function of \na and minimizing we demonstrate that some values of a give a sufficient decrease \nin the total potential. \n\nTheorem 3 If the following conditions hold for p( u), U, q, and iteration Tl: \n\n1. 3(3 ~ .J3 such that -p'(u + v) ~ p(u + v) ~ -p'(u)(3-Vq whenever -1 ~ \n\nv ~ 1 and u > U, \n\n2. 2::1P(Ui,Tl) ~ p(U), \n3. -p' (u) is strictly decreasing, and \n4. 3C > 0, 'Y> 1 such that Cp(u) ~ 'Y- u Vu > U \n\nwhich decreases expo-\n\nThe proof of this theorem is a generalization of the AdaBoost proof. \n\nCombining these two theorems, and the generalization bound from Theorem 2 of \nBartlett et al. [15] gives the following result, where d is the VC dimension of the \nweak hypothesis class. \n\nTheorem 4 If for all edges 0 < r < 1/2 there exists T1,r ~ poly(m,l/r), Ur , \nand qr satisfying the conditions of Theorem 3 such that p(Ur ) ~ poly(r) and \nqrv'f'=T2 = l(r) < 1 - poly(r), then in time poly(m, 1/r) all examples have nor-\n\n\fPotential Boosters? \n\n263 \n\nmalized margin at least () = In ((l~~~l)) / In(r) and \n\nP D[yFT(X) ~ 0] E 0 Vm \n\n( \n\n)~) \n\n1 ( l n2(r)dlog2 (m/d) \n\n(In (l(r) + 1) -In (21(r)))2 + 10g(1/8) \n\nChoosing m appropriately makes the error rate sufficiently small so that the algo(cid:173)\nrithm corresponding to p has the PAC boosting property. \n\nWe now apply Theorem 4 to show that the AdaBoost and LogitBoost potentials \nlead to boosting algorithms. \n\n6 Some Boosting Potentials \n\nIn this section we show as a direct consequence of our Theorem 4 that the potential \nfunctions for AdaBoost and LogitBoost lead to boosting algorithms. Note that \nthe LogitBoost algorithm we analyze is not exactly the same as that described by \nFriedman et al. [7], their \"weak learner\" optimizes a square loss which appears to \nbetter fit the potential. First we re-derive the boosting property for AdaBoost. \n\nCorollary 3 AdaBoost's [16] potential boosts. \n\nProof: To prove this we simply need to show that the potential p(u) = exp( -u) \nsatisfies the conditions of Theorem 4. This is done by setting Ur = -In(m), qr = 1, \n'Y = f3 = e, C = 1, and Tl = o. \nCorollary 4 The log-likelihood potential (as used in LogitBoost [7]) boosts. \n\nProof: In this case p(u) = In (1 + e- U ) and -p'(u) = l!~:u. We set 'Y = f3 = e, \nC = 2, Ur = -In ~ -1 and qr = 1 +exp(-Ur ) = ~. Now The-\norem 2 shows that after Tl ~ poly(m, l/r) iterations the conditions of Theorem 4 \nare satisfied. \n\n( Jl- f2/2 \n\n{l?1Z \n\n) \n\n7 Conclusions \n\nIn this paper we have examined leveraging weak learners using a gradient descent \napproach [9] . This approach is a direct generalization of the Adaboost [4, 16] algo(cid:173)\nrithm, where Adaboost's exponential potential function is replaced by alternative \npotentials. We demonstrated properties of potentials that are sufficient to show \nthat the resulting algorithms are PAC boosters, and other properties that imply \nthat the resulting algorithms are not PAC boosters. We applied these results to \nseveral potential functions from the literature [7, 10, 16]. \n\nNew insight can be gained from examining our criteria carefully. The conditions \nthat show boosting leave tremendous freedom in the choice of potential function \nfor values less than some U, perhaps this freedom can be used to choose potential \nfunctions which do not overly concentrate on noisy examples. \n\nThere is still a significant gap between these two sets of properties, we are still a long \nway from classifying arbitrary potential functions as to their boosting properties. \n\nThere are other classes of leveraging algorithms. One class looks at the distances \nbetween successive distributions [17, 18]. Another class changes their potential \n\n\f264 \n\nN. Duffy and D. Helmbold \n\nover time [6, 8, 12, 14]. The criteria for boosting may change significantly with \nthese different approaches. For example, Freund recently presented a boosting \nalgorithm [12] that uses a time-varying sigmoidal potential. It would be interesting \nto adapt our techniques to such dynamic potentials. \n\nReferences \n\n[1] Robert E. Schapire. The Design and Analysis of Efficient Learning Algorithms. MIT \n\nPress, 1992. \n\n[2] Michael Kearns and Leslie Valiant. Cryptographic limitations on learning Boolean \n\nformulae and finite automata. Journal of the ACM, 41(1):67-95, January 1994. \n\n[3] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134-1142, Novem(cid:173)\n\nber 1984. \n\n[4] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line \nlearning and an application to boosting. Journal of Computer and System Sciences, \n55(1):119-139, August 1997. \n\n[5] Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algo(cid:173)\n\nrithms: Bagging, boosting and variants. Machine Learning, 36(1-2):105-39, 1999. \n\n[6] Leo Breiman. Arcing the edge. Technical Report 486, Department of Statistics, \n\nUniversity of California, Berkeley., 1997. available at www.stat.berkeley.edu. \n\n[7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: \n\na statistical view of boosting. Technical report, Stanford University, 1998. \n\n[8] G. lliitsch, T. Onoda, and K-R. Muller. Soft margins for AdaBoost. Machine Learn(cid:173)\n\ning, 2000. To appear. \n\n[9] Nigel Duffy and David P. Helmbold. A geometric approach to leveraging weak learn(cid:173)\n\ners. In Paul Fischer and Hans Ulrich Simon, editors, Computational Learning Theory: \n4th European Conference (EuroCOLT '99), pages 18-33. Springer-Verlag, March 1999. \n[10] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algorithms \n\nas gradient descent. To appear in NIPS 2000. \n\n[11] Thomas G. Dietterich. An experimental comparison of three methods for construct(cid:173)\n\ning ensembles of decision trees: Bagging, Boosting, and Randomization. Machine \nLearning. To appear. \n\n[12] Yoav Freund. An adaptive version of the boost-by-majority algorithm. In Proc. l\u00a3th \n\nAnnu. Conf. on Comput. Learning Theory, pages 102-113. ACM, 1999. \n\n[13] David Haussler, Michael Kearns, Nick Littlestone, and Manfred K Warmuth. Equiva(cid:173)\nlence of models for polynomiallearnability. Information and Computation, 95(2):129-\n161, December 1991. \n\n[14] Y. Freund. Boosting a weak learning algorithm by majority. Information and Com(cid:173)\n\nputation, 121(2):256-285, September 1995. \n\n[15] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the \nmargin: A new explanation for the effectiveness of voting methods. The Annals of \nStatistics, 26(5):1651-1686, 1998. \n\n[16] Robert E. Schapire and Yoram Singer. \n\nImproved boosting algorithms using \n\nconfidence-rated predictions. Machine Learning, 37(3):297-336, December 1999. \n\n[17] Jyrki Kivinen and Manfred K Warmuth. Boosting as entropy projection. In Proc. \n\nl\u00a3th Annu. Conf. on Comput. Learning Theory, pages 134-144. ACM, 1999. \n\n[18] John Lafferty. Additive models, boosting, and inference for generalized divergences. \n\nIn Proc. l\u00a3th Annu. Conf. on Comput. Learning Theory, pages 125-133. ACM. \n\n\f", "award": [], "sourceid": 1737, "authors": [{"given_name": "Nigel", "family_name": "Duffy", "institution": null}, {"given_name": "David", "family_name": "Helmbold", "institution": null}]}