A Generalized Class of Boosting Algorithms based

0 downloads 0 Views 161KB Size Report
decoding view under theory of theory of Recursive Error Correcting Codes allows the ... Established Machine Learning boosting [1] theory assumes a low dimensional feature ... target concept may be considered as a simple teaching strategy in order to improve .... will analyze APP decoding methods for T -repetition codes.
Lecture Notes in Computer Science

1

A Generalized Class of Boosting Algorithms based on Recursive Decoding Models Elizabeth Tapia, José C. González, Julio Villena Department of Telematics Engineering - Technical University of Madrid, Spain {etapia, jcg, jvillena}@gsi.dit.upm.es

Abstract. A communication model for the Hypothesis Boosting (HB) problem is proposed. Under this model, AdaBoost algorithm can be viewed as a threshold decoding approach for a repetition code. Generalization of such decoding view under theory of theory of Recursive Error Correcting Codes allows the formulation of a generalized class of low-complexity learning algorithms applicable in high dimensional classification problems. In this paper, an instance of this approach suitable for High Dimensional Features Spaces (HDFS) is presented.

1

Introduction

Established Machine Learning boosting [1] theory assumes a low dimensional feature space setting. The extension of boosting to arbitrary HDFS is an area of potential interest [2] in fields like Information Retrieval. In this paper, we address the extension of the HB concept to HDFS by recalling common sense teaching-learning strategies and their similarity to the design of RECCs. The remainder of the paper is organized as follows. Section 2 introduces a communication model for the HB problem and the interpretation of AdaBoost algorithm as an instance of APP threshold decoding. Section 3 introduces a generalized recursive learning approach in order to cope with complexity when constructing boosting algorithms in high dimensional spaces. Section 4 presents a first stage implementation suitable for high input domains through the Turbo_Learn algorithm. Finally, in Section 5 a summary and future work is presented.

2

Teaching and Learning Strategies

How must we teach and how can we learn? Both questions are essential in the design of ML algorithms. Consider a teaching through examples process for a target concept c belonging to a class C : X → {−1,1} . Similarly, let WL be a weak learning from examples algorithm and let S be the training sample. Trivial repetition of the target concept may be considered as a simple teaching strategy in order to improve the WL performance.

Lecture Notes in Computer Science

2

Such strategy can be implemented by exposing S as many times as WL requires, reinforcing the presence of harder examples each time. In Machine Learning theory, the above teaching strategy is no more than the hypothesis-boosting concept. In the next section, we will show that the HB problem can be viewed the transmission of concepts through a noisy channel. Thus, under suitable (concept) channel-encoding, arbitrary small (learning) error rates can be achieved.

3

A communication model for HB

Transmission of information through a noisy channel requires channel-encoding [3] schemes. Let us consider the transmission of concepts c belonging to a target class C : X → { 0 , 1 } imbedded in some metric space R ρ . Assume that transmission is intended with accuracy ε so that C can described by a set A ⊂ R ρ with N εR ( C ) elements (the set A being a minimal ε − net for C under covering numbers theory [4] [5]). Following [6], for each c ∈ C we can define a deterministic mapping E:C → B , so that each concept can be represented by a bitstream b ∈ B with length n B n B = H εR ( C ) = log 2 N εR ( C )

(1)

In order to transmit c ∈ C , E simply selects the integer j ∈ {1 ,...., N εR ( C ) } for which the ball Ball (a j , ε ) with center in concept a j and radio ε contains the target concept c . Similarly, we can define a decoding mapping D : B → R so that a received bitstream b ∈ B is mapped into a concept a j being j the integer with bit representation sequence b ∈ B . In communication terms, the mapping E : C → B can be modeled as a Discrete Memoryless Source (DMS) with output alphabet X , X = N εR ( C ) . Let q be a DMS output distribution and let H ( X ) be the entropy characterizing such DMS (2) pU (ak )= qk k = 1,.....,N R ( C ) a k ∈ X ε

Let us consider the transmission of information symbols from such source through a Discrete Memoryless Channel (DMC) [7] characterized by a finite capacity C Π and resembling a weak learner behavior. Shannon’s Noisy Coding theorem [8] states that reliable transmission of information through a noisy channel can be achieved by suitable channel encoding. Coding proceeds by transmission of arbitrary long T source sequences at a rate information symbol r = Tk being k the number of information symbols in each T − sequence. In almost random encoding is performed at the transmitter side, then as T → ∞ , the bock error probability Pe can be bounded as follows Pe ≈ 2 − T ( C Π −

r)

(3)

Whenever r is less than channel capacity C Π , arbitrary small (learning) error rates can be achieved by the suitable introduction of parity redundant concepts. It should be

Lecture Notes in Computer Science

note that for the learning case, r values are limited to

1 T

3

( k = 1 ) if learning proceeds

in a concept by concept fashion. For this case, the unique allowable linear blockcoding scheme is a T - repetition code. Thus, in order to cope with learner limitations, a teacher would repeat the target concept T times, resembling the transmission of a codeword t t = ( c ( x ), c ( x ), ..., c ( x ) )     

(4)

T − times

Under the assumption of a weak learning algorithm with errors resembling a DMC channel, learning becomes a decoding problem on a received sequence r r = ( h 1 ( x ), h 2 ( x ), ..., h T ( x ) )

(5)

For binary transmitted and received concepts r = t + e mod 2

(6)

The decoding problem is to give a good estimate e * for the error vector e under prior knowledge on channel behavior by means of probabilities p i = P ( e i = 1) , 1 ≤ i ≤ T , so that a final estimation t* = r + e * can be assembled. Therefore, a suitable learning algorithm in some aspects should correlate the behavior of decoding schemes. From learning theory, we know that adaptation is a desirable feature for good generalization abilities and in fact, the same requirement applies for decoding algorithms when dealing with very noisy channels. In decoding terms, adaptation is equivalent to the application of APP (A Posteriori Probability) decoding techniques. In next section, we will analyze APP decoding methods for T -repetition codes. For sake of brevity, we refer the reader to original Massey´s doctoral dissertation [9] for background on APP methods. 4.1 Threshold Decoding for T − Repetition Codes Let us consider a simple T -repetition code and the threshold-decoding estimation of the unique information bit. A T -repetition code naturally induces the following trivial set of parity check equations At = e1 ( x ) − et ( x )

2≤t ≤T

(7)

The above set is orthogonal on bit e1 (in the APP sense for linear bock codes). Thus, we can estimate e1 as follows  1 * e1 =   î 0

T

2⋅



 1 − pi     p i  

∑ A ⋅  log  i

i =2

otherwise

T

 1 − pi   p i 

∑ log  i =1

(8)

4

Lecture Notes in Computer Science

The receiver will perform the following t1* estimates depending on the received value r 1 . It can be shown that

*

t1

  0   î 1

T

∑ (2r

i

i =1

(9)

  1− pi   − 1) log    0   pi  otherwise

Let us introduce a linear mapping φ ( x ) = 2 x − 1 between binary alphabets A = { 0 , 1 } and A  = { −1 , 1 } . Then equation (9) can be expressed as * c1

 −1   î 1

T



 1− pi pi

∑ h  log  i

i =1

(10)

    0 

hi ∈ A , 1 ≤ i ≤ T .

otherwise

4.2 The Repetition of Concepts and APP-Threshold Decoding Let us assume a teaching by repetition strategy over T units of time on fixed instance x through an additive DMC channel resembling a weak learner performance. Therefore, the learner can now implement APP decoding in its threshold-decoding form in order to arrive to a final decision. Assuming transmitted and received concepts with output domain {−1,1} , we get * c1

 − 1 (x )   î 1

T



 1 − pi pi

∑ h ( x ) log  i

i =1



  0 

(11)

otherwise

where each p i is the probability that a received concept hi ( x ) is different from the transmitted concept c i ( x ) , 1 ≤ i ≤ T i.e. the error probability achieved by the i − th  1 − pi   , for fixed x equation (12) is almost AdaBoost  pi 

WL. Denoting wi = log 

decision. However, two differences are observable. First, there is a factor

1 2

difference between APP weighting factors w t and those derived from AdaBoost. Though this fact does not affect the final decision, its presence can be explained [10] by the exponential cost function used in AdaBoost instead of a Log-likelihood criterion. The other difference is that computation of APP weighting factors w i requires exact channel error probabilities. However, recall that in HB we always know the target concept at a finite set of sample points S so that we can provide a sample mean estimate wˆ t associated to each weak hypothesis ht ( x ) under distribution D i for S as follows

Lecture Notes in Computer Science

wˆ t =

1 2

⋅ log

E S ≈ D t [c ( x ) = h t ( x )]

5

(12)

E S ≈ D t [c ( x ) ≠ h t ( x )]

Thus if we use (35) to estimate the weighting factors w t required by the threshold decoding rule, the target learner will issue a final decision h f ( x ) h



f

T

∑ wˆ 

( x ) = sign 

t =1

t

 ⋅ht ( x )   

(13)

which is exactly the AdaBoost decision for discrete weak hypothesis with output domain Y = { −1 , 1 } . Concept repetition is a special case of general block conceptchannel coding schemes. For AdaBoost like boosting algorithms, there is no way to construct an unbounded set of orthogonal parity checks equations for increasing T values. At some point dependency between distribution leads to significant correlation between errors so that no further improvements can be achieved. At this point, the best we can do is to adjust threshold coefficients i.e. “the size of the weights is more important than the size of the network” [11][12].

4

Learning by Diversity: Recursive Models

The decoding view for the HB problem explains simply the classic teaching by concept repetition strategy. In addition, it also suggests many unexplored teaching schemes. When learning classes which are too complex, it would be useful to think in some kind of target concept expansion so that any concept can be expressed and reconstructed from a fixed number of simple base concepts i.e. a learning by diversity model. Let c ( x ) be a target concept admitting some kind of expansion c ( x ) = Span ( c 1 ( x ) ,... c k ( x ) )

(14)

Then, a teaching strategy for a weak learner may be viewed as the transmission of a frame of base concepts over a noisy channel. Each concept codeword in the frame must be decoded first in order to reconstruct to whole target concept. For each Span definition, a particular learning algorithm would be obtained. A good example can be found in the ECOC [13] approach for M -class problems, where a M -valued target function is broken down into k log 2 M binary base functions through an Error Correcting Output enCoding scheme (ECOC). Thus, the selected encoding scheme implicitly defines the components in the Span expansion whilst the Minimum Distance Hamming criterion defines the Span −1 recombination function. An essential limitation in ECOC behavior is the increasing coding length requirement for better generalization performance and off course for growing M values. In fact, this is well known problem in coding theory, where the exchange between block-coding length and error rates has been largely treated. Coding theory has been able to find a promising solution for such problem under the theory of Recursive Error Correcting Codes so that alternative low-complexity ECOC extensions could be derived from them.

Lecture Notes in Computer Science

6

Definition 1: A bipartite graph is one in which the nodes can be partitioned in two different classes. An edge may connect nodes of distinct classes but there are no edges connecting nodes of the same class. Definition 2: A Recursive Error Correcting Code (RECC) is a code constructed from a number of similar and simple component subcodes. A RECC in its simple form can be described by bipartite graphs known as Tanner graphs [14]. Let us consider a simple example (Fig. 1) of a RECC constructed from two parity check subcodes S 1 and S 2 . Codewords in this simple RECC are all binary 6-tuples, which simultaneously verified parity restrictions imposed by each component subcode. S1

x

1

x

2

S2

x

x 3

x 4

x 5

6

Fig. 1. Tanner graphs for a simple RECC built from two component subcodes

The main objective of defining codes in terms of component subcodes is to reduce decoding complexity. A RECC can be decoded by an ensemble of decoding processes, each one at a component subcode (check nodes in Tanner graph terms) and later exchange of information between them on bits (local variables in Tanner graph terms) they have in common. It should be note, that this decoding approach requires the implementation of local APP decoding methods, because these are the only methods that give us probabilistic estimation of code bits. For purposes of learning in high output domains, the set of code bits would define a set of binary weak learners with their corresponding error rates with communicating socket points at the component subcodes. The essential fact about Tanner graph representations is that they imply the existence of a message-passage mechanism between check nodes and local variables [15] and this is precisely what we need for the design of lowcomplexity learning algorithms in high dimensional spaces. Now, let us consider the learning problem for target classes defined for HDFS. A common sense strategy would be to choose a reduced and informative number of features and teach through an associated attribute-filtered version of S . The problem with this strategy is the prior knowledge requirement. It may happen that we do not have such prior knowledge or even there is no reduced set of informative attributes. In such cases, an alternative strategy can still be applied. We can expose different, perhaps random, attribute filtered versions of S to a set of weak learners and then let them exchange information in order to encourage their common learning performance. It happens that this ML strategy can also be modeled by Tanner graphs under theory of RECCs.

Lecture Notes in Computer Science

7

4.1 Boosting Algorithms in HDFS Let C : X → K a target class, the problem is to reduce learning complexity because of the number q of features in X . In the absence of prior knowledge about the relevant features, we may take the sample S and perform d Random Feature Filtering (RFF) steps over the set of feature vectors available in the training set S . We are thinking in a low density (with respect to q ) binary random attribute filter matrix H d ×q characterized by the presence of k ones per row and j ones per column similar to parity matrix of a Low Density Parity Check Code (LDPC) code. [16][17]. The whole filtering process implemented by H over X can be modeled using a Tanner graph as it is shown in Fig. 2

X1



6



X4 5 2

X3



1

2

3

X2 4

Fig. 2. RFF - q = 6, d = 4 , k = 3 , j = 2

The RFF process creates a set of input spaces X 1 ,...., X d ,

X

r

= k  q ,

1 ≤ r ≤ d . Therefore, from a sample S we can obtain a vector of samples S H with components being RFF versions of S . Because of the underlying random and sparse structure of the diversity matrix H , the sample components in S H may be assumed as being independent. The, we can apply a set of d supervised learning process so that each weak learning algorithm L r over a sample S r issues a weak hypothesis hr ( x ) (r = 1,.., d ) . As each learner sees only a fraction of the feature space X , its

decisions suffer from some kind of distortion due to the filtering process. 4.2 Recursive Classifiers in HDFS Assume that we have a teacher, a target concept and two different weak learners. Differences between learners arise because of their distinct criterions about the most important features defining a target concept. The same target concept is taught to each weak learner by concept repetition over d times. After the teacher has completed his class, each learner will be asked about the target concept. Both learners are allowed to exchange information before issuing a final decision. The first learner will issue a first decision after d units of time and then will help the second learner in order to improve its decision. The second learner will repeat the process and will help the first after d units of time... From the theory of RECCs, the proposed architecture is

Lecture Notes in Computer Science

simplified learning version of a turbo coding scheme [18]. In learning model is shown by a Tanner graph representation. L1

L 1*

L2

L3

L2*

L

8

Fig. 3, the proposed

L4

L 4*

3*

Fig. 3. Turbo_Learn exchange of information by means of a Tanner graph

Propagation of messages begins at the first graph from left to right until reaching check node L4 and then continues to the second graph structure. This structure can be generalized using T parallel boosting units, thus defining the Turbo_Learn algorithm Turbo_Learn Algorithm Input: LDPC matrix H d ×q , Initialization:

D1o (i ) =

1 m



/ | S | = m , weak learner  , d, T

, 1≤ i ≤ m

For each 1 ≤ t ≤ T , for each 1 ≤ r ≤ d , do

(

h rt ( x ) = WL D rt −1 , S r

Drt +1 ( i ) = D rt ( i )⋅ D tp ( i ) = D rt +1 (i ) Output: h

(

)

, Choose α rt ∈ R

exp −α rt ⋅ y i ⋅ h rt ( x i

, 1≤ i ≤ m

being p = ( r + 1 ) mod ( d + 1 

f

))

Z rt

T



d

∑  ∑ 

( x ) = sign 

t =1

r =1

α

t r

⋅h

t r

)

, 1≤ i ≤ m



( x )   

End Theorem 1: The training error

ε in Turbo_Learn is at most

T

d

∏∏ Z

t r

.

t =1 r =1

Proof: The proof is almost the same as that in AdaBoost [1]. By unraveling the expression of D1t (i ) after T boosting steps. To conclude, in Fig. 4, we present a representative Turbo_Learn test error response through the Vote dataset (UCI Repository of ML). We used methods in [19] to generate H matrixes and a Decision Stump algorithm as base learner.

Lecture Notes in Computer Science

; 9: C 9: B 9: A 9: 3 9: @ 9: ? 9: > 9: = 9: