k-fold cross validation

1 downloads 0 Views 599KB Size Report
Sep 5, 2015 - In the 'SETS' section, “P, P1, N and ERR(MS)” are one dimensional sets, element numbers of those are 4, 5, 100 and 15 defined in 'DATA' ...
How to discriminate….. by Shuichi Shinmura 2015/9/5 How to discriminate the small sample by the “k-fold cross validation” method by LINGO Shuichi Shinmura There are some questions about six LDFs and new method and procedure by LINGO. Moreover, there is a question about “DATA and SETS” section of LINGO that is not related in my research. Therefore, I upload some part of my draft. Full text will be upload next year after the conference.

I. MODEL FOR IRIS DATA A. Natural ( Mathematical) Notation by LINGO In this section, we explain the natural ( mathematical) notation of Revised IP-OLDF by LINGO. We can describe the model by the natural notation and “SET and CALC” notation. The natural notation is as same as the usual mathematical notation. In the case we wish to obtain the global minimum value of the function Z(x, y) = x*SIN(y) + y*SIN(x), we define the object function and two constraints, and obtain the output in Figure 1. The ‘@’ means LINGO function. “@BND (-10, X1, 10)” means the constraints (-10≤ X1 ≤10). Within 1 second, we obtain -15.8 as the global minimum (the minimum value, not local minimum) at (x, y) = (7.98, -7.98).

MIN=X1*@SIN(Y1)+Y1*@SIN(X1); @BND(-10,X1,10); @BND(-10,Y1,10);

Figure 1: Global Minimum Solution. If we wish to draw the contour graph, we can define the model by “SET and CALC” section. The ‘MODEL:’ section consists two sub-sections such as “SETS and CALC”, and ends at ‘END’. We insert the ‘SETS: ,,, ENDSETS’ and ‘CALC: ,,, ENDCALC’ sections before and after the optimization model. The ‘SETS’ section defines one dimensional set such as ‘POINTS’ with 21 elements by ‘/1..21/’ that can be defined in ‘DATA’ section as data. We can define two dimensional set by the combination of one

How to discriminate….. by Shuichi Shinmura 2015/9/5 dimensional sets such as ‘POINT2 (POINTS, POINTS):’ that is two dimensional set with (21, 21) elements. We can define three dimensional sets by the combination three one dimensional sets. The `POINT2’defines three arrays such as “X, Y, Z” with (21, 21) elements in the right side of ‘:’. Set ‘POINTS’ has no one dimensional array. The ‘CALC’ section is a programming language that can optimizes MP-model and control it. In the ‘CALC’ section, we draw the contour of Z(x, y) at the mesh by x=(-10,9,…,9,10) and y = (-10,-9,…,9,10). Figure 2 shows the contour graph. Because XS = @FLOOR(-(@SIZE(POINTS)/2)+0.5) = @FLOOR(-(21/2)+0.5) = @FLOOR(-10) = -10, we can directly replace this statement by ‘X = -10’ directly. “@FOR (POINTS2(I,J):” is the below loop function. Because MP-based model has the same struvture of constraints, we need not to describe those constraints one by one. “@CHART( …,'CONTOUR',…);” function draws the contour in Figure 2. If we replace ‘surface’, we can draw the surface plot. For i = 1 To 21 For j

To 1 Step 21

Xij=XS+I-1; Yij=YS+J-1; Zij=Xij*SIN(Yij)+Yij*SIN(Xij); Next j Next i

MODEL: SETS: POINTS /1..21/; POINTS2 (POINTS,POINTS): X,Y,Z; ENDSETS

MIN=X1*@SIN(Y1)+Y1*@SIN(X1); @BND(-10,X1,10); @BND(-10,Y1,10);

CALC: XS=@FLOOR(-(@SIZE(POINTS)/2)+0.5); YS=XS; @FOR (POINTS2(I,J): X(I,J)=XS+ I-1; Y(I,J)=YS+ J-1; Z(I,J)=X(I,J)*@SIN(Y(I,J))+Y(I,J)*@SIN(X(I,J))); @CHART( 'X Y Z', 'CONTOUR', 'Z = X*@SIN(Y)+Y*@SIN(X)' ); @CHART( 'X Y Z', 'SURFACE', 'Z=X*@SIN(Y)+Y*@SIN(X)' ); ENDCALC END

How to discriminate….. by Shuichi Shinmura 2015/9/5

Figure 2: The Contour Graph.

B. The Iris Data on Excel We discriminate the iris data on Excel file in Figure 3 by Revised IP-OLDF. It consists of two species such as versicolor (yi=1) and virginica (yi=-1). Each species has 50 cases with five variables (four independent variables and indicator yi). We define Excel range name ‘IS’ that is “B2:F101”. LINGO can retrieves ‘IS’ array values by “IS = @ OLE();” and store it on LINGO array name ‘IS’.

Figure 3: The Iris Data on Excel.

Next, we define the Excel range name ‘CHOICE’ that is “H2:L16”. Fifteen rows correspond the models from the full model (X1, X2, X3, X4) to the 1-variable model (X2). After optimization, we output three LINGO arrays such as the ‘NM’ on the Excel array name “NM (M2:M16)”, the number of discriminant hyperplane on the LINGO array name “ZERO (N2:N16)”, and the discriminant coefficients on “VARK100 (O2:S16)” by “@OLE( ) = IC, ZERO, VARK100;” function. C. Six LDFs by LINGO In this paper, we explain themodel by LINGO, which is the solver developed by LINDO Systems Inc. We develop six LDFs by the ‘SET’ notation. Six LDFs are Revised IP-OLDF (RIP), Revised IPLP-OLDF (IPLP), Revised LP-OLDF (LP), H-SVM and two S-SVM (SVM4 and SVM1). Because Revised IPLP-OLDF is the two-stage algorithm of Revised LP-OLDF and Revised IPOLDF, this model is deferent other six LDFs. The Revised IP-OLDF in equation (1) can find the actual MNM by “MIN=Σei” because it can directly find the interior point of OCP. If case xi is classified, ei =0. If case xi is misclassified, ei =0. Because the

How to discriminate….. by Shuichi Shinmura 2015/9/5 discriminant score becomes negative for the misclassified case, Revised IP-OLDF selects alternative discriminant hyperplane such as “yi* (txi b+ b0) = 1 - M* ei” instead of “yi* (txi b+ b0) =0”. MIN = Σei ; yi* (txi b+ b0) >= 1 - M* ei ; b: p independent variables, b0: the constant, xi : (1*p) case vector if data is (n*p), (txi b+ b0): the discriminant score, M: big M constant such as 10000, yi: yi=1 for class 1 and yi =-1 for class2, ei: 0/1 integer decision variable corresponding xi,

(1)

We can define this model in ‘SUBMODEL’ section in LINGO. ‘RIP’ is the sub-model name. We can solve and control this integer programming (IP) model by this name. “@SUM and @FOR” are two important LINGO loop functions. “@SUM (N(i): E(i))” means “Σi=1n E(i)”. “@FOR(N(i):” defines n constrains such as “@SUM(P1(j): IS(i,j)*VARK(j)* CHOICE(k,j)) >= 1BIGM*E(i))”. “@FOR(P1(j): @FREE(VARK(j)));” defines the p bs as the free decision variables”. “@FOR(N(i):@BIN(E(i))); ” defines p-ei are 0/1 integer decision variables. SUBMODEL RIP: MIN=ER; ER=@SUM(N(i):E(i)); @FOR(N(i):@SUM(P1(j):IS(i,j)*VARK(j)*CHOICE(k,j)) >= 1-BIGM*E(i)); @FOR(P1(j):@FREE(VARK(j))); @FOR(N(i):@BIN(E(i))); ENDSUBMODEL

If we insert ‘!’ such as “! @FOR(N(i):@BIN(E(i)));”, it changes the only comment and become non-negative real decision variable. This model is Revised LP-OLDF. ‘SUBMODEL’ section defines an arbitrary character strings. Therefore, we define the model of Revised LP-OLDF named ‘LP’. SUBMODEL LP: MIN=ER; ER=@SUM(N(i):E(i)); @FOR(N(i):@SUM(P1(j):IS(i,j)*VARK(j)*CHOICE(k,j)) >= 1-BIGM*E(i)); @FOR(P1(j):@FREE(VARK(j))); ! @FOR(N(i):@BIN(E(i))); ENDSUBMODEL

Moreover, we define the ‘SUBMODEL’ named ‘RIP’. SUBMODEL RIP: @FOR(N(i):@BIN(E(i))); ENDSUBMODEL

By these two ‘SUBMODEL’, we can discriminate the data by Revised IP-OLDF by the “@SOLVE (LP, RIP)” functions in the ‘CALC’ section. Next, we define Revised IPLP-OLDF. In the first stage, we discriminate the data by Revised LP-OLDF. In the second stage, we discriminate the restricted cases misclassified by Revised LP-OLDF. Therefore, we must distinguish two alternatives stored in the array ‘CONSTANT’ and Revised IP-OLDF discriminate only the misclassified cases by the “submodel CONS”.

How to discriminate….. by Shuichi Shinmura 2015/9/5 SUBMODEL CONS: @FOR(N(I)| CONSTANT(i)#GT#0:@BIN(E(I))); @FOR(N(I)| CONSTANT(i)#EQ#0:E(I)=0); ENDSUBMODEL In the ‘CALC’ section, we insert the below statements for Revised IPLP-OLDF. @SOLVE(LP); @FOR(N(i):@IFC( E(I)#EQ#0: CONSTANT(i)=0; @ELSE CONSTANT(i)=1;)); MNM=0;ER1=0; MNM2=0;ER2=0; @FOR( P1( J): VARK( J) = 0; @RELEASE( VARK( J))); @solve(LP,IPLP);

S-SVM has been defined in (2) with two objects. These two objects are combined by defining some “penalty c.” In this research, two S-SVMs such as SVM4 (c=104) and SVM1 (c=1) are examined to show that the mean error rates of SVM4 are almost better than SVM1. If we delete the second object ‘c* Σei’ or set ‘c=0’, it becomes H-SVM. MIN = ||b||2/2 + c* Σei ; yi* (txi b+ b0) >= 1- M*ei ; b, xi, (txi b+ b0), yi,: same in equation (1), (2) c : penalty c. ei: non-negative decision variable. SUBMODEL SSVM: MIN=ER; ER=@SUM(P(J1):VARK(j1)^2)/2 +BIGM*@SUM(N(i):E(i)); @FOR (N(i):@SUM(P1(j):IS(i,j)*VARK(j)* CHOICE(k,j)) >= 1-E(i)); @FOR (P1(j): @FREE(VARK(j))); ENDSUBMODEL

If we insert six LDFs before the ‘CALC’ section, we can easily discriminate the data by six LDFs. D. Discrimination of the Iris Data by LINGO We can discriminate the iris data on Excel by Revised IP-OLDF using “SETS, DATA, MODEL, CALC and DATA” sections. In the ‘SETS’ section, “P, P1, N and ERR(MS)” are one dimensional sets, element numbers of those are 4, 5, 100 and 15 defined in ‘DATA’ section. Set ‘P1’ has one dimensional array ‘VARK’ that stores the discriminant coefficient of one discriminant model. Set ‘N’ has two one dimensional arrays. ‘E’ stores the 100 binary integer values of ei” and ‘CONSTANT’ stores 100 discriminant scores”. “ERR (MS):” has three two dimensional arrays. ‘NM’ and ‘ZERO’ store the number of misclassifications (NM) and the number of cases on the discriminant hyperplane in Figure 4. If we discriminate the data, ‘NM’ column shows MNMs of fifteen models. From ‘ZERO’ column, Revised IP-OLDF is free from the problem1. Because other LDFs cannot avoid the problem 1, all LDFs must output these numbers. Now, we cannot trust the output of NMs. ‘VARK100’ stores fifteen coefficients of Revised IP-OLDF of ‘VARK’ in Figure 5.

MODEL:

How to discriminate….. by Shuichi Shinmura 2015/9/5 SETS: P; P1:VARK; P2; N: E,CONSTANT; MS; ERR(MS):NM,ZERO; D(N,P1):IS; MB(MS,P1):CHOICE; VP(MS,P1):VARK100; ENDSETS DATA: P=1..4; P1=1..5; N=1..100; MS=1..15; CHOICE,IS=@OLE(); ENDDATA SUBMODEL RIP: MIN=ER; ER=@SUM(N(i):E(i)); @FOR(N(i):@SUM(P1(J1):IS(i,J1)*VARK(J1)*CHOI CE(k,J1)) > 1-BIGM*E(i)); @FOR(P1(J1):@FREE(VARK(J1))); @FOR(N(i):@BIN(E(i))); ENDSUBMODEL CALC: @SET('DEFAULT'); @SET('TERSEO',2); K=1; G=1; LEND=@SIZE(MS); @WHILE(K#LE#LEND: @FOR( P1( J): VARK( J) = 0; @RELEASE( VARK( J)));NM=0; Z=0; BIGM=10000; @SOLVE(RIP); @FOR(P1(J1):VARK100(@SIZE(MS)*(G-1) +K, J1) =VARK(J1)*CHOICE(k,J1)); @FOR(n(I): constant(i)= @SUM(P1(J1): IS(i,J1)*VARK(J1)*CHOICE(k,J1))); @FOR(n(I): @IFC(CONSTANT(i) #EQ#0: Z=Z+1)); @FOR(n(I): @IFC(CONSTANT(i) #LT#0: NM=NM+1)); NM(K)=NM; ZERO(K)=Z; K=K+1 ); ENDCALC DATA: @OLE( )=NM,ZERO,VARK100; ENDDATA END

II. THE 100-FOLD CROSS VALIDATION METHOD A. How to generate the re-sampling and prepare the data on Excel file ‘We generate re-sampling sample from the original data and evaluate six LDFs by our method. We explain this procedure by the

How to discriminate….. by Shuichi Shinmura 2015/9/5 Fisher’s iris data that consist of two species such as virginica (yi =1) and versicle (yi =-1). Those species composed of 50 cases with 4-variables and classifier yi. ‘SET’ section defines six one-dimensional sets such as P, P1, P2, N, MS, and G100. “P, P1, and P2” are the number of independent variables, the number of (independent variables + constant) and the number of (independent variables + constant + subgroup). These figures of elements are defined in ‘DATA’ section such as 4, 5 and 6, respectively. Only ‘P1’ defines one-dimensional array named ‘VARK’ with 5-elements that store the discriminant coefficients of the training sample. Set ‘N’ with 100 elements defines “E, SCORE, CONSTANT” Array ‘E’ is a one-dimensional array with 100 elements and corresponds to the ei. “N: ; N2: ; MS: ; MS100: ; G100: ;” are one-dimensional set with 100, 10000, 15, 1500 and 100, respectively. Two-dimensional set ‘D(N,P1):’ with 100*5 has the same size array ‘IS’ that stores the 100 sub-sample as the evaluation samples. “D2(N2,P2):” with 10000*6 has the same size array ‘ES’ that stores the resampling-sample as the validation samples. We copy each species 100 times. We add the random number as the sixth variable and sort it by “the random number” and add the sub-group number from 1 to 100 as the seventh variable. Each re-sampling samples consist 5000 cases and seven variables in Table 1. ‘R’ column is the random number sorted by ascending order in each class. Five data including this data are input by “ES= @OLE ();” in the ‘DATA’ section. The ‘@OLE () ‘ function input the data ES on Excel array name such as “A2:F10001” if the cell of ‘X1’ is located in `A1’, and define the LINGO array ES. TABLE I.

SG R

X1

X2

X3

X4

yi

x(1,1)

x(2,1)

x(3,1)

x(4,1)

1

1

1

….

1

1

1

….

1

100

1



x(1,k)

x(2,k)

x(3,k)

x(4,k)

1

100

-x(1,k+1)

-x(2,k+1)

-x(3,k+1)

-x(4,k+1)

-1

1

-1

….

-1

1

-1

….

-1

100

-1



-1

100

-x(1,2k)

-x(2,2k)

-x(3,2k)

-x(4,2k)

RESAMPLING SAMPLE: ES

If you wish to know the full model of the “k-fold cross validation” method by LINGO, see the paper of ICORES2014.