Functional networks training algorithm for ... - Semantic Scholar

1 downloads 13845 Views 472KB Size Report
Functional Networks Training Algorithm for Statistical Pattern Recognition. Emad A. El-Sebakhy ... Department Mathematics, Computer Science & Statistics.
Functional Networks Training Algorithm for Statistical Pattern Recognition Emad A. El-Sebakhy Assistant Professor of Computer Science Department Mathematics, Computer Science & Statistics State University of New York, College at Oneonta Tel: (607)-277-9663, email: [email protected] Oneonta, NY 13820 Abstract Pattern classification is the study of how machines can observe the environment, learn to distinguish patterns of interest from their background, and make a reasonable decisions about the categories of the patterns. It is a very important in a variety of engineering and scientific disciplines such as computer vision, artificial intelligence, medicine. New and emerging applications, such as web searching, multimedia data retrieval, data mining, and machine learning require robust and eficient pattern classijication techniques. Recently, functional network has been proposed as a generalization of the standard neural network. In thispaper we are interested in dealing with the statistical pattern recognition via functional networks and investigate its pe~ormance using some real-world applications. We use functional equations to approximate the neuron functions, which allows a wide class of functions to be presented. The steps of working with functional networks and the structural learning is proposed. Keywords: Pattern Classification; Functional Networks; Neural networks; Functional Equations, Least Squares, Lagrangian Multipliers.

1

The Pattern Classification Meaning

7rik =

P(yi = k 1 xi); k = 0 , 1 , . . . , c - 1,

such that 7rik 2 0;

c-1 7 ~ i k=

1; vi = 1,.. . , n.

k=O

(1) Generally, the Statistical pattern recognition is critical k most human decision-making ” the more relevant patterns at your prediction, the better your decision will be”, Ross (1998). There are so many examples in science and engineering can be considered as pattern clas- ’ sification problem, for intance, Intelligent Searching, Fraud Detection, Bioinformatics, and Intelligent Search-

ing Engine.

In most recent years, dealing with the pattern classification problem in high dimension is not an easy task. The most common definition of the pattern classification can be summarized as follows: Pattern classification is defined as a mapping from the matrix of predictor variables X E IWp to a vector of class categories, Y C W. This means that each

0-7803-8623-W04/$20.00 02004 IEEE

pattern is given in terms of the independent predictors X I , . . . , X,, and the goal is to establish a rule that separates predictors belonging to different groups, A I ,A2, . . . ,A,. Without loss of generality, we can write the classes { A I ,A2, . . . , A , } as {0,1,. . . , c - 1) and X j = (zlj,. . . , ~ ~ for j )all ~ j =, 1,2, . . . , p as the j t h predictor variable for more convenience. We use D = {(XI,,. . . ,x p , y ) } as the given training set, and the lower case letters z,l, x i 2 , . . . ,x i p , for i = 1, . . . ,n to refer to the values of each observation of the feature variables, and y = k - 1 to the response variable Y (class A k ) , for k = 1,.. . ,c. we use the symbol 7Tik for the probability that observation i falls in the class Ah, that is,

92

The structure of the paper can be summarized as: Section ( 1) proposes the pattern classification meaning and its importance. In section ( 2) we propose a literature review and briefly describes some of the common classification techniques that dealt with pattern classification problem. Section ( 3) gives a briefly description of functional networks and the most needed steps to work with it. We propose the functional network classifier and the

learning algorithm in section (4).Selecting the best Network, and validate it with a real-world applications has proposed in section ( 5). We draw the conclusion and future work in section ( 6).

2

Researchers started looking for some techniques that classical statistics techniques (Disciminant Analysis, Logistic Regression), Artificial Neural Networks (Multilayer Feedforward Neural Network, Radial Basis Function Networks, Probabilistic Network), Machine Learning (Support Vector Machine, Decision Trees, K-Nearest Neighbor), and Data Mining. However, the method of classification depends on the nature of the data and the question of interest. Many developments and new materials are found in Gordon (1981, 1999). and the references therein. Many books have been published in the area of pattern classification, and neuro-computing, particularly incorporating development in statistics, data mining, neural network methods and machine learning, see for instance, James (1985), Quinlan (1986), (1992); Devroye et al. (1996); Bishop (1995); Ripley (1996); Burgess (1998); Fine (1999); Hosmer and Lemeshow (2000), and Nello. Shawe-Taylor (2000); and Duda et al. (2001).

Motivating Functional Networks

Castillo et al. (1998, and 2001) introduced functional network as a generalization of the standard neural network. Functional network allows neurons to be multiargument, multivariate different learnable functions instead of fixed functions. It allows connecting neuron outputs and forcing them to be coincident. This leads to the concept of equivalent of functional networks. Castillo et al. (1998) defined the functional network as a pair < X , r >, where X is a set of nodes and = {< X j , fj,Zj >: j €1 I is a set of neuron functions over X, which satisfies: Every node zi E X must be either an input or an output node of at least one neuron function in We call the node X i E X, for all i E I as a multiple node, if it is an output of more than one neuron functions. Otherwise, it is called a simple node. Generally, functional networks have three types of nodes, which are input, intermediate and output nodes. These three types of nodes can be summarized as follows:

r

r

r

r.

Literature Review

-deal with the pattern classification problem, such as the

3

Input Node: It is an input of at least one neuron function in and is not the output of any neuron function of I?. It is stored in the layer of iiiput units. The Intermediate Node: It is an input of at least one neuron function in and at the same time is the output of at least one neuron function of It is stored in an optional set of layers, called the set of intermediate layers. The Output Node: It is an output of at least one neuron function in I? and is not the input of any neuron function of I?. It is stored in the layer of output units, which is the last layer contains the output information of the network. The most common steps required for working with functional networks can be summarized as follows:

I}

r.

1. Select the suitable initial architecture: This is a fundamental step, and the selection of the initial topology of a functional network is based on the characteristics of the problem in hand. 2. Simplifying the initial functional network It can be done using the functional equations, and its applications as the main tool for simplifying functional networks.

3. Checking the uniqueness of representation: Sometimes, for a given structure, we can get different output for the same set of inputs (not separable). This will lead to checking the concept of uniqueness, and finding the conditions on the neuron functions. The following Theorem is a necessary information that can be used to check the uniqueness the functional network:

Theorem: All solutions of the functional equation n

&(z)gk(y)

A @ ; and g(y) = B\k, where

U is an integer = {cpl(z),. . . ,vu(.)} and between 0 and n; = {$,,+l(y), . . . ,$,,(y)} are two arbitrary systems of mutually linearly independent functions; A = ( a k t ) n x u . and B = ( h j ) n x ( n - , , ) are constant matrices with ATB = 0, where 0 is a U x (n - U ) zero matrix. See AczCl(l966, p. 160) for a complete proof and more details.

+

*

4. Learning Algorithm: Once the structure of the functional network is known in Step 1, it is necessary to learn the neuron functions using one type

of learning, either linear or non-linear method according to the optimization technique to learn the network parameters, see Castillo et al. (1998) for more details. 1

93

= 0 can be written as: f(z) =

k=l

5. Quality or Validation Test of the final functional network model: After the learning algorithm processes done, it is necessary to do the test for quality of the functional network model, and see its performance on a real data.

where c , . , ~ ~ . . .are ~ , the parameters in the functional network needed to be learned. We use one of the discrepancy measure methods (least squares method, Minimax, ...., etc) to minimize the criterion error function n

Q=

e:, where ei = 'rik - fzk, for all i = 1,.. . ,n

a=l

4

Functional Network Classifier

The initial architecture of the functional network classifier has to satisfy the pattern classification problem conditions as it is shown in (1). Since the goal is to estimate the probability Aik, we assume that the desired functional network has the following initial form: rik

= hk(xi, @k) + E ; ,

hk(X;,

2 0, and

@k)

2

hk(X,, Ok) = 1,

(2)

..

.

k

=)log

, f o r k = I,. . . ,c-1 (";;-t> h i ( = f l z :

go(xi, 0,)= 0, or hk(Xi, ~ k =) k=l,.

:::(xi'ek)

1+

and

, for

egk(xiiek) k=1

.., c-1 and ho(xi, Ob) =

. There-

1 e9k(xi?ek) k=1

fore, the initial functional network in (2) can be rewritten as:

+E;,

. . ,c - 1, (3) from which it fo~lowsthat -00 < 7;k < 00, for all i = 1,.. . , p , and k = 0, 1,. .. ,c - 1. The learning ~ i = k gk(x;, O k )

8)

"'ih

PI

.. .

mpk

aQ , and a G 1

...r p

equating it to zero. It leads

to system of equations in the unknowns c , . ~ , , . . where ~, TI = 1 , 2 , . . . , m l k ; . . . ; rP = 1 , 2 , . . . ,mpk. We note that the number of parameters in the functional net-

n mjk, which is very large number, P

j=1

and it causes a complexity problem. One way to reduce. the parameters in Model (3) is to write the functions (&(xi, @ k ) as:

where the functions g r j k ( x ; j ) are unknown functions, and contain the unknown parameters @ k of the functional network model. The set fl is the set of non-empty subsets U,. 2 {1,2,. . . , p } . Thus, the functional network (3) can be rewritten as:

r = l j=1

The learning technique is equivalent to approximate the neuron functions grjk(xij) using set of known hearly independent functions, @,.j = {&jl(xij) : I = 1 , 2 , mrjk, T = 1,.. . ,2P - l}, that is,

for k = 0,.

technique in functional network is to ap roximating the right hand of the functional network in using known set of families of linearly independent elementary functions, @ j = {d., ( x i j ) , ~j = 1 , 2 , . . . ,mjk}, that is, 9k(Xii e k )

computing

work model is

k=O

for k = 0 , l ,..., c - 1, i = 1,...,n, where xi ( x i l , . ,zip) is the ith observation of the predictor variables, @ k = (elk,.. . , is the vector of unknown parameters needed to be learned, and hk (xi, 0 , ) are unknown functions. The error are random errors independent of the x's, but depend on A;k. We summarize the results of a fitted functional network model by constructing the c x c confusion matrix. Assume that D = {(Y, XI,: . . ,X,)} be a given training set of categorical response variable, Y and feature variables, X E RP. Acdording to the functional network (2), we are interested in modeling ~ i as k an unknown function of xil, . . ,xip, and discovering the structure of it. We note that there are some constraints on the function &(Xi, @k), then to get around these constraints, we replace it with a function gk(xi,@k), such that gk(X,, ~

-

and k = 0,. . . ,c 1, subject to the uniqueness constraints conditions on the neuron functions. These lead to a non-linear system of equations, and then we choose the least-squares technique to solve the corresponding optimization problem. The minimum is obtained by

..#rp(zip)i

C r l r z . . . ~ p h(l z i l ) .

(4)

'P

94

1=1

then we only need to learn the coefficients (parameters) { u r j k l } , For simplicity, without loss of generality, we illustrate Model (6) for two cases: The given training set or simulated data have both two and three predictor variables. The cases when we have more than three predictor variables (p > 3) follow similarly.

4.1 Two Predictor Variables

and

When we have two explanatory variables, then and Model (6) can be written as: 'zk

=

=

v$ln:=19r3k(2.3)+Ex 9 1 1 k ( 2 ~ 1 ) 922k('12)

+

+ 931k(2r1)832k(z\2)

= 3,

(8) +EIS

where the unknown parameters all, u 1 2 , a419 must satisfy the conditions:

a42,

bzl,

b22, b31, and b32

where g 2 1 k ( q ) = g12k(z,2) = 1, and the constant term is included in the function gllk(zcI1) only to prevent collinearity. Figure 1 showed the architectural of the functional network (8). Since (8) contains the douFrom (13) we get the relation among these unknown parameters as: all = - b 2 1 ; a 1 2 = - b 3 1 ; a 4 1 = 4 2 ; and a42 = b 3 2 . Therefore, to have unique representation of this functional network if the following conditions are satisfied: g l l k ( 5 0 1 ) = c 1 ; g 3 1 k ( l & ) = CZ;g Z Z k ( 2 0 2 ) = c 3 ; and g 3 2 k ( r G 2 ) = C4, for any arbitrary chosen constants 2 0 1 , x & , 2 0 2 , x&, and C 1 , C Z C , 3 and C4. Now, by approximating these neurons using set of the linearly independent mFJk

families, which is, i j r j k ( z i j )

A

a,.jklr&jt(zij).

We use

I=1

the least squares method to minimize the sum of the squared

Figure 1: The associated functional network model Tik

= gllk(zi1)

+g22k(xi2) +g31k(zil)g32k(Zi2)

f Ei.

ble interactions terms g31k(Zil)g32k(5;2), then the corresponding model is not separable, and then the uniqueness representation has to be checked. We assume that there exist two distinct sets of functions g,.jk(zij) and g:jk(zij), for j = 1,2; k = 0 , . . . , C - 1; T = 1 , 2 , 3 such that 3

C

Therefore, the learning algorithm is abbreviated as the following optimization problem:

3 g r l k ( ~ i l ) g r Z k ( l i z= )

g:lk(lil)g~k(~iZ).

(9)

v=l

r=l

Since gZlk(zi1) = g m ( z i 2 ) = 1, then &(w)= g;lk(zcil) = 1, and the constant term is included in gilk(zil).

Therefore, equation (9)can be written in the 4

zero form,

= 0,where

H,1(&)G82(&2) s=1

Hll(Zi1) = 9 l l k ( z i l ) - 9 f 1 ( 2 ~ 1 ) , H 2 i ( z i i ) = 1. H31(=il) = 931k(=il)i H41(2il) g;ik(=;l)>

C 1 2 ( 2 i 2 ) = 1, G22(2i2) = !722k(zi2) - 9;2k(ziZ). G32(Zi2) = 932k("i2). G42(2i2) = -S3j2*(zi2). 110)

As in Castillo et al. (1992), we choose the two sets of linearly independent functions (1, g31k(zil)} and { 1,,})2iz(k,$.g then the genera1 solution of equation (10) is: 9llb(=il) - gflk(=il)

)

=

-11

-12

-41

-42

[A )

The necessary Kuhn, Tucker (1961) stationary conditions for the existence of minimum can be determined by taking the partial derivatives of L with respect to a l l k t , azzkl, cuuk, and A,

(931ktZil))

95

respectively. According to the definition of & j k ( z i j ) , and the uniqueness conditions, the approximated functions 4 3 1 . , ( z i l ) , ~~zzI(z~ and z ) 4, 3 2 v ( i i 2 ) do not contain the constant term, and then we can choose the following linearly independent family: 4 = { 1,t , t', . . . ,t"} with the constant term for the function i l l k ( 2 i l ) . and the family 6 = { t , t ' , . . . ,t"} without the constant term for the rest of the functions & , k ( i i j ) . Therefore, we obtain a system of (m + 1)' linear equations in (m 1)' unknowns, it can be written in a matrix form as: A & = B k , where Q k is the vector of unknown parameters needed to be estimated. If A is non-singular square matrix, we can compute the matrix of unknown parameters 0 = [ Q I , . . ,Q,] by solving the system A Q k = B k , for all k = 0,1, . . . ,c - 1, aid then the solution 6 k exists, that is, ek =A-'Bk.

+

.

4.2

p Predictor Variables, p

>2

Assuming that the training set D = {(yi = k,zil, . . . ,zip)}, for all i = 1,.. . ,n,k = 0,.. . ,c-1 and 1521 = 2p-1. Thus, the general form of functional network can be formalized as follows:

r

.., -

..

k = 0.1,. c; i = 1.. , n; j 1 , i z , . suchthaljl # j 2 # . . . # j p i n a n m y k r m ,

where LL, =

ULt =

6(p)

3=0

(p);

. . , j p = 1 , 2 , . . . ,pi

V s = 1 , 2,...,p ' - 1, and

; V t = 2 , . . . , p . By approximating the

j=l

neuron function g l l k ( z i 1 ) contains the constant term, and assume that @,.j = { & j l ( z i j ) : 1 = 1,2,m,jk; r = 1,.. . ,2p - 1; j = 1,.. . , p } are sets of known linearly independent functions used to approximate the neuron functions g r j k ( z i j ) . Thus, the goal is to determine the coefficients { a , . j k l : 1 = 1,2,. .. ,m , j k ; T = 1,. . ,2P - 1; j = 1,.. . ,p ; k = 0,. . ,c - 1). We use the least squares method and the approximated formula of the neuron functions to determine these parameters { a r j k l } . we compute the known matrix w of size n x (m + l)P from the given data, and assume that the matrix 8 = [el,. ,&] of size (m l ) p x c is the matrix of unknown parameter vectors of size (m+ l)px 1, and the error, ei = ( 7 i k - UT&). Therefore, the Lagrangian multiplier function L(&, A) can be written as:

.

.

..

+

96

The necessary Kuhn, Tucker stationary conditions for the existence of minimum is:

k = 0,... , c - 1; j = 1,... , ( m + l ) ' We obtain a system of (m + l)p linear equations in (m + l)p unknowns and it can be written in the matrix form A 6 k = B k , for all k = 0,1,.. ,c, and the unknown vector e k of size (m l)p x 1, where the known coefficients square matrix A , and the vector B k are A = wT U ; B k = wT T k , where Tk = ( ~ i k , .. . , ~ , , k ) ~for , k = 0,. . . , c - 1. Thus, he unknown parameters Ojk, can be learned by solving the system A& = B k , for all k = 0,. . . , c - 1.

+

5

.

Application to Real-World Data

In this section we illustrate the use of the functional network classifier, by examining its applications to real-world problems. M b t of the real-world application data are available and it can be downloaded from the machine learning database site at the University of California, Irvine, "'ftp://ftp.ics.uci.edu/pub/machine-learning-databases". We choose the Fisher Iris data, (1936) due to its important a known database to be found in the pattern recognition literature. The entire data set contains four feature variables, all of them are measured in centimeters, each one of the feature vectors is drawn from 3 classes of 50 observations each, where each class refers to a type of iris plant. We use the symbols 2 1 , 2 2 , 2 3 and x4 to refer to the sepal length, sepal width, petal length and petal width, respectively. By looking at the scatter plot of the data in Figure 2, we observe that one class is linearly separable from the other 2; the latter are not linearly separable from each other. To evaluate the performance of the functional network classifier on the iris data set, we use 80 % of the data to estimate the classification model (internal validation) and 20 % of the data for external validation. As proposed in Turk (1990) and Cook (1986) to guarantee that we get the same proportions from each group as in the original data, we use stratifed random sampling technique, by dividing the data set into non-overlapping samples, then we randomly hold a data of size m, such that m = round(:) observations with mk = round(m ?), where n is the number of instances in a given data and n k is the number of instances in the group k. In the iris data, we have the training set with 120 observations, and the validation set with 30 observations, such that ml = m2 = m3 = 10 observations from classl, class2, and class3, respectively. We repeat the internal and external validation processes for at least 100 runs, then compute the correct classification rate, its standard deviation, and the variability coefficient for these 100 runs, and compute the same output. For the four explanatory variables, and c = 3 category variables, then we obtain the best model with 10 parameters and the minimum description length (MDL) is -1295.85551. The correct classification rate

[2] Burgess C. C., (1998) A tutorial on support vector machines for pattern recognition. Machine Learning journal. [3] Castillo E., and Ruiz-Cob0 R. (1992) Functional Equations and Modelling in Science and Engineering, Marcel Dekker New York. [4] Castillo E., Cob0 A., Gutitrrez J. M. and Pruneda E. (1998), Introduction to Functional Networks With Applications. A Neural Based Paradigm. Kluwer Academic Publishers: New York. [5] Cook, R. L., (1986), ”Compute the third vertex of the sub-triangle”. Stochastic sampling in computer graphics. ACM Transactions on Graphics 5 , 1 51-72. [6] Devroye L., Gyorfi L., and Lugosi G. ( 1996) A Probabilistic Theory of Pattern Recognition, Springer verlag, Berlin.

Figure 2: The three dimension plot of the Sepal Length, Sepal Width, and Petal Length. is 100%. The estimated functional network model is given by: Tik

=

+

6

-

0.1083 - 0.167521 - 0 . 0 3 6 1 ~ 2 0.0353~ +0.0556~4 0.001~: 0.001~2 +O.Ols; 0.0057~: 0.0011~1~2.

-

+ -

(19)

.

[7] Duda 0. Richard , Hart E. Peter and Stork G. David , (2001). Pattern Classifcation, Wiley Interscience, Second Edition, John Wiley & Sons, Inc, New York.

.

[8] Fine L. T. (1999), FeedForward Neural Network Methodology, Spring-Verlag New York Inc. [9] Fisher R. A. (1936), ”The Use of multiple Measurements in taxonomic Problems”, Ann. Eugenics, 7 , 179-188. [IO] Gordon, A. D. (1981), Classifcation: Methodsfor the Explanatory Analysis of Multivariate Data, Chapman & Hall, London.

Conclusion

We conclude that functional networks is an efficient framework for the pattern classification problem and the initial architecture of the functional network model depends only on the physical meaning of the problem in hand. We can use another technique for the initial choice other than the logit, such as probit. Moreover, with some specific choices of the initial structure we get the sigmoid neural network model. Functional networks incorporate different neural functions and they are not iestricted to be a linear combination of inputs. The neuron functions can be learned either exactly or approximately, and the neuron outputs can be forced to be coincident. Functional network is easy to implement and get better correct classification rate. This work can be extended and generalized if we do simulation study and cross validation to compare its performance with the previous proposed classification methods. Generally, functional network classifier can be used to get better performance of new and merging applications, such as cryptography, internet security, web searching, multimedia data retrieval, data mining, and machine learning.

References [l] Bishop M. C., (1995). Neural Networks for Pattern Recognition, Oxford University Press, Oxford.

97

[ l l ] Gordon, A. D.(1999), Classification, 2nd Edition, Chapman & HalV CRC. [12] Hosmer D. L. and Lemeshow S. (2000), Applied Logistic Regression, second edition. John Wiley & Sons, New York. 131 James M., (1985) Classifcation Algorithms, John Wiley and Sons Inc, New York. 141 Kuhn, H. W., and Tucker, A. W.(1961), ” Nonlinear Programming,” in proceedings of the second Berkeley Symposium on Mathematical Statistics and Probability, J. Neyman (editor), Universiv of California Press, Berkeley and L o s Angeles, California, pages 481492. 151 Quinlan J. R. (1986) ” Induction of decision trees”. Machine Laming, l , 81-106. 161 Quinlan J. R. (1992) C4.5: program for machine learning, Morgan Kaufmann, San Mateo, California, USA. [17] Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press: Cambridge. [18] Ross P. E. (1998) ”Flash of Genius”, Forbes, 98 - 104. [19] Turk, G., (1990), ”Generating Random Points in Triangles. In Graphics Gems”, Academic Press, New York, 1990, pp. 24-28.

Suggest Documents