large data set features detection by a linear predictor ...

4 downloads 373 Views 386KB Size Report
A linear model of acoustic-to-facial mapping: Model parameters, data set size, and ... Learning Complex Classification Models from Large Data Sets. AIP Conf.
Main large data set features detection by a linear predictor model Carlos Enrique Gutierrez, Prof. Mohamad Reza Alsharif, Mahdi Khosravy, Prof. Katsumi Yamashita, Prof. Hayao Miyagi, and Rafael Villa Citation: AIP Conference Proceedings 1618, 733 (2014); doi: 10.1063/1.4897836 View online: http://dx.doi.org/10.1063/1.4897836 View Table of Contents: http://scitation.aip.org/content/aip/proceeding/aipcp/1618?ver=pdfcov Published by the AIP Publishing Articles you may be interested in Knowledge Discovery in Large Data Sets AIP Conf. Proc. 1082, 196 (2008); 10.1063/1.3059044 A linear model of acoustic-to-facial mapping: Model parameters, data set size, and generalization across speakers J. Acoust. Soc. Am. 124, 3183 (2008); 10.1121/1.2982369 Learning Complex Classification Models from Large Data Sets AIP Conf. Proc. 872, 227 (2006); 10.1063/1.2423279 SciYIS Fri 07: Fitting the linearquadratic model to detailed data set for different dose ranges Med. Phys. 32, 2419 (2005); 10.1118/1.2031029 Dimensional analysis of models and data sets Am. J. Phys. 71, 437 (2003); 10.1119/1.1533057

This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 77.28.48.42 On: Fri, 10 Oct 2014 08:12:53

Main Large Data Set Features Detection by a Linear Predictor Model Carlos Enrique Gutierreza , Prof. Mohamad Reza Alsharifa , Mahdi Khosravyb, Prof. Katsumi Yamashitac, Prof. Hayao Miyagi a, Rafael Villad a

Department of Information Engineering, University of the Ryukyus, 1 Senbaru, Okinawa, Japan [email protected] , { asharif; miyagi }@ie.u-ryukyu.ac.jp b University for Information, Science and Technology, ''Saint Paul the Apostle'', Ohrid, Macedonia [email protected] c Graduate School of Engineering, Osaka Prefecture University, Osaka. [email protected] d Regional Public Goods, InterAmerican Development Bank, Washington DC, USA. [email protected] Abstract. The aim of the present paper is to explore and obtain a simple method capable to detect the most important variables (features) from a large set of variables. To verify the performance of the approach described in the following sections, we used a set of news. Text sources are considered high-dimensional data, where each word is treated as a single variable. In our work, a linear predictor model has been used to uncover the most influential variables, reducing strongly the dimension of the data set. Input data is classified in two categories; arranged as a collection of plain text data, pre-processed and transformed into a numerical matrix containing around 10,000 different variables. We adjust the linear model’s parameters based on its prediction results, the variables with strongest effect on output survive, while those with negligible effect are removed. In order to collect, automatically, a summarized set of features, we sacrifice some details and accuracy of the prediction model, although we try to balance the squared error with the subset obtained. Keywords: Big data, Data mining, Patterns discovery, Linear predictor.

INTRODUCTION The amount of data in our world has been exploding; data sets grow in exponential sizes in part because they are increasingly being gathered by ubiquitous information-sensing devices and social media. Large data sets become complex and difficult to process using on-hand database management tools or traditional data processing applications. Analyzing such data sets is one of the keys of leading companies and one of the most active research fields. Several issues need to be addressed to capture the full potential of big data; one of them is to find correlations from a vast amount of variables. In this paper, we describe an algorithm built on a linear model, to retain a subset of the most important variables based on their correlation with the desired output values. The linear predictor transforms the original data set showing its most important features. With a large number of features, around 10,000 in our experiment, we wish to get “automatically” a smaller subset of variables, and explore the balance between the prediction error and the emerged subset. Important to clarify that our main goal is not to create a prediction model, but to reduce the dimensionality of the input data, reduction that can be used for further purposes. In previous experiments PCA principal component analysis has been used to compress a set of documents; the obtained orthogonal transformation generated a new set of values of linearly uncorrelated variables. Due that each principal component is a linear combination of the original variables; the real meaning of the features disappeared, eliminating the chances to use the transformation results in applications that require the “semantic” of the data. Our present work pretends to be a powerful alternative for compression with no loss of the essence and context of the data, essential characteristics for further analysis.

DATA SET The HTML data set used in our experience comes from CNN web news during a week between March 11th and March 18th 2011. March 11th is sadly remembered as the day where multiple earthquakes triggered a tsunami in Japan. We would like to examine the correlation between words and type of news to select the most influential ones, each word is considered a variable. News are classified in two classes, depending on whether news is related to a natural disaster or not. The set is composed by around 500 HTML files converted into text format. At this step, our aim is to represent the text files as a numerical matrix. An implementation in C++ has been developed to implement International Conference of Computational Methods in Sciences and Engineering 2014 (ICCMSE 2014) AIP Conf. Proc. 1618, 733-737 (2014); doi: 10.1063/1.4897836 © 2014 AIP Publishing LLC 978-0-7354-1255-2/$30.00

This article is copyrighted as indicated in the article. Reuse of AIP content is subject733 to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 77.28.48.42 On: Fri, 10 Oct 2014 08:12:53

a process. It extracts from each file the words, creates a dictionary and computes words frequency. Special characters, numbers, symbols, and meaningless words such as conjunctions, prepositions and adverbs were removed. In addition, our implementation includes a Porter stemming algorithm for reducing inflected (or sometimes derived) words to their stem, base or root form. The general idea underlying stemming is to identify words that are the same in meaning but different in form by removing suffixes and endings; for instance, words such as "expanded”, "expanding", "expand", and "expands" are reduced to the root word, "expand". The final result was a clean dictionary (vector of 9962 elements) and a matrix of 521 x 9962 (news x words), where each element is a number equal to the frequency of each word. As expected, the result is a high-dimensional matrix that will be the training set of a linear predictor model described in this paper.

LINEAR REGRESSION MODEL Since many years ago, linear prediction models (or regression models) are still being effective and actively used in a diverse area of applications, such as data forecasting, speech recognition, signal restoration, and others. A linear predictor is a linear function (linear combination) of a set of coefficients and independent variables, whose value is used to predict the outcome of a dependent variable. In this paper we derive a compression algorithm from a linear predictor modified by shrink operators. The model assumes that the regression function E (Y | X ) is linear in the inputs X 1 , X 2 ,..., X n . We want to predict an output Y , so the linear regression model is defined as:

f (X ) T0 

d

¦ X jT j

(1)

j 1

Where T ' s are the unknown coefficients, and d is the dimension of input vector X . Typically there is a set of training data ( X 1 , y1 )...( X n , y n ) from where parameters T ' s are estimated. The most popular estimation method is the least squares that minimizes the quadratic cost between the output training data and the model predictions. n

¦ ( yi  T 0 

J (T )

i 1 T

d

¦ xijT j ) 2

(2)

j 1

J (T ) ( y  XT ) ( y  XT ) (matrix notation)

(3)

T

Each xi ( xi1 , xi 2 ,..., xid ) is a vector of feature measurements for the ith case. Let’s denote by the X the n x (d  1) matrix, where each row is an input vector, the first column filled with 1, and similarly let’s denote y to be a n  vector of outputs in the training set. Differentiating (3) with respect to T we obtain an estimation for the coefficients:



T Predictions are calculated as follows:

 y

 XT

( X T X ) 1 X T y

(4)

X ( X T X ) 1 X T y

(5) One of the most famous results in statistics asserts that the least squares estimates of the parameters have the smallest variance among all linear unbiased estimates.

Ridge Regression and Least Absolute Selection and Shrinkage Operator Ridge regression shrinks the T ' s by imposing a penalty on their size. The ridge coefficients minimize a penalized quadratic cost between the output training data and the model predictions. From equation (3), ridge adds a penalty as follows:

J (T ) ( y  XT ) T ( y  XT )  G 2T T T  T ( X T X  G 2 I d ) 1 X T y

(6) (7)

2

Where G ! 0 is a complexity parameter that controls the amount of shrinkage and I d is a d x d identity matrix. For larger values of G the amount of shrinkage increases; this method compresses T ' s toward zero and each

This article is copyrighted as indicated in the article. Reuse of AIP content is subject734 to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 77.28.48.42 On: Fri, 10 Oct 2014 08:12:53

other. The solution (equation 7) adds a positive constant to the diagonal of X T X before inversion; this was the main motivation for ridge method when it was introduced, to solve the singular problem of X T X inversion. Least absolute selection and shrinkage operator, also known as “lasso”, is like ridge method but with big d

differences in the penalty term. In this case, the lasso penalty is G 2 ¦ T j also known as L1 norm. The new j 1

penalty form, introduces nonlinear solutions for estimation of T ' s . In matrix form lasso is: d

J (T ) ( y  XT ) T ( y  XT )  G 2 ¦ T j

(8).

j 1

Computing lasso solution is a quadratic programming problem; for example, the derivative containing an absolute value. Lasso is used mainly at algorithm described below to remove irrelevant features (words).

ALGORITHM DERIVATION Equation (8) can be expressed as:

J (T )

n

d

i 1

j 1

¦ ( yi  T 0  X i T T ) 2  G 2 ¦ T j

(9)

differentiating the 1st term of the sum with respect to one generic feature coefficient T j we obtain: st wJ (T ) 1 term wT j

n

n

i 1

i 1

¦ 2 X i j 2T j  ¦ 2( yi  T 0  X i j T T  j )( X i j )

z1T j  z 2

(10),

T

where X i T T  j is same as X i T but excluding the feature j and its coefficient. The 2nd term of (9) is j differentiated as: st wJ (T ) 2 term wT j

G

2

wT j wT j

­ G 2 if °° 2 2 ( , ) G G if  ® ° 2 if °¯ G

Tj 0 Tj

(11)

0

Tj !0

Written together equations (10) and (11) and equating to zero, we have: 0

z1T j  z 2  G 2

wT j wT j

, therefore

estimation for feature coefficient Tˆ j is:

Tˆ j

wT j z2  G 2 wT j z1

­ z2  G 2 if ° z1 ° °° z  G 2 z  G 2 2 , 2 ) if ®( z z1 1 ° ° z2  G 2 if ° z1 °¯

Tj 0 Tj

0

(12).

Tj !0

Equation (10) shows that z1 is always positive, therefore T j  0 is same as z 2  G 2 , and equivalently

T j ! 0 same as z 2 ! G 2 . Equations above are implemented as an algorithm: 1. Initialize coefficients by ridge method (equation 7), calculate z1 (equation 10), and initialize G . 2. Repeat until coefficients Tˆ' s get stable 3. For each feature j perform the following: 4. Update z 2 (equation 10)

This article is copyrighted as indicated in the article. Reuse of AIP content is subject735 to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 77.28.48.42 On: Fri, 10 Oct 2014 08:12:53

5 .In case z 2  G 2

Update Tˆ j

6 .In case z 2 ! G 2 Update Tˆ j

z2  G 2 z1 z2  G 2 z1

Update Tˆ j 0 8. Change value of G and iterate from line 2 7. Otherwise

RESULTS Algorithm described in previous section has been run several times for different values of the complexity parameter that controls the amount of shrinkage G . Simulation results are shown in Table 1, compression ratio is defined as the ratio between the amount of original variables (9962 in our experiment) and the final subset of strongest variables. For example, for G 0.2 , the error prediction is very slight, but the compression is low. Although a compression ratio of 6.17 implies a 6 times reduction, from 9962 to 1613 features, for some systems and applications to manage 1600 variables could still be complex and costly in resources and time. Shrinkage parameter G

TABLE 1. Simulation results. Prediction Squared Error (equation 3) 0.0068 0.1741 1.7651 24.3671 38.6543

Compression ratio (initial variables/final variables) 6.17 13.77 25.61 262.16 1660.33

0.2 0.5 1 4 10 In the opposite case, for G 10 , the error prediction is high and the compression results are considerable. For this scenario, figure 1 shows the final subset of features that evidence strong compression results. The issue in this case is the prediction error, some systems and application cannot afford higher degradation levels of the accuracy. The final decision involves balancing the error against the compression considering the future application of the data. Adding new testing data and re-computing values presented in table 1, help to verify the generalization of the final subset, and to support a closing decision

FIGURE 1. Strongest variables for shrinkage parameter G

10 .

CONCLUSION In this paper we have obtained a manageable algorithm to select automatically the most important features from a high dimensional data set. The algorithm is straightforward, and is a derivation of the shared application between a linear prediction model besides ridge and lasso shrink methods. It has been demonstrated that different compression ratios can be obtained by the modification of a shrink parameter, high linear prediction accuracy is correlated with low compression ratios, and on the contrary, strong compression harmonizes with a degradation of the prediction accuracy. Our solution can be used as an alternative for systems that are struggling with issues related to high

This article is copyrighted as indicated in the article. Reuse of AIP content is subject736 to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 77.28.48.42 On: Fri, 10 Oct 2014 08:12:53

dimensional input data. It produces an input model that is interpretable and has possibly lower prediction error than the full model.

REFERENCES 1. C.E. Gutierrez, M. Alsharif, H. Cuiwei, M. Khosravy, R. Villa, K. Yamashita, H. Miyagi, Uncover news dynamic by principal component analysis. ICIC Express Letters, vol.7, no.4, pp.1245-1250, 2013. 2. C.E. Gutierrez, M.R. Alsharif, H. Cuiwei, R. Villa, K. Yamashita, H. Miyagi, K. Kurata, Natural disaster online news clustering by self-organizing maps. Ishigaki, Japan, 27th SIP symposium, 2012. 3. C.E. Gutierrez, M.R. Alsharif, R. Villa, K. Yamashita, H. Miyagi, Data Pattern Discovery on Natural Disaster News. Sapporo, Japan, ITC-CSCC, ISBN 978-4-88552-273-4/C3055, 2012. 4. T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning, 2nd Edition. Springer Series on Statistics, 2008.

This article is copyrighted as indicated in the article. Reuse of AIP content is subject737 to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 77.28.48.42 On: Fri, 10 Oct 2014 08:12:53

Suggest Documents