Machine Learning and Data Optimization using

0 downloads 0 Views 541KB Size Report
Machine learning is a branch of a field Artificial Intelligence (a branch of field ... Unsupervised learning, Semi supervised learning, Reinforcement learning,.
Int. J. Emerg. Sci., 1(2), 108-119, June 2011 ISSN: 2222-4254 © IJES

Machine Learning and Data Optimization using BPNN and GA in DOC Zeeshan Ahmed and Saman Majeed Department of Bioinformatics, Biocenter, University of Wuerzburg, Germany

Abstract. The purpose of this research was to implement an application improving the process of optimization for efficient results estimation in possible quick time using the concepts of adaptive machine learning. Back Propagated Neural Network algorithm was implemented to take advantage in developing a learning behavior in system and Genetic algorithm was implemented for data optimization. Going into the details of this research paper, we briefly discuss the implemented methodology in the form of software application i.e. Optimal Data Classifier (DOC), using Java programming language. Furthermore we describes two performed experimentation processes using two different datasets with DOC, and presented observed results helping in improving learning and data optimization behavior. Keywords: Back Propagation Neural Network, Bioinformatics, Genetic Algorithm, Machine Learning, WEKA,.

1

INTRODUCTION

Machine learning is a branch of a field Artificial Intelligence (a branch of field computer science), aiming to facilitate complex system development for data analysis, optimization, classification and prediction with the use of some mathematical algorithms e.g. Artificial neural network, Genetic Algorithm, Bayesian statistics, Case-based reasoning, Decision trees, Inductive logic programming, Gaussian process regression, Group method of data handling (GMDH), k-NN, SVMs, Ripper, C4.5 and Rule-based classifier etc. Furthermore Machine Learning deals with the implementation of learning behavior(s) in intelligent software and hardware systems, as it is stanch towards the research apprehension with the use of some learning techniques e.g. Supervised learning, Unsupervised learning, Semi supervised learning, Reinforcement learning, Transduction learning and Learning to learn etc. During this research we have tried to take the advantage of Machine Learning concepts by implementing the existing machine learning principles for the development of a special purpose system towards data classification, optimization and learning behavior implementation. Currently residing within the limited scope of this research, we are focused towards the implementation of a software application, required to be developed, capable of producing optimal results dealing with datasets containing large (complex) data. 108

Zeeshan Ahmed and Saman Majeed

The criterion for choosing mathematical algorithm was based on evolutionary approach selection, able to deal with weighting of networks, chromosome encoding and terminals. Meeting these criteria we have decided to use Genetic Algorithm [1] for data optimization and back propagation neural network algorithm for learning behavior implementation. The defined experimental procedure was to analyze data (datasets) consisting of information based on number of attributes, classes, instances and relationships. Apply suitable mathematical algorithm to establish the process of estimating best optimal input parameters and on the basis of those selected input parameters train network parameters to best fit with the use of best suitable learning techniques. Later then, get the results and try extracting optimal results out of all. Residing with in the domain of Machine Learning (artificial intelligence) several approaches have been introduced, implemented, evaluated and published by many researchers which are providing lots of values in the implementation of adaptive machine learning system implementation and usage for the sake of data optimization and classification in different field of computational research e.g. Fast Perceptron Decision Tree Learning [8], Massive Online Analysis (MOA) [9], 3D Face Recognition Using Multiview Keypoint Matching [10], Evolving Data Streams[11], Classifier Chain [12], Multi-label Classification [13], MultipleInstance Learning [14], Adaptive Regression [15], nearest neighbor search [16], Bayesian network classification [17, 19], Naive bayes text classification [18], ML for Information Retrieval [20], Probabilistic unification grammars [21], Instance Weighting [22], KEA [23] and MetaData for ML [24, 25] etc. Apart from the fact of existence of these referred valuable approaches, we have decided to implement our own software application during this research and development, consisting of different methodology. In section two of this brief research paper, we discuss the implemented methodology for the classification and optimization of data with the implementation of learning behavior in it. Later in section three of this paper, we present the implemented console based software application and its work flow with the implementation of earlier discussed methodology in section 2. In section 4 of this research paper we describe the experimentations with implemented software application using two different complex data sets i.e. Zoo Database and Labor Database, and wrap up with observations in section 5. Furthermore we conclude discussion in section 6 and present future recommendations in section 7 of this research paper.

2

METHODOLOGY

At first the information is needed to be extracted from (in use) data set. Based on extracted information all attributes, class, instances and relationships are needs to be categorized. It is also needed to obtain information about hidden layers, learning rate and momentum. Then classification is needed to be performed using a classifier for the identification of correctly and incorrectly classified instances. As we have already mentioned the reasons of choosing Genetic Algorithm to obtain the optimistic results by evaluating attributes for the classification on the

109

International Journal of Emerging Sciences, 1(2), 108-119, June 2011

basis of provided instances, so without going into rationale details, we would like to briefly present the methodology, as shown in Figure 1. The whole methodology is divided into following 7 steps.

Figure 1: Data Classification Methodology; Genetic Algorithm

1. Estimate Chromosomes; it is need to estimate chromosomes by calculating the range (upper and lower limits). In genetic algorithm, a chromosome is eventually a set of parameters defining a solution to the targeted problem(s), represented as a data structures with a wide variety of inclusions. 2. Set Learning Rate and Momentum; based on the calculated chromosome information learning rate and momentum is needed to set, for the cross validation. Learning Rate and Momentum is a pair of information intending to find out the affect of machine learning using BPNN [28] algorithm including its speed and optimum efficiency. 3. Cross Over using pair of best results; amongst all obtained information as the result of step 3, on the basis of accuracy and the correct number of instances a pair of two best results (chromosomes) is needed to be chosen for the cross over. During the process of cross over bits exchanges and two new children (offspring chromosomes) are created. Cross over is mainly one of the operators of genetic algorithm divided into two categories i.e. single point cross over (based on binary string from the beginning of chromosome, of which one part if moved to the first parent and other part moved to other parent) and two point cross over (two parts are selected and moved to parents). 110

Zeeshan Ahmed and Saman Majeed

4. Mutate new off springs; Resulted two children in step 4, later then used for Mutation (maintain inherited assortment from one generation of a population of algorithm chromosomes to the next) to evaluate the accuracy of these two children. 5. Replace offsprings; from actual results two values on the basis of worst accuracy are needed to be replaced by these new offsprings. Likewise actual data processing, repeat processing with two new offsprings to generate new population. Replace newly generated population with previous population. 6. Perform cross validation; next step is to perform cross validation is, to create new children different than parents. 7. Estimate Individual and Commutative weight of all instances; as the last step Individual and commutative weight factors of the all the instances are needed to be calculated. Then to obtain the information based on number of hidden layers, learning rate, momentum and other available options and performs data classification. A learning algorithm Back Propagated Neural Network is implemented for Multilayer Perception [2].

3

SOFTWARE IMPLEMENTATION; DOC

We have implemented earlier discussed genetic algorithm based methodology in the form of a software application i.e. Optimal Data Classifier abbreviated as DOC, for the classification of data and estimation of optimized results. The implemented software is a desktop console application, developed using Java programming language by Sub Microsystems. The developed software is capable of processing standard Attribute Relation File Format (ARFF) dataset files. ARFF is a text file describing a list of instances sharing a set of attributes, especially used developed for machine learning projects. As shown in Figure 2, the work flow of implemented software application starts with the basic analysis of inputted data set file (dot arff) and concludes information based on the number of attributes, classes, instances and relationships. As earlier discussed in section 2, to obtain the information based on number of hidden layers, learning rate, momentum and other available options and performs data classification, we have implemented Back Propagated Neural Network for Multilayer Perception using WEKA [3, 4 , 5] and implemented genetic algorithm (genetic programming) is implemented to estimate optimized results. Here, briefly narrating WEKA [26], an open source software (library) issued under the license of GNU General Public with the collection of some machine learning algorithms especially proposed for data mining tasks e.g. data preprocessing, regression, classification, clustering, and rules associations. Furthermore Weka provide a machine learning workbench (general purpose environment) with the implementation and provision of graphical user interface for data exploration and the experimental comparison of different machine learning techniques on the same problem as shown in figure 3.

111

International Journal of Emerging Sciences, 1(2), 108-119, June 2011

Figure 2: Work flow of implemented software application

Later then data classification is performed to execute analysis using genetic algorithm. During examination using genetic algorithm at first chromosomes are estimated, then learning rate and momentum is set to perform cross over using pair of best results. Furthermore, as earlier discussed in section 2, next, the mutation of two offspring is performed, on the basis of obtained accuracies two old offsprings with low values are replaced with two new offspring, the cross validation is perform and in the end individual and commutative weights of instances are calculated. The obtained results are validated and final output is presented to the user in the end.

Figure 3: Weka Explorer; Graphical User Interface

112

Zeeshan Ahmed and Saman Majeed

The implemented version of software application is capable of reading and analyzing number of population in given data set, and based on the indentified population, it estimates following results i.e. Kinds of species in population (if there are more than 1), Correctly Classified Instances, Incorrectly Classified Instances, Hidden Layers, Momentum and Accuracy (optimized/weighted results). The software application is capable of repeating prediction procedure up to user’s satisfaction level and each time user can expect difference in results depending upon each transaction’s input and output values.

4.

EXPERIMENTATIONS

To evaluate the strength of implemented methodology in the form of a software application we have perform several experiments using different data sets, but here in this research paper we will discuss two experiments performed using two different datasets i.e. Zoo Database [6] and Labor Database [7]. 4.1 Zoo Database Experiment

Zoo is a database (zoo.arff) containing 101 Instances consisting of 18 Attributes, 2 numeric attributes i.e. animal and legs, rest 16 of are Booleans attributes i.e. hair, feather, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathes, venomous, fins, legs, tail, domestic, catsize and type, as shown in figure 4, using Weka explorer.

Figure 4: Zoo Database; using Weka Explorer (Graphical User Interface)

113

International Journal of Emerging Sciences, 1(2), 108-119, June 2011

The dataset file of Zoo Database is inputted to the developed software application to evaluate attributes i.e. feather, eggs, milk, airborne, aquatic and fins for the classification of the animal (first attribute) against the type (last attribute) on the basis of all other attributes, by considering all the instances of the dataset randomly. The observed results during the experimentation process using our developed software application with Zoo Database are shown in Figure 5, which are 1. 1617 Mammals, 539 Birds, 0 Reptile, 637 Fish, 0 Amphibian, 490 Insects and 49 Invertable from the whole population of 3332 species in dataset. 2. 68 instances are correctly classified and rest 33 are incorrectly classified from all 101 instances 3. No Hidden layer 4. 0.3 Learning rate 5. 0.1 Momentum 6. 0.67326732673267326733 Accuracy

Figure 5: Experimental results of Zoo Database; at first offspring

4.2 Labor Database Experiment

Labor is a database containing 57 Instances consisting of 16 Attributes, 13 numeric attributes i.e. Duration, wage increase first year, wage increase second year, wage increase third year, cost of living adjustments, working hours, pension, standby pay, shift differential, statutory holidays, vacations, contribution dental plan and contribution to health plan, and 3 Boolean attributes i.e. bereavement assistance, long term disability assistance and education allowance, as shown in figure 6, using Weka explorer.

114

Zeeshan Ahmed and Saman Majeed

Figure 4: Labor Database; using Weka Explorer (Graphical User Interface)

The dataset file of Labor Database is inputted to the developed software application to evaluate attributes i.e. duration, wages, living adjustment, working hours, pension, and vacations for the classification on the basis of all other attributes, by considering all instances randomly. The observed results during the experimentation process using our developed software application with Labor Database are shown in Figure 7 are 1. 49 Good and 441 Bad of all 490 Population. 2. 10 instances are correctly classified and rest 47 are incorrectly classified from all 57 instances 3. No Hidden layer 4. 0.1 Learning rate 5. 0.5 Momentum 6. 0.17543859649122806 Accuracy

115

International Journal of Emerging Sciences, 1(2), 108-119, June 2011

Figure 5: Experimental results of labor Database; at first offspring

5.

OBSERVATIONS

We have performed different experimental transactions using these two dataset and observed results. We have applied experimental behavior to the process of attribute classification on the basis of their respective values, and observed 100 percent results. We experimented in three different ways and observed results i.e. 1. By increasing the learning rate and placing the momentum constant. 2. By increasing both learning rate and momentum. 3. By randomly changing the weight. During experimentations the size for chromosome was 6 bits, 3 bit decimal value (0-10 / 10 = value) for learning rate and 3 bit decimal values for momentum. We observed during experimentations that by keeping default weight of instance, results become stable and by increasing the weight of instance the size of results increases, in condition if we don’t reinitialize the object of multi layer perception e.g. if we doubles the weight, size of result will become double. Furthermore we have also observed that mutation can also affect the accuracy in both the ways, either can increase or decrease the accuracy. We have also observed during experimentations with learning time that classifier produces results in minimum possible time with value 1; the more we will increase the value of classifier the more it will take time.

6.

CONCLUSIONS

In this research paper, we started discussion by highlighting the field machine learning then focusing on the targeted problem of data optimization, at first we discussed some existing machine learning approaches and then we presented our (in use) methodology (genetic algorithm) and its implementation in the form of a software application. To evaluate the effectiveness of implanted approach, we

116

Zeeshan Ahmed and Saman Majeed

performed experimentations using two entirely different data sets (open and freely available) consisting of completely different characteristic and number of population. As the result of experimentation process, we found 100 percent accurate results towards the process of attribute classification on the basis of their respective values. Apart from our own methodology, there are some data mining methodologies (WEKA based software applications) exist which targets the same goals of data classification and optimization using mathematical algorithms e.g. Genetic Programming Classifier uses tree structure to predict and classify data [27], MULAN uses multi label classification method towards music categorization, protein function classification and semantic scene classification [30], Tree based Density Estimation Algorithms (TUBE) based on probabilistic principles [29] etc. but each of these differs in data processing and result production as compared to out methodology, major reason could of the use of different algorithms (methodology). Enfolding the discussion, based on our own observations, we can only claim that the discussed methodology in the form of software application (in this research paper) could be very useful especially for the data classification and optimization.

7.

FUTURE RECOMMENDATIONS

As this is an ongoing in process research in the field of machine learning, expanding the scope of our research, in future we will be more focused towards the enhancement and implementation of our application (methodology) for data classification, optimization and learning behavior implementation but focusing only on the field of bioinformatics, epically towards protein structure, functionality and annotation analysis as well as system biology (protein network analysis).

8. AUTHORS CONTRIBUTIONS Initially this research project (DOC) was started by the author Zeeshan Ahmed as an academic (unfunded, personal) research project at the Department of Computer Science Blekinge Institute of Technology Sweden. Continuing this research, we (both authors) are working on its further enhancements at the Department of Bioinformatics, University of Wuerzburg Germany.

9. ACKNOWLEDGEMENT We would like to thank academic administration of Blekinge Institute of Technology Sweden for giving the opportunity to initiate this project and University of Wuerzburg Germany for letting us continue this research. We would like to pay our special gratitude to Prof. Dr. Thomas Dandekar for his support and guidance. Author (Saman Majeed) is also thankful to DFG (Da 208/11-1) for their support.

117

International Journal of Emerging Sciences, 1(2), 108-119, June 2011

REFERENCES 1. David E. Goldberg, John H. Holland (1998) Genetic Algorithm and Machine Learning. Machine Learning 3: 95-99 2. Amini. J (2008) Optimum Learning Rate in Back-Propagation Neural Network for Classification of Satellite Images (IRS-1D). Scientia Iranica, Vol. 15, No. 6, pp. 558-567 3. War. R.E. De., Neal. D.L (1994) Weka machine learning project: Cow culling. Technical report, The University of Waikato Computer Science Department. http://www.cs.waikato.ac.nz/ml/publications1994.html. Accessed 18 April 2011 4. Holmes. G, Donkin. Witten.A (1994) Weka: A machine learning workbench. Proc Second Australia and New Zealand Conference on Intelligent Information Systems, Brisbane, Australia, 1994. 5. Mark. H, Eibe.F, Geoffrey. H, Bernhard. P, Peter. R, Ian H. W (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations, Volume 11 Issue 1, doi>10.1145/1656274.1656278 6. Richard. F (1994) Zoo database. http://www.hakank.org/weka/zoo.arff. Accessed 18 April 2011 7. Stan. M (1988) Final settlements in labor negotitions in Canadian industry. http://www.hakank.org/weka/labor.arff. Accessed 18 April 2011 8. Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Eibe Frank. Fast perceptron decision tree learning from evolving data streams. In Proc 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hyderabad, India, pages 299-310. Springer, 2010. 9. Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. Moa: Massive online analysis. Journal of Machine Learning Research (JMLR), 2010. 10. Michael Mayo and Edmond Zhang. 3d face recognition using multiview keypoint matching. In Proc 6th International Conference on Advanced Video and Signal Based Surveillance, Genova, Italy, pages 290-295. IEEE Computer Society, 2009 11. Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà. New ensemble methods for evolving data streams. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 139-148, New York, NY, USA, 2009. ACM. 12. Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. In Proc 13th European Conference on Principles and Practice of Knowledge Discovery in Databases and 20th European Conference on Machine Learning, Bled, Slovenia. Springer, 2009 13. Jesse Read, Bernhard Pfahringer, and Geoffrey Holmes. Multi-label classification using ensembles of pruned sets. In Proc 8th IEEE International Conference on Data Mining, Pisa, Italy, pages 995-1000. IEEE Computer Society, 2008. 14. James Foulds and Eibe Frank. Revisiting multiple-instance learning via embedded instance selection. In Proc 21st Australasian Joint Conference on Artificial Intelligence, Auckland, New Zealand. Springer, 2008. 15. Eibe Frank and Mark Hall. Additive regression applied to a large-scale collaborative filtering problem. In Proc 21set Australasian Joint Conference on Artificial Intelligence, Auckland, New Zealand. Springer, 2008.

118

Zeeshan Ahmed and Saman Majeed

16. Ashraf Masood Kibriya. Fast algorithms for nearest neighbour search. Master's thesis, Department of Computer Science, University of Waikato, 2007. 17. Remco R. Bouckaert. Voting massive collections of bayesian network classifiers for data streams. In Proc 19th Australian Joint Conference on Artificial Intelligence, pages 243252, 2006. 18. Eibe Frank and Remco R. Bouckaert. Naive bayes for text classification with unbalanced classes. In Proc 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, pages 503-510. Springer, 2006. 19. R. Bouckaert. Bayesian network classifiers in weka. Technical Report 14/2004, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2004. 20. S.J. Cunningham, J.N. Littin, and I.H. Witten. Applications of machine learning in information retrieval. Technical Report 97/6, University of Waikato, Department of Computer Science, Hamilton, New Zealand, February 1997. 21. T.C. Smith and J.G. Cleary. Probabilistic unification grammars. In Workshop notes: ACSC'97 Australasian Natural Language Processing Summer Workshop, pages 25-32, 1997. 22. K.M. Ting. Inducing cost-sensitive trees via instance-weighting. Technical Report 97/22, University of Waikato, Department of Computer Science, Hamilton, New Zealand, September 1997. 23. Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. NevillManning. KEA: Practical automatic keyphrase extraction. In Proc 4th ACM conference on Digital Libraries, Berkeley, CA, pages 254-255. ACM, August 1999. 24. Cleary J.G., G. Holmes, S.J. Cunningham, and I.H. Witten. Metadata for database mining. In Proc IEEE Metadata Conference, Silver Spring, MD, April 1996. 25. S.J. Cunningham. Dataset cataloguing metadata for machine learning applications and research. Technical Report 96/26, University of Waikato, Computer Science Department, Hamilton, New Zealand, October 1996. 26. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11, 1, 10-18. November 2009 27. Genetic Programming Classifier for Weka, , accessed 10 May 2011 28. Bajwa, I.S., Choudhary, M.A. (2006) A Study for Prediction of Minerals in Rock Images using Back Propagation Neural Networks In: IEEE 1st International Conference on Advances in Space Technologies (ICAST 2006) 185-189 29. TUBE: Tree-based Density Estimation , accessed 10 May 2011

Algorithms,

30. G. Tsoumakas, I. Katakis, “Multi Label Classification: An Overview”, International Journal of Data Warehousing and Mining, David Taniar (Ed.), Idea Group Publishing, 3(3), pp. 1-13, 2007.

119