Applied Mechanics and Materials Vol. 371 (2013) pp 769-773 © (2013) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/AMM.371.769
Machine Learning and Creative Methods Used to Classify Customers in a CRM Systems CIOCA Marius1,a, GHETE Andrada-Iulia2,b, CIOCA Lucian-Ionel1,c and GÎFU Daniela3,d 1
”Lucian Blaga” University of Sibiu, Bd-ul Victoriei, nr. 10, 550024, Romania
2
Technische Universität München, Arcisstraße 21, D-80333 Munich, Germany 3
”A.I. Cuza” University of Iasi, Bd-ul Carol I, nr. 11, 700506, Romania
a
b
c
[email protected],
[email protected],
[email protected], d
[email protected]
Keywords: creative, data mining, classify, customers, CRM systems.
Abstract. Considering the fact that business processes are becoming more and more customeroriented, marketing and connecting top management to the customers are extremely important, which should be given a special attention, taking into account the globalization and the increasing market competition. Due to the exponential growth of data volume, in any field, it was necessary to use and develop new methods and techniques for discovering “hidden” information in data, information almost impossible to be detected by traditional means, based on human analysis skills. This paper presents a pre-processing method for mining data in order to support decision makers in developing coherent and saleable strategies as regards the customers, based on their classification in different categories, specific to the area of concern, for the purpose of improving the Customer Relationship Management systems (CRM systems). In order to facilitate analysis and decision making by top management, the analyzed results are presented - in graphic format-using the facilities provided by Google API integrated in the application created with Open Source technologies. Introduction The information processing combined with communication techniques, supported by a tremendous Internet development, has led to great amounts of data, information and knowledge represented in the most diverse forms [1]. This tremendous amount of data is further increased not only by a permanent development of the Web, but also by the aggressive appearance of emerging technologies such as dedicated systems, mobile systems and ubiquitous systems for information processing. Under these circumstances, the need for information and knowledge extraction emerged from this distribution of massive data, in order to support decision makers. In decision making processes the human factor occupies the central position, but methods with predominant qualitative character need an analysis of a larger data volume, practically impossible to be covered by the human factor [2]. The main concepts and results in assisting the decision making processes with electronic means which involve data analysis were: Data Warehouse - DW, On-line Analytic Processing – OLAP, as well as Data Mining and Knowledge Discovery - DMKD [3, 4]. Related to these applications, a number of achievements can be listed [5], which of great interest in the context of this study are: a) client categorization, b) pricing and product positioning in the market, c) discovery of potential customers and keeping the existing ones d) prediction and prevention bankruptcy situations, e) determining the commands’ portfolio safety level and the appropriate measures that must be taken etc. Organizations have accumulated massive amounts of data stored on computer storage media, regarding transitions of various types, developed along many years. In almost any organization, there are thousands and hundreds of thousands of records regarding purchases, sales, receipts and All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of TTP, www.ttp.net. (ID: 194.169.191.113-27/06/13,09:17:57)
770
Innovative Manufacturing Engineering
payments. Stores have hundreds of thousands of records from the cash register, containing not only the items purchased but also the customers identified by their access pass or their loyalty card. For a long time these data have been collected simply by virtue of the need for archiving. The continuous increasing competition, the increasing market requirements have determined the organizations to become aware of the potential of this data archive. Since 1992, H. Simon, winner of Nobel Prize for economics [1], highlighted that not information was missing, but the manager’s time to consider all the available information. The collected data include information and “hidden” knowledge, that can serve the good of the organization, but taken as such, they don’t are very useful unless they are accompanied by mechanisms and methods allowing their exploration and the understanding of phenomena that govern the functioning of data sources. The culture of an economical organization, in a knowledgebased society [6], cannot be made without taking into account its history and without using methods that provide consistent data and “extract” “hidden” information from the data processed by the company and not only. Current Research Studies show that interest in Data Mining (DM) has considerably increased, organizations becoming aware of the competitive advantages offered by these computational technologies [7]. Top software products from companies like IBM, SAS etc., are now more than simple search engines based on complex algorithms. Thus, more and more systems like Enterprise Resource Planning (ERP) and Customer Relationship Management (CRM), such as SAP [8], cannot be effective anymore without approaches based on DM. Definitions of DM are numerous, but however we will exemplify, as follows, the result returned by [9] namely “data processing using sophisticated data search capabilities and statistical algorithms to discover patterns and correlations in large preexisting databases; a way to discover new meaning in data”. Method The method used in the analysis of various data sources, aims to establish the customers’ profile, their classification (to see which products are in demand, and finally, customers’ predisposition to certain products) in order to assist management in decision-taking and adjust future production, as well as effective product supply of the major retailers, and not least for the raw material production supply. In order to apply this method, we used PHP language (Hypertext Preprocessor) a software application for extraction, classification and interpretation of information from different sources, allowing data analysis both in text files and data available on the Internet (for example, online trading websites, twitter, rss etc). As a demonstration, in this case, data from ebay.com were analyzed, using the occurrence frequency of different brands of mobile phones (Blackberry, HTC, Motorola, Apple iPhone, Samsung and Sony). Furthermore, the application can analyze data from any field if a vocabulary specific to the analyzed "subject" is defined. As an example, the vocabulary used for this method consists of a 6 class-number (Table 1), and each class has a 5 word number for analysis, classification and information extraction from different sources. Table 1. 30 Word Dictionary divided into 6 classes BlackBerry BlackBerry Storm2 BlackBerry Z10 BlackBerry Tour BlackBerry Curve BlackBerry Bold 9000 BlackBerry 8820
HTC HTC Magic HTC Desire HTC EVO HTC One ST HTC Vivid HTC 7 Pro
Motorola
Samsung
Sony
Apple Iphone
Motorola Fire Motorola Xoom 2 Motorola Atrix 2 Motorola Pro Motorola CUPE Motorola ES400
Samsung Star 3
Sony T 250
Apple Iphone 3 GS
Samsung Galaxy Samsung C3350 Samsung Galaxy Q Samsung Trender Samsung 4G LTE
Sony Vivaz Sony Cedar Sony TXT Sony A8i Sony S003
Apple Iphone 4S Apple Iphone 4 Apple Iphone 3 G Apple Iphone 4 CDMA Apple Iphone 5
Applied Mechanics and Materials Vol. 371
771
In achieving the objectives, we used the algorithm below. Because all classes (categories) defined and used by the application are balanced in terms of quantity (they have five words) in the application there is only PAF normalization, namely a quality normalization. Moreover, the application uses a special class that contains “linking words” that are not being registered, in order to reduce "the noise" of the data pre-processing phase. The algorithm is described below: Requirements: a. The structure of primary elements, termed as word classes, specified in a "source word" format (the word without a prefix or suffix [the word root]) b. The link pool to be classified Note: These sources are in a standard configuration files (*.ini) The Application Methodology: a. All categories are loaded in an accepted format o wordCountcat=[total no. of words / category] b. Every link is downloaded o The algorithm does not generates links from a source document, that is if it finds a link it does not follow it, , thus, the algorithm operates on the level 0; c. Tags are extracted d. A structure based on the word is thus created, and all categories connected with that word are linked to it o It can also be checked whether there is at least one word in two categories and that word must be deleted from both categories, to prevent conflicts; e. Each word is searched in the word source extracted and a distribution network is created in the following format o [Word(root)]cat/link=[number of occurrences in the linked document] f. The absolute probability between the total number of words and the number of occurrences in a certain category is calculated by: PAcat / link = g.
∑ word (root )
cat / link
wordCountT otallink
.
(1)
The normalized probability is calculated:
PAFinalcat / link =
PAcat / link 100 . ∑ PAcat / link
(2)
Output: a. Unix like configuration file - a format compatible with the large majority of programming languages, parsers are included at the API level, and thus is a format which allows further crossplatform processing. The problem is that it does not allow specifications of the types of fields used, and thus a wide range of interpretations remain at the level of source code of the data stored in this output format; b. HTML Table – a format which ensures user interface, UI cross-platform compatibility, but it is difficult to work with it at an interoperability level of applications which use these output data; c. XML - using DMG [10], ensures the best interoperability between applications which use this type of data. The advantage is due to the XML, which incorporates specifications about the types of data used in the output document, but also to the DMG specifications which provide a full understanding of the output data; d. Google API Charts [11], ensure visual comparison. Google Charts is integrated in the application using Javascript and HTML, which gives it versatility and enables it to clearly display results.
772
Innovative Manufacturing Engineering
Research Results The results obtained through the application, in a data mining process, their pre-processing phase are presented in Fig. 1, where we notice that the best-selling phone at [12] is Apple iPhone. The main benefits of this kind of method implemented in the application are: a) The analysis speed of a large data volume, the extraction of information/characteristics and the features specific to the analyzed field towards a client orientation, the improvement of CRM systems and support in customer relationship management; b) From customers’ perspective it can be a good reference point in acquiring a new product, considering the most sold models, which can characterize the product reliability over time, offering wide facilities in addition to speech (web camera, Internet access); c) From the perspective of mobile phone manufacturers and top management, they may decide and improve future marketing and production strategies by improving products, being familiar to customer orientation to certain specifications and requirements related to a particular product. It is to be noted that this method can be applied to any field, such as social, political, economical, engineering, medical and so on, if a dictionary specific to the area of interest is set.
Fig. 1. Results achieved in the DM pre-processing step Conclusions and Future Research Considering that in a “knowledge society” we must provide data consistency by creating “smarter” computers, we consider the expansion method, guiding us towards Web 3.0, on the one hand by defining a meta-classifier that encompasses solutions like Naïve Bayes, Support Vector Machines, Neural Networks, etc [13,14]. In order to stimulate these solutions, we will use the MatLab development environment which helps Naïve Bayes analyses [15] or Support Vector Machines [16], the final implementation being achieved in JAVA.
Applied Mechanics and Materials Vol. 371
773
Acknowledgements One part of performing this research, the fourth author was supported by the POSDRU/89/1.5/S/63663 grant. References [1] C. Lepadatu, Solutions for knowledge discovery in database / data mining (in Romanian Solutii informatice pentru descoperirea cunostintelor din date / mineritul datelor), Institutul de Cerecetari pentru Inteligenta Artificiala, 2011, Romania. [2] F. Gorunescu, Data Mining Concepts, Models and Techniques (in Romanian – Data Mining Concepte, Modele si Tehnici), Editura Albastra, Cluj-Napoca, 2006. [3] M. Cioca, L.I. Cioca, Decision Support Systems used in Disaster Management, in Chiang S. Jao (Ed.), Decision Support Systems, ISBN: 978-953-7619-64-0, InTech, 2010, pp. 371-390. [4] S. Buraga, M. Cioca, A. Cioca, Grid-Based Decision Support System Used in Disaster Management, Studies in Informatics and Control, Vol. 6, no. 3, 2007. [5] F.G. Filip, Decision Support Systems (in Romanian – Sisteme Suport pentru Decizii), Editura Tehnica, Bucuresti, 2004. [6] F.G. Filip, I. Cojocaru, Economy of Culture in the Information Society Based on Knowledge, in UNESCO Regional Bureau for Science and Culture in Europe (BRESCE), 2010. [7] M. Cioca, C., Cioranu, D. Gifu, Computational techniques in management of engineering and business institutions, in International Conference on Engineering & Business Education, Innovation and Entrepreneurship, ISBN 978-606-12-0369-7, ISSN 1843-6730, 2012, pp. 469-472; [8] I. Bondrea, H. Hermann, C. Simion, Using sap in production planning & control for automotive manufacturing, in International Conference on Systems. Theory and Applications, 2007, 322-325. [9] WordNet Search - 3.1. (n.d.). available at http://wordnetweb.princeton.edu/perl/ webwn?s= data+mining &o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=, accessed: 15.11.2012. [10] DMG (Data Mining Group), available at http://www.dmg.org/, accesed: 03.02.2013. [11] Google.com. (n.d.). Google chart Tools - Google https://developers.google.com/chart/, accessed: 03.02.2013.
Developers,
available
at
[12] Informations available at: www.ebay.com, accesed: 14.02.2013. [13] R.G. Belu, C. Oancea, A. Belu, L.I. Cioca, A 2-d indoor radio propagation modeling by using MATLAB for classroom instruction, in 33 rd Annual Conference Frontiers in Education, FIE 2003 [14] C. Bigan, A Recursive Time-frequency Processing Method for Neural Networks Recognition of EEG Seizures, in Neural Networks and Expert Systems in Medicine and Healthcare, World Scientific, 1998, pp. 67-73. [15] MathWorks, Naive Bayes classifier – MATLAB, available http://www.mathworks.com/help/toolbox/stats/naivebayesclass.html, accessed: 22.11.2012.
at
[16] MathWorks, Support Vector Machine. Train support vector machine classifier – MATLAB available at http://www.mathworks.com/help/toolbox/bioinfo/ref/svmtrain.html, accessed: 22.11.2012.