User Segmentation Based on Finding Communities with Similar ...

2 downloads 5204 Views 855KB Size Report
Various software was developed to support web analysis. However, most of them .... The analysis of web access logs usually consists of data gathering and ...
2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

User Segmentation Based on Finding Communities with Similar Behavior on the Web Site Kateˇrina Slaninov´a, Radim Dol´ak, Martin Miˇskus Department of Informatics SBA, Silesian University of Opava Karvin´a, Czech Republic Email: {slaninova, dolak, miskus}@opf.slu.cz

Jan Martinoviˇc, V´aclav Sn´asˇel Department of Computer Science ˇ - Technical University FEECS, VSB Ostrava, Czech Republic Email: {jan.martinovic, vaclav.snasel}@vsb.cz

A. Finding Communities in Social Networks

Abstract—Web log analysis can be helpful in gaining information about the usability of the web site, web performance, for marketing purposes, or for development of business intelligence tools in e-commerce systems. User segmentation is one of the problems solved in marketing and e-commerce sphere. Various software was developed to support web analysis. However, most of them provide only information through the tools based on statistics. User behavior and interaction with the web site is usually presented by measurement of click through rates, or by identification and sometimes visualization of popular paths only. User segmentation for further analysis (e.g. campaign analysis in marketing, web recommendation, web usage optimization) is usually allowed with the manual selection (often with variable setting). In this paper is presented the automatic user segmentation (clustering) based on the similar user’s behavior on the web site. The user’s behavior and behavioral patterns are extracted using process mining techniques; further user segmentation is provided by finding communities with similar behavior through two-step hierarchical clustering.

Finding communities is an important aspect in discovering the complex structure of social networks. A community is defined as a group of nodes within the network, such that connections between them are more dense than the nodes in other communities [1]. It can be defined as a group of vertices, which probably share common properties, and/or play similar roles within the network as well. Community structure can be defined using modules (classes, groups or clusters) and generally is intended for mapping the network using hierarchies, often complicated. The communities can be used to study and solve some of the web problems, such as new generation of search engines, content filtering, automatic classification, web page optimization, or user segmentation. The user segmentation based on finding communities in synthetic social network created by user’s behavioral patterns while browsing the web site is presented in this paper. The users’ behavioral patterns are obtained using process mining techniques (clustering user web sessions). Further user segmentation is provided by finding communities with similar behavior through application of two-step hierarchical agglomerative clustering. The paper is organized as follows: section 2 is related to aspects of web mining research and log file analysis in the context to web logs, in the 3rd section is described the experiment of finding the users’ behavioral patterns while browsing the web site and their further segmentation.

Keywords-user segmentation; communities; log mining; behavioral patterns; web mining; sequential pattern mining

I. I NTRODUCTION Modern applications (information, enterprise, e-commerce systems) as well as web applications, or simply web servers generate huge amount of data collections. This information, often stored in large log files, is used for further analysis for various reasons like system security, compliance with audit or regulation of processes, system trouble shooting, social network analysis, web usability analysis, user clustering, etc. The paper is oriented to web log mining with relation to finding communities in synthetic social network based on user’s behavioral patterns while browsing the web site. Methods of social network analysis (SNA), especially in the large-scale social networks, facilitate better understanding of the network structure and provide useful information for addressing the main aspects of SNA: the sources and distribution of power. The power of an individual node is an attribute which depends on its relations with other nodes. The social structure then may be seen as the visualization of the appropriate level of power as a result of variations in the patterns of ties among nodes. 978-0-7695-4191-4/10 $26.00 © 2010 IEEE DOI 10.1109/WI-IAT.2010.288

II. W EB M INING AND L OG F ILE A NALYSIS Many researches propose algorithms to define the community structure in complex networks using web information, for example [2], [3], [4], [1]. As the internet rapidly grows, these algorithms were redefined for large-scale networks [5], [6]. These methods were used for analysis of various types of social groups and weighted networks. A. Web Log Analysis Web log analysis includes analysis of web server log files that contain records of web server activity. The records provide detail information about file requests to the web 75

server and the server adequate response to the requests. The access log (the main log file) has the standardized format, and typically includes the following information, like: IP address of the client accessing the web page, user’s name, date and time of request, resource requested, size in bytes of the data returned to the client and URL that referred the client to the resource. The following table I shows an example of the event log file from the Appache web server, which was used in the case study.

Data collections in log files generally consist of various types of information. Beside the typical information like event, type of event, device or time when event was performed, we can find the information of person, who initiated the event (activity). Using this information, social networks can be derived on the basis of similar attributes of persons and, in consequence, we can construct models that explain some aspects of persons’ behavior. Aalst et all [16] defines event log as follows: Let A be a set of activities (also referred as tasks) and U as set of performers (resources, persons). E = A × U is the set of (possible) events (combinations of an activity and performer). C = E ∗ is the set of possible event sequences. L ∈ β(C) is an event log. For the web mining analysis we can consider the typical web log with its records of requested activities as an event log in context of previous definition. A user’s paths (sessions) can be defined as the activities, with sequences of events (user requests). The user’s path typically consists on an entering page (usually/but not necessary the index or login page), sequence of pages followed by user, and a page from which user left the web site. The user’s behavior may include other activities like data supplement as a part of web interaction (e.g. web forms), downloading the multimedia and other files from the web site, etc. These information may misrepresent the constructed user paths. In addition, there are several limitations, which must be taken into account while analyzing web log files, e.g. user identification bz host (absence of login, shared computers), record of incomplete transactions and actions, dynamic naming of the web pages, etc. Various software was developed for the analysis of web logs (upon GNU GPL license or commercial). However, most of them provide only information through the tools based on statistics like web traffic, user demographic and others. User behavior and interaction with the web site is usually presented by measurement of click through rates, identification and sometimes visualization of popular paths only. The software tools, which enables clickstream analysis (user path navigation analysis) are, for example: Pathalizer, Visitors, Apache2GDL or StatViz. Processing access web server log can be viewed in the form of a network graph, which illustrates the behavior patterns of Web users. Available tools usually convert a list of hits (with referrer field) into a collection of pairs. Then, those pairs are used to generate the resulting graph. The tools does not search for the common behavior of individual users, followed by segmentation into groups with similar behavior when browsing the web. The result is only a visualization of web browsing by all users when the network chart can be seen to move between different sections of the website for all users. As we can see, user segmentation for further analysis is usually allowed with the manual selection (often with

Table I E XAMPLE OF LOG RECORD FROM A PACHE WEB SERVER Time stamp Server name:port Source of request (Host) Login Password Request ate and time ’’Client method, resource and protocol’’ Status code Object size Refferrer http request header ’’User agent http request header’’ Mar 29 10:24:15 joanes apache: www.opf.slu.cz:80 120.0.0.1 - [29/Mar/2010:10:24:15 +0200] ’’GET /js/global.js.php HTTP/1.1’’ 200 4536 http://www.opf.slu.cz/ ’’Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; InfoPath.1; MS-RTC LM 8)’’ Mar 29 10:24:15 joanes apache: www.opf.slu.cz:80 120.0.0.1 - [29/Mar/2010:10:24:14 +0200] ’’GET / HTTP/1.1’’ 200 7056 ’’Mozilla/5.0 (X11; U; Linux x86_64; cs-CZ; rv:1.9.1.8) Gecko/20100214 Ubuntu/9.10 (karmic) Firefox/3.5.8’’

The analysis of web access logs usually consists of data gathering and preprocessing, pattern discovery and pattern analysis [7]. B. User Segmentation Based on Pattern Mining User segmentation based on the user behavior can be defined as clustering of user’s navigation sessions on the basis of similar behavioral characteristics. User navigation session is the group of activities performed by a user from the moment of entering the web site to the moment of leaving it. There were several methods presented for the web sessions clustering (similarity based, probabilistic, or model based). Similarity based session clustering, where sessions were defined as unordered sets of events (clicks) were proposed in several previous works, for example [8], [9]. Sequence alignment method based on web pages similarity an using dynamic programming method for definition of scoring function was presented in [10], [7]. The automatic identification of web navigation behavior patterns on the basis of ADP method was described in [11]. A technique for generation of significant usage patterns (SUP) was proposed in [12]. Authors of [13] describe unexpected browsing behavior to improve web design. Web usage interference and analysis of user-sessions characteristics extracted by clustering is studied in [14]. In [15] is presented multiobjective evolutionary clustering of web user sessions for web page recommendation. The related work is oriented to clustering of user web behavior, or to clustering of user path navigation, but user segmentation for finding communities based on similar behavior was not presented yet. C. Process Mining Exploring user behavior on the web site through click paths from the log collections is generally integrated to process mining, which refers to methods for distilling a structured process description from a set of real executions.

76

we can see the histogram of sequences ordered by amount of occurrence before and after reduction.

variable setting). The automatic user segmentation and its visualization using community structure in terms of social networks based on the similar user’s behavior on the web site is presented in the next section. III. C ASE S TUDY In the case study is presented the user segmentation (clustering) based on the similar user’s behavior on the web site. The user’s behavior and behavioral patterns are extracted using process mining techniques. User segmentation is provided by finding communities with similar behavior through two-step hierarchical agglomerative clustering. Similar approach was used in our previous work oriented to finding similar student’s behavior patterns in e-learning system [17], [18]. For the web mining analysis we have processed the typical web log from apache server with its records of requested activities. To obtained data collection there were applied the standard data preprocessing methods, where records from search engines and spiders were removed, and only web site browsing was leaved (without download of pictures and icons, stylesheets, scripts etc.). We have obtained a set of users U , defined by the host identification using IP address (the analyzed website has not user identification by the user login). A user’s paths (sessions) were considered as the sequences (a set S), with sequences of activities (user requests). Set of activities A is defined using the combination of standard web log information FromMethod, URL address and StatusCode. An event then represents user request to the web server. An sequence s ∈ S is then defined using user session on the web site (the set of the requested web sites during the one session). As the end of session is used at least 30 min pause. Obtained data collection after pre-processing is specified in table II.

Figure 1.

As we can see, the reduction of the amount of sequences allowed us obtain the set S 0 of sequences which are more significant on the basis of their occurrence. Then, from the obtained data set, there was constructed matrix U x S 0 and consequently matrix of similarity of users U x U . Obtained matrix of similarity was then visualized on the basis of hierarchical agglomerative clustering using graph analysis tool GraphViz. From the figure 2 we can see, that used clustering methods for the reduction of the amount of types of the sequences and users’ clustering based on the similarity between sequences create clusters of users with similar activities, respectively similar behavior on the web site. We have obtained two dominant clusters (subgraphs). The main features of these clusters are the most popular sequences on the web. Than, we can see the components connected to other subgraphs which can be represented as relations between groups of users (communities). These relations can create synthetic social network on the basis of similar users’ behavior. IV. C ONCLUSION

Table II BASIC CHARACTERISTICS OF WEB LOG ActivityTypes Count Hosts Count LogItems Count

Histogram of the sequences before and after reduction

In this paper there was presented the user segmentation and its visualization using community structure analysis in social network terms. Finding communities and their visualization based on the similar user’s behavior on the web site was provided using process mining techniques; communities with similar behavioral patterns were obtained through hierarchical agglomerative clustering. From the results described through graph visualization we can see, that clustering methods for the reduction of the amount of types of the sequences and users’ clustering can provide users’ segmentation on the basis similar behavior on the web site. In the future the authors attempt to provide more detailed analysis of obtained communities in relation to users’ behavior on the web site, as well as comparison with other clustering methods. Although the the method was carried out to identify visitor groups of an educational web site, this approach is generic

5 110 6 831 258 627

Each user u has assigned the set of its sequences s. For further analysis were considered only users with 2 and more sequences, which reduced the amount of users to 2947 and amount of sequences to 5243. Because of too high amount of sequences (often very similar), we performed the sequences reduction using cluster analysis. As the input was used vector s = (a1 , a2 , ...an ), where n is amount of activities and ai ∈ A. For the computing of the similarity of the sequences we used cosine measure [19]. Consequently we have obtained clusters using hierarchical agglomerative clustering. In accordance to provided clustering we have reduced the set of sequences to the 924. On the figure 1

77

[7] W. Wang and O. R. Za¨ıane, “Clustering web sessions by sequence alignment,” in DEXA ’02: Proceedings of the 13th International Workshop on Database and Expert Systems Applications. Washington, DC, USA: IEEE Computer Society, 2002, pp. 394–398. [8] C. Shahabi, A. M. Zarkesh, J. Adibi, and V. Shah, “Knowledge discovery from users web-page navigation,” in RIDE ’97: Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE ’97) High Performance Database Management for Large-Scale Applications. Washington, DC, USA: IEEE Computer Society, 1997, p. 20. [9] B. Mobasher, R. Cooley, and J. Srivastava, “Automatic personalization based on web usage mining,” Commun. ACM, vol. 43, no. 8, pp. 142–151, 2000.

Figure 2.

[10] B. Hay, G. Wets, and K. Vanhoof, “Mining navigation patterns using a sequence alignment method,” Knowl. Inf. Syst., vol. 6, no. 2, pp. 150–163, 2004.

Graph of user segments with similar behavioral patterns

[11] I.-H. Ting, L. Clark, and C. Kimble, “Identifying web navigation behaviour and patterns automatically from clickstream data,” International Journal of Web Engineering and Technology, vol. 5, no. 4, pp. 398–426, 2009.

enough to be applied on any other domain, like e-commerce or business application. The method can be used in many spheres; for example in information retrieval of users’ (or potential customers’) analysis in e-commerce systems (recommended systems in e-shops, web site personalization) or for further processing through marketing methods.

[12] L. Lu, M. Dunham, and Y. Meng, “Mining significant usage patterns from clickstream data,” in Advances in Web Mining and Web Usage Analysis. Springer Berlin / Heidelberg, 2006. [13] I.-H. Ting, C. Kimble, and D. Kudenko, “Ubb mining: Finding unexpected browsing behaviour in clickstream data to improve a web site’s design,” Web Intelligence, IEEE / WIC / ACM International Conference on, vol. 0, pp. 179–185, 2005.

ACKNOWLEDGMENT This work was supported by VSB-TU (grant no. SP/2010196 Machine Intelligence) and Silesian University (grant no. SGS/24/2010 The Usage of BI and BPM Systems to Efficiency Management Support).

[14] A. Bianco, G. Mardente, M. Mellia, M. Munaf`o, and L. Muscariello, “Web user-session inference by means of clustering techniques,” IEEE/ACM Trans. Netw., vol. 17, no. 2, pp. 405– 416, 2009.

R EFERENCES [1] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, “Defining and identifying communities in networks,” Feb 2004. [Online]. Available: http://arxiv.org/abs/cond-mat/0309488

¨ gu¨ d¨uc¨u, [15] G. N. Demir, A. S¸ima Uyar, and S¸ule G¨und¨uz-O˘ “Multiobjective evolutionary clustering of web user sessions: a case study in web page recommendation,” Soft Computing - A Fusion of Foundations, Methodologies and Applications, vol. 14, no. 6, pp. 579–597, 2010.

[2] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, “Trawling the web for emerging cyber-communities,” in WWW ’99: Proceedings of the Eighth International Conference on World Wide Web, 1999, pp. 1481–1493.

[16] W. M. P. Van Der Aalst, H. A. Reijers, and M. Song, “Discovering social networks from event logs,” Comput. Supported Coop. Work, vol. 14, no. 6, pp. 549–593, 2005.

[3] M. Toyoda and M. Kitsuregawa, “Cerating a web community chart for navigating related communities,” in HYPERTEXT ’01: Proceedings of the Twelfth ACM Conference on Hypertext and Hypermedia, 2001, pp. 103–112.

[17] P. Dr´azˇ dilov´a, G. Obadi, K. Slaninov´a, S. Al-Dubaee, J. Martinoviˇc, and V. Sn´asˇel, “Computational intelligence methods for data analysis and mining of elearning activities,” in Computational Intelligence For Technology Enhanced Learning. Springer Berlin / Heidelberg, 2010.

[4] M. E. J. Newman and M. Girwan, “Finding and evaluating community structure in networks,” Physical Review E, vol. 69, no. 2, p. 026113, Feb 2004.

[18] G. Obadi, P. Dr´azˇ dilov´a, J. Martinoviˇc, K. Slaninov´a, and V. Sn´asˇel, “Using spectral clustering for finding students patterns of behavior in social networks,” in International Conference DATESO 2010, 2010.

[5] A. Clauset, M. E. J. Newman, and C. Moore, “Finding community structure in very large networks,” Phys. Rev. E, vol. 70, no. 6, p. 066111, Dec 2004.

[19] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing and Management, vol. 24, no. 5, pp. 513–523, 1988. [Online]. Available: http://dx.doi.org/10.1016/0306-4573(88)90021-0

[6] K. Wakita and T. Tsurumi, “Finding community structure in mega-scale social networks,” in Proceedings of the 18th International Conference on World Wide Web WWW 09. ACM Press, 2007, p. 1275.

78

Suggest Documents