K-Means Based Clustering on Mobile Usage for Social Network ...

6 downloads 5177 Views 973KB Size Report
class subjects. Keywords-Clustering; K-Means; Social Network Analysis; Data. Mining. I. INTRODUCTION. All manuscripts must be in English. These guidelines.
K-Means Based Clustering on Mobile Usage for Social Network Analysis Purpose Xu Yang, Yapeng Wang

Dan Wu

Athen Ma

MPI-QMUL Information Systems Research Centre Macao Polytechnic Institute Macao SAR, China

International School Beijing University of Post and Telecommunications Beijing, China

Queen Mary University of London School of Electronic Engineering and Computer Science London, UK

Abstract- The development of mobile network technology provides a great potential for social networking services. This paper studied data mining for social network analysis purpose, which aims at find people’s social network patterns by analyzing the information about their mobile phone usage. In this research, the real database of MIT’s Reality Mining project is employed. The classification model presented in this project provides a new approach to find the proximity between users – based on their registration frequencies to specific cellular towers associated their working places. K-means Algorithm is applied for clustering, and we find the result could achieve the highest accuracy 0.823 at the number groups k = 6. The clustering result successfully reflects the higher proximity (at work) for the intraclass subjects. Keywords-Clustering; K-Means; Social Network Analysis; Data Mining

I.

INTRODUCTION

All manuscripts must be in English. These guidelines include complete descriptions of the fonts, spacing, and related information for producing your proceedings manuscripts. Please follow them and if you have any questions, direct them to the production editor in charge of your proceedings at Conference Publishing Services (CPS): Phone +1 (714) 8218380 or Fax +1 (714) 761-1784. In recent years, communication technology has been developing at an unprecedented speed, and it has got the interaction between people become more and more easily. Thanks to the convenience brought by virtual network (also called social network), one can get in touch freely with another person thousands miles away. Thus nowadays, the world is shrinking greatly which seems to become a small village. With the development of the mobile network, people become no longer just satisfied with sitting in front of the laptop to surf the Internet; but have got used to enjoy the convenience brought by mobile network at everywhere not just in their homes or offices. In this study, we develop a new mechanism to find the correlation between people based on their mobile usage. This approach likes the search engine or B2C sites have done: using data mining technology to analyze the records in their users and then try to discover the association rules regarding user’s searching behavior. Through this, B2C websites can discover customers’ purchasing pattern and thus give them some other products for recommendation to increase their sales. In our

research instead, we focus on data mining on mobile phone usage, which is more interactive and social since the mobile phone platform are independent form the usage of a computer. Social network analysis (SNA) is the mapping and measuring of relationships and flows between people, groups, organizations, computers, URLs, and other connected information/knowledge entities [1]. As a highly interdisciplinary research method, SNA draws great attention of sociologists, economists, anthropologists and so on. In the past, the research on social network mainly relied on the selfreport (e.g. collected by survey), the accuracy, depth and breadth of the social network analysis will be limited due to the limitation of the size of subject pool and time period involved. Today, wide usages of the Internet and mobile network have changed a lot about the way people communicate to each other [2]. For example, blog, BBS, and some new instantaneous message service can build one a huge, complicated virtual social network in which they can play game together, share their feelings or gossip without knowing each other before. This also provides a great potential for SNA‘s development. It is reported that in 2008, there was only one type mobile phone with functionalities of social networking services; however in 2010, there will be 20 versions of mobile phones with core function in social networking services. This sharp increase indicates the bright prospect of the social networking service based on mobile phone usage. II. RELATED WORK AND OUR APPROACH In recent years, there are various projects involving mobile users’ location and social activities analysis. Some examples are Helsinki Institute for Information Technology’s (HIIT’s) Context project, Place Lab research at Intel Research [3], and Reality Mining project conducted by Massachusetts Institute of Technology (MIT) [4]. A. HIIT’s Context Project The Context project studies characterization and analysis of information about user context and its use in proactive adaptivity, aiming at discovering user’s interpretation of context description and context structures [5]. This project addressed the correlation, interaction between the users and the context. Mobile communication worked as a typical example of the ubiquitous application, of which the usefulness relies strongly on how environment –sensitively communication decisions are handled. The main achievements are applying

- 223 -

qualitative user circumstance description in context-sensitive applications; algorithm for context analysis; and models for how the users interact with the environment. B. Place Lab Research Place Lab develops a piece of software that can be used on laptops, PDAs and cell phones to help the clients estimate their locations based on some fixed beacons, such as nearby 802.11 access points, GSM cell towers and fixed Bluetooth devices [3]. It provided a way to overcome the one serious concern of the ubiquitous computing—privacy, by allowing the users identify their own locations without a continuous connection with a central server. C. Reality Mining Taking the advantage of increasing number of cell phone users, Reality Mining analyse implicit information about user’s mobile phone usage such as cell tower information, call duration and Bluetooth device to find out the dynamics of people’s behaviour. Information about 106 users’ location, communication, proximity and activity has been collected during 9 months during 2004 - 2005. Among them, 94 subjects participated in the survey including dyadic questions about proximity and friendship with other subjects and some individual habits about social activities. Reality Mining addressed the issues in social network’s evolution, predictability of people’s lives, information flow and so on. Reality Mining not only used the communication records such as email or call logs to research people’s activity and relationship between each other, non-observable data like cellular tower information. It has been proved that it is possible to evaluate people’s dynamic behaviour by keeping track on the information about user’s location and proximity to others, especially for a relative long period of time. The data collected by mobile phone software as well as the self-report data gathered from the survey was used to conduct the research in individual social network, self-satisfaction, and inner relationship of organizations or communities. D. Our approach In this project, the Reality Mining Dataset is used which was gained via the application software installed in Nokia Symbian Series 60 Phone [2]. The dataset is collected by MIT Media Labs, in which the location information is represented in the terms of cell tower that the user currently connected. In sum, there are 25 cellular towers that are associated with the user’s working place. Our research aims to find the correlation among people who frequently work in the same place or near around. III.

ALGORITHM DESIGN AND IMPLEMENTATION

Data mining refers to extracting or “mining” knowledge from large amounts of data [6]. The term Āminingā, indicates the process of extract knowledge from large amount of data. There are two kinds of models in data mining, predictive model and descriptive model. According to the result obtained from the previous data, predictive model makes prediction of the values of data in the future. Classification, regression, series time analysis and prediction are the tasks that can be

achieved by predictive model. Clustering, generalization, sequence discovery and association rules are tasks belong to descriptive model, which is used to find out and describe the patterns hidden in the data [7]. Clustering, which is also called data segmentation, refers to the process of grouping a set of physical or abstract objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters [6]. Different from classification in which the group is predefined, clustering is an unsupervised learning method that does not require defining the number of groups and labelling each group before the applying the clustering algorithm. Clustering sometimes is more effective than classification, since the researchers just need to set features for the subjects that are being classified, instead of studying the characteristics of each group. To measure the similarity (or dissimilarity) between objects is the core of clustering. For numeric data, the similarity can be measured by distance. Suppose each object has p attributes, we can see each object as a point in a pdimension space; n objects are n points in the p-dimension space. Simply put, the objects that are similar to each other, which means belong to the same cluster, should be close to each other in the space. Hence the distance in the space can be used to quantify the similarity between objects, and the short distance means a high degree of similarity. Take D  {x1 , x2 , 666, xn } as a group of objects in pdimension space, xi , x j  D , d ij is the distance between

xi and x j . Generally the definition of distance should satisfy the following conditions [8]:

1. dij  0 f xi  x j 2. Oxi , x j , dij  0

(1)

3. Oxi , x j , dij  d ji 4. Oxi , x j , xk , dij  dik  d kj

In clustering, distance is required to satisfy the first three conditions. One of the most frequently used distance measure is Euclidean distance defined as: 2

2

dij  ( xi1  x j1  xi 2  x j 2  666  xip  x jp

2

(2)

Where xi1 , xi 2 666 xip and x j1 , x j 2 , 666 x jp are p attributes of object i, j respectively. In fact, the square of Euclidean distance is often applied. K-means clustering (k-means), simply speaking, is an algorithm to classify or to group objects based on attributes or features into k number of group. The grouping is done by minimizing the distances between data and the corresponding cluster centroid. In the application of k-means, we need to decide the value of k before starting the program, it should be noticed that different value of k will cause different levels of accuracy of the grouping. The basic steps of k-means are given as shown in Figure 1.

- 224 -

Figure 1. Flowchart of K-means

'

Step 1. Decide the value of k, which is the number of groups; ' Step 2. Take k samples from total number of N randomly as the centroid of each cluster; ' Step 3. Calculate the distance of the rest (N-k) samples to each centroid, and assign them to the cluster with the nearest centroid. After each assignment, recalculate the centroid of the gaining cluster; ' Step 4. Repeat step 3 until convergence is achieved, that is until a pass through the training sample causes no new assignments. The main advantages of k-means are fast converging and highly scalable. K-means also has its drawbacks. One is that the clustering result depends on the k centroids selected in initialization stage which is always generated by randomly taking. Hence, sometime we find repeated experiments with same parameters don’t lead to same result. Another one is that local minima problem does exist. It is possible to reach a local minimum, where the reassignment of any one point will increase the total sum of object-to-centroid distances, but it is not a global optimum A. Steps for implementation process the data In this project, the database of Reality Mining is used. The first step is to abstract the data related to actors’ locations to construct my own database. The data including time-stamped tower transitions, user’s affiliation, research group (if research staff or student), users answers to survey questions (about their daily life) etc. We have got 24 cellular towers that are associated with user’s work place (according to MIT’s research, 27 are in the original database, but three pairs of them are duplicated in tower IDs). The second step is to calculate the frequency that the users’ accessing to each cellular towers. During this process, we find: there are 67 subjects have ever been accessing to those 24 cellular towers, this is generally consistent with the real situation in subject pool—among 94 subjects who completed the survey, 26 were incoming students. In order to eliminate the interference caused by the pass-by cases, which means at that time the person just pass by the area that the cellular tower

associated to. For example, in MIT Media Lab research, the proper threshold for distinguishing end location and non-end location is 10 minutes for discovering the users mobility patterns. If the duration is greater than the threshold, we will take this item into consideration when calculating frequency of each subject logging to the working cellular towers respectively. We apply the script calculate_work_frequency to get the 104×24 matrix recording the frequencies of the times that each of the 104 subjects log on the 24 cellular towers related to MIT. Through experiments, we found that for the thresholds longer than 120 minutes, the frequency matrix does not have a significant change. When applying k-means, the statistic tools embedded in Matlab 7.0 is used. We call the function kmeans(Cluster,k,'dist','sqEuclidean','replicates',times), where the Cluster is the input data in which each row is a feature vector for one subject. The value “sqEuclidean” means square of Euclidean distance is used as the distance measure. The Euclidean distance is widely used in low-dimension space, of which the dimension is no more than 16 [9]. K-means may reach a local minima; the situation is that the reassignment of any point can increase the sum of point-centroid distances, but a better solution does exist. In this work the optional parameter ‘replicates’ is set to overcome local minima. The value of “replicates” is set to 5. That means for each value k, the algorithm will be run five times, each begins from a different randomly selected set of initial centroids. Finally, the one clustering result with lowest total sum of distances will be returned. The computational complexity of k-means algorithm is O(knI), where n is the total number of objects, k is the number of groups, and I is the number of iterations before the minimization of sum of the distances is achieved. In order to find the proper value of k, we take the value of k from 2 to 6. For each k, we apply the k-means for each k, and use silhouette value (discussed later) to judge the accuracy of clustering. We use a matrix means to record the mean silhouette values of the total number of objects for each value of k. For threshold = 120 minutes, k = 6 gives the clustering result with the maximum mean silhouette value 0.823. Accuracy Measures The function silhouette(X, Clust) returns cluster silhouettes for input matrix X, with the clustering result provided by Clust, which contains a cluster index (i.e. which group this user being put into ). For the input matrix X, rows correspond to users and columns correspond to attributes to consider. The silhouette value S(i) for each point is a measure of how similar that point is to points in its own cluster compared to points in other clusters, and ranges from -1 to +1. It is defined as

S (i ) 

min(b(i, k ), 2)  a (i ) max(a(i ), min(b(i, k ), 2))

(3)

where a(i) is the average distance from the ith point to the other points in its cluster, and b(i,k) is the average distance from the ith point to points in another cluster k. By default, squared Euclidean distance between points in X is used in calculating silhouette values. If the silhouette value is close to +1, that means this point is very far-away from other clusters; a silhouette value around 0 indicates the point is not distinctly in

- 225 -

one cluster or another; and a silhouette of -1 indicates the point may not be assigned to the right cluster. We can judge the performance of the clustering by its average silhouette width. Kaufman and Rousseeuw thought that the silhouette width greater than 0.5 can be esteemed as ‘good enough’ clustering result while a lower than 0.2 width is lack of substance of the clustering structure [10]. IV.

RESULTS AND ANALYSIS

A general view of the accuracy of clustering result can be seen from Figure 2. Taking 120 minutes as threshold for the time duration, there are 58 subjects with records regarding to the cellular towers considered in their cell phones. By the optimization process, we find out the clustering will produce the best result at k = 6, the silhouette plots is as Figure 6 shows.

coordinate is the index of subject in the database; y-coordinate is the cluster index in result plus its corresponding silhouette value. Hence, in the same cluster level, the higher the position it is located, the more probably it belongs to this cluster. For example, the red asterisks stand for the subjects in cluster 1. From the figure above, we also notice that there are 33 subjects being assigned into cluster 6, while there are very few subjects in cluster 1,3, and 4. Especially in cluster 1 and 3, the silhouette values are relatively lower, that the subjects in 1 and 3 are not very distant from the subjects in other cluster or not sharing high similarities with subjects within the cluster. A. Result Analyses By analyzing the cellular towers records in the users’ cell phones, we have got the users’ patterns regarding to working places. Now we compare the result with the self-report data in the “network” which is a 94×94 matrix saving each subject’s answers to the dyadic questions: Q: Estimate Your Average Proximity (within 10 feet) with Each Person at work / outside lab. The multiple choices are [2]: 5 -at least 4-8 hours per day 4 -at least 2-4 hours per day 3 - at least 30 minutes least- 2 hours per day 2 - at least 10 - 30 minutes per day 1 - at least 5 minutes 0 - 0-5 minutes (default) Since the threshold for computing the frequencies is 2 hours, we also use 2 hours as the threshold for analyzing the proximity between users. For each user in cluster 6 (see yellow asterisks in Figure 4), if the two users’ average proximity is more than 2 hours (network.lab(i,j)>3), we apply a blue line to connect these two users in the clustering result Figure 4.

Figure 2. Silhouette Width ( threshold = 2h, k = 6)

From Figure 2, we can see that there is no negative silhouette value and most of them are greater than 0.5, which means that the clustering result is generally accurate judged by silhouette.

Figure 4. Proximity at Lab (threshold = 2 h)

Figure 3. Clustering Result

Subjects in same cluster are represented by asterisk in the same color. In Figure 3, for each point, the value of x-

From Figure 4, first we can clearly see that the density of the lines is higher within the cluster. The intra-class subjects do share a higher proximity at work compared with inter-class ones. This is consistent with our anticipation about the result: the people who have similar records in working-place cellular towers logon, are more probable to work together or have more communications during work. The next step, have a look at some intra-class lines. For instance, network.outlab(97,44) > 3,

- 226 -

s(97) is in the third group which has a close connection with s(44) who are in the sixth group. After scrutiny of their affiliation, we find they both belong to Sloan Business School, but work in different research groups, “Wakeborders” and “Surfers” respectively. Look at another pair, s(5) in the sixth group and s(12) in the first, they are both Media Lab graduate students while belong to different research groups. Hence, for people in different clusters, they may also have close contact since they have same affiliations.

proximity figure and friend network figure. So we can draw the conclusion that the proximity at work does not mean a close friendship, and the proximity of outside lab can better reflect the friendship between subjects.

Figure 7. Proximity outside Lab (threshold = 0.5h)

Figure 5. Proximity Outside Lab (threshold = 2 h)

In Figure 5, we use the same threshold (network.lab(i,j) >3) to draw the proximity lines between subjects during the time out of work. We discover that there are few subjects always together for more than 2 hours per day out of lab, if they are not friends with each other. Compared with the Figure 6, which shows friend declarations of the subjects in cluster 6, we can see the similarity between the proximity outside lab and friend network.

Figure 8. Friend Network for All Subjects

The model used in this project focuses on classifying users into different groups based on their registration frequencies of the cellular towers in their working places. The clustering result may be influenced by the several kinds of interference, some could be eliminated by adding some new features to consider, while some accompany with mobile network structure and can hardly overcome.

Figure 6. Friend Network for Subjects in Cluster 6

For all the 58 subjects, the friend network is shown in Figure 8. Changing the threshold for outside lab proximity judgment, we find that as the threshold equals 30 minutes (Figure 7), we get a similar lines distribution in outside lab

B. Time Duration Absense The classification model used in this project does not include the time duration as an independent feature to consider; instead, it is used as a threshold for counting the times of the cellular towers being registered. The registration frequency can reflect how often a certain user came to a particular area. On the other hand, the duration of how long the user has been working in a specific office or lab cannot be discovered rely solely on frequency calculation. However, in fact, the time

- 227 -

duration for one’s working in a particular place has a close tie with the one’s social circle: the longer one stay in one place, the more probably he/she becomes proximate with his/her colleagues. On the contrary, a new comer may not have built his /her own social circle around current work place. Therefore, to improve the accuracy of user classification model from the aspect of time consideration, the duration that the user in the certain work place should be considered. One approach can be calculating the sum of time duration in the 9 months that one user spend in the area of one certain cellular tower. If two users spend similar amount of time in the range of same cellular tower during work time, they are highly probable to be proximate in working. C. Interference at Cell Borders It is possible for a mobile phone to receive signal from different cellular towers at the same time, since it is located or oscillating around the border of the cell or area. When it happened, the cell phone will log on one tower with highest strength signal. Most cellular towers have a range of several square kilometers [2]. In this work, we don’t know the exact locations of those cellular towers; for instead, we know their IDs. The only way we distinguish the one cellular tower from another is to use the ID number. In the case that two users are physically close to each other but belong to the ranges of different cellular towers, it can be hard for us to know their proximity in location. V.

CONCLUSION AND FURTHER WORK

The classification model presented in this paper provides a new approach to find the similarities between users – based on their registration frequencies to several specific cellular towers associated their working places. The time threshold to eliminate the pass-by interference is 2 hours. K-means is applied for clustering, and finally we take the clustering result with the highest accuracy 0.823 judged by the silhouette value. The clustering result successfully reflects the higher proximity (at work) for the intra-class subjects. For the clusters in small size and lower accuracy, the probable reasons for that outcome are presented. Finally, we analyze the drawbacks of the model and put forward some mechanism to improve it, for example,

to take time duration as an independent feature in addition to frequencies. The next step of the research could focus on improve the model to make it applicable to more complex data. For example, the input data of the clustering may include the Bluetooth scan records, communication logs between users, the user’s preference for mobile phone application, and so on. By handle the time-stamp properly, the outside lab proximity can also be inferred, which will give more information about user’s social activities other than work. If combined with what we’ve got in the work proximity analysis, we can get a more comprehensive view of the users’ whole social circle. Being refined, the model can be applied in social networking service, such as “friend recommendation”. If two users of the same group are proximate to each other (e.g. within 10 meters), while they have never get contact with each other, then we can give an alert to both parties to indicate they could be friends. REFERENCES [1] Krebs, V. Social Network Analysis, A Brief Introduction. available at http://www.orgnet.com/sna.html, City, 2008. [2] Eagle, N., Pentland, A. and Lazer, D. Inferring Social Network Structure using Mobile Phone Data. Proceedings of the National Academy of Sciences (PNAS) 106, 36 2009), 15274-15278. [3] LaMarca, A., Chawathe, Y., Consolvo, S. and Hightower, J. Place lab: Device positioning using radio beacons in the wild. LNCS, Pervasive, 34682005), 116-133. [4] Eagle, E. MIT Medialab: Reality Mining, available at: http://reality.media.mit.edu/. Massachusetts Institute of Technology, City, 2009. [5] HIIT Research Group, Context Recognition by User Situation Data Analysis (CONTEXT) available at http://www.hiit.fi/context/, City, Dec. 2009. [6] Jiawei, H. and Micheline, K. Data Mining:Concepts and Techniques Second Edition. Morgan Kaufmann Publisher, 2006. [7] Hand, D., Mannila, H. and Smyth, P. Principles of Data Mining. Massachusetts Institute of Technology, 2001. [8] Chen, A., Chen, N. and Zhou, L. Data Mining: Technology and Applications. Science Press, 2006. [9] Chen, J. High-Dimensional Clustering Knowledge Discovery. Publishing House of Electronics Industry, 2009. [10] Kaufman, L. and Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990.

- 228 -