Intelligent Data Analysis 18 (2014) 137–156 DOI 10.3233/IDA-140634 IOS Press
137
CO PY
An alternative approach for clustering web user sessions considering sequential information Rajhans Mishraa,∗, Pradeep Kumarb and Bharat Bhaskerb a Indian
b Indian
Institute of Management, Indore, India Institute of Management, Lucknow, India
TH OR
Abstract. Clustering is a prominent technique in data mining applications. It generates groups of data points that are similar to each other in a given aspect. Each group has some inherent latent similarity which is computed using the similarity measures. Clustering web users based on navigational pattern has always been an interesting as well as a challenging task. A web user, based on its navigational pattern, may belong to multiple categories. Intrinsically, web user navigation pattern exhibits sequential property. When dealing with sequence data, a similarity measure should be chosen, which captures both the order as well as content information during computation of similarity among sequences. In this paper, we have utilized the Sequence and Set Similarity Measure (S3 M) with rough set based similarity upper approximation clustering algorithm to group web users based on their navigational patterns. The quality of cluster formed using rough set based clustering algorithm with S3 M measure has been compared with the well known clustering algorithm, Density based spatial clustering of applications with noise (DBSCAN). The experimental results show the viability of our approach. Keywords: Clustering, similarity upper approximation, web usage data, sequential data
1. Introduction
AU
With the advent of digital technology, the generation and capture of web related data has become possible. Web data can be analyzed on three dimensions namely structure, content and usage [1,2]. Analysis of web usage data can provide useful insights related to user browsing behaviour. Data mining techniques commonly used for web usage mining include, association rule generation, sequential pattern generation, clustering, and classification. Association rule mining techniques [3] are used to discover correlations between items found in a large database of transactions. In the context of Web Usage Mining a transaction is a group of web page accesses by web users, with an item being a one page access. An example of association rule found from an IBM analysis of the server log of the official 1996 Olympics web site is “If a visitor accesses a page about Indoor Volleyball then the visitor also accesses a page on Handball in 45% of the cases. This pattern is present in 0.23% of the transactions” [4]. Discovering sequential pattern is to find patterns such that the presence of a set of items is followed by another set of items in the ordered transaction set. The transaction set may be ordered with respect to time, space etc. [5]. Again, taking the example from IBM’s report on Olympic database an example ∗
Corresponding author: Rajhans Mishra, Indian Institute of Management, Indore, India. E-mail:
[email protected].
c 2014 – IOS Press and the authors. All rights reserved 1088-467X/14/$27.50
138
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
AU
TH OR
CO PY
of sequence pattern is “5.77% of the site visitors accessed the Atlanta home page followed by the Sights and Sounds main Page” [4]. The generated association or sequential pattern rules from the web usage data of a user has potential applications in areas such as web personalization, recommendation system and intrusion detection system. Web personalization has been used to design web recommendation system [6,7]. The core step in designing an effective recommender system is to identify the group of products and interested user set. Clustering is used to identify these groups. The web pages which come from one domain set leads to the formation of a dense region. The main challenge in clustering web pages which exhibit the dense nature is to figure out a dense region and separate the noise data from the dense region. Various clustering algorithms for dense data have been proposed in literature like Density based spatial clustering of applications with noise (DBSCAN) [8] Ordering point to identify the clustering structure (OPTICS)algorithm [9], DENsity-based CLUstering algorithm (DENCLUE) [10], Improved DBSCAN (IDBSCAN) algorithm [11] and Density Differentiated Spatial Clustering (DDSC) algorithm [12]. The philosophy of these algorithms in forming clusters is to find density variance present in the data set. All the above mentioned algorithms separate regions of high density from regions of low density. The high density regions are termed as clusters while separated low density regions are recognized as noise points. The density based clustering generates clusters in an evolving manner. All these techniques assume that the web user can belong to one and only one group. But in real life applications, it is not so. Consider a scenario when a user is searching for a data mining book for his research work, as well as a travel book since he has to attend a conference in USA. The system analyzing his browsing behaviour should be able to put him into two categories, namely data mining interest group and travel interest group. The above mentioned algorithms will put the user in one and only one group considering the frequency of highest visited pages or binary representation of dataset i.e. visited or not visited web pages. Incorporation of soft computing based clustering techniques with sequential pattern discovery measures may be helpful in forming such types of web user groups. Rough set is a soft clustering technique [13] which deals with ambiguity and vagueness present in data. Rough set can be approximated by two crisp (definite or non-vague) sets termed as upper approximation and lower approximation. These sets are derived using the similarity among the objects. Similarity upper approximation can be used to derive incremental similarity among the objects. Incremental clustering can result in deriving arbitrary shaped clusters hence it can be utilized in dense data clustering where dense regions within the sparse regions can be found. In this paper we have incorporated the rough set based similarity upper approximation clustering technique [14] with sequence based similarity measure [15] to form dense clusters. In the proposed clustering framework, initially soft clusters are formed. In subsequent steps based on the strength of data points hard clusters are generated.We compared our technique with well established dense clustering algorithm DBSCAN [8]. The cluster quality obtained using our approach is better with respect to DBSCAN clustering algorithm. The rest of the paper is organized as follows. In Section 2, we present a review of literature related to our work. Section 3 discusses modified sequence data clustering using similarity upper approximation. Section 4 presents experimental results and discussion. Conclusion and future work has been discussed in Section 5. Reference section has been presented in Section 6. 2. Related work In this section, we review the recent literature related to sequence data clustering and rough set based clustering.
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
139
2.1. Sequence data clustering
AU
TH OR
CO PY
Sequence mining is an area which deals with finding the useful patterns from the sequence data sets and understanding their behaviour. Analysis of sequential data has become increasingly important since it has been found in many areas as in biological sequences, text documents and web access logs. Classical clustering algorithms (as k-means) fails to give good results for sequential data sets as it is difficult to compute the pair wise similarity between the sequences [16]. However in most of the cases pair wise similarity computation happens to computationally complex having at least quadratic expression on the number of sequences. Hence, it can be applied only on small data sets resulting in to scalability problem for sequential data. Sequences can be represented using various schemes as usage based (UB) representation, frequencybased (FB) representation, viewing-time based (VTB) representation and visiting-order based (VOB) representation [17]. Different set of techniques have been used for sequence mining which includes association rules, frequent sequences and frequent generalized sequences [18]. Association rules consider the problem of finding association among visited web pages similar to finding association among item sets in transaction databases. Yang and Wang [16] have developed a clustering technique for sequences, based on their sequential features. Clustering technique faces problem in case of sequential data (in categorical domain) due to the lack of any efficient similarity measure. They have proposed a new model CLUSEQ for sequential clustering using significant statistical properties possessed by the sequences. Guralnik and Karypis [19] have proposed a new technique for sequence clustering that does not require an all-against-all analysis and uses a near linear complexity k-means based clustering algorithm. The proposed approach finds a set of features that capture the sequential nature of the various data-sequences. Further it projects each data sequence into a new space where dimensions are identified features for the data sequence. Traditional k-means algorithm has been used for clustering of the data-sequences after the transformation. Kum et al. [20] have proposed a sequence mining algorithm termed as ApproxMAP (APPROXimate Multiple Alignment Pattern mining) using approximate sequential pattern mining which deals with identifying patterns approximately shared by many sequences. Sayed et al. [18] have suggested a novel algorithm ‘FS: Miner’ for discovering frequent patterns in sequence databases. FS miner requires only two scans of the database.Kumar et al. [15] proposed a new similarity measure for sequential data termed as S3 M and utilized it for clustering of sequential data using partition around medoids (PAM) algorithm. The clusters formed were crisp in nature. S3 M happens to be a non-vector-based similarity measure for clustering web user sessions which consider content and sequential information both while clustering web pages. A new clustering algorithm SeqPAM has been developed by Kumar et al. [21] for sequential data. 2.2. Rough set based clustering
Many researchers have utilized the concept of rough set in their work. Lingras [22–24] has focused on rough set clustering for web mining. Paper has described unsupervised classification using rough sets along with genetic algorithms to represent clusters as interval sets. Indiscernibility based clustering has been proposed by Hirano and Tsumoto [25,26] which is able to deal with relative proximity of data points.
140
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
AU
TH OR
CO PY
Rough approximation has been applied for clustering of web transactions through web logs [14,27]. Various attempts have been performed to model the uncertainty and vagueness present in data during clustering with the help of integration of fuzzy and rough sets [28–30]. Rough set has been used to capture the inherent uncertainty involved in clustering. Asharaf et al. [31] have developed an incremental clustering approach for interval data using rough sets. The technique assumes that it is possible to consider data points one at a time and assign them to existing clusters. Any new data item is assigned to a cluster without looking at the previously seen patterns. The proposed incremental algorithm scales well with the size of the dataset. Mohebi and Sap [32] have applied rough sets to self organizing maps (SOM) and proposed a two- level clustering based on SOM with rough sets. Rough set theory has been utilized to capture the inherent uncertainty involved in cluster analysis. In the first stage SOM is used to develop the prototypes by training the data with SOM neural network. Clustering has been performed in the next step using rough set based incremental clustering technique. Kumar et al. [33] have proposed a novel new indiscernibility-based rough agglomerative hierarchical clustering algorithm. The indiscernibility relation has been extended to a tolerance relation with the transitivity property being relaxed. In their approach the initial clusters are formed using similarity upper approximation. Subsequent clusters are formed using the concept of constrained-similarity upper approximation where a relative similarity is used as a merging criterion. Kandwal et al. [34] have proposed rough set based clustering using active learning approach. They have extended the concept of Hamming distance and propose a dissimilarity measure which helps in finding the approximations of clusters in the given data set. The paper has utilized rough set theory for clustering. The proposed approach has been tested on bench mark data set from UCI machine learning repository and produced favourable results. Trabelsi et al. [35] have presented two classification approaches using rough sets (RS) which learn decision rules from uncertain data. It has been assumed that the uncertainty exists only in the decision attribute values of the decision table (DT) which has been represented by the belief functions. Belief Rough Set Classifier (BRSC) is the first technique which used basic concepts of rough sets. The second technique happens to be more complex and based on Generalization Distribution Table (BRSC-GDT), which is a hybridization of the Generalization Distribution Table and the Rough Sets (GDT-RS). Both classification techniques try to simplify the uncertain decision table (UDT) to create significant decision rules for classification process. A heuristic method based on rough sets has been used for attribute selection and reduction of the time complexity. The performance of the proposed classifier has been evaluated with the help of experiments using bench mark real world datasets where uncertainty in the decision attributes has been introduced artificially. Yanto et al. [36] have used variable precision rough set model for clustering. They have applied rough set clustering for clustering of students suffering from anxiety. They utilized the mean of accuracy of approximation using variable precision of attributes. Data had been collected through survey to find the anxiety sources among the students. The paper had shown the way to use variable precision rough set model for grouping. All the above mentioned clustering techniques do not consider the order information present in the sequence data. These techniques first convert them into frequency domain or absence/presence of data within the sequence. Treating sequences in such a way may lead to loss of ordering information. In this work we have utilized rough set based similarity upper approximation algorithm with S3 M similarity measure which considers sequence information besides content information.
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
141
3. Clustering sequence data using similarity upper approximation algorithm
AU
TH OR
CO PY
With the advances in technology, there has been a rapid growth in the volume and complexity of electronic data being generated and stored. As a result of this increase, the task of extracting meaningful knowledge in the form of patterns, relationships and groupings to be used in applications such as decision support, prediction and anomaly detection has become machine intensive and essential. Furthermore, the need to discover underlying data structures in mixed attribute data calls for efficient data analysis with minimal human intervention. Cluster analysis is one such task which helps us to figure out the hidden relationships/group. In clustering task the main objective is to find the natural groups of similar data points. Clustering techniques group data points which are similar to each other. The similarity has been decided on the basis of the similarity measures. The value of similarity among the data points varies with respect to type of similarity measures. When dealing with sequence data researchers use to transform the sequences into structured data by using frequency encoding or binary encoding. Sequence mining suffers due to the lack of suitable similarity measures for finding similarity between sequences. The main challenge in dealing with sequence data and using above mentioned technique is that we lose the information related to order of occurrence. In this paper we have combined the sequence clustering algorithm with similarity upper approximation to keep into account the above challenges associated with web data. The objective of current paper is to figure two things. First, whether the sequence information available within the data set can help in forming better clusters. Second, how to modify the similarity upper approximation based clustering technique for dense data. Web users’ traversal may be within a web site or may correspond to different types of interrelated web sites. Our study is limited to study the traversal behaviour of web users within the web site. For large number of web users and limited number of web pages a dense region is generated. For such type of dense web domain, compact clusters are the desired output. The generated clusters will be validated on the basis of their compactness. Compactness of any cluster can be defined as the degree of similarity of grouped users. Intra cluster distance indicates the density of the objects within the cluster that means how dense it is packed. For a good clustering algorithm, the intra-cluster distance should be minimum and inter cluster distance should be maximum. The inter cluster similarity refers to the separation among the various clusters. In our work the separation of the clusters is not an important aspect as we are looking for dense clusters, which can be ensured by intra cluster similarity. The compactness will be calculated on the basis of the intra-cluster similarity of the clustering scheme. More the intra cluster similarity more the compactness of the clusters. We will validate the clusters on the basis of the compactness of the obtained clusters. The rough set theory is mathematically simple and quite useful in data mining areas. Rough set theory is widely used tool in information retrieval, decision support, machine learning, and knowledge based systems. A wide range of applications utilize the concepts of rough set theory. Medical data analysis, aircraft pilot performance evaluation, image processing, and voice recognition are a few examples. Let U be the collection of web user sessions and is non-empty set containing n user sessions denoted as {x1 , x2 , x3 , . . . xn }. Each user session comprises of web page visits. Let D be a similarity matrix [D]ij = μ(xi , xj ), denotes the similarity among web user session xi and xj . The similarity between two web user sessions is computed using S3 M measure. S3 M measure considers both the content as well as order information while computing similarity between web user sessions. Once the similarity matrix has been computed the initial set of clusters are formed using similarity upper approximation.
142
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
j=1
CO PY
Definition 1 [33] Given a threshold value δ ∈ (0, 1], for any element xi ∈ U , where U is the non empty ¯ i) = set consisting of web user sessions, the similarity upper approximation with respect to xi , R(x ¯ {xj |μ(xi , xj ) δ, j = 1, 2, . . . k}, where R(xi ) is the initial cluster and xi is called the cluster centre of the corresponding cluster. Thus, from above definition it is clear that elements in the non-empty set U can be partitioned into a ¯ i |i = 1, 2, 3, . . . n} in terms of the threshold value δ. family of overlapping sets {R From the generated set of cluster family formed due to definition 1, only one set will be taken if two sets A and B are equal (consider set A and set B were two sets out of several generated sets). Also, if set A is the proper subset of set B then consider only set B. Thus, considering only unique and proper supersets from the cluster family set formed from definition 1 a new set family will be generated with ¯i |i = 1, 2, 3, . . . l}, where l < n. For any xi , xj ∈ U , reduced size denoted as {R l ¯ j ⇒ xj ∈ R ¯ i and ¯ j If xi ∈ R ¯j = U . xi ∈ any R R ¯ i |i = 1, 2, 3, . . . l} is likely to be a pseudo-partition due to common However, the family of sets {R elements in different sets. In order to have a natural grouping it is necessary to partition the universe. In such a partitioning an element should be in only one partition. After forming the cluster due to first similarity upper approximation a web user session will be a member of more than one group. Such objects are referred as ambiguous objects. These ambiguous objects may be made fully definable that is, to find with which set the element exactly belongs. While forming the second similarity upper approximation we calculate the strength of the ambiguous web user session with all the sets to which it belongs. The strength of the web user session is calculated using Definitions 2 and 3 as given below.
TH OR
Definition 2. For an object x ∈ C1 ⊆ U and then the lower approximation of set C1 due to object x is given by RC1 x = C1 − ((C1 ∩ C2 ) ∪ (C1 ∩ C3 ) ∪ . . . ∪ (C1 ∩ Cn ))
where, x ∈ C2 , C3 , . . . Cn .
Definition 3. For any element x ∈ Ci , let μ(RCi x (j), x) μCi (x) =
j=1,...k i=1,...l
|RCi x |
(1)
AU
where, i the total number of sets intersecting due to the object x, j is the element in the lower approximation of Ci due to element x, μ(RCi x (j), x) is the pair wise similarity between the object x and the j th element of the lower approximation of Ci due to object x. |RCi x | cardinality of the lower approximation of set Ci due to object x and μCi (x) is called the strength of relationship among object x and the set Ci . The greater the value of relationship of an object with the cluster’s lower approximation the object will reside in that cluster. To explain approach consider two sets A and B with a common object xd . ¯ c denotes the intersecting region between set A and B. xd will be in cluster A or cluster B based on R following conditions. ¯ c , if μA (xd ) > μB (xd ), then 1) For xd ∈ R ¯ B \{xd } ¯B = R R
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
143
¯ c , if μA (xd ) < μB (xd ), then 2) For xd ∈ R ¯ A \{xd } ¯A = R R ¯ c , if μA (xd ) = μB (xd ), and 3) For xd ∈ R ¯ A | |R ¯ B | then R ¯ B \{xd } ¯B = R |R
¯ A | < |R ¯ B | then R ¯ A \{xd } ¯A = R |R
CO PY
¯ c , if μA (xd ) = μB (xd ) and 4) For xd ∈ R
Applying the above condition the user web sessions which belongs to more than one cluster are assigned to only and only one cluster. The resultant cluster set is now crisp in nature. The strength of our approach is that initially we assign a web user session in multiple groups. Later as the algorithm progresses the web user sessions are moved to the cluster where its strength is higher. To compute the strength of membership we have used both the concept of lower and upper approximation. In our work, we compute the similarity using various similarity measures. S3 M similarity measure is weighted linear combination of sequence and set similarity measures. To compute sequence similarity we have utilized the length of longest common subsequence (LLCS) and Longest common subsequence (LCS) [37]. The set similarity is computed using Jaccard similarity measure [38]. We have outlined the algorithm for forming clusters from sequential data using similarity upper approximation as below
TH OR
Begin Step1: For two web user sessions x1 and x2 ∈ T call function Sim (x1 , x2 ) Step2: For a given δ ∈ (0, 1] form the first similarity upper approximation. Step3: Remove proper subsets. Step4: Identify the cluster centres, if exists, and remove them from other clusters. Step5: Apply definitions 2 and 3 to get the strength of relationship of all the elements with the clusters to which they belong. End Function Sim (x1, x2) Input p, q, x1 and x2 where 0 p , q 1 Output D a similarity matrix
AU
Begin Step1: Compute SeqSim For ith element of x1 & jth element of x2 LCS (i, j) = { 0 if i=0 or j=0; = { max |LCS(i -1, j), LCS(i, j-1)| if xi = yj & i, j>0 = {|LCS(i-1, j-1)|+1 if xi = yj & i, j>0 LLCS SeqSim(x1 , x2 ) = max(|x 1 |,|x2 |) Step2: Compute SetSim 1 ∩x2 | SetSim(x1 , x2 ) = |x |x1 ∪x2 |
144
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
Step3: Compute S3 M S 3 M = p× SeqSim(x1 , x2 ) + q × SetSim(x1 , x2 ) End
TH OR
CO PY
The upper approximation gives the group of sequences on the basis of specified value of similarity measure. Modified algorithm seems to perform similar to density based clustering algorithm DBSCAN. The nature and type of clusters formed due to modified algorithm is very much similar to that of DBSCAN clustering algorithm. The various terminology used in DBSCAN clustering algorithm can be mapped to the terminologies used in the similarity upper approximation based clustering technique for example, similarity threshold value (δ) instead of concept of neighbourhood parameter (ε). Similarity threshold value (δ) has been used for finding the upper approximation of groups which results in to clusters. Cluster centres are termed as core, elements other than cluster centres are termed as Border Points. The object that is not the part of any cluster is termed as Noise. Mins define the minimum number of points or objects that constitute the cluster. We compared the similarity threshold (δ) of rough set based clustering using similarity upper approximation and neighbourhood radius of (ε) of DBSCAN clustering algorithm. The increase in similarity threshold value (δ) increases the expectation of similarity that exist between the objects present in the same cluster. More the value of (δ) more similar objects will be there in the cluster. In DBSCAN algorithm neighbourhood radius has been used as a distance measure for finding the dissimilarity between the objects. Increased similarity between objects will result in less distance between the objects that are grouped in the cluster, in other terms it will shrink the neighbourhood region (ε) of a core point. It means with increase in the Similarity threshold value (δ), Neighbourhood parameter (ε) will decrease. It is clear from the discussion that there exists a reciprocal relationship between similarity threshold value (δ) and Neighbourhood parameter (ε). The exact one to one relationship between these two parameters (δ and 1/ε). can be derived. Assuming the linear relationship between similarity threshold value (δ) and reciprocal of Neighbourhood parameter (ε), can be expressed by the Eq. (2) For two constants A and B a linear relationship between δ and 1/ε can be expressed as A +B ε Assuming the two instances of clustering where δ = δ1 , ε = ε1 and δ = δ2 , ε = ε2 Solving the above equation for two instances, values of A and B can be computed as δ=
(δ1 − δ2 ) Δδε1 Δδε2 1 1 and B = δ2 + Δε or B = δ1 + Δε ( ε1 − ε2 )
AU
A=
(2)
Putting the values of A and B in the Eq. (3) the linear relationship between δ and ε can be expressed as.
δ=
(δ1 −δ2 ) 1 − ε1 ε 1
2
ε
Δδε2 + δ1 + Δε
(3)
The instances have been taken for the purpose of illustration of possible relationship that may exist between δ and ε.
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
145
Table 1 Number of clusters generated using rough set clustering algorithm for “msnbc” web dataset with Mins = 4, δ = 0.2 Mins = 4, δ = 0.2 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size-200 19 12 11 10 8 8
Data Size-500 48 32 69 54 39 38
Data Size-1000 125 102 149 140 153 146
Data Size-2000 194 126 226 222 146 255
Data Size-3000 144 232 57 51 31 34
CO PY
4. Experimental results and discussion
4.1. Description of the datasets
TH OR
This section has been divided in three subsections. In first subsection we provide the description of dataset used for the experiments. In second subsection we report the performance of modified rough set based clustering algorithm using similarity upper approximation with several similarity measures. Finally in third subsection we compare the modified rough set based clustering with the DBSCAN clustering. Clustering algorithms can be classified into various categories as density-based, partitioning, hierarchical, grid-based and model-based [40]. Density based clustering techniques deals with finding dense regions in the data space. DBSCAN happens to be the most prominent algorithm in density based clustering hence we have chosen DBSCAN as a representative of density based clustering. All the experiments reported in this paper were performed on a PC having an Intel Core 2 Duo Processor (1.83 GHz) with 2 GB RAM using JAVA as programming language on the Windows XP platform. Experiments were conducted on two datasets namely msnbc web navigation [39] and simulated sequential dataset.
AU
The msnbc web navigational dataset has been collected from UCI dataset repository. The dataset consists of Internet Information Server (IIS) logs for msnbc.com web site and news-related portions of msn.com for the entire day of September 28, 1999 (Pacific Standard Time). Each web log is a sequence representation of page views of a web user during that twenty-four hour period. The length of web user session varies from from 1 to 500. The average length of web user session is reported to 5.6, hence for our experiments we have taken only those user sessions whose length is 6. Also comparing a web user session of length one with user session of size 500 may not lead to any useful information. The data set has seventeen categories that are “frontpage”, “news”, “tech”, “local”, “opinion”, “on-air”, “misc”, “weather”, “health”, “living”, “business”, “sports”, “summary”, “bbs” (bulletin board service), “travel”, “msn-news”, and “msn-sports”. We have converted the categories into numbers for the purpose of experiments starting from 1 to 17. The second dataset used for the experimentation purpose has been simulated over twenty five different categories which are numbered from “1” to “25”. Each sequence of the dataset is of length six. 4.2. Performance of modified algorithm
The current sub-section reports the performance of rough set based clustering algorithm using similarity upper approximation. We have arbitrarily selected 200, 500, 1000, 2000 and 3000 web user sessions from datasets (msnbc & simulated) and formed five different datasets of varying sizes. We conducted our experiments on all these five datasets over different distance/similarity measure namely, Jaccard, Levensthein, S3 M with different p (weightage to sequence similarity) values. In Tables 1 and 2 we report
146
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information Table 2 Number of clusters generated using rough set clustering algorithm for “msnbc” web dataset with Mins = 3, δ = 0.3
Mins = 3, δ = 0.3 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.7) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size-200 19 20 20 16 20 11 5
Data Size-500 4 26 7 5 12 8 7
Data Size-1000 60 38 48 61 73 51 54
Data Size-2000 65 32 69 79 134 110 100
Data Size-3000 200 209 210 211 147 35 26
Mins = 4, δ = 0.2 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size-200 0.1140 0.1158 0.1221 0.0936 0.0897 0.0838
Data Size-500 0.0627 0.0960 0.0412 0.0572 0.0956 0.1009
CO PY
Table 3 Total Intra-Cluster Similarity of clusters using rough set based clustering for “msnbc” web dataset with Mins = 4, δ = 0.2 Data Size-1000 0.0699 0.0450 0.0649 0.0700 0.0718 0.0648
Data Size-2000 0.0598 0.0482 0.0596 0.0640 0.0638 0.0802
Data Size-3000 0.0961 0.0888 0.0786 0.0994 0.0835 0.0838
S(U1 , U2 ) =
|U1 ∩ U2 | |U1 ∪2 |
TH OR
the number of clusters generated clusters using rough set clustering algorithm for different mins and δ value over different dataset sizes. Tables 1 and 2 also report the result for different distance/similarity measures. Jaccard similarity measure is an example of content based non vector similarity measure. It is used to find the relative commonality present among two sets. It is measured as a ratio of number of common attributes of two user sessions say U1 and U2 (4)
AU
Levenshtein similarity measure is a representative of sequence based non vector similarity measure [41]. The Levenshtein distance is also called as edit distance. It is an approximate sequence matching algorithm, that is used to solve the problem of finding sequences p = p1 p2 in another sequence s = s1 s2 s3 . . . sn which have at most e > 0 differences between two sequences. If p = “battle” and s = “settle” then the Levenshtein distance between these two sequence will be ‘2’, and the Levenshtein similarity will be 0.66 (1-(2/6)). S3 M measure is a hybrid non vector similarity measure. It is a combination of the content and sequence based non vector similarity measures. S3 M measure is given as follows: S 3 M (A, B) = p ×
|A ∩ B| LLCS(A, B) +q× max(|A|, |B|) |A ∪ B|
(5)
where p + q = 1 and p, q 0. Here, p and q determine the relative weights given for order of occurrence (sequence similarity) and to content (set similarity), respectively. In practical applications, user could specify these parameters. The cluster quality can be measured in terms of number of clusters formed as well as the intra cluster distance/similarity. If any clustering algorithm results in too many clusters it means that the clusters are not densely packed or the data items are sparse in nature. Whereas if the number of clusters is less it means all the data points are put in few clusters thus losing the significance and importance of clustering
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
147
Table 4 Total Intra-Cluster Similarity of clusters using rough set based clustering for “msnbc” dataset with Mins = 3, δ = 0.3 Mins = 3, δ = 0.3 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.7) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size-500 0.0725 0.0858 0.0464 0.0579 0.0563 0.1146 0.1332
Data Size-1000 0.0945 0.0717 0.1001 0.0956 0.1073 0.1037 0.0992
Data Size-2000 0.0866 0.0718 0.0790 0.0774 0.1114 0.0757 0.0719
Data Size-3000 0.1091 0.1029 0.1238 0.1378 0.1451 0.1154 0.0835
(Here bold letter shows the best result in terms of total intra cluster similarity).
CO PY
∗
Data Size-200 0.1462 0.1231 0.1634 0.1719 0.1438 0.1406 0.1126
Table 5 Total Inter-Cluster Similarity of clusters using rough set based clustering for “msnbc” dataset with Mins = 3, δ = 0.3 Mins = 3, δ = 0.3 Jaccard Levenshtein S3 M(p = 0.2) S3 M(p = 0.5) S3 M(p = 0.7) S3 M(p = 0.8) S3 M(p = 0.9)
Data Size-200 0.2966 0.1526 0.3085 0.2576 0.2381 0.1896 0.1490
Data Size-500 0.1499 0.2143 0.1269 0.0666 0.1828 0.2565 0.1559
Data Size-1000 0.1593 0.1597 0.2389 0.2346 0.2268 0.1941 0.1981
Data Size-2000 0.2017 0.1236 0.2036 0.1783 0.2452 0.2220 0.2369
Data Size-3000 0.2998 0.2009 0.3666 0.3642 0.3313 0.2364 0.3123
Table 6 Total Inter-Cluster Similarity of clusters using rough set based clustering for “msnbc” dataset with Mins = 4, δ = 0.2 Data Size 200 0.2109 0.1370 0.1818 0.1905 0.1917 0.1970
Data Size 500 0.1510 0.2019 0.2595 0.1613 0.1622 0.1702
Data Size 1000 0.1758 0.1385 0.2628 0.1835 0.2029 0.2033
TH OR
Mins = 4, δ = 0.2 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size 2000 0.1793 0.1227 0.1843 0.1970 0.1885 0.2033
Data Size 3000 0.2744 0.1973 0.2143 0.2153 0.2740 0.3283
Table 7 Number of clusters generated using rough set clustering algorithm for simulated dataset with Mins = 4, δ = 0.2 Data Size- 200 16 18 10 10 8 6 6
Data Size-500 6 24 8 8 6 4 3
AU
Mins = 4, δ = 0.2 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.7) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size-1000 25 25 40 40 32 37 33
Data Size-2000 46 18 70 66 55 61 67
Data Size-3000 2 9 2 25 7 2 4
technique. Both the cases are undesirable hence reaching an optimal number of clusters required is not an easy task rather a domain expert can identify the best optimal number of clusters. More compact clusters infer grouping of more similar users in a cluster. The compactness of clusters can be measured using intra cluster similarity/distance of clusters. Total Intra-cluster similarity of clusters can be calculated using Eq. (6), where Ci∗ represents the cluster center of ith cluster. Sim(Cj , Ci∗ ) represents the similarity between cluster centre of ith cluster and jth element of ith cluster. |Ci∗ | repre-
148
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information Table 8 Number of clusters generated using rough set clustering algorithm for simulated dataset with Mins = 3, δ = 0.3
Mins = 3, δ = 0.3 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.7) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size-200 19 20 20 16 20 11 5
Data Size-500 4 26 7 5 12 9 7
Data Size-1000 60 38 48 61 73 51 54
Data Size-2000 65 32 62 79 134 126 100
Data Size-3000 64 203 39 34 23 30 57
Mins = 4, δ = 0.2 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.7) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size-200 0.1347 0.1214 0.1430 0.1066 0.1006 0.1569 0.1544
Data Size-500 0.0910 0.0944 0.1262 0.1402 0.0833 0.1305 0.1680
CO PY
Table 9 Total Intra-Cluster Similarity of clusters using rough set based clustering for simulated dataset with Mins=4, δ = 0.2 Data Size-1000 0.0496 0.0496 0.1422 0.1403 0.1159 0.1065 0.1037
Data Size-2000 0.1064 0.1150 0.1104 0.1132 0.1206 0.0874 0.1322
Data Size-3000 0.4181 0.0494 0.2354 0.2473 0.2135 0.2654 0.2524
Table 10 Total Intra-Cluster Similarity of clusters using rough set based clustering for simulated dataset with Mins = 3, δ = 0.3 Data Size-200 0.1462 0.1231 0.1634 0.1719 0.1438 0.1406 0.1126
Data Size-500 0.0933 0.0936 0.0619 0.0731 0.0690 0.0768 0.1711
Data Size-1000 0.1164 0.0926 0.1281 0.1195 0.1326 0.1269 0.1217
TH OR
Mins =3, δ = 0.3 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.7) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size-2000 0.1074 0.0873 0.1007 0.0978 0.1364 0.1175 0.0821
Data Size-3000 0.2406 0.0564 0.2655 0.2561 0.2218 0.2245 0.2266
AU
sents total number of objects (in this case web users) in ith cluster. “K” represents the total number of clusters generated by the particular clustering scheme. Equation (6) represents the average intra-cluster similarity of all the clusters. Clustering scheme with high intra-cluster similarity provide more compact clusters. Intra- cluster similarity focuses on the compactness of the clusters while inter cluster similarity deals with separation of the existing clusters from each other. In the web applications we are interested in finding the dense domains hence only intra cluster similarity has been considered. K Sim(Cj ,Ci∗ )
Intra − ClusterSim =
|Ci∗ |
i=1 j∈Ci
K
(6)
The inter cluster similarity can be expressed by Eq. (7) K K
Inter − ClusterSim =
i j=i
Sim(Ci∗ , Cj∗ )
(K − 1)!
(7)
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
149
Table 11 Total Inter-Cluster Similarity of clusters using rough set based clustering for simulated dataset with Mins = 4, δ = 0.2 Mins = 4, δ = 0.2 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.7) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size-200 0.2401 0.1557 0.1961 0.1905 0.2419 0.1526 0.1541
Data Size-500 0.1050 0.1914 0.0955 0.0954 0.1620 0.1055 0.1611
Data Size-1000 0.1481 0.1472 0.2285 0.2338 0.2056 0.1970 0.1781
Data Size-2000 0.1977 0.1459 0.1834 0.1981 0.2007 0.1955 0.2082
Data Size-3000 0.2666 0.0576 0.1907 0.2514 0.2572 0.2267 0.2490
Mins =3, δ = 0.3 Jaccard Levenshtein S3 M (p = 0.2) S3 M (p = 0.5) S3 M (p = 0.7) S3 M (p = 0.8) S3 M (p = 0.9)
Data Size-200 0.2966 0.1526 0.3085 0.2576 0.2381 0.1896 0.1490
Data Size-500 0.1500 0.2143 0.1269 0.0666 0.1828 0.1487 0.1559
0.8 0.7
0.5 0.4 0.3 0.2 0.1 0 0
Data Size-1000 0.1593 0.1597 0.2389 0.2346 0.2268 0.1941 0.1981
TH OR
0.6
CO PY
Table 12 Total Inter-Cluster Similarity of clusters using rough set based clustering for simulated dataset with Mins = 3, δ = 0.3
0.1
0.2
0.3
Density(RoughSetBased)
0.4
0.5
Data Size-2000 0.2017 0.1236 0.2036 0.1783 0.2452 0.1829 0.2369
0.6
Data Size-3000 0.2779 0.0534 0.2750 0.2581 0.2356 0.2411 0.2369
0.7
Density(DBSCAN)
AU
Fig. 1. Comparison of Intra-cluster similarity of rough set clustering using similarity upper and DBSCAN (with Mins = 4, Jaccard, Size-1000) for msnbc web dataset. (Colours are visible in the online version of the article; http://dx.doi.org/10.3233/ IDA-140634)
Where Ci∗ and Cj∗ represents the cluster centre’s of ith and jth cluster respectively. K happens to be the total number of clusters. The clusters have been validated using total intra-cluster similarity. Tables 3 and 4 represent the total intra cluster similarity for datasets of size 200, 500, 1000, 2000 and 3000 of “msnbc” dataset, while Tables 9 and 10 report the same for the simulated dataset. Clusters with highest intra cluster similarity represent the most intact clusters, hence considered as best case clustering. Tables 5 and 6 represents the inter cluster similarity of generated clusters for “msnbc” dataset, which has been calculated using Eq. (7). Tables 7–12 represent the experimental results for simulated dataset. It has been clear from the experimental results that much better (compact in this case) clusters have been generated (as it has highest total intra cluster similarity) with S3 M similarity measure (with various
150
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information 0.7 0.6 0.5 0.4 0.3 0.2
0 0
0.1
0.2
CO PY
0.1
0.3
Density(RoughSetBased)
0.4
0.5
0.6
Density(DBSCAN Based)
Fig. 2. Comparison of Intra-cluster similarity of rough set clustering using similarity upper approximation and DBSCAN (with Mins = 4, S3 M (0.5) Size-1000) for msnbc web dataset. (Colours are visible in the online version of the article; http://dx. doi.org/10.3233/IDA-140634)
0.7 0.6 0.5
0.3 0.2 0.1 0 0
TH OR
0.4
0.1
0.2
0.3
Rough Set Intracluster
0.4
0.5
0.6
DSCAN Intracluster
AU
Fig. 3. Comparison of Intra-cluster similarity of rough set clustering using similarity upper approximation and DBSCAN (with Mins = 4, S3 M (0.5) Size-2000) for msnbc web dataset. (Colours are visible in the online version of the article; http://dx. doi.org/10.3233/IDA-140634)
values of p and q ) which considers both content and sequence similarity present among the uses. For both the Jaccard as well as the Levensthein measure it can be seen that the intra cluster similarity is low and hence does not form good clusters. 4.3. Comparison of DBSCAN with the modified Rough set based clustering using similarity upper approximation DBSCAN algorithm is used to find dense clusters and Noise. Clusters grow gradually to find dense areas in dataset. The points that are left ungrouped i.e. which are not part of any cluster are termed as Noise points.
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
151
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.1
CO PY
0 0.2
0.3
Rough Set Intra-Cluster
0.4
0.5
0.6
DBSCAN Intra-Cluster
Fig. 4. Comparison of Intra-cluster similarity of rough set clustering using similarity upper approximation and DBSCAN (with Mins = 10, S3 M (0.2) Size-4000) for msnbc web dataset. (Colours are visible in the online version of the article; http://dx. doi.org/10.3233/IDA-140634)
0.8 0.7 0.6
0.4 0.3 0.2 0.1 0 0
TH OR
0.5
0.1
0.2
0.3
Rough Set Intra-Cluster
0.4
0.5
0.6
0.7
0.8
DBSCAN Intra-Cluster
AU
Fig. 5. Comparison of Intra-cluster similarity of rough set clustering using similarity upper approximation and DBSCAN (with Mins = 10, S3 M (0.9) Size-4000) for msnbc web dataset. (Colours are visible in the online version of the article; http://dx. doi.org/10.3233/IDA-140634)
The dense regions are separated with the regions of low density. The algorithms group the points on the basis of density gradient. DBSCAN [8] happens to be the most prominent density based algorithm. It connects the objects (data points) in gradual manner. It can find clusters of non spherical shape and defines any cluster as a maximal set of density connected points. It defines two parameters that are Neighborhood defined by radius of Neighborhood (ε), minimum number of specified points in a cluster termed as Minpts. Besides that clustering scheme defines three types of points that are Core point, Border point and Noise Point. The clusters are formed using the concept of density reach ability. Core point happens to be the center of the cluster. All the points other than the core point are termed as border points. The points that are not part of any cluster are termed as Noise points.
152
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information 0.7 0.6 0.5 0.4 0.3 0.2
0
0.1
0.2
0.3
CO PY
0.1 0.4
Density(RoughSetBased)
0.5
0.6
0.7
Density(DBSCAN)
Fig. 6. Comparison of Intra-cluster similarity of rough set clustering using similarity upper and DBSCAN (with Mins = 4, Jaccard, Size-1000) for simulated dataset. (Colours are visible in the online version of the article; http://dx.doi.org/ 10.3233/IDA-140634) 0.7 0.6 0.5
0.3 0.2 0.1 0
TH OR
0.4
0.1
0.2
0.3
Density(RoughSetBased)
0.4
0.5
0.6
0.7
Density(DBSCAN)
Fig. 7. Comparison of Intra-cluster similarity of rough set clustering using similarity upper approximation and DBSCAN (with Mins = 4, S3 M (0.5) Size-1000) for simulated dataset. (Colours are visible in the online version of the article; http://dx. doi.org/10.3233/IDA-140634)
AU
Rough set is a soft clustering technique. It approximates an imprecise set using two crisp sets that are upper approximation set and lower approximation set. In this paper we have modified the rough set based similarity upper approximation clustering algorithm for sequential data. Initially soft clusters are generated which are further converted to hard clusters using the maximum membership assignment approach. In maximum membership assignment approach if any object is present in more than one clusters then it has been assigned to a cluster for which it has maximum membership value. In our case similarity value with respect to any cluster has been considered as membership value for that cluster. The paper compares the clusters generated by DBSCAN and rough set based clustering algorithm using similarity upper approximation. Both type of clustering generates hard clusters as an end product. The comparison has been made considering the compactness of the generated clusters. Compact cluster represents the dense region. The more the compactness of the clusters the more dense the region.
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
153
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1
0.2
0.3
0.4
0.5
0.6
0.7
CO PY
0
Density(RoughSetBased)
Density(DBSCAN)
Fig. 8. Comparison of Intra-cluster similarity of rough set clustering using similarity upper approximation and DBSCAN (with Mins = 4, S3 M (0.5) Size-2000) for simulated dataset. (Colours are visible in the online version of the article; http://dx. doi.org/10.3233/IDA-140634)
0.7 0.6 0.5
0.3 0.2 0.1 0
TH OR
0.4
0.1
0.2
0.3
Density(RoughSetBsed)
0.4
0.5
0.6
0.7
Density(DBSCAN)
Fig. 9. Comparison of Intra-cluster similarity of rough set clustering using similarity upper approximation and DBSCAN (with Mins = 10, S3 M (0.9) Size-4000) for simulated dataset. (Colours are visible in the online version of the article; http://dx. doi.org/10.3233/IDA-140634)
AU
The generated clusters are compared in various scenarios by varying different parameters which includes type of similarity measure, threshold of similarity upper approximation, minimum number of points need to be there in a cluster (Mins). The experiments are performed on various sizes of the datasets. In Figs 1–9 X axis represent the similarity threshold in case of similarity upper approximation clustering algorithm or value of neighborhood in case of DBSCAN clustering algorithm. The Y axis in Figs 1–9 represents the density of the best cluster. Figures 1–5 represents the results for msnbc dataset while Figs 6–9 represents the results for simulated dataset. The best cluster of both the clustering algorithm has been compared for the compactness. In this case we are looking for compact clusters as it gives more similar data points which may be most desired in case of various e-Commerce applications for finding similar users. Figures 1–5 represent the output of the experiments performed for the study. They represent the density of the best clusters and their variation
154
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
with respect to the similarity threshold (δ)/Neighbourhood radius (ε), similarity measures and other parameters. It is clear from the results that for the given msnbc web dataset rough set based clustering using similarity upper approximation has produced more compact clusters than DBSCAN clustering algorithm.
5. Conclusion and future work
References
[2] [3] [4] [5] [6] [7]
R. Cooley and B. Mobasher, Web Mining: Information and Pattern Discovery on the World Wide Web, Proceedings of Ninth IEEE International Conference on Tools with Artificial Intelligence, California, USA, 3–8 Nov, 1997. P. Kolari and A. Joshi, Web mining: Research and Practice, Computing in Science & Engineering IEEE, Co published by the IEEE CS and the AIP University of Maryland, Baltimore County, 2004, pp. 49–53. R. Agrawal and R. Srikant, Fast algorithms for mining association rules, Proc. of the 20th VLDB Conference, Santiago, Chile, 1994, pp. 487–499. S.E. Dean and M. Viveros, Data mining the IBM official 1996 Olympics Web site, Technical report, IBM T.J. Watson Research Center, 1997. R. Agrawal and R. Srikant, Mining sequential patterns: Generalizations and performance improvements, Proc. of the Fifth Int’l Conference on Extending Database Technology, Avignon, France, 1996. F. Masseglia, P. Poncelet, M. Teisseire and A. Marascu, Web Usage Mining: Extracting Unexpected Periods from Web Logs, Data Mining and Knowledge Discovery 16(1) (2008), 39–65. G. Castellano, A.M. Fanelli and M.A. Torsello, NEWER: A system for NEuro-fuzzy WEb Recommendation, Applied Soft Computing 11(1) (2011), 793–806.
AU
[1]
TH OR
CO PY
The paper investigates the effect of incorporation of sequential information during the clustering in dense web domain. It also suggests a new way for clustering in dense web domain. The paper tries to capture and use both sequence and content information in deriving the clusters. We have utilized content based, sequence based and hybrid similarity measures (combination of content and sequence based similarity measures) to see the effect of sequential information during the clustering. Jaccard, Levenshtein and S3 M have been utilized as a representative of content based, sequence based and hybrid similarity measure, respectively. The paper uses the rough set based clustering using similarity upper approximation algorithm with a hybrid similarity measure and in particular tested it with S3 M for dense web domain. The clusters have been validated using total intra-cluster similarity. The more the intra-cluster similarity, more the compactness of the clusters. More compact clusters group the more similar users that are important in ecommerce applications. Rough set based clustering using similarity upper approximation has been used with S3 M similarity measure (which happens to be a hybrid similarity measure) for clustering and the results have been compared with the DBSCAN clustering algorithm. Results have shown that rough set based clustering algorithm using similarity upper approximation has produced more compact clusters than DBSCAN algorithm for “msnbc” web navigational dataset and simulated dataset. Paper has also presented mapping between the various parameters of the rough set based clustering using similarity upper approximation with the DBSCAN clustering parameters. The paper explores a novel and different method of clustering in dense web domain and suggests a logical mapping of the parameters of rough set based clustering using similarity upper approximation and DBSCAN clustering algorithm. The suggested mapping can be further explored and validated with the experiments and the values of parameters can be calculated using the experimental results.
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34]
CO PY
[10]
TH OR
[9]
M. Easter, H.P. Kriegek and J.A. Sander, Density-based algorithm for discovering clusters in large databases. Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD 96), AAAI Press, Portland, Aug.1996, pp. 226–231. M. Ankerst, M.M. Breunig and H.P. Kriegel, OPTIC: Ordering points to identify the clustering structure, Proc. of ACM SIGMOD International Conference on Management of Data, ACM Press, Philadelphia, 1999, pp. 49–60. A. Hinneburg and D.A. Keim, An efficient approach to clustering in large multimedia databases with noise, Proc. Fourth International Conference on Knowledge Discovery and Data Mining (KDD 98), AAAI Press New York, 1998, pp. 58–65. B. Borah and D.K. Bhattacharyya, An Improved Sampling-Based DBSCAN for Large Spatial Databases, Proc. International Conference on Intelligent Sensing and Information, IEEE Press 2004, pp. 92–96. B. Borah and D.K. Bhattacharyya, DDSC: A Density Differentiated Spatial Clustering Technique, Journal of computers 3(2) (2008), 72–79. Z. Pawlak, Rough sets, International Journal of Computer and Information Sciences 2 (1982), 341–346. P. Kumar, P.R. Krishna, S.K. De and R.S. Bapi, Web usage mining using rough agglomerative clustering, Proceedings of Seventh International Conference on Enterprise Information System, LNCS Springer-Verlag, London, UK, 2005, pp. 315–320. P. Kumar, B.S. Raju and P.R. Krishna, A New Similarity Metric for Sequential Data, International Journal of Data Warehousing and Mining (IJDWM) 6(4) (2010), 16–32. J. Yang and W. Wang, CLUSEQ: efficient and effective sequence clustering, Proceedings of the 19th International Conference on Data Engineering, Bangalore, India, 2003, pp. 101–112. J. Xiao, Y. Zhang, X. Jia and T. Li, Measuring similarity of interests for clustering web-users, Proceedings of the 12th Australasian Conference on Database Technologies, Australia 2001, pp. 107–114. M.E. Sayed, C. Ruiz and E.A. Rundensteiner, FS-Miner: efficient and incremental mining of frequent sequence patterns in web logs, Proceedings of the 6th annual ACM international workshop on Web information and data management (WIDM ’04). ACM, New York, USA, 2004, pp. 128–135. V. Guralnik and G. Karypis, A scalable algorithm for clustering sequential data, Proceedings of the IEEE International Conference on Data Mining, San Jose, CA, 2001, pp. 179–186. H.C.M. Kum, J. Pei, W. Wang and D. Duncan, ApproxMAP: approximate mining of consensus sequential patterns, Proceedings of the Third SIAM International Conference on Data Mining (SDM), San Francisco, CA, 2003, pp. 311– 315. P. Kumar, R.S. Bapi and P.R. Krishna, SeqPAM: A sequence clustering algorithm for Web personalization, International Journal of Data Warehousing and Mining 3(1) (2007), 29–53. P. Lingras and C. West, Interval set clustering of web users with rough k-means, Journal of Intelligent Information Systems 23(1) (2004), 5–16. P. Lingras and Y.Y. Yao, Time complexity of rough clustering: Gas versus k-means, Proceedings of Third International Conference on Rough Sets and Current Trends in Computing, LNCS Springer-Verlag, London, UK, 2002, pp. 263–270. P.J. Lingras, Rough Set Clustering for Web Mining, Proceedings of IEEE International Conference on Fuzzy Systems, Honolulu, 2002. S. Hirano and S. Tsumoto, An indiscernibility-based clustering method with iterative refinement of equivalence relations – rough clustering, Journal of Advanced Computational Intelligence and Intelligent Informatics 7(2) (2003), 169–177. S. Hirano and S. Tsumoto, Indiscernibility-based clustering: Rough clustering, Proceedings of International Fuzzy Systems Association World Congress, LNCS Springer-Verlag, Heidelberg, 2003, pp. 378–386. S.K. De and P.R. Krishna, Clustering web transactions using rough approximation, Fuzzy Sets and Systems 148(1) (2004), 131–138. S.K. Pal and P. Mitra, Case generation using rough sets with fuzzy representation, IEEE Transactions on Knowledge and Data Engineering 16(3) (2004), 292–300. S.K. Pal and A. Skowron, Rough Fuzzy Hybridization: New Trends in Decision Making. Singapore: LNCS Springer Verlag, 1999. M. Sarkar, Rough-fuzzy functions in classification, Fuzzy Sets and Systems 132(3) (2002), 353–369. S. Asharaf, M.N. Murty and S.K. Shevade, Rough set based incremental clustering of interval data, Pattern Recognition Letters 27(6) (2006), 515–519. E. Mohebi and M.N.N. Sap, Rough Set Based Clustering of the Self Organizing Map, Proceedings of First Asian Conference on Intelligent Information and Database Systems Vietnam 2009, pp. 82–85. P. Kumar, P.R. Krishna, R.S. Bapi and S.K. De, Clustering using Similarity Upper Approximation, Proceedings of IEEE International Conference on Fuzzy Systems Canada, 2006. R. Kandwal, P. Mahajan and R. Vijay, Rough Set Based Clustering Using Active Learning Approach, International Journal of Artificial Life Research 2(4) (2011), 12–23.
AU
[8]
155
[37] [38] [39] [40] [41]
CO PY
[36]
S. Trabelsi, Z. Elouedi and P. Lingras, Classification systems based on rough sets under the belief function framework, International Journal of Approximate Reasoning 52 (2011), 1409–1432. I.T.R. Yanto, P. Vitasari, T. Herawan and M.M. Deris, Applying variable precision rough set model for clustering student suffering study’s anxiety, Expert Systems with Applications 39 (2012), 452–459. L. Bergroth, H. Hakonen and T. Raita, A survey of longest common subsequence algorithm, In Seventh International Symposium on String Processing and Information Retrieval SPIRE Atlanta 2000, 39–48. P. Gludici, Applied Data Mining.Statistical methods for business and industry, Wiely publication, West Sussex, England 2003. http://archive.ics.uci.edu/ml/datasets/MSNBC.com+Anonymous+Web+Data. A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A review, ACM Computing Surveys 31(3) (1999), 264–323. L.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics-Doklady 10(7) (1966), 707–710.
TH OR
[35]
R. Mishra et al. / An alternative approach for clustering web user sessions considering sequential information
AU
156