A Web Page Clustering Method Based on Formal Concept ... - MDPI

information Article

A Web Page Clustering Method Based on Formal Concept Analysis Zuping Zhang *, Jing Zhao and Xiping Yan School of Information Science and Engineering, Central South University, Changsha 410000, China; [email protected] (J.Z.); [email protected] (X.Y.) * Correspondence: [email protected]; Tel.: +86-731-88879628 Received: 21 August 2018; Accepted: 2 September 2018; Published: 6 September 2018

Abstract: Web page clustering is an important technology for sorting network resources. By extraction and clustering based on the similarity of the Web page, a large amount of information on a Web page can be organized effectively. In this paper, after describing the extraction of Web feature words, calculation methods for the weighting of feature words are studied deeply. Taking Web pages as objects and Web feature words as attributes, a formal context is constructed for using formal concept analysis. An algorithm for constructing a concept lattice based on cross data links was proposed and was successfully applied. This method can be used to cluster the Web pages using the concept lattice hierarchy. Experimental results indicate that the proposed algorithm is better than previous competitors with regard to time consumption and the clustering effect. Keywords: formal concept analysis; feature weight; cross linked list; concept lattice; Web page clustering

1. Introduction In the era of big data, due to the massive amount of data in the network, the difficulty of getting the correct information is greatly increased. More and more people are beginning to realize the importance of network resource standardization and reorganization. Web clustering technology has attracted a lot of attention as an important part of network resource management. It is mainly used in the fields of data extraction, knowledge discovery, data mining, and information fusion [1]. Similar Web pages are grouped into one category by extracting and clustering the similarity of these Web pages. Because of the large capacity and dynamic nature of the network, Web page clustering is more difficult and complicated than ordinary text clustering [2]. At present, most studies on Web page clustering have been based on text, link structure or hybrid clustering. Oren Etzioni and Oren Zamir [3] explored the clustering of Web pages, and found that users are often forced to screen the list of files returned by the search engine, which makes it hard to retrieve information. They then proposed a new algorithm to construct suffix trees based on the Web abstract by studying the clustering of Web pages; the clustering was more effective when using this algorithm. In addition, Mecca et al. [4] proposed a new Web page clustering algorithm in 2006, which not only directly analyzes the results returned by search engines, but also obtains the entire Web page by analyzing the fragment content of the results. The Formal Concept Analysis (FCA) represents the relationship between objects and attributes, and is used widely in data mining, e-commerce, personalized navigation, text clustering, semantic Web, search engines, and so on. Sun introduced an information management system for data mining by integrating ontology and FCA in e-commerce [5]. Hitzler applied FCA in a semantic web [6]. Shen et al. proposed an intelligent search engine in which the information retrieval model is based on the formal context of FCA and is integrated with a browsing mechanism for such a system based on the concept lattice [7]. Freeman and White suggested that network data could take the form of a square-actor in an actor-binary adjacency matrix, where each row and each column in the matrix represents a social actor. Information 2018, 9, 228; doi:10.3390/info9090228

www.mdpi.com/journal/information

Information 2018, 9, 228

2 of 14

They used a Galois Lattice to represent network data [8]. Although there are many algorithms for FCA clustering with larger feature sets and larger data sets, the results are not satisfactory. Clustering based on FCA still has some limitations. For example, there are too many concepts; the scale of the concept lattice is too large; the computation is large and complex; and the process of generating a complete Hasse diagram is difficult. The purpose of this paper is to study a web page clustering method based on formal concept analysis. The contributions of this paper are as follows:

• •

Propose a bi-directional maximum matching word segmentation method based on a Chinese word segmentation method and propose a word weighting scheme based on term frequency. Propose a ListFCA algorithm based on the study of different types of concept lattice construction algorithms. The ListFCA uses a more efficient method to traverse the header list based on cross data link when constructing concept lattices, which makes it easier to generate the Hasse graph corresponding to the concept lattice.

The rest of the paper is organized as follows. The related works and the classic algorithm for constructing the concept lattice is discussed in Section 2. In Section 3, we preprocess the Web page and propose a more efficient header list traversal method for the construction of concept lattices. The experiments are shown in Section 4 to verify the efficiency of the ListFCA. The conclusion of the paper is in Section 5. 2. Related Work 2.1. The Background of FCA In the 1980s, Wille proposed FCA and introduced its mathematical definition and related theories [9]. In this method, the binary relationship between objects and attributes is analyzed, the formal context is constructed, and finally a concept lattice is constructed based on the formal context. Due to the effective description of data and data relationships, FCA has become a powerful tool in rule extraction and data analysis. FCA is a formalized description of concepts (i.e., an ontology). The intension of a concept is the set of attributes belonging to the concept, and the extension of a concept is the set of objects that contain the concept [10]. In FCA, we use all the attributes and objects to construct formal contexts, and these contexts are usually in the form of a cross table. In this cross table, objects are represented by rows, and attributes are represented by columns, while the intersection of the mth row and the nth column represents the mth object with the nth attribute [11,12]. Clustering based on the FCA method contains four steps. First, the data sets are pretreated and a feature item is extracted from a data element, which is represented by a feature item. Then, a formal context is constructed by taking data elements as objects and feature items as attributes. Next, the concept of formal contexts is extracted, and the concept lattice of the formal contexts is constructed. Finally, a Hasse diagram with a high degree of visualization can be generated. 2.2. Related Algorithms for Constructing Concept Lattice Among FCA-related applications, the core content is constructing a formal context and then generating the corresponding concept lattice. The algorithms for constructing the concept lattice are mainly divided into three types: batch construction algorithm, incremental construction algorithm and parallel construction algorithm [13]. 2.2.1. The Batch Algorithm The Batch algorithm is a kind of early construction algorithm. Based on the different ways of constructing the lattice, they can be divided into three categories, including top-down, bottom-up, and enumeration [14]. In the top-down method, the top node is constructed first, and then it proceeds


3 of 14

downwards layer by layer. The Bordat algorithm [15] is one of the classic algorithms. In the bottom-up method, the bottom node is constructed first, and then it expands upward, such as in the Chein algorithm [16]. The enumeration method is based on a given data set, and all nodes are enumerated according to a certain order, such as Ganter algorithm [10]. The father-child relationships between concepts and all the concept lattices can be obtained in the batch algorithm. However, it is difficult to generate Hasse diagram. We introduce the Bordat algorithm [15] as a proposition. At first, the top node concept is constructed which has empty intension or has only one element, and its extension is all sets of object in formal context. Secondly, all the sub nodes of the top node are calculated, each of their direct sub nodes is also calculated one by one. Finally, these steps are repeated until there are no sub nodes. 2.2.2. The Incremental Algorithm In the Incremental algorithm, the lattices are computed by adding objects or attributes of a given context one by one. Godin algorithm is one of the classic algorithms. The incremental algorithm can avoid redundant nodes and work with dynamic datasets. It has the ability to construct concept lattices for dynamic formal context. Because of the superior temporal performance, most of the concept lattice systems are based on this method. We introduce the Godin algorithm as a proposition [17]. Godin algorithm scans every node in the original concept lattice and estimates which category the node belonged, then the corresponding operation is taken according to different categories. 2.2.3. The Parallel Algorithm The Parallel algorithm falls into two categories. The first one involves the partition of the formal context, so that the sub context can be processed in parallel when constructing concept lattice. The other one involves improving serial algorithm, so that it can be parallelized. In the Parallel algorithm, the complex parent-child relationships between concepts can be obtained, and the concept of dynamic formal context can be extracted. It has great advantages in dealing with complex and huge data sets. However, in most parallel construction algorithms it is difficult to produce a complete lattice structure, and Hasse diagram cannot be generated. Therefore, it is not suitable for the study in this paper. 3. The ListFCA Algorithm 3.1. Web Data Cleaning Before constructing the concept lattice, the Web page needs to be preprocessed. The Web page generally contains a large number of HTML tags in the source file. The HTML tags can provide a lot of information and structure information of the Web pages, but there is much useless information in the HTML tags. 3.1.1. Disturbance Cleaning The organization and format of Web content are different from the plain text. The text content of Web page does not exist alone in the Web source file [18]. Instead, the text content is nested in the HTML tags. The different contents can be added with different tags so that it can be displayed in different styles. Through the analysis of the Web page contents, we find that some contents have no relation to the content of Web page itself [19]. Those contents are useless for clustering even increases the complexity of the algorithm and becomes interference. The interference generally covers most of the content except the text, such as various pictures, advertisements, borders and so on. More of the interference come from the source codes, include some page rule information, page style information, page function information etc. Removing the interference is necessary. Different contents are usually wrapped in different tags in the source file, which is very helpful for the identification of


4 of 14

interference [20]. The interference exists in the interference item tags, and the information in the tags should be filtered out. Through several Web page analyses, we have developed the interference item tags shown in Table 1 [21]. Table 1. Tag of Interference Item. Tag

3.1.2. Tag Weighting The tags appear in pairs in Web pages. After removing the interference in the Web page, the unfiltered tags are different in content. Through the analysis of several Web instances, the contents of different tags have different levels of importance. Thus, the different weights should be given to the tags. In this way, the contents of the tag have the same weight as the tag, and the feature words are contained in the corresponding contents [22]. For example, the tags have higher weight values while the tags such as
are not important and have lower weight values. In this section, a higher frequency tag is developed in the Web page and the relative value of the corresponding weight is shown in Table 2 [23]. Table 2. Weighting of Page Tag. Tag

Weight

Tag

Weight

5 4 4 3 3 4

3 2 2 3 3 3

Web page preprocessing is an important foundation work in Web clustering. A Web page is represented as a set of feature words through preprocessing and is very useful for simplifying Web page clustering. We need to segment the Web pages, calculate the word weight and extract the feature in preprocessing. In this paper, we used the bi-direction maximum matching word segmentation method [24]. The rules are as follows: The algorithm has an ability of eliminating ambiguities, and it can also effectively solve the flaws of the segmentation algorithm based on dictionary in ambiguity recognition, which ensures the speed of word segmentation.

• •

If the number of the result words in forward or the reverse maximum matching is different, then take the fewer words as the correct result; If the number of the result words is the same, there is no ambiguity, then return to any one as a result.


5 of 14

After segmenting Web page into words and removing stop words, the term frequency (TF) was used as the primary basis to calculate the weights. The formula for calculating term frequency is as follows: t f k = ∑ β j × t f kj (1) j∈ J

where t f k indicates the term frequency of the kth character in a Web page; J represents a collection of label categories; β j represents the weight of j class label; t f kj represents the term frequency of the feature word k in the j class label. The term weight uses normalization method [25]. The formula is as follows: w(tk ) =

t fk N ∑ n =1 t f n

(2)

N

where N represents the number of term frequencies; ∑ t f n represents the total term frequency of a n =1

Web page; w(tk ) represents the kth term weight in this Web page. Since the word weight has been calculated, we could sort the feature words by weight in descending order in feature extraction and select the top m feature words with maximum weight values as the dimension of the Web feature vector to represent the Web page. 3.2. Web Formal Context Construction According to the related knowledge of formal context, we could define the formal contexts of multiple Web pages: Definition 1. For a triple K = ( G, M, I,), where G = { g1 , g2 , g3 , . . . gi } denotes a set of objects; M = m1 , m2 , m3 , . . . m j denotes a set of attributes; and I ∈ G × M is a binary relation between these two sets; where the symbol × denotes Cartesian product. Then the triple K is defined as the formal context of the Web page [26,27]. Based on the definition of the formal context of the Web page given above as well as the formal context definition in formal concept analysis, set P is the collection of Web pages, set M is the collection of feature words that can represents Web pages. For any Web page p ∈ P and any feature words m ∈ M, if Web page p contains feature m, then ( p, m) ∈ R. For example, we set the formal context as K = ( P, M, R), P = (1, 2, 3, 4) as the collection of Web page, M = (m1 , m2 , m3 , m4 , m5 ) as the collection of feature words. Table 3 shows a simple Web formal context. In the table, a single Web page is represented by a row and a feature word is represented by a column [28]. The “×” at the intersection between rows and columns indicates that the Web page of this line has the feature word of this column. The Hasse diagram corresponding to the formal context can be obtained as shown in Figure 1. we number the 1 to . 8 concepts in the concept lattice of Figure 1 as Table 3. Weighting of Page Tag.

1 2 3 4

m1

m2

m3

×

× ×

× × ×

×

m4

m5

× × × ×

Information 2018, 9, x FOR PEER REVIEW

6 of 14

Information 2018, 9, 228 Information 2018, 9, x FOR PEER REVIEW

6 of 14 6 of 14

Figure 1. The Hasse graph corresponding to Table 3. Figure 1. The corresponding to Table 3. In Web page clustering studies, in Hasse ordergraph to emphasize the commonalities of the Web pages, the Figure 1. The Hasse graph corresponding to Table 3. feature words with high frequency in Web documents should be highlighted. In addition, when the In Web page clustering studies, in order to emphasize the commonalities of the Web pages, number of pages large, if astudies, feature word appears only on one page, the word of shouldWeb be considered In Web pageis clustering to emphasize the commonalities the feature words with high frequencyininorder Web documents should be highlighted. Inthe addition,pages, when the the a special case and not be chosen as an attribute for clustering. When building the formal context of feature with high frequency Webappears documents be highlighted. In addition, when the numberwords of pages is large, if a featurein word only should on one page, the word should be considered a Web pages, we should first filter out the words that appear only on one page (the frequency of the number of pages is large, if a feature word appears only on one page, the word should be considered special case and not be chosen as an attribute for clustering. When building the formal context of Web page 𝑑𝑓 case = 1; and when number of an pages continue to increase,When we can consider filtration apages, special notthe be chosen as attribute for clustering. building theenlarging formalofcontext of we should first filter out the words that appear only on one page (the frequency the page threshold 𝑑𝑓 ). Thus, the purpose of simplifying the concept lattice can be achieved. Web wethe should firstoffilter the words that appear only on oneenlarging page (thefiltration frequency of the d f k =pages, 1; when number pagesout continue to increase, we can consider threshold page 𝑑𝑓 = 1; when the number of pages continue to increase, we can consider enlarging filtration d f k ).The Thus, the purpose of simplifying concept 3.3. Construction Process of the Crossthe Data Link lattice can be achieved. threshold 𝑑𝑓 ). Thus, the purpose of simplifying the concept lattice can be achieved. 3.3. The Process of the Cross Dataconsists Link of four fields: row, col, down, and right. The row As Construction is shown below in Figure 2, a node 3.3. Construction Process of the Cross Data Linkfield is used to store the subconcept. The down field fieldThe is used to store the superconcept. The col As is shown below in Figure 2, a node consists of four fields: row, col, down, and right. The row points next node of this column. The down field offields: the last concept nodeand in The each column is is the shown below Figure 2, a node consists row, down, right. Thefield row field As is to used to store thein superconcept. The col fieldof isfour used to store thecol, subconcept. down empty, which indicates the end of the column. In addition, the right field points to the next concept field is to used store the of superconcept. fieldfield is used to store the subconcept. field points theto next node this column.The Thecol down of the last concept node in The eachdown column is node in the line, the right field of last concept node in each line is also empty, indicating the end of points to the next node of this column. The down field of the last concept node in each column is empty, which indicates the end of the column. In addition, right field points to the next concept the line. the line, empty, indicates thefield endof oflast the concept column. node In addition, field points to the next node inwhich the right in eachthe lineright is also empty, indicating theconcept end of The construction process data node link isinaseach follows: node in the line, the right fieldofofthe lastcross concept line is also empty, indicating the end of the line.

the line. The construction process of the cross data link is as follows:

Figure 2. 2. Node Node structure structure in in the the cross cross data data link. Figure link.

• •

• •

•

••

•

•

The construction process of the cross data link is as follows: Step 1. The concept preprocessing: Westructure iterate through thedata concepts, Figure 2. Node in the cross link. and sort them in ascending order based on their number of intensions. If the number of intensions thethem same,inwe sort the Step 1. The concept preprocessing: We iterate through the concepts, andissort ascending concepts by lexicographic order. order1.based on theirpreprocessing: number of intensions. If through the number of intensions thethem same,inwe sort the Step The concept We iterate the concepts, andissort ascending Step 2. Creating the link header: We create a new node based on the definition of the node concepts by lexicographic order. order based on their number of intensions. If the number of intensions is the same, we sort the structure and empty the row field and col field of the node. Step 2. Creating the link header: concepts by lexicographic order. We create a new node based on the definition of the node

Step 3. the headers column headers: All theon colthe field in row header structure and empty the fieldand and col fielda of the node.based Step 2. Creating Creating therow linkrow header: Wethe create new node definition of thenodes node and all the row fields in column header nodes are emptied. For each row, the row field nodes of ith structure and empty the row fieldand andthe colcolumn field of headers: the node.All the col field in row header Step 3. Creating the row headers header node stores concept and eachare column, the ofrow, ithinthe header node stores and all the row fields inith column header nodes emptied. For each row fieldnodes of ith Step 3. Creating the the row headers and thefor column headers: Allcol thefield col field row header the ith concept. The down field of the i − 1th row header node points to the ith row header node, header stores the in ithcolumn conceptheader and fornodes each column, the colFor field of ith header nodefield stores and all node the row fields are emptied. each row, the row of the ith andconcept. thenode last down field is empty. fieldheader of i − 1th column header node points to the ith ith The down field of theThe iand −right 1th node points to of theith ithheader row header node, header stores the ith concept for row each column, the col field node stores lineith node, and the right field empty. and theconcept. last down field is empty. The i − 1th column header points to the ith the Thelast down field ofis the iright − 1thfield row of header node points to thenode ith row header node, line node, and the last field isThe empty. and the last down fieldright is empty. right field of i − 1th column header node points to the ith line node, and the last right field is empty.


7 of 14


7 of 14

••

Step Step 4. 4. Constructing Constructing nodes nodes based based on on subconcept-superconcept subconcept-superconcept relations: relations: If If the the superconcept superconcept is is numbered as x and the subconcept is numbered as y, we can insert a new node in the numbered as x and the subconcept is numbered as y, we can insert a new node in xth the row xth and row the superconcept is storedisinstored the row of thefield newofnode, and the subconcept andyth thelist. ythThen list. the Then the superconcept in field the row the new node, and the is stored in the col field. The down field of the new node points to the next node in this subconcept is stored in the col field. The down field of the new node points to the next column node in and rightand field points thepoints next node thisnode row. in If it does notIf have a next node, we empty this the column the right to field to theinnext this row. it does not have a next node, its down field or right field. we empty its down field or right field.

As context in in Table 3, a3,cross data linklink is created withwith the As an an example exampleof ofthe theconcept conceptininthe theformal formal context Table a cross data is created above method (the “0” indicates that the row field or the col field is empty). the above method (the “0” indicates that the row field or the col field is empty).

•• •• ••

••

1 to . Step Step 1. 1. Ordering Ordering the the eight eight concepts concepts in in Figure Figure 11 based based on on the thenumber numberof ofintentions intentionsas as ① to8 ⑧. Step 2. Creating the total header and setting the row field and the col field to “0”. Step 2. Creating the total header and setting the row field and the col field to “0”. Step the row rowheaders headersand andlist listheads; heads;allall field row headers to “0”, Step 3. Creating the colcol field of of thethe row headers areare set set to “0”, the 1 8 the row field stores concepts from to the down field points to the next node, and the last

, row field stores concepts from ① to ⑧, the down field points to the next node, and the last one one is empty. fields column headersare areset settoto“0”, “0”,the the col col field field stores concepts is empty. All All thethe rowrow fields of of thethe column headers concepts 1 8 from

to

, the right field points to the next node, and the last one is also empty. from ① to ⑧, the right field points to the next node, and the last one is also empty. 2 and 3and Step concepts in Figure 1. Concepts

are③ direct of the concept Step4.4.Inserting Insertingthe the concepts in Figure 1. Concepts ② are subconcepts direct subconcepts of the 1 A node 1 is stored

.

at the first and row the second ConceptConcept instored the row concept ①.isAinserted node is inserted at row the first and thecolumn. second column. ① is in 2 is stored field andfield concept col in field this node. Then a node is ainserted the firstinrow the row and concept ② in is the stored theincol field in this node. Then node isin inserted the 1 Concept 3 is stored and third

is stored ① in the row field androw concept in ③ theiscolstored field firstthe row andcolumn. the thirdConcept column. is stored in the field and concept in node. right fieldThe andright the down field point to field the next node. is repeated in this the col fieldThe in this node. field and the down point to theThis nextprocess node. This process for all the other nodes. is repeated for all the other nodes.

Through Through the the above above four four steps, steps, we we can can get get the thecross crossdata datalink, link,as asshown shownin inFigure Figure3.3.

Figure 3. The cross data link corresponding corresponding to to Table Table3. 3.

3.4. Construction Algorithm of the Concept Lattice Based on a Cross Data Link 3.4. Construction Algorithm of the Concept Lattice Based on a Cross Data Link We introduce an algorithm for constructing the concept lattice based on a cross data link and take We introduce an algorithm for constructing the concept lattice based on a cross data link and the added object as an example. We iterate through the column header, judge the type of concept in take the added object as an example. We iterate through the column header, judge the type of concept the concept lattice, and then perform different operations. in the concept lattice, and then perform different operations.


8 of 14

Definition 2. For a formal context K = ( P, M, R), the corresponding concept lattice is L(K ), if a new concept p is added, based on the relationship between the concept in L(K ) and the attributes { p}0 of the new object, the types of the concepts stored in the col field of the column header will be divided into three categories: 1. 2. 3.

For a concept ( A, B) ∈ L(K ), if B ⊆ {p}0 , the concept is an updating concept, and the extension of this concept will be updated. For a concept ( A, B) ∈ L(K ), if B 6= ∅ and B ∩ { p}0 = ∅, the concept is an invariant concept, and no operation is required. For a concept ( A, B) ∈ L(K ), if B ∩ { p}0 = H 6= ∅ and the intension of any concept in L(K ) not equal to H and for any superconcept ( A2 , B2 ) of concept ( A1 , B1 ), B2 ∩ {p}0 6= H is established, the concept is a generation concept, and the new concept will be created. If the concept is a generation concept, the following operations are performed:

•

•

•

•

Step 1. Insert a new header node at the end of the row header and the column header. The row field of the new row header node and the col field of the new column header node are set to be the new concept. Step 2. Insert the new node at this column and the new line. The col field of the new node is consistent with the col field of this column, and the new concept is stored in the row field in the new node. Step 3. We need to determine the relationship between the direct superconcept set A in the col field in the new node and the direct superconcept set B in the new concept. The results are divided into two cases: one is where set A has an intersection set C with set B. In this situation, we delete the nodes in this column where the row field concept belongs to set C. Another situation is where set A and set B have no intersection, in which case no operation is required. Step 4. We simply insert new nodes based on set B; the number of new nodes is equal to the size of set B. The row field in the new nodes stores each concept in set B, and the col field in the new nodes stores the new concepts.

In accordance with the above algorithm, we add a new object “5” in Figure 1, with attributes m4 and m5 . This is represented by {5, (m4 , m5 )}. We first iterate through the column header storing the 1 to . 8 The concepts are divided into 3 types in accordance with Definition 2. concepts from 1. 2. 3.

2 , 4 , 6 7 are invariant concepts. No operation is required. Concepts , 1 3 are updating concepts. We add the new object “5” to the extension of Concepts ,

these concepts. 5 8 are generation concepts. 9 and 10 and, Concepts ,

We create the new concepts 9 as an example, the following operations are performed: taking concept

• • • •

Step 1. Insert a new header node at the end of row header and column header. 5 is stored in the col Step 2. Insert a new node at the 5th column and the 9th row. Concept 9 is stored in the row field of the new node. field of the new node and concept 5 and concept 9 have Step 3. The relationship between the direct superconcept of concept no intersection. Step 4. Insert a new node at the 9th column and the 1st row, because the direct superconcept 9 is concept . 1 Concept 9 is sotred in the col field of the new node and concept of concept 5 is stored in the row field of the new node.

The new cross data link is shown in Figure 4, the new nodes are indicated by bold boxes.


9 of 14


9 of 14

Figure 4. 4. The cross data data link link after after inserting inserting the the object. object. Figure

The The corresponding corresponding Hasse Hasse graph graph is is shown shown in in Figure Figure 5: 5:

Figure after inserting inserting the the object. object. Figure 5. 5. The Hasse graph after

The The concept concept lattice lattice construction construction algorithm algorithm based based on on the the cross cross data data link link proposed proposed in in this this section section does not need need to to re-establish re-establish the theconcept conceptlattice latticewhen whenadding addingobjects. objects.It only It only needs to operate does not needs to operate on on nodes and connections, and the connections are the direct relationships between subconcepts nodes and connections, and the connections are the direct relationships between subconcepts and

superconcepts. The algorithm considers the case of multiple direct superconcepts and has a wider applicability.


10 of 14

and superconcepts. The algorithm considers the case of multiple direct superconcepts and has a wider applicability. 4. Experiments 4.1. The Evaluation of the Clustering Effect Among the clustering methods, some are based on distance, some are based on density, etc. Due to the difference between the methods, there is no uniform evaluation index. Evaluation methods for clustering include methods such as the Standard IR method [29], the Merge-then-cluster method [29] and the user evaluation method. The method used in this paper contains three indexes for evaluating the clustering results, and is adopted based on the Standard IR method. 4.1.1. Cluster Label Readability The cluster label readability refers to the proportion of readable labels in all labels, the higher the proportion is, the better the readability is. The formula is as follows: CLR =

u l

(3)

where CLR represents cluster label readability; u represents the number of labels which have better readability; l represents the total number of labels. 4.1.2. Cluster Content Coverage Cluster content coverage refers to the proportion of all initial documents in the clustering results. The formula is as follows: o CCC = (4) s where CCC represents content coverage; o represents the number of documents in the clustering result; s represents the number of initial documents. 4.1.3. Cluster Overlap Cluster overlap refers to the proportion of the total number of documents that is repeated in the clustering results. The formula is as follows: CO =

c−s c

(5)

where CO represents cluster overlap; c represents the total number of documents in the cluster labels. 4.2. Efficiency Verification We analyzed the time efficiency of the algorithm through experiments. First of all, in order to obtain the experimental data sets, we used Java to generate a random “0 1” matrix of scale 300 × 20, We set the probability as 30% (i.e., the “1” in the proportion of matrix), and divide 400 object sets into 10 groups. To evaluate the time efficiency, we carried out experiments using the ListFCA algorithm as proposed in this paper, and the improved linked list algorithm and the Godin algorithm based on the references. Then we recorded the time consumption after each group calculation; the results are shown in Figure 6.


11 of 14

Information2018, 2018,9,9,x228 Information FOR PEER REVIEW

14 1111ofof14

time(seconds) time(seconds)

270 240 270 210 240 180 210 150 180

Godin algorithm Godin algorithm list algorithm

120 150 90 120

list algorithm ListFCA

60 90 30 60

ListFCA

300 0

30

60

90 120 150 180 210 240 270 300 objects 30 60 90 120 150 180 210 240 270 300 objects Figure 6. Result comparison when the probability is 30%. Figure Figure6.6.Result Resultcomparison comparisonwhen whenthe theprobability probabilityisis30%. 30%.

time(seconds) time(seconds)

We set the probability to 40%, with the other parameters remaining unchanged, and then recorded thethe time consumption; the results are shown in Figure 7. remaining We set probability toto 40%, with the other parameters remaining unchanged, and then recorded We the probability 40%, with the other parameters unchanged, and then the time the consumption; the results shown Figurein 7. Figure 7. recorded time consumption; theare results areinshown 1500 1350 1500 1200 1350 1050 1200 900 1050 750 900 600 750 450 600 300 450 150 300 1500 0

Godin algorithm list algorithm Godin algorithm ListFCA list algorithm ListFCA

30 60 90 120 150 180 210 240 270 300 30 60 90 120 150 180 210 240 270 300 objects objects Figure 7. Result comparison when the probability is 40%. Figure 7. Result comparison when the probability is 40%.

4.3. Experiment and Analysis Figure 7. Result comparison when the probability is 40%. 4.3. Experiment and Analysis To control the scale of the experiment, we first used a search engine to cluster the Web data of 4.3. Experiment and Analysis To control thefor scale of the experiment, we first search engine to cluster the Web data of the search results a keyword in the experiment inused orderato enhance the correlation of experimental the search results for a keyword in the experiment in order to enhance the correlation of experimental control the scale“physics” of the experiment, we first used a search to clusterand the40 Web data of WebTo data. “computer”, and “environment” were used asengine the keywords, pages were Web data. “computer”, “physics” and “environment” weretoitems used as the keywords, and pages were the search results for a keyword the experiment order enhance the correlation of 40 experimental selected for clustering. Then, weinselected the first in 5 feature representing the Web page based on selected for clustering. Then, we selected the first 5 feature items representing the Web page based Web data. “computer”, “physics” and “environment” were used as the keywords, and 40 pages were their weight during preprocessing of the Web page. The attributes of Web frequency equal to 1 were on their for weight preprocessing of the the “computer” Web The attributes Web the frequency equal to 1 selected clustering. Then, we selected first 5page. feature items representing Web page based eliminated in theduring formal contexts. Taking keyword as anof example, we finally obtained were eliminated in shown the preprocessing formal contexts. “computer” keyword as anfrequency example, equal we finally on weight during the Webthe page. The attributes of Web to 1 thetheir Hasse diagram in Figure 8.of Taking obtained the Hasse diagram shown in Figure 8. were eliminated in result, the formal contexts. the “computer” an example, we finally In the above every node inTaking the Hasse graph can bekeyword seen as aasclass of clustering result.

obtained Hasse diagram shown in Figure classes 8. They arethe repeatedly divided into different because of different combinations of attributes. Web page clustering based on formal concept analysis is very effective for querying simple or rich and ambiguous information. To show that the clustering method is effective and feasible, we compared the results of ClusterFCA [30] with a typical clustering method based on formal concept analysis. Similarly, “computer”, “physics” and “environment” were used as the keywords of the search, and 40 pages were selected for clustering; we intercepted the second layers and the third layers of the index value to compare; the results are shown in Figure 9.

Information 2018, 9, 228 Information 2018, 9, x FOR PEER REVIEW

12 of 14 12 of 14

Figure 8. Hasse graph of the concept lattice corresponding to the 40 Web pages.

In the above result, every node in the Hasse graph can be seen as a class of clustering result. They are repeatedly divided into different classes because of different combinations of attributes. Web page clustering based on formal concept analysis is very effective for querying simple or rich and ambiguous information. To show that the clustering method is effective and feasible, we compared the results of ClusterFCA [30] with a typical clustering method based on formal concept analysis. Similarly, “computer”, “physics” and “environment” were used as the keywords of the search, and 40 pages were selected for clustering; we intercepted the second layers and the third layers of the index to compare; theconcept resultslattice are shown in Figure Figurevalue 8. Hasse Hasse graph of the corresponding to 9. the 40 Web graph Web pages. pages. In the above1 result, every node in the Hasse graph can be seen as a class of clustering result. They are repeatedly divided into different classes because of different combinations of attributes. 0.9 Web page clustering based on formal concept analysis is very effective for querying simple or rich 0.8 and ambiguous0.7information. To show that the clustering method is effective and feasible, we compared the results of ClusterFCA [30] with a typical clustering method based on formal concept 0.6 analysis. Similarly, as the keywords of the 0.5 “computer”, “physics” and “environment” were usedListFCA search, and 40 0.4 pages were selected for clustering; we intercepted the second layers and the third ClusterFCA layers of the index 0.3 value to compare; the results are shown in Figure 9. 0.2 1 0.1 0.90 0.8 0.7 0.6 0.5

Class label readability

Class content coverage

Class overlap

Figure Figure9.9.Comparison Comparisonof ofevaluation evaluationbetween betweentwo twomethods. methods. ListFCA

0.4 From the comparison of the index values of the two methods, we can conclude that the readability ClusterFCA From the0.3comparison of the index values of the two methods, we can conclude that the of the cluster labels in the method introduced in this paper is higher than that of the other method. readability of the cluster labels in the method introduced in this paper is higher than that of the other 0.2 is slightly lower than the ClusterFCA method, and the cluster coverage ratio of the The cluster overlap method. The cluster overlap is slightly lower than the ClusterFCA method, and the cluster coverage 0.1higher than the ClusterFCA method. Based on the above comparison, the ListFCA is ListFCA is much ratio of the ListFCA is much higher than the ClusterFCA method. Based on the above comparison, 0 effective, and it has advantages in cluster content coverage. proven to be more the ListFCA is proven to be more effective, and it has advantages in cluster content coverage. Class label

Class content

Class overlap

readability coverage 5. Conclusions and Future 5. Conclusions and Future In this paper, we firstly studied a method of Web page clustering based on formal concept analysis. Comparison of evaluation between methods. In this paper, we Figure firstly9.studied a method of Web page two clustering based on formal concept Then, we researched the preprocessing methods of Web pages and proposed a bi-directional maximum analysis. Then, we researched the preprocessing methods of Web pages and proposed a bi-directional matching segmentation the study of different of weighting From word the comparison of method. the indexThirdly, values through of the two methods, we can types conclude that the algorithms, we found a new computing scheme based on term frequency. Finally, by the readability of the cluster labels in the method introduced in this paper is higher than thatstudying of the other construction concept lattices, we proposed a new concept latticemethod, construction algorithm on method. The of cluster overlap is slightly lower than the ClusterFCA and the cluster based coverage cross data link and verified its effectiveness. Using this algorithm, the Hasse graph corresponding to ratio of the ListFCA is much higher than the ClusterFCA method. Based on the above comparison, the can be generated. The method of Web pagein clustering based on formal concept the concept ListFCAlattice is proven toeasily be more effective, and it has advantages cluster content coverage. analysis has a high degree of visualization, rich content, and is characterized by overlapping clustering. This paperand has Future achieved a more efficient clustering performance through clustering Web pages 5. Conclusions based on FCA. However, the following properties still need to be further explored: In this paper, we firstly studied a method of Web page clustering based on formal concept analysis. Then, we researched the preprocessing methods of Web pages and proposed a bi-directional


•

•

13 of 14

Although the structure of Web pages is different from ordinary text, some research results of text classification and text clustering can also be used in Web page clustering. We should consider applying existing research results in other fields to Web clustering. The concept lattice is the core content of formal concept analysis, and there is still much room for improvement in the study of its construction algorithm. We can optimize the concept lattice construction algorithm based on a cross linked list in order to reduce the traversing times while maintaining the concept lattice with an increase or decrease in the number of objects.

Author Contributions: J.Z. and X.Y. performed the experiment and data collection. Z.Z. was involved in theoretical investigation and optimization. Funding: This work is supported by the National Natural Science Foundation of China (Grant No. 61379109, M1321007), Science and Technology Plan of Hunan Province (Grant No. 2014GK2018, 2016JC2011). Acknowledgments: The authors would to thank the reviewers for their valuable suggestions and comments. Conflicts of Interest: The authors declare no conflict of interest.

References 1.

2. 3.

4. 5.

6. 7.

8. 9. 10. 11.

12.

13. 14.

He, X.; Ding, C.H.; Zha, H.; Simon, H.D. Automatic topic identification using webpage clustering. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; pp. 46–54. Dau, F.; Ducrou, J.; Eklund, P. Concept similarity and related categories in searchsleuth. In Proceedings of the 16th International Conference on Conceptual Structures, Toulouse, France, 7–11 July 2008; pp. 255–268. Zamir, O.; Etzioni, O. Web document clustering: A feasibility demonstration. In Proceedings of the 21st Annual ACM/SIGIR International Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; pp. 46–54. Mecca, G.; Raunich, S.; Pappalardo, A. A new algorithm for clustering search results. Data Knowl. Eng. 2007, 32, 504–522. [CrossRef] Sun, X. Construction data mining information management system based on FCA and ontology. In Advances in Electronic Engineering, Communication and Management; Springer: Berlin/Heidelberg, Germany, 2012; pp. 19–24. Hitzler, P. What’s happening in semantic web: And what FCA could have to do with it. In Proceedings of the International Conference on Formal Concept Analysis, Nicosia, Cyprus, 2–6 May 2011; pp. 18–23. Shen, X.; Xu, Y.; Yu, J.; Zhang, K. Intelligent search engine based on formal concept analysis. In Proceedings of the 2007 IEEE International Conference on Granular Computing (GRC 2007), San Jose, CA, USA, 2–4 November 2007; p. 669. Freeman, L.C.; White, D.R. Using Galois lattices to represent network data. Sociol. Methodol. 1993, 23, 127–146. [CrossRef] Wille, R. Restructuring lattice theory: An approach based on hierarchies of concepts. In Ordered Sets; Springer: Dordrecht, The Netherlands, 1999; pp. 445–470. Ganter, B.; Wille, R. Formal Concept Analysis: Mathematical Foundations; Springer: Berlin/Heidelberg, Germany, 2012. Valtchev, P.; Missaoui, R.; Godin, R. Formal concept analysis for knowledge discovery and data mining: The new challenges. In Proceedings of the Second International Conference on Formal Concept Analysis, Sydney, Australia, 23–26 February 2004; pp. 352–371. Poelmans, J.; Ignatov, D.I.; Viaene, S.; Dedene, G.; Kuznetsov, S.O. Text mining scientific papers: A survey on FCA-based information retrieval research. In Proceedings of the 12th Industrial Conference on Data Mining, Berlin, Germany, 13–20 July 2012; pp. 273–287. Kuznetsov, S.O.; Obiedkov, S.A. Comparing performance of algorithms for generating concept lattices. J. Exp. Theor. Artif. Intell. 2002, 14, 189–216. [CrossRef] Kovács, L. Efficiency analysis of concept lattice construction algorithms. Procedia Manuf. 2018, 22, 11–18. [CrossRef]


15. 16. 17. 18. 19.

20. 21.

22.

23. 24. 25.

26. 27. 28. 29. 30.

14 of 14

Godin, R.; Missaoui, R.; Alaoui, H. Incremental concept formation algorithms based on Galois (concept) lattices. Comput. Intell. 1995, 11, 246–267. [CrossRef] Chen, Q.Y. Improvement on Bordat algorithm for constructing concept lattice. Comput. Eng. Appl. 2010, 46, 33–35. (In Chinese) Yang, S.T. A webpage classification algorithm concerning webpage design characteristics. Int. J. Electron. Bus. Manag. 2012, 10, 73. Wang, Y.Z.; Chen, X.; Dai, Y.F. Web information extraction based on webpage clustering. J. Chin. Comput. Syst. 2018, 39, 111–115. (In Chinese) Tan, K.W.; Han, H.; Elmasri, R. Web data cleansing and preparation for ontology extraction using WordNet. In Proceedings of the International Conference on Web Information Systems Engineering, Hongkong, China, 19–21 June 2000; pp. 11–18. Bu, Z.; Zhang, C.; Xia, Z.; Wang, J. An FAR-SW based approach for webpage information extraction. Inf. Syst. Front. 2014, 16, 771–785. [CrossRef] Cai, D.; Yu, S.; Wen, J.R.; Ma, W.Y. Extracting content structure for web pages based on visual representation. In Proceedings of the 5th Asia-Pacific Web Conference and Application, Xian, China, 23–25 April 2003; pp. 406–417. Kwon, O.W.; Lee, J.H. Web page classification based on k-nearest neighbor approach. In Proceedings of the 5th international workshop on Information Retrieval with Asian Languages, Hong Kong, China, 30 September–1 October 2000; pp. 9–15. Huang, C.N.; Gao, J.; Li, M.; Chang, A. Chinese Word Segmentation. U.S. Patent 10/662,602, 31 March 2005. Cohen, W.W.; Singer, Y. Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. (TOIS) 1999, 17, 141–173. [CrossRef] Kuznetsov, S.O.; Obiedkov, S.A. Algorithms for the construction of concept lattices and their diagram graphs. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Barcelona, Span, 20–24 September 2010; pp. 289–300. Zou, L.; Zhang, Z.; Long, J.; Zhang, H. A fast incremental algorithm for deleting objects from a concept lattice. Knowl. Based Syst. 2015, 89, 411–419. [CrossRef] Croitoru, M.; Ferré, S.; Lukose, D. Conceptual Structures: From Information to Intelligence. Lect. Note Comput. Sci. 2010, 9, 26–30. Frakes, W.B.; Baeza-Yates, R. Information Retrieval: Data Structures and Algorithms; Prentice Hall: Englewood Cliffs, NJ, USA, 1992. Zamir, O.E. Clustering Web Documents: A Phrase-Method for Grouping Search Engine Results. Ph.D. Thesis, University of Washington, Seattle, WA, USA, 1999. Wang, J. Research of Web Search Result Clustering Based on Formal Concept Analysis. Ph.D. Thesis, Xihua University, Chengdu, China, 2008. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Suggest Documents

A Method for Traffic Congestion Clustering Judgment Based on ... - MDPI

Read more

A Method of Collecting Mongolian Web page Based on ... - SERSC

Read more

An Adaptive Trajectory Clustering Method Based on Grid and ... - MDPI

Read more

Fuzzy Concept Mining based on Formal Concept ... - Semantic Scholar

Read more

Clustering Documents using a Wikipedia-based Concept ...

Read more

Clustering Documents using a Wikipedia-based Concept ...

Read more

Classification on Web Blogger Based on Clustering - MATEC Web of ...

Read more

A COMPREHENSIVE SURVEY ON FORMAL CONCEPT ANALYSIS ...

Read more

SRCluster: Web Clustering Engine based on Wikipedia

Read more

A clustering-based prefetching scheme on a Web ... - Semantic Scholar

Read more

Building a web-snippet clustering system based on a

Read more

A Composite Method Based on Formal Grammar and ... - ScienceOpen

Read more

A model-based clustering method to detect

Read more

A DIVISIVE HIERARCHICAL CLUSTERING- BASED METHOD FOR ...

Read more

A Principal Components-Based Clustering Method to

Read more

Concept Analysis as a Formal Method for Menu Design

Read more

a formal concept analysis approach for web usage ... - Semantic Scholar

Read more

A new fast prototype selection method based on clustering

Read more

A new fast prototype selection method based on clustering

Read more

A Key Management Method Based on Dynamic Clustering for Sensor ...

Read more

A Novel Method of Mutation Clustering Based on Domain Analysis

Read more

A Lightweight Intrusion Detection Method Based on Fuzzy Clustering

Read more

A Parallel Clustering Method Study Based on MapReduce

Read more

A Novelty-based Clustering Method for On-line ... - Semantic Scholar

Read more

Report "A Web Page Clustering Method Based on Formal Concept ... - MDPI"

Your name

Email

Reason

Description

Copyright © 2025 M.MOAM.INFO. All rights reserved.
| About Us | Privacy Policy | Terms of Service | Help | Copyright | Contact Us | Cookie Policy

Sign In

Email

Password

Remember me Forgot password?

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close