Self-adaptive Semantic Focused Crawler for Mining ...

1

Self-adaptive Semantic Focused Crawler for Mining Services Information Discovery H. Dong, Member, IEEE, and F. K. Hussain  Abstract—It is well recognized that the Internet has become the largest marketplace in the world, and online advertising is very popular with numerous industries, including the traditional mining service industry where mining service advertisements are effective carriers of mining service information. However, service users may encounter three major issues – heterogeneity, ubiquity, and ambiguity, when searching for mining service information over the Internet. In this paper, we present the framework of a novel self-adaptive semantic focused crawler – SASF crawler, with the purpose of precisely and efficiently discovering, formatting, and indexing mining service information over the Internet, by taking into account the three major issues. This framework incorporates the technologies of semantic focused crawling and ontology learning, in order to maintain the performance of this crawler, regardless of the variety in the Web environment. The innovations of this research lie in the design of an unsupervised framework for vocabulary-based ontology learning, and a hybrid algorithm for matching semantically relevant concepts and metadata. A series of experiments are conducted in order to evaluate the performance of this crawler. The conclusion and the direction of future work are given in the final section. Index Terms—mining service industry, ontology learning, semantic focused crawler, service advertisement, service information discovery.

I

I. INTRODUCTION

is well recognized that information technology has a profound effect on the way business is conducted, and the Internet has become the largest marketplace in the world. It is estimated that there were over 2 billion Internet users in 2011, with an estimated annual growth of over 16%, compared with 360 million users in 20001. Innovative business professionals have realized the commercial applications of the Internet both for their customers and strategic partners, turning the Internet T

1 http://www.internetworldstats.com/ This is a preprint version of the paper: Hai Dong; Hussain, F.K., "Self-Adaptive Semantic Focused Crawler for Mining Services Information Discovery," Industrial Informatics, IEEE Transactions on , vol.10, no.2, pp.1616,1626 ,May 2014, doi: 10.1109/TII.2012.2234472 Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. H. Dong is with School of Information Systems, Curtin Business School, Curtin University of Technology, Perth, WA 6845, Australia (e-mail: [email protected]). F. K. Hussain is with School of Software, Faculty of Engineering and Information Technology, University of Technology, Sydney, Ultimo, NSW 2007, Australia (e-mail: [email protected]).

into an enormous shopping mall with a huge catalogue. Consumers are able to browse a huge range of products and service advertisements over the Internet, and buy these goods directly through online transaction systems [1]. Service advertisements form a considerable part of the advertising which takes place over the Internet and have the following features: Heterogeneity. Given the diversity of services in the real world, many schemes have been proposed to classify the services from various perspectives, including the ownership of service instruments [2], the effects of services [3], the nature of the service act, delivery, demand and supply [4], and so on. Nevertheless, there is not a publicly agreed scheme available for classifying service advertisements over the Internet. Furthermore, whilst many commercial product and service search engines provide classification schemes of services with the purpose of facilitating a search, they do not really distinguish between the product and the service advertisement; instead, they combine both into one taxonomy. Ubiquity. Service advertisements can be registered by service providers through various service registries, including 1) global business search engines, such as Business.com2 and Kompass 3 , 2) local business directories, such as GoogleTM Local Business Center 4 and local Yellowpages® 5 , 3) domain-specific business search engines, such as healthcare, industry and tourism business search engines, and 4) search engine advertising, such as GoogleTM 6 and Yahoo!® 7 Advertising Home [5]. These service registries are geographically distributed over the Internet. Ambiguity. Most of the online service advertising information is embedded in a vast amount of information on the Web and is described in natural language, therefore it may be ambiguous. Moreover, online service information does not have a consistent format and standard, and varies from Web page to Web page. Mining is one of the oldest industries in human history, having emerged with the beginning of human civilization. Mining services refer to a series of services which support mining, quarrying, and oil and gas extraction activities [6]. In Australia, the mining industry contributed about 7.7% of the Australian GDP between 2007 and 2008, to which the field of 2

http://www.business.com/ http://www.kompass.com/ 4 http://www.google.com/local/add/businessCenter/ 5 http://www.yellowpages.com/ 6 http://www.google.com/ 7 http://www.yahoo.com/ 3

2 mining services contributed 7.65% [6]. Since the advent of the information age, mining service companies have realized the power of online advertising, and they have attempted to promote themselves by actively joining the service advertising community. It was found that nearly 50,000 companies worldwide have registered their services on the Kompass website. However, these mining service advertisements are also subject to the issues of heterogeneity, ubiquity and ambiguity, which prevent users from precisely and efficiently searching for mining service information over the Internet. Service discovery is an emerging research area in the domain of industrial informatics, which aims to automatically or semi-automatically retrieve services or service information in particular environments by means of various IT methods. Many studies have been carried out in the environments of wireless networks [7-9, 26] and distributed industrial systems [10]. However, few studies have been planned for industrial service advertisement discovery in the Web environment, by taking into account the heterogeneous, ubiquitous and ambiguous features of service advertising information. In order to address the above problems, in this paper, we propose the framework of a novel self-adaptive semantic focused (SASF) crawler, by combining the technologies of semantic focused crawling and ontology learning, whereby semantic focused crawling technology is used to solve the issues of heterogeneity, ubiquity and ambiguity of mining service information, and ontology learning technology is used to maintain the high performance of crawling in the uncontrolled Web environment. This crawler is designed with the purpose of helping search engines to precisely and efficiently search mining service information by semantically discovering, formatting, and indexing information. The rest of this paper is organized as follows: in Section II, we review the related work in the field of ontology-learning-based focused crawling and address the research issues in this field; in Section III, we present the framework of the SASF crawler, including the mining service ontology and metadata schema and workflow of this crawler; in Section IV, we deliver a hybrid concept-metadata matching algorithm to help this crawler semantically index the mining service information; in Section V, we conduct a series of experiments in order to empirically evaluate the framework of the crawler; and in the final section, we discuss the features and the limitations of this work and propose our future work. II. RELATED WORK In this section, we briefly introduce the fields of semantic focused crawling and ontology-learning-based focused crawling, and review previous work on ontology learning-based focused crawling. A semantic focused crawler is a software agent that is able to traverse the Web, and retrieve as well as download related Web information on specific topics by means of semantic technologies [11, 12]. Since semantic technologies provide shared knowledge for enhancing the interoperability between heterogeneous components, semantic technologies have been

broadly applied in the field of industrial automation [13-15]. The goal of semantic focused crawlers is to precisely and efficiently retrieve and download relevant Web information by automatically understanding the semantics underlying the Web information and the semantics underlying the predefined topics. A survey conducted by Dong et al. [16] found that most of the crawlers in this domain make use of ontologies to represent the knowledge underlying topics and Web documents. However, the limitation of the ontology-based semantic focused crawlers is that the crawling performance crucially depends on the quality of ontologies. Furthermore, the quality of ontologies may be affected by two issues. The first issue is that, as it is well known that an ontology is the formal representation of specific domain knowledge [17] and ontologies are designed by domain experts, a discrepancy may exist between the domain experts’ understanding of the domain knowledge and the domain knowledge that exists in the real world. The second issue is that knowledge is dynamic and is constantly evolving, compared with relatively static ontologies. These two contradictory situations could lead to the problem that ontologies sometimes cannot precisely represent real-world knowledge, considering the issues of differentiation and dynamism. The reflection of this problem in the field of semantic focused crawling is that the ontologies used by semantic focused crawlers cannot precisely represent the knowledge revealed in Web information, since Web information is mostly created or updated by human users with different knowledge understandings, and human users are efficient learners of new knowledge. The eventual consequence of this problem is reflected in the gradually descending curves in the performance of semantic focused crawlers. In order to solve the defects in ontologies and maintain or enhance the performance of semantic-focused crawlers, researchers have begun to pay attention to enhancing semantic-focused crawling technologies by integrating them with ontology learning technologies. The goal of ontology learning is to semi-automatically extract facts or patterns from a corpus of data and turn these into machine-readable ontologies [18]. Various techniques have been designed for ontology learning, such as statistics-based techniques, linguistics (or natural language processing)-based techniques, logic-based techniques, etc. These techniques can also be classified into supervised techniques, semi-supervised techniques, and unsupervised techniques from the perspective of learning control. Obviously, ontology-learning-based techniques can be used to solve the issue of semantic-focused crawling, by learning new knowledge from crawled documents and integrating the new knowledge with ontologies in order to constantly refine the ontologies. In the rest of this section, we will review the two existing studies in the field of ontology learning-based semantic focused crawling. Zheng et al. [19] proposed a supervised ontology-learning-based focused crawler that aims to maintain the harvest rate of the crawler in the crawling process. The main idea of this crawler is to construct an artificial neural network (ANN) model to determine the relatedness between a Web

3 document and an ontology. Given a domain-specific ontology and a topic represented by a concept in the ontology, a set of relevant concepts are selected to represent the background knowledge of the topic by counting the distance between the topic concept and the other concepts in the ontology. The crawler then calculates the term frequency of the relevant concepts occurring in the visited Web documents. Next, the authors used the backpropagation algorithm to train a three-layer feedforward ANN model. The first layer is a linear layer with a transfer function y   x . The number of input nodes depends on the number of relevant concepts. The hidden 1 layer is a sigmoid layer with a transfer function y  . 1  e x There are four nodes in the hidden layer. The output layer is also a sigmoid layer. The output of the ANN is the relevance score between the topic and a Web document. The training process follows a supervised paradigm, whereby the ANN is trained by labeled Web documents. The training will not stop until the root mean square error (RMSE) is less than 0.01. The limitations of this approach are: 1) it can only be used to enhance the harvest rate of crawling but does not have the function of classification; 2) it cannot be used to evolve ontologies by enriching the vocabulary of ontologies; and 3) the supervised learning may not work within an uncontrolled Web environment with unpredicted new terms. Su et al. [20] proposed an unsupervised ontology-learning-based focused crawler in order to compute the relevance scores between topics and Web documents. Given a specific domain ontology and a topic represented by a concept in this ontology, the relevance score between a Web document and the topic is the weighted sum of the occurrence frequencies of all the concepts of the ontology in the Web document. The original weight of each concept Ck is WCOk  1.00  n d (Ck ,T ) , where n is a predefined discount factor, and d(Ck, T) is the distance between the topic concept t and Ck. Next, this crawler makes use of reinforcement learning, which is a probabilistic framework for learning optimal decision making from rewards or punishments [21], in order to train the weight of each concept. The learning step follows an unsupervised paradigm, which uses the crawler to download a number of Web documents and learn statistics based on these Web documents. The learning step can be repeated many times. The weight of a concept Ck to a topic t in learning step m is mathematically expressed as follows: WCmk 

P(t  Ck ) P(t | Ck ) nt  N c  WCmk 1   WCmk 1  k  WCmk 1 P(t )  P(Ck ) P(t ) nk  N t

(1) where nk is the number of Web documents in which Ck occurs, n kt is the number of Web documents in which Ck and t co-occurs, Nc is the total number of Web documents crawled, and Nt is the number of Web documents in which t occurs. Compared with Zheng et al. [19]’s approach, this approach is able to classify Web documents by means of the concepts in an

ontology, to learn the weights of relations between concepts, and to work in an uncontrolled Web environment thanks to the unsupervised learning paradigm. The limitations of Su et al.’s approach are: 1) it cannot be used to enrich the vocabulary of ontologies; 2) although the unsupervised learning paradigm can work in an uncontrolled Web environment, it may not work well when numerous new terms emerge or when ontologies have a limited range of vocabulary. By means of a comparative analysis of the two ontology-based focused crawlers, we found a common limitation, which is that none of the two crawlers is able to really evolve ontologies by enriching their contents, namely their vocabularies. It is found that both of the approaches attempt to use learning models to deduce the quantitative relationship between the occurrence frequencies of the concepts in an ontology and the topic, which may not be applicable in the real Web environment. When numerous unpredictable new terms outside the scope of the vocabulary of an ontology emerge in Web documents, these approaches cannot determine the relatedness between the new terms and the topic, and cannot make use of the new terms for the relatedness determination, which could result in the decline in their performance. Consequently, in order to address this research issue, we propose to design an innovative ontology-learning-based focused crawler, in order to precisely discover, format and index relevant Web documents in the uncontrolled Web environment.

III. SYSTEM COMPONENTS AND WORKFLOW In this section, we introduce the system architecture and the workflow of the proposed SASF crawler. It needs to be noted that this crawler is built upon the semantic focused crawler designed in our previous research [11, 12]. The differences between this work and the previous work can be summarized as follows:  Our previous research work created a purely semantic focused crawler, which does not have an ontology-learning function to automatically evolve the utilized ontology. This research aims to remedy this shortcoming.  Our previous work utilized the service ontologies and the service metadata formats, especially designed for the transportation service domain and the health care service domain. In this research, we design a mining service ontology and a mining service metadata schema to solve the problem of self-adaptive service information discovery for the mining service industry. An overview of the system architecture and the workflow is shown in Fig. 1. As can be seen, the SASF crawler consists of two knowledge bases – a Mining Service Ontology Base and a Mining Service Metadata Base, and a series of processes, as well as a workflow coordinating these processes. In the rest of this section, we will introduce the two knowledge bases and each process in this workflow.

4

Fig. 1. System architecture and workflow of the proposed self-adaptive semantic focused crawler

A. Mining Service Ontology Base and Mining Service Metadata Base The Mining Service Ontology Base is used to store a mining service ontology, which is the representation of specific mining service domain knowledge. Concepts in the mining service ontology are organized in a hierarchical structure, and these concepts are associated by a generalization/specialization relationship, and are organized in the form of a four-level hierarchy. Fig. 2 shows the first and second level of the concepts in the mining service ontology. Each concept in the mining service ontology represents a mining service sub-domain, and is defined by three properties – conceptDescription, learnedConceptDescription, and linkedMetadata, which are expressed as follows:  The conceptDescription property is a datatype property used to store the textual descriptions of a mining service concept, which consists of several phrases in order to concisely summarize the discriminate features of the corresponding mining service sub-domain. The contents of each conceptDescription property are manually specified by domain experts and this will be used to calculate the similarity value between a mining service concept and a mining service metadata.

5



The learnedConceptDescription property is a datatype property that has a purpose similar to that of the conceptDescription property. The difference between the two properties is that the former is automatically learned from Web documents by the SASF crawler.  The linkedMetadata property is an object property used to associate a mining service concept and semantically relevant mining service metadata. This property is used to semantically index the generated mining service metadata by means of the concepts in the mining service ontology. The Mining Service Metadata Base is used to store the automatically generated and indexed mining service metadata. Mining service metadata is the abstraction of an actual mining service advertisement published in a Web document. The mining service metadata schema follows a structure similar to the health service metadata schema defined in our previous work [12], by which mining service metadata comprise two parts – mining service provider metadata and mining service metadata. Mining service provider metadata is the abstraction of a service provider’s profile, including the service provider’s basic introduction, address, contact information, and so on. Mining service metadata has the properties of serviceDescription and linkedConcept, which are stated as follows:  The serviceDescription property is a datatype property, which contains the texts used to describe the general features of an actual service. The contents of this property are automatically extracted from the mining service advertisements by the SASF crawler. This property will be used for the subsequent concept-metadata similarity computation.  The linkedConcept property is the inverse property of the linkedMetadata property, which stores the URIs of the semantically relevant mining service concepts of mining

service metadata. It needs to be noted that mining service metadata and mining service concepts can have a many-to-many relationship. In addition, mining service metadata and a relevant mining service provider metadata are associated by an object property of isProvidedBy, and this association follows a many-to-one relationship, because in fact, a service provider can provide more than one service. B. System Workflow In this section, we will introduce the system workflow of the SASF crawler step-by-step, as shown in Fig. 1. The primary goals of this crawler include: 1) to generate mining service metadata from Web pages; and 2) to precisely associate between the semantically relevant mining service concepts and mining service metadata with relatively low computing cost. The second goal is realized by: 1) measuring the semantic relatedness between the conceptDescription and learnedConceptDescription property values of the concepts and the serviceDescription property values of the metadata; and 2) automatically learning new values, namely descriptive phrases, for the learnedConceptDescription properties of the concepts. As can be seen in Fig. 1, the first step is preprocessing, which is to process the contents of the conceptDescription property of each concept in the ontology before matching the metadata and the concepts. This processing is realized by using Java WordNet Library 8 (JWNL) to implement tokenization, part-of-speech (POS) tagging, nonsense word filtering, stemming, and synonym searching for the conceptDescription property values of the concepts. The second and third steps are crawling and term extraction. The aim of these two processes is to download k Web pages from the Internet at one time (k will be explained in Section

Fig. 2. The mining service ontology 8

http://sourceforge.net/projects/jwordnet/

6 IV-B), and to extract the required information from the downloaded Web pages, according to the mining service metadata schema and the mining service provider metadata schema defined in Section III-A, in order to prepare the property values to generate a new group of metadata. These two processes are realized by the semantic focused crawler designed in our previous work [11, 12]. The next step is term processing, which is to process the content of the serviceDescription property of the metadata in order to prepare for subsequent concept-metadata matching. The implementation of this process is similar to the implementation of the preprocessing process. The major difference is that term processing does not need the function of synonym searching for two major reasons: 1) the synonyms of the terms in the conceptDescription properties of concepts have already been retrieved in preprocessing; and 2) the computing cost of the synonym searching for the terms in the serviceDescription property is relatively high and this may influence the scalability of the SASF crawler, as term processing is a real-time process. The rest of the workflow can be integrated as a self-adaptive metadata association and ontology learning process. The details of this process are as follows: first of all, the direct string matching process examines whether or not the contents of the serviceDescription property of metadata are included in the conceptDescription and learnedConceptDescription properties of a concept. If the answer is ‘yes’, then the concept and the metadata are regarded as semantically relevant. By means of metadata generation and association process, the metadata can then be generated and stored in the mining service metadata base as well as being associated with the concept. If the answer is ‘no’, an algorithm-based string matching process will be invoked to check the semantic relatedness between the metadata and the concept, by means of a concept-metadata semantic similarity algorithm (introduced in Section IV). If the concept and the metadata are semantically relevant, the contents of the serviceDescription property of the metadata can be regarded as a new value for the learnedConceptDescription property of the concept. The metadata is thus allowed to go through the metadata generation and association process; otherwise the metadata is regarded as semantically non-relevant to the concept. The above process is repeated until all the concepts in the mining service ontology have been compared with the metadata. If none of the concepts is semantically relevant to the metadata, this metadata is regarded as semantically non-relevant to the mining service domain and will be dropped. It needs to be noted that only the conceptDescription property values of the concepts can be used in the algorithm-based string matching process, due to the fact that the semantic relatedness between the concept and the metadata is determined by comparing their algorithm-based property similarity values with a threshold value. If the maximum similarity value between the serviceDescription property value of a metadata and the conceptDescription property values of a concept is higher than the threshold value, the metadata and the concept are regarded as semantically relevant; otherwise not.

Hence, the threshold value can be viewed as the boundary for determining whether or not the serviceDescription property value of the metadata is semantically relevant to a conceptDescription property value of a concept, and the conceptDescription property values of this concept can be viewed as the foundation for constructing this boundary. On one hand, this boundary is relatively stable, as both the threshold value and the conceptDescription property values are relatively stable; on the other hand, the maximum similarity values between the conceptDescription property values and the learnedConceptDescription property values are relatively dynamic (within the interval [threshold, 1] according to the algorithm introduced in Section IV). Therefore, the learnedConceptDescription property values of the concepts cannot be used in the algorithm-based string matching process and can only be used in the direct string matching process.

IV. CONCEPT-METADATA SEMANTIC SIMILARITY ALGORITHM In this section, we introduce a novel concept-metadata semantic similarity algorithm to judge the semantic relatedness between concepts and metadata in the algorithm-based string matching process (Fig. 1). The major goal of this algorithm is to measure the semantic similarity between a concept description and a service description. This algorithm follows a hybrid pattern by aggregating a semantic-based string matching (SeSM) algorithm and a statistics-based string matching (StSM) algorithm. In the rest of this section, we will describe these two algorithms in detail. A. Semantic-based String Matching Algorithm The key idea of the SeSM algorithm is to measure the text similarity between a concept description and a service description, by means of WordNet9 and a semantic similarity model. As the concept description and the service description can be regarded as two groups of terms after the preprocessing and term processing phase, first of all, we need to examine the semantic similarity between any two terms from these two

Fig.9 http://wordnet.princeton.edu/ 3. Graphical representation of the assignment in the bipartite graph problem

7 groups. Here we make use of Resnik [22]’s information-theoretic model and WordNet to achieve this goal. Since terms (or concepts) in WordNet are organized in a hierarchical structure, in which concepts have the relationships of hypernym/hyponym, it is possible to assess the similarity between two concepts by comparing their relative position in WordNet. Resnik’s model can be expressed as follows:

simResnik (C1 , C2 )  maxCS (C1 ,C2 ) [ log( P(C))]

(2)

where C1 and C2 are two concepts in WordNet, and S(C1, C2) is the set of concepts that subsume both C1 and C2, and P(C) is the probability of encountering a sub-concept of C. Hence, P(C )   (C ) / 

(3)

where  (C ) is the number of concepts subsumed by C and  is the total number of concepts in WordNet. It needs to be noted that a concept sometimes consists of more than one term in WordNet, so concepts sometimes do not equate to terms. Since the result of Resnik’s model is within the interval [0, ∞], we make use of a model introduced by Dong et al. [23] to normalize the result into the interval [0, 1], which can be expressed as follows:

CD j , C j

(4)

where  (C2 ) is the synset of C2. In the above step, we calculate the similarity values between any two terms in a concept description and a service description. We now make use of Plebani et al. [24]’s bipartite graph model to optimally assign the matching between the terms from each group. Given a graph G = (V, E), where V are a group of vertices and E are a group of edges linking between the vertices. A matching in G is defined as M  E so that no two edges in E share a common end vertex. An assignment in G is a matching M so that each vertex in G has an incident edge in M. Let us suppose that the set of vertices are partitioned into two sets P (namely the terms in the service description) and Q (namely the terms in the concept description), and each edge in this graph has an associated weight w within the interval [0, 1] given by Equation (4). A function maxSimS → [0, 1] returns the maximum weighted assignment, i.e., an assignment so that the average weight of the edge is maximum. Fig. 3 shows the graphical representation of the assignment in the bipartite graph problem. The assignment in bipartite graphs can be expressed in a linear programming model, which is j J 1  max  w( pi , q j ), |P| iI i  I , j  J , I  [1.. | P |], J  [1.. | Q |].

maxSimS ( w, P, Q) 

maxSim P ( SDi , CD j , h )  max [ P (CD j , | CD j , h )  P (CD j , | SDi )]

| simResnik (C1 , C2 ) |  max CS ( C1 ,C2 ) [  log( P (C ))] if C1  C2   max C [  log( P (C ))] 1 if C1  C2 or C1   (C2 ) 

B. Statistics-based String Matching Algorithm The StSM algorithm is a complementary solution for the SeSM algorithm, in case the latter does not work effectively in some circumstances. For example, for a service description “old mine workings consolidation contractor” and a concept description “mining contractor”, their similarity value is (1+1)/5 = 0.4 according to the SeSM algorithm, which is relatively lower than the actual extent of their semantic relevance. In this circumstance, we need to find an alternate way to measure their similarity. Here we make use of a statistics-based model [25] to achieve this goal. In the crawling process and the subsequent processes indicated in Fig. 1, the SASF crawler downloads k Web pages at the beginning, and automatically obtains the statistical data from the k Web pages, in order to compute the semantic relevance between a service description (SDi) and a concept description (CDj,h) of a concept (Cj). The StSM algorithm follows an unsupervised training paradigm aimed at finding the maximum probability that CDj,h and SDi co-occur in the Web pages. A graphic representation of the StSM algorithm is shown in Fig. 4. The StSM algorithm is shown as follows:

(5)

(6)

 n jj,,h nij ,   max    CD j , C j n  j , h ni  j , where CDj,θ is a concept description of Cj, n j , h is the number of

Web pages that contain both CDj,θ and CDj,h, nj,h is the number of Web pages that contain CDj,h, ni j , is the number of Web pages that contain both CDj,θ and SDi, and ni is the Web pages of metadata that contain SDi. C. Hybrid Algorithm On top of the SeSM and StSM algorithm, a hybrid algorithm is required to seek the maximum similarity values from the two algorithms. maxSim( SDi , CD j , h )  max[maxSim S ( wi , j , SDi , CD j , h ), maxSim P ( SDi , CD j , h )]

(7)

V. SYSTEM IMPLEMENTATION AND EVALUATION In this section, in order to systematically evaluate the proposed SASF crawler, we implement a prototype of this crawler, and compare the performance of the crawler with the existing work reviewed in Section II, based on several performance indicators adopted from the information retrieval (IR) field. A. Prototype Implementation We implement the prototype of this SASF crawler in Java

8 within the platform of Eclipse 3.7.1 10 . This prototype is an extension of our previous versions presented in [11, 12]. The Mining Service Ontology Base and Mining Service Metadata Base are built in OWL-DL within the platform of Protégé 3.4.711. The service mining ontology consists of 158 concepts, knowledge of which is mostly referenced from Wikipedia12, Australian Bureau of Statistics13, and the websites of nearly 200 Australian and international mining service companies. B. Performance Indicators We define the parameters for a comparison between our crawler and the existing ontology-learning-based focused crawlers. All the indicators are adopted from the field of IR and need to be redefined in order to be applied in the scenario of ontology-based focused crawling. Harvest rate is used to measure the harvesting ability of a crawler. The harvest rate for a crawler ε after crawling μ Web pages is determined as follows: |   | HR (  )  |   |

(8)

R (C j ) 

| { i    j |  i  R j } | | R j |

(10)

 where | R j | is the number of the relevant metadata from the

Web pages for Cj. Harmonic mean is a measure of the aggregated performance of a crawler. The harmonic mean for a concept Cj after crawling μ Web pages is determined as follows: HM (C j ) 

P (C j )  R (C j ) P (C j )  R (C j )

(11)

Fallout is used to measure the inaccuracy of a crawler. The fallout for a concept Cj after crawling μ Web pages is determined as follows:

F (C j ) 

| { i    j |  i  R j } | | R j |

(12)



where |    | is the number of associated metadata from the

where R j is the set of non-relevant metadata from the Web

Web pages, and |    | is the number of generated metadata from the Web pages. Precision is used to measure the preciseness of a crawler. The precision for a concept Cj after crawling μ Web pages is determined as follows:

 pages for Cj, and | R j | is the number of non-relevant metadata

P (C j ) 

| { i    j |  i  R j } | |   j |

(9)



where   j is the set of associated metadata from the Web  pages for Cj, |   j | is the number of associated metadata from 

the Web pages for Cj, and R j is the set of relevant metadata from the Web pages for Cj. It needs to be noted that the set of relevant metadata need to be manually identified by operators before the evaluation. Recall is used to measure the effectiveness of a crawler. The recall for a concept Cj after crawling μ Web pages is determined as follows:

10

http://www.eclipse.org/ http://protege.stanford.edu/ http://en.wikipedia.org/ 13 Fig. 4.http://www.abs.gov.au/ Graphical representation of the probabilistic model

from the Web pages for Cj. It needs to be noted that the set of non-relevant metadata need to be manually identified by operators before the evaluation Crawling time is used to measure the efficiency of a crawler. The crawling time of the SASF crawler for a Web page is defined as the time interval from processing the Web page from the Crawling process to the Metadata Generation and Association process or to the Filtering process, as shown in Fig. 1. C. System Evaluation In this section, we will evaluate our SASF crawler by comparing its performance with that of the existing ontology-learning-based focused crawlers of Zheng et al. and Su et al., introduced in Section II. Testing data source. As mentioned in Section II, one common defect of the existing ontology-learning-based focused crawlers is that these crawlers are not able to work in an uncontrolled Web environment with unpredicted new terms, due to the limitations of the adopted ontology learning approaches. Hence, our proposed SASF crawler aims to remedy this defect, by combining a real-time SeSM algorithm, and an unsupervised StSM algorithm. In order to evaluate our model and the existing models in the uncontrolled Web environment, we choose two mainstream business advertising websites – Australian Kompass 14 (abbreviated as Kompass below) and Australian Yellowpages® 15 (abbreviated as Yellowpages® below), as the test data source. There are around 800 downloadable mining-related service or product advertisements registered in Kompass, and around 3200 similar

11 12

14 15

http://au.kompass.com/ http://www.yellowpages.com.au/

9

Fig. 5. Comparison of the ontology-learning-based focused crawling models on harvest rate

Fig. 7. Comparison of the ontology-learning-based focused crawling models on recall

advertisements registered in Yellowpages®. All of them are published in English. Test environment and targets. Owing to the primary objective of the ontology-learning-based focused crawlers, that is, to precisely and efficiently download and index relevant Web information, we subsequently employ the ontology learning and Web page classification models used in each of the focused crawlers, namely the ANN model used in Zheng et al.’s crawler, the probabilistic model used in Su et al.’s crawler, and our self-adaptive model, together in our SASF crawler framework, for the task of metadata harvesting and indexing (or classification) (i.e., the Metadata Generation and Ontology Learning process in Fig. 1), and compare their performance in this process. It is recognized that all of these models need a training process before harvesting and classification, since the ANN model follows a supervised training paradigm, and the other two models follow an unsupervised training paradigm. Therefore, we use the Kompass website as the training data source, and label the Web pages from this website for the ANN training. Following that, we test and compare the performance of these models by using the unlabelled data source from Yellowpages®, with the purpose of evaluating their capability in an uncontrolled environment. In addition, by means of a series of experiments, we find that 0.6 is the optimal threshold value for the self-adaptive model to determine the relatedness between a pair of service description and concept description. Test results. We compare the performance of the ANN model, the probabilistic model, and the self-adaptive model in terms of the six parameters defined in Section V-B. It needs to be noted that we evaluate the performance of the ANN model based only on the parameters of harvest rate and crawling time, since the major purpose of the ANN model is to enhance the harvest rate, and it does not contain the function of classification. In the rest of this section, we will present and discuss the test results. Harvest rate. The graphic representation of the comparison of the probabilistic, self-adaptive and ANN models on harvest rate, along with the increasing number of visited Web pages, is shown in Fig. 5. It needs to be noted that the harvest rate concerns only the crawling ability, not the accuracy, of a crawler. A high proportion (around 40%) of Web pages are peer-reviewed as non-mining-service-related Web pages in the

unlabeled data source, which is part of the reason that the overall harvest rates of these three models are all below 60%. It can be seen that the self-adaptive model has the optimal performance (more than 50%), compared to the ANN model (around 16%) and the probabilistic model (between 8% and 9%). This proves that the self-adaptive model has a positive impact on improving the crawling ability of the semantic focused crawler, as more service descriptions extracted from the Web pages are matched to the learned concept descriptions. Precision. The graphic representation of the comparison of the probabilistic and self-adaptive models on precision, along with the increasing number of visited Web pages, is shown in Fig. 6. It can be observed that the overall precision of the self-adaptive model is 32.50%, and the overall precision of the probabilistic model is 13.46%, which is less than half of that of the former. This is because the self-adaptive model is able to filter out more non-relevant mining service Web pages based on its vocabulary-based ontology learning function. This proves that the self-adaptive model significantly enhances the preciseness of semantic focused crawling. Recall. The graphic representation of the comparison of the probabilistic and self-adaptive models on recall, along with the increasing number of visited Web pages, is shown in Fig. 7. It can be seen that the overall recall for the self-adaptive model is 65.86%, compared to only 9.62% for the probabilistic model. This is because the self-adaptive model is able to generate more relevant mining service metadata based on its vocabulary-based ontology learning function, and thus improves the effectiveness of the semantic focused crawler. Harmonic mean. The graphic representation of the comparison of the probabilistic and self-adaptive models on harmonic mean, along with the increasing number of visited Web pages, is shown in Fig. 8. As an aggregated parameter, the overall harmonic mean values for both of these models are below 50% (11.22% for the probabilistic model, and 43.51% for the self-adaptive model), due to their low performance on precision. Since the self-adaptive model outperforms the probabilistic model on both precision and recall, it is not surprising that the overall harmonic mean of the former is nearly four times as high as that of the latter. Fallout rate. The graphic representation of the comparison of the probabilistic and self-adaptive models on fallout rate,

10

Fig. 9. Comparison of the ontology-learning-based focused crawling models on fallout rate

Fig. 10. Comparison of the ontology-learning-based focused crawling models on crawling time

along with the increasing number of visited Web pages, is shown in Fig. 9. It can be seen that the overall fallout rate of the self-adaptive model is 0.46%, and for the probabilistic model is around 0.49%. This indicates that the former generates fewer false results than the latter, which proves the low inaccuracy of the self-adaptive model. Crawling time. The graphic representation of the comparison of the probabilistic, self-adaptive and ANN models on crawling time, along with the increasing number of visited Web pages, is shown in Fig. 10. It can be seen that the ANN model uses less time than the others, since the ANN model does not need time for metadata classification. For a comparison between the probabilistic model and the self-adaptive model, for the first 200 pages, the former runs faster than the latter. However, after 200 pages, the latter gradually gathers speed, in contrast to the relatively fixed speed of the former. Eventually, from the beginning of 1200 pages onwards, the total time of the latter is gradually less than that of the former. This gap becomes larger and larger, as the number of visited Web pages increases. This is because this crawler is able to use the self-adaptive metadata association and ontology learning process in the self-adaptive model to learn new concept descriptions at the beginning. Afterwards, along with the enrichment of learned concept descriptions, it needs less and less time to run the algorithm-based string matching and ontology learning processes. Instead, it runs the direct string matching process only, in order to implement the concept-metadata matching,

which greatly saves crawling time. Overall, the average crawling speed for the ANN model is 77.3 ms/page, for the probabilistic model it is 91.9 ms/page, and for the self-adaptive model it is 81.5 ms/page. This test result proves the efficiency of the self-adaptive model for crawling large numbers of Web pages. Summary. From the above comparison of the three models, it can be seen that the self-adaptive model has the strongest performance on nearly all of the six indicators, except for a slightly slower crawling time compared with that of the ANN model. The self-adaptive model outperforms its competitors several times on the parameters of harvest rate, precision, recall, and harmonic mean, which shows the significant technical advantage of this model. In addition, in contrast to its competitors, the self-adaptive model incurs a relatively lower computing cost while retaining strong performance. Last but not least, from Figs. 5 to 10, it can be observed that all of the performance curves of the self-adaptive model are relatively smooth, regardless of the variety in the number of visited Web pages, thereby partly fulfilling our goals. Therefore, according to these test results, it can be deduced that the proposed framework of the SASF crawler was shown to be feasible by these experiments.

Fig. 6. Comparison of the ontology-learning-based focused crawling models on precision

Fig. 8. Comparison of the ontology-learning-based focused crawling models on harmonic mean

VI. CONCLUSION AND FUTURE WORK In

this

paper,

we

presented

an

innovative

11 ontology-learning-based focused crawler – the SASF crawler, for service information discovery in the mining service industry, by taking into account the heterogeneous, ubiquitous and ambiguous nature of mining service information available over the Internet. This approach involved an innovative unsupervised ontology learning framework for vocabulary-based ontology learning, and a novel concept-metadata matching algorithm, which combines a semantic-similarity-based SeSM algorithm and a probability-based StSM algorithm for associating semantically relevant mining service concepts and mining service metadata. This approach enables the crawler to work in an uncontrolled environment where the numerous new terms and ontologies used by the crawler have a limited range of vocabulary. Subsequently, we conduct a series of experiments to empirically evaluate the performance of the SASF crawler, by comparing the performance of this approach with the existing approaches based on the six parameters adopted from the IR field. We describe a limitation of this approach and our future work as follows: in the evaluation phase, it can be clearly seen that the performance of the self-adaptive model did not completely meet our expectations regarding the parameters of precision and recall. We deduce two reasons that caused this issue as follows: Firstly, in this research, we try to find a universal threshold value for the concept-metadata semantic similarity algorithm in order to set up a boundary for determining concept-metadata relatedness. However, in order to achieve optimal performance, each concept should have its own particular boundaries, namely particular threshold values, for the judgment of the relatedness. Consequently, in future research, we intend to design a semi-supervised approach by aggregating the unsupervised approach and the supervised ontology learning-based approach, with the purpose of automatically choosing the optimal threshold values for each concept, while keeping the optimal performance without considering the limitation of the training data set. Secondly, the relevant service descriptions for each concept are manually determined through a peer-reviewed process; i.e., many relevant service descriptions and concept descriptions are determined on the basis of common sense, which cannot be judged by string similarity or term co-occurrence. Hence, in our future research, it is necessary to enrich the vocabulary of the mining service ontology by surveying those unmatched but relevant service descriptions, in order to further improve the performance of the SASF crawler.

[5] [6] [7] [8] [9] [10]

[11]

[12]

[13] [14] [15] [16]

[17] [18] [19] [20] [21] [22] [23] [24] [25]

REFERENCES [1] [2] [3] [4]

H. Wang, M. K. O. Lee, and C. Wang, "Consumer privacy concerns about Internet marketing," Commun. ACM, vol. 41, pp. 63-70, 1998. R. C. Judd, "The Case for Redefining Services," J. of Marketing, vol. 28, pp. 58-59, 1964. T. P. Hill, "On goods and services," Rev. of Income and Wealth, vol. 23, pp. 315-38, 1977. C. H. Lovelock, "Classifying services to gain strategic marketing insights," J. of Marketing, vol. 47, pp. 9-20 1983.

[26]

H. Dong, F. K. Hussain, and E. Chang, "A service search engine for the industrial digital ecosystems," IEEE Trans. Ind. Electron., vol. 58, pp. 2183-2196, 2011. "Mining services in the US: Market research report," IBISWorld2011. B. Fabian, T. Ermakova, and C. Muller, "SHARDIS - A privacy-enhanced discovery service for RFID-based product information," IEEE Trans. Ind. Informat., vol. 8, pp. 707-718, 2012. H. L. Goh, K. K. Tan, S. Huang, and C. W. d. Silva, "Development of Bluewave: A wireless protocol for industrial automation," IEEE Trans. Ind. Informat., vol. 2, pp. 221-230, 2006. M. Ruta, F. Scioscia, E. D. Sciascio, and G. Loseto, "Semantic-Based enhancement of ISO/IEC 14543-3 EIB/KNX standard for building automation," IEEE Trans. Ind. Informat., vol. 7, pp. 731-739, 2011. I. M. Delamer and J. L. M. Lastra, "Service-oriented architecture for distributed publish/subscribe middleware in electronics production," IEEE Trans. Ind. Informat., vol. 2, pp. 281-294, 2006. H. Dong and F. K. Hussain, "Focused crawling for automatic service discovery, annotation, and classification in industrial digital ecosystems," IEEE Trans. Ind. Electron., vol. 58, pp. 2106-2116, 2011. H. Dong, F. K. Hussain, and E. Chang, "A framework for discovering and classifying ubiquitous services in digital health ecosystems," J. of Comput. and Syst. Sci., vol. 77, pp. 687-704, 2011. J. L. M. Lastra and M. Delamer, "Semantic web services in factory automation: fundamental insights and research roadmap," IEEE Trans. Ind. Informat., vol. 2, pp. 1-11, 2006. S. Runde and A. Fay, "Software support for building automation requirements engineering—An application of semantic web technologies in automation " IEEE Trans. Ind. Informat., vol. 7, pp. 723 - 730, 2011. M. Ruta, F. Scioscia, E. Di Sciascio, and G. Loseto, "Semantic-based enhancement of ISO/IEC 14543-3 EIB/KNX standard for building automation," IEEE Trans. Ind. Informat., vol. 7, pp. 731-739, 2011. H. Dong, F. Hussain, and E. Chang, "State of the art in semantic focused crawlers," in Computational Sci. and Its Applicat. - ICCSA 2009. vol. 5593, O. Gervasi, D. Taniar, B. Murgante, A. Lagana, Y. Mun, and M. Gavrilova, Eds., ed Berlin, Germany: Springer Berlin / Heidelberg, 2009, pp. 910-924. T. R. Gruber, "A translation approach to portable ontology specifications," Knowl. Acquisition, vol. 5, pp. 199-220, 1993. W. Wong, W. Liu, and M. Bennamoun, "Ontology learning from text: A look back and into the future," ACM Comput. Surveys, vol. 44, p. 20:1-36, 2012. H.-T. Zheng, B.-Y. Kang, and H.-G. Kim, "An ontology-based approach to learnable focused crawling," Inform. Sciences, vol. 178, pp. 4512-4522, 2008. C. Su, Y. Gao, J. Yang, and B. Luo, "An efficient adaptive focused crawler based on ontology learning," presented at the Fifth Int. Conf. on Hybrid Intelligent Syst. (HIS '05) Rio de Janeiro, Brazil, 2005, pp. 73-78. J. Rennie and A. McCallum, "Using reinforcement learning to spider the Web efficiently," presented at the Sixteenth Int. Conf. on Mach. Learning (ICML '99), Bled, Slovenia, 1999, pp. 335-343. P. Resnik, "Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language," J. of Artificial Intell. Research, vol. 11, pp. 95-130, 1999. H. Dong, F. K. Hussain, and E. Chang, "A context-aware semantic similarity model for ontology environments," Concurrency and Comput.: Practice and Experience, vol. 23, pp. 505-524, 2011. P. Plebani and B. Pernici, "URBE: Web service retrieval based on similarity evaluation," IEEE Trans. Knowl. Data Eng., vol. 21, pp. 1629-1642, 2009. H. Dong, F. K. Hussain, and E. Chang, "Ontology-learning-based focused crawling for online service advertising information discovery and classification," presented at the Tenth Int. Conf. on Service Oriented Comput. (ICSOC 2012), Shanghai, China, pp.591-598, 2012. G. Moritz, F. Golatowski, D. Timmermann, C. Lerche, "Beyond 6LoWPAN: Web services in wireless sensor networks," IEEE Trans. on Industrial Informatics, accepted, in press, 2012.

Hai Dong (S’08 – M’11) received the Bachelor degree in Information Management from Northeastern University, China, in 2003, and the

12 Master degree and the Ph.D. degree in Information Technology from Curtin University of Technology, Australia, respectively in 2006 and 2010. He received the award of Curtin Research Fellowship from Curtin University of Technology, in 2012. He is currently a research fellow in School of Information Systems, Curtin University of Technology. Prior to this position, he was a research associate in Digital Ecosystems and Business Intelligence Institute, Curtin University of Technology. His research interest includes: service-oriented computing, semantic search, ontology, digital ecosystems, Web services, machine learning, project monitoring, information retrieval, and cloud computing. Farookh Khadeer Hussain received the Bachelor of Technology degree in computer science and computer engineering; the M.S. degree in Information Technology from the La Trobe University, Melbourne, Australia; and the Ph.D. degree in Information Systems from Curtin University of Technology, Perth, Australia, in 2006. He is currently a Faculty member at School of Software, Faculty of Engineering and Information Technology, University of Technology, Sydney. His areas of active research are cloud computing, services computing, trust and reputation modeling, semantic web technologies and industrial informatics. He works actively in the domain of cloud computing and business intelligence. In the area of business intelligence the focus of his research is to develop smart technological measures such as trust and reputation technologies and semantic web for enhanced and accurate decision making.

Self-adaptive Semantic Focused Crawler for Mining ...

Self-adaptive Semantic Focused Crawler for Mining ...

Suggest Documents

Self-Adaptive Semantic Focused Crawler for Mining Services ...

A Simple Focused Crawler - Semantic Scholar

Learning Capable Focused Crawler for Information ... - Semantic Scholar

A Scalable, Extensible Web Crawler with Focused Web Crawler

Research Article An Improved Focused Crawler

hybrid focused crawler - a fast retrieval of topic ... - Semantic Scholar

The BINGO! Focused Crawler - L3S Research Center

THE DISSERTATION ENTITLED FOCUSED CRAWLER ... - Google Sites

Search Sounds: An audio crawler focused on ... - Semantic Scholar

Focused Crawler Optimization Using Genetic Algorithm - TELKOMNIKA

Focused Web Crawler with Page Change

A Focused Crawler with Ontology-Supported Website Models for ...

Learning Capable Focused Crawler for Information Technology Domain

DeepBot: A Focused Crawler for Accessing Hidden Web Content

Profile-Based Focused Crawler for Social Media-Sharing Websites

A Focused Crawler for Borderlands Situation Information with ... - MDPI

An Ontology-Supported Web Focused-Crawler for Java ... - CiteSeerX

Integration of Web mining and web crawler - Semantic Scholar

Integration of Web mining and web crawler

Using Pattern Mining for Competency-Focused

RSS-Crawler Enhancement for Blogosphere ... - Semantic Scholar

A Clickstream-based Focused Trend Parallel Web Crawler - Eprint UTM

A Chinese Topic Crawler Focused on Customer ... - Science Direct

TwitterEcho - A Distributed Focused Crawler to Support Open ...