Expert Systems with Applications 36 (2009) 9358–9370
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
An ontological Proxy Agent with prediction, CBR, and RBR techniques for fast query processing Sheng-Yuan Yang a,*, Chun-Liang Hsu b a b
Department of Computer and Communication Engineering, Saint John’s University, 499, Sec. 4, Tamking Road, Tamshui, Taipei, Taiwan, ROC Department of Electrical Engineering, Saint John’s University, 499, Sec. 4, Tamking Road, Tamshui, Taipei, Taiwan, ROC
a r t i c l e
i n f o
Keywords: Proxy Agent Data mining Query prediction Case-based reasoning Rule-based reasoning Ontology
a b s t r a c t This paper proposed a three-tier Proxy Agent working as a mediator between the users and the backend process of a FAQ system. The first tier makes use of an improved sequential patterns mining technique to propose effective query prediction and cache services. The second tier employs an ontology-supported case-based reasoning technique to propose adapted query solutions and tune itself by retaining and updating highly-satisfied cases identified by the user. Finally, the third tier utilizes an ontology-supported rule-based reasoning to generate possible solutions for the user. Our experiments show around 79.1% of the user queries can be answered by Proxy Agent, leaving about 20.9% of the queries for the backend Answerer Agent to take care, which can effectively alleviate the overloading problem usually associated with a backend server. Ó 2009 Elsevier Ltd. All rights reserved.
1. Introduction In this information-exploding era, the term ‘global village’ implies the shortening of distances between people as a result of the popularity and changing of the network technology with each passing day. We can virtually get any information (e.g., daily or new information about life) we are interested in by surfing the internet, no longer restricted in textbooks. The Internet is like a huge knowledge treasury waiting for exploration. A variety of query systems have thus appeared. The basic operational pattern of all the general query systems is to pass the user’s query to a backend process, which is responsible for producing proper query result for the user. One major drawback of this approach is, when the number of queries increases, the backend process is overloaded, causing a dramatic degradation of system performance. The user then has to spend more time waiting for query responses. What is worse is that most of the long-awaited responses are usually dissatisfactory. Therefore, how to quickly get the information the user really wants over the limited bandwidth of the Internet and how to improve the retrieval performance of the query systems is becoming an important research topic. In addition, techniques that involve data gathering and integrating through database techniques are common in the literature. The following problems are usually associated with the techniques, however: (1) database relationships so constructed usually lack of physical
* Corresponding author. E-mail address:
[email protected] (S.-Y. Yang). 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.01.011
meanings; (2) responses to user query are usually independent of the user level or the degree of user satisfaction; (3) automatic maintenance of the database through the user feedback is usually not available. Consequently, how to help users to find out user-oriented solutions; furthermore, to obtain, learn, and predict the best solution through user feedback, or how to support incremental maintenance of the solution database becomes an important research topic. In order to provide high-quality FAQ answers on the Web to meet the user request, we have proposed an FAQ-master (Yang, 2007c; Yang, Hsu, Lee, & Deng, 2008) as an intelligent Web information aggregation system to provide intelligent information retrieval, filtering, and aggregation services. It contains four agents, namely, Interface Agent (Yang, 2007d, in press), Proxy Agent (Yang, 2007b; Yang et al., 2005a, 2005b), Answerer Agent (Chuang, 2003; Yang, 2008a; Yang, Chuang, & Ho, 2007), and Search Agent (Yang, 2006, 2007a, 2008b), supported by a content base, which in turn contains a user model base, a domain ontology base, a website model base, an ontological database, a solution library, and a rule base. In the context of FAQ service, this architecture can collect the most useful FAQ files, preprocess the files into consistent data, aggregate FAQ answers according to the user query and his/her user model, and provide quick FAQ answers through a proxy mechanism. In this paper, we discussed an ontological Proxy Agent with solution integration and proxy techniques for Web information processing, which not only helps the user find out proper, integrated query results in accordance with his/her proficiency level or satisfaction degree, i.e., user-oriented solution, but supports
9359
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
proxy access of query solutions through an intelligent solution finding process, as shown in Fig. 1. The architecture involves two agents, namely, Answerer Agent and Proxy Agent, and shows how it interacts with Interface Agent and Search Agent. Answerer Agent uses the wrapper approach to do web information preparation, including parsing, cleaning, and transforming Q–A pairs, obtained from heterogeneous websites, into an ontology-directed canonical format, then store them in ontological database (OD) via ontological database manager (ODM). In order to speed query processing, we introduced an ontological Proxy Agent with three proxy-relevant mechanisms, namely, CBR (case-based reasoning), RBR (rule-based reasoning), and solution prediction. Solution Finder is designed to serve as the central control in finding solutions. To ensure that all the knowledge used in CBR, RBR, and solution prediction can be automatically generated, we have introduced a rule miner and a Solution Predictor into the system. They will be described in the following sections. Our experiments show around 79.1% of the user queries can be answered by Proxy Agent, leaving about 20.9% of the queries for Answerer Agent to take care, which can effectively alleviate the overloading problem usually associated with a backend server, a noticeable improvement in overall query performance. The personal computer (PC) domain is chosen as the target application of our Proxy Agent and will be used for explanation in the remaining sections. The rest of the paper is organized as followings. Section 2 develops the domain ontology as the fundamental semantics. Section 3 describes the design of the three-tier Proxy Agent and how it works. Section 4 illustrates how the first-tier performs query cache and query prediction services. Sections 5 and 6 illustrate how CBR works as the second-tier solution provider and how RBR works as the three-tier solution provider, respectively. Section 7 describes the implementation of our Proxy Agent and reports how much better it performs. Section 8 discusses related works, while Section 9 concludes the work.
2. Domain ontology as fundamental semantics Ontology is a method of conceptualization on a specific domain (Noy & Hafner, 1997). It plays various roles in developing intelligent systems, including knowledge sharing, knowledge reuse, semantic analysis of languages, etc. Development of ontology for a specific domain is not yet an engineering process, but it is clear that a domain ontology must include descriptions of explicit concepts and their relationships about a specific domain (Asnicar & Tasso, 1997). We have outlined a principle construction procedure in Yang (2007d); following the procedure, we have developed an ontology for the PC domain. Fig. 2 shows part of the ontology taxonomy. The taxonomy represents relevant PC concepts as classes and their relationships as a variety of links. The figure illustrates an isa link, which allows inheritance of features from parent classes to child classes. We have carefully selected, from each concept, the properties that are most related to our application and defined them as the detailed ontology for the corresponding class. Fig. 3 exemplifies the detailed ontology of the concept of CPU. In the figure, the uppermost node uses various fields to define the semantics of the CPU class, each field representing an attribute of ‘CPU’, e.g., interface, provider, synonym, etc. The nodes at the lower level represent various CPU instances, which capture real world data. The complete PC ontology can be referenced from the Protégé Ontology Library at Stanford Website (http://protegewiki.stanford. edu/index.php/Protege_Ontology_Library) or our website (http:// pcontology.et.ntust.edu.tw). We have also developed a problem ontology to help process user queries. Fig. 4 illustrates part of the problem ontology, which contains query type and operation type. These two concepts constitute the basic semantics of a user query and are therefore used as indices to structure the cases in ODAC (ontological database access cases) and OD, which in turn can provide fast case retrieval. Finally, we have used Protégé’s APIs (application program interface) to develop a set of ontology services, which work as the
Maintain Website Models Website Models
Answerer Agent Preprocessed webpages
Search Agent
Wrapper
Ontological Database Manager
Ontological Database
Solution Integrator
(2) Ontological Database Access Cases (ODAC)
CBR
(3)
Ontology Base (4)
Relevant Solutions
Solution Finder
Internal Query Format
Interface Agent
Cached or Predicted Solutions
RBR (1)
Solution Predictor User query history
Rule Base
Web
Query Pool
Rule Miner User Models Proxy Agent
Fig. 1. Proxy Agent architecture.
User Query
9360
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
H ardw are isa isa isa isa
isa
Interface C ard
isa
Po w er Eq uip m ent
isa
isa
M em ory
Storage M edia
C ase
isa isa
isa
isa
N etw ork C hip
So und C ard
isa
D isp lay C ard
isa
SC SI C ard
isa
N etw ork C ard
isa
Po w er Supply
isa
U PS
isa
isa
M ain M em ory
ROM
isa
O p tical
isa
CD
isa
Z IP
isa
D VD
isa
isa
C D R /W
CDR
Fig. 2. Part of PC ontology taxonomy.
CPU Synonym= Central Processing Unit D-Frequency String Interface Instance* CPU Slot L1 Cache Instance Volume Spec. Abbr. Instance CPU Spec.
...
XEON Factory= Intel
T HUNDERBIRD 1.33G Synonym= Athlon 1.33G Interface= Socket A L1 Cache= 128KB Abbr.= Athlon Factory= AMD
DURO N 1.2G Interface= Socket A L1 Cache= 64KB Abbr.= Duron Factory= AMD Clock= 1.2G HZ
PENT IUM 4 2.0AG HZ D-Frequency= 20 Synonym= P4 2.0GHZ Interface= Socket 478 L1 Cache= 8KB Abbr.= P4
PENT IUM 4 1.8AGHZ D-Frequency= 18 Synonym= P4 1.8G HZ Interface= Socket 478 L1 Cache= 8KB Abbr.= P4
CELERO N 1.0G Interface= Socket 370 L1 Cache= 32KB Abbr.= Celeron Factory= Intel Clock= 1GHZ
PENT IUM 4 2.53AG HZ Synonym= P4 Interface= Socket 478 L1 Cache= 8KB Abbr.= P4 Factory= Intel
...
...
...
...
...
...
Fig. 3. Ontology of the concept of CPU.
Q uery isa
isa
O peration T ype
Q uery T ype
io io
Adjust
io
io
Use
io
Setup
io
Close
io
io
io
O pen
Support
io
Provide
How
W hat
io
io
W hy
io
W here
Fig. 4. Part of problem ontology taxonomy.
Query
Ontology Services
Domain Ontology Developed Using Protege2000
1. Transforming query terms into canonical ontology terms; 2. Finding the definition of a specific term in ontology; 3. Finding relationships among terms; 4. Finding compatible and/or conflicting terms against a specific term.
Fig. 5. Ontology services.
primitive functions to support information retrieval from the ontology. As illustrated in Fig. 5, the ontology services currently available include transforming query terms into canonical ontol-
ogy terms, finding definitions of specific terms in ontology, finding relationships among terms, finding compatible and/or conflicting terms against a specific term, etc.
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
When Solution Finder turns to RBR for asking for solutions, the latter can employ the reasoning technique to pick out the best suitable rules to answer to the user query from Rule Base, under the support of PCDIY Ontology to carry rule-based reasoning out, and accordingly provide the reasoned solutions to Solution Finder. Solution Finder is the central control in finding solutions. After receiving a user query from Interface Agent, it tries to produce an answer following an intelligent solution finding process, which includes predicted solution retrieval, CBR, RBR, and solution integration. Firstly, Solution Finder checks whether any predicted queries exist in User Model Base. If yes, it directly produces the answer retrieved from the question–answer pair. If none, Solution Finder starts CBR to find out query solution. If the given query is already existent in ODAC, Solution Finder directly outputs its answer part. If none, Solution Finder performs case adaptation to solve the query. If CBR provides no solutions, Solution Finder exerts RBR, i.e., using the rules in Rule Base to find out query solution. Note that, in addition to the exact query answer, Solution Finder also gathers the solutions that are super set of the user question for the user. The last mechanism of finding solutions is to trigger Solution Integrator to integrate solutions from OD. If this integrated solution is credited with a high degree of satisfaction by the user, it will be stored in the ODAC serving as a new case. In short, Proxy Agent employs a three-tier proxy mechanism, including Solution Predictor, CBR, and RBR modules, to speed up query processing, at the same time reducing the loading of Answerer Agent. If Solution Finder gets a solution from one of three tiers, it immediately passes the solution to Interface Agent, which then produces a proper query result for the user according to his proficiency level and idiosyncratics (Yang, 2007d; Yang et al., 2007). If none of three tiers can provide solutions, Solution Finder asks Answerer Agent to derive a solution (Yang et al., 2007c, 2008).
3. Three-tier proxy design The Proxy Agent performs solution finding by a three-tier architecture. Fig. 1 illustrates the architecture of Proxy Agent and how it interacts with Interface Agent and the backend process, i.e., Answerer Agent. First, Interface Agent collects queries from the user according to his/her user model stored in user model base, which models his/her important behavior as well as his/her query history. Answerer Agent manages OD, which stores preprocessed Q–A pairs collected from the Web, and retrieves proper Q–A pairs to respond to user queries. Inside Proxy Agent, Ontology Base contains the PC ontology as shown in Fig. 2. ODAC is the case base for CBR. It stores past user query cases, which come from the Q–A pairs produced by Answerer Agent. Solution Predictor is responsible for tracking user query history, storing most occurring queries in Cache Pool, and predicting next possible queries for storage in Prediction Pool. Cache Pool and Prediction Pool are combined into a Query Pool, which provides query cache and query prediction to reduce query response time. CBR employs the case-based reasoning technique to reason about solutions for a given query from ODAC. If there is an existing case in ODAC that is exactly the same as the user query, CBR directly outputs it as the solution to the user. If only similar cases to the user query exist in ODAC, CBR modifies the solutions for the user through a case adaptation process. CBR is also responsible for the maintenance of cases in ODAC by evaluating and determining whether a new query-answer pair deserves storage in ODAC to serve as a new case according to the user’s feedback to the solution. The job functions of RBR can be divided into two parts: off-line rule mining and on-line reasoning. Firstly, the former employed the mining, validating, and generalizing techniques for mining out the rules to suitably do rule-based reasoning and then store those rules into Rule Base for doing on-line reasoning.
Query Cache
Cache Pool
cases
ODAC
User Model Base
Interface agent
query history
Case Retriever
cached solution
frequent query sets
Query Pattern Miner sequential query patterns
Solution Finder Internal Query Form at
Prediction Module sequential rule
Prediction Models
case
9361
Pattern Matching Monitor next probable query
Prediction Pool
Query Prediction
Predictor Fig. 6. First-tier proxy.
predicted query solution
9362
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
4. First-tier proxy: query prediction and query cache The basic idea behind the first-tier proxy is the observation of common sequential patterns in the query behavior of the users. Solution Predictor exploits this observation and uses the sequential pattern mining-technique to analyze implicit correlationships from the user query behavior. These correlationships then serve as potential rules for prediction of future user query. Fig. 6 shows the architecture of the Predictor working as the first-tier proxy. Query pattern miner looks for frequent sequential query patterns using the Full-Scan-with-PHP algorithm, from the query histories of the users of the same group as recorded in User Models Base (Yang et al., 2005a). Note that we pre-partitioned the users into five user groups according to their proficiency on the domain (Yang, 2007d). Query Miner then turns the frequent sequen-
Input : dataset :D1, minimum support threshold :min_supp output: all frequent patterns // map all patterns with perfect hashing 1: map all 1-patterns of transaction of D1 to perfect hash table PH1; 2: L1 = { h| h∈ ∈ PH1, h.bucket_counter >= min_supp}; 3: k =2; Do { 4: Dk ={ }; 5: forall transaction t∈ Dk-1 { // perfect hashing 6: map all k-patterns of t to perfect hash table PHk; 7: t’ =prune t , Dk =Dk { t’ }; } 8: Lk = {h|h ∈ PHk, h.bucket_counter >= min_supp}; 9: k=k+1; } while(Lk-1 !=) 10: L=L1 L2 ... Lk; 11: return L;
D1 Items ACD BCE ABCE BE
PH1 Perfect Hashing bucket count from D1 2 Scan PH1
3 3 1 3
PH2 Perfect Hashing bucket count Pruning D1 2 from D1 2 3 2
Perfect Hashing from D2
Perfect Hashing from D3
TID 100 200 300 400
PH4 bucket count null null
L1
D2 Items AC Scan PH2 BCE ABCE BE
D3 TID Items PH3 Pruning D2 200 BCE bucket count 300 BCE 2 D4 Pruning D3 TID Items
4.1. Full-Scan-with-PHP algorithm The Full-Scan algorithm was developed, based on the DHP algorithm, to perform sequential pattern mining (Chen, Park, & Yu, 1998). The PHP algorithm introduced the perfect hash mechanism into the DHP algorithm and proved that it performs better than the DHP algorithm (Özel & Güvenir, 2001). Fig. 7 integrates PHP with Full-Scan to form our sequential pattern mining algorithm, which returns L (step 10) as a set of frequent sequential patterns. The pruning method (step 7) is the same as PHP. The major feature of the algorithm is that it can obtain Lk by directly scanning the perfect hash table PHk (step 8) without spending extra time scanning the database to calculate supports for the frequent sequent patterns. Fig. 8 illustrates how the algorithm works on example database D1 with minimum support min_supp set to 2. 4.2. Construction of Query Pool
Fig. 7. Full-Scan-with-PHP algorithm.
TID 100 200 300 400
tial query patterns to Case Retriever, which is responsible for retrieving corresponding solutions from ODAC and constructing ‘frequent queries’ for storage in Cache Pool. The frequent sequential query patterns are also sent to Prediction Module to construct a prediction model for each user group. Pattern Matching Monitor is responsible for monitoring recent query records and using the prediction model to produce next possible queries for storage in Prediction Pool. In short, the predictor produces ‘frequent queries’ for Cache Pool and ‘predicted queries’ for Prediction Pool off-line. Given a new query online, Solution Finder passes the query to the predictor, which employs both query prediction and query cache mechanisms to produce possible solutions for the query.
L2
Scan PH3
L3
Scan PH4
L4 null
Fig. 8. Illustration on sequential pattern mining of Full-Scan-with-PHP algorithm.
As noted before, Cache Pool contains frequent queries while Prediction Pool contains predicted queries. Both are constructed based on the frequent sequential query patterns discovered by the Full-Scan-with-PHP algorithm. Note that we treat each query case stored in ODAC as an item during the operation of the algorithm. The algorithm first constructs L1 as a set of 1-item frequent sequential patterns. Strictly speaking, this set has nothing to do with query sequences yet, but it represents the pool of all frequent queries. Thus, each element in the set can be treated as a frequently happening query, which can be cached somewhere for ready use in the future. We hence put each element of the set to Cache Retriever, which then retrieves its corresponding answer part from the ODAC to form a complete frequent query for storage in Cache Pool. Fig. 9 illustrates some examples of frequent queries stored in Cache Pool. The final output of the mining algorithm returns L, a set of frequent sequential patterns, and submits it to Prediction Module for producing sequential rules. The basic idea follows (Zhang, 2001) by defining the sequential rule confidence as follows. Definition 1. Given two frequent sequential patterns S1 = ht1, t2, . . ., tki and S2 = ht1, t2, . . ., tk1i, where k P 2 and S2 is a sub-sequential pattern of S1, we produce a sequential rule ‘‘t1, t2, . . ., tk1 ? tk” with the following rule confidence:
Fig. 9. Examples of frequent queries in Cache Pool.
9363
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370 Table 1 Examples of sequential rules. Sequential rule
Confidence
B,C ? E A?C B?C B?E C?E
1 1 0.67 1 0.67
prediction model and finds no antecedents of sequential rules match A, B, C. Therefore, it reduces the traced pattern into hB, Ci and repeats the rule checking process. It finally finds that the antecedent of the sequential rule ‘‘B, C ? E” matches the new pattern. It thus predicts the consequent ‘‘E” as the next query and retrieves its answer part from ODAC to form a predicted query. 4.3. First-tier solution service Given a user query, Solution Finder first checks whether there are any predicted queries in Prediction Pool. If yes, it directly generates a solution by retrieving the answer part of the predicted query for the user. If none exists, it turns to check whether any cached queries exist in Cache Pool. If one exists, it generates a solution from its corresponding answer part. If none exists, it turns to the second tier for generating solutions. 5. Second-tier proxy: case-based reasoning
Fig. 10. Examples of rules in the prediction model.
Confidence ðt1 ; t 2 ; . . . ; tk1 ! t k Þ ¼ Support ðS1 Þ=Support ðS2 Þ We define that a sequential rule is legal if its confidence satisfies some minimal confidence. Table 1 shows part of the legal sequential rules generated from the database D1 of Fig. 9 with minimal confidence 0.6. Fig. 10 illustrates some examples of prediction rules about the PC domain stored in the prediction model. With the prediction model constructed, Pattern Matching Monitor can do query prediction and produce predicted queries into Prediction Pool. Basically, it monitors and traces the query record of the user in a session, and employs the longest-path-first method (Zhang, 2001) to check whether any matched sequential rules exist in the prediction model. A query record is treated as sequence of queries. If the antecedent of a sequential rule matches the pattern of the traced query record, Pattern Matching Monitor produces the consequence of the rule as the next possible query and retrieves its corresponding solution part from ODAC to form a predicted query for Prediction Pool. If no rules exist, the Monitor will reduce one query from the query record and repeat the sequential rule checking process until the length of the query record reaches one. Let’s illustrate this process of query prediction below, which uses the prediction model of Table 1. Suppose Pattern Matching Monitor notices a user query record hA, B, Ci in a session. It checks the
The second-tier proxy employs CBR to provide a broader range of search for solutions than the first-tier. Fig. 11 illustrates the architecture of the second-tier proxy mechanism. Again, the ODAC is the case library, which contains query cases produced by the backend Answerer Agent. Case Retriever is responsible for retrieving a case from ODAC, which is the same as or similar to the user query. Case Reuser then uses the case to check for any discrepancy against the user query. If the case is completely the same as the user query, Reuser directly outputs it to the user. If the case is only similar to the user query, Reuser passes it to Case Reviser for case adaptation. Case Reviser employs the PC ontology along with Adaptation Rule Base to adapt the retrieved case for the user. Adaptation Rule Base contains adaptation rules, constructed by the domain expert. Case Retainer is responsible for the maintenance of ODAC, processing case addition, deletion, and aging. In short, the second-tier proxy involves four operations, namely, case retrieval, case reuse, case adaptation, and case maintenance, to be detailed below. 5.1. Case retrieval Fig. 12 shows how Case Retriever performs case retrieval. At step 1, if Case Retriever retrieves a case, which contains the same question features with, or more question features than, the user query, it directly outputs it as the solution for the user query. Otherwise, at step 2, it checks for similar cases using the VRelation-
Interface Agent
user feedback
Case Retriever solution Internal Query Form at
Solution Finder
full m atching or sim ilar case
retrieved case
new case
Case Reuser
ODAC
original case solution sim ilar case for adaption
Case Retainer
adapted case
Case Reviser adapted case solution
Adaptation Rule Base
PC ontology
CBR Fig. 11. Second-tier proxy.
9364
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
Step 1: if retrieved cases contain the same or more question features, stop. Step 2: if retrieved cases contain different features; 2.1 divide the cases into similar cases and dissimilar cases according to the ontology VRelationship; 2.2 delete the dissimilar cases and pass the similar cases to Case Reuser to determine whether to start Case Reviser to perform case adaptation or to directly output the solutions for the user. Fig. 12. Case retrieval.
ship in the ontology. Table 2 illustrates some examples of VRelationship, to be explained below. (1) Mutually exclusive VRelationship: This constraint enforces that an attribute can only take on one and only one value from a legal set of values. For example, row 1 of the table illustrates that a motherboard cannot belong to two different producers at the same time. A retrieved case containing attributes having this VRelationship with the user query cannot become a similar case. (2) Downward-compatible VRelationship: This constraint defines some compatibility between two related attributes. For example, row 2 of the table says that a motherboard that contains four USB (Universal Serial Bus) ports can be regarded as one with two USB ports. Thus, a retrieved case containing an attribute having this VRelationship with the user query can become a similar case. (3) Conditionally downward-compatible VRelationship: This is very similar to the case above, except that the compatibility is subject to some conditions. For example, row 3 of the table shows that a 1.8 GHz Pentium 4 CPU can be regarded as one with 1.4 GHz, but cannot be regarded as one with 866 MHz, which is a Pentium III CPU format. In this case, the retrieved case is treated as a similar case with extra conditions.
perform case adaptation. Finally, it produces an adapted solution for the user. Table 3 illustrates a scenario of case adaptation, which contains a user query and two reference cases retrieved from ODAC. Note that the ‘‘conditionally downward-compatible VRelationship” relates the two reference cases to the user query. The first adaptation step is to use the PC ontology to identify a feature that is constrained by the VRelationship between the user query and the reference cases. This constrained-feature explains the semantics of the specified relationship, and thus may suggest proper adaptation. The example shows that the feature CPU_Clock_Rate is constrained by the ‘‘conditionally downward-compatible” relationship. Thus, we should check whether there are adaptation rules referring to feature CPU_Clock_Rate in Adaptation Rule Base. Suppose we have an adaptation rule as shown below: IF concept 2 ‘‘Motherboard” and constrained-feature ‘‘CPU_Clock_Rate” THEN adaptation attribute = ”Support_CPU_Clock_Range”
=
The rule says if the concept is about an instance of ‘‘Motherboard” and the constrained-feature is ‘‘CPU_Clock_Rate”, then we can focus on attribute ‘‘Support_CPU_Clock_Range” for adaptation. Case Reviser then checks whether the adaptation feature exists in the reference cases. If yes, it goes one step further to check whether the feature values meet the requirement of the user query. For example, Table 3 shows three motherboard instances, ‘‘P4S333, P4S433, and P4S533”, which are derived from the two reference cases. It also shows their respective values with respect to the adaptation feature ‘‘Support_CPU_Clock_Range”. We find P4S333 supports up to 1.5 GHz, which cannot meet the user requirement - 1.8 GHz. Therefore P4S333 has to be removed from the set of solutions, leaving the final solutions to be P4S433 and P4S533.
5.2. Case reuse When Case Retriever recognizes similar cases, Case Reuser steps in to check whether a similar case can be directly reused or needs some adaptation before being reused. Specifically, if the VRelationship between the user query and the similar case is: (1) Downward-compatible, then the case is directly treated as a legal solution. (2) Conditionally downward-compatible, then the case is treated as a reference case and sent to Case Reviser for case adaptation according to the inference rules.
5.3. Case adaptation A case adaptation process involves three operations. First, Case Reviser finds a reason for adaptation according to the PC ontology. Then, it selects proper adaptation rules according to the reason to
Table 3 Example of case adaptation. User query: Pentium 4 1.8 GHz, 2 Q: Reference case 1: pentium 4 1.3 GHz,2 Q: A: P4S333, P4S433, P4S533 Reference case 2: pentium 4 2.4 GHz, 2 Q: A: P4S533 Constrained-feature: CPU_Clock_Rate Adaptation feature: Support_CPU_Clock_Range Adaptation feature values: P4S333: 0.45–1.5 GHz P4S433: 0.6–2 GHz P4S533: 0.6–2.4 GHz Adapted solution: A: P4S433, P4S533
USB USB
USB
? ?
?
Table 2 Examples of VRelationship. Relationship
Feature
Value
Explanation
Mutually exclusive
MB_Provider
A motherboard cannot belong to two different producers at the same time
Downward-compatible
MB_PCI_Num
Conditionally downward-compatible
CPU_Clock_Rate
(ASUS) (MicroStar) PCI4 PCI2 Pentium4 1.4G Pentium4 1.8G
A motherboard which contains four USB ports can be regarded as one with two USB ports A 1.8 GHz Pentium 4 CPU can be regarded as one with 1.4 GHz, but cannot be regarded as one with 866 MHz, a Pentium III CPU format
9365
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
Fig. 13. Examples of cases.
5.4. Case maintenance
Table 5 Example of similarity calculation.
Fig. 13 shows some examples of cases stored in ODAC. Note that each case is associated with a survival value (the last column), which represents how active the case is in the system, and serves as our case maintenance basis. The increment or decrement of the survival value of a case depends upon its satisfaction degree. Table 4 illustrates the system-defined five degrees of satisfaction for the user to provide his feedback on the proposed solutions. The maintenance of survival values depends on whether the case is a solution, a reference case, an adapted case, or a new solution case from the backend Answerer Agent, to be detailed below. 5.4.1. Solution cases For a solution case, we can directly reflect the user feedback into the survival value of the case. Eq. (1) defines how we refresh the survival value SR(c) of a solution case c.
SR
ðnewÞ
ðcÞ ¼ SR
ðoldÞ
ðcÞ þ DSRðcÞ
ð1Þ
where
DSRðcÞ ¼ ðSatðcÞ 0:45Þ a
ð2Þ
where Sat(c) stands for the representative value of the satisfaction degree of case c as defined in Table 4, and ‘‘a” represents the learning rate, set to 0.1 in our system for slowly adjusting SR(c). 5.4.2. Reference cases In a reference case, we have to take into account the similarity of the reference case against the adapted solution. Eq. (3) first defines the solution similarity.
Simðc; aÞ ¼ jSc \ Sa j=jSc \ Sa j
ð3Þ
where ‘a’ is an adapted case, ‘c’ is a reference case, Sa is the answer part of case ‘a’, and Sc is the answer part of ‘c’. Sim(c, a) measures the extent to which the reference case contributes to the case adaptation. Table 5 illustrates the solution similarities between the reference cases and the adapted case of Table 3. Now we can use Eq. (1) to refresh the survival value of a reference case, after DSR(c) is redefined by Eq. (4).
DSRðcÞ ¼ ðSat ðaÞ 0:45Þ Simðc; aÞ a
ð4Þ
5.4.3. Adapted cases We retain an adapted case in ODAC only when its satisfaction degree is over a pre-defined satisfaction threshold and its similarTable 4 Satisfaction degrees and their representation. Satisfaction degree (Highly satisfied)
Representative value 0.80
(Satisfied)
0.65
(Soso)
0.45
(Unsatisfied) (Highly unsatisfied)
0.25 0.10
User query: Pentium 4, 1.8 GHz, 2 USB Q: Reference case 1: Q: Pentium 4, 1.3 GHz, 2 USB A: P4S333, P4S433, P4S533 Reference case 2: Pentium 4, 2.4 GHz, 2 USB Q: A: P4S533 Adapted solution: A: P4S433, P4S533 Solution similarity calculation: Solution similarity of reference case 1 = 2/3 = 67% Solution similarity of reference case 2 = 1/3 = 33%
? ?
?
ity is less than a pre-defined similarity threshold. Both thresholds are set to 0.5 in our system in order to balance user satisfaction and case similarity in retaining an adapted case. They can be changed according to how the system prefers user feedback or case similarity. We then use Eq. (5) to assign the retained adapted case an initial survival value SR(a).
Pn SRðaÞ ¼
i¼1 ½Simða; ci Þ
SRðci Þ
n
ð5Þ
where SR(ci) is the survival value of the reference case ‘ci’, n is the number of reference cases contributing to the adaptation of ‘a’, ci is ith reference case, and Sim(a, ci) is the solution similarity between a and ci. 5.4.4. New solution cases from the backend process Since a new solution case from the backend process has no ‘old’ survival value, we cannot use Eq. (1) to refresh its survival value. Eq. (6) is used instead to assign a survival value to Cnew, the new solution case, where SRave is the averaged survival value of all the cases in ODAC, as defined in Eq. (7).
SRðC new Þ ¼ SRave SatðC new Þ Pn SRðC i Þ SRave ¼ i¼1 n
ð6Þ ð7Þ
6. Third-tier proxy: rule-based reasoning The third-tier proxy employs RBR to provide a broadest range of search for solutions than the first two tiers. Fig. 14 illustrates the solving architecture of RBR. Inside the architecture, Rule Miner is responsible for mining rules from ODAC. Rule Validator is responsible for testing and verifying those rules mined by Rule Miner and filtering out the valid rules. Rule Generalizer then generalizes those valid rules filtered by Rule Validator and accordingly stores the generalized rules into Rule Base. Finally, Solution Reasoner is responsible for taking out the proper rules to reply to the user query from Rule Base, putting to the proof of the validation of the solution part of the rule supported by PCDIY Ontology, and finally producing the valid solutions to the user. In short, the job
9366
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
Rule Miner ODAC
cases raw rules
Rule Validator interesting rules
Rule Generalizer generalization rules
Solution Finder Internal Query Format
Rule Base
PCDIY Ontology
solution
partial matching rule
Solution Reasoner
RBR Fig. 14. Architecture of RBR.
functions of RBR can be divided into two parts: off-line rule mining and on-line reasoning. The former employed Rule Miner, Rule Validator, and Rule Generalizer for mining out the rules to suitably do rule-based reasoning and then store those rules into Rule Base for doing on-line reasoning. When Solution Finder turns to RBR for asking for solutions, the latter can employ Solution Reasoner to pick out the best suitable rules to answer to the user query from Rule Base, under the support of PCDIY Ontology to carry rule-based reasoning out, and accordingly provide the reasoned solutions to Solution Finder. 6.1. PD with PH algorithm The mining algorithm combined the Perfect Hashing (PH) technology with Pattern Decomposition (PD) algorithm Zou, Chu, David, & Chiu, 2002, shown in Fig. 15, to get frequent itemsets for off-line rule mining. This algorithm is different from PD algorithm in the Perfect Hash table, which not only can play the role of looking for all of k-item sets in the pattern (step 6), but also can calculate the degree of support in the item sets (step 8–9). The rest of the algorithm is the same as PD algorithm. The main purpose of step 6 of this algorithm is to find out all of k-item sets in the pattern, especially, in the composite pattern. The steps 8–9 then
Input transaction-set T Output all frequent patterns PD (transaction-set T) 1: D1 = {| t ∈ T}; k=1; 2: while (Dk≠ Φ ) do begin 3: PHk=an empty perfect hash table; 4: forall p in Dk do begin 5: ph=an empty perfect hash table; // get all k-itemset of p.IS with perfect hashing 6: put all k-itemset s ⊆ p.IS to ph // counting support with perfect hashing 7: forall bucket h∈ ph do begin 8: If h in PHk then PHk.h.Occ+=p.Occ; 9: Else put to PHk; 10: end 11: end 12: decided Lk and ~Lk by scanning PHk; //build Dk+1 13: Dk+1= PD- rebuild(k,Dk, Lk, ~Lk); 14: k++; 15: end 16:Answer = Lk
Fig. 15. PD with PH algorithm.
employed another perfect hash table, i.e., PHk, to increase the occurrence value of k-item sets in the pattern. Finally, the step 12 directly scanned this perfect hash table for getting Lk and Lk. START
R: set of frequent itemsets that satisfy ontology constraints
Y
Every frequent itemset of R calculated ? N Get frequent itemset i
Divide the attributes of frequent itemset i according to the attribute values of Problem_Subject
Rf: attribute values unconstrainted by Problem_Subject Rb: attribute values constrainted by Problem_Subject
Calculate degree of confidence = degree of support of frequent itemset i / degree of support of Rf
If degree of confidence >= minimal confidence ? Y Add Rf -> Rb to association rule base
END
Fig. 16. Flowchart of mining association rules.
N
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
9367
6.2. Off-line rule mining
7. System evaluation
In RBR solving mechanism, the role of Rule Miner is important to the processing of off-line rule mining, in which Rule Miner is responsible for mining association rules from the cases in ODAC for RBR according to the frequent itemsets mined by PD with PH algorithm. A mixed version of Apriori algorithm (Agrawal & Srikant, 1995) and Eclat algorithm (Zaki, Parthasarathy, Ogihara, & Li, 1997) is properly modified to perform the rule-mining task, as shown in Fig. 16. Rule Miner is invoked whenever the number of new cases in ODAC reaches a threshold value. In addition, it is worthwhile to be mentioned that the rule mining process is not for whole ODAC but for different question types such as ‘What’, ‘How’, ‘Why’, ‘If’, and etc. Those mined rules comparatively possess the semantic meaning and convenient for management.
Our Proxy Agent was developed using Borland JBuilder 5.0 on Microsoft Windows XP, as shown in Fig. 18. The database management system is Microsoft SQL Server 2000, and the ontology development tool is Protégé2000. We collected in total 517 FAQs from the FAQ website of one famous motherboard factory in Taiwan and then transformed the query-answer pairs of each FAQ for storage in ODAC. Our first experiment was to learn how the Predictor works well. We used in total 200 user query scenarios of the same user level as the training data set. We set the minimal support to 3% and minimal confidence to 60%. The Full-Scan-with-PHP algorithm was used to construct 36 frequent queries for storage in Cache Pool and 43 rules in Prediction Model. Prediction Pool only kept the most recent three predicted queries, in order to avoid over-fitted prediction, which produces a solution similarity to rote learning of lots of prefetched queries. We then randomly selected 100 query scenarios from the training data set as the testing data to test the performance of the Predictor. Table 6 illustrates the five-time experiment results. It shows, on average, that Query Prediction is invoked to predict next query for 60.2 queries, among which it can correctly predict next query for 38 queries, a 65.8% average success rate. It also shows, on average, out of 310 queries,
6.3. On-line rule reasoning If no solution obtains from Solution Predictor and CBR, RBR is triggered by Solution Finder, which makes rule-based reasoning to generate possible solutions supported by PCDIY ontology and accordingly provides those solutions to the user via Solution Finder. Fig. 17 shows the reason steps of RBR.
Step 1: Introduce the feature-matched way to get out the best-matched rule according to their antecedent parts against the features of user query; Step 2: Take out the antecedent pa rt of the best rule and unmatched features against the user query; Step 3: Under the supported by PCDIY Ontology, the system directed against those unmatched features to sequentially check whether there existed conflicts between the rule consequent parts and those features, and then deleted the conflicted consequent features and kept compatible ones to be the reasoned solutions.
Fig. 17. Reason steps of RBR.
Fig. 18. Main screen of Proxy Agent.
9368
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
Table 6 Testing results on predictor. Testing order
# Query
1 2 3 4 5 Average
289 325 320 302 314 310
# Query prediction invocation
Query prediction #
41 88 59 65 48 60.2
27 44 47 39 33 38
Query cache
%
Query prediction success rate (%)
#
%
9.3 13.5 14.7 12.9 10.5 12.2
65.8 50 79 66.1 68 65.8
52 72 58 59 55 59.2
18.0 22.1 18.1 19.5 17.5 19.1
Table 7 Experiment results of three tiers. Testing order
# Query
1 2 3 4 5 Average
289 325 320 302 314 310
Query prediction
Query cache
CBR
#
%
#
%
#
%
#
RBR %
27 44 47 39 33 38
9.3 13.5 14.7 12.9 10.5 12.2
52 72 58 59 55 59.2
18 22.1 18.1 19.5 17.5 19.1
103 117 132 118 147 123.4
35.6 36.0 41.2 39.1 47.0 39.8
23 26 27 24 25 25
7.9 8 8.4 7.9 7.9 8
38 queries can be taken care of by the query prediction mechanism (around 12.2%), while 59.2 queries can be taken care of by the query cache mechanism (around 19.1%). In summary, around 31.3% of the user queries can be answered by the first-tier proxy. Our second experiment is to learn how well the overall threetier Proxy Agent works. We used the same Cache Pool and Prediction Model as the first experiment for the first-tier proxy. We then randomly selected about 500 FAQs from OD, extracted proper query keywords from their question parts, and randomly combined the keywords into a set of 345 queries cases for ODAC for experiment with the second-tier case-based reasoning. We conducted the experiment for five times. After each experiment, some 15 new cases are retained into ODAC by Case Retainer and the system trigged Rule Miner to do rule re-mining in ODAC for the next experiment with the help of user feedback from the experiment conductor. Table 7 illustrates the five-time experiment results. It shows, on average, 31.3% (12.2 + 19.1%) of the user queries can be answered by the user-oriented query prediction and cache technique, while 47.8% (39.8 + 8%) of the user queries can be taken by the ontology-supported CBR and RBR. In short, the experiment shows around 79.1% of the user queries can be answered by Proxy Agent, leaving about 20.9% of the queries for Answerer Agent to take care, which can effectively alleviate the overloading problem usually associated with a backend server. This is a very impressive improvement in query response.
2002) evaluates the most suitable documents in a repository using a user model updated in real time. A different approach to Web pages prediction is the ‘Path-based’ systems. For example, the work of Yang, Zhang, & Li (2001) presents an application of Web log mining to obtain web-document access patterns and uses these patterns to extend the well-known GDSF (Greedy-Dual-Size-Frequency) caching policies and prefetching policies. PPS (Proxybased Prediction Service) (Lou & Lu, 2002) applies a new prediction scheme which employs a two-layer navigation model to capture both inter-site and intra-site access patterns, incorporated with a bottom-up prediction mechanism which exploits reference locality in proxy logs. The work of Bonino, Corno and Squillero (Bonino, Corno, & Squillero, 2003) proposes a method to exploit user navigational path behavior to predict, in real-time, future requests using the adoption of a predictive user model based on Finite State Machines (FSMs) together with an evolutionary algorithm that evolves a population of FSMs for achieving a good prediction rate. In the paper, we adopt the technique of sequential patterns mining to discover the user query behavior from his query history and accordingly to offer efficient query prediction and query cache services, just like (Oikonomopoulou, Rigou, Sirmakessis, & Tsakalidis, 2004) in which differently mined from server log files and either (Gopalratnam & Cook, 2007) using different sequential prediction algorithm, say Active LeZi. Intelligent agents are programs that act on behalf of their human users to perform laborious information-gathering task. The last ten years have seen a surging interest in agent-oriented technology, spanning applications as diverse as information retrieval, intelligent document filter, etc. We noticed that case-based reasoning (CBR) has played an important role in agent development. For example, Liu and Leung (2001) present a Web-based case-based reasoning model to assist investors to determine stock trend. The model conforms to a web-based agent framework, forming part of an advisory system for financial forecast. Different cases are collected based on the theory of wave features and their combination. The agent framework supports processes including knowledge generation, wave units mining and wave pattern recognition, and case revise and learning. Aktas, Pierce, Fox, and Leake (2004) develops a recommender system which uses conversation case-based reasoning with semantic web markup languages providing a standard form of case representation to aid in metadata discovery. Lorenzi, dos Santos, and Bazzan (2005) presents the use of swarm intelligence in the task allocation among cooperative agents applied to a case-based recommender system to help in the process of planning a trip. In this paper, the CBR technique is used as a problem solving mechanism in providing adapted past queries. It is also used as a learning mechanism to retain high-satisfied queries to improve the problem solving performance. We further present a hybrid approach which combine CBR with RBR for providing solutions, just as Shi and Barnden (2005) in which differently diagnosing multiple faults.
9. Conclusions 8. Related works The emerging knowledge discovery techniques, including knowledge discovery in databases (KDD) (Frawley, PicatetskyShapiro, & Matheus, 1992; Mendonca & Sunderhaft, 1999; Shortland & Scarfe, 1995), Association Rules (Agrawal & Srikant, 1994), Classification (Ruggieri, 2002), Clustering (Tsai, Wu, & Tsai, 2002), Sequential Patterns (Agrawal & Srikant, 1995), etc. has inspired a lot of work on user action prediction. For example, Web Watcher (Joachims, Freitag, & Mitchell, 1997) anticipates the next selected hyperlink by using a model built through reinforcement learning. Transparent Search Engine (Bota, Corno, Farinetti, & Squillero,
We have proposed a three-tier Proxy Agent, composed of a firsttier Predictor, a second-tier CBR, and a third-tier RBR to work as a mediator between the users and the backend Answerer Agent of an FAQ service system. The Predictor makes use of an improved sequential patterns mining algorithm, Full-Scan-with-PHP, to discover user query behavior as a basis to construct Cache Pool and Prediction Model, which can support efficient query prediction and query cache. The CBR employs an improved case-based reasoning technique to reason about adapted solutions for a given user query, with the help of domain ontology, from a case library, which is then fine-tuned according to the user feedback
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
information. The RBR makes use of a frequent itemsets mining algorithm, a PD with PH algorithm, and utilizes an ontology rulebased reasoning technique to generate possible solutions supported by PCDIY ontology and accordingly provides those solutions to the user. The Agent is interesting in the following respects. Firstly, it performs fast user-oriented mining and prediction. The Predictor uses the user query history stored in the User Model Base to discover frequent queries and predicted queries. The sequential pattern mining algorithm is made more efficient by the techniques of perfect hashing and database decomposition. Secondly, it performs an ontology-directed case-based reasoning. The semantics of PC ontology, in particular the VRelationships relationship, are used in determining similar cases, performing case adaptation, and case retaining. Thirdly, it self-improves by tuning the survival value of each case in the case library. This tuning process involves user satisfaction and case similarity as two major factors, which are both user-oriented and ontology-supported. Finally, it fulfils an ontology-supported rule-based reasoning. To combine the PH technology with PD algorithm is used in mining the frequent itemsets, and then a modified version of Apriori algorithm and Eclat algorithm to perform the rule-mining task, and accordingly employs the reasoning technique to pick out the best suitable rules to answer possible solutions to the user query under the support of PCDIY Ontology. Our experiments show that the Agent can share up to 79.1% of the query loading from the backend process, leaving about 20.9% of the queries for the backend Answerer Agent to take care, which stands for a big improvement of the overall query performance. For easy demonstration of the techniques in our system, the current implementation runs on a specific ‘PC’ domain. However, we believe even if the domain is scaled up, our techniques are still applicable. The idea is this: we are not directly scaling up our ontology; instead we can create a complex system by integrating a set of simple systems through a multi-agent architecture, which is supported by a set of simple domain ontologies. By exploiting the capability of Protégé2000, which supports creation, extension and cooperation of a set of domain ontologies, we really need not make much change to our system in order to transform it into a complex system. We need only to recollect and reconstruct ODAC and prediction models for supporting the first-tier proxy, rebuild the adaptation rule base for supporting the second-tier proxy, remine the frequent itemsets to perform rule-mining task for supporting the third-tier proxy, and devise a mechanism to make sure a set of ontology-supported systems can cooperate effectively, which is under our investigation. Acknowledgements The authors would like to thank Y.H. Chiu, Y.H. Chang, and F.C. Chuang for their assistance in system implementation. This work was supported by the National Science Council, ROC, under Grants NSC-89-2213-E-011-059, NSC-89-2218-E-011-014, and NSC-952221-E-129-019. References Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rules. In Proceedings of the 20th international conference on very large databases (pp. 487– 499). Santiago de Chile, Chile. Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Proceedings of the 11th international conference on data engineering (pp. 3–14). Taipei, Taiwan, ROC. Aktas, M. S., Pierce, M., Fox, G. C., & Leake, D. (2004). A web based conversational case-based recommender system for ontology aided metadata discovery. In Proceedings of the 5th IEEE/ACM international workshop on grid computing (pp. 69–75). Washington, DC, USA.
9369
Asnicar, F. A., & Tasso, C. (1997). If web: A prototype of user model-based intelligent agent for document filtering and navigation in world wide web. In Proceedings of the 6th international conference on user modelling (pp. 3–12). Sardinia, Italy. Bonino, D., Corno, F., & Squillero, G. (2003). A real-time evolutionary algorithm for web prediction. In Proceedings of the 2003 IEEE/WIC international conference on web intelligence (pp. 139–145). Halifax, Canada. Bota, F., Corno, F., Farinetti, L., & Squillero, G. (2002). A transparent search agent for closed collections. In Proceedings of international conference on advances in infrastructure for e-business, e-education, e-service, and e-medicine on the internet (pp. 205–210). L’Aquila, Italy. Chen, M. S., Park, J. S., & Yu, P. S. (1998). Efficient data mining for path traversal patterns. IEEE Transactions on Knowledge and Data Engineering, 10(2), 209–221. Chuang, F. C. (2003). Ontology-supported and ranking technique-enhanced FAQ systems. Master Thesis, Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan. Frawley, W., Picatetsky-Shapiro, G., & Matheus, C. (1992). Knowledge discovery in database: An overview. AI Magazine, 13(3), 57–70. Gopalratnam, K., & Cook, D. J. (2007). Online sequential prediction via incremental parsing: The active LeZi algorithm. IEEE Intelligence Systems, 22(1), 52–58. Joachims, T., Freitag, D., & Mitchell, T. (1997). WebWatcher: A tour guide for the world wide web. In Proceedings of the 1997 international joint conference on artificial intelligence (pp. 770–775). Nagoya, Japan. Liu, J. N. K., & Leung, T. T. S. (2001). A Web-based CBR agent for financial forecasting, workshop 5 soft computing in case-based reasoning. Vancouver, BC, Canada. Lorenzi, F., dos Santos, D. S., & Bazzan, A. L. C. (2005). Negotiation for task allocation among agents in case-based recommender systems: a swarm-intelligence approach. In 2005 International workshop on multi-agent information retrieval and recommender systems (pp. 23–27). Edinburgh, Scotland. Lou, W. W., & Lu, H. J. (2002). Efficient prediction of web accesses on a proxy server. In Proceedings of the 17th international conference on information and knowledge management (pp. 169–176). McLean, Virginia, USA. Mendonca, M. & Sunderhaft, N. L. (1999). Mining software engineering data: A survey, a DACS (Data & Analysis Center for Software) state-of-the-art report. Noy, N. J., & Hafner, C. D. (1997). The state of the art in ontology design. AI Magazine. pp. 53–74. Oikonomopoulou, D., Rigou, M., Sirmakessis, S., & Tsakalidis, A. (2004). Fullcoverage web prediction based on web usage mining and site topology. In IEEE/ WIC/ACM international conference on web intelligence (pp. 716–719). Beijing, China. Özel, S. A., & Güvenir, H. A. (2001). An algorithm for mining association rules using perfect hashing and database pruning. In Proceedings of the 10th Turkish symposium on artificial intelligence and neural networks (pp. 257–264). North Cyprus. Ruggieri, S. (2002). Efficient C4.5. IEEE Transactions on Knowledge and Data Engineering, 14(2), 438–444. Shi, W. & Barnden, J. A. (2005). How to combine CBR and RBR for diagnosing multiple medical disorder cases. In Proceedings of the 6th international conference on case-based reasoning (pp. 477–491). Chicago, IL, USA. Shortland, R., & Scarfe, R. (1995). Digging for gold. IEE Review, 41, 213–217. Tsai, C. F., Wu, H. C., & Tsai, C. W. (2002). A new data clustering approach for data mining in large databases. In International symposium on parallel architectures, algorithms, and networks (pp. 278–283), Makati City, Metro Manila, Philippines. Yang, S. Y. (2006). An ontology-directed webpage classifier for web services. In Proceedings of joint 3rd international conference on soft computing and intelligent systems and 7th international symposium on advanced intelligent systems (pp. 720–724). Tokyo, Japan. Yang, S. Y. (2007a). An ontological website models-supported search agent for web services. In Accepted for publication in expert systems with applications (On-line). Yang, S. Y. (2007d). An ontological interface agent for FAQ query processing. WSEAS Transactions on Information Science and Applications, 11(4), 1400–1407. Yang, S. Y. (in press). Developing of an ontological interface agent with templatebased linguistic processing technique for FAQ services. In Expert systems with applications (On line). Yang, Q., Zhang, H. H., & Li, T. (2001). Mining web logs for prediction models in www caching and prefetching. In Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 473–478), San Francisco, California, USA. Yang, S. Y., Chuang, F. C., & Ho, C. S. (2007). Ontology-supported FAQ processing and ranking techniques. Journal of Intelligent Information Systems, 28(3), 233–251. Yang, S. Y., Liao, P. C., & Ho, C. S. (2005). A user-oriented query prediction and cache technique for FAQ proxy service. In Proceedings of the 2005 international workshop on distance education technologies (pp. 411–416), Banff, Canada. Yang, S. Y., Liao, P. C., & Ho, C. S. (2005). An ontology-supported case-based reasoning technique for FAQ proxy service. In Proceedings of the 17th international conference on software engineering and knowledge engineering (pp. 639–644). Taipei, Taiwan. Yang, S. Y. (2007b). An ontological proxy agent for web information processing. In Proceedings of the 10th international conference on computer science and informatics (pp. 671–677). Salt Lake City, Utah, USA. Yang, S. Y. (2007c). An ontological multi-agent system for web FAQ query. In Proceedings of 2007 international conference on machine learning and cybernetics (pp. 2964–2969). HongKong, China.
9370
S.-Y. Yang, C.-L. Hsu / Expert Systems with Applications 36 (2009) 9358–9370
Yang, S. Y. (2008a). Developing an ontological FAQ system with FAQ processing and ranking techniques for ubiquitous services. In Proceedings of 1st IEEE international conference on ubi-media computing. Lanzhou, China. Yang, S. Y. (2008b). Developing of an ontological focused-crawler for ubiquitous services. In IEEE 22nd international conference on advanced information networking and applications (pp. 1486–1492). Okinawa, Japan. Yang, S. Y., Hsu, C. L., Lee, D. L., & Deng, L. Y. (2008). FAQ-master: An ontological multi-agent system for web FAQ services. WSEAS Transactions on Information Science and Applications, 5(3), 221–228.
Zaki, M. J., Parthasarathy, S., Ogihara, M., & Li, W. (1997). New algorithms for fast discovery of association rules. In Proceedings of the 3rd international conference on KDD and data mining (pp. 283–286). Newport Beach, California, USA. Zhang, H. (2001). Improving performance on www using path-based predictive caching and prefetching. Available at . Zou, Q., Chu, W., David, J., & Chiu, H. (2002). A pattern decomposition algorithm for data mining of frequent patterns. Knowledge and Information Systems, 4, 466–482.
Expert Systems with Applications 36 (2009) 10148–10157
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
OntoPortal: An ontology-supported portal architecture with linguistically enhanced and focused crawler technologies Sheng-Yuan Yang * Department of Computer and Communication Engineering, St. John’s University, 499, Sec. 4, TamKing Rd., Tamsui, Taipei County 25135, Taiwan, ROC
a r t i c l e
i n f o
Keywords: Ontology Web portals Automatic annotation Focused crawlers
a b s t r a c t This paper proposed the techniques of ontology and linguistics to develop a fully-automatic annotation technique, coupling with an automatic ontology construction method, could play a key role in the development of Semantic Portals. An ontology-supported portal architecture: OntoPortal was proposed according to this technique, in which three internal components Portal Interface, Semantic Portal, and OntoCrawler was integrated to rapidly and precisely collect information on Internet and capture true user’s intention and accordingly provide high-quality query answers to meet the user requests. This paper also demonstrated the OntoPortal prototype which defined how a semantic portal is interacting with the user by providing five different types of interaction patterns such as including keyword search, synonym search, POS (Part-of-Speech)-constrained keyword search, natural language query, and semantic index search. The preliminary experiment outcomes proved the technology proposed in this paper to be able to really up-rise the precision and recall rates of webpage searching and accordingly showed that it can indeed retrieve better semantic-directed information to meet user requests. Ó 2009 Elsevier Ltd. All rights reserved.
1. Introduction The Web has been drastically changing the availability of electronically available information. An information portal, which can help people effectively use this voluminous repository, is becoming ever important. Tim Berners–Lee visions ‘‘Semantic Web” as the next generation of web by allowing the web pages on the Semantic Web to be machine-readable (Lee, Hendler, & Lassila, 2001). Techniques to realize this vision include the techniques centered on domain ontology (Lee et al., 2001), which help domain knowledge conceptualization. Webpage annotation is another important technique, which adds relevant domain knowledge and meta-information into a web page to enhance its semantics (Albanese et al., 2005). Logic inference is yet another important technique, which allows the development of various web services. These techniques can effectively relieve the bottlenecks of the current Web (Fensel, Hendler, Lieberman, & Wahlster, 2000; Stollberg & Thomas, 2005). Automatic webpage annotation can even reduce the high cost of pre-processing of web pages into a semantic web by humans (Benjamins, Contreras, Corcho, & Gomez-Perez, 2002; Kiyavitskaya, Zeni, Cordy, Mich, & Mylopoulos, 2005). We also notice that the concept of crawler is mostly used in the Web systems that work on information gathering or integrating to improve their gathering processes or the search results from ubiquitous information environments. For instance, Dominos (Hafri & * Tel.: +886 2 28013131x6394; fax: +886 2 28013131x6391. E-mail address: [email protected] 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.01.004
Djeraba, 2004) can crawl several thousands of pages every second, in which includes a high-performance fault manager, especially, the system possesses platform independent and is able to adapt transparently to a wide range of configurations without incurring additional hardware expenditure. Ganesh, Jayaraj, Kalyan, and Aghila (2004) developed an ontology-support Web crawler with an association-metric to estimate the semantic content of the URL based on the domain dependent ontology, which in turn strengthens the metric that is used for prioritizing the URL queue. Ubi–Crawler (Boldi, Codenotti, Samtini, & Vigna, 2004), a scalable distributed web crawler, is platform independent, linear scalability, graceful degradation in the presence of faults, a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task. We conjectured that these techniques can equally help us overcome the problems associated with general portals and employed the techniques of ontology and linguistics to develop a fully-automatic annotation technique (Yang, 2007), which, coupling with an automatic ontology construction method, could play a key role in the development of Semantic Portals (Chang, 2003). An ontologysupported portal architecture: OntoPortal was proposed according to this technique, which integrated three internal components Portal Interface, Semantic Portal, and OntoCrawler (Yang, Chen, & Wu, 2008) to rapidly and precisely collect information on Internet and capture true user’s intention and accordingly provide high-quality query answers to meet the user requests. This paper also demonstrated the OntoPortal prototype which defined how a semantic portal is interacting with the user by providing five different types
S.-Y. Yang / Expert Systems with Applications 36 (2009) 10148–10157
10149
of interaction patterns such as including keyword search, synonym search, POS (Part-of-Speech)-constrained keyword search, natural language query, and semantic index search. The preliminary experiment outcomes proved the technology proposed in this paper to be able to really up-rise the precision and recall rates of webpage searching and accordingly showed that it can indeed retrieve better semantic-directed information to meet user requests. The rest of the paper is organized as following: Section 2 develops the domain ontology. Section 3 explains OntoPortal architecture. Section 4 reports the system demonstrations and evaluations. Section 5 compares the work with related works, while Section 6 concludes the work. The Scholar: Einstein domain is chosen as the target application of OntoPortal and will be used for explanation in the remaining sections.
2. Domain ontology as the down-to-the-earth semantics Ontology provides complete semantic models which means in specified domain all related entities, attributes and base knowledge among entities owning sharing and reusing characteristics used for solving the problems of common sharing and communication (Yang & Ho, 1999). Protégé http://protege.stanford.edu/ (Noy et al., 2001) was a free, open-source ontology editor and knowledge-based framework developed by Stanford Center for Biomedical Informatics Research at the Stanford University School of Medicine. Protégé was not only one of the most important platforms to construct ontology but also the most frequently adapted one (Grosso et al., 1999). Protégé 3.3.1 was adapted in this paper, which supported two main ways of modeling ontologies via the Protégé-Frames and Protégé-OWL (adapted by us, shown in Fig. 1) editors. Protégé ontologies can be exported into a variety of formats including RDF(s), OWL, and XML Shema and lead knowledge workers to constructing knowledge management system based on ontology; furthermore, users could transfer to different formats of ontology such as RDF(S), OWL, XML or directly inherit into database just like MySQL and MS SQL Server (adapted by us and detailed later), which have better supported function than other tools (Chien, 2006). Nowadays the research on ontology can be branched into two fields: one is to configure huge ontology in a specified field and through it to assistant the knowledge analysis in this field; the other is to study how to construct and express precisely with ontology (Chien, 2006). In this paper, we adapted the former in which took advantage of built ontology database of some scholars to support OntoCrawler for querying webpage of related scholars. First of all, we conducted statistics and survey of homepage of related scholars to fetch out the related concepts and their synonym appearing in the homepage. The second stage of ontology constructing of scholars is to transfer the ontology built with Protégé into MS SQL database. The procedures are as following: (1) With Protégé to define the scholars’ ontology analyzed from the first stage so as to share the ontology with other interesting researchers. (2) Exporting an XML file constructed with Protégé knowledge base and then importing into MS Excel for correcting. That was the strong evidence of knowledge reusing and fast embedding within Protégé. (3) Finally importing MS Excel into MS SQL Sever to finish the ontology construction of this system. Fig. 1 indicated the structure of domain ontology of scholars in Protégé, taking the middle frame of the screen for instance, the related concept ‘‘Education” was linking behind with ‘‘M.S.”, ‘‘Ph.D.”,
Fig. 1. The ontology structure of scholars in Protégé.
and ‘‘B.S.”. In application, we defined those as related concepts and that means ‘‘Education” is nothing but an combination of these related concepts that would be conveniently interpreted by OntoCrawler to compare with content of the queried webpage, and if the compared outcomes were corresponding to any item among the four, we would infer the related concept ‘‘Education” as matched condition for web page querying. For simplification and demonstration of all portal-related techniques, we narrow down the ontology scope to the Scientist ontology (i.e., Einstein), as shown in Fig. 2, the Scientist Ontology hierarchy was developed based on the method described in Uschold and Gruninger (1996). In the figure, nodes represent ontology concepts; links labeled with ‘‘is A” denotes the parent–child relationship between concepts, which allows inheritance of features from parent classes to child classes; links labeled otherwise represent some reference relationships from a concept to another concept, where ‘‘” denotes multiple instance reference (Chang,
10150
S.-Y. Yang / Expert Systems with Applications 36 (2009) 10148–10157
so as to up-rise precision rate of webpage searching. This technique has practically been installed in Google and Yahoo search engines and furthermore searched some related scholars’ personal WebPages and accordingly stored the results into some system databases to let the front-end systems to do advanced processes. Semantic Portal is responsible for producing three supporting knowledge bases to support Portal Interface to provide a better semantic portal. Finally, Portal Interface applies those knowledge bases during the interaction with the user by providing five different types of search mechanisms for retrieving high-quality query answers to meet the user requests. Fig. 2. Part of Scientist ontology hierarchy.
3.1. OntoCrawler architecture 2003). In addition to the Scientist ontology, we also defined a query ontology to facilitate the description of queries. Fig. 3 shows the query ontology, where the root node represents class ‘‘Query” which contains a set of fields defining the semantics of the class, each field representing one attribute of ‘‘Query”, e.g., QueryType, Name, etc. The links labeled with ‘‘io” represent relationship ‘‘instance_of”, relating an instance to the query class. The figure shows four query instances, including Who, What, Which, and Where, each standing for a specific solution pattern. For example, query instance ‘‘Who” has ‘‘People” as the value of its QueryType attribute, which means the answer to the query instance must belong to the People class. Finally, we have used Protégé’s APIs http://protege.stanford.edu/doc/pdk/api/ to develop a set of ontology services, which provide primitive functions to support the applications of the ontologies. The ontology services currently available include transforming query terms into canonical ontology terms, finding definitions of specific terms in ontology, finding relationships among terms, finding compatible and/or conflicting terms against a specific term, etc.
Fig. 5 shows the operation system structure of OntoCrawler, and related techniques and functions of every part were described as below: (1) Action: to transfer internal query into URI code, and then embed into Google’s query URL: an example as follow: http://www.google.com.tw/search?hl=zh-TW&q=%E5%AD% B8%E8%80%85&meta=.
3. OntoPortal architecture Fig. 4 illustrates the architecture of OntoPortal. It integrated three internal components: Portal Interface, Semantic Portal and OntoCrawler to rapidly and precisely collect Web information on Internet and capture true user’s intention and accordingly provide high-quality query answers to meet the user requests. Firstly, OntoCrawler employed the ontology-supported technique for actively providing comparison and verification for user’s keywords
Fig. 3. Query ontology.
Fig. 4. OntoPortal architecture.
Fig. 5. System structure of OntoCrawler.
S.-Y. Yang / Expert Systems with Applications 36 (2009) 10148–10157
(2) LinkToGoogle: to declare an URL object and add Google query URL on well transferred URI code, and then used a BufferedReader to read and used while loop to add String variable ‘‘line” line by line. Finally, output ‘‘line” as text file as final analysis reference. The file content was the html source file of the webpage. (3) RetrieveLinks: to set an ‘‘int” variable ‘‘count” as 0, making the display of search progress convenient. Then the system would used regular expression to search for whether there are matched URL. But it could not retrieve all the linkages in one time out because of the Google webpage editing with indenting. So we used a ‘‘for” loop and ran for twice. The semantic of the two in regular expression were slightly different so as to completely fetch out related hyperlinks corresponding to the conditions. Finally, added all hyperlinks on String variable ‘‘RetrieveLink” and output the txt file to provide the system for further processing. (4) RetrieveContent: wasting too much time on RetrieveContent processing and executing order would be problems to make system interface seem to entirely stop. So, system designing with ‘‘thread” got free from Swing thread events, and this made it possible to do some proper change of the interface when querying webpage. Afterwards, we set a Boolean variable ‘‘crawling” as ‘‘true”, it stood for executing the activity of webpage query and the initial condition of query bar was 0%. The Button (Java Class) caption was set ‘‘stop”; Textarea (Java Class) was set ‘‘not editable”. And then we used BufferedReader (Java Class) to read in ‘‘RetrieveLink” with ‘‘while” loop line by line, that meant we checked one URL link once a time and really linked the URL. After judging what kind coding of the webpage was, we read in the html source file of webpage with correct coding added on String variable s3 and output it as text file so as to let system conduct further processing. After completing all procedures mentioned above, we could used SearchMatches method (described later) to judge whether the webpage was located in the range we hoped to query; supposed the answer was ‘‘yes”, we would execute RemoveHTMLLabel (described later) to delete the html label from source file and remained only the text content so as to let system conduct further processing and analyzing. Finally, we collected the number of queried webpage and divided with total of the webpage and the mean we got was the percentage of query processing. Remember to clear RetrieveLink lest the next query should cause errors. The last step was to change the Textarea with ‘‘editable” and Button caption with ‘‘start”, ‘‘crawling” changed with ‘‘false” to final the whole webpage query procedures. (5) SearchMatches: supporting RetrieveContent internal calling service to judge whether the webpage was the range we queried. It linked the ontology database and fetched out the content to compare content of s3 when using this linking method. If there were any return value corresponding to the value we set (usually we set 6 and this indicated 50% queried webpage matched up our query condition), and then system would return one ‘‘true” Boolean variable ‘‘matches”. That meant the webpage matched our query condition, on the other hand, if returned ‘‘false” meant the webpage did not match our query condition. (6) RemoveHTMLLabel: just like SearchMatches, it supported RetrieveContent internal calling service and deleted html label in the html source file. In this system we used a serial of regular expression to remove html labels in s3 during internal calling function, and then we could get a webpage with pure text file.
10151
Fig. 6. Semantic portal architecture.
3.2. Semantic Portal architecture and Portal Interface workflow Fig. 6 illustrates the architecture of semantic-portal. OntoCrawler is responsible for collecting Web information and removing HTML tags and punctuations from the original web pages, and storing them in ‘‘Pre-Processed WebPages Database.” Semantic WebPages generator then bases on the database to produce three supporting knowledge bases. The first knowledge base is ‘‘Stemmed WebPages Database” created by Stemmer (detailed later), which stems words and removes stop and noisy words on a pre-processed web page. The second is ‘‘POS (Parts-of-Speech)-attached WebPages Database” created by the POS Tagger (detailed later), which analyzes the POS of the words on a web page and transforms the web page into a list of words with POS. The third is ‘‘Annotated WebPages Database” created by Automatic Annotator (detailed later), which automatically identifies the concepts and relationships contained in a web page, derives a content summary for the web page, and constructs a semantic index for the web page. These knowledge bases, along with Ontology Database, support Portal Interface (Fig. 7) to provide a better Semantic Portal. In other words, Portal Interface applies these five knowledge bases during the interaction with the user by providing five different types of search mechanisms, including keyword search, synonym search, POS-constrained keyword search, natural language query, and semantic index search. The user can choose one of the five patterns to input his/her queries. Answer Generator finally produces semantic-directed answers to respond to the user query. Finally, Ontology Database provides the most fundamental semantics, as illustrated in Fig. 2. 3.2.1. Semantic Webpage Generator Semantic Webpage Generator bases on the pro-processed WebPages database to produce three supporting knowledge bases for doing WebPages annotation through three components Stemmer, POS Tagger, and Automatic Annotator.
10152
S.-Y. Yang / Expert Systems with Applications 36 (2009) 10148–10157 Table 1 Part of POS tags used by Stemmer and POS Tagger.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
POS Tag
Meaning
BE BEDR BEDZ BEG BEM BEN BER BEZ CC DO DOD DON DOG DOZ DT IN MD TO WDT WP WP$ WRB
Be Were Was Being Am Been Are Is Conjunction, coordinating (and) Do Did Done Doing Does Determiner, general (a, the, this, that) Preposition(on, of) Modal auxiliary (might, will) Infinitive marker (to) Det, wh- (what, which, whatever, whichever) Pronoun, wh- (who, that) Pronoun, possessive wh- (whose) Adv, wh- (how, when, where, why)
Fig. 7. Portal Interface workflow.
3.2.1.1. Stemmer. Fig. 8 illustrates the workflow of Stemmer. Qtag http://www.english.bham.ac.uk/staff/omason/software/qtag.html is used here to help remove stop words, passing only major terms to WordNet-based Stemmer. Specifically, Qtag employs probability with a sliding window (Tufis & Mason, 1998) to determine POS of the words in a web page. The output of Qtag contains detailed POS tags (Examples shown in Table 1), which are further categorized into five categories, namely, noun, verb, adjective, adverb, and DC (for Do not Care) for the purpose of removing stop words. Finally, WordNet-based Stemmer employs WordNet, which is a comprehensive vocabulary (WordNet 2.1., 2005), to transform the major terms into their stems and stores them in Stemmed WebPages Database.
Fig. 9. Workflow of POS Tagger.
fier work together to extract the concepts and relationships contained in a web page, supported by the domain ontology. Text Summarizer employs Text-Miner, developed by IBM Intelligent
3.2.1.2. POS Tagger. Fig. 9 illustrates the workflow of POS Tagger, which is very similar to the Stemmer. Qtag is used here to actually produce a POS for each word of a web page. WordNet-based Stemmer is then used as in Stemmer to do word stemming. 3.2.1.3. Automatic Annotator. Fig. 10 illustrates the architecture of Automatic Annotator. Concept Identifier and Relationship Identi-
Fig. 8. Workflow of Stemmer.
Fig. 10. Automatic annotator architecture.
10153
S.-Y. Yang / Expert Systems with Applications 36 (2009) 10148–10157
Miner for Text (1998), to identify important sentences in order to derive a summary for the web page. Semantic Index Generator identifies the most important sentence in the summary as the semantic index, supported by the domain ontology again. Once the semantic index is identified, it is treated as the center theme of the webpage and we can use it to refine the summary. Basically, we employ the vector space model (Salton, Wong, & Yang, 1975) to calculate the correlations between the semantic index and the sentences in the webpage summary, and accordingly adjust their appearance order or remove some less relevant sentences. This helps us develop a summary which contains sentences that are not only highly-weighted but also highly relevant to the center theme. Finally, XML Translator adds all of these into a web page and transforms it into an annotated web page. Note that Automatic Annotator not only annotates a web page, but also adds relevant synonyms into the annotated web page, with the help of Concept Identifier and the domain ontology (Chang, 2003).
4. System evaluations and demonstrations The evaluation of the overall performance of our proposed system involves lots of manpower and is time-consuming. Here, we first focus on the performance evaluation of the most important modules, for instance OntoCrawler of the OntoPortal architecture. Our philosophy is that if OntoCrawler can precisely collect domain related WebPages and then we not only can precisely and effectively produce solutions with better semantics, but also can efficiently improve the quality of the retrieved to meet the user requests, which would be clear and detailed from system demonstrations. 4.1. Evaluations on OntoCrawler In this experiment, we compared our query technique with the most popular query engines Google and Yahoo. We used ten different keywords as webpage query of related information. First of all, we keyed in the same set of keywords in Google and Yahoo. The comparison result was shown in Table 2, in which NWT meant the number of total returned WebPages; NWC meant number of
correct returned WebPages; NWR meant number of related returned WebPages but they were not necessarily the correct webpage. Here, we used Eqs. (1) and (2) to define Precision Rate, RP and Recall Rate, RR. After comparing returned webpage one after another, we got the average RP and RR of Google were 1.90% and 3.70% while Yahoo were 1.80% and 3.90%, respectively.
NW C NW T NW R þ NW C RR ¼ NW T RP ¼
ð1Þ ð2Þ
When we used Google and Yahoo as base of webpage query, but on OntoCrawler we keyed in the same set of keywords. The comparison results were shown in Table 3. After comparing returned webpage one after another, we got the average RP and RR of OntoCrawler on Google were 21.93% and 38.51% while on Yahoo were 14.48% and 35.33%, respectively. After comparing (Tables 2 and 3), the query precision and recall rate for Google researching engine through ; assistance of OntoCrawler has up-rise around 91% (21:93%1:90% 21:93% 38:51%3:70% 14:48%1:8% ) while conditions of Yahoo were about 89% ( ; 38:51% 14:48% 35:33%3:9% ). From the above comparison, it indicated that OntoCraw35:33% ler offered more precision and recall rate than Google and Yahoo on webpage searching; in addition, the technique we proposed has its availability. 4.2. Demonstrations on OntoPortal prototype First, suppose the user chooses to query by keywords, Query Manager will directly invoke Keyword Matcher, which uses fulltext search to compare the keywords with the words in all of the web pages stored in Pre-Processed WebPages Database. This process is the same as the classical keyword match employed by most general portals. Two major issues are associated with the keywordbased query method, however. First, it only retrieves web pages which contain syntactically the same keywords. Second, keywords of different parts of speech may represent different meanings. The second issue can be easily coped with by allowing the user to use the Word+POS Match method, which requires the user to associate a part-of-speech with each input keyword. Query Manager then invokes Keyword+POS Matcher to search in POS-attached WebPages
Table 2 Comparison of the front 100 queries on Google and Yahoo. No.
Affiliation
(a) Google 1 Dept. of Computer Science and Information Engineering, National Taiwan University 2 Dept. of Chemical Engineering, National Cheng Kung University 3 Dept. of Electronics Engineering, National Tsing Hua University 4 Dept. of Photonics, National Chiao Tung University 5 Dept. of International Business, National ChengChi University 6 Dept. of Business Administration National Central University 7 Graduate Institution of Finance, National Taiwan University of Science and Technology 8 Dept. of Business Management, National Taipei University of Technology 9 Dept. of Electrical Engineering, National Chung Cheng University 10 Dept. of Information Management, National Sun Yat-Sen University Average (b) Yahoo 1 Dept. of Computer Science and Information Engineering, National Taiwan University 2 Dept. of Chemical Engineering, National Cheng Kung University 3 Dept. of Electronics Engineering, National Tsing Hua University 4 Dept. of Photonics, National Chiao Tung University 5 Dept. of International Business, National ChengChi University 6 Dept. of Business Administration, National Central University 7 Graduate Institution of Finance, National Taiwan University of Science and Technology 8 Dept. of Business Management, National Taipei University of Technology 9 Dept. of Electrical Engineering, National Chung Cheng University 10 Dept. of Information Management, National Sun Yat-Sen University Average
Keyword
NWC
NWR
NWT
RP (%)
RR (%)
(Tei-Wei Kuo) (J.H. Liu) (Sheng-fu Horng) (Ken Y. Hsu) (Jyh-Shen Chiou) (Ming-ji James Lin) (Bing-Huei Lin) (Ching-Jui Keng) (Cheng Shong Wu) (BingChiang Jeng)
1 3 2 1 2 3 3 1 1 2
2 1 1 4 0 3 3 3 0 1
100 100 100 100 100 100 100 100 100 100
1.00 3.00 2.00 1.00 2.00 3.00 3.00 1.00 1.00 2.00 1.90
3.00 4.00 3.00 5.00 2.00 6.00 6.00 4.00 1.00 3.00 3.70
(Tei-Wei Kuo) (J.H. Liu) (Sheng-fu Horng) (Ken Y. Hsu) (Jyh-Shen Chiou) (Ming-ji James Lin) (Bing-Huei Lin) (Ching-Jui Keng) (Cheng Shong Wu) (BingChiang Jeng)
1 2 3 2 1 0 5 0 1 3
3 2 1 2 1 5 2 3 0 2
100 100 100 100 100 100 100 100 100 100
1.00 2.00 3.00 2.00 1.00 0.00 5.00 0.00 1.00 3.00 1.80
4.00 4.00 4.00 4.00 2.00 5.00 7.00 3.00 1.00 5.00 3.90
10154
S.-Y. Yang / Expert Systems with Applications 36 (2009) 10148–10157
Table 3 The comparison of the front 100 queries between based on Google and Yahoo. No.
Affiliation
Keyword
NWC
NWR
NWT
RP (%)
RR (%)
(a) OntoCrawler on Google 1 Dept. of Computer Science and Information Engineering, National Taiwan University 2 Dept. of Chemical Engineering, National Cheng Kung University 3 Dept. of Electronics Engineering, National Tsing Hua University 4 Dept. of Photonics, National Chiao Tung University 5 Dept. of International Business, National ChengChi University 6 Dept. of Business Administration, National Central University 7 Graduate Institution of Finance, National Taiwan University of Science and Technology 8 Dept. of Business Management, National Taipei University of Technology 9 Dept. of Electrical Engineering, National Chung Cheng University 10 Dept. of Information Management, National Sun Yat-Sen University Average
(Tei-Wei Kuo) (J.H. Liu) (Sheng-fu Horng) (Ken Y. Hsu) (Jyh-Shen Chiou) (Ming-ji James Lin) (Bing-Huei Lin) (Ching-Jui Keng) (Cheng Shong Wu) (BingChiang Jeng)
1 2 2 1 1 1 2 1 1 2
1 1 0 1 0 2 3 3 0 1
9 4 7 6 4 4 11 14 11 7
11.11 50.00 28.57 16.67 25.00 25.00 18.18 7.14 9.09 28.57 21.93
22.22 75.00 28.57 33.33 25.00 75.00 45.45 28.57 9.09 42.86 38.51
(b) OntoCrawler on Yahoo 1 Dept. of Computer Science and Information Engineering, National Taiwan University 2 Dept. of Chemical Engineering, National Cheng Kung University 3 Dept. Electronics Engineering, National Tsing Hua University 4 Dept. of Photonics, National Chiao Tung University 5 Dept. of International Business, National ChengChi University 6 Dept. of Business Administration, National Central University 7 Graduate Institution of Finance, National Taiwan University of Science and Technology 8 Dept. of Business Management, National Taipei University of Technology 9 Dept. of Electrical Engineering, National Chung Cheng University 10 Dept. of Information Management, National Sun Yat-Sen University Average
(Tei-Wei Kuo) (J.H. Liu) (Sheng-fu Horng) (Ken Y. Hsu) (Jyh-Shen Chiou) (Ming-ji James Lin) (Bing-Huei Lin) (Ching-Jui Keng) (Cheng Shong Wu) (BingChiang Jeng)
1 2 2 2 1 0 4 0 1 2
3 1 1 3 1 3 2 4 0 2
8 7 9 18 5 5 13 19 14 16
12.50 28.57 22.22 11.11 20.00 0.00 30.77 0.00 7.14 12.50 14.48
50.00 42.86 33.33 27.78 40.00 60.00 46.15 21.05 7.14 25.00 35.33
Database for web pages which contain the keywords appearing in the correct part of speech, as illustrated in Fig. 11. We employ two methods to cope with the first issue in our system. The first method not only invokes Synonyms Matcher to search in Annotated WebPages Database for web pages containing the synonyms of the entered keywords, but also tackles morphological changes by returning a web page containing ‘‘writing”, given input keyword ‘‘wrote”, as shown in Fig. 12. Alternatively, we allow the user to post natural language query in order to deal with the first issue. The user can place a natural language query by using NL-Based Query method, when Query Manager invokes Pseudo NL Answerer to propose answers. Fig. 13 shows an example of NL question ‘‘Who are Einstein’s
sons?”; Query Manager answers ‘‘Edward (Albert)” and ‘‘Hans Albert”. In addition to the above four user interface methods, we define a new keyword query method called semantic index match. The user can choose the method and enter usual keywords; Query Manager will invoke Semantic Index Searcher to search in Annotated WebPages Database for all web pages whose semantic index contains the keywords and accordingly displays the semantic indices for the user, as illustrated in Fig. 14. Note that the return page contains a list of URLs, each followed by its semantic index. This approach is different from general keyword-based search engines, which usually return the title and the first sentence of each web page.
Fig. 11. Search result of keyword+POS match.
S.-Y. Yang / Expert Systems with Applications 36 (2009) 10148–10157
10155
Fig. 12. Search results in response to stemmed keywords.
Fig. 13. Search results in response to a natural language query.
5. Related works and comparisons The Semantic Web or semantic portals heavily rely on webpage annotation, which inevitably becomes one of the hottest research topics. SHOE (Simple HTML Ontology Extension) is equipped with a manual Knowledge Annotator SHOE, 2000, which is a Java program allowing the user to annotate web pages graphically. OntoMat–Annotizer (OntoMat, 2003) is a user-friendly interactive webpage annotation tool. It supports the user with the task of creating and maintaining ontology-based DAML + OIL markups, e.g., creation of DAML-instances, attributes and relationships. Annotea (2001) in LEAD (Live and Early ADoption) uses special RDF annotation schema to add comments (a kind of annotation) as metadata. S-CREAM (Semi-automatic CREAtion of Metadata) (Handschuh, Staab, & Ciravegna, 2002) allows creation of metadata and is trainable for a specific domain. SEAL (SEmantic portAL) (Hartmann & Sure, 2004) is a generic approach for developing semantic portal. PASS (Portal with Access to Semantic Search) (Pinheiro & Moura,
2004) uses components of domain ontologies stored in the portal to expand terms during the search process. ONTODELLA (Viljanen, Kansala, Hyvonen, & Makela, 2006) presents a logic-based approach for creating view projections and semantic linking for creating semantic enhanced web applications, such as semantic portals. Khan & Malik (2007) propose an approach to leverage the development life cycle of a semantic portal through the use of Object Management Group’s Model Driven Architecture. The above annotation mechanisms are either manual or semi-automatic, which impose heavy labor on humans facing a voluminous amount of web pages. Some automatic annotation mechanisms appeared in the literature. For instance, (Lerman, Getoor, Minton, & Knoblock, 2004) describes two algorithms which use redundancies in the content of table and detail pages to help information extraction. A strategy was proposed based on active recommendation to support the semantic annotation of contested knowledge to promote annotators’ interest (Sereno, Shum, & Motta, 2004). Conflict detection pat-
10156
S.-Y. Yang / Expert Systems with Applications 36 (2009) 10148–10157
Fig. 14. Search results in response to semantic index search.
terns were devised based on different datum, ontology at different inference levels and proposed the corresponding automatic conflict resolution strategies for image annotation (Lee & Soo, 2006). Amilcare (2002) is an information extraction system which helps automatic annotation with learning capability. Its power of linguistic analysis stems from Gate, which performs tokenization, sentence identification, POS tagging, gazetteer lookup and named entity recognition. Gate functions virtually in the same way as the combination of Qtag, Concept/Relationship identifier, and Text-Miner in our system during automatic annotation. However, since Gate is easier to be used, it will be planned to employ in our future semantic-based systems. In addition, Amilcare is salient in employing the (LP)2 algorithm, a supervised algorithm that falls into a class of Wrapper Induction System using LazyNLP. (LP)2 can induce rules which help insert annotations in the texts and rules which correct mistakes and imprecision in the annotations generated by the former rules. We reckon that as an information extraction system, Amilcare works very similarly to part of our system. Its adaptive feature, supported by the underlying learning capability, however, provides yet another level of automation in webpage annotation and deserves more attention, such as growing hierarchical self-organizing maps: an unsupervised learning approach mentioned in Antonio et al. (2008). 6. Conclusions and discussions An ontology-supported portal architecture: OntoPortal was proposed, which integrated the techniques of ontology, linguistics, and focused crawler to rapidly and precisely collect information on Internet and capture true user’s intention and accordingly provide high-quality query answers to meet the user requests. This paper also demonstrated the OntoPortal prototype which supports the five patterns of interaction of the Portal Interface by generating annotated WebPages, stemmed WebPages, and POS-attached WebPages with the help of domain ontology. Our technique is an interesting contribution in terms of the following features. First, the preliminary experiment outcomes proved the technology proposed in this paper to be able to really up-rise the precision and recall rates of webpage searching and accordingly showed that it can indeed retrieve better semantic-directed information to meet user
requests. Second, Semantic Portal is a fully automatic annotation tool. The tool not only helps develop annotated web pages during the construction of the portal, but also provides the user with an automatic online annotation process. The user can use the process to annotate the web pages specific to his/her interests. Second, it enriches webpage annotation by content summary and semantic index as two facets of representative semantics of a web page. Finally, it improves the traditional keyword-based retrieval mechanism by providing word-stemming, parts-of-speech, natural language processing, and semantic index search. These extensions help capture the true intention of the user and return better information with semantics. For easy demonstration of the techniques in our semantic-portal prototype, the current implementation runs on a very simple ‘‘Scholar: Scientist” domain. However, we believe even if the domain is scaled up, our techniques still be applicable. The idea is this: we are not directly scaling up our ontology. Instead we can create a complex system by integrating a set of simple systems through a multi-agent architecture which is supported by a set of simple domain ontologies. By exploiting the capability of Protégé-2000, which supports the creation, extension and cooperation of a set of domain ontologies, we really need not make much change to our ontology-supported semantic portal in order to transform it into a complex portal. What we really need to focus on is how to make a set of ontology-supported systems cooperate effectively, which is under our investigation. The ontology plays a very important role in our annotation technique. One major difficulty is its construction, which currently still relies on the help of domain experts. We are planning to investigate the technique of automatic ontology construction in the future, such as ontological engineering of Web portals mentioned in Francisco, Rodrigo, Leonardo, Jesualdo, and Dagoberto (2006). We believe our automatic annotation technique, when coupled with the automatic ontology construction technique, can help proliferate the techniques of Semantic Web and in turn promote the development of better Semantic Portals. Acknowledgments The author would like to thank Y.H. Chang, T.A. Chen, and C.F. Wu for their assistances in system implementation. This work
S.-Y. Yang / Expert Systems with Applications 36 (2009) 10148–10157
was supported by the National Science Council, ROC, under Grant NSC-95-2221-E-129-019. References Albanese, C., Calabrese, A., D’Amico, A., Mele, F., Minei, G., Kostakopoulos, L. (2005). Web-pages annotation and adaptability: A Semantic Portal on the international space station. In: Proceedings of the 2nd Italian semantic Web workshop: Semantic Web applications and perspectives. Trento, Italy. Amilcare. (2002). Available from: . Annotea. (2001). Available from: . Antonio, S. A., Jose, M. G., Emilio, S. O., Alberto, P., Rafael, M. B., & Antonio, S. L. (2008). Web mining based on growing hierarchical self-organizing maps: Analysis of a real citizen Web Portal. Expert Systems with Applications, 34(4), 2988–2994. Benjamins, V. R., Contreras, J., Corcho, O., Gomez-Perez, A. (2002). Six challenges for the Semantic Web. In: Proceedings of the KR2002 workshop on Semantic Web. Toulouse, France. Boldi, P., Codenotti, B., Samtini, M., & Vigna, S. (2004). UbiCrawler: A scalable fully distributed web crawler. Software: Practice and Experience, 34(8), 711–726. Chang, Y. H. (2003). Development and applications of Semantic Webs. Master thesis, Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan. Chien, H. C. (2006). Study and implementation of a learning content management system search engine for special education based on Semantic Web. Master thesis, MingHsin University of Science and Technology, HsinChu, Taiwan. Fensel, D., Hendler, J., Lieberman, H., Wahlster, W. (2000). Creating of Semantic Web. Available from: . Francisco, G. S., Rodrigo, M. B., Leonardo, C., Jesualdo, F. B., & Dagoberto, C. N. (2006). An ontology-based intelligent system for recruitment. Expert Systems with Applications, 31(2), 248–263. Ganesh, S., Jayaraj, M., Kalyan, V., Aghila, G. (2004). Ontology-based web crawler. In: Proceedings of the international conference on information technology: Coding and computing (pp. 337–341). Las Vegas, NV, USA. Grosso, W. E., Eriksson, H., Fergerson, R. W., Gennari, J. H., Tu, S. W., Musen, M. A. (1999). Knowledge modeling at the millennium: The design and evolution of protege – 2000. SMI technical report, SMI-1999-0801. Hafri, Y., Djeraba, C. (2004). Dominos: A new web crawler’s design”, In: Proceedings of the 4th international web archiving workshop. Bath, UK. Handschuh, S., Staab, S., Ciravegna, F. (2002). S-CREAM: Semi-automatic CREAtion of metadata. In: Proceedings of the 13th international conference on knowledge engineering and knowledge management (pp. 358–372). Siguenza, Spain. Hartmann, J., & Sure, Y. (2004). An infrastructure for scalable, reliable Semantic Portals. IEEE Intelligent Systems, 19(3), 58–65. IBM Intelligent Miner for Text (1998). Available from . Khan, M., Malik, Z. (2007). An MDA-based approach for specifying Semantic Portals. In: Proceedings of 2007 IEEE international conference on web services (pp. 1218– 1219). Salt Lake City, Utah, USA.
10157
Kiyavitskaya, N., Zeni, N., Cordy, J. R., Mich, L., Mylopoulos, J. (2005). Semiautomatic Semantic Annotation for web documents. In: Proceedings of the 2nd Italian Semantic Web workshop. Trento, Italy. Lee, C. Y., & Soo, V. W. (2006). The conflict detection and resolution in knowledge merging for image annotation. Information Processing and Management, 42(4), 1030–1055. Lee, T. B., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 35–43. Lerman, K., Getoor, L., Minton, S., Knoblock, C. (2004). Using the structure of the web sites for automatic segmentation of tables. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data (pp. 119–130). Paris, France. Noy, N. F., Sintek, M., Decker, S., Crubezy, M., Fergerson, R. W., & Musen, M. A. (2001). Creating semantic web contents with protégé-2000. IEEE Intelligent Systems, 16(2), 60–71. OntoMat. (2003). Available from: . Pinheiro, W. A., Moura, A. M. de C. (2004). An ontology based-approach for semantic search in portals. In: Proceedings of the 15th international conference on database and expert system applications (pp. 127–131). Zaragoza, Spain. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 97–107. Sereno, B., Shum, S. B., Motta, E. (2004). Semi-automatic annotation of contested knowledge on the world wide web. In: Proceedings of the 2004 international world wide web conference (pp. 276–277). New York, USA. SHOE. (2000). Available from: . Stollberg, M., Thomas, S. (2005). Integrating agents, ontologies, and semantic web services for collaboration on the semantic web. In: Proceedings 2005 AAAI fall symposium series on agents and the semantic web. Arlington, Virginia, USA. Tufis, D., Mason, O. (1998). Tagging Romanian texts: A case study for QTAG, a language independent probabilistic Tagger. In: Proceedings of the 1st international conference on language resources and evaluation (pp. 589–596). Granada, Spain. Uschold, M., & Gruninger, M. (1996). Ontologies: Principles, methods and applications. The Knowledge Engineering Review, 11(2), 93–136. Viljanen, K., Kansala, T., Hyvonen, E., Makela, E. (2006). ONTODELLA-A projection and linking service for Semantic Web applications. In: Proceedings of the 17th international conference on database and expert system applications (pp. 370– 376). Krakow, Poland. WordNet 2.1. (2005). Available from: . Yang, S. Y., Ho, C. S. (1999). Ontology-supported user models for interface agents. In: Proceedings of the 4th conference on artificial intelligence and applications (pp. 248–253). Chang-Hua, Taiwan. Yang, S. Y. (2007). An ontology-supported and fully-automatic annotation technology for Semantic Portals. In: Proceedings of the 20th international conference on industrial, engineering and other applications of applied intelligent systems (pp. 1158–1168). Kyoto, Japan. Yang, S. Y., Chen, A. T., Wu, C. F. (2008). Ontology-supported web crawler for scholars. In: Proceedings of 2008 international conference on advanced information technologies. TaiChung, Taiwan.
22nd International Conference on Advanced Information Networking and Applications - Workshops
Developing of an Ontological Focused-Crawler for Ubiquitous Services Sheng-Yuan Yang Dept. of Computer and Communication Engineering, St. John’s University E-mail: [email protected] association-metric to estimate the semantic content of the URL based on the domain dependent ontology, which in turn strengthens the metric that is used for prioritizing the URL queue. Ubi-Crawler [3], a scalable distributed web crawler, is platform independent, linear scalability, graceful degradation in the presence of faults, a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task. This paper developed an ontological focused-crawler using website models as the core technique, which can help search agents successfully tackle the problems of search scope and user interests. The Personal Computer (PC) domain is chosen as the target application of our focused-crawler and will be used for explanation in the remaining sections.
Abstract This paper proposes the use of several ontological techniques to provide a semantic level solution for a search agent so that it can provide fast, precise, and stable search results in ubiquitous information environments. An ontological focused-crawler had been developed according to these techniques, which can benefit both user requests and domain semantics. Equipped with these techniques, the crawler can manifest the following interesting features: ontology-supported construction of website models, website models-supported website model expansion, and website models-supported webpage retrieval.
1. Introduction In ubiquitous information environments, the user expects to spend short time in retrieving really useful information rather than spending plenty of time and ending up with lots of garbage information. Current general search engines, however, produce so many entries of data that often overwhelms the user. The user usually gets frustrated after a series of visits on the entries, when he discovers that dead entries are everywhere, irrelevant entries are equally abundant, what he gets is not exactly what he wants, etc. Current domain-specific search engines do help users to narrow down the search scope by the techniques of query expansion, automatic classification and focused crawling; their weakness, however, is almost completely ignoring the user interests [20]. In general, current search engines face two fundamental problems. Firstly, the index structures are usually very different from what the user conjectures about his problems. Secondly, the classification/clustering mechanisms for data hardly reflect the physical meanings of the domain concepts. These problems stem from a more fundamental problem: lack of semantic understanding of Web documents. New standards for representing website documents, including XML [12], RDF [4], DOM [1], Dublin metatag [21], and WOM [14], can help cross-reference of Web documents; they alone, however, cannot help the user in any semantic level during the searching of website information. OIL [17], DAML [6], DAML+OIL [7] and the concept of ontology stand for a possible rescue to the attribution of information semantics. This paper proposes the use of several ontological techniques to provide a semantic level solution for a search agent so that it can provide fast, precise, and stable search results in ubiquitous information environments. We notice that the concept of crawler is mostly used in the Web systems that work on information gathering or integration to improve their gathering processes or the search results from ubiquitous information environments. For instance, Dominos [10] can crawl several thousands of pages every second, include a high-performance fault manager, be platform independent and be able to adapt transparently to a wide range of configurations without incurring additional hardware expenditure. Ganesh et al. [9] developed an ontology-support Web crawler with an
978-0-7695-3096-3/08 $25.00 © 2008 IEEE DOI 10.1109/WAINA.2008.13
2. Domain Ontology as Down-to-the-Earth Semantics
the
Hard ware isa isa isa
Po wer Equipment
isa
isa
isa
isa
isa
isa
Memory
Storag e M edia
Case
isa
isa
isa
Network C hip
isa
Interface Card isa
Sound Card
Display Card
isa
SCSI C ard
N etw ork Card
isa
Power Supply
isa
UPS
isa
isa
M ain Memory
ROM
isa
CD
isa
Optical
isa
isa
D VD
isa
ZIP
isa
CD R/W
CD R
(a) Part of ontology taxonomy CPU Synonym= Central Processing Unit D-Frequency String Interface Instance* CPUSlot L1 Cache Instance Volume Spec. Abbr. Instance CPU Spec.
... io XEON Factory= Intel
io
io
THUNDERBIRD1.33G DURON 1.2G Synonym= Athlon 1.33G Interface= Socket A Interface= Socket A L1 Cache= 64KB L1 Cache= 128KB Abbr.= Duron Abbr.= Athlon Factory= AMD Factory= AMD Clock= 1.2GHZ
...
...
io
io
io
PENTIUM4 2.0AGHZ D-Frequency= 20 Synonym= P4 2.0GHZ Interface= Socket 478 L1 Cache= 8KB Abbr.= P4
PENTIUM4 1.8AGHZ D-Frequency= 18 Synonym= P4 1.8GHZ Interface= Socket 478 L1 Cache= 8KB Abbr.= P4
...
...
io CELERON 1.0G PENTIUM4 2.53AGHZ Interface= Socket 370 Synonym= P4 L1 Cache= 32KB Interface= Socket 478 Abbr.= Celeron L1 Cache= 8KB Factory= Intel Abbr.= P4 Clock= 1GHZ Factory= Intel
...
...
(b) Ontology for the concept of “CPU” Fig. 1 Part of PC ontology Ontology is a method of conceptualization on a specific domain [16]. It plays diverse roles in developing intelligent systems, for example, knowledge sharing and reusing [8,11], semantic analysis of languages [15], etc. Development of an ontology for a specific domain is not yet an engineering process, but it is clear that an ontology must include descriptions of explicit concepts and their relationships of a specific domain [2]. We had outlined a principle construction procedure in [22]; following the procedure we had developed an ontology for the PC domain. Fig. 1(a) shows part of the PC ontology taxonomy. The taxonomy represents relevant PC concepts as classes and their parent-child relationships as isa links, which allow inheritance of features from parent classes to child classes. We then carefully selected those properties of each concept that are most related to our application and defined them as the detailed ontology of the corresponding class. Fig. 1(b) exemplifies the detailed ontology for the concept of “CPU”. In the figure, the uppermost node uses various fields to define the semantics of the CPU class, each field representing an attribute of “CPU”, e.g., interface, provider,
1486
synonym, etc. The nodes at the bottom level represent various CPU instances that capture real world data. The arrow line with term “io” means the instance of relationship. Our ontology construction tool is Protégé 2000 and the complete PC ontology can be referenced from the Protégé Ontology Library at Stanford Website (http://protege.cim3.net/cgi-bin/wiki.pl?ProtegeOntologiesL ibrary). Finally, we use Protégé’s APIs to develop a set of ontology services, which provide primitive functions to support inference of the ontologies. The ontology services currently available include transforming query terms into canonical ontology terms, finding definitions of specific terms in ontology, finding relationships among terms, and finding compatible or conflicting terms against a specific term, etc.
3. Website Model and Construction W ebsite Profile: W ebNo::Integer W ebsite_Title::String Start_URL::String W ebType::Integer Tree_Level_Lim it::Integer Update_Time/Date::Date/Tim e ..... W ebpage Profile: Basic Information: DocNo::Integer Location::String URL::String W ebType::Integer W ebNo::Integer Update_Tim e/Date::Date/Tim e ..... Statistics Inform ation: #Tag #Frame ..... Title Text Anchor Text Heading Text O utbound_URLs ..... O ntology Inform ation: Domain_Mark::Boolean class1: belief1; term 11(frequency); ... class2: belief2; term 21(frequency); ... .....
(a) Format of a website model W ebN o#1
DocNo#11
DocN o#12
(w ebsite profile)
.....
D ocNo#189
DocN o#190
(w ebpage profile)
W ebN o#2
(w ebsite profile)
Update_Time/Date remembers when the webpage was modified last time. The statistics information section stores statistics about HTML tag properties. Specifically, we remember the texts associated with Titles, Anchors, and Headings for webpage analysis; we also record Outbound_URLs for user-oriented webpage expansion. Finally, the ontology information section remembers how the webpage is interpreted by the domain ontology. Domain_Mark is used to remember whether the webpage belongs to a specific domain. This section annotates how a webpage is related to the domain and can serve as its semantics, which helps a lot in correct retrieval of webpages. Let’s turn to the website profile. WebNo identifies a website. Through this number, we can access those webpage profiles describing the webpages that belong to this website. Website_Title remembers the text between tags of the homepage of the website. Start_URL stores the starting address of the website. WebType identifies one of the six Web types as used in the webpage profile. Tree_Level_Limit keeps the search agent from exploring too deeply. Update_Time/Date remembers when the website was modified last time. This model structure helps interpret the semantics of a website through the gathered information; it also helps fast retrieval of webpage information and autonomous Web resources search. The last point will become clearer later. Fig. 2(b) illustrates how website profiles and webpage profiles are structured. During the construction and expansion process of a website model, we need to extract primitive webpage information as well as to perform statistics. We also need to transform the original webpage into a tag-free document for annotation of ontology information. These activities involve intensive consultation of domain ontology. In order to facilitate these activities, we have re-organized the ontology structure into Fig. 3, which stresses on how concept attributes are related to class identification. In the figure, each square node contains a set of representative ontology features for a specific concept, while each oval node contains related ontology features between two concepts or among more concepts. This design clearly structures semantics between ontology classes and their relationships and can serve as a fast semantics decision mechanism for website expansion. PC Hardw are
DocNo#21
DocN o#22
.....
Graphic Card
D ocNo#290
DocN o#291
Optical Drive
Related Concept
(w ebpage profile)
Related Concept
Sound Card
(b) Conceptual structure of a website model Fig. 2 Website model format and structure
Hard Drive Related Concept
Related Concept
Netw ork Card
Monitor
CPU
Related Concept
Modem
A website model contains a website profile and a set of webpage profiles. Fig. 2(a) illustrates the format of a website model. The webpage profile contains three sections, namely, basic information, statistics information, and ontology information. The first two sections profile a webpage and the last annotates domain semantics to the webpage. DocNo is automatically generated by the system for identifying a webpage in the structure index. Location remembers the path of the stored version of the Web page in the website model; we can use it to answer user queries. URL is the path of the webpage on the Internet, same as the returned URL index in the user query result; it helps hyperlinks analysis. WebType identifies one of the following six Web types: com (1), net (2), edu (3), gov (4), org (5), and other (0), each encoded as an integer in the parentheses. WebNo identifies the website that contains this webpage.
Related Concept
Related Concept
SCSI Card
Related Concept
Related Concept
Related Concept
Related Concept
Motherboard
Reference class Superclass
Fig. 3 Re-organized PC ontology The semantics decision process sometimes involves the decision of the class for a webpage or a website. We had proposed an ontology-directed classification mechanism, namely, OntoClassifier [24] to solve the problem. It classifies a webpage in two stages. The first stage uses representative ontology features. We employ the level threshold to limit the number of ontology features to be involved in this stage. The basic idea of classification of the first stage is defined by Eq. (1). In the equation, OntoMatch(d,C) is defined by Eq. (2), which calculates the number of ontological features of class C that appears in
1487
webpage d, where M(w,c) returns 1 if word w of d is contained in class C. Thus, Eq. (1) returns class C for webpage d if C has the largest number of ontology features appearing in d. Note that not all classes have the same number of ontology features; we have added #wC’, the number of words in each class C’, for normalization. Also note that Eq. (1) only compares those classes with more than three ontology features appearing in d, i.e., it filters less possible classes. As to why classes with less than three features appearing in d are filtered, we refer to Joachims’ concept that the classification process only has to consider the term with appearance frequency larger than three [13].
expansion according to the user query. Autonomous Website Evolver autonomously discovers URLs in the website models that are domain-dependent for further webpage expansion. Since these two types of URLs are both derived from website models, we call them website model URLs in the figure. User Priority Queue stores the user search strings and the website model URLs from User-Oriented Webpage Expander. Website Priority Queue stores the website model URLs from Autonomous Website Evolver and the URLs extracted by URLExtractor. W eb
OntoMatch(d , C ' ) HOntoMatch(d ) = arg max , OntoMatch (d , C ' ) > 3 C '∈c C '∈c # wC '
(1)
Search Engines
Focused Craw ler
(2)
OntoMatch ( d , C ) = ∑ M ( w, C )
URL
W ebpage
W eb Craw ler
w∈d
Query result
Search strings
W ebpages
DocPool
DocExtractor
OntoAnnotator
W ebpage
If for any reason the first stage cannot return a class for a webpage, we move to the second stage of classification. The second stage no longer uses level thresholds but gives an ontology term a proper weight according to which level it is associated with. That is, we modify the traditional classifiers by including a level-related weighting mechanism for the ontology concepts to form our ontology-based classifier. This level-related weighting (L) mechanism will give a higher weight to the representative features than to the related features. The second stage of classification is defined by Eq. (3). Inside the equation, OntoTFIDF(d,C) is defined by Eq. (4), which is basically the calculation of a TFIDF score on the ontology features of class C with respect to webpage d, where TF(x|y) means the number of appearance of word x in y. Thus, Eq. (3) returns class C for webpage d if C has the highest score of TFIDF with respect to. HOntoTFIDF(d ) = arg max OntoTFIDF (d , C ' ) ' C ∈c
OntoTFIDF ( d , C ) =
∑
w∈d
1 Lw
TF ( w C )
×
∑ TF ( w' C ) ∑ TF ( w' d )
w '∈FC
(3)
TF ( w d ) w '∈FC
(4)
Website modeling involves three modules. We use DocExtractor to extract basic webpage information and perform statistics. We then use OntoAnnotator to annotate ontology information. Since the ontology information contains webpage classes, OntoAnnotator needs to call OntoClassifier to perform webpage classification; detailed descriptions and architectures please refer to [25].
User Priority Queue
URLExtractor
URL Distiller
User-Oriented W ebpage Expander
W ebsite Model URLs
Autonomous W ebsite Evolver
W ebsite M odels
Fig. 4 Architecture of Focused-Crawler Distiller controls the Web search by associating a priority score with each URL (or search string) using Eq. (5) and placing it in a proper Priority Queue. Eq. (5) defines ULScore(U,F) as the priority score for each URL (or search string).
ULScore (U , F ) = WF × S F (U )
(5)
where U represents a URL or search string; and F identifies the way U is obtained as shown in Table 1, which also assigns to each F a weight WF and a score SF(U). Thus, if F = 1, i.e., U is a search string, then W1 = 3, and S1(U) = 100, which implies all search strings are treated as the top-priority requests. As for F = 3, if U is new to the website models, S3(U) is set to 1 by the URLExtractor; otherwise it is set to 0.5. Finally, for F = 2, the URLs may come from User-Oriented Webpage Expander or Autonomous Website Evolver. In the former case, we follow the algorithm in Fig. 5 (to be explained in Section 4.2) to calculate S2(U) for each U. The assignment of S2(U) in the latter case is more complicated; we will describe it in Section 4.3. Table 1 Basic Weighting for URLs to be explored Input Type Search Strings Website Model URLs URLs Extracted by URLExtractor
4.1 Focused Web Crawling Supported by Website Models
Query result
User search strings
4. Website Models Application
In order to effectively use the website models to narrow down the search scope, we proposed a new focused-crawler as shown in Fig. 4, which can feature a progressive crawling strategy in obtaining domain relevant Web information. Inside the architecture, Web Crawler gathers data from the Web. DocPool was mentioned before; it stores all returned Web pages from Web Crawler for DocExtractor during the construction of webpage profiles. It also stores query results from search engines, which usually contains a list of URLs. URLExtractor is responsible for extracting URLs from the query results and dispatching those URLs that are domain-dependent but not yet in the website models to Distiller. User-Oriented Webpage Expander pinpoints interesting URLs in the website models for further webpage
W ebsite Priority Queue
F 1 2
WF 3 2
SF(U) S1(U)=100 S2(U)
3
1
S3(U)
Note that Distiller schedules all the URLs (and search strings) in the user priority queue according to their priority scores for web crawling before it starts to schedule the URLs in the website priority queue. In addition, whenever there are new URLs coming into the user priority queue, Distiller will stop the scheduling of the URLs in the website priority queue and turn to schedule the new URLs in the user query queue. This design prefers user-oriented Web resource crawling to website maintenance, since user-oriented query or webpage expansion takes into account both user interest and domain constraint, which can better meet our design goal than website maintenance.
4.2 User-Oriented Web Search supported by
1488
Website Models
Conventional Information Retrieval research has mainly been based on (computer-readable) text [19] to locate desired text documents using a query consisting of a number of keywords, very similar to the keyword-based search engines. Retrieved documents are ranked by relevance and presented to the user for further exploration. The main issue of this query model lies in the difficulty of query formulation and the inherent word ambiguity in natural language. To overcome this problem, we propose a direct query expansion mechanism, which helps users implicitly formulate their queries. This mechanism uses domain ontology to expand user query. One straightforward expansion is to add synonyms of terms contained in the user query into the same query. Synonyms can be easily retrieved from the ontology. Table 2 illustrates a simple example. More complicated expansion adds ontology concepts according to their relationships with the query terms. The most used relationships follow the inheritance structure. For instance, if more than half of the sub-concepts of a concept appear in a user query, we add the concept to the query too. Table 2 Example of Direct Query Expansion Original user query: Mainboard
CPU
Socket
KV133
ABIT
Expanded user query: Mainboard CPU Socket KV133 ABIT
Motherboard Central Process Unit, Processor Slot, Connector
Let D be a Domain, Q be a Query, and U be an URL. Let URL_List be a List of URLs with Scores. UserTL(Q) = User terms in Q. AnTerm(U) = Terms in the Anchor Text of U. S2(U) = Score of U (cf. Table 6). Webpage_Expand_Strategy(Q, D) { For each Website S in the Website Model For each outbound link U in S { Score = AnE_Score(U, Q) If Score is not zero { S2(U) = RS,D x Score Expanding_URL(U, Score) } } Return URL_List } AnE_Score(U, Q) { For each Term T in AnTerm(U) If UserTL(Q) contains T AnE_Score = AnE_Score + 1 Return AnE_Score }
discussed later), which uses the scores to rank the URLs and, accordingly fetches hyperlinked webpages. Note that the algorithm uses RS,D to modulate the scores, which decrease those hyperlinks that are less related to the target domain. RS,D specifies the degree of domain correlation of website S with respect to domain D, as defined by Eq. (6). In the equation, NS,D refers to the number of webpages on website S belonging to domain D; and NS stands for the number of webpages on website S. Here we need the parameter Domain_Mark in the webpage profile to determine NS,D. In short, RS,D measures how strong a website is related to a domain. Thus, the autonomous webpage expansion mechanism is also domain-oriented in the nature.
RS , D =
(6)
4.3 Domain-Oriented Web Search Supported by Website Models Now, we can discuss how Autonomous Website Evolver does domain-dependent website expansion. Autonomous Website Evolver employs a 4-phase progressive strategy to autonomously expand the website models. The first phase uses Eq. (7) to calculate S2(U) for each hyperlink U which is referred to by website S and recognized to be in S from its URL address but whose hyperlinked webpage is not yet collected in S. S 2 (U ) =
∑R
C∈D
S ,D
× (1 − PS , D (C )), U ∈ S and PS , D (C ) ≠ 0 (7)
where, C is a concept of domain D, RS,D was defined in Eq. (6), and PS,D is defined by Eq. (8). PS,D(C) measures the proportion of concept correlation of website S with respect to concept C of domain D. NS,C refers to the number of webpages talking about domain concept C on website S. Fig. 6 shows the algorithm for calculating NS,C. In short, PS,D(C) measures how strong a website is related to a specific domain concept.
PS ,D (C ) =
Expanding_URL(U, Score) { Add U and Score to URL_List }
N S ,C
(8)
N S ,D
Let C be a Concept, S be a Website, and D be a Domain. Let THW be some threshold numbers for the website models. OntoPages(S) = The webpages on website S belong to D. OntoCon(THW, P) = The top THW domain concepts of webpage P.
Fig. 5 User-oriented webpage expansion strategy supported by the website models We also propose an implicit webpage expansion mechanism oriented to the user interest to better capture the user intention. This user-oriented webpage expansion mechanism adds webpages oriented to the user interest for further retrieval into the website models. Here we exploit the outbound hyperlinks of the stored webpages in the website models. To be more precise, we are using Anchor Texts specified by the webpage designer for the hyperlinks, which contain terms that the designer believes are most suitable to describe the hyperlinked webpages. We can compare these anchor texts against a given user query to determine whether the hyperlinked webpages contain the terms in the query. If yes, it implies the hyperlinked webpages might be interested by the user and should be collected in the website models for further query processing. Fig. 5 formalizes this idea into a strategy to select those hyperlinks, or URLs, that the users are strongly interested in. The algorithm returns a URL-list which contains hyperlinks along with their scores. The URL-list will be sent to the Focused-Crawler (to be
N S ,D NS
Calculating_NS,C(S, C) { For each webpage P in OntoPage(S) { If OntoCon(THW, P) contains C NS,C = NS,C + 1 } Return NS,C }
Fig. 6 Algorithm for calculating NS,C Literally, Eq. (7) assigns a higher score to U if U belongs to a webiste S which has a higher degree of domain correlation RS,D, but contains less domain concepts, i.e., less PS,D(C). In summary, the first phase prefers expanding the websites that are well profiled in the webiste models but have less coverage of domain concepts. The first phase is good at collecting more webpages for well-profiled websites; it cannot help with unknown websites, however. Our second phase goes a step further by searching for webpages that can help define a new website profile. In this phase, we exploit URLs that are in the website models, but belong to some unknown website profile. We use Eq. (9) to calculate S2(U) for each outbound hyperlink U of some webpages that is stored in an indefinite
1489
website profile.
S 2 (U ) = Anchor (U , D)
(9)
where, function Anchor(U,D) gives outbound link U a weight according to how many terms in the anchor text of U belong to domain D. Thus, the second phase prefers to expand those webpages that can help bring in more information to complete the specification of indefinite website profiles. In the third phase, we relax one more constraint; we relax the condition of unknown website profiles. We exploit any URLs as long as they are referred to by some webpages in the website models. We use Eq. (10) to calculate S2(U) for each outbound hyperlink U which are referred to by any webpage in the website models. This equation heavily relies on the anchor texts to determine which URLs should receive higher priority scores. In short, the third phase tends to collect every webpage that is referred to by the webpages in the website models.
S 2 (U ) =
∑ Anchor (U , C ) × R
S ,D
C∈D
× PS ,D (C )
(10)
In the last phase, we resort to general website information to refresh and expand website profiles. This phase is periodically invoked according to the Update_Time/Date stored in the website profiles and webpage profiles. Specifically, we refer to the analysis of refresh cycles of different types of websites conducted in [5] and define a weight for each web type as shown in Table 3. This phase then uses Eq. (11) to assign an S2(U) to each U which belongs to a specific website type T.
the semantics in the website models. The major index structure uses ontology features to index webpages in the website models. The ontology index contains terms that are stored in the webpage profiles. The second index structure is a partial full-text inverted index since it contains no ontology features. Fig. 7 shows this two-layered index structure. Since we require each query contain at least one ontology feature, we always can use the ontology index to locate a set of webpages. The partial full-text index is then used to further reduce them into a subset of webpages for users. This design of separating ontology indices from a traditional full-text is interesting. Since we then know what ontology features are contained in a user query. Based on this information, we can apply OntoClassifier to analyze what domain concepts the user are really interested in and use the information to fast locate user interested webpages. In short, we use the second stage of OntoClassifier along with a threshold, said THU, to limit the best classes (concepts) a query is associated with. For example, if we set THU to three, we select the best three ontology classes from a query and use them as indices to fast locate user-interested webpages. Finally, we can employ the identified ontology features in a user query to properly rank the webpages for the user using the ranking method [20]. Ontology Index (Inverted Index of Ontology Feature Terms)
Document numbers of webpages in website models
Partial Full-Text Index (Inverted Index of Partial Terms)
DocNo 1 Term 1
Term 11 DocNo 2
Term 2
Term 21 DocNo 3
Term 3
Term 31 DocNo 4
...
... ...
...
... ... ...
S 2 (U ) = RS , D × WT , U ∈ T
5. User-Satisfaction Evaluation
Table 3 Weight Table to Different Types of Websites Web Type, T com net/org edu gov
Fig. 7 Index structures in Website Models
(11)
WT 0.62 0.32 0.08 0.07
Table 4 User Satisfaction Evaluation K_WORD METHOD Yahoo Lycos InfoSeek HotBot Google Excite Alta Vista Our prototype
4.4 Webpage Retrieval from Website Models Webpage retrieval concerns the way of providing most-wanted documents for users. Traditional ranking methods employ an inverted full-text index database along with a ranking algorithm to calculate the ranking sequence of relevant documents. The problems with this method are clear: too many entries in returned results and too slow response time. A simplified approach emerged, which employs various ad-hoc mechanisms to reduce query space [19]. Two major problems are behind these mechanisms: 1) They need a specific, labor-intensive and time-consuming pre-process and; 2) They cannot respond to the changes of the real environment in time due to the off-line pre-process. Another new method called PageRank [18] was employed in Google to rank webpages by their link information. Google spends lots of offline time pre-analyzing the link relationships among a huge number of webpages and calculating proper ranking scores for them before storing them in a special database for answering user query. Google’s high speed of response stems from a huge local webpage database along with a time-consuming, offline detailed link structure analysis. Instead, our solution ranking method takes advantage of
CPU MOTHERBOARD MEMORY AVERAGE (SE / ST) (SE / ST) (SE / ST) (SE / ST) 67% / 61% 77% / 78% 38% / 17% 61% / 52% 64% / 67% 77% / 76% 36% / 20% 59% / 54% 69% / 70% 71% / 70% 49% / 28% 63% / 56% 69% / 63% 78% / 76% 62% / 31% 70% / 57% 66% / 64% 81% / 80% 38% / 21% 62% / 55% 66% / 62% 81% / 81% 50% / 24% 66% / 56% 63% / 61% 77% / 78% 30% / 21% 57% / 53% 78% / 69% 84% / 78% 45% / 32% 69% / 60%
Table 4 shows the comparison of user satisfaction of our system prototype against other search engines. In the table, ST, for Satisfaction of testers, represents the average of satisfaction responses from 10 ordinary users, while SE, for Satisfaction of experts, represents that of satisfaction responses from 10 experts. Basically, each search engine receives 100 queries and returns the first 100 webpages for evaluation of satisfaction by both experts and non-experts. The table shows that our system prototype with the ontological focused-crawler, the last row, enjoys the highest satisfaction in all classes. From the evaluation, we conclude that, unless the comparing search engines are specifically tailored to this specific domain, such as HotBot and Excite, our prototype system, in general, retrieves more correct webpages in almost all classes.
6. Conclusions
1490
We have described how ontology-supported website models can effectively support Web search, which is different from website model content, construction, and application over our previous works [23]. A website model contains webpage profiles, each recording basic information, statistics information, and ontology information of a webpage. The ontology information is an annotation of how the webpage is interpreted by the domain ontology. The website model also contains a website profile that remembers how a website is related to the webpages and how it is interpreted by the domain ontology. We have developed an ontological focused-crawler, which employs domain ontology-supported website models as the core technology to search for Web resources that are both user-interested and domain-oriented in ubiquitous information environments. It features the following interesting characteristics. 1) Ontology-supported construction of website models. By this, we attribute domain semantics into the Web resources collected and stored in the local database. One important technique used here is the Ontology-supported OntoClassifier which can do very accurate and stable classification on webpages to support more correct annotation of domain semantics [24]. 2) Website models-supported Website model expansion. By this, we take into account both user interests and domain specificity. The core technique here is to employ progressive strategies to do user query-driven webpage expansion, autonomous website expansion, and query results exploitation to effectively expand the website models [25]. 3) Website models-supported Webpage Retrieval. By this, we lever the power of ontology features as a fast index structure to locate most-wanted webpages for the user. In addition, our ontology construction is based on a set of pre-collected webpages on a specific domain; it is hard to evaluate how critical this collection process is to the nature of different domains. We are planning to employ the technique of automatic ontology evolution to help studying the robustness of our ontology.
Acknowledgements The author would like to thank Yung Ting, Jr-Well Wu, Chung-Min Wang, and Yai-Hui Chang for their assistance in system implementation. This work was supported by the National Science Council, R.O.C., under Grants NSC-89-2213-E-011-059, NSC-89-2218-E-011-014, and NSC-95-2221-E-129-019.
References [1]
[2] [3]
[4]
[5]
[6] [7] [8]
[9]
[10] [11] [12]
[13]
[14] [15] [16] [17] [18]
[19] [20]
[21]
L.H. Arnaud, L.H. Philippe, W. Lauren, N. Gavin, R. Jonathan, C. Mike, and B. Steve, “Document Object Model (DOM) Level 3 Core Specification,” Available at http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20 040407/, 2004. N. Ashish and C.A. Knoblock, “Wrapper generation for semi-structured Internet sources,” ACM SIGMOD Record, 26(4), 1997, pp. 8-15. P. Boldi, B. Codenotti, M. Samtini, and S. Vigna, “UbiCrawler: A Scalable Fully Distributed Web Crawler,” Software: Practice and Experience, 34(8), 2004, pp. 711-726. D. Brickley and R.V. Guha, “RDF Vocabulary Description Language 1.0: RDF Schema,” Available at http://www.w3.org/TR/2004/REC-rdf-schema-20040210/, 2004. J. Cho and H. Garcia-Molina, “The Evolution of the Web and Implications for an incremental Crawler,” Proc. of
[22]
[23]
[24]
[25]
1491
VLDB 2000: 26th International Conference on Very Large Databases, Cairo, Egypt, 2000, pp. 200-209. DAML, http://www.daml.org/about.html, 2003. DAML+OIL, http://www.daml.org/2001/03/daml+oil-index/, 2001. S. Decker, S. Melnik, F. van Harmelen, D. Fensel, M. Klein, J. Broekstra, M. Erdmann, and I. Horrocks, “The Semantic Web: the Roles of XML and RDF,” IEEE Internet Computing, 4(5), 2000, pp. 63-74. S. Ganesh, M. Jayaraj, V. Kalyan, and G. Aghila, “Ontology-based Web Crawler,” Proc. of the International Conference on Information Technology: Coding and Computing, Las Vegas, NV, USA, 2004, pp. 337-341. Y. Hafri and C. Djeraba, “Dominos: A New Web Crawler’s Design,” Proc. of the 4th International Web Archiving Workshop, Bath, UK, 2004. R. Al-Halami and R. Berwick, “WordNet: an electronic lexical database,” ISBN 0-262-06197-X, 1998. S.T. Henry, B. David, M. Murray, and M. Noah, “XML Base,” Available at http://www.w3.org/TR/2001/REC-xmlbase-20010627/, 2001. Joachims, T. “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization,” Proc. of The 14th International Conference on Machine Learning, 1996, pp. 143-151. F. Manola, “Towards a Web Object Model,” Available at http://www.objs.com/OSA/wom.htm, 1998. D.I. Moldovan and R. Mihalcea, “Using WordNet and Lexical Operators to Improve Internet Searches,” IEEE Internet Computing, 4(1), 2000, pp. 34-43. N.F. Noy and C.D. Hafner, “The State of the Art in Ontology Design,” AI Magazine, 18(3), 1997, pp. 53-74. OIL. http://www.ontoknowledge.org/oil/downl/oil-whitepaper.p df, 2000. L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” Stanford Digital Libraries Working Paper of SIDL-WP-1999-0120, Department of Computer Science, University of Stanford, California, USA, 1999. G. Salton and C. Buckley, “Term Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, 24(5), 1988, pp. 513-523. C.M. Wang, Web Search with Ontology-Supported Technology, Master thesis, Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, 2003. S. Weibel, “The State of the Dublin Core Metadata Initiative,” D-Lib Magazine, 5(4), 1999. S.Y. Yang and C.S. Ho, “Ontology-Supported User Models for Interface Agents,” Proc. of the 4th Conference on Artificial Intelligence and Applications, Chang-Hua, Taiwan, 1999, pp. 248-253. S.Y. Yang and C.S. Ho, “A Website-Model-Supported New Search Agent,” Proc. of 2nd International Workshop on Mobile Systems, E-Commerce, and Agent Technology, Miami, FL, USA, 2003, pp. 563-568. S.Y. Yang, “An Ontology-Directed Webpage Classifier for Web Services,” Proc. of Joint 3rd International Conference on Soft Computing and Intelligent Systems and 7th International Symposium on advanced Intelligent Systems, Tokyo, Japan, 2006, pp. 720-724. S.Y. Yang, “An Ontology-Supported Website Model for Web Search Agents,” Proc. of the 2006 International Computer Symposium, Taipei, Taiwan, 2006, pp. 874-879.