A Domain Specific Indexing Technique for Hidden ...

3 downloads 17140 Views 76KB Size Report
This model treats forms as a set of (element, domain) pairs. A form element can be ... response to form submissions) to check if the submission ... limitations. A critical look at the available literature indicates that till ... Domain Name. Domain ID.
Communications in Information Science and Management Engineering

CISME

A Domain Specific Indexing Technique for Hidden Web Documents Ritu Shandilya 1, Sugam Sharma 2, and Shamimul Qamar3 1

Shobhit University, India Iowa State University, USA 3 Salman bin Abdul Aziz University, KSA [email protected] 2

Abstract-The web creates new challenges for information retrieval as the amount of information on the web is growing rapidly. One of the challenges is to crawl the information hidden behind a search form, as a tremendous amount of high quality content is hidden behind the search forms. This high quality information can be retrieved by hidden web crawler using a Web query front-end to the database with standard HTML form elements. The documents retrieved by a hidden web crawler are more relevant, as these documents are accessible only through dynamically generated pages, delivered in response to a query. To index these documents efficiently, the search engine requires new indexing technique that optimizes speed and performance for finding relevant documents for a search query. In this paper, a new technique to index hidden web crawled documents is being proposed that not only indexes the documents more efficiently but also gives a classification of documents. In the technique, attributes of a query interface and their value sets are employed to index the documents.

URL list) are similar to those of traditional crawlers. However, whereas the latter ignore forms, HiWE performs the following sequence of actions for each form on a page. Choose URL

Retrieve the page

Extract the links and add to queue

No

Keyword-Search Engine; Indexer; Hidden Web Crawler; Domain

Form?

Yes Form Analysis

I. INTRODUCTION The traditional crawlers retrieve content only from the publicly indexable Web[1,2,3,4,5], i.e., the set of Web pages extracted purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content “hidden” behind search forms, in large searchable electronic databases. The problem of extracting content from this hidden Web is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford. It was new Layout-based Information Extraction Technique (LITE) used in automatically extracting semantic information from search forms and response pages [6, 7, 8]. Since search forms are the entry-points into the hidden Web, HiWE is designed to automatically process, analyze, and submit forms, using an internal model of forms and form submissions. This model treats forms as a set of (element, domain) pairs. A form element can be any one of the standard input objects such as selection lists, text boxes or radio buttons. Each form element is associated with a finite or infinite domain and a text label that semantically describes the element. The values used to fill out forms are maintained in a special table called the LVS (Label Value Set) table. Each entry in the LVS table consists of a label and an associated fuzzy/graded set of values. The weight associated with a value represents the crawler's estimate of how effective it would be, to assign that value to a form element with the corresponding label. The basic actions of HiWE as shown in Figure 1 (fetching pages, parsing and extracting URLs, and adding the URLs to a

Yes

Value assignment and Submission

Response Analysis

Response Navigation Fig. 1. Hidden Web Crawler

A. Form Analysis Parse and process the form to build an internal representation based on the above model. B. Value Assignment and Ranking Use approximate string matching between the form labels and the labels in the LVS table to generate a set of candidate value assignments.1 Use fuzzy aggregation functions to

C 2011-2012 World Academic Publishing CISME Vol.2 No.2 2012 PP.37-41 www.jcisme.org ○

- 37 -

No

Done With

Communications in Information Science and Management Engineering

CISME

referred to as domain index, behaves and performs analogous to those built natively by the database system. Oracle8i server implicitly invokes user supplied index implementation code when domain index operations are performed, and executes user-supplied index scan routines for efficient evaluation of domain specific operators. This paper provides an overview of the framework and describes the steps needed to implement an indexing scheme. The paper also presents a case study of Oracle Cartridges (Inter Media Text, Spatial, and Visual Information Retrieval), and Daylight (Chemical compound searching) Cartridge, which have implemented new indexing schemes using this framework and discusses the benefits and limitations.

combine individual weights into weights for value assignments and use these weights for ranking the candidate assignments. C. Form Submission Use the top N value assignments to repeatedly fill out and submit the form. D. Response Analysis and Navigation Analyze the response pages (i.e., the pages received in response to form submissions) to check if the submission yielded valid search results. Use this feedback to tune the value assignments in Step 2. Crawl the hypertext links in the response page to some pre-specified depth.

A critical look at the available literature indicates that till date many indexing techniques have been introduced to index the web documents [12, 13]. These searching techniques are optimized for features and speed both but none of these technique makes any difference to index the pages crawled by a traditional crawler and by a hidden web crawler.

The hidden Web is particularly important, as it retrieves high-quality information. Therefore there is a need to implement an indexing technique to be more efficient to index the high quality data. II. RELATED WORK Farhi Marir and Kamel Houam, School of Informatics & Multimedia Technology, North London University, N7 8DB, UK proposed an Indexing and Retrieving Web Document Using Computational and Linguistic Techniques [9]. This new technique for capturing the semantic of the document is used for indexing and retrieval of relevant documents from the internet. It performs the conventional keyword based indexing and introduces a thematic relationship between part of text using natural language understanding(nlu)and a linguistics theory called rhetorical structure theory(RST). Eibe Frank, Gordon W. Paynter , Ian H. Witten , Carl Gutwin, et al., proposed a Domain-specific keyphrase extraction technique [10]. Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents has author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process. The paper proposed by Eibe Frank, Gordon W. Paynter , Ian H. Witten , Carl Gutwin shows that a simple procedure for keyphrase extraction based on the naive Bayes learning scheme performs comparably to the state of the art. It goes on to explain how this procedure’s performance can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases is improved significantly when domain-specific information is exploited. Another related work is proposed by Jagannathan Srinivasan, Ravi Murthy, Seema Sundara, Nipun Agarwal, Samuel DeFazio. They proposed a framework for integrating domain-specific indexing schemes into Oracle8i [11]. Extensible Indexing is a SQL-based framework that allows users to define domain-specific indexing schemes, and integrate them into the Oracle8i server. Users register a new indexing scheme, the set of related operators, and additional properties through SQL data definition language extensions. The implementation for an indexing scheme is provided as a set of Oracle Data Cartridge Interface (ODCI Index) routines for index definition, index-maintenance, and index-scan operations. An index created using the new indexing scheme,

III. THE PROPOSED TECHNIQUE To index the hidden web crawled documents, a domain specific indexing technique, based on attributes of a query interface and their value sets are being proposed. The previously proposed techniques are based on keywords in which first the crawler downloads the documents and then extracts the keywords from these downloaded documents to index them. While in proposed technique attributes of a query interface and values corresponding to these attributes are employed to index the documents. This indexing technique not only optimizes speed for finding web documents but also gives more specified results for a search query. Since hidden web data sources contain a high quality data hidden behind a query interface, many efforts have focused on querying and integrating the sources. An important focus of these efforts is to build a uniform query interface (Global Query Interface) [14, 15] to the data sources in the domain to make access to individual sources transparent to users. This Global Query Interface is designed by interface matching techniques. A separate Global Query Interface is designed for each domain. This Global Query Interface is employed to generate the required index and the steps followed are given bellow. Step 1: Each domain is assigned a Domain ID as shown in Table 1. For instance the Domain ID of Airline domain is I. TABLE I. DOMAINS AND DOMAIN ID Domain Name Airline

Dom 1

Book

Dom 2

Movie

Dom 3

Railways

Dom 4

Real estates

Dom 5

Step 2: Extract the Attributes of each global query interface. Figure 2 shows a Global Query Interface containing

C 2011-2012 World Academic Publishing CISME Vol.2 No.2 2012 PP.37-41 www.jcisme.org ○

- 38 -

Domain ID

Communications in Information Science and Management Engineering four attributes Departure City, Destination City, Class and Airline. Global Query Interface

CISME

Air Sahara

Val 9

Jet Airways

Val 10

Step 5: Assign a value from value set to each attribute of the global query interface and submitted the form. Step 6: The downloaded document corresponding to the form is stored in repository.

Departure City

0 Destination City Domain Level

1

2

…… ……

3

n

Class

Attribute Level

1

1

1,

1,

1

.

n

n

n

Airline Value Level

1,

1,1

1,

1,

Submit Documents Fig. 2. Global Query Interface for Airline

Step 3: Each attribute is assigned an Attribute ID as shown in Table 2. For instance the Attribute IDs of attributes of Global Query Interface are listed in Table II: TABLE II. ATTRIBUTES AND ATTRIBUTE ID Attributes

Attribute ID

From City

Att1

To City

Att 2

Class

Att 3

Airline

Att 4

Fig. 3. Generalized Architecture

Step 7: In addition, for each document, the (attribute, value) pairs for all the attributes of global query interface are stored. Create a table of Association of values and documents for each attribute of global query interface. Information contained in Table 2 and Table 3 is employed to generate the required (attribute, value) pairs for all Documents. For instance values and Documents association for attribute From city is shown in Table IV. TABLE IV. DOCUMENTS AND (ATTRIBUTE ID, VALUE ID) Documents

Step 4: Each value of value set of all attributes is assigned a Value ID as shown in Table.3 For instance the Value ID for each value of value set of Global Query Interface is listed in Table III: TABLE III. VALUES AND VALUE ID FOR ATTRIBUTES FROM CITY AND TO CITY

(Attributes, value)

Doc 1

(Att 1, Val 1), (Att 2, Val 1), (Att 3, Val 5), (Att 4, Val 7)

Doc 2

(Att 1, Val 1), (Att 2, Val 2), (Att 3, Val 5), (Att 4, Val 7)

Doc 3

(Att 1, Val 1), (Att 2, Val 3), (Att 3, Val 5), (Att 4, Val 8)

Doc 4

(Att 1, Val 1), (Att 2, Val 4), (Att 3, Val 5), (Att 4, Val 12)

Values

Value ID

Doc 5

(Att 1, Val 2), (Att 2, Val 1), (Att 3, Val 6), (Att 4, Val 11)

Chennai

Val 1

Doc 6

(Att 1, Val 2), (Att 2, Val 2), (Att 3, Val 6), (Att 4, Val 8)

Delhi

Val 2

Doc 7

(Att 1, Val 2), (Att 2, Val 3), (Att 3, Val 5), (Att 4, Val 8)

Mumbai

Val 3

Doc 8

(Att 1, Val 2), (Att 2, Val 4), (Att 3, Val 6), (Att 4, Val 10)

Kolkata

Val 4 Doc 9

(Att 1, Val 3), (Att 2, Val 1), (Att 3, Val 6), (Att 4, Val 10)

Business

Val 5 Doc 10

(Att 1, Val 3), (Att 2, Val 2), (Att 3, Val 5), (Att 4, Val 12)

Economic

Val 6

Air India

Val 7

Doc 11

(Att 1, Val 4), (Att 2, Val 3), (Att 3, Val 6), (Att 4, Val 11)

Indian Airlines

Val 8

Doc 12

(Att 1, Val 4), (Att 2, Val 4), (Att 3, Val 5), (Att 4, Val 9)

C 2011-2012 World Academic Publishing CISME Vol.2 No.2 2012 PP.37-41 www.jcisme.org ○

- 39 -

Communications in Information Science and Management Engineering

2) Identify the domain: To indentify the domain of the query, a matching technique using the content bearing words (extracted in previous step), is applied. For this query the Domain is “Airline”.

IV. EXAMPLE The given example is of Airline domain. All the domains and domain IDs are listed in Table 1. It contains 4 domains. Global Query Interface of Airline domain is shown in figure 4.2. There are 4 attributes in this Global Query Interface for airline domain. Step-2 and step-3 of proposed method are applied to create a list of these attributes and their IDs listed in Table 2. For each attribute there exists a value set. By applying step-4 each value of value set for all attributes is assigned a Value ID as shown in Table 3. There are 10 values corresponding to 4 attributes.

3) Identify attributes: For instance the attributes are “From” and “to”. Table 2 is used to get the attribute id corresponding to these attributes. 4) Identify values corresponding to attributes: In the given query for attribute “From”, the value is “delhi” and for “to”, the value is “mumbai”. Table 3. is used to get the value id corresponding to these values.

Step-5 to step-7 are applied to create the association of Docs and (attribute, value) pairs as shown in Table 4. There are 12 Docs in this example. Table 4 is employed to create an association of value IDs and Docs IDs for each attribute as shown in Table 5.

5) Retrieve the Docs: For each attribute there is a specific table having a list of Docs for each value of the attribute. (Table 5.) 

To City delhi” Docs are (Doc1, Doc2, Doc3, and Doc4).



For “Destination City Mumbai”, (Doc3, Doc7, Doc 11).

TABLE V. VALUE ID AND DOCS FOR ATTRIBUTES FROM CITY Value ID

Docs

Val 1

Doc 1, Doc 2, Doc 3, Doc 4

Val 2

Doc 5, Doc 6, Doc 7, Doc 8

Val 3

Doc 9, Doc 10

Val 4

Doc 11, Doc 12

6) Take the intersection of two sets retrieved by step-5 and step-6. The intersection will be (Doc3). B. Response Given by Keyword Based Indexing 1) Extract content bearing words: In given query content bearing words are Flight, delhi and mumbai. 2) Retrieve the Docs containing word “Flight”. There is no Doc having word “Flight”.

TABLE VI. VALUE ID AND DOCS FOR ATTRIBUTES TO CITY Value ID

3) Retrieve the Docs containing word “delhi” For “delhi” Docs are (Doc1, Doc2, Doc3, Doc4, Doc5, and Doc9).

Documents

Val 1

Doc 1, Doc 5, Doc 9

Val 2

Doc 2, Doc 6, Doc 10

Val 3

Doc 3, Doc 7, Doc 11

Val 4

Doc 4, Doc 8

4) Retrieve the Docs containing word mumbai”. For “mumbai” Docs are (Doc3, Doc7, Doc9, Doc10, and Doc11). 5) Take the intersection of two sets retrieved by step-2, step-3 and step-4. Then final Docs are (Doc3, Doc9). In the given example proposed technique retrieved Doc 3 and keyword based technique retrieved Doc 3 and Doc 9 both. Since Doc 9 is from Mumbai to delhi, which is not a relevant Doc.

V. COMPARISON WITH KEYWORD INDEXING TECHNIQUE

The comparison of the proposed domain specific indexing technique with traditional keyword based indexing technique is given with a query example. A query “flight from delhi to mumbai” is submitted to a search interface as shown in Figure 4. The responses of this query, given by both the technique are compared.

VI. CONCLUSIONS AND FUTURE RESEARCH A Domain Specific Indexing Technique for Hidden Web Docs was proposed. This new technique for hidden web crawled Docs uses attributes and their value set to index the Docs. The previously proposed techniques are based on keywords. First the crawler downloads the Docs and then extracts the keywords from these downloaded Docs to index them. To retrieve the Doc hidden behind a form it assigns the values to the attributes of the form and then submits the form. The proposed technique uses these attributes and their values to index that Doc. The proposed technique gives a classification of Docs which not only provides the relevant results but also speeds up the searching process.

Search Interface

Search

CISME

flight from delhi to mumbai

Fig. 4. Search Interface

The future scope of this work is to evaluate the frequency at which the database should be refreshed on the bases of its domain. The proposed technique can be applied to distributed processing also, where the search takes place in parallel for more than one attribute at the same time, which will improve the efficiency of search engine by increasing the speed of

A. Response Given by Domain Specific Indexing The response is generated by applying the following steps: 1) Extract content bearing words from query string: In given query content bearing words are Flight, delhi and mumbai.

C 2011-2012 World Academic Publishing CISME Vol.2 No.2 2012 PP.37-41 www.jcisme.org ○

- 40 -

Communications in Information Science and Management Engineering

[8] Luciano Barbosa, Juliana Freire, “An adaptive crawler for locating

searching. The research can also be extended for ranking the hidden web documents. REFERENCES

[9]

[1] Sergey Brin and Lawrence Page, “The Anatomy of a Large-Scale [2] [3]

[4] [5] [6] [7]

CISME

Hypertextual Web Search Engine” In Proceedings of the 7th World Wide Web Conference (WWW7), 1998. Sunny Lam, Department of Computer Science University of waterloo “The Overview of Web Search Engines” Waterloo, Ontario Canada [email protected], February 9, 2001. Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan, Computer Science Department, Stanford University “Searching the Web”, ACM Transactions on Internet Technology Volume 1 , Issue 1 , pp: 2 - 43, August 2001. S. Lawrence and C. L. Giles, “Searching the World Wide Web”, Science, 1998. Andrei Scherbina, Clustering of Web Access Session”, 2008. Sriram Raghavan Hector Garcia-Molina, “Crawling the Hidden Web Source”, In proceedings of the 27th International Conference on Very Large Data Bases, pp: 129 – 138, 2001. S. Raghavan and H. Garcia-Molina, “Crawling the hidden web”, Technical Report 2000-36, Computer Science Department, Stanford University,Dec.2000. Available at http://dbpubs.stanford.edu/pub/200036.

[10] [11]

[12]

[13]

[14] [15]

hidden-Web entry points”, Proceedings of the 16th international conference on World Wide Web, pp: 441 – 450, 2007. Farhi Marir and Kamel Houam,,”Indexing and Retrieving Web Document Using Computational and Linguistic Techniques”, Available at http://www.springerlink.com/index/j449nm7000f4f1f0.pdf Eibe Frank, Gordon W. Paynter , Ian H. Witten , Carl Gutwin, et al., “Domain-specific keyphrase extraction”. Jagannathan Srinivasan, Ravi Murthy, Seema Sundara, Nipun Agarwal, Samuel DeFazio, “Extensible indexing: a framework for integrating domain-specific indexing schemes into Oracle8i”, Proceedings of 16th International Conference on Data Engineering IEEE, 2000. Maxim Martynov and Boris Novikov, “An Indexing Algorithm for Text Retrieval” in proceedings of the Third International Workshop on Advances in Databases and Information Systems, ADBIS 1996, Moscow, Russia,September 10-13, pp:171-175, 1996. Ethan Miller, Dan Shen, Junli Liu, and Charles Nicholas, “Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System,” Journal of Digital Information, Vol. 1, No. 5, pp: 1–25, Jan. 2000. A.K. Sharma, Komal Kumar Bhatia, “Merging query interfaces in domain specific hidden web databases”. David Eichmann. “Representing Knowledge in Domain Engineering”, Eighth workshop on Software Reuse, Columbus OH, March 23-26, 1997.

C 2011-2012 World Academic Publishing CISME Vol.2 No.2 2012 PP.37-41 www.jcisme.org ○

- 41 -