Ontology Based Web Crawler to Search Documents ...

Proceedings of “Wilkes100 - Second International Conference on Computing Sciences”

Ontology Based Web Crawler to Search Documents in the Semantic Web Vishal Jain and Dr. Mayank Singh 1

Research Scholar, Computer Science and Engineering Department, Lingaya’s University, Faridabad 2 Associate Professor, Krishna Engineering College, Ghaziabad Email: [email protected], [email protected]

Abstract. The term Semantic Web (SW) given by Tim Berners Lee is considered as vast concept within itself. Semantic Web (SW) is defined as collection of information linked in a way so that it can be easily processed by machines. It is information in machine form. It contains Semantic Web Documents (SWD’s) that are written in RDF or OWL languages. They contain relevant information regarding user’s query. Crawlers play vital role in accessing information from SWD’s. A Crawler is a type of software that systematically browses documents and extracts information from them for purpose of Indexing. So, there is need to develop some prototype systems that perform task of Information Retrieval (IR) using crawlers. There are several prototype systems like OWLIR, SWANGLER, and SWOOGLE etc. The paper illustrates the outline of SWOOGLE which is one of crawler based indexing and retrieval system for finding SWD’s. It describes relations that are derived from RDF and OWL languages and lists Ontologies thus providing complete description of given problem. Keywords: Semantic Web (SW), Ontology, SWOOGLE, Semantic Web Documents (SWD’s)

1. Literature Survey The earlier version of SWOOGLE was ver1.0. It has facility of advanced database search query. Due to its capability of retrieving SWD’s, SWOOGLE has emerged as Semantic Search Engine with its advanced versions like SWOOGLE 2005 ver 2.1, SWOOGLE 2007, ver 3.1. The process of finding SWD’s from input keywords is very challenging task. When SWOOGLE did not come into existence, then documents are retrieved using conventional Information Retrieval (IR) approaches and traditional search engines. These engines are not so intelligent that they could retrieve relevant documents. They retrieves only ordinary text documents instead of markup documents. The result is lots of documents are produced that may be relevant or irrelevant. Some researchers tried to use Knowledge Management (KM) solutions in complex environments that are major phase in emergence of SW and Ontology. With existence of SW, there arises SWD’s that is combination of text documents as well as structured documents written in ontology languages. At that time, there were no crawlers and users find Ontologies by combining results retrieved from documents with the help of Ontology editors. These editors represent concept and relationships between terms that matches given query. Then, Web crawlers came into work. It includes Google Bot Crawler and yahoo. They are able to retrieve relevant documents and satisfies user’s query but are unable to deliver Ontologies and generate metadata. 2. Introduction Semantic Web (SW) came into existence due to problem in conventional search engines that dissatisfies users by retrieving inadequate and inconsistent results. The documents retrieved by conventional search engines are like horse of different colors. These engines work on predefined standard terms that work in centralized environment, thus accessing standard Ontologies. With advent of SW and Ontology, users are able to develop new facts and use their own keywords/terms in different environments. With use of ontology, user can perform following tasks: (a) Users can use Interface Description Languages (IDL) and services for different environments. IDL means defining new data objects and their relations. ©Elsevier Publications 2013.

Vishal Jain and Dr. Mayank Singh

(b) Users can communicate with different agents using shared ontology like FOAF (Friend of a Friend). Semantic Web (SW) [1] is combination of SWD’s that are expressed in ontology languages (RDF, OWL). Ontology [2] refers to categorization of concepts and relationships between terms in hierarchical fashion Although SWD’s retrieves relevant information because they are characterized by semantic methods and ideas, but it is tedious job to find URL’s of SWD’s. So, there is need to develop some crawler based prototype systems that focuses on extraction of metadata for each SWD’s. The paper is categorized into following sections: Section 2 makes readers aware of SWOOGLE, its significance including its architecture. Section 3 describes information in favor of SWOOGLE describing how it is better than other prototype systems and Ontology repositories. It also gives information about types of SWD’s, use of crawlers and helps in finding Ontologies by using Ontology Rank algorithm that identifies whether given document is Semantic Web Ontology (SWO) or Semantic Web Database (SWDB). Section 4 provides current status of searching through SWOOGLE via pictorial representation. 3. Outline OF SWOOGLE SWOOGLE [3] is treated as crawler-based indexing prototype system that retrieves documents based on set of classes, properties and methods and produces URI’s matching the query. 3.1 Why SWOOGLE? What is its significance? As we know, SW is a web that works like HTML documents. These documents are different from SWD’s because HTML documents follows conventional search engines which are unable to extract required information in short and simple way. Keeping this in mind, we have developed a prototype SW search engine called SWOOGLE for extracting SWD’s that is used by users and software agents. With the help of SWOOGLE, we can “AEQ” RDF and OWL documents where A stands for Access, E stands for Explore and Q stands for Querying. Querying includes we can clear our misconceptions by putting query. 3.2 Defining and Analyzing SWOOGLE SWOOGLE is crawler based indexing and retrieval system for SW. Indexing means generation of metadata i.e. it extracts metadata for each SWD and gives relationship between those documents. Documents are indexed by some Information Retrieval (IR) system which either uses character N-grams or URI’s (Uniform Resource Identifier) as keywords to find relevant documents. It provides web interface where user can ask query by submitting URL of either SWD or web page directly. Analysis: After we have developed Swoogle, it is found to be analyzed on three activities which are listed below: Helps in searching appropriate Ontologies.  Searching Data Instance  Characterize Semantic Web We will discuss them one by one. (a) Searching appropriate Ontology: - Conventional Search engines failed many times to find required events for particular task. Swoogle helps in finding Ontologies as it allows user to query for documents. (b) Finding Data Instance: - Swoogle allows user to query SWD’s with keywords that uses Classes/Properties. (c) Characterizing Semantic Web: - Collection of data by researcher’s leads to characterization of SW. User can answer any question about ontology

Ontology Based B Web Craawler to Searchh Documents in the Semanticc Web

3.3 SWOO OGLE Architeecture Four compponents includee in its architeccture. They are as follows: (a) SWD’ss discovery (b) Metadaata creation (c) Analysiis of data (d) Interfacce All four coomponents worrk independenttly and interactt with each other through dataabase.  SW WD’s discovery ry: - It discoverrs Semantic Web Documents and keeps up to data informaation about objjects.  Metadata M creatiion: - It gives SWD S cache andd generates meetadata at both semantic and syntactic s level..  Data D Analysis: - It uses cachee SWD’s and metadata m to prroduce analysiss with the helpp of IR analyzer and SW WD analyzer.  Innterface: - It prrovides data serrvices to SW community. c

Web

U ’s URL

Cache

Bassic

Relatioonal

Analytical

D Analysis Data

IR Analyzer

F Figure1: SWO OOGLE Architecture [4] woogle is betteer than other Prototype P Systtems and Onttology Reposittories? 4. How Sw pe systems thaat are designedd to solve userr queries like OWLIR O (Ontollogy Web Lannguage There are many prototyp mation Retrievaal), SWANGLE ER and SWOO OGLE. and Inform ype systems thaat takes text doocuments as Innput argumentss. It does not diirectly consider RDF OWLIR is one of prototy nput. It annotattes text docum ments with SW markup, produuces results andd then indexes them. or OWL doocuments as in To find SW WD’s with thee help of OWL LIR, we have to build Custoom Indexing System. S After it, i we can pass both structured as well as text documents. Soo, obviously it is better but noot optimal systeem.

3

©Elsevieer Publications 2013.


SWANGLER directly considers RDF documents encoded in XML language and produces documents that are suitable to given query. It can become optimal system but it fails due to following problems: (a) XML namespace is not valid to search engines like Google. (b) Tokenization rules are designed for natural languages. SWOOGLE is termed as optimal crawler based prototype system that maintains interoperability between SWD’s. As Semantic Web contains RDF documents, so SWOOGLE directly takes RDF documents as input and lists Ontologies that matches query. It can either use N-gram or URI refs as keywords to find relevant documents. OWLIR and SWANGLER encode only 1 triple for each term. If there are more than 1 triple, they are replaced by single URI. SWOOGLE can analyze lot of SWD’s with lot of triples. It captures more metadata on classes and properties to support huge collection of documents. So, SWOOGLE is better and optimal than other prototype systems. Comparison with Ontology Systems: There is difference between SWOOGLE and other SW engines and query systems. Ontology Based Annotation Systems like SHOE, CREAM, and WEBKB focuses on creating metadata of online documents without seeing whole documents. Their ontology standards are different from SWD’s versions. These systems simply store RDF documents rather than solving them and querying them. So, they are not capable of handling millions of documents because their own Ontologies are not suitable for SWD’s. 4.1 Types of Semantic Web Documents (SWD’s) Semantic Web Document (SWD) is a document written in SW languages like OWL, DAML+OIL etc that is online and easily accessible to all web users. SWD is only means of information exchange in SW. (a) Semantic Web Ontologies (SWO’s): - A document is said to be SWO when required portion of given statement defines new classes and properties or inherit the definitions of terms used by other SWD’s. (b) Semantic Web Databases (SWDB): - A document is said to be SWDB when it does not define new terms. It matches given query with terms that are stored in database. 4.2 Use of Crawlers in Finding SWD’s The simplest way to find SWD’s is to use conventional search engines but they will not return relevant results. We have developed set of crawlers like Google Crawler, Focused Crawlers for finding SWD’s. Google Crawler: - It searches URL’s using Google search engine. It uses extensions like rdf, owl, daml. To make our search more expressive, we have introduced use of keywords. Searching URL’s depends on Google Crawler (Google Bot), Google Indexer and Google Query Processor. The process follows as:  Web pages downloading are done by a web crawler named GoogleBot. It is a web crawling robot that retrieves pages on web and hands them off to Google Indexer. GoogleBot has many computers attached to it that requests and fetches web pages. Each web page has an associated ID number called docID. When given URL is entered, it is assigned a given docID.  There is URL Server that sends list of URL’s to be fetched by crawler. Fetched web pages are sent to Store Server. Store Server compresses these pages and stores them in repository.  Google Indexer makes documents uncompressed. It removes all bad links in every web page and stores important information. It ignores some punctuation marks as well as converting all letters to lowercase. After Indexer, there is Google Query Processor which retrieves stored documents and return search results with the help of Doc Server.

Ontology Based Web Crawler to Search Documents in the Semantic Web

Store Server

GoogleBot/ Crawler

URL’s Server List of URL’s

Web Pages

Web Server sends Query to Indexer And tells relevant pages

Indexer

Repository (contains fetched web documents)

Doc Server (The query travels to Doc Server which retrieves stored documents) Google user gets search results

Figure 2: Illustration of Google Bot/ Crawler [5] Focused Crawler: - It finds documents within given website. It uses extensions like jpg, html to reduce complexity. JENA2 is based on SWOOGLE that analysis content of SWD’s first and then produces them. 3.3 Finding Ontologies using Ontology Rank Algorithm For finding Ontologies, we should aware of language features and RDF statistics of SWD’s that are described below: SWOOGLE Basic Metadata: - It contains symbols and semantic features of SWD’s.

Language Features

RDF Statistics

Ontology Annotations

Figure3: Categories of Basic Metadata (a) Language Features: - It lists features of SWD’s and their properties. It includes:  Encoding: It has three types of encoding used in SWD’s i.e. XML/RDF, N-Triples and N3.  Language: - It shows SW languages that are OWL, RDF, RDFS, DAML  OWL Species: - It shows language species of SWD’s written in OWL language only. Its species are OWLLITE, OWL-DL, and OWL-FULL. (b) RDF Statistics: - It focuses on how SWD’s define new classes and properties and individuals. There are three things namely: Class (C), Property (P) and Individuals (I). RDF statistics shares properties related to nodes of RDF graphs of SWD’s. A node is defined as Class if and only if it is not empty node and should be instance of some rdfs: Class (rdfschema). A node is termed as Property iff it is not an empty node and should be instance of rdf: Property. An Individual is a node which is instance of any user defined class. Ontology Rank Algorithm: It ranks all the Ontologies that are returned by SWOOGLE while finding SWD’s. Ranking means till how much extent we can use particular ontology. Let (gag) be one of SWD. Let C (gag), P (gag) and I (gag) be Class, property and Individual of given SWD. 5

©Elsevier Publications 2013.


Then Ontology Ratio for given SWD is calculated as: R (gag) = |C (gag) + P(gag)| / |C(gag) + P(gag) + I(gag)| If R (gag) =0, then our SWD is pure SWDB else it is pure SWO. (c) Ontology Annotations: - It shows properties that describes SWD as ontology. Its properties are label, comment and version info. 4. Illustration of SWOOGLE This section describes the layout of SWOOGLE version 3.1 used in year 2007. It allows users to specify any string arbitrarily in order to find relevant SWD’s in response to that particular string. SWOOGLE analyses whole document and generates only relevant parts of document in ranked order like URL’s, terms, description and namespaces about documents.

Figure3: SWOOGLE Start-Up Page We have searched string Economic Crisis. So, it will return SWD’s that matches these keywords in ranked order. We will get separate documents for keyword economic and for keyword crisis. It is shown below:

Figure 5: SWOOGLE query result From above screen shot, we have seen that our first SWD is encoded in N3 and its Ontology ratio is 0.61 Second document is encoded in RDF/XML with ontology ratio of 0.97. Related namespaces for second SWD is shown below:

Ontology Based Web Crawler to Search Documents in the Semantic Web

Figure 6: Namespaces about given SWD The current version of SWOOGLE returns following statistical information regarding number of SWD’s retrieved, number of triples generated and other parameters. We can say that SWOOGLE can handle huge collection of documents.

Figure 7: SWOOGLE statistical information 5. Conclusions The paper has given us way of extracting Semantic Web Documents (SWD’s) by using one of crawler-based prototype indexing and retrieval system named SWOOGLE. It generates metadata for given SWD’s and lists Ontologies related to given keywords. It is better than other prototype systems like OWLIR and SWANGLER that requires building of Custom Indexing Module. They use their own ontology standards which are not suitable for SWD’s.

7

©Elsevier Publications 2013.


OWLIR and SWANGLER treat markup as structured information and perform results over it. SWOOGLE stores metadata about RDF documents in its database so that it can retrieve SWD’s based on Classes(C), Properties (P) and Individuals (I). SWOOGLE is designed to work with all SWDB’s and is better than current web search engines like Google because Google work with natural languages only. Acknowledgement I Vishal Jain give my sincere thanks to Prof. M. N. Hoda, Director, BVICAM, New Delhi for giving me opportunity to do P.hD from Lingaya’s University, Faridabad. References [1]. Accessible from T.Berners Lee, “The Semantic Web”, “Scientific American”, May 2007 [2]. Berners Lee, J.Lassila, “Ontologies in Semantic Web”, “Scientific American”, May (2001) 34-43 [3]. Tim Finin, Anupam Joshi, Vishal Doshi, “Swoogle: A Semantic Web Search and Metadata Engine”, “In proceedings of the 13th international conference on Information and knowledge management”, pages 461-468, 2004. [4]. Gagandeep Singh, Vishal Jain, “Information Retrieval (IR) through Semantic Web (SW):An Overview”, “In proceedings of CONFLUENCE 2012- The Next Generation Information Technology Summit at Amity School of Engineering and Technology”, September 2012, pp 23-27. [5]. M. Preethi, Dr. J. Akilandeswari, “Combining Retrieval with Ontology Browsing”, “International Journal of Internet Computing, Vol.1, Issue-1”, 2011. [6]. T.Finin, J. Mayfield, A.Joshi, “Information retrieval and the semantic web”, “IEEE/WIC International Conference on Web Intelligence”, October 2003. [7]. U.Shah. T.Finin and A.Joshi. “Information Retrieval on the semantic web”, “Scientific American”, pages 34-43, 2003 [8]. Stojanovic, N. Studer, R. Stojanovic, “An approach for ranking of query results in the Semantic Web”, “The Semantic Web – ISWC”, 2003, pp 500-516 [9]. Swati Ringe, Nevin Francis, Palanawala, “Ontology Based Web Crawler”, “International Journal of Computer Applications in Engineering Sciences, ISSN 2231-4946, Vol. II, Issue III”, September 2012. [10]. Goetz Graze, “Query Evaluation techniques for large databases”, “In Proceedings of ACM COMPUTING SURVEYS”, 2003 About the Authors Vishal Jain has completed his M.Tech (CSE) from USIT, Guru Gobind Singh Indraprastha University, Delhi and doing PhD from Computer Science and Engineering Department, Lingaya’s University, Faridabad. Presently he is working as Assistant Professor in Bharati Vidyapeeth’s Institute of Computer Applications and Management, (BVICAM), New Delhi. His research area includes Web Technology, Semantic Web and Information Retrieval. He is also associated with CSI, ISTE.

Dr. Mayank Singh has completed his M. E in software engineering from Thapar University and PhD from Uttarakhand Technical University. His Research area includes Software Engineering, Software Testing, Wireless Sensor Networks and Data Mining. Presently He is working as Associate Professor in Krishna Engineering College, Ghaziabad. He is associated with CSI, IE (I), IEEE Computer Society India and ACM.