Ontology-assisted automatic precise information ... - Springer Link

16 downloads 4709 Views 1004KB Size Report
May 26, 2011 - to a webpage which creates the ability for machines to understand the ..... LM (2001) Ontology development 101: a guide to creating your first ...
Artif Intell Rev (2012) 38:9–24 DOI 10.1007/s10462-011-9238-6

Ontology-assisted automatic precise information extractor for visually impaired inhabitants Ahmad C. Bukhari · Yong-Gi Kim

Published online: 26 May 2011 © Springer Science+Business Media B.V. 2011

Abstract As the internet grows rapidly, millions of web pages are being added on a daily basis. The extraction of precise information is becoming more and more difficult as the volume of data on the internet increases. Several search engines and information fetching tools are available on the internet, all of which claim to provide the best crawling facilities. For the most part, these search engines are keyword based. This poses a problem for visually impaired people who want to get the full use from online resources available to other users. Visually impaired users require special aid to get along with any given computer system. Interface and content management are no exception, and special tools are required to facilitate the extraction of relevant information from the internet for visually impaired users. The HOIEV (Heavyweight Ontology Based Information Extraction for Visually impaired User) architecture provides a mechanism for highly precise information extraction using heavyweight ontology and built-in vocal command system for visually impaired internet users. Our prototype intelligent system not only integrates and communicates among different tools, such as voice command parsers, domain ontology extractors and short message engines, but also introduces an autonomous mechanism of information extraction (IE) using heavyweight ontology. In this research we designed domain specific heavyweight ontology using OWL 2 (Web Ontology Language 2) and for axiom writing we used PAL (Protégé Axiom Language). We introduced a novel autonomous mechanism for IE by developing prototype software. A series of experiments were designed for the testing and analysis of the performance of heavyweight ontology in general, and our information extraction prototype specifically. Keywords Heavyweight ontology for IE · Intelligent information extraction · IE for blind user · E-Store ontology · Hybrid technique for ontology modeling

A. C. Bukhari · Y.-G. Kim (B) Department of Computer Science, Research Institute of Computer Science and Information Communication, Gyeongsang National University Jinju, Kyungnam, 660-701, Republic of Korea e-mail: [email protected] A. C. Bukhari e-mail: [email protected]

123

10

A. C. Bukhari, Y.-G. Kim

1 Introduction The heterogeneity in current internet is growing rapidly. Thousands of web pages are added to the internet and millions of transactions take place using e-commerce applications (Crescenzi and Mecca 2004). The dynamicity in the web application is increasing day by day and relevant information extraction, using conventional technologies from this large pool of data is extremely difficult. As a result, there has been a strong impetus in the field of information engineering to develop and improve techniques to extract precise information from huge volumes of extraneous data. Currently the web is no more than an accretion of millions of web pages, and current IE tools use keyword based search mechanisms. The conventional search engine is not able to find exact information according to user requirements. Search engines consist of two categories: crawler based search engines, and human powered directories. Crawler based search engines, such as Google and Yahoo, automatically index the web pages which are uploaded on the internet. They collect and index the web pages after applying the optimization criteria which vary from search engine to search engine (Gatial et al. 2005). Whenever a user makes an inquiry of a crawler based search engine, the spider of the search engine compares keywords against the indexing repository pages and displays the results which meet the criteria. In the case of human powered search engines, the users manually add their website details in web directories instead of using a crawler. So, we have to keep updating the indexing repositories with changes in websites using human powered directories, while the crawler based search engines update their indexes automatically. Keyword based search engines extract results on the basis of keywords. As a result, there is a high volume of data that is recalled, but the ratio of precision is very low. In Shadbolt et al. (2006) the researcher presented the idea of an intelligent web. The layered architecture of this intelligent web concept would produce a machine with the ability to understand a request and infer possible solutions. This ability to understand is currently known as a ‘semantic web.’ The basic theme behind a semantic web is to make the web intelligent so that it can understand and conceive the meaning of data (Chang et al. 2006; Gerber et al. 2007; Nasrolahi et al. 2009). This would be a powerful technology for assisting visually impaired users who require special consideration in terms of interface and how this impacts their ability to extract information from the web efficiently. In this paper we discussed the extended version of OBIEBU architecture, the philosophical anatomy of this approach discussed in Chan et al. (2010) by our team. The architecture is a novel approach and ray of hope for visually impaired people. Our proposed heavyweight ontology based highly precise information extraction architecture for visually impaired users describes a mechanism to extract precise information based on vocal command system as discussed in Michail and Christos (2007). The rest of the research work is divided in such fashion in Sect. 2 we described about related work, tool and technologies which were used in past for information extraction. Section 3 is about the semantic web and new information extraction methodologies. A proposed architecture is discussed in Sects. 4, 5 and 6 are relating to our experiments and conclusions, respectively.

2 Related work Recently, several techniques were developed to extract relevant information from the internet. Natural language processing, parts of speech, name entity tagging and ontology based information extraction are some of these techniques. In this section, we analyzed several of these and also dug out some old ways which visually impaired users used to find information online.

123

Ontology-assisted automatic precise information extractor

11

2.1 Web content mining techniques for information extraction There are several useful information repositories available on the internet, such as product catalogues, current news, weather forecasting, event listing and so on. So initially, web content mining techniques were introduced to fetch relevant information from large, multidimensional networks. Current web content mining techniques extract and integrate the useful data, information and knowledge from web pages. The two methods most widely used in web content mining are wrapper induction information extraction and automatic information extraction. In wrapper induction technique (Crescenzi and Mecca 2004; Kushmerick 1999), we use machine learning techniques to make rules for information extraction. In this technique, the user searches the item from the web page, and the machine learning techniques make rules on the behalf of the extracted items. These rules are used in future to find the information about contents and products. We manually tag the contents of the web pages and the machine learning techniques to make the extraction rule. But to write wrapper classes and keep them updated is a very tough job when you consider the dynamic growth on the internet. Whenever the website contents and structure change, we have to relearn our wrapper and make significant changes according to new structures. The probability of errors occurring in wrappers is also very high (Kushmerick 1999). The automatic information extraction from a large website is another technique to extract information from large websites. In this technique a sample webpage is given for wrapper learning instead of writing a new wrapper. The issues of disjunction are very hard to handle in automatic information extraction and it is very difficult to generate attribute names for any extended data in it (Crescenzi and Mecca 2004). 2.2 Information extraction for visually impaired users Several softwares were developed for visually impaired internet users to use for retrieving precise information quickly, but most of them used screen reading techniques such as “BrookesTalk” (Brown et al. 2007). BrookesTalk is one of the tools which give a summary of the page to the user vocally. It uses different voices for different parts of the webpage so that the visually impaired user can understand the architecture of the webpage. With the help of this tool, the user can easily navigate between the web pages and extract information at some level. Its ability to produce different sounds for different parts of the webpage sometime irritates the users. Its design is based on the standard tool for visually impaired “pwWebSpeak”. Another approach introduced in Zajicek et al. (1998) aims to facilitate the visually impaired users in website searching and navigation. The functionality most often included in these software packages is as follows: providing guidance for visually impaired users, empowering visually impaired users and reducing cognitive load. The result of this research developed a prototype called “NavAccess”. It provides what was regarded as a pleasant, effective and efficient environment for visually impaired users. This prototype used a server based searching mechanism. The server searches all the web pages of the website and Stores the related pages in cache. The user agent accessibility guideline (UAAG) 1.0 and web content accessibility guide line (WCAG) are standards given by world wide web consortium (W3C) on which the NavAccess design is structured. 2.3 Voice browsing Nowadays several web companies are developing their voice portals to assist their customers. These voice portals provide interactive voice interface and were developed especially in a specific voice contents holding language known as Voice XML. The Voice XML is

123

12

A. C. Bukhari, Y.-G. Kim

the standard language and recommended by W3C. Normally in voice browsing, the voice service is speaker independent. The voice service receives the request and then opens the hyperlink. Two techniques are used in voice browsing: phone browsing and transcending. In transcending, the existing HTML based pages are converted into voice using a text to speech mechanism (TTS) as defined in Shadbolt et al. (2006); Meng et al. (2004). 3 Semantic web and ontology for precise information extraction The invention of the internet has profoundly changed our means of communication and information sharing. Easy access to the internet has played a major role in both its own growth and its utility in day to day life. In general, anyone and everyone can design a web page and share it without any approval. As a result, we can find information almost on every topic. Trillions of bytes of data propagate every day on the internet and more than a million companies have shifted their businesses on internet (Chang et al. 2006). When the internet was invented, its vision was to create a virtual world in which human and computer work together while sharing information. Although not explicitly stated, this vision of the internet implied the development of computers with the same ability to understand content as a human. Its continuously unchecked growth derailed its structure from its vision. Due to its gigantic scale and the human ability to process information and make connections, there was not sufficient impetus to correct the deficiencies in the hardware and software side of the equation regarding relevant and correct information retrieval. Currently one has to work a lot to find correct and relevant information in a short span of time because currently the web is designed for human consumption and does not provide any help for machine interpretation. Computers do not yet have the ability to understand the actual meaning of sentences as humans do. Computers cannot understand the difference between “man eats chicken” and “chicken eats man”. Several statistical and natural language processing techniques were used in the past to make the computers operate intelligently. With natural language, we can understand nouns and verbs, and we can classify them easily and appropriately. But machines cannot infer the information exactly by using available data. Web mining techniques also failed because the volume of total data on the internet is huge and just as widely scattered. In Gerber et al. (2007); Chang et al. (2006) and Shadbolt et al. (2006) researchers presented a vision about the web to transform the current link directory repository into large, but efficiently distributed repositories of knowledge. This can be done to add the semantic annotation to a webpage which creates the ability for machines to understand the underlying meaning among concepts and their relationships. The layered architecture of the semantic web divides the whole process into steps from Unicode/URI to Trust. The layered architecture in Gerber et al. (2007) clearly defines all steps. To add semantic annotation, different techniques were developed, starting with XML code, which added metadata to facilitate understanding of web contents. Later, the resource description framework (RDF) was used to describe web contents. RDF arranges the contents and concepts into triples. Subject predicate object logic is the underlying process of the RDF. RDF is currently used in many industrial and professional software tools. 4 Heavyweight ontology based proposed architecture Our proposed HOIEV (Heavyweight Ontology Based Information Extraction for Visually impaired User) is a novel architecture which facilitates for extraction of highly precise

123

Ontology-assisted automatic precise information extractor

13

Fig. 1 The proposed HOIEV architecture

information from heterogeneous web sources using a vocal command system for exact and timely decision making for the visually impaired user. The HOIEV architecture is based on already developed technologies, so little effort is needed to implement it publicly. According to World Health Organization (WHO) (World Health Organization 2011) some key facts regarding the visually impaired are: • There are 314 million people in the world facing visually impairment diseases; 45 million of these are completely sightless. • The visually impaired people are mostly old citizen in every part of the world. The complete visually impairment ratio in developing countries 87%. • The ratio of age related visually impairment increases as compared to visually impairment due to infection. • Correction of refractive errors could give normal vision to more than 12 million children (ages 5–15). • About 85% of all visual impairment is avoidable globally To enable visually impaired people globally to have the same quality of life as normally visioned people is a daunting task. The proposed HOIEV architecture as described in Fig. 1, gives hope to visually impaired users with its novel architecture. The next section divides and explores the stages of HOIEV architecture. 4.1 The remote user interaction layer (layer 1) The remote user interaction layer was specially designed to facilitate the remote visually impaired users. We assumed in our experiment that the visually impaired user had a smart phone with Pocket CMU SPHINX-2 installed as an application on it. The user records his/her voice message in Pocket CMU SPHINX 2 (Carnegie Mellon University 2011) speech to text conversion mode. The user’s voice will automatically be converted into text, and sent to the central server for further processing. This phenomenon needs a realtime continuous speech

123

14

A. C. Bukhari, Y.-G. Kim

Fig. 2 The STT conversion using pocket CMU SPHINX 2

recognation system (RTCSRS) which should be effective, fast and must be lightweight so that mobile devices can easily use this with low power consumption. For this purpose we chose the Pocket CMU SPHINX 2 as it is already tested and analysed by researchers several times in their experiments. The Pocket CMU-SPHINX-2 is an open source large vocablary continuous speech recognation system. There are many portable text to speech (TTS) and speech to text (STT) recognition systems availabale on internet under closed license but most of them are expensive and available without source code. The Pocket CMU SPHINX 2 (Daines et al. 2006) is the first known open source continuous speech recognation system to date. Pocket CMU SPHINX 2 is lightweight, and specially developed for a mobile platform and low processing handheld devices. At the layer 1 level, the recorded voice of the user is converted into text using Pocket CMU SPHINX 2. Figure 2 describes the process graphically. According to Fig. 2, the visually impaired user records a voice message and the built-in Pocket CMU SPHINX 2 encodes it into data packets. These data packets are sent to the main server on which the prototype tool is installed for subsequent processing. 4.2 Information extraction kernel (layer 2) 4.2.1 The AJAX based SMS inbound panel There are three serial activities which are performed at the second layer of the information extraction kernel. The visually impaired user’s text is received at the SMS Server’s In-Bound Panel as shown in Fig. 3. The user can subscribe to an SMS service or GPRS for sending text messages to a server. Almost all mobile phone operating companies provide such text messaging services for a nominal charge. The text message is received at a specially designed (Asynchronous JavaScript and XML) AJAX based dynamic panel. Figure 3 presents the graphical interface of an AJAX enabled SMS inbound panel. AJAX is not a technology, but is instead a combination of JavaScript and XML. With Ajax, web applications can retrieve data from the server asynchronously in the background without interfering with the display and behaviour of the existing page. In AJAX, the server and client remain in connection with the help of an “XMLHttpRequest” method which is specially developed for seamless communication purposes (Meng et al. 2004). As the SMS reaches the inbound panel, it spontaneously frontwards the query SMS to the ontology based focused crawler for further processing. Our prototype tool developed in JAVA (JSP), for information exchange, it uses XML. Figure 3 shows that the prototype SMS server module is developed with JSP and AJAX for handling incoming and outgoing data traffic. 4.2.2 Modeling heavyweight ontology for ontology crawler Ontology is considered a branch of metaphysics which focuses on the study of existence and an explicit specification of shared conceptualization (John et al. 2003). In ontology we

123

Ontology-assisted automatic precise information extractor

15

Fig. 3 The SMS server module of prototype tool

basically focus on concepts (classes) of the domain, their relationships and their properties. Ontology arranges the classes in hierarchical structure, in the form of subclass and super class hierarchy, and also defines which property has constraints on its values. Basically, ontology is developed to share common understanding of the domain knowledge among people and software in order to help us reuse the classes of a domain instead of remodelling them. Ontology is written in a specific language called OWL (Ontology Writing Language) which is developed by W3C. OWL is specially designed for ontology modelling. OWL 2 is the extended version of OWL which holds some new features and rationale. To increase the expressivity some new constructs are introduced in OWL 2 and also made enhancement in metamodeling features by applying extended annotation capabilities. Figure 4 is showing a code view of OWL 2 class with constraints. There are many methodologies used in the development of ontologies (Jones et al. 1998). In Natalya and Deborah (2001) seven-steps ontology development steps were discussed which covers almost every aspect of ontology modelling. These steps are: 1. 2. 3. 4. 5. 6. 7.

Determine the domain and scope of the ontology Consider reusing existing ontologies Enumerate important terms in the ontology Define the classes and the class hierarchy Define the properties of classes—slots Define the facets of the slots Create instances

For the construction of the domain model, the number of methodologies that precisely address the ontology development issues in different domains (Shadbolt et al. 2006) and the

123

16

Fig. 4 A snippiest of E-Store ontology (E-Store) OWL 2 class

123

A. C. Bukhari, Y.-G. Kim

Ontology-assisted automatic precise information extractor

17

Fig. 5 The PAL axioms code file

construction of the ontology have become both an art and understood engineering processes (Lovrencic and Tomac 2006; Wei 2010). We developed heavyweight ontology for an E-Store using OWL 2 and PAL. The basic difference between lightweight and heavyweight ontology is that in Heavyweight ontology, the ontology is enriched by axioms which produce a deep expressivity level. In past research, researchers experiments based on lightweight ontology which produced set of results with limited inferences. PAL is used to write axioms in Protégé (a well known ontology development tool by Stanford university) (Kushmerick 1999). PAL is basically a constraint language that is used to enforce the semantic properties of knowledge bases encoded in Protégé (Stanford University 2001; Lovrencic and Tomac 2006). These axioms are based on mathematical notation. They add constraints and link the information with other sets of information by applying certain criteria. The addition of axioms in lightweight ontology increases the expressivity level of ontology, and helps in machine interpretability. In PAL, constraints include a set of variables, which are: range definitions and logical statements. Predefined predicates and functions create a statement. Logical connectives (, => (if), and, or) are used to link up ontology and other predefined predicates. The PAL constraints for the E- Store ontology in which an I Phone price must be greater than Nokia3200 can be coded as shown in Fig. 5. IPhone is the name of the variable. Variable names start either with ‘?’ or ‘%’ which define local and global variables, respectively. FRAME is one type of variable. Other types of variables are SET, SYMBOL, STRING, INTEGER, and NUMBER. Figure 6 shows the inferred model of the E-Store ontology. The inferred model is the graphical structure of interconnected domain elements. We used Pellet Reasoner Tool in Protégé to generate inferences. This domain ontology is used as input in our ontology crawler and the ontology crawler populates the corpus from the internet. The entire website which falls under ontology defined rules is considered as the part of the corpus. Our designed prototype will automatically receive short messages and convert them into description logic queries (DL-Queries). The resulting queries will fetch the information from the corpus. This information is not in natural language, requiring a natural language processing API to interpret the resulting query in order to make the results understandable to the user (Stanford University 2001). 4.3 The information extraction API layer (layer 3) 4.3.1 The precise information extraction Relevant information extraction from the web is always a difficult process in the information engineering domain. A web crawler is a program that browses the web automatically. The difference between a simple web crawler and a focused web crawler is that a simpler crawler just searches the web and indexes the results which may be relevant or not. The

123

18

A. C. Bukhari, Y.-G. Kim

Fig. 6 E-Store heavy weight ontology inferred model generated by protégé OWL-VIZ extension

focused crawler only indexes those pages which belong to a particular domain, and which converge with certain predefined criteria (Etzioni et al. 2008). Many web crawlers developed in the past had different criteria for searching based on page relevance, link relevance and so on. Usually the server side dynamic web pages are based on databases, and these web page contents are populated in real time from the database and the populated values from the database changes according to the defined criteria. Therefore, the crawler which indexes the web pages on the behalf of page and link relevance fails at this stage. To overcome this problem, ontology based crawlers are used at this stage to extract the precise information from web. The proposed HOIEV architecture at the third layer utilizes the ontology based

123

Ontology-assisted automatic precise information extractor

19

focused crawler to index the relevant web page from the scattered data on the internet. The internal ontology based web crawler first fetches the web pages from the web with its fetcher component and the information is then parsed. The data is stored in the local cache followed by the ontology techniques which are applied against the fetched web pages. Only the most relevant web pages are indexed (Etzioni et al. 2008; Gatial et al. 2005). We used the open source ontology authoring tool Protégé 4.0.2 to develop the E-Store ontology. The E-Store ontology defines the classes, subclasses and most of the properties of the E-Store domain. In Fig. 2 we suppose a user sends a query from voice user interface to AJAX based query panel after processing from Pocket CMU Sphinx mechanism then converts the speech into text. The AJAX based panel automatically routes the query to the ontology based focused crawler which is loaded with our modified E- Store ontology. The ontology based crawler starts its operation and the results are indexed as to which results meet the ontology criteria. The name entity utility of ANNIE (Another Nearly New Information Extraction) is widely used and is an extremely reliable component in information engineering. It extracts the information and presents it into natural language format about people, places, prices, items and so on. In the HOIEV architecture at the third layer level, the results which are extracted on the basis of ontology are stored in a corpus and this corpus data is used as input for the GATE (Generalized Architecture of Text Engineering) API (Cunningham et al. 1995) based information parser. GATE (Etzioni et al. 2008) is the generalised architecture for text engineering. In information extraction, we take unstructured and scattered data as intake and produce fixed length output for further storage or display. The output may also be used for analysis. Information extraction is not a simple task as most of the input is in some form of human language. The information extraction tools provide facilities for tokenization, part-of-speech tagging and named entity recognition. GATE is an architecture, development environment, and framework for building systems that process human language (Cunningham et al. 1995). ANNIE is the information extraction module under the GATE framework which is highly used and effective in the information engineering domain. We used the GATE API to develop our prototype extractor engine. Our prototype tool autonomously receives SMS from a user, converts it into a DL-Query, passes it to the ontology based crawler for information extraction from the internet, refines information and sends the precise information back to the user. The GATE rules are based on JAPE (Java Annotation Patterns Engine) which defines all the rules and regulations. These rules help the extraction process; GATE is developed in Java language and is freely available with source code on internet. We added JAPE based domain specific ontology rules in our prototype to get the desired result. The information is extracted at this stage and is sent to the outbound panel of the software which will route the information to the user’s smart phone. At the last stage, the text is converted into voice, and the user can hear the query answer in a vocal format. In Fig. 7 the precise information module is explained.

5 Experiments and results 5.1 Research design evaluation and validity To ensure the effectiveness of our research design in ontology based information systems we have some known ways of evaluation, such as description logic queries, user level testing and domain expert manual evaluation. DL-Query is a powerful and user friendly tool used for searching a classified ontology. We tested and analysed the performance of our modelled ontology using Protégé 4.0. Manchester OWL syntax is used for writing description logic

123

20

A. C. Bukhari, Y.-G. Kim

Fig. 7 Precise information extraction module of prototype

queries. Reasoner must be executed before applying any query. Reasoner is basically a utility which can infer information on the basis of described rules. The Protégé package normally has ‘FACT++’ and ‘Pellet’ as reasoned utilities by default. We used ‘Pellet’ reasoned utility in our experiments. Pellet is one of the best tools which provide cutting edge reasoning for sound and complete information extraction. Pellet is available in the Protégé package by default, or it can be added as a third party utility. Figure. 8 shows the Pellet is working on E-Store ontology. We designed some queries (with user and domain expert concern) to completely judge the overall performance of the system and found result according to requirement. Some of these are listed here.

Example DL Query: 1 Syntax of query: Travel_Agency and provideTravelScheduling value RyanAirLine In this query the domain user wants information from a travel agency, whose schedule is selected according to the domain user’s schedule. Information selection depends on whether or not the travel agency has any travel packages for trains, cruises and airlines. Figure 9 shows the result of the query.

Example DL Query: 2 Syntax of query: Travel_Agency and bookHotelReservation value HotelReservation1. This query displays travel agencies that can make hotel reservations also. The results of this query can be viewed in Fig. 10.

123

Ontology-assisted automatic precise information extractor

21

Fig. 8 Pellet reasoner utility in working mode

Fig. 9 DL-Query result of travel agency with schedule according to domain user

Fig. 10 DL-Query result of travel agency with hotel reservation at its instance

Example DL Query: 3 Syntax of query: Domain_User and accessJewelry value BraceletJewelry1 This query describes the user, who, for the purposes of our test, is the valid domain user and likes jewelry; specifically, bracelets Fig. 11.

123

22

A. C. Bukhari, Y.-G. Kim

Fig. 11 DL-Query result of domain user, who likes bracelet as jewelry

5.2 System precision measurements A series of experiments were designed to test and evaluate the efficiency and performance of the ontology, and the performance and information extraction capabilities of the prototype tool. Normally in every information extraction system there are three main performance measures: extraction rate, precision and recall. These parameters are considered while evaluating the performance of the system. Mathematically precision and recall can be expressed as: ce × 100% ce+te ce × 100% PR = ce+fe

RC =

(1) (2)

where RC is the recall and PR is the precision ‘ce’ is the correct information that is extracted, ‘te’ and ‘fe’ represents the right and wrong information that are extracted by system, respectively. Our experimental environment had Intel(R) Core (TM1) 2 Quad CPU with 2.0 GB RAM, Window XP. We used Tomcat as web server and CDMA based external modem for sending and receiving user SMS. First, we designed the lightweight ontology and loaded it into our ontology crawler and fetched the results. Subsequently, we added the axioms into our ontology and refined it to get deep expressivity levels and inputted the ontology into the crawler and searched from the internet. We obtained different precision and recall statistics, shown in the tables below. Tables 1 and 2 depict some statistics of ontology base information extraction crawler. We have noted that precision and recall ratios using a conventional ontology approach are: 69.9, 77.3, and 77.9%. In case of heavyweight ontology statistics are: 63.7, 58.1 and 52.3%. It is pretty obvious that precision rate increases from 69.9 to 94.5%, 77.3 to 87.8% and

Table 1 A precision and recall of system using conventional E-Store ontology Ontology elements

Resource elements (ce)

True elements (te)

False elements (fe)

Precision (PR) (%)

Recall (RC) (%)

Class (Concept)

413

235

178

69.9

63.7

Property

687

496

191

77.3

58.1

Individual

109

78

31

77.9

52.3

123

Ontology-assisted automatic precise information extractor

23

Table 2 A precision and recall of system using E-Store heavyweight ontology Ontology elements

Resource elements (ce)

True elements (te)

False elements (fe)

Precision (PR) (%)

Recall (RC) (%)

Class (Concept)

413

389

24

94.5

51.5

Property

687

591

96

87.8

53.8

Individual

109

93

16

87.2

54.0

Fig. 12 Graphical comparison between lightweight and heavyweight ontology

77.9 to 87.2% using heavyweight ontology. So according to our experiment more precise information can be achieved by using heavyweight ontology. In Fig. 12, the bar graph shows a comparison between heavyweight and lightweight ontologies on the basis of true and false elements. On the vertical side of the graph the metrics 0 to 700 refers to elements retrieved by the ontology crawler’s spider. Classes, Property and Individual on the horizontal side refer to the measurement of classification.

6 Conclusions In this research paper, we propose highly precise information extraction architecture for visually impaired computer users using Heavyweight ontology for several compelling reasons. The HOIEV architecture is both novel, simple, and is based on existing technologies. The automatic execution of this novel architecture makes its faster and more adaptable than existing solutions. The interactivity of the prototype enables the user to retrieve and filter the desired information while roaming. Another advantage of this system is that anyone can use the system without understanding the underlying structure. The classical wrapper writing exercise is time consuming as it needs to be continuous up gradation. Once the domain ontology is developed, it does not need any continuous updating. We presented the results and performance measures of our tool and ontology. The results were vividly better than our previous experiments. The enrichment of axioms produces a quality of expressivity, and makes the information more understandable for computers. As a result, the precision of extracted information increases. Currently, we are developing new capabilities for automatic axioms

123

24

A. C. Bukhari, Y.-G. Kim

rules generation, and improving the portability of this system. More research is required, but we are confident that this will make the system more efficient overall.

References Brown MK, Glinski SC, Schmult BC (2007) Web page analysis for voice browsing. Interaction Design: beyond human-computer interaction Sharp, Rogers and Preece Carnegie Mellon University (2011) Pocket CMU Sphinx-2. Accessed 11 Mar 2011. http://www.cmusphinx. sourceforge.net Chan A, Mehtab A, Siddique S (2010) The anatomy of ontology based IE from E-Store for remote visually impaired User. (The Philosophical OBIEBU Architecture). IEEE international conference on intelligence and information technology (ICIIT 2010) vol 2, pp 75–80 Chang C, Kayed M, Girgis MR, Khaled F (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng, pp 1411–1428 Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM (JACM) 51:731– 779 Cunningham H, Gaizauskas R, Wilks Y (1995) A general architecture for text engineering (GATE). A new approach to language R&D research memo CS-95-21. Department of Computer Science, University of Sheffield, UK Daines H, Kumar M, Chan A (2006) Pocket-Sphinx: a free, real-time continuous speech recognition system for Hand-Held devices. In: Proceedings of (ICASSP 2006), pp 41–46 Etzioni O, Banko M, Soderland S, Daniel S (2008) Open information extraction from the web. Commun ACM 51 Gatial E, Balogh Z, Ciglan, M, Hluchy L (2005) Focused web crawling mechanism based on page relevance. In: Proceedings of ITAT 2005 information technologies applications and theory, pp 41–46 Gerber AJ, Barnard A, vander Merwe AJ (2007) Towards a semantic web layered architecture. In: Proceedings of the 25th conference on IASTED international multi-conference software engineering, pp 353–362 John HG, Mark AM, Ray WF, William EG (2003) The evolution of protege: an environment for knowledgebased systems development. Int J Hum Comput Stud 58:89–123 Jones DM, Bench-Capon TJM, Visser PRS (1998) Methodologies for ontology development. In: Proceedings of XV IFIP world computer congress, IT and knows, pp 62–75 Kushmerick N. (1999) Wrapper induction: efficiency and expressiveness. J Artif Intell, pp 15–68 Lovrencic S, Tomac IJ (2006) Managing understatements in legislation acts when developing legal ontologies. In: Proceedings of international conference on intelligent engineering systems, pp 69–73 Meng H, Li YC, Fung TY, Low KF (2004) Bilingual Chinese/English voice browsing based on a VoiceXML platform. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, vol 3, pp 769–772 Michail S, Christos K (2007) Adaptive browsing shortcuts: personalizing the user interface of a specialised voice web browser for visually impaired People. In: IEEE 23rd international conference on data engineering workshop, pp 17–20 Nasrolahi S, Nikdast M, Boroujerdi MM (2009) The semantic web: a new approach for future world wide web. World Acad Sci Eng Technol Natalya FN, Deborah LM (2001) Ontology development 101: a guide to creating your first ontology. http:// protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html Shadbolt N, Hall W, Berners-Lee T (2006) The semantic web revisited. J Intell Syst 21:96–101 Stanford University (2001) A quick guide to the protégé axiom language and toolset. Accessed 6 Mar 2001. http://protege.stanford.edu/plugins/paltabs/pal-quickguide/ World Health Organization (2011) Visual impairment and blindness. Accessed 11 Mar 2011. http://www.who. int/mediacentre/factsheets/fs282/en./ Wei Q (2010) Development and application of Knowledge engineering based on ontology. In: Proceedings of third international conference on Knowledge discovery and data mining, pp 518–521 Zajicek M, Powell C, Reeves C (1998) A web navigation tool for the visually impaired. In: Proceedings of the third international ACM conference on assistive technologies California, United States, April 15–17 1998

123