A New Approach for Visual Interpretation ofDeep Web

0 downloads 0 Views 374KB Size Report
When the number of fields is big query becomes very complex and novice users find difficulty to formulate a correct query. We have seen that line segment.
The Fourth International Conference on Intelligent Systems and Applications. INTELLI 2015. October 11 - 16, 2015 - St. Julians, Malta

G-Form: A New Approach for Visual Interpretation of Deep Web Form as Galaxy of Concepts Radhouane Boughammoura1, Lobna Hlaoua2 and Mohamed Nazih Omri3 Research Unit MARS, Faculty of Sciences of Monastir University of Monastir [email protected], [email protected], [email protected] Abstract — Despite the richness of deep Web forms, their rendering methodology is very poor in terms of capacity of expression. Hence, user has no indication about the richness of query and query capability when regarding this interface. This paper presents a new approach for visual interpretation of deep Web forms. This approach considers these Deep Web Forms as a galaxy of concepts, which are easy to interpret by user and reflects the exact meaning of query. This approach was evaluated on standard dataset and compared with other methods known in state of the art. It has proved good performances with respect to standard measures such as precision, recall and F1-measure. Keywords- Query Interpretation; Galaxy of Concepts; Pertinence of a Concept

1. INTRODUCTION Deep web is hidden behind Web forms which give access to deep Web databases. It represents the part of Web which is not reachable via hyperlinks [2, 4, 7, 12, 14, 15]. Information on databases is really a big treasury and more than 90% of information in Web comes from deep Web. In addition, this information is very rich in terms of quality of services offered to internet users [12, 13, 14]. We aim by our job to reveal deep Web to novice users of internet via new, simple and easy-touse web forms. Web form is information retrieval interface which give access to deep Web data. It is a graphical representation of query using a set of fields. Users form their query by interpreting visually the meaning of query interface. The design method of web form [1, 3, 5, 6, 10, 11] is very important since it is the only source of inspiration for novice users in order to understand the meaning of query. Several studies in the literature have addressed this problem [22, 23, 24, 25, 26, 27 and 28]. A bad interpretation leads to incorrect query and hence restricts access to deep web services. We focus in this paper on design aspect of Web forms in order to offer to novice user easy- to-use forms. If we look carefully at Web form of figure 1.(a), we remark presence of White lines crossing horizontally Web form. These lines are very important; they indicate presence of semantic

entities (or semantic concepts). Novice user may don’t pay attention to these lines. Here explanation of relevance of white lines. Let suppose user is not interested by non-stop flights but searching for a flight having one stop city. Suppose he is searching for all flights having as destination ‘’Tunis’’ offered by airline company with one stop city. User may give by mistake (ignoring white lines) destination city and number of passengers and leave departure city empty. This request will be wrong if and only if user know all stop cities for destination ‘’Tunis’’. This is not evident and will be a burden for novice user as he must formulate as much query as there is stop cities. In this paper we present two design methodologies which simplify the design of universal deep Web form. While our methodology preserves the query capability of universal Web form, it removes the complexity of universal query with easyto-use forms containing necessary and pertinent fields. The resulting universal form becomes very simple and above all semantically very rich. 2. Z-FORM DESIGN METHODOLOGY Z-Form [9] is a flat query where all fields are listed at the same level of granularity. The noun Z-Form comes from the fact that user reads Z-Form like reading lines in a paragraph, he begins by the first line, then the second, etc. This reading strategy resembles to the letter Z (see figure 2.b). Let’s explain with detail. In Z-Form we noticed the Z geometric pattern. Drawing straight line segments between any two consecutive fields in the semantic order, then all these segments form one connected curve. The curve consists of horizontal, vertical and diagonal line segments. A horizontal (vertical) line segment corresponds to a set of fields laid out row-wise (column) on the visual rendering of query interface. Figure 1.(b) shows the curve for a fragment of the interface in figure 1.(a). The curve also has a number of inflection points. An inflection point is a point in the curve where the curve changes from being concave upwards (positive curvature) to concave downwards (negative curvature), or vice versa.

The Fourth International Conference on Intelligent Systems and Applications. INTELLI 2015. October 11 - 16, 2015 - St. Julians, Malta

Table 1. Analogy Deep Web form vs Galaxy

A C

B

Deep Web Form

Cosmic Universe

field

star

Group of fields

galaxy

Super-group of fields

Super-galaxy

degree of pertinence of field

Distance separating stars

Mean average pertinence of group of fields

Center of mass of galaxy

D (b)

Deep Web Form

Hierarchical Schema

root

(a)

Figure 1. Z geometric pattern in a Z-Form Z-Forms are the most used information retrieval query interface in Web. Despite their reputation, Z-Forms have drawbacks. The first drawback is that all fields are rendered in the same interface. When the number of fields is big query becomes very complex and novice users find difficulty to formulate a correct query. We have seen that line segment covers an entire line and forms hence one semantic group of fields. However in many cases, more than one semantic group may appears in one line. The Z-geometric pattern does not detect all semantic groups in the line but considers the entire line as one semantic group. Z-Form is well designed for flat query. 3. AFFINITIES BETWEEN DEEP WEB FORM AND GALAXY G-Form is deep Web form which is inspired from Galaxy. First, we will present affinities between galaxy and deep Web form, and then we explain the way galaxy form is made. Novice users regard deep web form like they regard cosmic universe. Stars are fields and galaxies are groups of fields. When novice users regard deep Web form, they move by regard from one semantic entity to another entity as astronaut move from one planet to another or from one galaxy to another in cosmic universe. If distance is a measure of this cosmic travel, pertinence is the measure of relevance between fields. Section A and B details the similarity between deep Web forms and galaxy.

2 Leaving From

2

Going 3 4 To

3

1

Departing On 9

5

Returning On 12

Flight Class15

Adults Children Infants depDay depMonth retDay retMonth

6

7

8

10

11

13

14

Figure 2. Deep Web form and its hierarchical schema A. Web forms Deep Web forms organize fields respecting a hierarchical schema (see figure 2). This schema gives query its meaning. In our previous work [18] we have presented an algorithm, called VIQI (Visual Interpretation of Deep Web Query Interfaces), which is able to extract hierarchical schema from Z-Form. Figure 2 gives the resulting output of our algorithm when applied to query interface (left). The hierarchical schemas detect presence of 4 groups: Departure={Leaving From, Going To}, Number of Passengers={Adults, Children, Infants}, Departure Date ={Day of Departure, Month of Departure}, and Returning Date={Returning Day, Returning Month}, and 10 fields: Leaving From, Goining To, Adults, Children, Infants, Day of Departure, Month of Departure, Returning Day, Returning Month, and Flight Class) and one super group :root. As we have mentioned before, the hierarchical aspect of deep Web form indicates presence of semantic relations between entities which are “is-a” and “part-of”.

The Fourth International Conference on Intelligent Systems and Applications. INTELLI 2015. October 11 - 16, 2015 - St. Julians, Malta

Another important aspect in deep Web forms is degree of pertinence of fields. In G-Form, fields are organized according to their pertinence. Fields with relevant pertinence are rendered before fields with less pertinence: relevant fields are left most on the top of the Web form, while fields with less pertinence are placed right most on the bottom of the Web form. Hence field “Adults” is more pertinent than field “Infants” (pertinence decreases from left to right) as there is also “Children” field which may replace “Infants”. And field “Going From” is more pertinent than “Flight Class” (pertinence decreases from top to down) as by default users choose economic class while they must mention where are they going in order to formulate a correct query. Degree of pertinence in G-Form corresponds exactly to depth first search (DFS) traversal of schema elements. In figure 3, we have indicated under each schema element its degree of pertinence. We remark that I < J if element I is rendered before element J in Web form. Degree of pertinence of I is 1/I, while degree of pertinence of J is 1/J. Order is established from left to right and from top to down. With this way we can measure degree of pertinence of field relative to a group of fields. Let suppose D mean average degree of pertinence of a group of fields and r is distance between degree of pertinence of field and degree of pertinence of group of fields: if D/r   then field is pertinent in the group  if D / r   then field is not pertinent

B. Galaxy Planets in universe are structured according to hierarchical schema like G-Forms. Planets belong to galaxies and there is also super-galaxies grouping a set of galaxy. When we regard sky by night we observe thousands of brilliant stars (see figure 3). In reality each brilliant point are not only stars but may be other galaxies. As distance between observer and galaxy is very large light coming from the galaxy appears like a single point. Let’s take example. It is correct to consider Andromeda galaxy (see figure 3) as one point wherever in space, situated in center of mass of the galaxy, and have as mass the total mass of the entire galaxy.

Viewing the Andromedagalaxy fromEarth

Earth

D

r

x

D

Andromeda

r = distance of center of mass x= location of center of mass Figure 3. Regard Andomeda galaxy from Earth In mathematics, quotient D/r is given by the next equation: D/r 

Size of square contining Andromeda Distance of Center of mass relative to Earth

(1)

If quotient D/r is very small, we can replace the sum of all stars of galaxy Andromeda by only one term situated in center of mass.

if D / r   then star1 is observed from star2  if D / r   then star1 is observed as single point Figure 4 shows earth planet (on the left) and Andromeda galaxy (on the right). Square on the bottom shows a zoom on Andromeda galaxy. First, it is clear that according to an observer in Andromeda galaxy, our galaxy Milky Way may be approximated by a mass point situated at center of mass. In galaxy Andromeda (or Milky Way) itself, this geometric picture repeats like indicated in figure 4. While the quotient D1/r1 is very small, stars situated at the smallest box can be replaced by their center of mass. D

Earth

r

x

D

Andromeda

Vulcan

r1

D1

x

D1

Milky way

Andromeda

Figure 4. Replacing clusters by their centers of mass recursively

The Fourth International Conference on Intelligent Systems and Applications. INTELLI 2015. October 11 - 16, 2015 - St. Julians, Malta

4. PROPOSED APPROACH: G-FORM We propose a new approach for visual interpretation of deep Web forms. The interpretation of the form is based on hierarchical schema (see section 3.A) inspired from galaxy. We choose to use a hierarchical representation of query instead of flat representation as this representation is richer according to semantics of query. First Web form is rendered as Z-Form of stars where all fields are stars. When user clicks on a star, we consider that user moves to the galaxy containing the field. Our algorithm determines the immediate pertinent fields in the galaxy. A field is considered as pertinent if it is not far from the center of mass (clicked field) of the galaxy: its degree of pertinence is under distancefixed by user, from the clicked field. In table 2 we show which pertinent fields are rendered when user clicks on a star in figure 6.(a). For example, let’s take  fixed to 1: -

If user clicks field with degree of pertinence equal 3 then only fields “Leaving From” (with pertinence degree 3) and “Going To” (with pertinence degree 4) are rendered as they are situated under distance inferior or equal to 1 from field “Leaving From” (Clicked field) having degree of pertinence 3 (see figure 5.b).

-

If user clicks field with pertinence 7 then only fields “Adults”, “Children”, and “Infants” are rendered as they are situated under distance 1 from field “Children” (Clicked field) having degree of pertinence 7 (see figure 5.c).

In table 2, lines correspond to clicked star and columns correspond to the rendered field. According to table 2, we remark that rendered fields depend on the clicked field because when user clicks a star we consider that the observer moves to the galaxy of this star. Hence only stars at quotient d/r >  (see section 4.B) are observed from this galaxy and all the others are observed as brilliant stars.. 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18)

Procedure Render_G-Form( {f1, f2, …, fn}, pertinence set, ) Begin For I from 1 to n fi  Not pertinent End For if clickOn( fj) fj  pertinent else For I from 1 to n D  distanceOfPertinence( fi, fj) if (D < then fi  pertinent else fi  Not pertinent End if End For End if End. Algorithm 1: Render G-Form

Table 2. Observed fields relative to clicked one. Pertinent fields user clicks

3

4

3

*

*

4

*

*

6

7

10

11

13

14

6

*

*

7

*

*

*

*

*

10

*

*

11

*

*

13

*

*

14

*

*

*

*

*

8

15

8

15

The Fourth International Conference on Intelligent Systems and Applications. INTELLI 2015. October 11 - 16, 2015 - St. Julians, Malta

Leaving From

Leaving From

Going To

Going To

3

Tunis

4

6

Lisbon

7

10

11

13

14

Adults 1

8

Children 0

Infants 0

15 Send

(a)

Send

(b)

Send

(c)

G-Form

Figure 5. Rendering strategy of G-Form 5. EXPERIMENTAL EVALUATION AND RESULTS Our approach renders query according to its semantic representation (schema of query). We have tested performance of G-Form on standard dataset ICQ [19]. ICQ is a collection of query interfaces collected from deep Web services. For each query interface, its manually extracted query is available on dataset. Interfaces are collected into five classes of subjects: Airfare, Automobile, Books, Real estate, and Jobs. Our evaluation methodology is as follows. We start from the schema of Web form available on dataset. We build G-Form of deep Web Form. Then we build Table 2 which simulates user clicks on different stars in the Web form. Then we count number of correct entities (group, super-group of fields). An entity is considered as correct if it is semantically coherent. For example entity {Adults, Children, Infants} is a correct entity as it describes number of passengers. While entity {Time, Adults, Children} is not correct because Date of flight and Number of passengers overlap. We count for each G-Form number of extracted entities, number of extracted entities which are correct, and then we measure precision, recall and F1-measure of the algorithm. Recall 

Precision  F1 

NEE TNEW

NEPE NEE

2 * Recall * Precision Recall  Precision

(2)

Experimental results are summarized in table 3: Table 3. Experimental results Domain

Airfare

Auto

Books

#extracted

214

102

108

#extracted &pertinent

146

78

70

#total_ entities

200

105

110

Precision

0,68

0,76

0,64

Recall

0,73

0,74

0,63

Figure 8, 9, and 10 show that our approach perform their better results on domain of interest ‘’Auto’’. ‘’Auto’’ domain of interest contain flat queries, i.e fields are organized in the same level. Choosing a good coefficient make the visual representation of query very easy. Table 4. Comparison between (G-Form) and Z-Form approaches

(3) (4)

Where NEPE is the number of extracted pertinent entities, NEE is the number of extracted entities and TNEW is the total number of entities in the web form.

G-Form

Z-Form

Domain

Precision

Recall

F1

Airfare

0,68

0,73

0,70

Auto

0,76

0,74

0,75

Book

0,64

0,63

0,64

Airfare

0,66

0,70

0,67

Auto

0,72

0,71

0,71

Book

0,62

0,60

0,60

The Fourth International Conference on Intelligent Systems and Applications. INTELLI 2015. October 11 - 16, 2015 - St. Julians, Malta

Our approach have 73% of recall for ‘’Airfare’’ subject of interest. Queries in this domain are hierarchical with many levels. Choosing a small coefficient renders each group of fields separately. Hence, interpretation of query is one concept at one and the same time.

algorithm. However Z-Form precision is superior to our approach on “Auto” domain because the structure of the queries are flat and can be rendered in most cases as one single group of fields while our approach is well adapted for hierarchical queries.

The coefficient  indicatesthe complexity of the query. For small values of only small group of fields are rendered, the other groups are rendered as stars. This is case of ‘’Airfare’’ domain of interest. For large coefficient big group of fields are rendered and interpretation is close to Z-Form.

0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0

0,8 0,6 0,4

F1 of our approach

0,2

F1 of Z-Form

0 Precision of GForm Precision of ZForm

Airfare

Auto

Book

Figure 6. Comparison of precision of our approach and ZForm 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0

Recall of our approach Recall of Z-Form

Airfare

Auto

Book

Figure 8. Comparison of F1 of our approach and Z-Form 6. CONCLUSION AND FUTURE WORK In this paper we have proposed an essay to resolve the complexity of query in deep Web forms. We propose two design approaches H-Form and G-Form. They offer to novice users an easy- to-use deep Web forms and reveals clearly the exact meaning of query even if its schema is complex. We have relaxed the complexity of query with an approach inspired from cosmic galaxies. There is a strong analogy between galaxy and deep web form with respect to structure and granularity of entities in each concept (galaxy). G-Form is better than Z-Form which is well known state of art algorithm. Z-Form lacks design expressivity because there is no background concerning the semantic value of query, while our approach is based on hierarchical schema which reveals clearly the semantic of query. Our approach is clearly outside performance of Z-Form with respect precision, recall, and F- measures of performance. REFERENCES

Airfare

Auto

Book

Figure 7. Comparison of recall of our approach and Z-Form

Diagrams of figures 6, 7 and 8 summarize results shown in table 4. We notice that, with regard to all measures, the two curves corresponding to the two approaches follow the same pace. This phenomenon can be explained by the fact that process of rendering of fields is based on query schema which is common to the two approaches. However figures 8 and 10 show that performance of our approach is always superior to performance of Z-Form. Our approach attends its maximum of precision for ‘’Airfare’’ domain as queries in this domain are hierarchical and well adapted to rendering strategy of our

G. Agarwal, G. kabra, K. C-C Chang, “Towards rich query interpretation: walking back and forth for mining query templates”, In proceedings of the international conference on world wide web, 2010. [2] M. N. Omri, C. Tijus, S. Poitrenaud, and B. Bouchon-Meunier, “Fuzzy sets and semantic nets for on line assistance,” In Artificial Intelligence for Applications, IEEE. Proceedings., 11th Conference, 1995, pp. 374-379. [3] M. N. Omri, and N. Chouigui. "Measure of similarity between fuzzy concepts for identification of fuzzy user's requests in fuzzy semantic networks". International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 9(6), pp. 743-748. 2001. [4] Z. Zhang, B. He, and K. C.-C. Chang , ‘’On-the-fly Constraint Mapping across Web Query Interfaces’’. In Proceedings of VLDB-IIWeb'04, 2004. [5] J. Jansen and Dick C.A. Bulterman, ‘’Enabling adaptive timebased web applications with SMIL state’’, In Proceedings of DocEng '08, 2008. [1]

The Fourth International Conference on Intelligent Systems and Applications. INTELLI 2015. October 11 - 16, 2015 - St. Julians, Malta

[6] [7] [8] [9] [10] [11] [12] [13] [14]

[15]

[16] [17] [18]

[19]

[20] [21] [22]

[23]

[24]

[25]

M. Jayapandian, H. V. Jagadish. 2008, ‘’Expressive query specification through form customization’’, In Proceedings of EDBT '08, 2008. J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, A.Y. Halevy, ‘’Google’s Deep Web crawl’’, In Proceedings of VLDB, 2008. E-C. Dragut, T. Kabisch, C. Yu, U. Leser, ‘’A hierarchical approach to model web query interfaces for web source integration’’, In Proceeding of. VLDB 2009, 2009. W. Wensheng, A-H Doan, C. Yu, W. Meng , ‘’Modeling and Extracting Deep-Web Query Interfaces’’, In Proceedings of AIIS 2009, 2009. Z. Zhang, B. He, and K. Chang, ‘’Understanding Web query interfaces: Best-effort parsing with hidden syntax’’, In Proceedings of SIGMOD’04, 2004. Heidinger, C., K. B Ihm, et al. "Efficient and secure exactmatch queries in outsourced databases." World Wide Web: 139, 2013. L.T.H.Vo, J. Cao, W. Rahayu, ‘Structured content-based query answers for improving information quality’’, In World Wide Web: 1-24, 2014. J. Losada, J. Raposo, A. Pan, P. Montoto. "Efficient execution of web navigation sequences." World Wide Web: 1-27, 2013. R. Boughammoura, M.N. Omri, and H. Youssef, ‘’Fuzzy Approach for Pertinent Information Extraction from Web Resources’’, Journal of Computing and e-Systems, Vol. 1 No. 1, 2008. R. Boughammoura, M.N. Omri, ‘’Statistical Approach for Information Extraction from Web Pages’’, In proceedings of International Symposium on Distance Education (EAD’2009), 2009. R. Boughammoura, M.N. Omri, ‘’SeMQI: A New Model for Semantic Interpretation of Query Interfaces’’, In Proceedings of NGNS’11, 2011. R. Boughammoura, M.N. Omri, Hlaoua, L. ‘’ VIQI: A New Approach for Visual Interpretation of Deep Web Query Interfaces’’, ICITeS 2012. R. Boughammoura, M.N. Omri, Hlaoua, L. ‘’Information Retrieval from Deep Web based on Visuel Query Interpretation’’, International Journal of Information Retrieval Research, 2(4), 45-59, 2013. R. Boughammoura, M.N. Omri, "G-Form: A Collaborative Design Approach to Regard Deep Web Form as Galaxy of Concepts ". The 12th International Conference on Cooperative Design, Visualization and Engineering. (CDVE2015), 2015. The UIUC Web Integration Repository, Computer Science Department, University of Illinois at Urbana-Champaign, http://metaquerier.cs.uiuc.edu/repository, 2003. Ko-Chiu Wu, Affective surfing in the visualized interface of a digital library for children, Information Processing & Management, Volume 51, Issue 4, Pages 373-390, July 2015. F. Fkih and M. N. Omri. “Complex terminology extraction model from unstructured web text based linguistic and statistical knowledge,” International Journal of Information Retrieval Research (IJIRR), 2(3), pp. 1-18. 2012. M. N. Omri, “Fuzzy knowledge representation, learning and optimization with bayesian analysis in fuzzy semantic networks,” In 6th International Conference on Neural Information Processing, 1999. Proceedings. (ICONIP'99). pp. 412-417. 1999. F., Naouar, L., Hlaoua, and M.N. Omri, Possibilistic Model for Relevance Feedback in Collaborative Information Re-trieval. In International Journal of Web Application IJWA 4(2): 78-86. 2012. K. Garrouch, M. N. Omri, and B. ElAyeb, “Pertinent Information retrieval based on Possibilistic Bayesian network: origin and possibilistic perspective. An International Forum for

Exploring e-Business and e-Government Research, Applications, and Technologies (TIGERA), 2008. [26] M.N. Omri, Pertinent Knowledge Extraction from a Semantic Network: Application of Fuzzy Sets Theory. In Interna-tional Journal on Artificial Intelligence Tools (IJAIT). Vol. 13, No. 3, p. 705-719. 2004. [27] M.N. Omri, Effects of Terms Recognition Mistakes on Requests Processing for Interactive Information Retrieval. In International Journal of Information Retrieval Research (IJIRR) 2 (3), 19-35. 2012. [28] M.N Omri, I. Urdapilleta, J. Barthelemy, B. Bouchon-Meunier, C.A. Tijus. Semantic scales and fuzzy processing for sensorial evaluation studies. Information Processing And Management of Uncertainty In Knowledge-Based Systems (IPMU). pp. 715719. 1996.

Suggest Documents