Abstract. In this paper, we describe a visual language, based on the so-called. Predicate Tree Metaphor, which allows users to visually build complex sentences ...
The Predicate Tree – A Metaphor for Visually Describing Complex Boolean Queries Luca Paolino, Monica Sebillo, Genoveffa Tortora, and Giuliana Vitiello Dipartimento di Matematica e Informatica, Università di Salerno via Ponte don Melillo, 84084, Fisciano(SA) -Italy Tel.: +39 089 963324 {lpaolino, msebillo, tortora, gvitiello}@unisa.it
Abstract. In this paper, we describe a visual language, based on the so-called Predicate Tree Metaphor, which allows users to visually build complex sentences for querying commonly used search engines. By using this visual language, no parentheses have to be applied, and no precedence rules have to be known. Promising results about the usability of the proposed interface are reported, on the basis on an experimental between-group study, performed on a Yahoo-based prototype of the proposed graphical environment. Keywords: Visual metaphors, sum of product expression, search engines, usability evaluation.
1 Introduction Looking for information, whether is it for finding an article, discover a path to a restaurant, book a trip by airplane, or any other intentions, is something that many of us do in our daily lives. In order to find out the information we need, we have to make the query appropriately. For example, in searching for a paper, parameters that we might involve in the query are title, author, keywords, references to other papers and the file type. The search is then conducted by checking the results, adding more filters, e.g., for search refinement, until we discover the target information. However, in many cases, the process of data discovering is so difficult and so long that users are not able to succeed. Often, this problem is due to the fact that users are not adequately able to compose queries according to the language which underlies the search engine. In other words, they are not able to properly combine logical conditions, also taking into account precedence rules. Additionally, textual languages are not so proper for inexpert users because some differences exist between natural language and Boolean logic, with respect to the meaning associated with AND/OR combined conditions. For example, in English a user may pose the query: “Find all the documents that have pdf format and those that have doc as format” If this query was translated directly from English into a query language, the constraint clause would take the form: format =”pdf” and format =”doc” G. Qiu et al. (Eds.): VISUAL 2007, LNCS 4781, pp. 524–536, 2007. © Springer-Verlag Berlin Heidelberg 2007
The Predicate Tree – A Metaphor for Visually Describing Complex Boolean Queries
525
The results of this query would always be an empty set because a document cannot be in pdf and doc format at the same time - the constraints of the query should have been ORed together. To overcome these limits and allow users to compose complex queries in a simple and intuitive way, visual languages and interfaces seem to be interesting solutions. In this paper, we propose a visual language, based on the so-called Predicate Tree Metaphor, which allows users to visually build complex sentences for querying commonly used search engines. By using this visual language, no parentheses have to be applied, and no precedence rules have to be known. A prototype of the proposed graphical environment has been implemented based on the Yahoo search engine, in order to overcome present limits of the Yahoo’s access method. As a matter of fact, we show that the adopted Predicate Tree Metaphor provides users with the ability to exploit the engine in a deeper way. The paper is organized as follows: in Section 2, the Predicate Tree language is presented together with its environment. Here, the main concepts concerning the construction of the visual representations are given. The algorithm for translating a tree instance into a Yahoo query form is also described. In Section 3, we report on an experimental between-group study, performed on a Yahoo-based prototype of the proposed graphical environment, involving two groups of ten subjects each. Section 4 makes a short discussion on related work. Finally, Section 5 concludes the paper with some final remarks and discussion on future work.
2 The Predicate Tree Language In the search engine field, the management of sentences composed by predicates connected by AND, OR and NOT operators is very common. However, in most cases, their employment is underutilized and often incorrect because it requires a deep knowledge of Boolean logic and a specific knowledge of the used search engine. As an example, Yahoo™ and Google™ differently interpret the query P1 AND P2 OR P3 AND P4. Yahoo selects documents which satisfy the (P1 AND P2) and (P3 AND P4) expressions, while Google™ looks for documents respecting (P1 AND P2 AND P4) or (P1 AND P3 AND P4). In practice, the correct interpretation depends on the precedence rules the search engine applies. Moreover, not all the search engines rely on the complete Boolean algebra, limiting the resulting expressive power. This is the case for Google, which only uses simple Boolean expressions with no parentheses. Figure 1 shows some known search engines and the corresponding set of Boolean operators [8]. In this intricate world of search engines and Boolean operators, the inexpert user who needs to perform a medium/hard query may get lost and needs to pass through several trials and errors before achieving the required information. In this context, visual between the user and the machine by alleviating problems coming from the inability to manage logical connectors, parentheses and other specific query language structures.
526
L. Paolino et al.
Fig. 1. The principal Boolean operators working on some search engines
In this section, we propose a novel interface, based on the Predicate Tree (PT) visual language, able to hide the complexity of Yahoo™ query composition by means of a simple and intuitive structure. It provides users with the ability to define the complex queries without knowing how the Yahoo’s logical operators work thanks to a tree structure where: 1. 2. 3. 4.
the root represents a True predicate, nodes represent simple conditions which can be visually created, edges connect predicates which should be valid at the same time in the resulting documents (logical AND) nodes on different paths are logically OR-connected conditions.
The Predicate Tree is used to select all the documents which satisfy at least one complete path on the tree instance. For example, the tree depicted in Figure 2 requires to select the documents which make true at least one among (P1 AND P2), (P1 AND P3), P4.
Fig. 2. An example of query according to the Predicate Tree language
In order to prove the effectiveness of our interface, we show one typical example of complex query involving the most important Yahoo’s fields.
The Predicate Tree – A Metaphor for Visually Describing Complex Boolean Queries
527
Before starting the example, we need to specify which predicates we implement into the structure. For the sake of simplicity and brevity, we decided to manage just a subset of those possible. Anyway, the model can be easily extended in order to completely simulate the language. Table 1 shows a summary of the symbols presently managed. Table 1. Advanced query symbols of the Yahoo Search Engine
Symbol
Meaning
+
If a common word is essential to get the results you want, you can include it by putting a "+" sign in front of it.
-
You can exclude a word from your search by putting a minus sign ("-") immediately in front of the term you want to exclude from the search results.
OR
Yahoo search supports the Boolean "OR" operator. To retrieve pages that include either word A or word B, you can use an uppercase OR between terms.
AND
Google search supports the Boolean "AND" operator. To retrieve pages that include both word A and word B, you can use an uppercase AND between terms.
Date
The "date:" query prefix will filter the results to include only documents that were inserted in the Google database within the specified Julian date range.
File Extension
The "file type:" query prefix will filter the results to only include documents with the extension specified..
Let us suppose we want to find out documents describing either JavaScript syntax or HTML syntax. The first kind of document must be exclusively in MSWORD format while the second one must be either in PDF format if it was updated during 2006 or in any format if it was updated in 2007. According to the rules specified in Section 2, we need to build up a Predicate Tree where each complete path selects a part of the total result we want. First of all, the sentence requires that we have to select two kinds of documents, either those containing the JavaScript word or those containing the HTML word. We want to find out documents describing either JavaScript syntax or HTML syntax. This corresponds to the Predicate Tree as shown in Figure 3.
Fig. 3. The first query composition step
528
L. Paolino et al.
The second part of the sentence asks for filtering JavaScript hits by indicating those in DOC format, namely: The first kind of document must be exclusively in MSWORD format …. The resulting tree is shown in Figure 4.
Fig. 4. The second query composition step
Finally, the rest of the sentence, … while the second one must be either in PDF format if it was updated during 2006 or in any format if it was updated in 2007, may be translated as shown in Figure 5.
Fig. 5. The third query composition step
Figure 6 shows the Tree Metaphor as implemented by means of the HTML and JAVASCRIPT languages into an Internet page. The root node represents the search engine where the request will be submitted. In the present paper, we choose Yahoo™ but other engines may be selected through the menu located on the left-upper corner. The nodes located below the root indicate either the filters applied on the current request or filters which are not yet specified and are not currently parsed for the final request. Such a difference is highlighted by assigning different colors to each set, namely green for the former and blue for the
The Predicate Tree – A Metaphor for Visually Describing Complex Boolean Queries
529
latter set. As a matter of fact, each time the user makes a choice through the node’s menu, the node is automatically green colored and, at the same time, two empty blue nodes are created in order to suggest users the positions where new filters can be set. When starting, just the root and a blue node connected to the root appear on the interface. When the user applies a new filter to this node, two new nodes are automatically created, one connected to the root and the other to the filtered node which becomes the father node. This is to say, nodes are added where new filters may be applied. Each time a new filter is set on a blue node, that node becomes green and two new empty nodes are created as its sibling and child, respectively. In order to set nodes with the required filters, each node is provided with a two level menu. In the first one, the user may select the filter category s/he needs. Currently, the Yahoo search engine provides selection on the 'last update date', the 'extension', the 'web domain', the 'language' and specific strings. Once selected the category, the set of sub-choices appears into a second level menu. As an example, if we select the Extension menu, the list of accepted file formats would appear on the screen in the bottom-right corner, as shown in Figure 6. In case the user needs to specify a textual pattern, the interface provides a specialized input text field. In order to resemble the user choice, the final selection appears on the node as either an icon resembling such a choice or a string. A question mark is visualized next to every empty node, to highlight the current path selection. As a matter of fact, each question mark visualizes a natural language description of the path from the leaf next to the question mark to the tree root. This functionality allows users to improve the query comprehension and to make specific improvements or correction whenever required.
Fig. 6. The visual composition of the example query into the Predicate Tree
530
L. Paolino et al.
Once the tree is built, the second step is to determine the algorithm for translating the tree instances into the language understood by the search engine, in this case Yahoo. As described in [8], Yahoo’ queries are in Sum of Product (SOP) form, namely the precedence rules follow the sequence NOT, AND, OR. In case we need to force this precedence, we might use parentheses. The algorithm is shown in Figure 7. 1 String query 2 for each path in tree 3 for each node in path 4 if node is not root then 5 query concat node 6 if node is not leaf then query concat “AND” 7 if path is not the last then query concat “OR” Fig. 7. The algorithm performing the translation from the Predicate Tree language to the Yahoo query language
In the first row a query string is defined in order to contain the query textual representation. Successively, a for cycle starting at 2 and ending at 7 is defined in order to return a reference to each path. According to the previous example, it returns the (Root, JavaScript, filetype:doc), (Root, HTML, filetype:pdf, date:2006) and (Root, HTML, date:2007) paths, sequentially. Within this cycle, a new for starting at Row 3 and ending at Row 6 returns a reference for each node belonging to the considered path. Once obtained the node reference, the system checks if it is different from the root, and in that case, the algorithm inserts in the query string its textual representation at Row 5. Moreover, if the system verifies that the node reference is not a leaf then the AND string will be concatenated to query at row 6. At the end of the first cycle of Row 3, query will contain the (JavaScript AND filetype:doc) string. Finally, the last statement of Row 2 cycle verifies if the referenced path is not the last one, in this case, the OR string will be concatenated and the Row 3 for performs another cycle, otherwise the algorithm stops the execution. According to the example, at the end of the execution, the final query string will be (JavaScript AND filetype:doc) OR (HTML AND filetype:pdf AND date:2006) OR (HTML AND date:2007).
3 Usability Evaluation In the present section, we report on the evaluation process, which has been carried out on the visual search language based on the Predicate Tree Metaphor, as described in the previous section. The evaluation strategy we have adopted has been to perform an experimental study, meant to compare the Yahoo Query Language (YQL) and the Predicate Tree language.
The Predicate Tree – A Metaphor for Visually Describing Complex Boolean Queries
531
The study was settled by carrying out the following preliminary activities: − −
Task analysis: meant to identify the relevant tasks which the users involved in the experiment should perform Definition of the parameters we should be able to measure and how to measure them. We had to choose the properties which we considered significantly influential for the major usability factors, namely efficiency, efficacy and user's satisfaction.
The learnability of the considered language was soon identified as one of the most important criteria against which the two languages should be compared. This usability factor refers to the ease with which new users can learn how to effectively use the search language, achieving a high performance degree [1]. Thus, the typology of tasks we considered for our experiment was 'query composition'. Given such a typology of tasks, we decided that learnability should be measured by collecting a score on the subjects' performance, based on: -
the error rate of each subject during query composition the kind of errors made by the subjects.
Another crucial criterion, user's satisfaction, was measured by combining a 'thinkaloud' evaluation approach with questionnaires. The think-aloud technique consisted in encouraging subjects' comments during their query composition tasks, and making appropriate annotations. Upon task completion, the subjects were invited to complete questionnaires, meant to estimate users' satisfaction with respect to each of the considered languages. The subjects were divided into two groups of 10 people each. Some of the subjects had programming experience while others were non-programmers who were familiar with common web search tasks. They were equally distributed between the two groups, in order to improve the dependability of the achieved results. A training session on each of the two languages preceded the experiment. During such session the subjects were instructed on how to compose a complex query with either languages. Each group was assigned a set of 5 search tasks expressed in natural language. The two sets were pair wise comparable, involving similar attributes and logical expressions. Proposing similar search tasks to both groups, we were able to perform a between-groups evaluation. The subjects in one group were asked to express the searches in the form of YQL queries, while the others were asked to use the Predicate Tree language, so that a comparison between the languages was possible. Biases due to knowledge transfer within each group were reduced by changing the order in which each subject performed his/her tasks. The list of submitted search tasks with the corresponding language syntax is reported in the Appendix. The subjects query composition tasks were scored by exploiting the error classification initially proposed by Reisner in [7], namely, ( C) for completely correct, (D) for minor data error,(M) for minor language error,(S) for error of substance, ( F) for error of form, ( N) for query not attempted. _
_
532
L. Paolino et al.
The results are shown in Table 2. They show that: •
• •
The group performing Predicate Tree queries correctly answered more than 20% than the group performing YQL queries. However, the percentage of correct queries are quite similar, namely 31 for the YQL and 39 for the PT. A significant growth has been instead reported for the number of minor language error, e.g. misspellings and punctuation, as well as for the number of queries with minor data error, namely queries where data is not supplied completely as required. People were encouraged to try the composition of the query. As a matter of fact, all the subjects belonging to the PT group tried to compose the queries whereas 9 percent of queries have not been answered by the YQL group. As we expected, the percentage of errors of form are notably decreased, resulting 48 from the YQL group and 27 from the PT group.
In order to monitor the overall user satisfaction, subjects were asked to answer some questions after having performed the tasks. Questions mainly concerned with three arguments, namely, general reactions to the language used, specific comments on the performed search tasks and on the difficulties encountered and support achieved during query composition. Table 2. Percentage of query responses in each category
Response Category
Group YQL
Group Predicate Tree
C (Correct) D (Minor data error) M (Minor language error) Total essentially correct S (Error of substance) F (Error of form) N (Not attempted) Total incorrect
31 4 7 42 2 48 9 58
39 8 17 64 9 27 0 36
As for the first argument, answers may be divided into four parts according to the external subdivision (YQL and PTL) and internal subdivision (programmers and nonprogrammers) of the groups. Programmers of both groups found no difficulty in composing the queries but those belonging to the PTL group observed that a notable support came in task expenditure thanks to the use of the tree metaphor. As a matter of fact, although textual languages are more concise with respect to visual languages people generally prefer to compose queries in a visual way rather than in a textual way. According to the particular answers we received, programmer-subjects particularly appreciated the fact that they do not have to address their efforts to correctly write tags or to use parentheses. On the other side, non-programmers considered YQL very hard to use and remember, and reported that they felt uncomfortable in performing the most complex search tasks. Differently, those belonging to the PTL group observed that the given visual environment encouraged them to carry out the assigned complex tasks, thanks to an adequate feedback and to the ability to recover from wrong actions.
The Predicate Tree – A Metaphor for Visually Describing Complex Boolean Queries
533
4 Related Work Information can be extracted from data sources either by browsing, i.e. going from a high level view through progressive refinement until finding the desired data, or by querying, i.e. specifying some criteria that describe properties of the data desired. The first method is better when the user does not know what exactly s/he is looking for, or does not know the data source schema, so s/he navigates or explores the data until s/he sees some relevant information. In addition, browsing is the only means to extract data when there is no query language or schema associated to the data, as is the case of the web. The second method addresses the needs of users who are looking for specific information, which means they have to use expressions to describe data properties. In our work, we tried to apply the querying methodology to unstructured data, trying to visually describe expressions specified according to search engine languages. For this reason, the work more closely related to our proposal, is concerned with visual search systems for databases, such as Filter/flow [3], Kaleidoquery [4] and FindFlow [9]. In Filter/flow, users adopt the pipe metaphor to describe Boolean logic. Each condition is like a filter for the water flow, if two conditions should be satisfied at the same time (AND) then they are located as a sequence of cocks, while if at least just one should be satisfied then the flow is divided in two minor flows which may be interrupted by cocks, namely the conditions. Downstream the cocks, flows are newly connected. Kaleidoquery specifies AND and OR by using a representation similar to that used to represent the operation of AND and OR gates in electronics. The inputs of the gate are represented as switches, if the constraint holds true then the instance will pass through the switch. More recently, the FindFlow interface has been presented in [9]. FindFlow creates a tree representation for searching data where each node contains the partial result of the query specified using filters on arcs necessary to reach the node from the root. Our approach differs from the above under three aspects, namely the goal it was conceived for, the results presentation, and the user's support for the definition of new conditions. As a matter of fact, our work was conceived for web search activities. Users may compose queries involving the most important search options such as looking for constant sentences, the last update time, the document format, the document’s language and so on. Moreover, the Predicate Tree language uses nodes to define filters and edges to represent logical operators. According to our definition, acyclic graphs are not allowed in order to avoid misinterpretation by users. In case, a filter is common to two or more paths, it must be duplicated or located along the paths so that no cycles are composed. Finally, in the Predicate Tree environment, when the user defines new conditions (filters) for search refinement, s/he is provided with a mechanism which suggests where new filters might be added. This feature makes the visual composition even easier to use, supporting predictability of the interface. Another tree-based approach to query composition, was presented in [2]. Here, queries are represented by trees within an environment named Geographical Visual
534
L. Paolino et al.
Query Composer (GVQC). In this environment, logical operators are represented by some specific nodes labelled AND or OR. They are used to indicate how to combine nodes representing conditions. However, the use of logical operators into the visual representation may cause problem with people which interpret the AND operator in the natural language way, this is to say as an OR. A completely different approach for querying the Internet is by browsing. Two popular examples are Quintura [6] and PageBull [5]. The Quintura search engine offers a visual search using a map of tags or hints contextually related to a search query. Every time we select a tag, the map is updated visualizing the most related values. The selection is recursively repeated until the user reaches his/her goal. Another commercial visual query tool is PageBull. It allows users to specify generic patterns into an input text form and then visualize the most important hits by presenting their thumbnails.
5 Conclusion and Future Work In this paper, we presented the Predicate Tree Metaphor, namely a visual language able to support users for the construction of complex Boolean expressions. In the future, we plan to extend the prototype in order to make the search parametric with respect to the most common web search engines. Then, further controlled experiments will be carried out in order to verify the goodness of the proposed approach, as well as to compare different search engines against search specific properties.
References 1. Dix, A., Finlay, J., Abowd, G., Beale, R.: Human-Computer Interaction, 3rd edn. Prentice Hall, Englewood Cliffs (2004) 2. Guo, D.: A Geographic Visual Query Composer (GVQC) for Accessing Federal Databases. In: National Conference for Digital Government Research, Boston, MA, pp. 1–4 (2003) 3. Young, D.: A graphical filter/flow representation of Boolean queries: a prototype implementation and evaluation source. Journal of the American Society for Information Science archive 44(6), 327–339 (1993) 4. Murray, N., Paton, N.W., Goble, C.A.: Kaleidoquery: A Visual Query Language for Object Databases. In: ACM Working Conference on Advanced Visual Interfaces, L’Aquila, Italy, pp. 247–257. ACM Press, New York 5. PageBull, available at http://www.pagebull.com 6. Quintura, available at http://www.quintura.com/ 7. Reisner, P., Boyce, R.F., Chamberlain, D.D.: Human Factors Evaluation of Two Database Query Languages–Square and Sequel. In: AFIPS Proceedings, pp. 447–452. AFIPS Press, NJ (1975) 8. Search Engine, available at http://www.searchengineshowdown.com/features 9. Shizuki, B., Hansaki, T., Misue, K., Tanaka, J.: FindFlow: A visual interface for information search based on intermediate results. In: Asia-Pacific Symposium on Information Visualization, Tokyo, Japan, vol. 60 (2006)
The Predicate Tree – A Metaphor for Visually Describing Complex Boolean Queries
535
Appendix In this appendix we show the queries we submitted to the subjects during the experiment phase. We propose ten queries, five for the Predicate Tree writing and five for the YQL writing, labelled with Query 1, 2, etc having an incremental difficult level. For the sake of simplicity and brevity, we decided to manage just a subset of those possible. Anyway, the model can be easily extended in order to completely simulate the language. Table 3. Query writing. Users are given a question stated in natural language and required to write a query in the Predicate Tree query language. Query 1. Find all the documents describing “UML” in the ppt format.
Query 2. Find all the documents in the TXT format and in DOC format that contain the word “PEACE”.
Root
Root
UML
Peace
File Type: DOC
File Type: TXT
File Type: PPT
Query 3. Find all the documents describing either “GIS” or “WEBGIS”. The former must be in PDF format and the latter must be in .edu sites.
Query 4. Find all the documents located in a .com site describing either “UML diagram” or “ER diagram”. The former must be in PDF format.
Root
Root
Site:com GIS
WebGIS ER Diagram
UML Diagram
Site:edu
File Type: PDF
File Type: PDF
Query 5. Find all the documents describing either PASCAL syntax or VISUAL BASIC syntax. The former must be in PDF format or located in a .edu site. The latter must be in DOC format or in a .com site. Root
PASCAL
File Type: PDF
VISUAL BASIC
Site:edu
File Type: DOC
Site:com
536
L. Paolino et al.
Table 4. Query writing. Users are given a question stated in natural language and required to write a query in the Yahoo query language. Query 1. Find all the documents describing the Pascal language in the TXT format located on a .edu site
Query 2. Find all the documents in the PDF format and in PPT format that contain the word “Environment”.
“Pascal” AND originurlextension:txt AND site:edu
“Environment” AND originurlextension:pdf OR originurlextension:ppt
Query 3. Find all the documents describing either “JAVA” or “JSP”. The former must be in PDF. JSP OR JAVA AND originurlextension:pdf
Query 4. Find all the documents located in .com site describing either “GIS” or “WEBGIS”. (gis OR webgis ) AND site:com
Query 5. Find all the documents describing either tangent or cotangent function. The former must be in PDF format. The latter must be in XLS format and located in .edu site. tangent AND originurlextension:pdf OR cotangent AND originurlextension:xls AND site:edu