Some Research Results in Dynamic Taxonomy and ... - CiteSeerX

9 downloads 115 Views 191KB Size Report
H.3.3 [Information Systems]: Information Storage and Retrieval. – Search process ..... machine with current technology (web server and interprocess .... The most important is the master-slave relationship that KB systems (and ... Rutgers Series.
Some Research Results in Dynamic Taxonomy and Faceted Search Systems Giovanni Maria Sacco Dipartimento di Informatica, Università di Torino Corso Svizzera, 185 10145 Torino, Italy +39-011-6706727

[email protected] ABSTRACT Metadata access to complex information bases through multidimensional, dynamic taxonomies (aka faceted search) is rapidly becoming a hot topic both in research and in industry, where e-commerce applications based on this access paradigm are becoming pervasive. The dynamic taxonomy model was the first model to fully exploit multidimensional classification schemes and to integrate query and exploration into a single visual framework. We review a number of research results in data modeling, implementation, user interaction and emerging application areas of dynamic taxonomies and faceted search, originating from the Department of Informatics, University of Torino.

Categories and Subject Descriptors H.3.3 [Information Systems]: Information Storage and Retrieval – Search process, Selection process H.5 [Information Presentation

Systems]:

Information

Interfaces

and

General Terms Design, Human Factors.

Keywords Dynamic taxonomies, faceted search.

1. INTRODUCTION Past research has focused on retrieval of information on the basis of precise specifications: examples include database queries and information retrieval. However, the vast majority of search tasks are exploratory and imprecise in essence: the user needs to explore the information base, find relationships among concepts and thin alternatives out in a guided way. Examples of this type of access include the selection of the “right” product to buy, of a candidate for a job, but also finding the likely cause of a malfunction, etc. Indeed, exploratory access applies to an Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Faceted Search Workshop at ACM SIGIR 2006, August 10, 2006, Seattle, WA. Copyright 2006 ACM 1-58113-000-0/00/0000…$5.00.

extremely wide range of practical situations. Traditional access methods are not helpful in this context, so that new access paradigms are required. The dynamic taxonomy model [10, 11, 12] was the first model to address this problem and propose both a new visual paradigm for accessing complex information bases and an effective and complete model. Subsequent works by Hearst and al. [4, 6, 31] consider the HCI perspective only and report positive results from usability experiments: despite an inefficient and slow implementation, faceted search produced a faster overall interaction and a significantly better recall than access through text retrieval. Most importantly, users feel that they have actually considered all the alternatives in reaching a result. By comparison, access through traditional methods (information retrieval or database queries) is usually a “trial-and-error” interaction. Although few usability studies exist, the recent widespread and continuing adoption of systems based on dynamic taxonomies by well-established e-commerce portals, such as Yahoo, Lycos, Bizrate, etc., empirically supports this initial evidence. Intelligent user-centric access to complex information must be considered holistically, both from the data modeling and from the user interaction side. By contrast, most researchers in semantic data models work from a system-centric rather than user-centric perspective: current research often concentrates on adding constructs, expressivity and functionalities rather than on making them available and easily understandable to users, which is one of the central points in faceted search. It is sometimes contended that some semantic models offer more sophisticated reasoning capabilities than dynamic taxonomies. It quite obvious that complex and extremely rich semantic data models can be and, in fact, have been designed. However, over 20 years of research prove that a) they are extremely costly to implement and maintain, and b) most importantly, users are unable to understand and use them. In fact, interaction with systems based on rich semantic models is usually mediated by specialized agents, which considerably add to the overall system complexity and make user interactions less “transparent”. In contrast with the diffusion of systems based on dynamic taxonomies and faceted search, semantically rich systems have sparse implementations and mainly confined within research. The goal of the present paper is to provide a very concise review of a number of research directions and results in data modeling, implementation, user interaction and emerging application areas of dynamic taxonomies and faceted search, originated over the last years from the Department of Informatics, University of Torino, and to give comprehensive if not exhaustive bibliographic

references to such research, including technical reports describing work in progress. For this reason and because of space limitations, references to other works are sparse.

2. DYNAMIC TAXONOMIES REVIEWED Dynamic taxonomies are a general knowledge management model for complex, heterogeneous information bases. Although their main industrial application areas are currently e-commerce, eauctions and e-catalogs, their application range is extremely wide and covers most of “search” tasks: some application areas are reviewed in the following. The model underlying dynamic taxonomies is minimal and yet quite powerful. The intension is a plain taxonomic organization, i.e. a concept hierarchy (taxonomies with multiple inheritance are supported but rarely required) going from the most general to the most specific concepts. It does not require any other relationships in addition to subsumptions (i.e. IS-A and PART-OF relationships). Differently from traditional taxonomies, however, dynamic taxonomies support and require a multidimensional classification: items in the extension are classified under several concepts. This models common real-life situations because items are very often about different concepts and items to be classified usually have different features (e.g. Time, Location, etc.) or “facets”, each of which can be described by an independent taxonomy. Concepts are defined by instances rather than by properties. Thus, a concept C is just a label that identifies all the items classified under C. Because of subsumption constraints, the items classified under C (items(C)) are all those items in the deep extension of C, i.e. the set of items identified by C includes all the items directly classified under C (C’s shallow extension) union all the items directly classified under any of C’s descendants. There are two important consequences of our approach. First, since concepts identify sets of items, logical operations on concepts can be performed by the corresponding set operations on their extension. This means that the user is able to restrict the information base by combining concepts through the normal logical operations (and, or, not). Second, dynamic taxonomies can find all the concepts related to a given concept C, which represent the conceptual summary of C. Concept relationships other than subsumptions are inferred through the extension only, according to the following extensional inference rule: two concepts A and B are related iff there is at least one item D in the infobase which is classified at the same time under A (or under one of A’s descendants) and under B (or under one of B’s descendants). For example, we can infer a (unnamed) relationship between Michelangelo and Rome, if an item that is classified under Michelangelo and Rome exists in the infobase. At the same time, since Rome is a descendant of Italy, also a relationship between Michelangelo and Italy can be inferred. The extensional inference rule can be easily extended to cover the relationship between a given concept C and a concept expressed by an arbitrary subset S of the universe: C is related to S iff there is at least one item D in S which is also in items(C). Hence, the extensional inference rule can produce conceptual summaries not only for base concepts, but also for any logical combination of concepts. In addition, dynamic taxonomies can produce summaries for sets of items produced by other retrieval methods (database queries, text retrieval queries, etc.) and therefore access through dynamic taxonomies can be easily combined with other retrieval methods.

Dynamic taxonomies can be used to browse and explore the infobase in several ways. The user is initially presented with a tree representation of the initial taxonomy for the entire infobase. Each concept label has also a count of all the items classified under it (i.e. the cardinality of items(C) for all C’s). The initial user focus F is the universe (i.e. all the items in the infobase). In the simplest case, the user can then select a concept C in the taxonomy and zoom over it. The zoom operation changes the current state in two ways. First, concept C is used to refine the current focus F, which becomes F∩items(C). Items not in the focus are discarded. Second, the tree representation of the taxonomy is modified in order to summarize the new focus. All and only the concepts related to F are retained and the count for each retained concept C’ is updated to reflect the number of items in the focus F that are classified under C’. The reduced taxonomy is a conceptual summary of the set of documents identified by F, exactly in the same way, as the original taxonomy was a conceptual summary of the universe. In fact, the term dynamic taxonomy indicates a taxonomy that can dynamically adapt to the subset of the universe on which the user is focusing, whereas traditional, static taxonomies can only describe the entire universe. We believe this definition to be more accurate than the term faceted search that focuses on faceted classification that is neither a necessary nor a sufficient condition for this method. The retrieval process can then be seen as an iterative thinning of the information base: the user selects a focus, which restricts (thins out) the information base by discarding all the items not in the current focus. Only the concepts used to classify the items in the focus (and their ancestors) are retained. These concepts, which summarize the current focus, are those and only those concepts that can be used for further refinements. From the human computer interaction point of view, the user is effectively guided to reach his goal, by a clear and consistent listing of all possible alternatives.

3. DATA MODELLING ISSUES Although the data model underlying dynamic taxonomies is minimal, it stresses two important points: 1.

the extensional inference rule clearly defines the mechanism used to adapt the taxonomy to varying user foci. From a cognitive point of view, unnamed relationships among concepts are inferred on the basis of empirical evidence; and

2.

the model is based on abstract metadata only. Most importantly, the items to be managed are not assumed to be textual.

The first point is extremely important and applies to a number of issues. First, it provides a ground to perform an analysis of convergence for exploratory patterns. The faster interaction found in the usability studies by Yee et. al [31] is not just a psychological feeling: it can be proved that three zoom operations are sufficient to reduce a 10 million item database to 10 items, given a compact taxonomy with 1,000 terminal concepts where each item is classified under 10 concepts [28]. An analysis of convergence is useful to prove effectiveness, but also to give guidance as to the minimum number of facets the designer has to define in order to provide effective access to information. Incidentally, [28] also shows that static hypertext implementations, in which all the possible concept combinations are explicitly stored, are not scalable: not surprisingly, they require an exponential amount of storage.

Second, it allows comparing the present approach with other proposals, such as 1.

2.

3.

OLAP techniques [3]. There are obvious similarities between dynamic taxonomies based on faceted schemata and the hypercubes used in OLAP [12, 28]: this indicates that dynamic taxonomy techniques can be applied to OLAP environments; decision trees [8]. Dynamic taxonomies have the same descriptive power as decision trees [19]. However, the user of a decision tree is presented with a fixed, predefined decision structure that he has to explore in the order defined by the designer. On the contrary, no predefined ordering is imposed by dynamic taxonomies, that are therefore more flexible than hypertext static decision trees, proposed by some researchers in medical diagnosis, such as Seroussi et al [29]; and formal concept analysis (FCA) [5]. There are a number of points of contact [25] with FCA, a mathematical knowledge representation model developed by Wille in the ‘80s. FCA has an extensive body of mathematical results, derived in the last twenty years, that we believe can be applied to dynamic taxonomies with little effort. On the other hand, although FCA has recently evolved towards dynamic taxonomies by adding fundamental constructs such as taxonomies, it still falls short under several aspects such as expressivity, manipulation, integration with other retrieval methods and practical large-scale systems.

Third, the data model provides a basis to derive strategies to map semantically rich schemata, such as the ones used in the Semantic Web, into dynamic taxonomies [23]. In this way, dynamic taxonomies can be used in alternative or as a complement to complex semantic schemata. Browsing and exploration, that are the primary user tasks in most applications, are not supported per se by current semantic web research, which focuses on searching rather than browsing. From this point of view, dynamic taxonomies are an ideal complement to complex ontologies and allow users to browse and explore them in a guided way. In all the situations in which data is intended for user consumption rather than to be used by agents and programs, dynamic taxonomies provide a low-cost higher-return alternative to ontologies. The research here is different from [30], which identifies guidelines for unstructured data, and from most of the work on the design of monodimensional and faceted taxonomies. The independence from the textual contents, stressed by the model, is fundamental because it makes the model immediately applicable to non-textual information bases (most importantly, multimedia databases) and allows an easy multilingual access, an extremely important feature in our global society. In addition, abstract identification of items plus the straightforward extension of the extensional inference rule allows integrating any type of external search method (text retrieval rather than shape retrieval) in the overall framework. Whereas most current systems manage shallow taxonomies and only allow classifying an item under a terminal concept, dynamic taxonomies allow items to be classified at any level of abstraction. As an example, a specific law can be classified under Italy because it applies to the entire country, while another one could be classified under a specific region. The separation between the shallow and deep extension of a concept is used primarily to

model situations like this one. This freedom in classification requires that the user be able to discriminate between the shallow and deep extension of concepts. In addition, when the taxonomy is used to express user interests in push systems [18], additional information is needed. Especially in multimedia infobases, items are not atomic, as assumed in the standard dynamic taxonomy model, but may have a hierarchical structure. As an example, a news video can be considered as an entire video, split into stories, each of which is split into scenes, composed by frames. Relationships among concepts depend on the level of granularity at which the user is working: two persons are closely related if they appear in the same frame, much less so if they only appear in the same video. In [17], we propose an extension to deal with item structure and exploit it both in retrieval and in classification. In the same context, although it has a much more general application scope, we preliminary discuss fuzzy dynamic taxonomies [17], in which membership in a concept is not boolean, as in the standard model, but fuzzy. Fuzziness has an impact on how items in a concepts are listed, and this is easy to represent in the standard way, i.e. by listing retrieved items by decreasing membership probabilities. However, fuzziness impacts also on the extensional inference rule, in which relationships are no longer boolean but fuzzy. Although a fuzzy extensional inference rule can be derived, the problem of conveying inferred fuzzy relationships to users has to be further investigated.

4. IMPLEMENTATION ISSUES The main problem in implementing effective systems based on dynamic taxonomies is that the computation of related concepts and the subsequent reduction of the corpus taxonomy have to be performed in real time. A slower execution would severely impair the sense of free exploration that the user of dynamic taxonomy systems experiences. Surprisingly, efficient architectures and implementations for dynamic taxonomies are not really discussed in literature. Sacco [12] identifies this as a challenge for medium to large information bases and claims that proprietary techniques speeded up execution by a factor of 30. Other works [31] generically report the use of a relational database management system without providing further details. We have investigated an architectural framework which strives for maximum flexibility and performance and have derived efficient bitmapped implementations [11, 16]. Extensive experiments are reported in [16] on databases ranging from 2,000 to 800,000 items. An optimized evaluation strategy achieved a throughput of 327 reduced taxonomies per second (rtps) with 800,000 documents on a Dell Dimension 8250, Intel Pentium 4 2.8 Ghz, 512 Mb RAM, Microsoft Windows 2000 Professional, an average machine with current technology (web server and interprocess communication overheads were ignored). The optimized strategy was compared to a memory-resident relational implementation managed by mySQL. We found out that the relational implementation required 5.5 times more memory and was 40 (2,000 items) to 90 (512,000 items) times slower. This indicates that the performance of relational implementations degrades as the database size grows, and indicates that relational database systems are not adequate to manage large databases. The throughput for the optimized evaluation on 512,000 items is equivalent to the throughput of the relational implementation on 8,000 items. Although the experiments depend on the specific dbms used, so

items must be narrowed to a small set of candidates. The “end-game”, in which candidate items are compared in order to select a single item, is solved by color-coding. An online demo for digital cameras is available at [7];

that the actual throughput for other systems might be different, we are confident that bitmapped evaluation will significantly outperform relational implementations in any case. Parallel architectures are also investigated in [16]. In [11, 16], we also address two important extensions to dynamic taxonomies. The first one, virtual concepts, addresses the efficient representation of high-cardinality/continuous facets (such as prices, weights, etc). The second one, time-varying concepts, efficiently represents data whose classification varies with time. Examples include age concepts, that track the age of an item, and time-to-completion concepts, that are critical in applications such as online auctions. Finally, in [11], efficient push strategies for dynamic taxonomies are described. These strategies were used in the DBWorld Xtended demo [23], discussed below.

5. USER INTERACTION ISSUES The way systems based on dynamic taxonomies work is clearly specified. On the other hand, faceted search systems (taken at their face value, i.e. as search systems based on a faceted classification [9]) can work according to very different approaches. As an example, many systems, especially those for tourist infobases, require users to preselect all the features they want: this can result in empty result sets that cannot occur in dynamic taxonomy systems, by construction. As another example, some systems do not allow concept composition, so that access through a facet is independent of all the other facets, and the resulting reducing power is significantly lower than for traditional, monodimensional taxonomies [28]. Variations are discussed in [13], where a number of user-interaction issues are also reviewed. Among these, 1.

the way the taxonomy is represented, e.g. by an expandable/collapsible tree widget ([7]) or by a column wise expanded display [4]. The first type of representation is more flexible and can easily accommodate a high number of facets, the second one is more immediate for the user and reduces potential information-hiding problems, but can lead to an insufficient number of facets/concepts;

2.

the way concept composition is performed. Usually only the AND of different concepts is supported. Very few systems [7] support all the boolean operations on concepts. In particular, composition in OR is especially important because it allows to define custom groups (generalizations) of concepts. As an example, consider a Location taxonomy, organized by continents, and countries. The only generalization (grouping) available here is by continents. What if the user is not interested in the entire Europe, but just in Balkan countries? If unions are not supported, he will have to explore each Balkan country (Bulgaria, Serbia, etc.) independently which, in addition to being cumbersome, will make the discovery of features common to all Balkan countries very hard.

6. EMERGING APPLICATION AREAS A number of application areas have been investigated in detail over the last few years. In addition to textual infobases [12], these include: 1.

E-commerce [15, 21, 27] where dynamic taxonomies can be used for the “thinning game”, in which a large number of

2.

Multimedia databases [14, 17, 24], where dynamic taxonomies can be used to integrate access by conceptual metadata and access by primitive multimedia features (color, texture, etc.) into a single, coherent framework. Rosso Tiziano, an online demo on an image collection of paintings of the Italian Renaissance, is available at [26];

3.

News systems with push/pull technology [20]. An online demo based on the DBWorld announcement list is available online at [23];

4.

Human resources and job placement services, where dynamic taxonomies are used to select candidates on the basis of their “features” [1, 22];

5.

Diagnostic systems [31]. In this area, knowledge-based systems have dramatically failed for a number of reasons [2]. The most important is the master-slave relationship that KB systems (and, by extension, semantic web systems) tend to impose, and that is obviously not quite accepted by skilled personnel such as physicians. We are actively working in this area because we feel that the user-centric approach of dynamic taxonomies should make them readily accepted;

6.

Discovery of services [26], especially in e-government.

7.

E-government portals [18, 22, 26]. It is still not quite perceived that the major problem in e-gov is really how to make information available to users, because, without complete information, no real democracy exists. In a number of papers, we have discussed the application of dynamic taxonomies to several aspects of e-gov information from normative information (law and regulations), to information and services designed to improve and assist citizens and companies (job brokering, services, etc.). We believe that our investigation shows that dynamic taxonomies can effectively manage all this type of information, in a uniform way.

All the demos referenced above are based on Knowledge Processors’ Universal Knowledge Processor [7], the first webbased engine implementing dynamic taxonomies.

7. CONCLUSIONS Originally proposed as a way to overcome the ineffectiveness of information retrieval, dynamic taxonomies have shown a much wider application range and indeed seem a universal access mechanism for all those situations in which the information base must be explored and/or different, rankable search criteria must be reconciled and used to find the “right” items. We believe that the vast majority of search tasks fall in this category.

8. REFERENCES [1] Berio, G., Harzallah, M., Sacco, G. M., Portals for integrated competence management, in: Encyclopedia of Portal Technology and Applications, A. Tatnall ed., Idea Group Inc., Hershey, Pennsylvania, 2006 [2] Brézillon, P. Context in problem solving: A survey, The Knowledge Engineering Review, 14(1), 1999, pp.1-34

[3] Chaudhuri, S., Dayal, U., An overview of data warehousing and OLAP technology, ACM SIGMOD Record, 26:1, 1997

[4] The Flamenco project, Univ. of Berkeley, http://bailando.sims.berkeley.edu/flamenco.html [5] Ganter, B., Wille, R., Formal concept analysis Mathematical foundations, Springer Verlag, Berlin, 1999



[6] Hearst, M., et al., Finding the Flow in Web Site Search, Comm. of the ACM, 45, 9, 2002 [7] Knowledge Processors, Universal Knowledge Processor, 1999, www.knowledgeprocessors.com [8] Moret, B. M. E., “Decision Trees and Diagrams”, ACM Computing Surveys, 14: 4, 1982 [9] Ranganathan, S. R.. The Colon Classification. Rutgers Series on Systems for the Intellectual Organization of Information (ed. Susan Artandi). New Jersey: Rutgers University Press, Volume 4, 1965

[20] Sacco, G. M., DBWorld Xtended: semantic dissemination of information through dynamic taxonomies, 5th Int. Conf. on Knowledge Management, I-KNOW05, Graz, June 2005 (in J. of Universal Computer Science, Springer) online demo: http://dbworldx.di.unito.it/ [21] Sacco G. M., The intelligent e-store: easy interactive product selection and comparison, 7th IEEE Conference on ECommerce Technology, IEEE CEC'05, 2005 [22] Sacco, G. M., Guided interactive information access for ecitizens, in: EGOV05 – Int. Conf. on E-Government, Springer Lecture Notes in Computer Science 3591, 2005, 261-268 [23] Sacco, G. M., Discount semantics: modeling complex data with dynamic taxonomies, Tech Rep Univ. di Torino, Dip. di Informatica, 2006

[10] Sacco, G. M., Navigating the CD-ROM, Proc. Int. Conf. Business of CD-ROM, 1987

[24] Sacco, G. M., Rosso Tiziano: user-centered systematic exploration of large image information bases, Tech Rep Univ. di Torino, Dip. di Informatica, 2006 online demo: http://tiziano.di.unito.it

[11] Sacco, G. M., Sacco, G.M., Dynamic taxonomy process for browsing and retrieving information in large heterogeneous data bases, 1998, US. Patent 6,763,349, 2004

[25] Sacco, G. M., Intelligent access to information: dynamic taxonomies vs. formal concept analysis, Tech Rep Univ. di Torino, Dip. di Informatica, 2006

[12] Sacco, G. M., Dynamic Taxonomies: A Model for Large Information Bases. IEEE Transactions on Knowledge and Data Engineering 12, 2000, 468-479

[26] Sacco, G. M., User-centric access to e-government information: e-citizen discovery of e-services, 2006 AAAI Spring Symposium Series, Stanford University, California, USA, March 27-29, 2006

[13] Sacco, G. M., Modeling, Interface and Interaction Issues in Dynamic Taxonomy and Faceted Classification Systems, Tech Rep Univ. di Torino, Dip. di Informatica, 2002 [14] Sacco, G. M., Systematic Browsing for Multimedia Infobases, Int. Conf. on Imaging Science, Systems and Technology, CISST '03, Las Vegas, pp. 77-83, 2003 [15] Sacco, G. M., The Intelligent E-Sales Clerk: the Basic Ideas, 9th IFIP TC 13 Int. Conf. on Human-Computer Interaction, INTERACT'03, Zürich, Switzerland, pp. 876-879, 2003

[27] Sacco, G. M., Dynamic taxonomies and guided searches , J. of the American Society for Information Science and Technology, 57:6, pp. 792-797, April 2006 [28] Sacco, G. M., Analysis and Validation of Information Access through Mono, Multidimensional and Dynamic Taxonomies, FQAS 2006, 7th International Conference on Flexible Query Answering Systems, Milano, Springer Lecture Notes in Artificial Intelligence 4027, 2006

[16] Sacco, G. M., Efficient Implementation of Dynamic Taxonomies, Tech Rep Univ. di Torino, Dip. di Informatica, 2003

[29] Séroussi B, et al., “OncoDoc, a successful experiment of computer-supported guideline development and implementation in the treatment of breast cancer”, Artif Intell Med; 22(1), 2001, pp. 43–64

[17] Sacco, G. M., Uniform access to multimedia information bases through dynamic taxonomies, IEEE Sixth Int. Symp. on Multimedia Software Engineering, (ISMSE'04), 2004, 320-328

[30] Tzitzikas, Y., Analyti, A., Spiratos, N., Compound Term Composition Algebra: The Semantics, Journal on Data Semantics II, Springer, 58-84, 2005

[18] Sacco, G. M., No (e-)Democracy Without (e-) Knowledge, in Int. Conf. IFIP TCGOV 2005, Bolzano, Springer Lecture Notes in Computer Science 3416,. 2005, 147-156 [19] Sacco, G. M., Guided interactive diagnostic systems, 18th IEEE International Symposium on Computer-Based Medical Systems (CBMS'05), 2005, 117-122

[31] Yee, K-P., et al., Faceted Metadata for Image Search and Browsing, Proc. CHI 2003, 2003

Suggest Documents