OLAP-Enabled Web Search of Complex Objects Alfredo Cuzzocrea University of Trieste and ICAR-CNR, Italy
[email protected]
Guandong Xu University of Technology Australia
[email protected]
ABSTRACT Inspired by the actual trend of empowering traditional Web search methodologies by means of novel computational paradigms, in this paper we propose and experimentally assess WebClustCube, a novel system that allows OLAP-enabled Web search of complex objects, thus adding new value to the potentialities of current Web search paradigms. In particular, WebClustCube supports the building and the interactive manipulation of OLAP-enabled Web views over complex objects extracted from distributed databases. The data management, OLAP-like support of WebClustCube is provided by ClustCube, a state-of-the-art framework for coupling OLAP methodologies and clustering algorithms with the goal of analyzing and mining of complex database objects. A case study that clearly shows the potentialities of WebClustCube in the context of next-generation Web search environments is provided. We complement of analytical contribution by means of an experimental assessment and analysis of WebClustCube according to several metric perspectives.
Categories and Subject Descriptors H.2 [DATABASE MANAGEMENT]: H.2.7 Administration – Data Warehouse and Repository
Database
General Terms Algorithms, Design, Management, Performance, Theory
Keywords Web Search; OLAP-Enabled Web Search; Complex Object Searching; Advanced Mining Applications; Intelligent Systems
1. INTRODUCTION The problem of empowering traditional Web search methodologies by means of novel computational paradigms, such as multigranularity [10], correlation analysis [17] and federated approaches [9], is gaining momentum, mainly driven by emerging technologies like tag-based search (e.g., [15]) and social-networks-based search (e.g., [16]), and so forth. Inspired by this actual trend, we argue that well-understood multidimensional data analysis and mining
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. iiWAS '15, December 11-13, 2015, Brussels, Belgium © 2015 ACM. ISBN 978-1-4503-3491-4/15/12…$15.00 DOI: http://dx.doi.org/10.1145/2837185.2837243
Giorgio Mario Grasso University of Messina Italy
[email protected]
paradigms, in particular OLAP [12] analysis, may represent a real “add-in value” for Web search methodologies. In fact, it is easy to understand how many classical search problems are intrinsically multidimensional in nature, i.e. driven by multiple variables (i.e. dimensional attributes [12]) that need to be managed simultaneously. This is not actually supported by standard keyword-based search methods, as the latter mainly use iterative approaches that incrementally refine the actual search results. The target of advanced Web search methodologies are the so-called complex objects that are built on top of tuple-shaped data stored in distributed databases that aliment data-intensive sites over the Web. In order to better appreciate the potentialities of multidimensionalparadigm-based Web search methodologies over the classical monodimensional approaches supported by conventional keyword-based search methodologies, consider the running example depicted in Figure 1 and Figure 2. Here, Alice and Bob are executing Web search with topic car, being interested in analyzing car sales in a specific geographic area across time. The underlying data model is of multidimensional kind, and it is characterized by the following dimensional attributes: CarBrand, CarType, CarDisplacement, CarDoorNum, VendorState, VendorCity, VendorLocation, CustomerState, CustomerCity, CustomerLocation, SaleYear, SalePrice, LoanType, AssuranceType. The measure is Sale. In particular, Alice (see Figure 1 – here, dimensional attributes are shown by symbols, e.g. Att0 ≡ CarBrand, Att1 ≡ CarType, and so forth) is searching the Web according to the multidimensional search paradigm we introduce in our research, while Bob (see Figure 2 – here, dimensional attributes are shown by symbols, e.g. Att0 ≡ CarBrand, Att1 ≡ CarType, and so forth) is doing the same task via the classical mono-dimensional search paradigm. As shown in Figure 1, Alice is allowed to retrieve results from the target search space by aggregating complex objects according to the 14 dimensions simultaneously, whereas, as illustrated in Figure 2, Bob can access results by progressively refining the search space via iteratively applying keyword-based search task (one for each dimension, ideally), hence resulting in a much-less effective and efficient process. In addition to this, results retrieved by Bob can also be incomplete (e.g., ) due to the fact that each single space restriction step, which is not performed per multiple dimensions simultaneously, may incur in the risk of discarding relevant objects that do not match the membership to all the dimensions of the underlying multidimensional data model (e.g., [35]). Our research statement is supported by the WebClustCube, a novel system that allows OLAP-enabled Web search of complex objects, thus adding new value to the potentialities of current Web search paradigms. In particular, WebClustCube supports the building and the interactive manipulation of OLAP-enabled Web views over complex objects extracted from distributed databases. The data management, OLAP-like support of WebClustCube is provided by ClustCube [8], a state-of-the-art framework for coupling OLAP methodologies and
clustering algorithms with the goal of analyzing and mining of complex database objects.
where we basically introduced the general idea of supporting OLAPenabled Web search of complex objects. In this paper, we provide significantly novel contribution on the architecture and functionalities of WebClustCube and the experimental assessment and analysis of the system.
2. THE CLUSTCUBE APPROACH Figure 3 shows the “big picture” of the research we propose in [8], i.e. the ClustCube overview. Basically, ClustCube defines a multiplelayer reference architecture that encompasses the following wellseparated layers: (i) Distributed DataBase Layer (DDBL), where the target distributed database from which complex objects are extracted is located; (ii) Complex Object Definition Layer (CODL), which supports primitives and functionalities for building and managing complex objects extracted from the DDBL layer; (iii) the Object Layer (OL), where complex objects are located, along with a suitable object schema; (iv) the ClusterCube Definition and Management Layer (CCDML), which supports primitives and functionalities for defining and managing ClustCube cubes; (v) the ClusterCube Layer (CCL), which stores the final ClustCube cube. Figure 1. Multidimensional Web Search Paradigm.
In order to assess WebClustCube, we provide an experimental assessment and analysis of its performance on top of a corporate distributed database, according to several metric perspectives. Results clearly and undoubtedly confirm the benefits deriving from out proposed system in supporting OLAP-enabled Web search of complex objects.
Figure 3. ClustCube Functional Architecture.
2.1 Complex Object Definition and Management
From the DDBL layer of the ClustCube framework (see Figure 3), the CODL layer meaningfully extracts complex database objects used to populate the OL layer via a set of Complex Object Definition Queries (CODQ), denoted by 𝑄𝑆𝐶𝑂𝐷𝐿 = {𝑄0 , 𝑄1 , … , 𝑄|𝑄𝑆𝐶𝑂𝐷𝐿|−1 }, which are conceptually located at the CODL layer and defined by the administrator editing the whole analysis/mining process, on the basis of her/his analysis/mining tasks. CODQ queries in QSCODL are combined by the CODL layer into a singleton global CODQ query, denoted by QCODL, which involves in a singleton class/object-schema, denoted by OSCODL, to which the set of object instances, denoted by OICODL, adhere. Both OSCODL and OICODL are located at the OL layer. QCODL can be simply obtained from CODQ queries in QSCODL according to several straightforward alternatives, such as union, i.e. 𝑄𝐶𝑂𝐷𝐿 = |𝑄𝑆 |−1 |𝑄𝑆 |−1 𝑄𝑘 , but, without any ⋃𝑘=0𝐶𝑂𝐷𝐿 𝑄𝑘 , or join, i.e. 𝑄𝐶𝑂𝐷𝐿 =⋈𝑘=0𝐶𝑂𝐷𝐿 loss of generality, in the ClustCube framework the administrator is allowed to edit global CODQ queries QCODL of arbitrary nature, even based on complex user-defined SQL expressions. Figure 2. Mono-Dimensional Web Search Paradigm.
The paper is organized as follows. In Section 2, we provide an overview of ClustCube, by highlighting benefits deriving from such framework. Section 3 contains an overview of state-of-the-art proposal that constitute the knowledge background of our research. In Section 4, we provide principles, models and implementation of the WebClustCube system, along with a case study that clearly shows its potentialities in the context of next-generation Web search environments. Section 5 contains an experimental assessment and analysis of the system, on top of a corporate distributed database. Finally, in Section 6 we derive conclusions and outline future work of our research. This paper is an extended version of the short paper [31],
2.2 The Fundamental Multidimensional Data Model As shown in [8], given a ClustCube cube C characterized by the set of dimensions D = {d0, d1, …, dN-1}, such that N = |OSCODL|, being dimensions in D corresponding to features in the set of fields of the class OSCODL, denoted by F(OSCODL), each ClustCube cube cell C[i0][i1]…[iN-1] C[𝕀] in C, such that 0 ≤ i0 ≤ |d0|, 0 ≤ i1 ≤ |d1|, …, 0 ≤ iN-1 ≤ |dN-1|, denoting 𝕀 = 〈𝑖0 , 𝑖1 , … , 𝑖𝑁−1 〉 an N-dimensional entry in the N-dimensional space of C, C[𝕀] stores a set of clustered objects, denoted by OICODL(C[𝕀]), that are obtained by simultaneously clustering objects in OICODL with respect to the dimensions/features d0,
d1, …, dN-1 in D/F(OSCODL). Hence, it is trivial to observe that, for each ClustCube cube cell C[𝕀] in C multidimensional boundaries of 𝑢𝑝 C[𝕀] along the dimension di, denoted by 𝐵𝑑𝑙𝑜𝑤 and 𝐵𝑑𝑖 , respectively, 𝑖 𝑢𝑝 with 𝐵𝑑𝑙𝑜𝑤 < 𝐵𝑑𝑖 , are determined by the A-based clustering of 𝑖 objects in OICODL with respect to feature di in D. As deeply discussed in [8], in the ClustCube framework, A models any arbitrary core clustering algorithm to be used to obtain the clustered complex objects at the leaf level cuboid. This means that the algorithm itself is orthogonal to the framework. In particular, in the research experience [8], the following well-known clustering algorithms have been exploited: BIRCH [32], CLARANS [33] and DBSCAN [34]. Briefly, recall that these algorithms adhere to different-in-nature clustering approaches: BIRCH is a hierarchical clustering algorithm, CLARANS a partition-based clustering algorithm, and DBSCAN a density-based clustering algorithm. In other words, while in traditional OLAP data cubes [12] the multidimensional boundaries of data cube cells along dimensions are determined by the input OLAP aggregation scheme, in ClustCube cubes multidimensional boundaries of ClustCube cube cells along dimensions are the result of the clustering process itself.
One of the most relevant contribution of the research we propose in [8] consists in equipping the final ClustCube cube generated by the CCDML layer (see Figure 3) with the canonical cuboid lattice [12]. In traditional OLAP, given an N-dimensional data cube C having D = {d0, d1, …, dN-1} as dimension set, the cuboid lattice associated to C, denoted by L, is a hierarchical structure composed by 2N – 1 cuboids, denoted by Ci, i.e. data (sub-)cubes that aggregate original relational data according to arbitrary combinations of dimensions in D, each one at different cardinality. In other words, an n-dimensional cuboid Ci of a data cube C represents a particular n-dimensional view of C, such that 0 ≤ n ≤ N. For n = N, the base cuboid is defined, which corresponds to the original data cube C. Cuboids of a certain cuboid lattice L are naturally ordered by means of the precedence relation ≺, such that, for each pair of cuboids Ci and Cj in L, Ci ≺ Cj holds iff Di ⊂ Dj, such that Di denotes the set of dimensions of Ci and Dj denotes the set of dimensions of Cj, respectively. This finally determines a cuboid hierarchy, denoted by H(L).
2.3 How to Compute a ClustCube Data Cube? Similarly to conventional OLAP data cube computation axioms, computing the cuboid lattice for a ClustCube cube can be extremely resource-consuming [8]. In order to face-off this drawback, in [8] we propose a collection of algorithms that meaningfully exploit the structured nature of complex database objects within cuboids and the distributive nature of clustering across hierarchical domains, like those defined by conventional OLAP schemas. In the ClustCube framework [8], we fully take advantages from this nice amenity that becomes the key property that allows us to significantly reduce computational needs due to compute ClustCube cube cuboid lattices. In fact, thanks to this amenity, we do not need to compute cuboids from the scratch but any cuboid Cj at level l of L, such that 2 ≤ l ≤ N, can be obtained from the cuboids at level l = 1, denoted by {C0, C1, …, CN-1}, simply by simultaneously distributing objects in Ci with respect to the features {D0, D1, …, DN-1} of {C0, C1, …, CN-1}, respectively [8]. As a major result of the research proposed in [8], ClustCube framework comprises several kinds of techniques for efficiently building the ClustCube cube C plus its cuboid lattice L (each one
codified by a respective ClustCube cube building algorithm), which differ with respect to two orthogonal strategies, namely: (i) materialization strategy and (ii) building strategy. Materialization strategies specify which cuboids, among the 2N – 1 cuboids of L, must be materialized, i.e. computed and stored (in secondary memory). On the other hand, building strategies specify how cuboids are computed actually. With respect to the materialization strategy, the following two alternatives are introduced in the ClustCube framework: (i) full, denoted by FUL, according to which all cuboids of L are materialized; (ii) partial, denoted by PAR, according to which a sub-set of the 2N – 1 cuboids of L is materialized. In particular, for what regards partial materialization approaches, in [8] we introduce so-called complete traversal paths over cuboid lattices, denoted by T, as yet-another powerful mining capability of the ClustCube framework. Specifically, this interesting mining capability is inspired from [8], which focuses on the problem of computing iceberg cubes efficiently. Traversal path model is consistent for both traditional OLAP and actual ClustCube data cubes. Given an N-dimensional data cube C with cuboid lattice L having N + 1 levels, a complete traversal path T over L is defined in terms of a list of cuboids Ci in L of kind: T = C0, C1, …, C|T|-1, such that: (i) |T| ≤ 2N – 1; (ii) Ci C for each i in {0, 1, …, |T| – 1}; (iii) for each pair i, j in {0, 1, …, |T| – 1} such that i j, Ci Cj; (iv) for each pair i, j in {0, 1, …, |T| – 1} such that j = i + 1, Ci ≺ Cj in H(L); (v) C0 in T corresponds to the apex cuboid [12] in the OLAP schema of C (i.e., C0 ); (vi) C|T|-1 in T corresponds to the base cuboid (i.e., the cube) C (i.e., C|T|-1 C). ClustCube framework supports both regular, i.e. the ones defined before, and irregular traversal paths, i.e. traversal path such that there exists at least one cuboid Ci that precedes more than one cudoid in the cuboid hierarchy H(L), i.e. Ci : Cj, Ck in T, Ci Cj Ci Ck : Ci ≺ Cj in H(L) Ci ≺ Ck in H(L). As regards the building strategy, ClustCube framework exposes the following two different approaches: (i) baseline, denoted by BAS, according to which, for each cuboid Ci in L, clusters are re-computed from the scratch (i.e., directly from input objects in OICODL); (ii) drilldown, denoted by DRI, according to which cuboids at level l of L are computed from cuboids at level l = 1 of L by means of a meaningfully distributive method. Since materialization strategies and building strategies are orthogonal, four techniques for computing the target ClustCube cube are finally obtained in terms of combinations of both materialization and building strategies, namely: ⟨FUL,BAS⟩, ⟨FUL,DRI⟩, ⟨PAR,BAS⟩, ⟨PAR,DRI⟩ [8].
2.4 Some ClustCube Application Scenarios
Application scenarios where the ClustCube framework may turn to be extremely useful identify a wide family, ranging from complex recommendation systems (e.g., [5]) to complex mobile frameworks (e.g., [6]).
3. RELATED WORK The idea we propose in this paper keeps a high level of research innovation. Indeed, to the best of our knowledge, in literature there not exist other approaches that, like us, pursue the idea of integrating multidimensional and OLAP analysis paradigms into Web search engines. This gives a further light to the quality of our proposal. On the other hand, some research efforts mention the idea of combining search with OLAP paradigms, perhaps in different but
related scientific area such as social networks, social media, and so forth. We provide a brief discussion on such proposals. EventCube [19] is a composite framework that supports multidimensional search and analysis on text and structured data arising in large-scale organizations, by means of the definition of two state-of-the-art OLAP-based structures over non-conventional (i.e., relational) data sources, namely Text Cube [20] and Topic Cube [21]. Text Cube [20] focuses the attention on text data that are associated to multidimensional information, and aims at proposing a data cube model that integrates the power of traditional OLAP and Information Retrieval (IR) techniques targeted to text (e.g., [43]). This allows us to apply OLAP analysis over text data. In particular, authors propose two kinds of (OLAP) hierarchies: dimensional hierarchy, which is the classical OLAP kind, and term hierarchy, which is extracted from text directly, via classical term definition and processing. Effective and efficient implementations of Text Cube are provided as well. Topic Cube [21] applies the well-known probabilistic topic model, which is exploited to support latent topic analysis and mining on textual information (e.g., [44]), and proposes the definition of an innovative data cube model that combines OLAP with probabilistic topic modeling, thus enabling OLAP over multidimensional text databases. The proposed model extends the traditional data cube model as to deal with a given topic hierarchy, and it stores probabilistic content measures over text documents that have been learned via a proper probabilistic topic model. Cuboids are computed via a heuristic method that speeds-up the iterative EM algorithm [45]. [22] engrafts faceted search within classical Business Intelligence (BI) environments over Clouds. Here the idea is that of using OLAP paradigms to improve the semantics and the expressiveness of search operations over decision maps. [23] proposes a framework that makes use of well-known OLAP constructs, such as dimensions and levels, with the goal of supporting multidimensional analysis of Web logs, by also proposing some fortunate visual metaphors that loosely resemble OLAP-enabled interfaces supported by WebClustCube. Semantic OLAP is another area of research that is related to our work, as semantics-based techniques can be exploited for improving the quality of the OLAP-enabled Web search tasks. In this scientific area, [24] is a first study that focuses the attention on so-called quotient cubes, which summarize the semantics of OLAP data cubes. The same approach has been then exploited to achieve an efficient exploration of OLAP data cubes [25]. [26] proposes a semantic cube model which allows us to apply object-oriented technology to (Semantic) Data Warehouses and enable users to design the generalization relationship between different data cubes. [7,27] exploits compression techniques to support semantics-aware advanced OLAP visualization of multidimensional data cubes. [28] studies how to extract the semantics of OLAP data cubes via advanced knowledge discovery paradigms. Finally, the area of OLAP over Social Media is also related to our research, yet in a minor extent. In this context, [36] focuses the attention on the issue of retrieving resources from social media on the Web via mapping social searches that explore multiple social entities over OLAP queries retrieving views of aggregated social data. The final goal is that of achieving the marriage between OLAP and the Social Web. [37] and [38] investigate the relevant problem of computing OLAP cubes over Twitter data, by introducing innovative models and algorithms. In particular, [37] proposes a composite methodology where a complete OLAP analysis platform over Tweets
is developed. The platform founds on three main components, namely: the data warehousing layer suitable to store data produced by a social network of Twitter, the semantic layer devoted to semantically-enriching the underlying (aggregate) data set, and the analysis layer where OLAP over Twitter data aggregated and enriched in the previous layers is performed. [38] is instead targeted to the specific problem of computing OLAP cubes over Twitter data, particularly modeled in terms of so-called multidimensional Tweet streams, a novel data model for streaming Twitter data inspired by [35], and an innovative technique based on Fuzzy Formal Concept Analysis (FCCA) [11] is proposed to this end.
4. WEBCLUSTCUBE: PRINCIPLES, MODELS, IMPLEMENTATION In this Section, we focus the attention on WebClustCube, the Web search system which represents the main application scenario of ClustCube. As highlighted in Section 1, WebClustCube consists of the integration of OLAP methodologies and clustering algorithms for supporting emerging Web search environments [2], and it embeds ClustCube, which thus plays the role of enabling technology. In fact, it should be noted that OLAP methodologies are particularly suitable for representing and mining (clustered) complex objects implemented on top of relational data sources (see Section 2.1) in Web search environments, according to given aggregation schemes. These schemes have been firstly proposed in the context of relational database settings (e.g., [14]) and oriented to progressively aggregate relational data from low-detail tuples towards coarser aggregates. This approach can be meaningfully adapted to progressively aggregate objects from (object) groups which are, in turn, aggregated on the basis of low-level (object) fields, as described in Section 2.2, towards (object) groups aggregated on the basis of coarser ranges of low-level fields, in a hierarchical fashion, being “aggregations” set of complex objects computed via clustering. Now we focus on the potentialities of WebClustCube over Web search environments in greater detail. Consider the case of a typical two-dimensional Web view interface over Web-enable keyword-based search results retrieved by means of Web search engines like the one described in [39], called Visual Cube. Indeed, Visual Cube acts as an “augmented” Web search interface over the results of Web search engines. Results retrieved by Visual Cube can naturally be modeled in terms of complex objects extracted from the target sources (e.g., relational databases available on the Web) and delivered via popular Web browsers. Two-dimensional Web views of Visual Cube support several functionalities, such as manipulating complex objects, zooming complex objects, removing/adding a column of the twodimensional Web view, and so forth, which are made all available to the user (as we describe next, these operations are corresponding to typical OLAP tasks). Thus, the user not only can extract desired Web results in terms of collections of complex objects, but she/he can further process and refine retrieved results, in an interactive way. Both characteristics fully adhere to the well-known OLAP paradigm (e.g., [4]), and have been recently gain new attention from the research community (e.g., [13]). In particular, focus the attention on the interaction functionalities. Web users not only want to access results but also they want to interact with the retrieved results (e.g., focusing on particular complex object sub-groups). This feature is fully supported by ClustCube, because not only the base cuboid over the object domain is computed, but also all the other cuboids of the lattice (see Section 2.2). This way, Web users can browse the overall multidimensional domain of clustered complex objects, and interact with it. As highlighted in Section 2.3, the alternative solution of computing arbitrary cuboids on the fly would be practically
impossible in real-life settings like the Web search environments we address in our research.
4.1 WebClustCube’s Main Functionalities
of the hierarchy to be rendered in the view; (4) accessing and browsing the OLAP view, in order to retrieve clusters of complex objects; (5) “natural'” roll-up (i.e., removing a column) and drill-down (i.e., adding a column) [4]; (6) pivoting.
The described Web search interaction paradigm supported by Visual Cube is likely to be associated with a typical OLAP-enabled Web interface, where ClustCube plays the role of supporting framework. This finally realizes the WebClustCube system. In such systems, the following functionalities are supported: (1) selection of the complex object types to be included in the OLAP analysis; (2) selection of the OLAP analysis to be performed, i.e. definition of the target OLAP view; (3) for each dimension defined at point (2), selection of the level
Functionalities (1), (2) and (3) are necessary in order to define the OLAP view that is relevant for the current analysis (the cuboid lattice for this view is computed by ClustCube – see Section 2.2). Functionalities (4), (5) and (6) support typical OLAP operations on top of the Web OLAP view over clustered complex objects. Similarly, the same two-dimensional Web views can be parameterized in order to represent higher-dimensional OLAP views (e.g., [7,27]).
Figure 4. Integration of Clustering Techniques and OLAP Methodologies within an OLAP-Enabled Web View of Clustered Digital Cameras in WebClustCube.
4.2 WebClustCube in Action
4.3 WebClustCube’s Logical Architecture
Figure 4 shows a typical OLAP-enabled Web view within the WebClustCube architecture. Here, clustered objects representing digital cameras retrieved via the Web view are delivered via OLAP methodologies, with associated browsing and support for typical operations (i.e., roll-up, drill-down, and so forth). In particular, the OLAP-enabled Web view of Figure 4 represents clustered objects/digital-cameras, for which the clustering phase has been performed over the clustering attributes Resolution and DigitalZoom, which also naturally model OLAP dimensions of the view. Furthermore, Figure 5 shows the same view over which a roll-up operation over the dimension Resolution has been applied. Moreover, Figure 6 shows the same view over which a drill-down operation over the dimension Price has been applied.
Figure 7 shows the logical architecture of WebClustCube. We borrowed this architectural experience from another similar research experience whose results have been presented in [51]. Here, a DW Administrator determines the most appropriate OLAP analysis of interest, which, in turn, determines the setting of ClustCube’s parameters (see Section 2.3). To this end, ClustCube interfaces the Target Relational Database, where the dataset of interest is stored, and from which complex objects are extracted (see Section 2.1). Based on the keyword-search interaction of the End-User interfaced to Visual Cube, ClustCube computes from the target dataset a suitable OLAP-like Hierarchical Cuboid Lattice (see Section 2.2) which stores multidimensional clusters organized in a hierarchical fashion. This cuboid lattice is mapped onto an ad-hoc Snowflake Multidimensional Schema implemented on top of Mondrian ROLAP Server. The so-determined OLAP object repository is accessed and queried via the JPivot Application Interface, which finally provides
the OLAP-enabled Web functionalities encoded in Visual Cube, as described. This visualization solution can be further improved if advanced OLAP visualization techniques, like [7,27], are integrated within its internal layers.
As regards proper implementation aspects, it should be noted that the described OLAP-enabled Web system can be further improved to gain efficiency by deploying it over a composite platform including emerging NoSQL (e.g., [3]) and Cloud Computing (e.g., [1]) paradigms.
Figure 5. The View of Fig. 2 after a Roll-Up Operation over the Dimension Resolution.
4.4 Remarks The amenities deriving from the integration of clustering techniques and OLAP methodologies achieved by WebClustCube should be noted. First, complex objects are characterized by multiple attributes that naturally combine with the multidimensionality of OLAP [12], i.e. clustering attributes also play the role of OLAP dimensions of Web views. However, such views can also embed additional OLAP dimensions that are not considered in the clustering phase, so that given more powerful functionalities to users. Second, retrieved Web views can be manipulated via well-understood OLAP paradigms, such as multidimensionality and multi-resolution, and operators, such as roll-up, drill-down and pivoting [4]. This clearly represents a critical add-in value provided by WebClustCube for actual Web search models and methodologies. Moreover, most importantly, it opens the door to novel research challenges that we conceptually located under the term multidimensional OLAP-like Web search, which can be reasonably intended as the integration of multidimensional models and methodologies with Web search paradigms. For instance, a very interesting scenario is represented by the case of applying the OLAP-like Web search paradigm meant by WebClustCube over uncertain data (e.g., [29,30]) meaning that we have to deal with uncertain database objects as well, with novel and exciting research challenges.
5. EXPERIMENTAL ASSESSMENT AND ANALYSIS In order to assess the performance of WebClustCube, we performed an experimental analysis where we tested the performance of our proposed system in executing the most-costing operation, i.e. aggregation of complex objects. It should be noted that, according to
guidelines given in Section 1, this operation introduces a higher complexity rather than Web search tasks that finally retrieve the desired complex objects. In fact, Web search tasks are performed on top of already-aggregate complex objects (see Section 4), based on a multidimensional and multi-resolution vision of them as materialized into the target ClustCube cube (see Section 2.2 and Section 2.3). As a consequence, when Web search tasks are running, the overall complexity of the current search is not affected by the cost of computing the underlying ClustCube cube. This gives us the motivation of estimating the complexity introduced by WebClustCube in terms of the complexity due to its most-costing operation, according to common theoretical complexity paradigms. From [8], it follows that the most suitable ClustCube cube building algorithm to be adopted as the core aggregation procedure of WebClustCube is ⟨FUL,DRI⟩ (see Section 2.3), due to the fact that, in a Web search environment, we aim at computing all the cuboids of the target lattice (as to richly support overwhelming Web search tasks), hence the full materialization of cuboids is exploited to this end. At the same, due to typical Web search requirements (e.g., [2]), we aim at taming the total data-plus-algorithmic complexity introduced by data cube computation (e.g., [18]), hence the drill-down building strategy is exploited as to avoid the re-computation of the whole cuboid hierarchy from the scratch. Indeed, it has been already recognized that data-plus-algorithmic complexity is critical in data cube computation methodologies, and it greatly affects the whole performance of any data-intensive system that is based on OLAP cubes as data and knowledge representation and mining layer. As a consequence, taming this complexity is mandatory for the effectiveness and the efficiency of the whole dataintensive system and the applications based on it.
Figure 6. The View of Fig. 3 after a Drill-Down Operation over the Dimension Price.
In our experimental assessment, we considered a data layer represented by the distributed database of a corporate company (with several branches that are geographically-distributed), and we randomly extracted complex objects based on (corporate) business processes “encoded” in the target organization. In order to assess the performance of WebClustCube, we engineered two kinds of experiments. In the first one, we measured the whole time needed to compute the target underlying ClustCube cube with respect to the
number of complex objects on top of which the ClustCube cube itself is computed. After that, we made use of the same experimental pattern but ranging the number of dimensions of the target ClustCube cube (in particular, the latter experimental scenario turns to be useful in order to assess the scalability of WebClustCube, due to the fact that, as widely known, the number of dimensions heavily impacts on the performance of data cube computation algorithms – e.g., [18]).
Figure 7. Logical Architecture of WebClustCube.
Figure 8. Temporal Performance of WebClustCube when Ranging the Number of Complex Objects (N = 10).
Figure 10. Spatial Performance of WebClustCube when Ranging the Number of Complex Objects (N = 10).
In the second kind of experiments, we measured the total occupancy in secondary memory required by the target underlying ClustCube cube as corresponding to the settings of the first set of experiments. This because storage space of aggregate data is considered another relevant parameter targeted to evaluate the efficiency of any arbitrary OLAP data cube aggregation technique (e.g., [46]). We run these experiments for the same experimental patterns introduced in the first kind of experiments, i.e. by ranging the number of complex objects, and by ranging the number of dimensions.
It should be noted, in fact, that while interacting with the OLAPenabled Web view, users may repeatedly request to compute aggregations not only on the leaf-level the complex object repository but also on already-aggregate complex objects (see Section 2.2). Both these operations can be very resource-consuming, hence resulting in poor query response time. This is the reason way, in order to deal with with such computational issues, WebClustCube stores the target cuboid lattice in the CCL layer of the ClustCube storage (see Section 2.3). In fact, making available already-aggregate cuboids improves query performance.
Figure 9. Temporal Performance of WebClustCube when Ranging the Number of Dimensions (|OICODL| = 350K).
Figure 8 shows the results retrieved by the first class of experiments, when fixing the number of dimensions N to 10. Figure 9, instead, shows the results retrieved when fixing the number of complex objects OICODL to 350K. Results deriving from Figure 8 and Figure 9, focused to the temporal complexity of the WebClustCube system, clearly confirm the benefits deriving from WebClustCube in supporting OLAP-enabled Web search of complex objects. Figure 10 shows the results retrieved by the second class of experiments, by using the same experimental setting of experiments shown in Figure 8, while Figure 11 represents the case for the same experimental setting of experiments shown in Figure 9. Even with respect to the spatial complexity analysis, Figure 10 and Figure 11 clearly show the reliability of the WebClustCube system.
5.1 Remarks Our experiments have, indeed, revealed the effectiveness and efficiency of WebClustCube as, in typical OLAP aggregation frameworks (e.g., [47]), taming the temporal and spatial complexity represents a critical challenge to be faced-off, as the size of aggregate data can become really explosive for a high number of dimensions (e.g., [48]). This problem gets hard when Web-enabled interfaces are considered, as the typical interaction of Web users is of interactive nature, and this would require to aggregate data iteratively as to “follow” the user’s dynamics. WebClustCube’s experimental analysis demonstrates the suitability of the system to this particular scenario.
Figure 11. Spatial Performance of WebClustCube when Ranging the Number of Dimensions (|OICODL| = 350K).
6. CONCLUSIONS AND FUTURE WORK WebClustCube, a novel system that allows OLAP-enabled Web search of complex objects has been presented in this paper. WebClustCube relies, in turn, on ClustCube an OLAP-based clustering framework over complex objects extracted from distributed databases. WebClustCube is innovative since, as demonstrated through the paper, it clearly enhance the potentialities of current Web search methodologies by embedding the well-understood multidimensional data analysis and mining paradigm, in particular OLAP analysis. We also propose the logical architecture of the system, which mainly relies on open source technologies, along with a case study that has confirmed the potentialities offered by WebClustCube. Future work is mainly oriented towards the evolution of our proposed system as to make it capable of supporting innovative features such as more performance, thanks to emerging Cloud infrastructures, and more flexibility, thanks to incorporating, beyond clustering, other Data Mining techniques such as classification and frequent itemset mining. On the other hand, we are interesting in studying innovative aspects of our framework, such as: (i) fragmentation (e.g., [40]);(ij) uncertain data management (e.g., [41]);(iii) big data management (e.g., [42]); (iv) adaptive predicates (e.g., [49,50]).
7. REFERENCES [1]
[2]
M. Armbrust, A. Fox, R. Griffith, A.-D. Joseph, R.-H. Katz, A. Konwinski, G. Lee, D.-A. Patterson, A. Rabkin, I. Stoica, M. Zaharia, “A View of Cloud Computing”, Communications of the ACM, vol. 53, no. 4, pp. 50–58, 2010. A.Z. Broder, “A Taxonomy of Web Search”, SIGIR Forum, vol. 36, no. 2, pp. 3–10, 2002.
[3]
R. Cattell, “Scalable SQL and NoSQL Data Stores”, SIGMOD Record, vol. 39, no. 4, pp. 12–27, 2010.
[4]
S. Chaudhuri, U. Dayal, “An Overview of Data Warehousing and OLAP Technology”, SIGMOD Record, vol. 26, no. 1, pp. 65–74, 1997.
[5]
A. Cuzzocrea, “Combining Multidimensional User Models and Knowledge Representation and Management Techniques for Making Web Services Knowledge-Aware”, Web Intelligence and Agent Systems, vol. 4, no. 3, pp. 289–312, 2006.
[6]
[7]
[8]
[9]
A. Cuzzocrea, F. Furfaro and D. Saccà, “Enabling OLAP in Mobile Environments via Intelligent Data Cube Compression Techniques”, Journal of Intelligent Information Systems, vol. 33, no. 2, pp. 95–143, 2009. A. Cuzzocrea, D. Saccà, P. Serafino, “Semantics-Aware Advanced OLAP Visualization of Multidimensional Data Cubes”, International Journal of Data Warehousing and Mining, vol. 3, no. 4, pp. 1–30, 2007. A. Cuzzocrea, P. Serafino, “ClustCube: An OLAP-based Framework for Clustering and Mining Complex Database Objects”, in Proceedings of ACM SAC, pp. 976–982, 2011. T. Demeester, D. Nguyen, D. Trieschnigg, C. Develder, D. Hiemstra, “Snippet-Based Relevance Predictions for Federated Web Search”, in Proceedings of ECIR, pp. 697–700, 2013.
[10] G. Ganu, A. Marian, “One Size Does Not Fit All: Multi-
Granularity Search of Web Forums”, in Proceedings of ACM CIKM, pp. 9–18, 2013. [11] B. Ganter, R. Wille, “Formal Concept Analysis: Mathematical
Foundations”, 1st edition, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1997. [12] J. Gray, D. Chaudhuri, A. Bosworth, A. Layman, D. Reichart,
M. Venkatrao, F. Pellow, H. Pirahesh, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 29–53, 1997. [13] T. Hsiao, W.-S. Luk, S. Petchulat, “Data Visualization on Web-
Based OLAP”, in Proceedings of ACM DOLAP, pp. 75–82, 2011. [14] Y. Kotidis, N. Roussopoulos, “DynaMat: A Dynamic View
Management System for Data Warehouses”, in Proceedings of ACM SIGMOD, pp. 371–382, 1999. [15] B.-Y.-L. Kuo, T. Hentrich, M. Benjamin, B.M. Good, M.D.
Wilkinson, “Tag Clouds for Summarizing Web Search Results”, in Proceedings of ACM WWW, pp. 1203–1204, 2007. [16] G.-W. Park, S.-J. Lee, S.-H. Lee, “Enhance Web Search based
on Topic Sensitive-Social Relationship Ranking Algorithm in Social Networks”, in Proceedings of WI/IAT Workshops, pp. 445–448, 2009. [17] J. Tong, G. Wang, D.-S. Stones, S. Sun, X. Liu, F. Zhang,
“Exploiting Query Term Correlation for List Caching in Web
Search Engines”, in Proceedings of ACM CIKM, pp. 1817– 1820, 2013. [18] D. Xin, J. Han, X. Li, Z. Shao, B.-W. Wah, “Computing
Iceberg Cubes by Top-Down and Bottom-Up Integration: The StarCubing Approach”, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 111–126, 2007. [19] F. Tao, K.-H. Lei, J. Han, C. Zhai, X. Cheng, M. Danilevsky,
N. Desai, B. Ding, J. Ge, H. Ji, R. Kanade, A. Kao, Q. Li, Y. Li, C.-X. Lin, J. Liu, N.-C. Oza, A.-N. Srivastava, R. Tjoelker, C. Wang, D. Zhang, B. Zhao, “EventCube: Multi-Dimensional Search and Mining of Structured and Text Data”, in Proceedings of ACM KDD, pp. 1494–1497, 2013. [20] C.-X. Lin, B. Ding, J. Han, F. Zhu, B. Zhao, “Text Cube:
Computing IR Measures for Multidimensional Text Database Analysis”, in Proceedings of IEEE ICDM, pp. 905–910, 2008. [21] D. Zhang, C.X. Zhai, J. Han, “Topic Cube: Topic Modeling for
OLAP on Multidimensional Text Databases”, in Proceedings of SDM, pp. 1124–1135, 2009. [22] H. Al-Aqrabi, L. Liu, R. Hill, L. Cui, J. Li, “Faceted Search in
Business Intelligence on the Cloud”, in Proceedings of GREENCOM-ITHINGS-CPSCOM, pp. 842–849, 2013. [23] M. Sifer, “Exploring Website Log Data with Coordinated
Dimension Hierarchies and Network Pivots”, International Journal of Computational Science and Engineering, vol. 2, no. 5/6, pp. 279–286, 2006. [24] L.-V.-S. Lakshmanan, J. Pei, J. Han, “Quotient Cube: How to
Summarize The Semantics of a Data Cube”, in Proceedings of VLDB, pp. 778–789, 2002. [25] L.-V.-S. Lakshmanan, J. Pei, Y. Zhao, “Efficacious Data Cube
Exploration by Semantic Summarization and Compression”, in Proceedings of VLDB, pp. 1125–1128, 2003. [26] S.-M. Huang, T.-H. Chou, J.-L. Seng, “Data Warehouse
Enhancement: A Semantic Cube Model Approach”, Information Sciences, vol. 177, no. 11, pp. 2238–2254, 2007. [27] A. Cuzzocrea, D. Saccà, P. Serafino, “A Hierarchy-Driven
Compression Technique for Advanced OLAP Visualization of Multidimensional Data Cubes”, in Proceedings of DaWaK, pp. 106–119, 2006. [28] S. Nedjar, R. Cicchetti, L. Lakhal, “Extracting Semantics In
OLAP Databases Using Emerging Cubes”, Information Sciences, vol. 181, no. 10, pp. 2036–2059, 2011. [29] A. Cuzzocrea, “Retrieving Accurate Estimates to OLAP
Queries over Uncertain and Imprecise Multidimensional Data Streams”, in Proceedings of SSDBM, pp. 575–576, 2007. [30] A. Cuzzocrea, “Approximate OLAP Query Processing over
Uncertain and Imprecise Multidimensional Data Streams”, in Proceedings of DEXA, Vol. 2, pp. 156–173, 2013. [31] A. Cuzzocrea, G. Xu, “Towards a Framework for Supporting
Web Search of Complex Objects via Multidimensional Paradigms”, in Proceedings of ICCSA (Workshops/Short Papers/Posters), pp. 217–220, 2014. [32] T. Zhang, R. Ramakrishnan, M. Livny, “BIRCH: A New Data
Clustering Algorithm and Its Applications”, Data Mining and Knowledge Discovery, vol. 1, no. 2, pp. 141–182, 1997. [33] R.-T. Ng, J. Han, “CLARANS: A Method for Clustering
Objects for Spatial Data Mining”, IEEE Transactions on
Knowledge and Data Engineering, vol. 14, no. 5, pp. 1003– 1016, 2002. [34] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, “A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, in Proceedings of ACM KDD, pp. 226–231, 1996. [35] A. Cuzzocrea, “CAMS: OLAPing Multidimensional Data
Streams Efficiently”, in Proceedings of DaWaK, pp. 48–62, 2009. [36] K. Morfonios, G. Koutrika, “OLAP Cubes for Social Searches:
Standing on the Shoulders of Giants?”, in Proceedings of ACM WebDB, available at http://webdb2008.como.polimi.it/images/stories/WebDB20 08/paper10.pdf, 2008. [37] N. Ur Rehman, A. Weiler, M.H. Scholl, “OLAPing Social
Media: The Case of Twitter”, in Proceedings of ASONAM, pp. 1139–1146, 2013. [38] A. Cuzzocrea, C. De Maio, G. Fenza, V. Loia, M. Parente,
“Towards OLAP Analysis of Multidimensional Tweet Streams”, in Proceedings of ACM DOLAP, pp. 849–858, 2015. [39] X. Jin, J. Han, L. Cao, J. Luo, B. Ding, C.-X. Lin, “Visual Cube
and On-Line Analytical Processing of Images”, in Proceedings of ACM CIKM, pp. 849–858, 2010. [40] A. Cuzzocrea, J. Darmont, H. Mahboubi, “Fragmenting Very
Large XML Data Warehouses via K-Means Clustering Algorithm”, International Journal of Business Intelligence and Data Mining, vol. 4, no. 3/4, pp. 301–328, 2009. [41] C.-K.-S. Leung, A. Cuzzocrea, F. Jiang, “Discovering Frequent
Patterns from Uncertain Data Streams with Time-Fading and Landmark Models”, Transactions on Large-Scale Data- and Knowledge-Centered Systems, vol. 8, pp. 174–196, 2013. [42] A. Cuzzocrea, D. Saccà, J.D. Ullman, “Big Data: A Research
Agenda”, in Proceedings of ACM IDEAS, pp. 198–203, 2013.
[43] R.A. Baeza-Yates, B. Ribeiro-Neto, “Modern Information
Retrieval”, Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1999. [44] M.W. Berry, “Survey of Text Mining”, Springer-Verlag, New
York, Inc., Secaucus, NJ, USA, 2003 [45] A.P. Dempster, N.M. Laird, D.B. Rubin, “Maximum
Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B, pp.1–38, 1977. [46] A. Cuzzocrea, U. Matrangolo, “Analytical Synopses for
Approximate Query Answering in OLAP Environments”, in Proceedings of DEXA, pp. 359–370, 2004. [47] V. Harinarayan, A. Rajaraman, J.D. Ullman, “Implementing
Data Cubes Efficiently”, in Proceedings of ACM SIGMOD, pp. 205–216, 1996. [48] S. Berchtold, C. Böhm, H.-P. Kriegel, “The Pyramid-
Technique: Towards Breaking the Curse of Dimensionality”, in Proceedings of ACM SIGMOD, pp. 142–153, 1998. [49] M. Cannataro, A. Cuzzocrea, C. Mastroianni, R. Ortale, A.
Pugliese, “Modeling Adaptive Hypermedia with an ObjectOriented Approach and XML”, in Proceedings of WebDyn 2002, CEUR-WS Vol. 702, pp. 35–44, 2002. [50] V. Kantere, M. Filatov, “A Workflow Model for Adaptive
Analytics on Big Data”, in Proceedings of IEEE BigData Congress, pp. 673–676, 2015. [51] M. Ceci, A. Cuzzocrea, D. Malerba, “Effectively and
Efficiently Supporting Roll-Up and Drill-Down OLAP Operations over Continuous Dimensions via Hierarchical Clustering”, Journal of Intelligent Information Systems, vol. 44, no.3, pp. 309–333, 2015.