Querying big data - ACM Digital Library

8 downloads 132905 Views 103KB Size Report
Typically the concept of big data assumes a variety of different sources of ... complex analytical processing, rather than just a huge and growing volume of data.
International Conference on Computer Systems and Technologies - CompSysTech’12

Querying Big Data Boris Novikov

Natalia Vassilieva

Anna Yarygina

Abstract: The term “Big Data” became a buzzword and is widely used in both research and industrial worlds. Typically the concept of big data assumes a variety of different sources of information and velocity of complex analytical processing, rather than just a huge and growing volume of data. All variety, velocity, and volume create new research challenges, as nearly all techniques and tools commonly used in data processing have to be re-considered. Variety and uncertainty of big data require a mixture of exact and similarity search and grouping of complex objects based on different attributes. High-level declarative query languages are important in this context due to expressiveness and potential for optimization. In this talk we are mostly interested in an algebraic layer for complex query processing which resides between user interface (most likely, graphical) and execution engine in layered system architecture. We analyze the applicability of existing models and query languages. We describe a systematic approach to similarity handling of complex objects, simultaneous application of different similarity measures and querying paradigms, complex searching and querying, combined semi-structured and unstructured search. We introduce the adaptive abstract operations based on the concept of fuzzy set, which are needed to support uniform handling of different kinds of similarity processing. To ensure an efficient implementation, approximate algorithms with controlled quality are required to enable quality versus performance trade-off for timeliness of similarity processing. Uniform and adaptive operations enable high-level declarative definition of complex queries and provide options for optimization. Key words: Computer Systems and Technologies, Query Languages, Query Processing, Big Data.

INTRODUCTION The ability to compare, confront, and combine knowledge obtained from different sources has always been considered extremely valuable. In the context of computerbased systems, the problem of integration of heterogeneous distributed information resources remains hot already for decades, but still several hard issues have not been resolved, especially in the context of “Big Data”. Typically the concept of big data assumes a variety of different sources of information and velocity of complex analytical processing, rather than just a huge and growing volume of data. All variety, velocity, and volume create new research challenges, as nearly all techniques and tools commonly used in data processing have to be re-considered. Big data is not just running analytics on a cloud. The systems might be heterogeneous in terms of data model, dynamics, trustfulness, and content type, as well as querying or retrieval paradigms. The diversity of primary data sources: spans over databases, semi-structured data, uncertain data, streams (news streams and sensor streams), text, and multimedia etc. The need to combine data extracted from heterogeneous resources potentially based on diverse querying paradigms appears in several application areas, including advanced search, personalization, relevance feedback, ETL, and analytical processing. Variety and uncertainty of big data require a mixture of exact and similarity search and grouping of complex objects based on different attributes. High-level declarative query languages are important in this context due to expressiveness and potential for optimization. The need for complex queries may arise, in particular, from the following patterns: x Merge information obtained from different sources on the same object(s); x Combine multiple scores (such as opinions) on a single object into a single Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CompSysTech'12, June 22-23, 2012, Ruse, Bulgaria. Copyright ©2012 ACM 978-1-4503-1193-9/12/06...$10.00.

1

International Conference on Computer Systems and Technologies - CompSysTech’12

x x

integrated score; Derive new knowledge based on combination of knowledge related to different or multiple objects (different kinds if joins and aggregation). Our objective is to provide a uniform approach to specification, formal representation, optimization and execution of complex queries and workflows. We introduce an open set of generalized algebraic operations to be used as an intermediate query language. Obviously, the power and expressiveness needed to capture the diversity outlined above exceeds potential capabilities of any specific closed model. Hence, extendibility is a must. In addition to common features of query languages, such as filtering, set-theoretic operations, and joins, we have to incorporate complex processing techniques such as NLP, mining, and analytics. Any integration facility trying to combine diversity of models and data structures faces enormous complexity of the needed model. In order to keep the complexity on a manageable level, the querying engine should be built on top of existing querying facilities rather than substitute them. Further, to avoid the complexity of a global schema definition, no assumption is made on existence of any kind of global schema or any complete schema for primary data sources. Instead, the assumption is that any attributes needed for each specific query evaluation, are accessible (or computable) for any of retrieved objects. We consider similarity-based querying as a common denominator for different types of data and queries, with a potential for preserving exactness of certain query results. Complex query processing based on similarity or other uncertain querying model often tends to be computationally heavy. This implies a need in approximate query evaluation and support for trade-offs between computational efficiency and accuracy of the output. Appropriate query optimization techniques, based on quality-efficiency trade-offs and adaptive optimization, are also essential. In the following sections we discuss models and languages suitable for querying big data and outline the design of the querying engine. SIMILARITY AND Q-SETS To build complex query evaluation environment we need to define how the objects will be compared with queries and with each other. The model presented here exploits a very general notion of similarity as a primary matching tool. The concept of similarity is a proven tool and a cornerstone for numerous information management techniques. To mention just few examples, similarity to a query is interpreted as relevance in the advanced information retrieval; similar objects are clustered together in data mining, etc. An extreme form of similarity may also represent exact queries in the traditional databases. Typically, the similarity is defined as proximity in a certain space, i.e. objects are considered similar if a distance between them is small in a certain measure. Hence, the central problem of similarity-based techniques is to define a distance function or a similarity measure, which approximates appropriate semantics with reasonable quality. The effectiveness of a similarity measure can be estimated with metrics commonly used for several information management tasks, that is, precision, recall, and their combinations. A relevance of an object to a query is expressed with the score defined as similarity between the object and a query. The scores must be calibrated before they can be used in the same query. We require all scores to be forced into the interval [0,1] in our model, implying the same range for all scores used internally. The external scores are pushed to this range as a part of the calibration procedure. The restriction on the range provides several different 2

International Conference on Computer Systems and Technologies - CompSysTech’12

interpretations of scores. Specifically, they might be considered as probabilities of relevance, or values of a membership function, or confidence levels. Obviously, the concept of similarity captures the representation of imprecise queries and results, typical for IR-style information resources or probabilistic databases. It also captures exact traditional database queries with discrete similarity function, which behaves like a predicate (that is, returns 0 or 1). A query evaluation and even a definition of a query requires certain knowledge of meta-information related to objects being queried. Typically, this information is available in some kind of a schema. However, it is unrealistic to assume existence of any kind of a global data definition. We also out-scope any discussion on representation or mapping of local schemas into a global one or vice versa. Instead, we assume that an external information resource can provide information on attributes and properties and objects needed for query evaluation have all necessary attributes. The central concept of our model is a q-set defined as a tripe (q,B,S) where x q is a query (no matter how represented), x B is a base set of objects, x S is a scoring function for objects in B. The notion of q-set encapsulates both query and result of its evaluation represented as a scoring function. The concept of a q-set can be viewed as a generalization of result sets commonly found in many database APIs, such as JDBC or ODBC. A q-set may also be viewed as a fuzzy set with scoring function interpreted as a membership function. We emphasize that there are no assumptions on the representation of a query; specifically, we do not assume that a query is expressed in any query language. A query might be expressed in terms of algebraic expressions composed of operations defined below. However, these expressions are relative in the sense that their arguments must be q-sets obtained from primary information sources (probably with different query languages). Similarly, the scores might be obtained from primary sources or calculated from the object and the query by certain function. There are two ways to obtain a q-set: a q-set produced from a primary information resource and output from an operation (or expression) processed in the query engine. The former may be considered as a generalization of stored tables in traditional database systems. Although the requirements to object structure are kept minimal, certain object properties and attributes are ultimately needed for any query. We require all objects to have an identity. The assumption is that an object obtained from a primary source can be reached based on this identity. The nature of identity may vary from the object itself, making it immutable, to just an URL or surrogate value. The expectation is that the identity is cross-source consistent. That is, if an object can be obtained from different primary sources, it is expected to have the same identity, and objects with different identities are different. AN ALGEBRA ON Q-SETS The operations defined on q-sets may constitute algebras of different expressive power. The lowest meaningful level is the keyword search, as defined in [1].This level of algebra includes filters and generic fusion operations. All filtering operations return the base set of the argument q-set and preserve identities of all objects in the base set. Simple filters re-calculate scores of individual objects, where a new score of an object may only depend on the attribute values of this object and a limited number of additional parameters. Consequently, filters can be executed on streams of objects without any need in a materialization or temporary storage of (any part of) the q-set. 3

International Conference on Computer Systems and Technologies - CompSysTech’12

A somewhat different kind of filters can be used to instantiate additional attributes. Typically, the values of additional attributes will be provided from external library functions. Although these filters might be computationally heavy, they still do not necessarily require materialization and can be chained in a pipe. Of course, materialization in a cache might be important for performance reasons, but not conceptually. Normalization and strengthening/weakening operations are essential for calibration procedures and provide for making q-sets from different sources comparable. The primary purpose of normalization is to adjust the range of scores or distances to desired interval. Typically, normalization should change scores or distances in some sense evenly (or proportionally). The strengthening is intended to improve the differential power of scores. That is, the differences between high scores are made larger, while differences between less important (lower) scores are made smaller. The weakening does the opposite. The discretization converts a fuzzy q-set into exact one by replacing high scores with 1 and low scores with 0. Effectively it truncates presumably unimportant objects with low scores from a q-set. The generic fusion operation is intended to extend set-theoretic operations, both exact and fuzzy. Traditional set-theoretic operations e.g. in relational database require arguments have the same type. There is no strict concept of type in our model. Consequently, instead of required arguments to be of the same type, we state that attributes from all arguments are included into the output q-set. Most of specializations of the fusion are symmetric, commutative, and might be either binary or multi-argument. Many of important specializations are not associative and for many of them multi-argument version cannot be implemented as an expression of binary versions. An exception is the difference operation, which can be defined in both exact and fuzzy versions. Obviously, the difference is asymmetric and can only be a binary operation. The base set of the result is a set-theoretic union of base sets of all arguments. Any specialization of generic fusion operation is defined with a function calculating the scores of objects in the base set. Duplicates are not included; instead, the scores of all duplicates may contribute to the output score. More powerful algebras are built with inclusion of joins and aggregations. The generic join/product operation is constructed as a generalization of theta-join operation of relational model. Namely, the base set of the output is a direct product of argument base sets. The definition of a join operation assumes a (fuzzy) predicate function on attributes of arguments. The value of this function is used as an additional component for the resulting score. Thus, the score for a join always depends on 3 factors: incoming scores and the predicate score. Depending on the predicate, the generic join can express traditional exact (natural) database joins, spatial joins, similarity joins, or fuzzy joins. The identities of output objects are constructed as surrogates. As in relational theory, the generic join is redundant and can be expressed as a product with subsequent filtering. However, the knowledge of the join predicate provides for more efficient implementation of join algorithms than just calculation of the complete product. The aggregation operation constructs an object of result base set from several objects. It can be considered as a fuzzy replacement and generalization for exact queries with GROUP BY clause, which defines a set of incoming objects to be grouped into a one outgoing object. In addition to grouping based on exact match of values of grouping attributes, the objects can be grouped based on classification, clustering of incoming objects.

4

International Conference on Computer Systems and Technologies - CompSysTech’12

A special case of aggregation is duplicate or near duplicate removal, the former is based on exact match of all object attributes (probably except the identity), while the latter might be considered as a special kind of clustering. The identities for aggregates are constructed as surrogates. The aggregation in its general form might be extremely powerful. For example, topic detection and clustering of incoming news stream might be viewed as an aggregation. The nest operation is a special case of aggregation, which creates an attribute of an aggregated object, which consists of all grouped objects. The unnest operation is a kind of an inverse for the nest and is constructed as follows. The objects in the base set of the argument must have a q-set valued attribute. The output objects are constructed from each object of the nested q-set augmented with the identity of the incoming base object (and probably its other attributes). The score of the outgoing object should be calculated from the score of an incoming object and a score of nested object. Thus, the unnest operation is also generic. The usefulness of the unnest might be illustrated with the following scenario. Consider a partially annotated collection of images and a textual search query. The first step will be text search of relevant annotations, resulting in a q-set containing several annotated images. These images are then used as queries in content-based image retrieval, producing nested q-sets as attribute values of annotated images. Finally, the unnest yields a q-set containing images from both annotated and unannotated parts of the collection. The disadvantage of joins and aggregations is the need in generation of surrogate identities. As result, the connection between the objects extracted from primary sources and derived facts may be lost. It might be reasonable to restrict the language to the level of semi-joins, which preserve the identities of one of arguments but still provide for linking of other objects. The semi-join algebra is based on a group join operation which is defined as a join followed by nesting with aggregation on the identity of the first argument. Effectively this operation combines the power of joins and aggregations and both joins and aggregations can be considered as a special case of the group join. Thus, in contrast with relational algebra, the restriction to semi-joins does not reduce the expressiveness of the language; as the result of join can be obtained from the group join followed by unnest. The usefulness of the operations outlined above can be illustrated with examples. Our first example shows the enrichment filters. The aim is to find best hotels according to their formal scores and guest reviews. Say, incoming q-set consists of hotel descriptions in XML. The first filter operation extracts only those hotels with high input scores. Unnest filter extracts all guest reviews to a new q-set in order to process them. The next filter analyses the text of each review and calculates a new score, which shows sentiment content of the guest review. After that the output is materialized and aggregation operation groups reviews for each hotel and calculates accumulated hotel score based on reviews scores. At the last step the filter operation extracts only those hotels with high scores, based on guest reviews. This data processing workflow can be specified with the following query: filter(score>threshold, aggregation(hotels, filter(score_construciton(SentimentAnalysis), unnest(reviews, filter(score>threshold, hotels))))) The function SentimentAnalysis is implemented as an external library function.

5

International Conference on Computer Systems and Technologies - CompSysTech’12

The second example demonstrates the group join operation. The informal query is to find inexpensive hotels located closely to congress centres capable to accommodate a large conference and reachable in at most one hour from an international airport. The query can be specified with the following expression: Group_join( Group_join( filter(hotels, ‘inexpensive’), filter(congress_centers, ‘capacity >500’), ‘walking distance’)), airports, ‘travel_time < 1 h’) Note that the calculation of distance (or time) in a road network is very expensive. An equivalent alternative expression is the following: Group_join( filter(hotels, ‘inexpensive’), Group_join( filter(congress_centers, ’capacity >500’), airports, ‘travel_time < 1 h’), ‘walking distance’)) Most likely the processing will be more efficient as the cardinality of the q-set with hotels is much higher than cardinality of one with conference centres. Consequently, this example demonstrates the potential for optimization. IMPLEMENTATION CONSIDERATIONS The main agents in our prototype are the Primary Sources, the Source of Queries, the Source of External Operations, and the Querying System. The Query Processing engine parses a query, performs the query optimization, and interprets the query execution plan. The prototype contains several alternative implementations of algebraic operations based on different algorithms (such as hash or nested loop) and different platforms (currently centralized DBMS and Hadoop). The engine is also able to use external operations and functions. The success of any query processing engine depends mostly on the quality of the optimization. Indeed, the difference in performance of optimal and naive query execution plan may be measured in several orders of magnitude. The power of query optimizers is based on: x Nice properties of query algebra providing huge search space; x Availability of alternative algorithms (implementations) of algebraic operations and high quality cost models; x Optimization strategies, algorithms, and heuristics. Unfortunately, some of these features are hardly available for any extensible language. Specifically, user-defined extensions might lack algebraic properties (thus reducing the search space) and have unpredictable complexity (thus undermining cost models). Both of these issues are inherent for our model. However, the concept of similarity is, by its nature, approximate, hence, exact query evaluation would be meaningless on many cases. Moreover, exact evaluation of certain approximate queries is not feasible for performance reasons. It is hardly possible to process all relevant objects, say, found on the Web.

6

International Conference on Computer Systems and Technologies - CompSysTech’12

An approximate query evaluation opens new options for query optimization, as the optimization is performed at the algebraic level. The space for optimization is provided due to trade-offs between accurate and fast execution of operations. Traditional database query optimizers minimize a cost function representing certain computational resource (such as CPU time or number of I/O operations). For approximate query evaluation an additional restriction is needed: the quality of the result should not be less than specified. An inverse optimization problem also makes sense: given a restriction on resources available for query evaluation, yield a query execution plan returning results of best possible quality. Yet another dimension of optimization is selection of physical operation (or algorithms) for algebraic operations. Of course, this is already accounted in the cost models for each operation. However, the cost depends on the execution engine chosen for temporary storage. To enable query evaluation on mixed platforms, additional copying operations (e.g. from centralized DB to Hadoop) are inserted into a plan. The cost of copying is added but does not require other changes in the optimizer. The primary tool for approximate execution is the discretization operation, which effectively truncates from a q-set all objects with low scores, which are unlikely to contribute to the final result High quality cost-based optimization is only possible with sufficiently precise statistical information which unlikely to be available for autonomous primary data sources. This observation suggests that adaptive query optimization and execution might be extremely valuable in our context. Specifically, statistics collected during the loading from primary sources may be used for re-optimization of the remaining part of the query. RELATED WORK The aim of this research is to combine information and knowledge from heterogeneous data sources. First of all, we address the issues concerned with integration of: information retrieval approaches and paradigms, and database theory and applications. The ideas and approaches for integration of IR and DB technologies were introduced in [2]. Authors derived a set of requirements for core DB&IR system: flexible scoring and ranking, and optimizability; and discuss several architectural issues. The architecture based on algebraic DB&IR framework is known to be preferable, and challenges with query optimization and algebraic limitations are introduced in [2] in the context of this integration approach. Historically first extensions of relational data model and query languages were introduced in the context of object-oriented database management systems. The object-oriented database management systems were introduced in mid-80-ies to address the needs of data intensive applications beyond the scope of the common database models at that time [3]. The experience gained from several experimental prototype systems is accumulated in the ODMG model and language [4]. The authors of [5] analyze the features of OQL and the ODMG query language. Several variations of object-oriented data models and query languages were proposed and analyzed [6]. The foundations of object-oriented data models and query language are investigated in [7]. A query language based on an object identity is developed in [8]. The authors compare identity-based and value-based languages, introduce union of types as a replacement for inheritance and prove that their query language can be mapped to the relational one. Query languages supporting imprecise IR queries were proposed in [12, 13].

7

International Conference on Computer Systems and Technologies - CompSysTech’12

Authors of [2] proposed early termination of the query evaluation process in the context of similarity based queries, where all objects in the query result have rank or score which identifies its relevance. A similarity algebra that brings together relational operations and lists of objects with scores in a uniform language is introduced in [14]. The proposed algebra can be used to specify complex queries that combine different interpretations of similarity values and multiple algorithms for computing these values. The algebra supports union, intersection, and difference; join; merge; subtract; select; and map operations. This research is a nice example of algebraic framework proposed from integration of several similarity based queries with further analysis of its optimizability. The fuzzy algebra which extends the relational algebra over fuzzy relations with new operators is introduced in [15] to provide a formal framework to formulate queries for several multimedia sources. Fuzzy implementations of selection, projection, join, union, difference, and top operations are formally defined there. The justification is done in the context of traditional meta search task. Authors of [15] discuss the application of top and cut operators in order to improve the query evaluation performance. In order to support a declarative way of formulating queries, a generalization of the classical relational domain calculus by incorporating fuzzy operations and user weights is introduced in [19]. Authors formally define a declarative query language, which combines the handling of imprecise truth values together with the traditional relational domain calculus. The algebraic framework presented in [19] contains projection, selection, union, intersection, difference, product, and join operations. The language provides operations to weight similarity predicates. Furthermore, fuzzified quantifiers are introduced. Authors discuss how to map the similarity relational calculus language to the similarity algebra, and distinguish between domain dependent and domain independent queries. The proposed algebra supports weighted conjunctions and disjunctions based on extended formula introduced in [20, 21]. Optimization rules based on algebraic framework properties and equivalence laws are discussed in [14, 15, 2]. It is important to mention that the lack of algebraic equivalences pushes the research to the development of optimization techniques based on performance/quality tradeoffs and approximate algorithms. Authors derived equivalence and containment relationships between similarity algebra expressions and developed query rewriting methods based on the selected set of operations. Cost models based on estimation of operation selectivity and cardinality are introduced in [14]. Optimization rules based on the query tree reconstruction are derived from the previous analysis. The cost models for object-oriented databases are discussed in [9]. The authors of [10, 11] demonstrates the power of nest and unnest operators and its usefulness for query optimization. A comprehensive survey of adaptive query processing techniques [18] contains a classification of different approaches to the adaptive query processing and provides indepth analysis of some of them. Discussion of fundamental ideas in fusion of retrieval results such as the “Chorus Effect”, the “Skimming Effect”, and the “Dark Horse Effect” is presented in [22, 23]. An analysis of several calibration strategies and comparison of fuzzy set-theoretic operations can be found in [24]. CONCLUSIONS We outlined the model and algebraic framework for query processing in a distributed heterogeneous environment and discussed key implementation issues. The central

8

International Conference on Computer Systems and Technologies - CompSysTech’12

concept of a q-set is suitable for mixed model queries and can support both exact and fuzzy querying paradigms. REFERENCES [1] Fletcher, G. H. L., J. V. D. Bussche, D. V. Gucht, S. Vansummeren. Towards a theory of search queries, ACM Trans. Database Syst., vol. 35, pp. 28:1-28:33, October 2010. [2] Chaudhuri, S., R. Ramakrishnan, G. Weikum. Integrating db and ir technologies: What is the sound of one hand clapping?, CIDR, pp. 1-12, 2005. [3] Catell, R.G.G.. Next-generation database systems, Commun. ACM, vol. 34, pp. 30-33, October 1991. [4] Cattell, R. G. G., D. K. Barry. The Object Data Standard: ODMG 3.0, Morgan Kaufmann, 2000. [5] Alashqur, A. M., S. Y. W. Su, H. Lam. Oql: a query language for manipulating object-oriented databases, Proceedings of the 15th international conference on Very large data bases, San Francisco, CA, USA, 1989, VLDB '89, pp. 433-442, Morgan Kaufmann Publishers Inc. [6] Bancilhon, F., S. Cluet, C. Delobel. A query language for the o2 object-oriented databases, Proceedings of the second international workshop on Database programming languages, San Francisco, CA, USA, 1989, pp. 122-138, Morgan Kaufmann Publishers Inc. [7] Bierman, G. M.. Formal semantics and analysis of object queries, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, New York, NY, USA, 2003, SIGMOD '03, pp. 407-418, ACM. [8] Abiteboul, S., P. C. Kanellakis. Object identity as a query language primitive, J. ACM, vol. 45, pp. 798-842, September 1998. [9] Schmitt, I.. Qql: A db&ir query language, The VLDB Journal, vol. 17, pp. 39-56, January 2008. [10] Fletcher, G. H. L., J. V. D. Bussche, D. V. Gucht, S. Vansummeren. Towards a theory of search queries, Proceedings of the 12th International Conference on Database Theory, New York, NY, USA, 2009, ICDT '09, pp. 201-211, ACM. [11] Adali, S., P. Bonatti, M. L. Sapino, V. S. Subrahmanian. A multi-similarity algebra, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, New York, NY, USA, 1998, SIGMOD '98, pp. 402-413, ACM. [12] Montesi, D., A. Trombetta, P. A. Dearnley. A similarity based relational algebra for web and multimedia data, Inf. Process. Manage., vol. 39, no. 2, pp. 307-322, 2003. [13] Schmitt, I., N. Schulz. Similarity relational calculus and its reduction to a similarity algebra, FoIKS, Dietmar Seipel and Jose Maria Turull Torres, Eds. 2004, vol. 2942 of Lecture Notes in Computer Science, pp. 252-272, Springer. [14] Fagin, R.. Fuzzy queries in multimedia database systems, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, New York, NY, USA, 1998, PODS '98, pp. 1-10, ACM. [15] Fagin, R., E. L. Wimmers. A formula for incorporating weights into scoring rules, Theor. Comput. Sci., vol. 239, no. 2, pp. 309-338, 2000. [16] Bertino, E., P. Foscoli. On modeling cost functions for object-oriented databases, IEEE Trans. on Knowl. and Data Eng., vol. 9, pp. 500-508, May 1997. [17] Fegaras, L., D. Maier. Optimizing object queries using an effective calculus, ACM Trans. Database Syst., vol. 25, pp. 457-516, December 2000. [18] Fegaras, L.. Optimizing queries with object updates, J. Intell. Inf. Syst., vol. 12, pp. 219-242, April 1999. [19] Deshpande, A., Z. G. Ives, V. Raman. Adaptive query processing, Foundations and Trends in Databases, vol. 1, no. 1, pp. 1-140, 2007. 9

International Conference on Computer Systems and Technologies - CompSysTech’12

[20] Lillis, D., F. Toolan, R. W. Collier, J. Dunnion. Probfuse: a probabilistic approach to data fusion, SIGIR, Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Jarvelin, Eds. 2006, pp. 139-146, ACM. [21] Vogt, C. C., G. W. Cottrell. Fusion via a linear combination of scores, Inf. Retr., vol. 1, pp. 151-173, October 1999. [22] Yarygina, A., B. Novikov, N. Vassilieva. Processing complex similarity queries: A systematic approach, ABDIS 2011 Research Communications: Proceedings II of the 5th East-European Conference on Advances in Databases and Information Systems 20 - 23 September 2011, Vienna, Maria Bielikova, Johann Eder, and A Min Tjoa, Eds. September 2011, pp. 212-221, Austrian Computer Society. ABOUT THE AUTHORS Boris Novikov, Prof., Department of Computer Science, Saint Petersburg University, Phone: +7 921 914 8534, Е-mail: [email protected]. Natalia Vassilieva, Senior Researcher, HP Labs, Е-mail: [email protected]. Anna Yarygina, PhD Student, Department of Computer Science, Saint Petersburg University, Е-mail: [email protected].

10

Suggest Documents