Federation and Sharing in the Context Marketplace Carsten Pils1, Ioanna Roussaki2, Tom Pfeifer1, Nicolas Liampotis2, and Nikos Kalatzis2 1
Telecommunications Software and Systems Group, Waterford Institute of Technology, Carriganore, Waterford, Ireland
[email protected],
[email protected] 2 School of Electrical and Computer Engineering, National Technical University of Athens, 9 Heroon Polytechneiou Str, 157-73 Zographou, Athens, Greece {nanario,nliam,nikosk}@telecom.ntua.gr
Abstract. The emerging pervasive computing services will eventually lead to the establishment of a context marketplace, where context consumers will be able to obtain the information they require by a plethora of context providers. In this marketplace, several aspects need to be addressed, such as: support for flexible federation among context stakeholders enabling them to share data when required; efficient query handling based on navigational, spatial or semantic criteria; performance optimization, especially when management of mobile physical objects is required; and enforcement of privacy and security protection techniques concerning the sensitive context information maintained or traded. This paper presents mechanisms that address the aforementioned requirements. These mechanisms establish a robust spatially-enhanced distributed context management framework and have already been designed and carefully implemented. Keywords: Context Distributed Database Management System, context databases & query languages, federation, spatially-enhanced distributed context management, mobile physical object handling, context privacy & security.
1 Introduction Network operators capture valuable information such as device location and status, user profiles and movement patterns, network performance, etc., in order to provide telecommunication services to their clients. This data is considered to be vital context information [1] that can be exploited to customize services, to anticipate user intentions and to ultimately reduce human-to-machine interactions [2]. In case this information is mashed-up with context data collected by other sources, such as sensor networks and web resources it becomes a commodity that can be traded with other operators and third party services providers. However, even though context information holds out the prospect of enhancing user experience and increasing revenues, trading it is not straightforward: First, the data is highly sensitive and a security breach could easily prove to be catastrophic for both users and operators. Second, the various infrastructures that store and manage context data are heterogeneous, while there is no standardized interface that supports context information exchange. Third, information, which can not J. Hightower, B. Schiele, and T. Strang (Eds.): LoCA 2007, LNCS 4718, pp. 121–138, 2007. © Springer-Verlag Berlin Heidelberg 2007
122
C. Pils et al.
be retrieved when necessary, is valueless. In an open context marketplace, where a wide variety of information types is traded, context consumers are challenged by the discovery of the required data. Finally, timely delivery of context information is crucial, due to the fact that most data sources provide real-time information. On a business level, information exchange is regulated by federation agreements. They specify the information that can be accessed by consumers, its quality and penalties for breach of contract. A context marketplace must implement and enforce these agreements based on efficient monitoring and access control mechanisms. Moreover, such marketplaces must provide mechanisms to store, advertise and discover information in a secure, distributed, scalable and standardized manner. Thus, technically speaking, the core of a context marketplace is a distributed heterogeneous spatial database management system. Yet, none of the database management systems meet these requirements. This paper presents the Context Distributed Database Management System (CDDBMS), an efficient distributed heterogeneous spatial database management system that caters for the needs of a context marketplace. The rest of this paper is structures as follows. Section 2 outlines the basic approach and details the requirements imposed by federation agreements. The CDDBMS architecture, database schema and query languages are presented in Section 3, while the implemented approach to handle mobile physical objects is described in Section 4. Section 5 presents the security and privacy approach implemented by the CDDBMS. Finally, in Section 6, conclusions are drawn and plans for future work are exposed.
2 Federation and Basic Context Management Approach The CDDBMS is a peer-to-peer database comprising of node servers, each of which stores information about at least one predefined information domain. Such domains include for instance: user profiles, accounting information of a telecommunication operator, service advertisements or information related to a specific geographical area. Context producers store their data with respect to domains and consumers query information accordingly. Two domains are distinguished: Logical (Logic-D) and geographic domains (Geo-D). Logic-Ds contain non-spatial information like for example user profiles and are rather independent from other domains. Geo-Ds, however, carry strong spatial inclusion relationships. For example, the suburb is geographic specialization of the city domain and thus, there is an inclusion relation between the City-Node and the Suburb-Node. Fig. 1 shows example Geo-Ds and their inclusion relations. Inclusion relations span a directed tree on the Geo-D space, where each CDDBMS is a node and each inclusion relation is a vertex. This graph structure perfectly matches an R-Tree [3], an indexing structure of spatial databases. As a result, the geographic inclusion dependencies imply a structure which perfectly caters for spatial queries [4]. However, this structure requires a tight coupling of nodes and has thus a strong impact on federation agreements. For example, the two operators depicted in Fig. 1 maintain a CDDBMS infrastructure including sensor networks, which is only accessible to the customers of the individual operators. Particularly, in densely populated areas such as capitals or major cities, the two operators compete heavily, and thus, there is no federation agreement which allows any information
Federation and Sharing in the Context Marketplace
123
Fig. 1. Federation scenario example
exchange. Yet, in less populated areas, infrastructure investments are less profitable and therefore the two operators share information, i.e. Operator B customers have access to the “Waterford” Node maintained by Operator A. In order to support seamless information roaming, operators can make federation agreements with other stakeholders, such as SMEs like TSSG (see Fig. 1) and users that act as sources of context information. Employees and users can thus access their premises’ and intelligent houses’ context information wherever they are. Provisioning of context information is not only a business of telecommunication companies. Medium and small enterprises could also offer and share context information and complement the Operators’ data. A thorough analysis of secure large-scale federation solutions for mobile service platforms is provided in [5].
3 CDDBMS Architecture and Query Languages The infrastructure exploited by the stakeholders of the context marketplace is heterogeneous. Different database management systems, which store context information, are deployed. Though, all databases implement the SQL standard, they provide different extensions or do not fully comply with it. Moreover, these systems do not address federation and the lack of a standardized database schema makes it almost impossible to find entries. To solve this problem, CDDBMS nodes have both a standardized database schema and a standardized query interface. 3.1 Architecture Fig. 2 illustrates the CDDBMS architecture. Each CDDBMS node comprises a NodeManager and an off-the-shelf database management system (DBMS). The NodeManager implements the management logic and the interfaces required for exchanging information with other nodes. Synchronous and asynchronous communication of context data is supported by the Query and the Event interfaces. The DBMS is the actual repository of the node’s context data. Basically any of-theshelf relational database engine meets the requirements of the CDDBMS. However, as spatial queries are a major feature of the CDDBMS, all CDDBMS nodes that cover a
124
C. Pils et al. Event Interface
Query Interface
CDDBMS Node Manager
remote communication
DBMS
Fig. 2. Architecture of the CDDBMS
Geo-D must implement the Simple Feature Standard SQL (SFS) of the Open Geospatial Consortium (OGC) [6]. This standard specifies SQL-like spatial query statements and 3-dimensional shapes that are used to describe geographic areas. All major database management systems, such as Oracle (www.oracle.com), DB2 (www-306.ibm.com/ software/data/db2), PostGis extension of PostgreSQL (www.postgresql.org) or MySQL (www.mysql.com), are shipped with an SFS extension. Access to the DBMS is managed by the Node Manager. Applications can either provide context query statements to the Node-Manager via the Query interface or they can subscribe for context data update events via the Event interface. When the NodeManager receives a context query statement or an event subscription, it first examines whether the relevant data item is stored in the local repository. If this is the case, query execution and subscription are rather straightforward. If the required context information is maintained by a remote node, the local Node Manager communicates the request to a so called Home-Node. Each context data item is associated with a Home-Node that stores its latest state (i.e. the master copy) and synchronizes its updates. To facilitate fast update and easy look-up of the latest state, the address of the Home-Node is encapsulated in the data item’s identifier. 3.2 Database Schema and Query Languages As the information traded at the context marketplace is heterogeneous, only a very simple database schema, which is yet perfectly aligned to the major query uses cases, suits the CDDBMS. Key-value pairs, subsequently called Attributes, which are the simplest form of storing information therefore constitute the core of CDDBMS schema. Attributes are grouped by Entities that are interconnected by Associations [7]. The CDDBMS is thus quite similar to the World Wide Web: Attribute sets correspond to the content of a web page, Entities to web pages and Associations to sets of hyperlinks. This design perfectly caters for two major query use cases: • Navigation queries/Browsing: Starting from a known context data item, applications (or context consumers in general) can follow Associations to discover new information, i.e. Entities and Attributes. For example, when a user’s entity is known, an application can follow the “myPreferences” Association to discover the user’s service preferences. • Spatial queries: As spatial inclusion relations can be directly mapped to Associations, navigational queries can also be used to look-up spatial data.
Federation and Sharing in the Context Marketplace
125
However, the burden of evaluating spatial information is put on the query client. Spatial queries are a simpler way of discovering spatial information. A query client must just provide a spatial description. The databases analysis this description and returns the data accordingly. To this end, the distributed spatial query processing algorithm utilizes inclusion relations. Navigational queries require previous knowledge about the content of the database (the context data item entry) and its structure (Associations). When only the semantics of the target context data item are known, both navigational queries and spatial queries are useless. Semantic queries are therefore considered to be the third major context query use case: • Semantic Queries: The context consumer is only aware of the semantics of a context data item. An application, for example, that discovers a printer in a building, just creates a query which couples the printer’s OWL description and the spatial description of the building. Processing of semantic queries is challenging as it involves first order logic reasoning or graph matching algorithms. Both mechanisms are not well supported by of-the-shelf databases. Section 4.4 describes the approaches for semantic query processing that are supported by the CDDBMS. Like the database schema, the query language is also inspired by web technology. Each context Model Object is identified by a URL structured as follows: NDQL://hostname:port/database/ModelType/type/number
The schema, NDQL, stands for Navigational Database Query Language. Hostname and port identify the Home-Node, while database is the name of the DBMS database that stores the context object. ModelType is one of the following: “Entity”, “Association” or “Attribute”. Type is an arbitrary type descriptor and number is a unique index. Queries can be applied to an object by simply adding a question mark and an NDQL query statement to its identifier. NDQL query statements are basically a subset of SQL and can be directly translated into SQL statements. The language is designed to provide a view on the database which supports the integrity of the schema. An NDQL projection/selection query is expressed as follows: GET Entity FROM [assoc-id] WHERE Association type=[entity-type] GET Entity FROM LOCATION=[coordinates] WHERE sem=[OWL-desc]
The first query looks-up all Entities of type [entity-type] that belong in the association-set of Association [assoc-id]. The second query looks-up all Entities that match the specified OWL description [OWL-desc] and are located in the spatial area [coordinates]. Note, that for the first query the reference to Home-Node is encapsulated in the Association identifier, while for the second it is described by the Location constraint (actually this might refer to multiple Nodes). Modification queries, i.e. inserts, updates and deletes, have the following format: (UPDATE | ADD | DELETE) :[reference]
where [reference] is a reference to an instantiated context Model Object.
126
C. Pils et al.
4 Dealing with Mobile Physical Objects The CCDBMS is designed to manage spatial models of the physical world that cover large areas or even the entire physical world. To this end, the nodes, Geo-Ds respectively, are ordered in an R-Tree structure. Fig. 3 illustrates such a structure. Each Geo-D, depicted on the right side, is associated with a node (corresponding domain node) of the distributed spatial database. A domain context data object stored at each node specifies the geographical coverage. When a node receives a query, it first checks whether one of the queried context objects is within the coverage domain, Geo-D respectively. If the query is relevant to the domain, either the corresponding context data items are instantly retrieved, or the query is passed on to the appropriate child Geo-D. If the query is not relevant to the Geo-D, it is passed on to the parent node (given it did not originate there). Finally, it traverses the spatial CDDBMS node hierarchy down to the responsible Geo-D Nodes. As more than one Node might be responsible for the target Geo-D, the query might be duplicated and forwarded to all responsible Node-Managers. Result sets are directly forwarded to the requesting Node-Server, which is also responsible of orchestrating join operations. The outlined routing of spatial queries is not without consequences for the security and storage approach. The security protection measures are detailed in Section 5. To
Fig. 3. Distributed spatial database: Spatial search for a printer in Waterford, Ireland
Fig. 4. Data manipulation in the CDDBMS. An example: Driving home from office and stopping at the supermarket.
Federation and Sharing in the Context Marketplace
127
facilitate efficient spatial query processing, spatial data objects must be maintained by the node that manages the corresponding Geo-D. Fig. 4 illustrates how the CDDBMS manages context information of physical mobile objects. A TSSG employee leaves his/her office and drives home, making a stop at a local supermarket on his/her way. In total, he/she crosses Geo-D boundaries five times. Consequently, a context data object representing the employee within the distributed spatial database should be moved across the four nodes five times. Table 1. Parameters of scalability equation Parameter N
λ p
L M
Description Number of user profiles stored in the user profile database. Movement rate per user in seconds, i.e. rate at which a user changes Geo-Ds. Probability that a Geo-D change affects only the parent Geo-D. With respect to graph G, p is the probability that a node change affects only its parent node. Height of graph G=(V,E) Maximum degree of a vertex of G, i.e. maximum number of sub Geo-Ds per Node Server.
Basically, moving data items between domains is perfectly scalable as most changes are performed within local network domains. Yet, any changes to mobile objects must be synchronised with the master copy at the Home Node. Frequent synchronisations increase the network traffic considerably and thus constitute a critical problem compromising the scalability of the system, as indicated by the following analysis. The location hierarchy of Node Servers is a directed, acyclic graph G=(V,E), where each vertex v ∈ V corresponds to a Node Server and each edge e ∈ E is an inclusion relation with respect to the Geo-Ds. In fact, G is considered to be a tree with a vertex degree of m. Given the parameters described in Table 1, the number of updates UC(N) to the Home-Node (induced by location updates) can be expressed as follows: U C (N ) = λ ⋅ N
Most updates are propagated across the entire network and impose considerable load on the backbone network. It would be much more efficient if most updates were kept locally, i.e are not propagated across the backbone. In such a distributed storage and synchronisation approach changes are only propagated to the parent Node Server. A distributed approach would have UD(N) updates per second: UD (N ) = λ ⋅ N ⋅ (1 − p)L
G is an R-Tree. To simplify the calculations, it has been assumed that the users are randomly distributed among the mL leafs; thus each leaf node maintains N/mL users. The performance gain of the distributed approach with respect to the centralised one is given by the following expression: Gain(N ) =
Uc (N ) mL = L Ud (N) m ⋅ (1 − p)L −1
128
C. Pils et al.
Fig. 5. Performance gain of the distributed over the centralised synchronisation approach. For parameter settings refer to Table 1.
The gain just depends on the graph structure (height and vertex degree) and probability p. Fig. 5 illustrates G for a rather flat hierarchy with 4 levels, i.e. L=4 and rather moderate p