Implementing Query Operations over Encrypted Sensitive Data Tinu baby and Ch. Aswani Kumar1,∗ School of Information Technology and Engineering, VIT University, Vellore 632 014 India. e-mail: 1
[email protected]
Abstract. Due to the emergence of cloud environment, the data are being provided as a service in the cloud. To maintain the confidentiality of the sensitive attributes, the data are encrypted and are stored at the service provider. When the encrypted data is queried, complete record has to be decrypted to fetch the actual results. This paper is mainly concentrating on improving the performance of mainly three types of query operations on encrypted data i.e. non-aggregate, aggregate and user-defined query operations over the encrypted data stored at the server and subsequently its performance is compared with that of the earlier methods, in terms of the execution time. Keywords:
cloud computing, encryption, query translation.
1. Introduction Data in the enterprise keep increasing day by day and hence the need of an efficient data management system is also becoming a major requirement. Installing the database management system in each and every computer increases the storage and maintenance cost. Hence the data is being stored, accessed and maintained at the server’s database [1], instead of doing it in each and every client’s databases. This makes it possible for the client to store, retrieve and modify data stored at the server, through the internet and thereby reducing the overall costs at the client’s side. This scenario can be applied in a cloud as a Database-as-a-service model [4]. Now what if the service provider itself tries to access the sensitive data ? In this case the data should be stored in the encrypted format at the nontrusted server in the service provider’s site to ensure the security. In this scenario if the client queries the data, which is stored in the encrypted format at the service providers site, it has to perform three steps i.e. 1) transfer the entire encrypted database from service providers site to client 2) decrypt the entire database 3) apply the client query over this database. This may reduce the performance of the system, with the increase in the communicational cost as well as computational cost required for the transfer and decryption of entire database respectively. Hence, balancing the security and performance of any system is a major challenge. The paper mainly aims to improve the performance of different types of query operations over encrypted data. ∗ Corresponding author.
K. R. Venugopal and L. M. Patnaik (Eds.) ICCN 2013, pp. 106–113. © Elsevier Publications 2013.
Implementing Query Operations over Encrypted Sensitive Data
2. Background Study Hakan et al in [1], brought out the concept of third party service provider, which can store, modify and retrieve data from the host site through the internet. The paper also investigates the challenges in the database as a service model, including the issues of data access from a third party service provider and its security. [3] addresses the challenge – what if the client doesn’t trust the service provider? So as to tighten the security, all the data are stored in the encrypted format at the service provider’s site. The data may include both sensitive and non-sensitive data. Hence, the computational cost of decrypting even the non-sensitive attribute is an issue. Wang et al.[7], proposes a method which stores a characteristic index value corresponding to the sensitive character string with respect to a characteristic function. By doing so, many of the tuples will be filtered out at the server side itself just by referring the characteristic index value of its sensitive attributes. The remaining tuples are sent back to the client for the decryption of the remaining tuples, over which the actual query will be applied. This can reduce the computational cost required for the decryption, at the client side. For numerical data, B+ tree index was created before encryption in order to maintain the ordering of each index in the record. For further improving the efficiency and effectively while querying the encrypted character strings, Zong et al. in [4] uses an n-phase reachability matrix for the generation of the characteristic index value for the sensitive character strings. Hakan et al. [2] investigates on aggregation query over encrypted data in a cloud. Various other techniques for search over encrypted data in a cloud has been discussed in [5] and [6]. 3. Query Operations Over Encrypted Data Motivated by the analysis in [4], this paper aims to improve the performance of executing queries over encrypted data. The existing methods of executing queries over encrypted data have considered only any one particular query operation. This paper is concentrating on improving the performance of mainly three types of query operations i.e. non-aggregate, aggregate and user-defined query operations over the encrypted data stored at the server. Figure 1 shows the system architecture of executing query operations over encrypted data. When the user wants to store the sensitive data, they are first identified as the data over aggregate column, non-aggregate column and UDF column. For non-aggregate column, an index is generated for each of the sensitive attribute value in the index generator using the meta-data before encrypting it. Aggregate as well as user defined functions columns are directly given for the encryption process. When the user query over this encrypted data, the query operations are classified as non-aggregation operation, aggregation operation and user-defined function operation. For non-aggregation query and user defined function query, the query undergoes query translation to form server side query. For aggregation operation the query first checks aggregate database i.e. its local cache, whether the result is stored already. If the data is present, query can directly fetch the aggregate from it. Otherwise query has to undergo query translation, to generate server side query. The translated query is executed over the encrypted database at the server. For non-aggregation query operation, the filtered results are sent to the temporary database. For UDF operation frquency of queried tuple is checked. If the frequency is above a threshold, it indicates the decrypted results are present in database at client. If frequency is less than the threshold, the encrypted data is sent to the temporary database from where the decrypted results are stored at the new table database. For the aggregation query at the server, the results are sent to the aggregate database at the client side. Finally, the query executor 107
Tinu baby and Ch. Aswani Kumar
Figure 1. Architecture of executing different query operations over encrypted data.
executes the actual client side query over the data from temporary database, aggregate database and the new table database over the non-aggregate column, aggregate column and user-defined column, respectively. Architecture in [4] shows only the query operation over non-aggregate data, whereas the proposed architecture is applicable for the data over non-aggregate, aggregate as well as UDF column. 4. Different Query Operations Over Sensitive Data In this paper query operations are classified mainly as three types: non-aggregate query operation, aggregate query operation and user defined query operations. Encryption of the sensitive attributes can be done by any of the traditional data encryption techniques like AES, RSA [8], Blowfish or DES. Let R be a relation consisting of all the sensitive as well as nonsensitive attributes (A1 , A2 , . . . , At , . . . , An ). The sensitive attribute At , is stored as At e i.e. the encrypted form of At , and in case of non-aggregate column, it also stores the characteristic index value Aind , generated for that sensitive value. Hence, the relation will be stored as Rd (A1 , A2 , . . . , At e , . . . , An , Aind ). The tuple ti at the database service provider will be stored as follows: (a1 , a2 , a3 , . . . , encrypt(at ), . . . , an , index(at )), where encrypt (at ) is the encryption function applied on the sensitive data at and index (at ) denotes the index generated for at . n is the number of attributes in a relation. 108
Implementing Query Operations over Encrypted Sensitive Data
4.1 Non-aggregate query operation Non-aggregate query operation can be done over the sensitive attributes using index value. Let S = s1 , s2 , . . . , st be the sensitive string over which non-aggregation query operation has to be performed and t is the number of characters in a string (t ≥ 1). P = ( p1 , p2 , . . . , pm ) be the already defined subset of sequence over which the sensitive characters are defined where m indicates the number of partitions over complete character set (m ≥ 1). For finding the n-phase reachability matrix [4], connected pair elements at each phase has to be found out as shown below: If t ≥ n + 1, C Pn = (si , sn+i ) where (1 ≤ i ≤ t − n), otherwise . Then we define three bijective function g1, g2 and g3 each having m elements. Now, the n-phase reachability matrix can be obtained as: M n = (ri j )(m+1)×m , When, i ≥ 1, j ≤ m, ri j = 1, if (g1−1 (i ), g2−1( j )) ∈ (C P1 ∪ C P2 ∪ C Pn ),
otherwise 0
(1)
And for the last row, i.e. when i = (1 + m), 1 ≤ j ≤ m, ri j = 1, if g3−1 ( j ) = si or g3−1 ( j ) = st otherwise 0. The matrix M n thus formed by the above conditions will be a Booleans matrix. The size of this matrix will be depended on the number of subsets over which the sensitive character string is defined. All the strings defined over the same subsets will have the same size. All the m rows indicate the connected pair of elements in the sequence and the last row shows the front element and the tail element of the character sequence. The matrix is written in a linear fashion with one row after another and is denoted as Rbit (M n ). By using this matrix, the similarity among the strings and the substring of an encrypted string can be found out. To find whether a string Sa is present in another string Sb , another three matrices has to be generated for the string sequence Sa , as follow: n =r n n (a) Mleft i j ((m+1)×m) , (b) Mright = ti j ((m+1)×m) , (c) Mmid = u i j ((m+1)×m) , When, i ≥ 1, j ≤ m, −1 −1 ri j = ti j = u i j = 1, if (g1 (i ), g2 ( j )) ∈ (C P1 ∪ C P2 ∪ C Pn ) otherwise 0. And for the last row, i.e. when i = (1 + m), 1 ≤ j ≤ m, M n can be equated as (a) ri j = 1, ti j = 0, u i j = 0 if g3−1 ( j ) = si , (b) ri j = 0, ti j = 1, u i j = 0 if g3−1 ( j ) = st , (c) ri j = 0, ti j = 0, u i j = 0 otherwise. n , Mn n From the above three matrices Mleft right and Mmid , generated for the string Sa , it can be inferred whether Sa is a substring present in the front, tail or middle of a string Sb . If Sa is present in front of Sb , then, n n Rbit (Mleft (Sa )) = Rbit (Mleft (Sa ))Rbit (M n (Sb )) (2)
If Sa is a substring at the end of Sb , then, n n (Sa )) = Rbit (Mright (Sa ))Rbit (M n (Sb )) Rbit (Mright
(3)
And if Sa is a substring that is present at the middle of Sb , then, n n (Sa )) = Rbit (Mmid (Sa ))Rbit (M n (Sb )) Rbit (Mmid
(4)
These matrices can be used to filter out the results at the server, for the queries over the encrypted characters given by the user. Two phase filtering algorithm [4] is used to filter out the tuples communicated between user and the server. 109
Tinu baby and Ch. Aswani Kumar Table 1. Steps for user-defined query operation. Step 1. Set the threshold say t Step 2. Initialise the count value of all tuples to 0 Step 3. When a tuple is queried, check its count Step 4. if (count < t), increment count Sent the tuple to temporary database. Decrypt and return the tuple to the user. Step 5. if (count = t), Increment count Send the tuple back to client for decryption. Store the result of the user defined function in a database Step 6. if (count > t), Directly fetch the result of user-defined function from clients database.
4.2 User-defined query operation For performing user-defined query operation over the encrypted database, the frequency with which the queried tuple is being used is found out. This is based on how many times the same tuple of the database has been queried before, with the same operand. This value is stored in the separate attribute of the database as a count value. A particular threshold value is being set for this count value, above which it will be considered as a frequently used tuple. This frequently used encrypted tuples are brought back to the client side, where it decrypts, performs necessary user-defined functions over it and then stores this result in the client’s database. When these tuples are being queried next time, it can get the result directly from the client’s database, instead of decrypting the attributes again and again to perform user defined function. Hence by doing so, it can reduce the cost of decrypting the attributes as well as the communication cost of retrieving these attributes each time it is queried (see Table 1). 4.3 Aggregate query operation When the aggregate query operation is performed on the database, the results are fetched directly from the encrypted database, without decrypting it. For aggregate COUNT, the result can be obtained directly by querying for the count over encrypted data. While other aggregation operations like SUM, AVG. etc can be done using the technique of homomorphic encryption [9]. In this type of encryption, the arithmetic operations are performed over the encrypted data and the resulting data is send back to the client, where this result will be decrypted to give the actual result. Various algorithms like Fermat’s little theorem is used for this type of encryption. Here only the encrypted result retrieved from the server has to be decrypted, instead of decrypting all the values and finding the actual result. Hence by doing so the computation cost due to the decryption of data at the client side can be much more reduced. The query results thus obtained are stored in the local cache at the client after decryption, for referring in the future. Next time while storing, it checks whether this attribute value exist in local cache. If so, it will update the local cache. Otherwise it will get the aggregation result from encrypted database at the server and sends it to local cache of client. 5. Experimental Analysis To validate the proposed approach experiments are conducted based on the execution time with varying string size. Sensitive data of a health centre dataset are used for the experimentation 110
Implementing Query Operations over Encrypted Sensitive Data
purpose. Attributes of the record include ID, Name, Age, Area, Height and Weight. To encrypt these data, AES algorithm is implemented in JAVA. Performance of query operations over non-aggregate column, aggregate column and user-defined functions columns are being compared with the traditional system with respect to their execution time. In traditional model, while querying any data at the service provider’s site, it transmits entire encrypted database back to the client, and then performs the decryption of the whole database to apply the actual query over it. The execution time includes the time required for query processing, encryption process and decryption process. Figures 2, 3 and 4 plots the execution time for performing non-aggregation query operation, aggregation query operation and user defined query operation over sensitive strings against increasing string size
Figure 2. Execution time for performing non-aggregation query operation.
Figure 3. Execution time for performing aggregation query operation.
111
Tinu baby and Ch. Aswani Kumar
Figure 4. Execution time for performing user defined query operation.
respectively. Considering the same set of encrypted database in both traditional as well as proposed system, the graph clearly states that the execution time required for querying the sensitive character strings in the proposed method is less compared to that of traditional approach, in all the three query operations. 6. Conclusion This paper aims to execute different types query operations over encrypted data in Database-as-a Service model. The data are being classified under aggregation, non aggregation and UDF columns. Non aggregated sensitive data are stored along with the index and is queried based on this index. For the data under aggregation and UDF column, the frequently used data are being sent to the local cache of the client for the future queries. Efficiency of the approach is found to be better than that of traditional method, in terms of the execution time. Acknowledgement Authors acknowledge the Grant from NBHM, Govt. of India under the grant number 2/48(11)/2010R&D II/10806. References [1] Hakan, H., Bala, L. and Sharad, M.: Providing Database as Service. In Proceedings of ICDE’02, Sanjose, Canada, 29–38 (2002). [2] Hakan, H. Bala, L. and Sharad, M.: Efficient Execution of Aggregation Queries Over Encrypted Relational Databases. In Proceedings of DASFAA’ 04, South Korea, Springer-Verlag, Berlin, 125–136 (2004). [3] Hakan, H. Bala, L. and Chen, L.: SQL Over Encrypted Data in the Database Service Provider Model. In Proceedings of SIGMOD’ 02, Madison, ACM Press, New York, 216–227 (2002). [4] Wu, Z., Xu, G., Yu, Z., Yi, X., Chen, E. and Zhang, Y.: Executing SQL Queries Over Encrypted Character Strings in a Database-as-Service Model. Knowledge-Based Systems, 35, 332–348 (2012). [5] Hakan, H. and Bala, L.: Query Optimization in Encrypted Database Systems. In Proceedings of DASFAA’ 05, Beijing, Springer-Verlag, Berlin, 43–55 (2005).
112
Implementing Query Operations over Encrypted Sensitive Data [6] Cui, B. G., Liu, D. X., and Wang, T.: Practical Techniques for Fast Searches on Encrypted String Data in Databases. Computer Science, 33(6), 115–120 (2006). [7] Wang, Z. F., Wang, W. and Shi, B.-L.: Storage and Query Over Encrypted Character and Numerical Data in Database. In Proceedings of CIT’ 05, Shanghai, China, 77–81, (2005). [8] Cormen, T. H., Charles, E. L., Ronald, L. R. and Clifford, S.: Introduction to Algorithms. In Second ed., MIT Press, (2001). [9] Domingo-Ferrer, J.: New Privacy Homomorphism and Applications. Information Processing Letters, 60, 277–282 (1996).
113