International Journal of Digital Content Technology and its Applications Volume 4, Number 3, June 2010
Block-Ranking: Content Similarity Retrieval Based on Data Partition in Network Storage Environment 1
Jingli Zhou, 2 Ke Liu, 3 Leihua Qin, 4 Xuejun Nie Huazhong Univ. of Sci. and Tech.,
[email protected] *2,Corresponding Author Huazhong Univ. of Sci. and Tech.,
[email protected] 3,4 Huazhong Univ. of Sci. and Tech.,
[email protected],
[email protected] 1, First Author
doi: 10.4156/jdcta.vol4.issue3.8
Abstract Nowadays, data partition plays an important role in eliminating duplicate data in green storage and cloud storage system. Fixed-sized chunking and content based chunking are two kinds of commonly used partition methods to break a file into a sequence of blocks. Meanwhile, inverted index has become the standard indexing method in modern information retrieval field. For conveniently analyzing, the inverted index is represented as inverted matrix and a MapReduce strategy is used to decompose a complex matrix computation into a set of smaller parallelizable sub-matrix computations. In order to efficiently take advantage of the block information to calculate file correlation, we propose a novel Block-Ranking method which utilizes the number of duplicate blocks to weight the correlation among files. The experiment indicates that Block-Ranking algorithm has rapidly enhanced the retrieval efficiency in network storage environment.
Keywords: Data Partition, Block Ranking, Content Aware, Cloud, Similarity Retrieval, Matrix Partition, MapReduce
1. Introduction Cloud storage is a new type of network storage environment and it has rapidly grown in popularity over the past few years. Massively scalable computing and storage resources are delivered as a certain service using Internet technologies. As an efficient management method, the reemergence of virtualization allows those physical resources in cloud system to be shared among a vast number of users. Thus thousands of independent service consumers could share the same storage service in the future. As storage space is accessed by multi-users in cloud [1], more and more files are shared by different application. How to eliminate those duplicate data in cloud storage system and how to utilize the information among those shared data to calculate the file correlation have become hot research topics in network storage environment [2]. In addition, when one user modifies the sharing file in his data space, how to transfer the information to other users is another important problem that we must confront with. Meanwhile, energy aware is one of the most significant issues that we have confronted with in green storage [3]. As the duplicate data in storage system consume numerous storage capacity and energy, we must try our best to find the effective means to lower energy consumption while accommodate to data growth. Currently used duplicate data detection and elimination methods commonly focus on saving storage space and declining storage costs while seldom take the relationship between files into consideration. But in information retrieval field, methods such as Google's PageRank [4, 5] show that abundant relationships among web resources can help to generate useful ranking of search results. Hence it is reasonable to take advantage of links between web documents to calculate web correlation [6]. As high correlation also exists between duplicate files, we can similarly utilize the duplicate information to calculate the coefficient between files. The new opportunity introduced by cloud and green storage motivates us to propose a more powerful correlation algorithm named Block-Ranking in this paper. As we know, MapReduce [7] is a programming model and framework introduced by Google for data intensive cloud computing on commodity clusters. So it is utilized to reduce computing overload and enhance process efficiency by parallelizing the Block-Ranking application in cloud environment.
85
Block-Ranking: Content Similarity Retrieval Based on Data Partition in Network Storage Environment Jingli Zhou, Ke Liu, Leihua Qin, Xuejun Nie The remainder of the paper is organized as follows. Section 2 describes the architecture of cloud system. And section 3 introduces three common data partition methods to efficiently separate a file into several individual blocks. In section 4, we utilize hash function to generate block identifier and data identifier. Meanwhile, inverted index is used to enhance retrieval efficiency and two cases of inverted matrix are shown. In section 5, the inverted matrix and Block-Ranking algorithm are used to calculate the file correlation. In section 6, we take advantage of MapReduce strategy to increase the parallelizability of Block-Ranking algorithm. Section 7 presents the results of experiment and a conclusion is induced in section 8.
2. Architecture of Cloud System Cloud offers a powerful abstraction that provides virtualized infrastructure, platform or software as a service where the complexity of fine-grained resource management is hidden from the user. In traditional network storage environment, users often store those data on local storage devices, while service consumers in public cloud store all of their data in the cloud’s common storage pool. Thus the architecture of cloud is different from the traditional storage system [8]. In this paper, we regard cloud as a new type of network storage environment and the following Block-Ranking algorithm is implemented in this environment. Meanwhile, with the fast-paced growth of digital data and the exploding information retrieval costs, researchers are looking for new ways to effectively query those related data in multi-user environment [9, 10]. In this section, we describe a cloud architecture that integrates storage service, computing service and retrieval mechanisms into a unified framework to support content similarity searching. User1
User2
……..
Userrn
Service Interface SPIL
VSCL
Storage Service
Computing Service
Retrieval Service
Virtual Resource Pool
Storage & Computing Nodes NSCL CPU & GPU
Figure 1. The architecture of cloud system As shown in Figure 1, the architecture of cloud is composed of three layers: Nodes of storage & computing layer (NSCL), virtualization of storage & computing layer (VSCL), and service provider & interface layer (SPIL). NSCL collects all kinds of storage and computing resources (CPU&GPU) which can be dynamically joined or eliminated as intelligent nodes. It also provides three kinds of access modes including block, file and object interface to extend the storage scalability. The underlying resources are virtualized by middle VSCL, which integrates heterogeneous storage and computing power to generate a virtual resource pool. When users need to process the information that they have stored in cloud platform, a virtual machine is generated to handle the processing. And those applications encapsulated within virtual machines are dynamically mapped onto a pool of physical servers by VSCL. In order to conveniently interact with users, virtualized storage, computing and retrieval service are provided to the upper level. All kinds of application can easily utilize the cloud system through standard service interface such as SOAP or REST in SPIL. As storage space is accessed by multi-users in cloud environment, duplication among files become a common phenomenon. Figure 1 indicates the typical cloud usage scenarios with multi-users. User1 and
86
International Journal of Digital Content Technology and its Applications Volume 4, Number 3, June 2010 User2 are two individual tenants who deliver Data1 and Data2 separately to cloud system through storage service interface in SPIL. Then, storage capacity is allocated for them according to their storage requests. After content-based partition that have described in section 3 is utilized, Data1 is divided into 3 blocks and Datar2 is divided into 2 blocks. Obviously, they share one duplicate block in storage node and it can be regarded as an inter-relationship between the two files. If User2 allows File2 to be shared by other users, then the duplicated block’s number is becoming two. We assume those files that contain at least one duplicate block are related files, and the shared files between different users are regarded as related ones even though they do not have any duplicate blocks. Hence, all those block-related information can be used to calculate the correlation between files. When users deliver query requests through retrieval service interface, content similarity retrieval is executed, meanwhile, those files that matching query requests and the related files that contain duplicate blocks are returned to users for browse. The inverted indexing structure and Block-Ranking method will be described in following sections.
3. Content-based Data Partition In order to explore the file redundancy, some methods are essential to separate a file into several small parts. There are three commonly used partition strategies in this field. The first one analyzes the file structure to extract information. For example, we can partition an email into attachment and the main body. By comparing the hashing value of attachment, the relationship between emails can be generated. It is easy to implement this method but the scalability is not so good, for it can only handle special file type. Fixed-sized partitioning [11] is another strategy used to employ a fixed block size which is independent of the content that has been stored, and objects are partitioned into blocks of the same size. As we can see, this method is highly sensitive to the modification of files. And any modification of the beginning parts of a file may change all of the blocks’ content. So it is hard to find a good partition length that is suitable to all files. As introduced in [12], the third approach is content-based chunking. It is a method to break a file into a sequence of chunks so that chunk boundaries are determined by the local contents of the file. The algorithm has been proposed to divide objects into variable sized chunks, instead of fixed-sized blocks, in order to easily identify the amount of duplicate data. The commonly used prototypical content-based chunking algorithm is sliding window algorithm. By utilizing this method, sensitivity to the modification of objects can be reduced by partitioning those objects into variable-sized chunks and it is more convenient to calculate the file correlation. To efficiently utilize those partitioned blocks to calculate the coefficient of files, a hash function is needed to convert the physical chunks to logical value. The advantages of utilizing hashing value to implement Block-Ranking rather than the chunk itself lie on the shorter comparison time and the fixed length is convenient for lookup. Block_ID1 Block_ID2
……
……
Block_IDn
MD5 Hash Function
Data_ID
Figure2. Structure of Block_ID and Data_ID
4. Content Hashing and Indexing 4.1. Structure of Partitioned Blocks
87
Block-Ranking: Content Similarity Retrieval Based on Data Partition in Network Storage Environment Jingli Zhou, Ke Liu, Leihua Qin, Xuejun Nie The workflow of content hashing is as follows: firstly, the file is partitioned by previously defined data partition method. Then we use the SHA-1 hash function to calculate the hashing value of a block, which is used to represent the block address and named as the block identifier (Block_ID) [13]. As shown in Figure 2, the record is represented as a series of binary values and the Block_ID’s suffix corresponds to the block offset in files. Secondly, because the list of Block_ID can be directly regarded as the fingerprint of the files, while the length of file is variable and the number of blocks is not fixed. In order to universally identify a file, we utilize MD5 method to calculate a special file content identifier of the above binary sequence. The result is also a hashing value and is presented as Data_ID.
4.2. Matrix of Inverted File In modern information retrieval, inverted file [14, 15] has become the standard indexing method in many content-based retrieval applications. An inverted file is composed of term table and record file. The term table includes the word to formulate files and the record file includes all files that contain certain term. As Data_ID is consists of a set of Block_IDs and the function of Block_ID in certain Data_ID is similar to the term (keyword) in a record file, we can utilize inverted file method to construct the inverted matrix of storage system to deliver efficient content similarity retrieval. Similar to inverted file, the hashing value of block is regarded as the word to construct a file, Block_ID is regarded as the representation of a word, and Data_ID is regarded as the identifier of a file. As shown in Figure 3, the structure of Data_ID is represent as to conveniently record the position of blocks. With the additional position information, we can efficiently calculate the file correlation between files and enhance the retrieval precision. The core algorithm of indexing lies in constructing a document-level index for Block-Ranking query evaluation. It needs to point out that the value of Block_ID and Data_ID are synchronously changed corresponding to the variety of file content. For conveniently analysis, the fingerprints are represented as matrix style. As shown in Figure 3, the row of matrix represents Block_ID and the number i indicates the sorted sequence of blocks. The column of matrix represents the file’s Data_ID and element Mi,j indicates the offset of Block_IDi in Data_IDj. When a new file is added to the storage system, a column is inserted into the matrix and the block offset information is recorded in the matrix’s corresponding position. For example, when a new file m is inserted, according to the data partition method, it is obvious that file m consists of two blocks. By comparing the new Block_IDs with the originally stored Block_IDs, we can easily find that the first file block is the same as Block_ID3 and ‘1’ is inserted to row 3 column m of the matrix. But there is not any stored Block_ID the same as the second block hashing value of file m, so ‘2’ is inserted to row n column m of the matrix to indicate that it is the second block of file m. Data1
2
……
0
Block2
2
1
……
0
……
1
…
Blocki
1
0
Mi, j
…
0
0
……
0 …
…
0
……
Blockn
……
0
…
1
…
……
Datam
…..
DATA 6
…
…
DATA 3
……
Data2
Block1
…
Block_IDn
DATA5
DATA1
…
……
DATA 2
…
Block_IDi
DATA1 …
……
…
Block_ID1
……
2
Figure 3. Inverted File and Inverted Matrix
4.3. Notification of File Renew In addition to the above inverted index matrix that utilize content partition and duplicate information to represent the relationship between block and file, another similar case is the shared files among different users that we have mentioned in section 2. In cloud environment, files are shared by many users, so we can use inverted matrix to represent the relationship between users and the files that
88
International Journal of Digital Content Technology and its Applications Volume 4, Number 3, June 2010 they own. If the content of a shared file has been modified by its owner, the other co-owners should be notified for consistency purpose. As shown in Figure 4, the procedure is as follow: firstly, the changed file’s Data_ID is used to get those modified blocks according to the Block-Data matrix. Then those files that consist of the corresponding duplicate blocks can be pointed out and their Data_IDs are used to find the users who share the same files. For example, when User2 modified Data2 in his own storage space, we found that Block2 is the block that has been changed with Data2 according to the Block-Data matrix. Because Data1 shares the same block with Data2, we conclude that the owner of Data1 must be notified of the changing. From the Data-User matrix, it’s easy to find that the unique owner of Data1 is User1, so we deliver the block changing information to User1 for consistent purpose. As far as we can see, the relationship between files is based on partitioned blocks but not on file itself. So the information of these blocks can be used to calculate the content correlation, which will be further discussed in next section.
0
1
0
0
……
0
Data2
0
1
0
……
0
1
Data3
1
0
1
……
0
Mi, j
0
……
2
Datan
1
0
0
0
0
Mi, j
0 …
……
…
Data1
0
…
0
…… Userm
0
…
0
User3
…
Blockn
3
……
User2
…
0 …
0
…
Block3
…
1
…
2
……..
Block2
……
…
2
User1
…
1
…… Datam ……
……..
Block1
Data2
…
Data1
……
1
Figure 4. Block-Data and Data-User Inverted Indexing Matrixes
5. Block Ranking Previous studies [16] on file correlations entail extracting semantic information from documents, but seldom concentrate on the block correlations. Block_ID can reflect the relationship between blocks by identifying the same hashing values among files. As the document is represented as a vector which is composed of a series of hashing value, we utilize Block_ID to calculate the correlation. So the correlation between files is converted to the relationship between hashing value of blocks. When user submits query request through service interface, the cloud system not only returns those entirely matching results, but those highly related query results. The related files that we mentioned here are those files that share certain Block_IDs. So the coefficient that has been calculated is a block-based content similarity but not the commonly used semantic similarity and this kind of algorithm is called Block-Ranking. The steps are as follows: File B1 (group B)
File A1 (group A) Block
1 2 File B2 (group B)
3 …
File B3 (group B)
…
Figure 5. Correlation among Logic groups based on Block-Ranking Firstly, according to the Data-User matrix, we cluster all files that belong to the same user into one logic group and define each of them as Group. So the files in a storage system can be clustered into several logical groups representing as {GroupA, GroupB, ……, GroupN}. Each file in one logic Group is represented as Data. Secondly, those Block_IDs are extracted from the Data in each Group, and
89
Block-Ranking: Content Similarity Retrieval Based on Data Partition in Network Storage Environment Jingli Zhou, Ke Liu, Leihua Qin, Xuejun Nie those similar files that contain the same Block_IDs can be founded from the Block-Data matrix by comparing the hashing value of Block_IDs. Sometimes these files are distributed in different logical groups, but as these files physically share duplicate blocks, they can reshape a physical group that consist of several related logical groups. Then the relationship among Groups can be calculated with the following formulation: SIMInter(Groupi, Groupj)= Num_DUPi,j/Num_MAX , in which Num_DUPi,j represents the number of duplicate blocks between two logical groups, Num_MAX represents the total block quantity of groupi or groupj(the bigger one). All those files that contain at lease one duplicate block are regarded as similar files and we cluster the closest logical groups into one big group by utilizing the above formulation. The formulation above represents the inter-relationship between two groups, while the innerrelationship among files within one group can be calculated in the similar way. The ratio between the number of duplicate blocks and the total block’s number of files can be considered as the correlation of two files. We define the similarity between Datam and Datan as: SIMInner(Datam, Datan)= Num_DUPm,n / NUMm , in which NUMm represents the total block quantity of Datam. The two formulas above not only consider the logical information between users, but take the physical information of duplicate blocks into consideration to calculate the correlation. Finally, when user delivers a query request, the matching result in logical group and the related files that contain duplicate blocks in other groups are returned to user for browse. As shown in Figure 5, group A and group B are two logical groups in one storage system. File A1 is the only element in group A, and group B is consists of three files. Supposing A1 is the unique result that completely matches the query request, while B1, B2, B3 are files that share some blocks with file A1. According to the Block-Ranking method, group A and group B are clustered into
0
1
2
……
0
2
1
……
0
2
1
……
0
……
…
0
……
0
…
…
0
…
1 Mi, j ……
…
0 0
2
0
…
0
Mi, j
3
1 …
0
…… ……
…
0
…
0
…
3
…
……
…
2
…
1
…
one big group, and the file correlations within the new groups are SIM(A1,B1)=2/3, SIM(A1,B2)=1/3, SIM(A1,B3)=1/3. So we regard B1, B2 and B3 as the most similar files to A1 and return the four files to user for browse according to the sorted sequence.
2
Figure 6. Vertical Partition and Block Partition
6. Matrix Partition According to the analysis above, the scale of inverted matrix explodes greatly with the increase of matrix dimension. For example, a storage system with 100 files, while each file is divided into 10 blocks, so a 1000*100 matrix should be used to describe the relationship between blocks and files on the worst condition. In order to enhance the retrieval efficiency, we must divide a high dimensional matrix into several sub-matrixes in order to reduce the matrix dimension and utilize the parallel method to quicken computing simultaneously. As shown in Figure 6, vertical partition and block partition are two kinds of method that have been implemented in cloud environment. Because files can be classified into some groups by cluster method [17] that describes in section 5, we can utilize those groups to partition matrix. The columns that belong to the same groups are partitioned into one sub-matrix. This principal can minimize the correlation between sub-matrixes that have been partitioned and this type of partition is regarded as vertical partition. As the dimension of inverted matrix is very high, those submatrixes’ dimension still remains high even after the vertical partition. Because sub-matrix is sparse,
90
International Journal of Digital Content Technology and its Applications Volume 4, Number 3, June 2010 we can use another method called block partition to modify the distribution of sub-matrix in order to enable it to mainly contain non-zero elements. In large-scale system, to take cloud environment for example, data process often performs in a decentralized manner. Most applications are characterized by parallel processing of massive data-sets stored in the underlying shared storage system. MapReduce is one popular programming model that enables data analytics by parallel processing data to enhance processing efficiency. The model is easy to use since it hides the details of parallelization, fault-tolerance, locality optimization and load balancing. A MapReduce job is composed of two steps: map and reduce. As shown in Figure 7, the partitioned inverted matrix is divided into several sub-matrixes that signed with P1, P2…. Pn. Then these sub-matrixes are distributed to lots of machines to process. After all those computing processes are finished, a Reduce operation integrates those mapped information into the final result that we want to handle. P2 P1
P1
……
P2
…… Pn
Map
P3
P3
Reduce
Merge
…… Pn
Figure 7. Matrix partition through MapReduce
7. Experimental Results In this paper, the corpus of experiment is the reference documents of TREC [18] in 10 years. Fixedsized partitioning is utilized as the partition strategy to divide file, and the duplicate rate is calculated by comparing the amount of duplicate blocks in corpus. As described in section 5, the Block-Ranking formulation only reflects the relationship among files, while the duplicate rate of the storage system can be calculated by formulation: SIMmix= Num_DUP/Num_MAX , in which Num_DUP represents the number of duplicate blocks in all logical groups, and the Num_MAX represents the total block quantity in all Groups. Table 1. Duplicate rate with different block size in single-user environment Block Size 4k 16k 64k 128k Duplicate Rate
4.6%
4.2%
3.9%
3.8%
256k 3.7%
According to the formulation above, there is only one logic group in single-user environment and all of the files in the test corpus are owned by one user. Hence, only those files that contain duplicate blocks are regarded as related ones. As shown in Table 1, the duplicate rate is around 4% and duplication is not a very common phenomenon in single-user environment. Because lots of files are shared by many users, duplication is more common in cloud storage system than in tradition storage system. The experiment is based on the following assumption: Those files that contain at least one duplicate block are regarded as related files, and the shared files between different users are regarded as related ones even though they do not have any duplicate blocks. The assumption above reflects physical and logical sharing among files and these two kinds of characteristics are compounded to calculate the mixed duplicate rate in cloud environment. As shown in Table 2, ten users randomly choose certain files from the TREC corpus and 10% preselected files are shared by each user. According to the formulations that describe in section 5 and section 7, the inter-duplicate rate is around 10% and the logic group number is 10. Because all files in the test corpus are distributed in different logic groups, the inner-duplicate rate of each group is lower than the duplicate rate in Table 1. But as each file is composed of certain blocks and some files are shared by multi-users, neither the inner-duplicate rate nor the inter-duplicate rate is bigger than the
91
Block-Ranking: Content Similarity Retrieval Based on Data Partition in Network Storage Environment Jingli Zhou, Ke Liu, Leihua Qin, Xuejun Nie mixed duplicate rate in cloud environment. Experimental results indicate that with the file number per user ranges from 10 to 180, the mixed duplicate rate raises simultaneously.
File Number per User
Table 2. Duplicate rate in multi-user environment Inner Duplicate Rate Inter Duplicate Rate Mixed Duplicate Rate
10 50 100 180
3.52% 3.59% 3.63% 3.67%
10.05% 10.13% 10.18% 10.21%
14.62% 15.17% 15.36% 15.73%
To measure the impact of matrix partition on retrieval performance, we did some other experiments on our test platform. As Eucalyptus [19] is an open source cloud platform and it allows users to store persistent data that have been organized as eventually consistent buckets and objects. Meanwhile, its interface is compatible with Amazon’s S3 (simple storage service) and allows user to set the access policies, we can easily take advantage of Eucalyptus to construct a basic cloud environment for test. The basic configuration is as follow: Above Eucalyptus cloud platform, a Hadoop-based [20] distributed file system is utilized to divide the inverted matrix and process the sub-matrix. The platform consists of 15 nodes in addition to one physical machine functioning as cloud controller and cluster controller. Each node in the cluster is composed of 4 cores 2.8 GHz CPUs and four 1T SATA disks, where 4 CentOS5.3 virtual machines can be started up in workstation running Xen 3.2.0. Meanwhile, Lucene [21] is utilized as a retrieval facility to provide search interface and to construct file’s index. Assuming the separated process time of each sub-matrix is Tsi, and defining the time that has been consumed to partition matrix & combine sub-processing results into final output as TMapReduce, then the total retrieval time that the cloud platform has taken to find the matching results can be described as the following formulation: Tretrieval = max(Ts1 , Ts2 ,….. Tsn) + TMapReduce In order to measure the impact of MapReduce and matrix partition on retrieval performance, we introduce two kinds of experiments to distinguish different application scenarios. The first one regards all files in reference document of TREC as test corps and the total file number is fixed to 1800. While the sub-matrix number ranges from 1 to 10, then all these sub-matrixes are distributed to different nodes by MapReduce. As shown in Figure 8, when sub-matrix number is small, the speedup causing by MapReduce is not very obvious because non-parallel processing plays an important role in retrieval. With the increase of sub-matrix number, more parallelization can be used to enhance the retrieval efficiency and the query time has greatly decreased at the same time. The retrieval time are computed from the average query time collected from 20 experiments. We confirmed that this number of experiments is sufficient to reflect the retrieval performance in all kinds of experimental environment.
Retrieval Time (ms)
1200 1000 800 600 400 200 0 1
2
4
8
10
Sub-matirx Number
Figure 8. Retrieval time with different sub-matrix number The second experimental scenario supposing the user number is fixed to 10 and the file number per user ranges from 10 to 180, so the file number in test corps ranges from 100 to 1800. Under such circumstances, three kinds of retrieval methods are compared: Lucene-based retrieval utilizes the traditional ranking method (TF-IDF) to sort query results. Block-Ranking retrieval applies the BlockRanking algorithm that we have described before to calculate the correlation between files. Because
92
International Journal of Digital Content Technology and its Applications Volume 4, Number 3, June 2010 Block-Ranking retrieval is composed of content retrieval and similarity retrieval, it consumes more retrieval time than Lucene-based retrieval do. As far as MapReduce-enhanced retrieval is concerned, by partitioning the big indexing matrix into several sub-matrixes, MapReduce facility can be used to parallelize the processing procedure of retrieval. As shown in Figure 9, with the increasing of file number, the scale of inverted matrix explodes rapidly and it consumes much more time to search the big indexing file. Hence, the performance of Lucene-based retrieval and Block-Ranking retrieval are greatly declined. But the retrieval time of MapReduce-enhanced Block-Ranking strategy remains at a low level even when the file number is reaching to 1800 or even bigger. As shown in experiments above, although matrix partition based on MapReduce may simplify the retrieval operation over mass data, and the content-aware design in this paper can further optimize the performance of cloud system by utilizing the inherent and abundant correlations among files. But the drawback of Block-Ranking method lies in it does not consider the relative block position within files. So the algorithm loses sequence information and can not distinguish two files that consist of the same blocks, while the block order is different. We are trying to find a way to overcome the shortcoming in the future.
Retrieval Time (ms)
1200
Lucene-Based Block-Ranking MapReduce-Enhanced
1000 800 600 400 200 0
100
500
1000
1800
File Number
Figure 9. Time comparison among Lucene-Based, Block-Ranking and MapReduce-Enhanced Block-Ranking Retrieval
8. Conclusion This paper presents a content similarity retrieval method named Block-Ranking which utilizes hashing value of duplicate data blocks to calculate the correlation among files in cloud environment. The novelty of Block-Ranking algorithm lies in it not only considers the logic characteristic of shared files, but the physical duplicate information to leverage the relationship among files. Specifically, this paper has made three main contributions: firstly, chunk partition methods are used to break up each file into a sequence of chunks and the hashing value of each chunk is regarded as the identifier of the blocks. Secondly, an inverted index matrix is constructed to conveniently represent the structure among duplicated blocks and matrix partition is utilized to decrease the high dimension of matrix. Finally, we take advantage of MapReduce strategy to increase the parallelizability of Block-Ranking algorithm. The experiment indicates that this Block-Ranking method based on MapReduce has rapidly enhanced the retrieval efficiency in network storage environment.
9. Acknowledgement This work is supported by the National Natural Science Foundation of China under Grant No.60673001.
10. References [1] Roxana Geambasu, Steven D. Gribble, and Henry M. Levy, “CloudViews: Communal Data Sharing in Public Clouds”, HotCloud, 2009 USENIX Annual Technical Conference, 2009.
93
Block-Ranking: Content Similarity Retrieval Based on Data Partition in Network Storage Environment Jingli Zhou, Ke Liu, Leihua Qin, Xuejun Nie [2] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, et al. “Above the clouds: A Berkeley view of cloud computing”, Technical Report No.UCB/EECS-2009-28, University of California at Berkeley,USA,Feb.10, 2009. [3] Kazuo Goda and Masaru Kitsuregawa, "Power-aware Remote Replication for Enterprise-level Disaster Recovery Systems",USENIX Annual Technical Conference, pp. 255-260, 2008. [4] Ziv Bar-Yossef, Li-Tal Mashiach, “Local approximation of pagerank and reverse pagerank”, Proceeding of the 17th ACM conference on Information and knowledge management, pp. 279288, 2008. [5] Reid Andersen, Christian Borgs, Jennifer Chayes, et al. “Robust PageRank and locally computable spam detection features”, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pp. 69-76, 2008. [6] Ruma Dutta, Indranil Ghosh, Anirban Kundu, Debajyoti Mukhopadhyay, “An Advanced Partitioning Approach of Web Page Clustering utilizing Content & Link Structure”, JCIT: Journal of Convergence Information Technology, Vol. 4, No. 3, pp. 65-71, 2009. [7] Dean, J., Ghemawat, S., “MapReduce: Simplified Data Processing on Large Clusters”, Communications of the ACM, Vol. 51, No.1, pp.107-113, 2008. [8] L. Wang, J. Zhan, W. Shi, Y. Liang and L. Yuan, "In cloud, do MTC or HTC service providers benefit from the economies of scale?," Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pp. 1-10, 2009. [9] Anirban Kundu, Sutirtha Kr. Guha, Tanmoy Chakraborty, et al. “Multi-Agent System for Search Engine based Web Server: A Conceptual Framework”, JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 3, No. 4, pp. 44-59, 2009. [10] Aameek Singh, Mudhakar Srivatsa, Ling Liu. "Search-as-a-Service: Outsourced Search over Outsourced Storage", ACM Transactions on the Web (TWEB). Vol.3, No. 4, 2009. [11] Deepak R. Bobbarjung, PSuresh Jagannathan, PCezary Dubnicki, “Improving duplicate elimination in storage systems”, ACM Transactions on Storage (TOS), Vol.2, pp. 424-448, 2006,. [12] K. Eshghi and H. K. Tang, “A framework for analyzing and improving content-based chunking algorithms”, Technical Report HPL-2005-30, Hewlett Packard Laboraties, Palo Alto, 2005. [13] Liu Ke, Qin Leihua, Zhou Jingli, Nie Xuejun, “Content-aware Network Storage System Supporting Metadata Retrieval”, 8th International Symposium on Optical Storage and 2008 International Workshop on Information Data Storage , 2008. [14] Vuk Ercegovac, Vanja Josifovski, Ning Li, et al. “Supporting sub-document updates and queries in an inverted index”, Proceeding of the 17th ACM conference on Information and knowledge management, pp.659-668, 2008. [15] Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, et al. “Sparse indexing: large scale inline deduplication using sampling and locality”, Proccedings of the 7th conference on File and stroage technologies (FAST), San Francisco, California, 2009. [16] Peng Xia, Dan Feng, Hong Jiang, et al. “FARMER: a novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file system performance”, Proceedings of the 17th international symposium on High performance distributed computing, pp.185-196, 2008. [17] Yu Hua, Yifeng Zhu, Hong Jiang, Dan Feng, Lei Tian, “Scalable and Adaptive Metadata Management in Ultra Large-scale File Systems”, Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems, pp.403-410, 2008. [18] TREC corpus, http://trec.nist.gov/. [19] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, et al. “The Eucalyptus Open-Source CloudComputing System”, 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, NW Washington, pp.124-131, 2009. [20] Richard A. Brown, “Hadoop at home: large-scale computing at a small college”, Proceedings of the 40th ACM technical symposium on Computer science education, Chattanooga, pp.106-110, 2009. [21] Lucene, http://lucene.apache.org.
94