HCube: A Server-centric Data Center Structure for Similarity Search Rodolfo da Silva Villaca∗ , Rafael Pasquini†, Luciano Bernardes de Paula‡ and Maur´ıcio Ferreira Magalh˜aes∗ ∗
‡
School of Electrical and Computer Engineering (FEEC/UNICAMP), Campinas – SP – Brazil † Faculty of Computing (FACOM/UFU), Uberlˆandia – MG – Brazil Federal Institute of Education, Science and Technology of S˜ao Paulo (IFSP), Braganc¸a Paulista – SP – Brazil Email: {rodolfo, mauricio}@dca.fee.unicamp.br,
[email protected],
[email protected]
Abstract—The information society is facing a sharp increase in the amount of information driven by the plethora of new applications that sprouts all the time. The amount of data now circulating on the Internet is over zettabytes (ZB), resulting in a scenario defined in the literature as Big Data. In order to handle such challenging scenario, the deployed solutions rely not only on massive storage, memory and processing capacity installed in Data Centers (DC) maintained by big players all over the globe, but also on shrewd computational techniques, such as BigTable, MapReduce and Dynamo. In this context, this work presents a DC structure designed to support the similarity search. The proposed solution aims at concentrating similar data on servers physically close within a DC. It accelerates the recovery of all data related to queries performed using a primitive get(k, sim), in which k represents the query identifier, i.e., the data used as reference, and sim a similarity level. Index Terms—Similarity Search, Big Data, Data Center, Hamming similarity
I. I NTRODUCTION In the current Big Data scenario of Internet, users have become data sources; companies store uncountable information from clients; millions of sensors monitor the real world, creating and exchanging data in the Internet of things. According to a study from the International Data Corporation (IDC) published in May 2010 [1], the amount of data available in the Internet surpassed 2 ZB in 2010, doubling every two years, and might surpass 8 ZB in 2015. The study also revealed that approximately 90% of them are composed by unstructured, heterogeneous and variable data in nature, such as texts, images and videos. Emerging technologies, such as Hadoop [2] and MapReduce [3], are examples of solutions designed to address the challenges imposed by Big Data in the so called three Vs: Volume, Variety and Velocity. Through parallel computing techniques in conjunction with grid computing and/or, recently, taking advantage of the DCs infrastructure offered by the cloud computing concept, the IT organizations offer means for handling large-scale, distributed and data-intensive jobs across commodity servers. Usually, such technologies offer a distributed file system and automated tools for adjusting, on the fly, the number of servers involved in the processing tasks. In such case, large volumes of data are pushed over the networking facility connecting the servers, transferring < key, value > pairs from mappers to reducers in order
to obtain the desirable results. Extensions like [4] avoid all the reprocessing of information stored in the distributed file systems, also minimizing the need for moving data across the network in order to speed up the overall processing task. While the current solutions are unquestionably efficient for handling traditional applications, such as batch processing of large volumes of data, they do not offer adequate support for the similarity search [5], whose objective is the retrieval of sets of similar data given by a similarity level. In previous works [6], an overlay solution for the similarity search was developed on top of a Distributed Hash Table (DHT) structure. Essentially, it was confirmed that it is possible to assign identifiers to data representing, with an elevated accuracy level, the similarity existent between them by the Hamming distance of their identifiers (defined in this paper as their Hamming similarity). Afterwards, it was shown the possibility of storing similar data in servers close in the logical space of the overlay network by using a put(k, v) primitive, and that it is also possible to efficiently recover a set of similar data by using a single get(k, sim) primitive, requiring a reduced number of logical hops between the peers in a DHT. In this current work, the main objective is to present a DC structure, named Hamming Cube (HCube), as a proof of concept for demonstrating the feasibility of storing similar data in peers physically near, as opposed to the logical neighborhood characteristic of the previous overlay solution, which does not necessarily represent physical distance. Although it is in early stage, the results indicate its feasibility, and serves as basis for the development of solutions for several applications such as recommender systems for social networks and/or medical images repositories, in which queries for similar past diagnoses may help in a new treatment. To achieve this goal, the HCube includes a data representation model, a Locality Sensitive Hashing (LSH) function [5], a data storage infrastructure and a routing solution. For the data representation, the HCube uses a vector representation, in which each dimension is related to a characteristic of the data being stored, such as keywords in a text, color histogram in a picture or profile attributes in a social network. In order to index the data, HCube adopts the Random Hyperplane Hashing Function (RHH) [7], a family of LSH functions whose similarity corresponds to the cosine of the angle between vectors. The data storage infrastructure of HCube uses servers as
data repositories, and it is inspired in the 3-cube server-centric solution from BCube [8]. Also, the servers are responsible for routing queries to their neighbors in order to retrieve all the desirable data regarding the get(k, sim) call. In the HCube, servers are organized according to the Gray code sequence [9], and each server is connected to six other servers whose identifiers present Hamming distances equal to 1. The routing solution used in the HCube adopts an XOR-based flat routing approach [10]. To evaluate the HCube proposal, the Adult Data Set of the UCI Repository [11] was used as adult profiles in a social network. For these profiles, hash identifiers were created using the RHH function and stored in the servers composing the HCube. Four cubes ranging from 1024 to 8192 servers were analyzed. Afterwards, queries for similar profiles were performed and the distance between servers that store similar profiles was measured. Also, the recall versus the number of hops is presented to show the efficiency of the proposed solution. To complement the evaluations, the Hamming similarity approach was validated by showing the correlation between the similarity extracted from the cosine of the adult vectors and the Hamming similarity between adult and server identifiers. The remainder of this paper is organized as follows: Section II presents the concept of cosine similarity (simcos ), RHH functions and Hamming similarity (simh ). Section III details the HCube describing its architecture and operation using Gray codes and the XOR routing. Section IV presents the evaluation results while Section V provides some final remarks about our proposal. Section VI concludes this paper. II. DATA M ODEL AND THE R ANDOM H YPERPLANE H ASHING Consider an application scenario in which profiles of users are represented by 2-dimensional vectors, each dimension representing the interest of the user in two subjects, for example A and B. These interests are represented by values from 0 up to 10 (0 means that the user is not interested at all about the subject and 10 means the user is totally interested on it). As an example, consider the Figure 1 in which three profiles are represented by the tuples PROFILE1(3,7) - rank 3 for subject A and 7 for subject B, PROFILE2(9,1) - rank 9 for subject A and 1 for subject B, and PROFILE3(5,5) - rank 5 for both subjects A and B. 8 PROFILE1 7 6 Interest: B
PROFILE3 5 4 3 NEW_PROFILE 2 PROFILE2 1 0 0
2
4 6 Interest: A
8
Based on these profiles, consider a friend recommender system and a new user profile represented by the tuple NEW PROFILE(6,2), as seen in Figure 1. Under such circumstances, the question is: what are the most similar profiles to recommend to the user represented by NEW PROFILE? A common option found in literature is the cosine similarity. Given two vectors, the value of the cosine of the angle between them, which ranges from 0 to 1, may be used as a similarity index; 0 meaning that both vectors have no similarity at all and 1 meaning that both vectors are totally similar. The choice of the similarity function highly depends on the application domain. In [12] it is stated that the cosine similarity produces high quality results for different types of data and application scenarios. This paper focused on the cosine similarity. The Locality Sensitive Hashing (LSH) functions reduce the dimensions of vectors representing data while ensuring that the more similar two objects are, the more similar the hash values of their vectors will be [5]. Each family of LSH functions is related to some similarity function. The Random Hyperplane Hashing (RHH) is an example of a family of LSH functions related to the cosine similarity. In this context, Charikar [7] presents a hashing technique summarized in the sequence. → → → → Given a set − r 1, − r 2, . . . , − r m of m vectors − r ∈ ℜd , each of their coordinates randomly drawn from a standard normal → − distribution, and a vector − u ∈ ℜd , a hash function h→ r is defined as follows: → → 1, if − r ·− u ≥0 → − − h→ → − → − r(u)= 0, if r · u < 0 → − − For each h→ r ( u ) one bit is generated, and the results of → − − m h→ r ( u ) are concatenated to compose an m-bit hash key → → → for this vector − u . For two data vectors − u,− v ∈ ℜd , the probability of generating similar keys is the value of the cosine → → of the angle between − u and − v . Consequently, the greater the cosine similarity, the more likely the generated keys will share common bits, leading to two identifiers close in the Hamming distance (i.e. the number of different bits in two binary strings). Inversely, the Hamming similarity is possible to be calculated by measuring simh = 1- Dmh , in which simh is the Hamming similarity and Dh is the Hamming distance. As an example, assuming that an application uses an iden→ tifier of 8-bits, a sequence of m = 8 random vectors − r − are must be generated, and the returned bits of the m h→ r concatenated in order to generate an 8-bit identifier. Based in the scenario previously presented in Figure 1, Table I compares the cosine similarity (simcos ) between PROFILE1, PROFILE2, PROFILE3 and NEW PROFILE and the Hamming distance (Dh ) and similarity (simh ) between their identifiers. As can be seen, PROFILE1, which is the most similar profile to NEW PROFILE based on cosine similarity, also has the smallest Hamming distance (Dh ), corresponding to a Hamming similarity (simh ) of 0.875. The next section presents the architecture and operation of HCube.
10
Fig. 1. Interest space for PROFILE1, PROFILE2, PROFILE3 and NEW PROFILE.
III. T HE HC UBE A RCHITECTURE This section provides an overall description of the HCube architecture, exhibiting the organization of servers inside the
TABLE I H AMMING DISTANCE (Dh ), H AMMING SIMILARITY (simh ) AND C OSINE SIMILARITY (simcos ) OF PROFILES. Comparison between profiles PROFILE1, NEW PROFILE PROFILE3, NEW PROFILE PROFILE2, NEW PROFILE
Dh 1 2 3
simh 0.875 0.75 0.625
simcos 0.99 0.95 0.85
DC structure, defining how the identifiers of the servers are assigned according to the Gray code sequence and presenting the XOR routing responsible for traffic forwarding. As seen in Figure 2, the logical organization of HCube is composed of two layers, the Admission Layer and the Storage Layer.
resulting in a HCube composed by 64 servers (4x4x4 servers). → After the admission of the query − q , a get composed by the reference data identifier di, the hosting server si and the desirable sim level is issued in the Storage Layer. Such message is routed towards si by using the XOR Routing presented later in Section III-C and, from the hosting server si, a series of get(di, sim) messages is triggered, being forwarded to the neighboring servers also using the XOR mechanism. The generation of such get(di, sim) messages is driven by some factors, including the reference data di and the desirable sim level. Specifically for the purposes of this paper, the get(di, sim) messages are gradually sent to all servers, following a crescent order of Hamming distances between server identifiers, since the objective is to provide a full analysis of the distance between servers storing similar data and the recall in which similar data is recovered. As the get messages are routed inside HCube, the servers containing data within the desirable sim level forward them to the Admission Server responsible for handling such request at the Admission Layer. The Admission Server summarizes the answers and delivers the set of similar data to the requesting user/application, concluding the similarity search process. As an alternative implementation, the Admission Server may return a list of references to the similar data, instead of returning the entire data set. Such option avoids unnecessary movement of huge volumes of data and allows the users to choose what piece of data they really want to retrieve, for example, a doctor may open only a few past diagnoses related to the current treatment. A. HCube structure
Fig. 2.
H-cube operation in a query.
The Admission Layer provides the interface between the external world and the HCube structure. This layer is composed by a set of Admission Servers, which operate on top of the HCube, receiving the queries from the users/applications and preparing such queries for being injected in the HCube in order to perform the similarity search. For simplicity, Figure 2 shows only one Admission Server at the Admission Layer. On the top left corner of Figure 2, it is presented a query → vector − q composed of d dimensions being admitted by the HCube, in conjunction with the desirable similarity level sim. The designation of the Admission Server to be in charge of handling such query may be set according to, for example, the geographic region from where the query is originated. → Once the query vector − q is received, the Admission Server obtains the data identifier di according to the process described in Section II which has 128-bits length in Figure 2. The di is the data identifier of reference for the similarity search process. In the sequence, the Admission Server reduces the di, also using the RHH function, aimed at obtaining the identifier si, which indicates the Hosting Server responsible for the reference data di. In this example, the si has 6-bits length,
The HCube is composed of a set of servers organized in a 3-dimensional cube topology, in which servers are the fundamental elements in the DC (server-centric paradigm). Under such server-centric scheme, simple COTS (commodity off-the-shelf) switches can be used to simplify the wiring process of the DC, or direct links can be established among the servers. The most important characteristic of the servercentric scheme is that network elements, like switches, are not involved in the traffic forwarding decisions, they simple provide connections between servers, which are responsible for all the traffic forwarding decisions. Each HCube server comprises a general purpose processor, with memory, persistent storage and six NICs (Network Interface Cards) named from eth0 to eth5 and distributed in three axis (x, y and z). In order to be settled in the HCube topology, each server uses the NICs eth0 and eth1 in the x axis, the NICs eth2 and eth3 in the y axis, and the NICs eth4 and eth5 in the z axis. In order to provide a greater degree of alternative paths in the HCube, increase the fault tolerance and reduce the maximum distance between any pairs of servers, the adopted topology wraps all the links between edge servers, in all the three axis. Figure 3 presents a HCube topology composed of twenty seven servers. Note in this figure the existence of six
NICs in all servers, the linkage in the three axis, and the wrapping connections between the edge servers.
Fig. 3.
HCube with 27 servers (x = 3, y = 3 and z = 3).
B. Allocation of server identifiers In order to facilitate the similarity search, the HCube adopts Gray codes for assigning identifiers to all servers composing its structure. By using the Gray codes [9], two successive identifiers of servers differ in only one bit, i.e., neighboring servers present Hamming distance 1 between their identifiers, tending to store similar data. The proposed organization follows the Gray Space Filling Curve (SFC) which offers a good ratio between the clustering of data and the computing complexity of the curve [13], and it is specialized in the clustering of data according to their Hamming distances. In Figure 4 it is shown, as an example, a HCube with 64 storage servers (identifiers of 6-bits length) distributed in a 4x4x4 HCube. The figure presents the four layers (L1, L2, L3 and L4) of the HCube, which are connected according to the details presented in Section III-A. As an example, consider the server 29 (0111012) located in L2 of Figure 4. The six neighbors of this server are 28 (0111002, L2) and 31 (0111112, L2) in the x axis; 25 (0110012, L2) and 21 (0101012, L2) in the y axis; 13 (0011012, L1) and 61 (1111012, L3) in the z axis. Note that all these neighbors present identifiers whose Hamming distance to server 29 is equal to 1. As another example, in which links between edge servers are established, consider server 60 (1111002) located in L3 of Figure 4. The six neighbors of this server are 62 (1111102, L3) and 61 (1111012, L3) in the x axis; 52 (1101002, L3) and 56 (1110002, L3) in the y axis; 44 (1011002, L4) and 28 (0111002, L2) in the z axis. All of them also presenting Hamming distance 1 to the identifier of server 60. 56 57 59 58 8 9 11 10 60 61 63 62 12 13 15 14 L3 = L1 = 52 53 55 54 4 5 7 6 48 49 51 50 0 1 3 2 24 28 L2 = 20 16 Fig. 4.
25 29 21 17
27 31 23 19
26 30 22 18
40 44 L4 = 36 32
41 45 37 33
43 47 39 35
44 46 38 34
Visualization of a HCube with 64 servers (L1, L2, L3 and L4).
The HCube presented in Figure 4 is a special case, since the number of NICs of the servers matches the length of the identifiers (6-bits). In this particular case, all the neighbors of all servers present Hamming distance equal to 1. However, for HCubes with size bigger than 64 servers the identity space for defining the identifiers of servers must be bigger than 6-bits, resulting in more than 6 servers whose Hamming distance is equal to 1. In short, it means that 6 servers presenting Hamming distance 1 will always be directly connected to a given server, and the remaining servers whose Hamming distance is 1 will be placed in a higher number of hops. Figure 5 exemplifies a HCuber bigger than 6-bits, showing the layers an 8x4x4 (7-bits) HCube with 128 servers. Note the Gray SFC represented by the arrows, and consider server 9 (00010012) located in the L1 as an example. There are seven Hamming distance 1 servers, organized as follows: servers 8 (00010002, L1) and 11 (00010112, L1) in the x axis; servers 25 (00110012, L1) and 1 (00000012, L1) in the y axis; servers 41 (01010012, L2) and 73 (10010012, L4) in the z axis and, finally, server 13 (00011012, L1) which is located in a higher number of hops from server 9. It is important to highlight that the distance in hops to server 13 is reduced given the wrapped links established between edge servers of HCube.
Fig. 5.
Gray SFC in a 8x4x4 HCube (128 servers).
A complete Hamming cube must have n dimensions, in which n is the number of bits of the identity space used for servers. Although such structure will have the better results in a similarity search using the Hamming similarity, it is very difficult to be deployed and far from the real DCs wiring structures. In this way, the 3-dimensional structure of HCube offers a good trade-off between the complexity of the cube and the results (recall) in a similarity search. C. XOR Routing The XOR metric uses n-bit flat identifiers to organize the routing tables in n columns and route packets through the network. Its routing principle uses the bit-wise exclusive or (XOR) operation between two server identifiers a and b as their distance, which is represented by d(a, b) = a ⊕ b, being d(a, a) = 0 and d(a, b) > 0, ∀a, b. Given a packet originated by server x and destined to server z, and denoting Y as the set of identifiers contained on x’s routing table, the XOR-based routing mechanism applied at server x selects a server y ∈ Y
that minimizes the distance towards z, which is expressed by the following routing policy R = argmin{d(y, z)}.
(1)
y∈Y
In order to support traffic forwarding, the XOR-based routing tables maintained per server are formed by O(n) entries, where the knowledge about neighbor servers is spread into n columns called buckets and represented by βi , 0 ≤ i ≤ n − 1. Each time a server a knows a novel neighbor b, it stores the information regarding server b in the bucket βn−1−i given the highest i that satisfies the following condition1 d(a, b) div 2i = 1, a 6= b, 0 ≤ i ≤ n − 1.
(2)
In order to exemplify the creation of the routing tables, consider the first example given in Section III-B, assuming server 29 as a = 011101 and its neighbor 13 as b = 001101. The distance d(a, b) = 010000 and the highest i that satisfies the condition (2) is i = 4, concluding that the identifier b = 001101 must be stored in the bucket βn−1−i = β1 . Basically, condition (2) denotes that server a stores b in the bucket βn−1−i , in which n − 1 − i is the length of the longest common prefix (lcp) existent between both identifiers a and b. This can be observed in Table II, in which the buckets β0 , β1 , β2 , β3 , β4 , β5 store the identifiers having lcp of length 0, 1, 2, 3, 4, 5 with server 29 (011101). TABLE II H YPOTHETICAL ROUTING TABLE FOR SERVER 29 (011101) IN AN HC UBE WITH A SERVER IDENTITY SPACE IN WHICH n = 6. β0 111101 100000 100010 111111
β1 001101 000000 000100 001111
β2 010101 010000 010100 010111
β3 011001 011000 011010 011011
β4 011111 011110
β5 011100
Such routing tables approach is one of the main advantages of the XOR-based mechanism, since a server only needs to know one neighbor per bucket of the possible 2n servers available in the network to successfully route packets. If a server has more than one entry per bucket, such additional entries might optimize the routing process, reducing the number of hops in the path from source to destination. Another important characteristic of this routing table is related to the number of servers that fit in each one of the buckets. There is only one server that fit in the last bucket (β5 ), two in β4 , four in β3 , doubling until reaching the first bucket (β0 ), where fifty percent of all servers fit in such bucket (32 servers for n = 6). For simplicity of presentation, Table II shows examples of neighbor servers limited to four lines. Afterwards, assuming the Gray code distribution of server identifiers adopted in HCube, filling the buckets is easier, since each one of the Hamming distance 1 neighbors fit in exactly one bucket of the routing table, assuring all the traffic forwarding inside the HCube. Note in the first line of Table II 1 div
denotes the integer division operation on integers.
the intentional presence of all servers whose identifiers present Hamming distance 1 to server 29 (011101): server 111101 in β0 , 001101 in β1 , 010101 in β2 , 011001 in β3 , 011111 in β4 and 011100 in β5 . As mentioned before, for HCubes bigger than sixty four servers (6-bits), only six servers with Hamming distance 1 will be physically connected to a given server. In this case, a signaling process [10] is used to discover the other servers located at distances bigger than 1 hop. IV. E VALUATION This section describes the experiments performed in the evaluation of the HCube proposal. The main idea is to validate our proposal as a valuable approach in order to support the searching for similar data. In our experiments, data is viewed as profiles of a Social Network represented by adult vectors extracted from the Adult Data Set of the UCI Repository [11]. This set contains 99762 queries and 199523 adults. In essence, it contains information about adult citizens living in the US, including: age, work-class, education level, years in school, marital status, occupation, relationship, race, sex, capital gain and loss over the last year, hours worked per week, native country and annual salary. Samples of these profiles are shown in the sequence: • ADULT1 - 43; Self-emp-not-inc; 5th-6th; 3; Married-civspouse; Craft-repair; Husband; White; Male; 0; 4700; 20; United-States; ≤50K • ADULT2 - 56; Private; 10th; 6; Married-civ-spouse; Craft-repair; Husband; White; Male; 0; 0; 0.45; France; ≤50K • ADULT3 - 50; Self-emp-inc; Prof-school; 15; Marriedciv-spouse; Prof-specialty; Husband; White; Male; 0; 0; 36; United-States; ≥50K • ADULT4 - 30; Private; Prof-school; 15; Married-civspouse; Prof-specialty; Husband; White; Male; 0; 0; 30; United-States; ≥50K The similarity between each pair of adult vectors is determined by the cosine of the angle between them. For example, the vectors ADULT1 and ADULT2 have cosine similarity equal to 0.71, ADULT3 and ADULT4 have cosine similarity equal to 0.90. Table III lists the Hamming distances (Dh ) obtained from using the RHH function over the vectors representing ADULT1, ADULT2, ADULT3 and ADULT4. Table III also presents the cosine similarity (simcos ) between adult vectors providing values for a comparison with the Hamming similarity (simh ) of their 128-bits adult identifiers. TABLE III H AMMING DISTANCE (Dh ), H AMMING SIMILARITY (simh ) AND COSINE SIMILARITY (simcos ) FOR SELECTED PROFILES IN THE A DULT DATA SET. Adults ADULT1, ADULT2 ADULT3, ADULT4
Dh 28 10
simh 0.68 0.92
simcos 0.71 0.90
To evaluate the HCube the following experiments were performed:
Adult profile identifiers with 128-bits were generated using RHH, indexed and distributed in HCubes with 1024, 2048, 4096 and 8192 servers; • Profiles in the query set were used as query vectors in the Adult set. Pairs of queries and adults were randomly selected to be evaluated considering similarity levels greater than or equal to 0.7, 0.8, 0.9 and 0.95; • A 128-bits identifier was generated for each query. Lookups having these identifiers and a Hamming similarity level were performed. To map the hosting server for each query, each 128-bits identifier was reduced to a 10-, 11-, 12- and 13-bits identifier using RHH. These reduced identifiers corresponds to the hosting servers of the queries for each evaluated size of HCubes; • From the hosting server, all profiles that fit within the Hamming similarity level are retrieved. These tests allow us to highlight the following aspects: • The correlation between cosine and Hamming similarity of adult identifiers (128-bits), and between cosine and Hamming similarity of hosting server identifiers; • The frequency distribution of the number of hops to retrieve all similar profiles in the set according to their similarity level; • The query recall, corresponding to the fraction of all relevant profiles that was successfully retrieved. To perform the tests, adult profiles were indexed and distributed in the HCube using the put(k, v) primitive. The get(k, sim) primitive was used to retrieve all similar profiles in a hosting server and its neighbors given an identifier (k) and a similarity level (sim). Only to permit us to evaluate the recall and the number of hops, this search was extended to all neighbors of the hosting server in a configurable depth. •
A. Correlation Table IV shows the average of the correlation between cosine similarity (simcos ) of pairs of queries and adult profiles, and the Hamming similarity (simham ) of their 128-bits identifiers. Also, the correlation between cosine similarity and the Hamming similarity of 10-, 11-, 12- and 13-bits hosting server identifiers is presented. To calculate these correlations we randomly selected 10 samples of queries and adults profiles, each sample containing about 1% of the total number of pairs in the Adult set. These samples were sorted according to these similarity levels: 0.7, 0.8, 0.9 and 0.95. TABLE IV AVERAGE OF THE CORRELATION BETWEEN COSINE AND H AMMING SIMILARITY. simcos Level ≥ 0.7 ≥ 0.8 ≥ 0.9 ≥ 0.95
128-bits simham 0.84 0.90 0.98 ≈1.00
10-bits simham 0.44 0.48 0.79 0.98
11-bits simham 0.48 0.52 0.84 0.99
12-bits simham 0.50 0.63 0.86 0.98
13-bits simham 0.52 0.68 0.90 ≈1.00
The results show a strong correlation between cosine and Hamming similarity for 128-bits pairs of adult profiles, spe-
cially for the highest similarity levels. When the similarity level falls bellow to 0.9 it is possible to notice a moderate correlation (< 0.8) between cosine and Hamming similarity of pairs of adults and their hosting server identifiers. Still examining this table, the greater the similarity level, the greater the correlation between cosine similarity of pairs of adults and Hamming similarity of their hosting server identifiers. The standard deviation of these averages for the lowest similarity levels are high due to the probabilistic characteristic of the RHH function, but the confidence interval of the samples were calculated and are low and acceptable. For the highest similarity levels the standard deviations are negligible, as the confidence interval of the samples. B. Distribution of the number of hops between similar data The results in Figure 6 show that the higher is the similarity level of a query, the lower is the effort necessary to recover similar data. As an example, in Figure 6(a), using 10-bits for a server identifier and a Hamming similarity level equal to 0.7, most of the similar profiles of a given query are located between 2 and 6 hops away from the hosting server. It is important to notice here that using 10-bits in a server identifier, a Hamming similarity of 0.7 is equivalent to a Hamming distance of 3, on average. In an HCube with 6-bits (a complete HCube), having 3 as the Hamming distance implies in 3 hops from source to target. As another example, in Figure 6(c), also using 10-bits for the server identifier and a similarity level of 0.9, most of the similar data associated to a given query is between 0 and 3 hops. A Hamming similarity of 0.9 in this case means that the Hamming distance between server identifiers storing those similar data is equal to 1, on average. In a complete Hamming cube with n dimensions, in which n is the number of bits of the server identifier, with a similarity level sim the maximum number of hops necessary to recover similar data is n-(sim*n). As an example, with sim=0.7 and n=10, the maximum number of hops necessary to recover all 0.7 similar data is 10-(0,7*10) = 3. The highest values presented in the HCube results are due to the reduced number of dimensions necessary to build an incomplete Hamming cube. The average of the number of hops to retrieve all similar profiles in a given query is presented in Table V: TABLE V AVERAGE OF THE NUMBER OF HOPS . Similarity Level 0.7 0.8 0.9 0.95
1024 servers (10-bits) 3.88 2.56 1.76 0.28
2048 servers (11-bits) 4.35 3.47 2.32 0.43
4096 servers (12-bits) 4.71 3.72 3.65 0.57
8192 servers (13-bits) 5.75 5.07 3.48 1.10
C. Recall An important metric to indicate the efficiency of any information retrieval system is the recall, which represents the fraction of the profiles that is relevant to the query and successfully retrieved in a similarity search. This evaluation
Similar Contents (%)
90 80 70 60 50 40 30 20 10 0
Frequency Distribution Similarity 0.7
10_bits 11_bits 12_bits 13_bits
0
1
2
3
4
5 6 7 8 9 10 11 12 13 Number of hops
90 80 70 60 50 40 30 20 10 0
Frequency Distribution Similarity 0.8
10_bits 11_bits 12_bits 13_bits
0
1
2
3
4
(a)
5 6 7 8 9 10 11 12 13 Number of hops
(b) Fig. 6.
D. Hamming distances The evaluations in this subsection show the relationship between the Hamming distances of the servers and the distance, measured in the number of hops, between them in the HCube. The results are presented in Figure 8. In this figure the results of a 6-bit HCube are shown as reference because in the HCube it represents a complete Hamming cube. As said earlier, for more than 6-bits in the server identifier, the HCube isn’t complete, characterizing a compromise between lower hop distances and the practical implementation of the DC. As an example, in a DC with 2048 servers (11-bits), when two servers present a Hamming distance equal to 1 between their identifiers, the average distance, measured in the number of hops, is about 2. For higher Hamming distances, such as 4 (Hamming similarity level 12-4/12 = 0.66), the average distance is about 8. Hamming Distances X Number of Hops 6_bits 10_bits 11_bits 12_bits 13_bits
Number of Hops
30 25 20 15 10 5 0
Fig. 8.
1
2
3
4
Frequency Distribution Similarity 0.9
10_bits 11_bits 12_bits 13_bits
0
1
2
3
4
5 6 7 8 9 10 11 12 13 Number of hops
90 80 70 60 50 40 30 20 10 0
Frequency Distribution Similarity 0.95
10_bits 11_bits 12_bits 13_bits
0
1
2
3
(c)
4
5 6 7 8 9 10 11 12 13 Number of hops
(d)
Frequency distribution. Similarity levels 0.7 (a), 0.8 (b), 0.9 (c) and 0.95 (d).
shows that it is possible to build an efficient search engine on top of the HCube. In Figure 7 the results of the evaluation of the recall are presented. They represent the average of all queries and results in the Adult set. From these results, it is possible to see that the best results occur for the highest similarity levels. As an example, taking a HCube with 4096 servers (12-bits) and similarity level equal to 0.8, 20% of the similar adult profiles are retrieved using 2 hops and 50% of them are retrieved with 4 hops. In the same scenario, using similarity level of 0.95, 60% of all similar profiles are stored in the hosting server, and expanding the search for its immediate neighbors (1 hop), 90% of all similar profiles are retrieved.
35
90 80 70 60 50 40 30 20 10 0
5 6 7 8 9 10 11 12 13 Hamming Distances
Hamming distances X Number of Hops.
From Figure 8 it is possible to see that the average distance between servers in the HCube, measured in the number of hops, is directly proportional to the Hamming distance of their identifiers. Combining these results with those presented in
Table IV, it is possible to confirm that the HCube is a valuable approach to enable similarity search in a Data Center scenario. V. F INAL R EMARKS Among the related works about similarity search available in the literature, most of them are based on centralized indexing schemes [14], [15], hashing functions [5] and SFCs [13]. Some others are related to overlay solutions in distributed P2P environments [16], [17], [18], [6], usually focused on large data collections and multimedia data. Both centralized and distributed scenarios are motivated by the nearest neighbors problem [5], i.e., how to retrieve the most similar data in an indexing space? The centralized solutions offer efficient results by using specialized data structures. To complement such scenario, the use of LSH functions was proposed to solve the curse of dimensionality problem [7], which appears when the number of dimensions in a data vector increases, making inefficient the use of SFCs. But, even using LSH to reduce the curse of dimensionality, centralized solutions are not suitable for Big Data due to the huge amount of data to be considered. The overlay approaches appear as solutions to help managing large volumes of data. In general, these solutions are based in Distributed Hash Tables (DHT), sharing data among peers in an overlay network. However, although peers storing similar data can be nearly located in the overlay network, there is no guarantee that they are physically near, usually resulting in high latencies during data transfer/recovery. When compared to these works, HCube is a distributed solution using techniques such as LSH functions and SFCs to support similarity searches, but HCube is not based on an overlay solution. HCube presents a server-centric data center structure specialized for similarity search, where similar data are stored in servers physically near, recovering such data in a reduced number of hops and with reduced processing requirements. On the other hand, players such as Google [19] and Amazon [20] developed data centers specialized in storing and processing large volumes of data through MapReduce solution, but none of them consider the similarity search. In this way, HCube opens up a new research field, where applications can benefit from its similarity search structure, avoiding processing-intensive tasks as occur in MapReduce, since similar data are organized in servers nearly located during the storing phase.
Similar Contents (%)
Recall - Similarity 0.7
Recall - Similarity 0.8
Recall - Similarity 0.9
Recall - Similarity 0.95
100
100
100
100
80
80
80
80
60
60
60
40
40
10 bits 11 bits 12 bits 13 bits
20 0 0
2
4 6 8 Number of hops
10
20 0 12
0
2
(a)
4
6 8 Number of hops
10
0 12
VI. C ONCLUSIONS
AND
40
10 bits 11 bits 12 bits 13 bits
20
(b) Fig. 7.
60
40
10 bits 11 bits 12 bits 13 bits
0
2
4
6 8 Number of hops
10
10 bits 11 bits 12 bits 13 bits
20 0 12
(c)
0
2
4
6 8 Number of hops
10
12
(d)
Recall. Similarity levels 0.7 (a), 0.8 (b), 0.9 (c) and 0.95 (d).
F UTURE W ORK
HCube is a data center solution designed to support similarity searches in Big Data scenarios, aiming to reduce the distance to recover similar contents in a similarity search. The data model used in HCube uses a vector space model representation. Through the use of the RHH function to generate data identifiers, the cosine similarity is represented by the Hamming similarity between them. Correlating data identifiers and server identifiers to the cosine similarity we were able to reproduce a scenario in which similar data is stored in the same hosting server or in servers with small Hamming distances in the DC. The use of the Gray SFC in the organization of the server identifiers in the HCube benefits our solution, since it ensures that all neighbors of a data storage server are at Hamming distance 1 between them. The 3-cube is flat, server-centric, and presents multipath routing capabilities, increasing resilience and fault tolerance. The results show that the recall and the number of hops to recover a high amount of similar data enables HCube as a DC solution in a similarity search scenario, which is an interesting kind of search due to the high number of related work involving this subject. The better results were obtained with high similarity levels. An extra layer in the HCube architecture can be implemented to evaluate the results of a similarity search, filtering and ranking the best answers improving the precision, which corresponds to the ratio between relevant data over the total amount of data returned in a query. As future work we plan to evaluate fault tolerance and load balance of the HCube. Also, strategies such as content replication and caching can be used to improve recall and load balance, reducing the number of hops to recover similar data. R EFERENCES [1] J. Gantz and D. Reinsel. (2010) The digital universe decade are you ready? [Online]. Available: http://www.emc.com/collateral/ analyst-reports/idc-digital-universe-are-you-ready.pdf [2] T. White, Hadoop: The Definitive Guide, first edition ed., M. Loukides, Ed. O’Reilly, june 2009. [Online]. Available: http://oreilly.com/catalog/ 9780596521981 [3] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” in OSDI, 2004, pp. 137–150. [4] P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini, “Incoop: Mapreduce for incremental computations,” in Proceedings of the 2nd ACM Symposium on Cloud Computing, ser. SOCC ’11. New York, NY, USA: ACM, 2011, pp. 7:1–7:14. [Online]. Available: http://doi.acm.org/10.1145/2038916.2038923
[5] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in STOC ’98: Proceedings of the 30th Annual ACM Symposium on Theory of Computing. New York, NY, USA: ACM, 1998, pp. 604–613. [6] R. Villac¸a, L. B. de Paula, R. Pasquini, and M. F. Magalh˜aes, “Hamming dht: An indexing system for similarity search,” in Proceedings of the 30th Brazilian Symposium on Computer Networks and Distributed Systems, ser. SBRC ’12. Ouro Preto, MG, Brazil: SBC, 2012. [7] M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in STOC ’02: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, New York, NY, USA, 2002, pp. 380–388. [8] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu, “Bcube: a high performance, server-centric network architecture for modular data centers,” SIGCOMM Comput. Commun. Rev., vol. 39, no. 4, pp. 63–74, Aug. 2009. [9] C. Faloutsos, “Gray codes for partial match and range queries,” IEEE Trans. Software Eng., vol. 14, no. 10, pp. 1381–1393, 1988. [10] R. Pasquini, F. L. Verdi, and M. F. Magalh˜aes, “Integrating servers and networking using an xor-based flat routing mechanism in 3-cube servercentric data centers,” in Proceedings of the 29th Brazilian Symposium on Computer Networks and Distributed Systems, ser. SBRC ’11. Campo Grande, MS, Brazil: SBC, 2011. [11] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [Online]. Available: http://archive.ics.uci.edu/ml [12] D. Lee, J. Park, J. Shim, and S. Lee, “Efficient Filtering Techniques for Cosine Similarity Joins,” INFORMATION-An International Interdisciplinary Journal, vol. 14, p. 1265, 2011. [13] J. Lawder, “The application of space-filling curves to the storage and retrieval of multi-dimensional data,” Ph.D. dissertation, University of London, London, December 1999. [14] D. Zhang, D. Agrawal, G. Chen, and A. K. H. Tung, “Hashfile: An efficient index structure for multimedia data,” in Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ser. ICDE ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 1103– 1114. [15] J. Jeong and J. Nang, “An efficient bitmap indexing method for similarity search in high dimensional multimedia databases,” vol. 2, 2004, pp. 815– 818 Vol.2. [16] I. Bhattacharya, S. Kashyap, and S. Parthasarathy, “Similarity searching in peer-to-peer databases,” in Distributed Computing Systems, 2005. ICDCS 2005. Proceedings. 25th IEEE International Conference on, june 2005, pp. 329 –338. [17] C. Tang, Z. Xu, and M. Mahalingam, “psearch: information retrieval in structured overlays,” SIGCOMM Comput. Commun. Rev., vol. 33, pp. 89–94, January 2003. [18] P. Haghani, S. Michel, and K. Aberer, “Distributed similarity search in high dimensions using locality sensitive hashing,” in Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, ser. EDBT ’09. New York, NY, USA: ACM, 2009, pp. 744–755. [19] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: a distributed storage system for structured data,” in Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, ser. OSDI ’06. Berkeley, CA, USA: USENIX Association, 2006. [20] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: amazon’s highly available key-value store,” SIGOPS Oper. Syst. Rev., vol. 41, no. 6, pp. 205–220, Oct 2007.