[49] Rainer E. Burkard, Stefan E. Karisch, and Franz Rendl. Qaplib â a quadratic assignment problem library. Journal of Global Optimization,. 10(4):391â403, ...
QUERY-CENTRIC STORAGE PARTITIONING FOR DISTRIBUTED SYSTEMS
A Dissertation Presented by TING ZHANG
Submitted to the Office of Graduate Studies, University of Massachusetts Boston in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY May 2017 Computer Science Program
© Copyright by Ting Zhang 2017
All Rights Reserved
QUERY-CENTRIC STORAGE PARTITIONING FOR DISTRIBUTED SYSTEMS
A Dissertation Presented by TING ZHANG
Approved as to style and content by:
Duc A. Tran, Associate Professor Chairperson of Committee
Dan Simovici, Professor Member
Gabriel Ghinita, Assistant Professor Member
Stratis Ioannidis, Assistant Professor Member
Dan Simovici, Program Director Computer Science Program
Peter Fejer, Chairperson Computer Science Department
ABSTRACT
QUERY-CENTRIC STORAGE PARTITIONING FOR DISTRIBUTED SYSTEMS MAY 2017 TING ZHANG B.E., QINGDAO TECHNOLOGICAL UNIVERSITY, QINGDAO, CHINA M.E., BEIHANG UNIVERSITY, BEIJING, CHINA M.S., UNIVERSITY OF MASSACHUSETTS BOSTON Ph.D., UNIVERSITY OF MASSACHUSETTS BOSTON Directed by: Professor Duc A. Tran
For a storage system to keep pace with increasing amounts of data, a natural solution is to deploy more servers to expand storage capacity and mitigate server bottleneck. Due to the large quantity, these servers need to be placed at geographically distributed locations, causing inevitable communication costs. Subsequently, an important design problem is how to best partition the data across the servers. To minimize cross-server traffic, the mainstream approach is data-centric, where data with similar content are assigned to the same server. It is however difficult to effectively quantify content similarity in cases where the content has many attributes or belongs to incomparable categories. In contrast, this dissertation advocates a query-centric storage approach where the only input information is queries and the data partitioner is aim to assign data often queried together on the same server. This approach
iv
avoids the assumption on the existence of a content similarity measure, thus applicable to both similarity search and non-similarity search. Following this approach, if all queries are given in advance, an optimal partitioner can be found by solving a classic hypergraph partitioning problem. The focus of this dissertation is the online setting: as queries arrive in a stream manner, how to revise the current partition incrementally to obtain the best partition for future queries. Contributions are (1) a formal formulation of this unexplored problem as a multi-objection optimization problem, (2) an evolutionary algorithm framework to explore Pareto-optimal partitioning solutions, and (3) an investigation on greedy online algorithms. Two case studies are considered: query-centric partitioning of an online social network and query-centric partitioning of a general distributed network. The findings are substantiated with evaluations using real-world datasets.
v
ACKNOWLEDGMENTS
First and foremost, I would like to express my grateful and sincere appreciation to my advisor Professor Duc A. Tran for his excellent guidance, understanding, patience, and most importantly, his continuous support during my Ph.D study. His enthusiasm for research and immense knowledge was contagious and motivational for me in the past five years. Without his guidance and persistent help this dissertation would not have been possible. Meanwhile, I would like to thank the rest of my committee members, Professor Dan Simovice, Professor Gabriel Ghinita, and Professor Stratis Ioannidis, for their encouragement, invaluable comments, and inspirational advice. My gratitude extends to Professor Bo Sheng, Professor Marc Pomplun, Professor Xiaohui Liang, Professor Ming Ouyang, Professor Robert Wilson, and many other faculty members for their time, energy and willingness to help me during my study at UMass Boston. My thanks also goes to my fellow lab mates, Siyuan Gong, Coung Pham, Thuy Do, Quynh Vo, in Network Information System Laboratory for the collaborations and all the fun we have had in the last few years. My time at UMass Boston was made enjoyable in large part due to the many friends. I am grateful for time spent with Jiayin Wang, Yi Ren, Kaixun Hua, Dong Luo, Dawei Wang, and Yahui Di. Last but not least, I would like to thank my family for all their love and encouragement. For my parents Zhiyong Zhang and Yingkun Zhao and my brother Shiqi Zhang who were always supporting me and encouraging me to pursue my dreams with
vi
their best wishes. For my supportive, encouraging, and patient husband Qingsong Sun whose faithful support during the final stages of this Ph.D is appreciated. And most of all for my lovely son, Alec Weigeng Sun. He brings me so many joyousness through my tough times, giving me endless power.
vii
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii CHAPTER
Page
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 1.2.2
1.3 1.4
Query-Centric Partitioning for Online Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Query-Centric Partitioning for General Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. QUERY-CENTRIC PARTITIONING FOR OSNS . . . . . . . . . . . . . . . . . . . 7 2.1 2.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1
Server Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1.1 2.2.1.2
2.2.2 2.2.3 2.3
Read load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Maintenance load . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 18
The Graph-Theoretic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
viii
2.4 2.5
Evolutionary Algorithm Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 S-PUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.1 2.5.2
Initial Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Final Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.2.1 2.5.2.2 2.5.2.3 2.5.2.4 2.5.2.5
2.6
2.6.4 2.6.5
Effectiveness of SPEA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effectiveness of S-PUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effect of Input Social Graph and Social Bond Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S-PUT vs. NSGA-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33 36 36 41 41
Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7.1
Geographic Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7.1.1 2.7.1.2
2.7.2 2.8
26 26 27 28 29
Evaluation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6.1 2.6.2 2.6.3
2.7
Representation of an individual . . . . . . . . . . . . . . . Evolution process . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crossover mechanism . . . . . . . . . . . . . . . . . . . . . . . . . Mutation mechanism . . . . . . . . . . . . . . . . . . . . . . . . . Selection Mechanism . . . . . . . . . . . . . . . . . . . . . . . . .
User-Server Locality . . . . . . . . . . . . . . . . . . . . . . . . . 45 Server-Server Locality . . . . . . . . . . . . . . . . . . . . . . . . 47
Optimization for Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3. QUERY-CENTRIC PARTITIONING FOR GENERAL DISTRIBUTED SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1 3.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.1 3.2.2
3.3
Correspondence to Hypergraph Partitioning . . . . . . . . . . . . . 59 Correspondence to Metrical Task Systems . . . . . . . . . . . . . . 60
Evolutionary Algorithm Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.1 3.3.2 3.3.3 3.3.4
Representation of An Individual . . . . . . . . . . . . . . . . . . . . . . . Initial Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crossover Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mutation Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
62 64 64 65
3.3.5 3.3.6 3.3.7
Selection Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Migration Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Evaluation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3.7.1 3.3.7.2
3.3.8 3.4
Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Online Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.4.1 3.4.2 3.4.3
Greedy Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Implementation Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Evaluation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.3.1
3.4.4 3.5
EA Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Migration Heuristics Analysis . . . . . . . . . . . . . . . . . 75
Greedy Heuristics Analysis . . . . . . . . . . . . . . . . . . . 86
Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4. FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 REFERENCE LIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
x
LIST OF FIGURES
Figure
Page
2.1 Parameters in an sample social graph . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Example of Pareto-optimal solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 General evolution process in an evolutionary algorithm . . . . . . . . . . 23 2.4 Architecture of S-PUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 An individual is represented as an N -element vector . . . . . . . . . . . . 26 2.6 Crossover mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7 Mutation mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.8 Results of SPEA2 after 100, 300, and 500 generations . . . . . . . . . . . 33 2.9 S-PUT vs. METIS vs. SPEA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.10 S-PUT results: Facebook graph with randomly generated social bond strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.11 S-PUT results: Gowalla graph with identical social bond strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.12 S-PUT results: Gowalla graph with random social bond strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.13 S-PUT results: DBLP graph with influence-based random social bond strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.14 S-PUT vs. METIS vs. NSGA-II as the EA method . . . . . . . . . . . . . 42 3.1 A gene in the chromosome of Q-PUT . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2 Example individual in Q-PUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 xi
3.3 Two-point crossover mechanism is used in Q-PUT . . . . . . . . . . . . . . 64 3.4 Mutation mechanism in Q-PUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 The average quality of the Pareto-optimal solutions for arXiv dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.6 The average quality of the Pareto-optimal solutions for Github dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.7 The average quality of the Pareto-optimal solutions for Retail dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.8 The average quality of the Pareto-optimal solutions for Actor-movie dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.9 Moving random items (arXiv and Github datasets) . . . . . . . . . . . . . 75 3.10 Moving random items (Retail and Actor-movie datasets) . . . . . . . . 76 3.11 Moving requested items (arXiv and Github datasets) . . . . . . . . . . . 78 3.12 Moving requested items (Retail and Actor-movie datasets) . . . . . . 79 3.13 Choosing requested items for migration (ArXive and Github datasets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.14 Choosing requested items for migration (Retail and Actor-movie datasets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.15 Results for the arXiv dataset with 8 servers . . . . . . . . . . . . . . . . . . . . 87 3.16 Results for the arXiv dataset with 16 servers . . . . . . . . . . . . . . . . . . . 88 3.17 Results for the Github dataset with 8 servers . . . . . . . . . . . . . . . . . . 89 3.18 Results for the Github dataset with 16 servers . . . . . . . . . . . . . . . . . 90 3.19 Results for the Retail dataset with 8 servers . . . . . . . . . . . . . . . . . . . 91 3.20 Results for the Retail dataset with 16 servers . . . . . . . . . . . . . . . . . . 92 3.21 Results for the Actor-Movie dataset with 8 servers . . . . . . . . . . . . . . 93 3.22 Results for the Actor-Movie dataset with 16 servers . . . . . . . . . . . . . 94 xii
LIST OF TABLES
Table
Page
2.1 Summary of key notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Parameters of the EA process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Run time for the EA process in S-PUT for Facebook and Gowalla graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.4 Run time for the EA process in S-PUT for DBLP graph . . . . . . . . . 44 3.1 Summary of datasets used in the evaluation . . . . . . . . . . . . . . . . . . . . 69 3.2 Number of items for each range of request frequencies . . . . . . . . . . . 70 3.3 Summary of datasets used in evaluation . . . . . . . . . . . . . . . . . . . . . . . 85 3.4 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
xiii
CHAPTER 1 INTRODUCTION
1.1
Motivation
The rapid development of computer and Internet technologies brings great convenience and change to our daily lives. Consequently, more and more people prefer to work and communicate using computers and mobile devices. With a few touches on his or her smart phone, one can create content and share information anytime, anywhere [1–3]. Demand for information on the go is also higher than ever. No longer the notion of information is merely about the content. In a fast moving world, its value depends critically on timing. We face a challenge: how to build a good data storage system to bring information to users, timely and efficiently? In the era of big data, for a storage system to keep pace with increasing amounts of data, a natural solution is to deploy more servers to expand storage capacity and mitigate server bottleneck. Due to the large quantity, these servers need to be placed at geographically distributed locations, causing inevitable communication costs. Subsequently, an important design problem is how to best partition the data across the servers. To minimize cross-server data traffic, the mainstream approach is datacentric, where “similar” data are stored on the same server. Here, “similar” refers to some content-related attribute space [4], such as similar content, similar location, similar time, and same ownership. However, it is hard to quantify content similarity effectively. Especially, in cases where content has many 1
attributes, if the data partition is optimized to preserve similarity based on certain attributes, this is unfair to requests looking for similar data in other attribute spaces. On the other hand, if we include all the data attributes to define similarity, its quality diminishes. Data may also belong to incomparable categories. For example, in an online social network, we need to store both user data (name, education, gender, etc.) and advertisement data (cars, goods, travel, etc.) and it is not easy to provide a content-related similarity measure to compare these two types of data. This is not to mention the existence of time-varying attributes, e.g., location data of mobile users, that result in expensive frequent updating of the similarity scores. In contrast, this dissertation advocates a query-centric storage approach where the only input information is queries and the data partitioner is aimed to assign data that are often queried together on the same server. Despite the intuition that the more similar two objects are, the more likely they satisfy the same user query, this applies only to similarity search, such as range search and k-nearest-neighbors (kNN) search, but not non-similarity search (e.g., seeking sets of frequently-purchased items of different types in an e-commerce application). The query-centric approach avoids the assumption on the existence of a content similarity measure, thus applicable to both similarity search and non-similarity search.
1.2
Goals
The central goal of this dissertation is to find an effective query-centric partitioning approach for distributed storage systems that offers high performance and good balancing. To achieve this goal, this dissertation starts with a study on Online Social Networks, which is among the most popular data-heavy
2
distributed systems, and then generalizes the research on general distributed storage systems. 1.2.1
Query-Centric Partitioning for Online Social Networks
Online Social Networks (OSNs) have become a norm for communication on the Internet. According to recent reports, Facebook enlists about 1.79 billion monthly active users [5] and Youtube about 1.3 billion users [6] and the number keeps rising. Different from the transitional Web, which is organized based on content, OSNs are managed based on users. When new users join an OSN, they create links to other users to establish social friendships. The sheer amounts of user data and interesting patterns about the user-to-user social links in OSNs bring new challenges to data storage. In an OSN, most often we have to process a read query that retrieves not only the data of a user but also that of its neighbors in the social graph, e.g., friends’ status posts in Facebook or connection updates in LinkedIn. Queries of this kind are the input for our OSN partitioning problem. Consequently, the objective becomes placing the data of socially-connected users on the same servers as much as possible so that the number of servers required to process a query is kept small. By contacting only a few servers instead of many, we can substantially reduce the response time [7–10]. I refer to this criterion as social locality. State-of-the-art storage systems of today’s popular OSNs, such as Gizzard [11] of Twitter and Cassandra [12] of Facebook, rely on a hash-based implementation which is blind to the desired social locality. From the graphtheoretic perspective, the OSN partitioning problem can be modeled as a graph partitioning problem. However, the effectiveness of classic graph partitioning algorithms when applied to scale-free sparse graphs, which is the case for OSNs,
3
has been questioned [13]. A goal of my work is to explore ways to achieve better partitioning quality and at the same time gather insights that can be useful to query-centric partitioning for a general distributed system. 1.2.2
Query-Centric Partitioning for General Distributed Systems
Distributed systems are widely adopted in today’s information industry. A general distributed system consists of a collection of autonomous computers, connected through a network and distributed middle-ware, which enables computers to coordinate their activities and to share the resources of the system, so that users perceive the system as a single, integrated computing facility. Examples of popular distributed systems include distributed computing systems used for high performance computing tasks, e.g., cloud clusters and grid, distributed information systems for management and integration of business functions, e.g,, e-commerce systems, and distributed pervasive systems, e.g., mobile and embedded systems and sensor networks, to name a few. In such a system, data can belong to different categories, but very often we need to pull them together for a joint processing task. As such, I set no restriction on the relationships of data that may be retrieved together in a query. In contrast, in the aforementioned OSN partitioning problem, the data to be queried must be associated with a user’s neighborhood in the social graph. My goal is to investigate the query-centric partitioning problem for the general case where a query can retrieve any set of arbitrary data items. In the literature, the goal is often to compute a single partition optimized for a “batch” workload of queries regardless of the order they are issued. In the dissertation, I generalize this challenge to set the goal to optimize on a “sequential” workload: taking as input a stream of queries, seek a sequence of
4
partitions each optimized for queries that will arrive next; hence the order of query arrival matters. This is an original research problem.
1.3
Contributions
I have made the following contributions to achieve the goals set above: Online Social Networks:
– Formulation of the query-centric partitioning problem for OSNs as
a multi-objective optimization, with the goal to find a partition for best server read cost and best server load balancing. This formulation incorporates the heterogeneity in user read and write request rates. – Investigation of the applicability of evolutionary-algorithms (EA) to
find Pareto-optimal partition assignments that offer the best tradeoffs between these two optimization objectives. This investigation shows that a conventional application of EA is not effective. – Design of S-PUT, an EA framework that is more effective by lever-
aging the results from graph partitioning. The solutions obtained by S-PUT are also superior to that obtained by METIS, a classic graph partitioning solution. The findings have been published in the Proceedings of the 2013 IFIP International Conference on Networking [14] and Elsevier Journal on Computer Networks [15]. General Distributed Systems:
– Formulation of the query-centric partitioning problem as a multi-
objective optimization, with the goal to find a sequence of partitions for best server read cost, best server migration cost, and best server 5
load balancing. The originality of this formulation is due to its sequential nature. – Design of Q-PUT, an EA framework to explore Pareto-optimal solu-
tions. This is the first EA framework in the literature for finding sequential partitions optimized for sequential queries. Q-PUT enables a number of migration strategies to be evaluated, which provides useful insights for practical online algorithms. – Investigation of several online algorithms based on greedy heuristics
recommended by Q-PUT. This investigation shows that there is a real benefit by migrating items between the servers during the query sequence to best serve future queries even though the only information we know is about the past. The findings have been published in the Proceedings of the 2016 IEEE International Performance Computing and Communications Conference [16] and the Proceedings of the 2017 IEEE International Conference on Advanced Information and Networking Applications [17].
1.4
Organization
This dissertation is organized as follows. The query-centric partitioning problem is discussed in detail for OSNs in Chapter 2 and for general distributed systems in Chapter 3, respectively. Several extensions and other pointers to future research are proposed in Chapter 4. The dissertation is concluded in Chapter 5.
6
CHAPTER 2 QUERY-CENTRIC PARTITIONING FOR OSNS
Online social networks (OSNs) have become an indispensable platform for people to communicate. Many would not call it a day without logging in social media to check the news and friends’ updates. From the perspective of a service provider, scalability is one of the most crucial priorities and, to achieve it, we need to understand what kind of information the users often ask. No one would argue that most of the information users often seek is localized in their social neighborhood. When a user logs in, she/he would be interested in also her/his friends’ posts. Imagine that the data of users are randomly placed in different servers, given a fast stream of user requests, there will be many requests sent to the storage servers in the OSN back-end system. The system may suffer from the multi-get hole problem, a phenomenon originally experienced in Facebook’s memcached network [18]. This problem happens when there is an overwhelming CPU bottleneck on the server side due to an extremely large number of read requests, that cannot be resolved simply by adding more servers. Now, if somehow the data of a user’s neighbors is located at the same server of the user, the number of read requests should be reduced substantially. In the extreme case, we could put the data of all the users on one server, resulting in optimal read cost, but, of course, one would ask, why the need for distributed servers? An optimized data storage scheme, therefore, should preserve the social locality (for good response time) and at the same time provide a good 7
balancing with respect to storage loads (for bottleneck avoidance). In the literature, a system with this property is said to be “socially aware”. Research on socially aware storage and replication did not take off until [10]. As introduced in Chapter 1, the focus of this dissertation is the query-centric approach. The data partitioning should be determined based on the co-queriedness of data items. For OSNs, the queries in consideration are each asking for data involving a social neighborhood and so the query-centric partitioning problem is essentially the socially aware partitioning problem. Most existing techniques rely on either heuristics [10,19] whose effectiveness are demonstrated via experiments, or classic graph partitioning tools known to be effective for many graphs, such as METIS [20]. In contrast, because social graphs exhibit non-trivial topological features (scale-free, highly-clustered) [21], my interest is to know if there is room for better partitioning solutions. To answer this question, I investigate the socially aware data partitioning problem by modeling it as a multi-objective optimization problem and exploring the applicability of evolutionary algorithms in order to achieve highlyefficient and well-balanced data partitions.
2.1
Related Work
Horizontal scaling has been a de facto standard when it comes to managing data at massive scale for most OSNs. Instead of vertical scaling, i.e., adding more hardware resources to the existing servers, the system, with horizontal scaling, is scaled “out” by adding commodity servers and partitioning the workload across these servers. On top of a distributed infrastructure of commodity storage servers, popular OSNs today adopt a hashing-based mechanism for data partitioning, e.g., range-based hashing used in Gizzard [11] of Twitter or consistent hashing in
8
Cassandra [12] of Facebook and Dynamo [22] of Amazon. Due to the randomness of hashing, as explained earlier, social locality is not preserved. Shown in [10], network I/O can substantially be improved at the server side by keeping all of the relevant data of each query local to the same server. Even on a disk, these data should be stored closely together to improve disk response time [23]. Aimed to improve system performance and scalability, socially aware data partitioning and replication schemes have been proposed. SPAR [10] is aimed to preserve social locality perfectly, i.e., every two neighbor users must have their data colocated on the same servers. This is impossible if each user has only one copy of its data and so replicas are introduced and placed appropriately. While SPAR sets no limit on the maximal number of replicas for each user, SCLONE [9], a scheme designed solely for replication, is aimed to preserve social locality under a fixed space budget for replication. SCHISM [7] is a workloaddriven scheme that partitions the data based on transaction patterns such as how often different data are retrieved together. While the queries targeted by SCHISM should be static and frequently repeated, OSNs often exhibit timedependent queries (e.g., status messages of Facebook are frequently refreshed with more recent ones). This observation motivates the partitioning technique in [8] which takes as input an activity graph that is changing over time to better represent social locality. Two of the aforementioned techniques, [7, 8], take into account the heterogeneity in how often a user reads/writes its own data and how often sociallyconnected users want to see the data of one another. A recent partitioning and replication scheme, COSPLAY [24], considers the heterogeneous traffic costs to cross servers in different clouds and the quality-of-service requirements per individual users such as preferences over which clouds/servers to best store their data. In contrast, my work is focused on data partitioning without replication.
9
Data partitioning over multiple servers can be formulated as a graph partitioning problem and many existing techniques rely on a classic graph partitioner to partition the social data graph. I adopt an evolutionary algorithm (EA) based approach, which is not to reinvent the wheel though because EA has been applied to graph partitioning (e.g., [25, 26]). My work, however, is different due to the unique setup of the multi-objective optimization as a formulation for socially aware partitioning. It is noted that the principle of network-aware locality has been utilized in storage of nonsocial graph data. As an example, subgraph search in a large RDF (Resource Description Framework) graph can be made efficient if it is distributed across a number storage nodes, e.g., pages on a disk [27] or compute servers in a cluster [28], such that cross-reference between different storage nodes is minimized. My work differs in that I seek a partition where each node is a disjoint part of the partition and the queries of interest are “neighborhood” queries, which is a special case of subgraph queries. This simplification allows for a partitioning solution that is more efficient than a solution universally aimed at arbitrary subgraph queries (due to many constraints). Indeed, according to [28], hash-based partitioning remains a popular choice for distributed RDF storage.
2.2
Problem Formulation
Consider a storage system with M servers to store data for a social graph of N users. A three-tier system architecture is assumed as in most existing storage systems, User-Manager-Server, in which the users do not communicate with the servers directly. Instead, the Manager, providing API and directory services, serves as the interface between the users (front-end) and the servers (back-end). The API is used for the users to query the system. Let N (i)
10
Notation
Meaning
[z]
The set {1, 2, ..., z}
M
Number of servers
N
Number of users
P = [pis ]N ×M
Assignment of user i to server s
E = [eij ]N ×M
Social bond user i has towards user j
R = [ri ]N
Read rate of user i
W = [ w i ]N
Maintenance rate of user i
λread s
Read load of server s
Λread
Total read load of all servers
λmaintain s
Maintenance load of server s
Λmaintain
Total maintenance load of all servers
Γmaintain
Load balancing coefficient
Table 2.1: Summary of key notations
denote the neighbor users of i. Queries are submitted to the system each in the following form: q = (i, U ), which is to retrieve the data of user i and the users in the subset U ⊂ N (i). If U = ∅, only user i’s data is returned. If U = N (i), every neighbor’s data is returned together with user i’s data. In
theory, U can be any subset of N (i). Assisted by the directory service, a user’s query is always directed to its primary server (because it is where this users data is located). The directory service can be implemented either as a global map or as a DHT. The connectivity information of the social graph is assumed to be available at the M anager. The system is characterized by the following parameters (see Table 2.1 and Figure 2.1):
11
Figure 2.1: Parameters in an sample social graph
Partition assignment P : a N ×M binary matrix representing the assignment
of users to servers. Each entry pis has value 1 if and only if user i is assigned to server s. Because each user must be assigned to one and only one server, P must satisfy the constraint
PM
s=1
pis = 1 ∀ i ∈ [N ].
Social relationship E : a N × M real-valued matrix representing the social
relationships in the social graph. Each entry eij is a value in the range [0, 1] quantifying the social bond between user i and user j . A stronger social bond indicates a stronger probability (tendency) to read the other’s data. The value 1 means the strongest and 0 means no relationship. It is noted that although i and j are socially connected, the values of eij and eji are not necessarily identical. In practice, the likelihood that user i tends to read its neighbor j ’s data may differ from the likelihood that
user j tends to read user i’s data. Read rate R: a N -dimensional real-valued vector representing user read
activity. Each element ri is a positive number quantifying the rate at
12
which a read query is issued for user i. A read query for a user is always sent to its assigned server, requesting to retrieve its data and possibly the data of its neighbors. Whether a neighbor’s data is also retrieved is determined by social bond strength given in matrix E . Maintenance rate W : a N -dimensional real-valued vector representing the
server cost for hosting, updating, and maintaining user data. Each element wi is a positive number quantifying this cost for user i. For example, in a cloud storage service users may require different levels of quality of service, thus charged at different rates. The update activity may also vary among the users, which incurs different costs on the server side. The maintenance rate W represents this heterogeneity in the server cost to service each user. The values for parameters E , R, and W are given, for example, obtained based on monitoring and analysis of actual workload. In practice, these values may vary over time and so I can divide the time into periods and update the values periodically. I focus on one such period: given E , R, and W , find the unknown P.
2.2.1
Server Load
Minimizing the server load and maximizing load balancing are among the most important objectives of any distributed storage system. The server load is categorized into two types: read load and maintenance load. 2.2.1.1
Read load
To understand the server read load, consider a read query q = (i, U ). This query needs to be directed to user i’s server, say server s, which will provide the data for i. To retrieve the data of each user j ∈ U , there are two cases: 13
User j ’s data is not located on server s: The data of j must be retrieved
from the server of j , requiring one additional read request sent to this server. User j ’s data is located on server s: The data of j can be provided by
server of s, requiring no one additional server request. The amount of data returned to user i is the same in both cases, but the number of read requests processed on the server side is different, worse if i and j do not colocate. More read requests result in more traffic and CPU processing on the server side. Therefore, an important objective is to minimize the server load due to read requests, which I refer to as read load, given the social relationships between the users and their read rates. It is noted that a user may request to access data other than its own and that of its friends. However, the latter is the more frequent activity in social networks and for this reason I focus on optimizing it. This is also the optimization goal of most previous work in socially aware storage such as [8, 10, 24, 29]. Regarding the case a second request is needed to get neighbor j ’s data, this request can be sent by the server of i or by the M anager and we leave this decision to the application developer; in either case, the result is only one cross-network request being sent. In other words, we do not dictate that the second request must be sent from the server of i. For an application setup where the M anager should initiate all requests, the second request shall be issued by the M anager. Given a server s, it has to process read requests initiated by users assigned to s, resulting in a cost N X i=1
14
ri pis
read requests initiated by users not assigned to s seeking the data of their
neighbors in s, resulting in a cost N X
ri (1 − pis )
N X
i=1
pjs eij
j=1
The read load of s is therefore λread s
=
N X
ri pis +
i=1
N X
ri (1 − pis )
i=1
N X
pjs eij
j=1
N N X X = ri pis + (1 − pis ) pjs eij i=1
(2.1)
j=1
The total read load summing all the servers is
Λ
read
= =
M X
λread s
s=1 M X N X
N X ri pis + (1 − pis ) pjs eij
s=1 i=1
j=1
X N N M M X X X (1 − pis )pjs = ri pis + eij s=1
i=1
=
N X
ri 1 +
i=1
j=1 N X
eij
M X
(1 − pis )pjs
(2.2)
s=1
j=1
(the last derivation uses the equality
s=1
PM
s=1
pis = 1 ∀ i ∈ [N ]).
As visible in the above formulas, to minimize the total read load and/or balance the read load across the servers, we have to take into account how the data is partitioned (P ) and the social relationships among the users (E ). 2.2.1.2
Maintenance load
Besides allocating storage space to store the data for its users, a server may also provide upkeep services such as updating the data upon write requests 15
from the users and backing up the data when needed. I refer to this overall load as maintenance load. The maintenance load depends on the number of users for whom the server stores data and their corresponding maintenance rates. This load for a server s is quantified as
λmaintain s
=
N X
wi pis
(2.3)
i=1
For example, if wi = 1∀i ∈ [N ], the maintenance load of a server is simply its storage load quantified as the number of users assigned to this server. The total maintenance load of all the servers is
Λ
maintain
= =
M X
λmaintain s
s=1 M X N X
wi pis
s=1 i=1
= =
N X i=1 N X
wi
M X
pis
s=1
wi .
(2.4)
i=1
While individual maintenance load depends on how the data are partitioned, the total maintenance load is fixed regardlessly. 2.2.2
Load Balancing
To represent the degree of load balancing across the servers, a variety of measures of statistical dispersion can be used, such as coefficient of variation, standard deviation, mean different, and Gini coefficient. The proposed framework can use any such measure. For the purpose of illustration, I formulate 16
the problem based on the Gini coefficient. For a population of M persons each person i with income xi , the Gini coefficient, which provides an indicator of the population’s income inequality, can be defined as half of the relative mean absolute difference of the incomes [30]. PM PM Gini =
i=1
2M
j=1 |xi − PM i=1 xi
xj |
Suppose that we rank the income in the increasing order x1 ≤ x2 ≤ ... ≤ xM . Then the Gini coefficient can be derived as following: PM PM Gini =
j=1 |xi − PM i=1 xi
i=1
xj |
2M i=1 (|xi − x1 | + · · · + |xi − xM |) P 2M M i=1 xi PM i=1 (ixi − (x1 + · · · + xi ) + (xi+1 + · · · + xM ) − (M − i)xi ) P 2M M i=1 xi PM PM i=1 (M xi + (x1 + · · · + xi ) − (xi+1 + · · · + xM )) i=1 2ixi − P 2M M i=1 xi PM PM 2 i=1 ixi − i=1 2(M − i + 1)xi P 2M M i=1 xi PM PM 4 i=1 ixi − 2 i=1 (M + 1)xi P 2M M i=1 xi PM 2 i=1 ixi M + 1 PM
= = = = = =
M
PM
i=1
xi
−
M
For a random sample S , we need to multiply Gini by the Bessel’s correction factor [31, 32]
M M −1
to get unbiased estimator. M × Gini M −1 PM 2 i=1 ixi M + 1 M = × − P M −1 M M M i=1 xi PM 2 i=1 ixi M +1 = − PM (M − 1) i=1 xi M − 1
G(S ) =
17
Gini coefficient can be used to compare load balancing of different distributions independent of their size, scale, and absolute values. This measure naturally captures the fairness of the load distribution, with a value of 0 expressing total equality and a value of 1 maximal inequality. Assuming that servers are ranked in the increasing order of maintenance load, λmaintain ≤ λmaintain ≤ ... ≤ λmaintain 1 2 M
or, equivalently N X
w1 p1s ≤
i=1
N X
w2 p2s ≤ ... ≤
i=1
N X
wM pM s
i=1
the formula for the Gini coefficient is M
Γmaintain =
X 2 M +1 maintain . s × λ − s M −1 (M − 1)Λmaintain s=1
Replacing λmaintain and Λmaintain using Eq. (2.3) and Eq. (2.4), respectively, we s have M
Γmaintain =
N
X X M +1 2 s wi pis − . PN M −1 (M − 1) i=1 wi s=1 i=1
The range of Γmaintain is 0 6 Γmaintain h 6 1. To balance the server load, I will minimize this Gini coefficient. 2.2.3
Multi-Objective Optimization
Ideally, we want to simultaneously minimize the total server load Λread while balancing λread the individual loads across the servers. The maintenance load s needs not be minimized because the total Λmaintain is fixed regardless of how the data is partitioned; we only need to balance the individual maintenance load λmaintain for every server s. Regarding the read load, our priority is to s minimize its total. Our rationale is that (1) OSNs often exhibit frequent read activities and it is important to minimize the overall response time and (2) 18
the objectives of minimizing and balancing the read load at the same time are conflicting with each other; indeed, we could place all the data on just one server to minimize the read load, resulting in a total read load of
PN
i=1 ri
, but
this placement would incur the worst imbalance (the other M − 1 servers are idle). The optimization problem is expressed as follows: Problem 2.2.1. Given E, R, and W , find binary matrix P such that
Λread , Γmaintain
min P
subject to 1)
M X
pis = 1 ∀i ∈ [N ]
s=1
2)
N X
wi pis ≤
i=1
N X
wi pit ∀s, t ∈ [M ] ∧ (s < t)
i=1
where
Λread =
N X
ri
1+
i=1
N X
eij
M X
2.3
(1 − pis )pjs
(2.5)
s=1
j=1
M
Γmaintain =
!
N
X X 2 M +1 s wi pis − . PN M −1 (M − 1) i=1 wi s=1 i=1
(2.6)
The Graph-Theoretic Approach
Problem 2.2.1 can be converted to a graph partitioning problem as follows. We have Λ
read
N N M X X X = ri 1 + eij (1 − pis )pjs i=1
=
N X
ri +
i=1
=
N X i=1
s=1
j=1 N X N X
ri eij
N
(1 − pis )pjs
s=1
i=1 j=1
ri +
M X
N
M
X 1 XX (ri eij + rj eji ) (1 − pis )pjs . 2 i=1 j=1 s=1
19
Denote fij = ri eij + rj eji , and so fij = fji . Then,
Λ
Since
PN
i=1 ri
read
N X
N
N
M
1 XX X = ri + f (1 − pis )pjs . 2 i=1 j=1 ij s=1 i=1
is constant, to minimize Λread is equivalent to minimizing N X N X
fij
M X
(1 − pis )pjs .
s=1
i=1 j=1
which is the sum of fij values where each fij value corresponds to a pair of users, i and j , who are neighbors but assigned to different servers. Let Gnew be the following undirected weighted graph: The vertices of Gnew are the vertices of the original social graph. Each
vertex i of Gnew is associated with a weight wi . An undirected link exists between vertex i and vertex j in Gnew if a link
exists between i to j (directed or undirected) in the original graph. Each link (i, j ) of Gnew is associated with a weight fij = (ri eij + rj eji ). Then, if we apply partition P on graph Gnew to obtain M components, the edge cut due to this partition is exactly N X N X i=1 j=1
fij
M X
(1 − pis )pjs .
s=1
In addition, the total weight of the vertices assigned to component s ∈ [M ] is N X
wi pis
i=1
which is exactly server s’s maintenance load λmaintain . s 20
Consequently, to solve Problem 2.2.1, I partition graph Gnew into M components such that: The edge cut, i.e., sum of the weights of inter-component links, is mini-
mum (so that Λread is minimized); The total vertex weight of each component is balanced (so that Γmaintain
is minimized). This is a classic weighted graph multi-way partitioning problem known to be NP-hard [33–36], but approximation algorithms have been proposed. Among them is METIS [34], arguably one of the best approximation algorithms for partitioning a large graph into equally-weighted components with minimum edge cut. According to its website, METIS can partition a 1-million-node graph in 256 parts in just a few seconds on todays PCs. It is noted that there are graph partitioning tools other than METIS that are also effective, such as Jostle [37] and KaFFPaE [26] but METIS has been widely used in many existing systems and as a benchmark for comparison [38, 39].
2.4
Evolutionary Algorithm Approach
From the above discussion, we know that to find the partition resulting in minimal total read load and optimal maintenance load balancing is NP-hard. In this section, I investigate this problem via the framework of evolutionary algorithms (EA). EA is often used to explore Pareto-optimal solutions (also called Pareto front) for multi-objective optimization problems [40]. Problem 2.2.1 belongs to this class of problems because of its two competing objectives. Indeed, the smallest read load is zero when all users are assigned to one server, but this partition incurs the worst maintenance balancing.
21
Figure 2.2: Example of Pareto-optimal solutions
A Pareto-optimal solution is one that is not “dominated” by any other solution. A solution S1 is dominated by solution S2 if S1 is no better than S2 in any objective and strictly worse in at least one objective. Consider the example with two objective functions o1 and o2 shown in Figure 2.2. In this figure, each solution is represented by a box point. Solution C is dominated by both solution A and solution B, but solution A and solution B do not dominate one another. The Pareto front consists of those solutions (blue-color) on the read curve. Each among the other solutions (gray-color) is non-Pareto optimal because it is dominated by some other solution. EA is a generic population-based optimization algorithm inspired by biological evolution process. Candidate solutions to the optimization problem play the role of individuals in a population. The main process of an EA is shown in Figure 2.3.
22
Figure 2.3: General evolution process in an evolutionary algorithm
EA is an iterative process of generations, starting with an initial population of candidate solutions in the first generation and iteratively improving the population from one generation to the next, eventually reaching the final solutions in the last generation’s population. The population’s evolution is inspired by biological evolution which is driven by three main “genetic” mechanisms: crossover, mutation, and selection. Crossover and mutation create the necessary diversity, thus facilitating novelty in the population. After crossover and mutation, in the selection step, the best quality individuals are chosen to form the next generation’s population. A fitness function is used to determine the quality of each individual. Initialization: A set of individuals is generated to form the initial popula-
tion. Conventionally, this generation is random to create a large diversity in the population. An individual is represented as a vector to represent a string of “genes”. Crossover: To change the programming of individuals from one gen-
eration to the next while retaining ancestral genes, two random parent individuals are chosen and their genes are exchanged to produce two offspring individuals. These offsprings will replace the parents in the current population. 23
Mutation: To create genetic novelty and further maintain genetic diver-
sity for the next generation, a random individual is chosen and one or more of its genes is altered to produce a new individual. In mutation, the new individual may entirely be different from its parent individual. Selection: After the population has been augmented with crossover and
mutation, an evaluation procedure takes place to select the individuals best fit to form the population for the next generation. The evaluation is based on a fitness function and those individuals with low fitness will be removed from the population. When an EA algorithm stops after sufficiently many generations of crossover, mutation, and selection, the individuals in the final population represent the solutions to the optimization problem. The longer generations take place, the better quality of the final solutions. However, it is expected that the EA process eventually converges to a stable sate when no further improvement is observed; this is when we should stop the algorithm.
2.5
S-PUT
While EA is routinely used for multi-objective optimization, a typical application of EA starting with a random population may be too slow to converge to good partitioning solutions. In our case, the size of the solution space is N M , with N unknown variables (the server assignment for each of the N users), the value for each variable anywhere between 1 and M . A social graph can contain millions of users, making the search space too large for a typical EA process to be effective. My proposed solution framework, called S-PUT, relies on EA for its eventual guarantee toward Pareto-optimality, but applies it in a more effective way.
24
Figure 2.4: Architecture of S-PUT
Specifically, S-PUT consists of two phases. In the Initial Partitioning phase, a graph partitioning algorithm is used to obtain the initial population for the EA process. In the Final Partitioning phase, the EA process takes place to result in the final set of optimized partition assignments. Figure 2.4 describes the architecture of S-PUT. 2.5.1
Initial Partitioning
As discussed in Section 2.3, I can convert Problem 2.2.1 to the problem of partitioning the graph Gnew into M components with minimal edge cut and balanced vertex weight carried in each component. The first step of S-PUT is to apply METIS [34] on graph Gnew to obtain a set of partitioning solutions, which I refer to as METIS solutions. METIS, a classic graph partitioner, has a parameter called a “seed” to allow for some random heuristics within the partitioning process. By varying the value of the seed, we can obtain a different partitioning solution. Each of these METIS solutions serves as an individual for the first generation of EA. The number of METIS solutions needed is equal
25
Figure 2.5: An individual is represented as an N -element vector
to the “population size”, an input parameter for the EA process, explained in the next section. 2.5.2
Final Partitioning
This step is where I apply EA starting with the initial population obtained in the earlier step. 2.5.2.1
Representation of an individual
I represent an individual as vector of N elements (“genes”), [s1 , s2 , . . . , sN ] ∈ [M ]N , which corresponds to a possible partition assignment: user i is assigned to server si . For example, for a network of 1000 nodes to be partitioned across 16 servers, an individual is an vector of 1000 integers, each having a value between 1 and 16. An example individual is shown in Figure 2.5. 2.5.2.2
Evolution process
There are many EA algorithms, among which the Strength Pareto Evolutionary Algorithm 2 (SPEA2) [41] and Non-dominated Sorting Genetic AlgorithmII (NSGA-II) [42] are widely used. Here I describe the EA process using SPEA2. An adaptation to NSGA-II should easily be derived. SPEA2 maintains two types of population, Ph (called the “regular population”, size |P|) and Ah (called the “archive”, size |A|), for each generation h. These populations, Ph and Ah , are updated as follows.
26
1. First generation (h = 0): P0 is the initial population of |P| individuals (result of applying METIS with |P| different seeds) and A0 is set to empty. 2. Generation (h + 1): given Ph and Ah (a) Ah+1 = set of non-dominated individuals of Ph
S
Ah . This set is
truncated to |A| individuals if the population size is larger than |A|, or padded with lowest-fitness individuals among the dominated if the population size is less than |A|. (b) Ph+1 = set of |P| individuals after application of crossover and mutation on Ah+1 . When the maximum number of generations, h∗ (given as input), is reached, the final partition assignments will be the non-dominated individuals in the final archive Ah∗ . 2.5.2.3
Crossover mechanism
The number of parent couples to be replaced by their offsprings in the crossover mechanism is determined by a probability pcrossover (given as input). Therefore, we should expect |A| × pcrossover parents to be replaced by their offsprings. The gene crossover can be done in three ways: one-point crossover, two-point crossover, or uniform crossover; see Figure 2.6. My experiment uses the two-point mode, which works as follows. Choose |A| × pcrossover pairs of individuals randomly (each individual chosen only once). For each chosen pair of individuals [s1 , s2 , . . . , sN ] and [s01 , s02 , . . . , s0N ], 27
Figure 2.6: Crossover mechanisms
choose two random positions in the individual vector, i and j (1 < i < j < N ). Then, the offsprings to replace the original pair are
[s1 , s2 , . . . , si−1 , s0i , . . . , s0j , sj+1 , . . . , sN ]
and [s01 , s02 , . . . , s0i−1 , si , . . . , sj , s0j+1 , . . . , s0N ]. The offspring individuals will be part of the regular population Ph+1 . 2.5.2.4
Mutation mechanism
In the mutation mechanism, a number of single parents are randomly selected from the current population, each to be replaced by an offspring. Every individual has a probability pmutate to be selected. Once the parent is selected, 28
Figure 2.7: Mutation mechanism
the offspring is created by setting random values at a number of random genes in the parent vector. Every position has a probability pgene mutate to be mutated. The expected number of single parents to be replaced is |A| × pmutate and the expected number of genes to be randomized in each parent is N × pgene mutate . Consider a parent individual [s1 , s2 , . . . , sN ] and the positions to be mutated are 1 ≤ i1 < i2 < ... < ik ≤ N . Let t1 , t2 , ..., tk be k random values in [M ]. Then the parent will be replaced with the offspring individual [o1 , o2 , ..., oN ] where
oi =
ti if i ∈ {i1 , i2 , ..., iN } si otherwise
This individual will be part of the regular population Ph+1 . Figure 2.7 illustrates an example where k = 1. 2.5.2.5
Selection Mechanism
After crossover and mutation are applied on the current archive population, the selection mechanism is applied to obtain the next archive population. The selection is based comparing fitness values of individuals. An individual p dominates an individual q, denoted by p q, if the corresponding partition assignment of p is no worse than that of q in terms of both Λread (total read load) and Γmaintain (balancing of maintenance load), with at
29
least one objective strictly better. The fitness of an individual p in generation h is defined as
X
f itness(p) = 1 −
q∈Ph ∪Ah
|
1 strength(q) + σp,k + 2 :qp {z
raw f itness
}
| {z }
density f actor
where strength(q) = cardinality of set {q0 ∈ Ph ∪ Ah | q q0 }
is the number of individuals dominated by q in the set Ph ∪ Ah and
σp,k =
q
(Λp − Λp0 )2 + (Γp − Γp0 )2
measures the similarity between the objectives [Λread = Λp , Γmaintain = Γp ] of individual p and the objectives [Λread = Λp0 , Γmaintain = Γp0 ] of its k -nearest neighbor p0 . As common setting, k is set to
p |P| + |A|. Intuitively, the first
term, raw f itness, in the fitness formulation represents that an individual has low fitness if it is dominated by individuals that are weak, and the second term, density f actor, incorporates the density information in order to discriminate between individuals having identical raw f itness, favoring individuals with denser neighborhood. In the denominator of density f actor, number 2 is added to ensure that density f actor < 1.
2.6
Evaluation Study
Presented in this section are the results of my simulation study to evaluate S-PUT in comparison with the conventional EA and graph-based partitioning. Two EA algorithms, SPEA2 and NSGA-II, were used for the evolution process and METIS was used for graph partitioning. Three real-world social graph 30
samples were considered: a Facebook graph obtained from a dataset made available by Max-Planck Software Institute for Software Systems, a Gowalla graph made available by the University of Cambridge, and a DBLP co-author graph made available at http://www.sommer.jp/graphs/. The Facebook graph contains N = 63, 392 users in the New Orleans region and 816,886 links, resulting in an average degree of 25.7. The Gowalla graph has N = 196, 591 users and 950,327 links, resulting in an average degree of 9.7. The DBLP graph has N = 713, 291 users and 2,148,708 links, hence an average degree of 6.
In the evaluation, read and maintenance rates of a user are assumed to be linearly proportional to its social degree, thus these rates are given values in the range (0, 1) proportional to the degree. The justification for this choice is that users with more connections tend to access the data more frequently and also the system has to carry more maintenance load for keeping up with these users’ activities [43, 44]. Three models are considered for simulating the social strength between neighboring nodes: Constant model: every relationship has identical strength 1. Random model: strength is uniformly generated in the range (0, 1). Influence model: strength eij is computed based on how influential user j
is to user i. The last model is used in the analysis of the DBLP graph where the influence information can be computed; definition and more details about this are given in Section 2.5.3. The first two models are used in the analysis of the Facebook and Gowalla graphs because we do not have the influence information. The number of servers is set to M = 16. The parameters for the EA process (SPEA2 and NSGA-II) are summarized in Table 2.2. The probability parameters’ values chosen are often used in EA evaluation. 31
Parameter
Setting
population size
|P| ∈ {100, 500}
archive size
|A| = |P|
crossover probability
pcrossover = 0.8
mutation probability
pmutate = 0.5
gene mutation probability
pgene mutate = 0.001
number of generations
h∗ ∈ {100, 300, 500}
Table 2.2: Parameters of the EA process
Three schemes are compared: METIS: as earlier discussed in Section 2.3, I first convert the original
directed social graph to an undirected weighted graph Gnew and then apply the METIS k -way partitioning algorithm (version 5.0.2, with overweight balance ratio set to 1.03) on this graph. To generate a METIS partition, I apply METIS using a random seed. We can obtain a set of METIS partitions by repeating this application, each time using a new random seed. (Original) SPEA2 or NSGA-II: the conventional way to apply EA, starting
with a random population. This scheme uses SPEA2 or NSGA-II for the evolution process. S-PUT: Instead of starting with a random population, S-PUT starts with
a population of METIS solutions and then applies the evolution mechanisms of SPEA2 or NSGA-II.
32
2.6.1
Effectiveness of SPEA2
(a) Population: 100 individuals
(b) Population: 500 individuals
Figure 2.8: Results of SPEA2 after 100, 300, and 500 generations
First, I evaluate the effectiveness of SPEA2, representing the evolutionary approach to solving the proposed partitioning problem. In this simulation, SPEA2 starts in the first generation consisting of random partition assignments (referred to as RANDOM assignments). Figure 2.8(a) shows the initial population of 100 individuals (RANDOM) and the final population of non-dominated individuals after 100 generations, 300 generations, and 500 generations. The random assignments have a total
33
read load consistently around 124,500 and a load balancing coefficient around 0.08. Gini coefficient below 0.1 implies excellent load balancing. After 100 generations of applying SPEA2, we observe some improvement. If we look at the final solution of SPEA2 with best load balancing, the reduction compared to the initial population is 19% (0.065/0.08 ≈ 81%). Although 19% is a good percentage number, this is an improvement over a load distribution that is already excellent. What I wish to see is a more significant improvement on the total read load. However, if we look at the final solution with best read load, the reduction compared to the initial population is only a tiny 2% (122000/124500 ≈ 98%). Similar observations are observed in the case the population size is 500 individuals, which are shown in Figure 2.8(b). In either case, 100 individuals or 500 individuals, the improvement does not seem to get much better after 300 generations and 500 generations, suggesting that we have seen the practical best of SPEA2 (unless we have to run many more generations, which is prohibitively long). I have shown that SPEA2 is limited in its effectiveness. On the other hand, evolutionary algorithms are typically known as an effective way to result in Pareto-optimal solutions for multi-objective optimization problems. Therefore, the next question in our study is to see if indeed we cannot do better than SPEA2. To answer this question, I compare the result of SPEA2 after 500 generations to the METIS method that finds partitioning assignments as described in Section 2.5.1. This comparison is illustrated in Figure 2.9, which shows a clear contrast between these two methods. Not only does METIS result in excellent load balancing comparable to SPEA2 and RANDOM, but METIS is superior in terms of total read load. In the case the population size is 100 individuals, an average METIS partitioning assignment offers a total
34
(a) Population: 100 individuals
(b) Population: 500 individuals
Figure 2.9: S-PUT vs. METIS vs. SPEA2
read load that is an 40% improvement over RANDOM (compared to just 2% improved by SPEA2). A similarly significant improvement is also observed if the population size is 500 individuals. It is noted that METIS is faster than SPEA2 is. What stood out in this study is that (1) a typical use of an evolutionary algorithm, SPEA2 in our experiment, is not effective in finding a good partition assignment, and (2) it is possible to have a partitioning technique that runs faster and offers substantially better optimization quality than SPEA2 does.
35
2.6.2
Effectiveness of S-PUT
Here, I discuss the results of running our proposed technique, S-PUT. These results are illustrated in Figure 2.9. There are two noteworthy observations. Firstly, the EA process in S-PUT is highly effective. After 100 generations, the improvement of S-PUT over its initial population (METIS) in terms of total read load is as high as 20% (60, 000/75, 000 ≈ 80%) for the case with 100 individuals, and 26% (55, 000/75, 000 ≈ 74%) for the case with 500 individuals. The typical SPEA2 process improves only about 2% over its initial population. The improvement in terms of load balancing in S-PUT is no worse than that in SPEA2. Secondly, the quality of S-PUT partition assignments is remarkable compared to METIS and SPEA2. In the case with 100 individuals, if we use the quality of RANDOM as benchmark, on average, S-PUT offers a total read load that is 65, 000/124, 500 ≈ 52% of RANDOM, while the total read loads for METIS and SPEA2 are 62% and 98%, respectively. These percentage numbers are similarly observed in the case with 500 individuals. Note that in this study, S-PUT stops after 100 generations whereas SPEA2 stops after 500 generations. 2.6.3
Effect of Input Social Graph and Social Bond Strength
The results I have discussed above are with the Facebook graph in which the social bond strength between adjacent nodes is identical (E = 1). In this section, we will see if S-PUT remains superior when the social bond strength is randomly generated or influence-based or when the social graph is different (Gowalla and DBLP). Figure 2.10 shows the S-PUT results for the Facebook graph in which the social bond strength is random. There is still a clear contrast between the group of S-PUT/METIS results and the group of SPEA2/RANDOM results, the former obviously offering better partition assignments than the latter. All
36
(a) Random strengths, 100 individuals
(b) Random strengths, 500 individuals
Figure 2.10: S-PUT results: Facebook graph with randomly generated social bond strengths
of these methods offer excellent load balancing (Gini coefficient is less than 0.1), but S-PUT is no question the clear winner in terms of total read load. As an EA process, S-PUT is faster than SPEA2 to reach a good Pareto “front” (i.e., set of non-dominated solutions). After 100 generation, S-PUT can improve the total read load over its initial population by as much as 25%, whereas SPEA2 after 500 generations can improve over its initial population by only 2%, which is insignificant.
37
(a) Identical strengths, 100 individuals
(b) Identical strengths, 500 individuals
Figure 2.11: S-PUT results: Gowalla graph with identical social bond strengths
The case with Gowalla graph offers a slightly different picture. As seen in Figure 2.11 and Figure 2.12, the EA process in both SPEA2 and S-PUT is more effective than the case with Facebook. In other words, with the Gowalla graph, it is quicker for both SPEA2 and S-PUT to improve over their initial population. This helps SPEA2 get closer to METIS, albeit still inferior. Nevertheless, S-PUT clearly remains the best, standing out from the other methods. Consider the case with 100 individuals as population size. For the same load balancing (Gini coefficient narrowly around 0.08), looking at the solution with
38
(a) Random strengths, 100 individuals
(b) Random strengths, 500 individuals
Figure 2.12: S-PUT results: Gowalla graph with random social bond strengths
the best total read load for the case of identical social strength bonds, S-PUT’s load is 76% of METIS, 65% of SPEA2, and 58% of RANDOM (Figure 2.11(a)). For the case of random social strength bonds, the total read load of S-PUT is 80% of METIS, 66% of SPEA2, and 60% of RANDOM (Figure 2.12(a)). These percentage numbers only change slightly for the case with 500 individuals as population size. Figure 2.13 shows the comparison in the partitioning of the DBLP co-author graph. This graph is substantially larger than the other two graphs and con-
39
(a) 100 individuals
(b) 500 individuals
Figure 2.13: S-PUT results: DBLP graph with influence-based random social bond strengths
tains information about the number of papers co-authored by each pair of users. I set the social bond eij to inf luence(j, i) – the social influence user j has on user i. According to Hangal et al. [45], the social influence user j has on user i is computed as the proportion of their shared papers relative to the total
number of papers of i: inf luence(j, i) =
Pnum papers(i,j) k num papers(i,k)
. Applying this concept
in our problem, if user j has higher influence on user i than does user k on user i, i.e., inf luence(j, i) > inf luence(k, i), user i is more likely to access user j ’s data than to access user k ’s data. With the social bond defined this way,
40
which I think is reasonable, the superiority of S-PUT and slow evolution of the conventional use of SPEA2 are similarly observed in our results, after just 500 generations. Throughout all the simulation runs with various configurations, S-PUT consistently outperforms the competing methods by significant margins. Not only that it is faster than the typical EA process to improve over the first generation but it incurs significantly less total read load while maintaining comparable if not better load balancing. 2.6.4
S-PUT vs. NSGA-II
In the foregoing sections, I have discussed the evaluation in the case SPEA2 is used the the EA method in comparison. I also run the simulation in which the EA method used is NSGA-II, another popular EA in the literature. The results are summarized in Figure 2.14. S-PUT exhibits a strong effectiveness similar to that observed in the case SPEA2 is used as the EA method. A typical application of NSGA-II starting with a random population is not effective whereas a quick application of METIS can achieve better results. The best partition assignments are resulted by S-PUT. The comparison is done for two cases of population size, 100 and 500 individuals. This study, once more, confirms the limitation of today’s partitioning techniques in getting a good tradeoff between response time (reflected in the total read load) and load balancing and also shows a significant improvement by using a simple yet effective EA-based framework (S-PUT). 2.6.5
Run Time
Since S-PUT is an EA-based framework, the computation time is a trade-off for the quality of the final partition assignments. Table 2.3 presents the time for the EA process in S-PUT to complete. Simulation runs on a Linux workstation 41
(a) Facebook: Identical strengths, 100-ind.
(b) Facebook: Identical strengths, 500-ind.
(c) Facebook: Random strengths, 100-ind.
(d) Facebook: Random strengths, 500-ind.
(e) Gowalla: Identical strengths, 100-ind.
(f) Gowalla: Identical strengths, 500-ind.
(g) Gowalla: Random strengths, 100-ind.
(h) Gowalla: Random strengths, 500-ind.
Figure 2.14: S-PUT vs. METIS vs. NSGA-II as the EA method
42
S-PUT Run Time
Facebook
EA Method Pop. 100gen. 300 SPEA2 8 servers SPEA2 16 servers NSGA-II 8 servers NSGA-II 16 servers
Gowalla 500
100
300
500
1h16m
25m
1h16m
2h8m
100
15m
47m
500
1h12m
3h41m 6h3m
2h
6h
10h15m
100
30m
1h27m 2h30m
50m
2h25m
4h
500
2h23m
7h22m 12h10m 4h
12h
20h10m
100
15m
50m
26m
1h17m
2h9m
500
1h13m
3h34m 6h2m
2h2m
6h
10h7m
100
29m
1h29m 2h27m
47m
2h22
3h5m
500
2h24m
7h12m 11h55m 3h52m 11h32m 20h24m
1h16m
Table 2.3: Run time for the EA process in S-PUT for Facebook and Gowalla graphs
with 8GB memory and 2.66GHz dual-core Intel Xeon CPU 3070, with different configurations in terms of EA population size, number of generations, the EA method used, and the number of servers for partitioning. Here, I show the time information for the case E = 1, but for the case E is random the completion time should be similar because changes in the values of E do not affect the run time. Whether the EA used is SPEA2 or NSGA-II, the completion of time is similar. For both Facebook and Gowalla graphs, as expected, the EA completion time is linearly proportional to the population size and the number of generations. For example, in the case that SPEA2 is used with S-PUT, starting with 100 individuals, it took 30 minutes to run 100 generations on the Facebook graph and 50 minutes on the Gowalla graph. This time is roughly tripled with 300 generations and quintupled if we continue to 500 generations or have a population size of 500 individuals. In my discussion of the simulation results in prior sections, S-PUT with population size of 100 individuals already outperforms the competing methods if it runs EA for only 100 generations,
43
S-PUT using SPEA2 No. Servers Pop. 8 16
DBLP 100 gen. 300 gen.
500 gen. METIS
100
1h52
6h
10h
35m
500
9h8m
28h50m
48h6m
2h55m
100
3h46m
11h24m
19h8m
42m
500
18h24m
55h24mm 92h
3h31m
Table 2.4: Run time for the EA process in S-PUT for DBLP graph
which corresponds to 30 minutes of running on Facebook and 50 minutes on Gowalla. Table 2.4 presents the time information for the partitioning evaluation on the DBLP graph. The time scalability in terms of population size and number of generation is similarly observed. Due to the significantly larger size, the time for partitioning DBLP is much longer than that for Facebook and Gowalla. For example, using 8 servers and 100-individual population size, it takes 15 minutes to run 100 generations on Facebook, 25 minutes on Gowalla (about 3x more vertices), and almost two hours to run on DBLP (about 10x more vertices and 3x more edges). For DBLP, the longest time is 90 hours for the case of 16 servers, 500 generations of EA, and population size 500. This time is long, but manageable given the size of the graph. This is encouraging in terms of time complexity, noting that our simulation runs on a moderate Linux workstation with 8GB memory and 2.66GHz dual-core Intel Xeon CPU 3070. In practice, where we have more resourceful and parallelizable computing nodes, the time should be much shorter.
44
2.7
Other Considerations
2.7.1
Geographic Locality
The partitioning problem I have addressed seeks a query-centric solution for queries that preserve social locality. Once a partition has been determined to assign groups of highly-connected users to their corresponding servers, theoretically, any permutation of the servers can also work. In other words, a group can be assigned to an arbitrary server as long as no other group maps to the same one. I leave it open for the application developers, depending on their specific needs, to decide which server is assigned to each user group. Here, I discuss two common cases in practice where geographic locality is an important factor for choosing the servers for the users. To take into account both social locality and geographic locality, first, I run S-PUT to obtain a socially aware partition of the users into groups and then assign each group to a server in a way that is geographically aware. Specifically, suppose that as a result of the partitioning solution we obtain a partition P which partitions the users into M groups of users, U [1], U [2], ..., U [M ], where
U [s] = {i ∈ [N ] | pis = 1}.
Let T [1], T [2], ..., T [M ] denote the actual physical servers. We need to find a binary matrix X = [xgs ]M ×M that represents the assignment of each group U [g ] to an actual server T [s]; i.e., xgs = 1 if and only group U [g ] is assigned to
physical server T [s]. 2.7.1.1
User-Server Locality
In this case, I assume a geographically-aware communication cost associated with the assignment of a user to a server. Since it is not desirable to assign a social user permanently located in Boston to a server located far away in South 45
Africa, we should take into account this cost when assigning users to servers. The group-to-server assignment can be formulated as a linear assignment problem.
I represent the user-server geographic locality by a distance matrix, D = [dis ]N ×M , assumed given by the application, where dis quantifies the geographic distance between a user i and a physical server T [s]. Given a group-to-server assignment X , the geographically-aware communication cost incurred by processing queries involving users in group U [g ] and a server T [s] is
cgs =
X
dis
i∈U [g]
and so the total over all servers and all groups is M X M X
cgs xgs .
g=1 s=1
Ideally, we want X to minimize this total cost. The only constraint is that X is an one-to-one correspondence between U [.] and T [.]. Consequently, I solve the following Linear Assignment Problem (LAP).
minimize X
M X M X
cgs xgs
g=1 s=1
subject to 1)
M X
xgs = 1 ∀s ∈ [M ]
g=1
2)
M X
xgs = 1 ∀g ∈ [M ]
s=1
3) xgs ∈ {0, 1} ∀g, s ∈ [M ] LAP has an optimal solution which can be found in O(M 3 ) time [46].
46
A nice property of social locality is that it is to some extent reflective of geographical locality. Empirical studies have shown that the number of neighbors of a given user decreases quickly with geographic distance and, consequently, most neighbors should stay in the local geographical region of this user. For example, the study [47] for LiveJournal.com suggests that the proportion of links with geographic distance δ is 1/δ 1.2 + 5 × 10−6 . Another study [48] observes that 58% of the links in FourSquare.com, 36% in BrightKite.com, and 32% in LiveJournal.com are shorter than 100km; these OSNs have average link distances of 1296km, 2041km, and 2727km, respectively. Therefore, by grouping socially connected users together and assigning the whole group to its corresponding server according to the optimal solution to the linear assignment problem above, I do not break their geographical locality. 2.7.1.2
Server-Server Locality
In practice, there are cases where the servers are placed in different “clouds” of a geo-distributed cloud storage network. Consequently, the cost (e.g., communication cost, delay, etc.) to cross different servers may vary widely; it should cost more to communicate with a server that is more distant. It is desirable that for those socially-connected users who are assigned to different groups, these groups should be assigned to servers close to each other. To represent this server-to-server locality, I use a distance matrix D = [hst ]M ×M where dst quantifies the geographically-aware cost to send a unit of traffic from a server T [s] to a server T [t]. Given a group-to-server assignment X , the cross-server cost between a pair of servers, T [s] and T [t], can be quantified as cst = dst
M X M X g1 =1 g2 =1
47
x g 1 s x g 2 t ag 1 g 2
where tg1 g2 is the amount of cross-server traffic between two groups U [g ] and U [g 0 ] quantified as X
ag1 g2 =
X
(ri eij + rj eji ).
i∈U [g1 ] j∈U [g2 ]
Ideally, we want X to minimize the total cross-server cost M X M X
cst =
M X M X
s=1 t=1
dst
x g 1 s x g 2 t ag 1 g 2
(2.7)
g1 =1 g2 =1
s=1 t=1 M X M X
=
M X M X
xg 1 s
g1 =1 s=1
M X
dst
t=1
M X
x g 2 t ag 1 g 2
(2.8)
g2 =1
I thus solve the following quadratic assignment problem (QAP) [49]:
minimize X
M X M X
dst
subject to 1)
x g 1 s x g 2 t ag 1 g 2
g1 =1 g2 =1
s=1 t=1 M X
M X M X
xgs = 1 ∀s ∈ [M ]
g=1
2)
M X
xgs = 1 ∀g ∈ [M ]
s=1
3) xgs ∈ {0, 1} ∀g, s ∈ [M ] QAP in general is NP-complete (because the Traveling Salesman Problem, which is NP-complete, can be polynomially reduced to QAP [50]). However, we 1
can use the QAPLIB software library
to get a feasible approximate solution
to this problem. 2.7.2
Optimization for Replication
I have herein before focused exclusively on the partitioning problem, which allows the resulted partition to work with any arbitrary replication scheme 1
http://www.seas.upenn.edu/qaplib
48
atop. I present below how to formulate a similar optimization framework to design an optimal replication scheme on top of any arbitrary partition. First, in addition to the partitioning parameter P , read rate R, maintenance rate W , and social bond E , I introduce a new parameter to represent the replication assignment: Replication assignment X : an N × M binary matrix representing the
replica assignment of user data across the servers. In this matrix, each entry xis has value 1 if and only if user i is replicated at server s. Since a replica cannot reside on the same server with its primary copy, we have
xis + pis ≤ 1 ∀ i ∈ [N ], s ∈ [M ]
(2.9)
The replication problem is to find the unknown X given P , R, and W . With the availability of the replicas, I process a read query q = (i, U ) for a user i as follows. A read request is sent user i’s primary server, say server s, which will provide the data for i. To retrieve the data for each neighbor user j ∈ U , there are two cases: User j ’s data (primary or replica) is located on server s: The data of j
can be provided by server s, requiring no additional server read request. User j ’s data (primary or replica) is not located on server s: The data
of j needs to be retrieved from the primary server of j , requiring one additional read request sent to this server. Therefore, I compute the read load at a server s as
λread = s
N X
ri
pis + (1 − pis )
i=1
N X j=1
49
pjs eij
M X t=1
! pit (1 − xjt )
which is due to read requests that are initiated by (1) each user i primarily assigned to s, incurring a cost ri pis ; and (2) each user i not primarily assigned to s, that have neighbors j primarily assigned to s but these neighbors are not collocated with i, incurring a cost N X
ri (1 − pis )
pjs eij
M X
pit (1 − xjt ).
t=1
j=1
The number of read requests belonging to the latter group depends on the social strength matrix E which determines whether a neighbor’s data needs also to be retrieved. The total read load of all the servers is
Λ
read
=
M X
λread s
s=1
=
M X N X
ri
pis + (1 − pis )
s=1 i=1
=
N X i=1
ri
N X
pjs eij
M X
1+
j=1
eij
pit (1 − xjt )
t=1
j=1 N X
!
!
M X
(1 − pis )pjs −
N X M X i=1 s=1
s=1
xis
N X
rj pjs eji .
(2.10)
j=1
Because the first term in the above formula (Eq. 2.10) is constant, to minimize Λread is equivalent to minimizing
A=−
M N X X
xis
i=1 s=1
N X
rj pjs eji .
j=1
The maintenance load at a server s is quantified as the number of users whose data (primary or replica) server s stores multiplied by the maintenance rate, λmaintain s
=
N X
wi (pis + xis ) .
i=1
50
Thus, the total maintenance load is
Λ
maintain
= =
M X s=1 N X
λmaintain s wi
M X
=
M X N X
pis +
s=1
i=1
wi (pis + xis )
s=1 i=1 M X
!
xis
=
s=1
N X
wi
1+
M X
! xis
.
s=1
i=1
The constraints for the replication problem should be application-specific. Here, I discuss the case for systems with limited storage budget for replication and equal degree of data availability so that everyone has an equal chance to successfully access data under any failure condition. Therefore, a useful constraint is to have the same number, K , of replicas for each user, in addition to their primary copy; i.e., M X
xis = K for ∀i ∈ [N ].
s=1
With this constraint, the total maintenance load becomes
Λ
maintain
=
N X
wi
1+
M X
! xis
= (K + 1)
s=1
i=1
N X
wi
i=1
which is constant regardless of the replication scheme. Therefore, we want to balance the maintenance load as similarly done for the case of partitioning. If λmaintain is ranked in increasing order, the Gini coefficient for the maintenance s
load is M
Γ
maintain
N
X X 2 M +1 = s wi (pis + xis ) − PN M −1 (M − 1) i=1 wi s=1 i=1
and so to minimize this coefficient is equivalent to minimizing
B=
M N X X s wi xis s=1
i=1
51
The replication problem, therefore, is formulated as follows: Problem 2.7.1. Find binary matrix X such that ( A=−
minimize X
N X M X i=1 s=1
xis
N X
M N X X rj pjs eji , B = s wi xis s=1
j=1
)
i=1
subject to 1) xis + pis ≤ 1 ∀i ∈ [N ], s ∈ [M ]
2) 3)
M X s=1 N X i=1
xis = K ∀i ∈ [N ] wi (pis + xis ) ≤
N X
wi (pit + xit ) ∀1 ≤ s < t ≤ M
i=1
Solving this problem using EA is a focus of my future work. It will be interesting because the search space for X is substantially larger than that for P in the partitioning problem and also it is not obvious how we can generate
a good population for the initial population in the EA process.
2.8
Summary
Queries in OSNs often demonstrate social locality and therefore this property should be preserved in the data storage of any OSNs in order to improve the server efficiency. I have shown in this chapter how the socially aware partitioning problem can be formulated as a multi-objective optimization problem with two competing objectives, minimizing the total read load and balancing the maintenance load, taking into account the user activity and social relationship. I have then explored Pareto-optimal solutions to this problem using EA. I have shown that a typical application of EA does not work acceptably. I have also found METIS, which is the basis for today’s partitioning techniques for OSNs, better but far from being optimal. The proposed framework, S-PUT, by feeding a set of METIS solutions to the EA process, can be much more 52
effective. S-PUT is superior to METIS in both load minimization and load balancing. Although the run time is a trade-off for any EA process, S-PUT has been shown to converge to a set of excellent partition assignments within a reasonable amount of time. In practice, S-PUT can be used to produce a good benchmark for comparing techniques. It can also provide a good initial partitioning solution for further improvement, for example, where the geographical locality between users and servers and that among the servers themselves are taken into account. I have also suggested how to integrate replication in a similar optimization framework. The query-centric partitioning problem discussed in this chapter is simplified because I assume prior knowledge about the input queries. Indeed, we know that data queried together must be associated with a social neighborhood and that the tendency of having certain data queried together is given in the social strength matrix E . Also, I have not addressed the case given a stream of queries where the partition needs to be computed efficiently on the fly, instead of offline. The next chapter will cover a broader case.
53
CHAPTER 3 QUERY-CENTRIC PARTITIONING FOR GENERAL DISTRIBUTED SYSTEMS
This chapter extends the previous chapter in the following aspects. First, a query can, theoretically, ask for arbitrary data items whose relationship is not known in advance. The data can be that belonging to the same social community in an online social network, that being near a given location in location-based services, or that about the items frequently purchased together in e-commerce transactions. I do not assume any known patterns such as those regarding how popular certain items are or how often certain items are requested in the same query. Second, queries are received in a sequential manner and we have the option to revise the partition to best benefit the queries that will arrive next. Third, the partition revision has to be computed on the fly, irrevocably, unknown of future queries, yet aiming to minimize the total sequential costs which consist of the number of servers read and number of items migrated during the entire process. The unique difference in this broader case is that I do not seek a single partition optimized for a “batch” workload of queries; instead, the goal is to determine a sequence of partitions each optimized for a query, the next query, in a “sequential” workload. A partitioning solution optimal for a batch of queries may not offer the best sequential costs incurred when processing the queries sequentially in a specific order. Vice versa, a partitioning solution optimal in these sequential costs for a given order may not be optimal in the batch cost.
54
I refer to the above problem as Sequential Query Centric Partitioning (SQCP). This problem is novel, for which I am aware of no earlier research. Unfortunately, it has no optimal solution due to its online nature; there is simply no way to compute a partition that is optimal for the next queries since we do not know what group of data items will be requested next. Even in the offline setup where the entire query sequence is known in advance the problem is already NP-hard. In this chapter, I will show that SQCP can fit partially into some existing frameworks but there are fundamental challenges that are not even explored in the literature of those frameworks. The following questions are of interest: Since the future queries are unknown and so is any data association pat-
tern, is there really a benefit of revising the current partition in hopes of better serving future queries? Should we keep the same partition as it is at the beginning? (At least this incurs zero moving cost.) If it is worth revising the current partition upon each query, which items
should be migrated and whereto among the existing servers? Should the moved items be among those items requested in the current query or can they be arbitrary? Do the answers to the above questions apply to different query sequences?
Should we disregard any heuristic that is consistently ineffective or is there one that always works better than the others? I propose to investigate SQCP using the following methodology. First, I formulate SQCP as a multi-objective optimization and proposing an Evolutionary Algorithms (EA) framework incorporating several online heuristics to explore Pareto-optimal solutions. These heuristics are chosen to mimic online partition revision heuristics we may apply when processing a query sequence in 55
streaming mode. We conjecture that if a heuristic helps EA converge faster to good partitioning solutions then in practice the corresponding online heuristic should be preferred for adjusting the partition during the sequence. Second, I investigate several online algorithms for SQCP that are based on the heuristics recommended from the EA study and provide a comparison on their effectiveness with respect to the objectives of minimizing migration cost and minimizing sequential read cost.
3.1
Related Work
The existing research most relevant to SQCP, perhaps, is about the associated data placement problem (ADP) addressed in the work of Yu and Pan [51]. In their language, associated data are those that are requested together in the same query and ADP can be considered an batch-workload version of SQCP that assumes known patterns about the queries. Consequently, ADP does not address the more challenging case: unknown pattern and sequential workload. For partitioning of OSNs as discussed in Chapter 2, the partitioning problem can be modeled as a graph partitioning problem where the graph consists of data items each as a vertex and queries each as a set of vertices in the direct neighborhood of a vertex. Similarly, for SQCP, we can think of a query as a set of vertices that can be anywhere in the graph. Consequently, SQCP can be modeled as a hypergraph partitioning problem where each query is represented by a hyperedge. There are many software tools for partitioning a hypergraph, e.g., [52, 53]. In [54], a re-partitioning hypergraph model is introduced for cases where the hypergraph needs to be re-partitioned over the time to adapt to workload changes. Although incorporating both read cost and migration cost in each partition adjustment, this model is not truly “online” and “query-adaptive”.
56
Each repartitioning requires a wait window long enough to form a quality hypergraph to represent the query workload during this epoch. In contrast, I focus on minimizing the sequential cost - the immediate cost - to process each query, not a window of queries. Thus, the arrival order of the queries matters. The need for live reconfiguration of partitions has increasingly been emphasized in order to minimize performance impact in database management systems [55]. An evolution algorithm has been proposed for hypergraph partitioning [56], serving applications in circuit design. While this algorithm seeks a single partition for a given hypergraph in an offline setup, the objective of our EA framework is to seek a sequence of partitions for a continuously growing hypergraph where one hyperedge is added at a time. I am aware of no other effort to offer such an EA framework. Partitioning algorithms exist for streaming hypergraphs [57] and streaming standard graphs [58]. In contrast, ours is the first not only to process a stream of hyperedges as input, but also to cope with constraints (w.r.t cost to change each partition) that make the problem even more challenging.
3.2
Problem Formulation
Suppose that we have N data items, O = [N ] that need to be distributed among M servers, S = [M ]; here, [z ] denotes the set {1, 2, ..., z}. Each item is placed on a server according to an initial partition. Over the time, queries are submitted to the system one by one, each asking for an itemset (a subset of items). Due to changing query workload, the initial partition may no longer be efficient. After each query is processed, we can keep the same partition, or revise it in hopes of reducing the future read costs. On the other hand, this
57
should be done without having to move too many items from one server to another. Consequently, we propose the following data partitioning problem. Problem 3.2.1 (Sequential Query Centric Partitioning (SQCP)). Denote the query sequence (unknown in advance) by q1 q2 ...qT , where qt ⊂ [N ] is the query at time t to retrieve a subset of items from the servers; for example, query {3, 5, 10} is for retrieving items 3, 5, and 10 from their respective servers. Start with a given initial partition at time t = 0, f0 : [N ] → [M ], assigning each item i to some server j = f0 (i). At each subsequent time t ≥ 1 once query qt is received, knowing only queries received thus far, q1 q2 ...qt , we need to compute a new partition, ft : [N ] → [M ], to assign each item i to some server j = ft (i). S Let r(t) = i∈qt ft−1 (i) be the read cost (number of different servers to read) of P 6 ft−1 (i)] the move cost (number of items migrated query qt and m(t) = N i=1 [ft (i) = to a different server due to the adjustment) to obtain partition ft from partition ft−1 ; notation [.] is the Iverson bracket. Over the entire query sequence, the objective is to minimize the total sequential read cost and the total sequential move cost while keeping the partition balanced:
T X [ ft−1 (i) min Λ = {ft }T t=1 t=1 i∈q t | {z } r(t) T N XX min Γ = ft (i) = 6 ft−1 (i)] [ {ft }T t=1 t=1 i=1 | {z }
(3.1)
(3.2)
m(t)
s. t.
N X
[ft (i) = j ] ≤ C ∀j ∈ [M ], t ∈ [T ].
(3.3)
i=1
Here, C is the maximum storage capacity allowed for each server. I assume that the set of data items and the set of servers do not change during the query sequence.
58
It is easy to see that Objective (3.1) and Objective (3.2) cannot concurrently be achieved. The move cost is minimum (zero) if the initial partition is never changed during the entire query sequence. The read cost of the initial partition, however, is not optimal. I show below how this problem can be cast as extended versions of the Online Hypergraph Partitioning problem and the Metrical Task Systems problem. 3.2.1
Correspondence to Hypergraph Partitioning
Consider a simplified case where we know the entire query sequence in advance and stick with the initial partition f0 for the entire query sequence (ft = f0 ∀t); hence, no migration allowed (Γ = 0). The total read cost (see Eq. (3.1)) becomes: T [ X Λ0 = f0 (i) .
(3.4)
t=1 i∈qt
I will show that the best f0 minimizing Λ0 is one that is a solution to a min-cut hypergraph partitioning problem. A hypergraph is a generalized graph where an edge, called a hyperedge, can consist of any arbitrary non-empty subset of vertices, not necessarily a pair of vertices as in standard graphs. Our hypergraph to be partitioned is G = (V, E ) where V = [N ] represents the set of items and E = {q1 , q2 , ..., qT } represents the set of queries. In other words, each item is a vertex and each query is an hyperedge consisting of all the items of this query. Given a partition, a hyperedge of connectivity k (i.e., spanning k parts) is said to be cut if k ≥ 2 and the weight of this cut is (k − 1). A min-cut partition is one that minimizes the total cut weight. Consider a partition f : [N ] → [M ]. Using this partition, each query qt S
requires reading
i∈qt
f (i) servers and, accordingly, the cut weight of hy-
59
S peredge qt is ( i∈qt f (i) − 1). Therefore, minimizing the total read cost is
equivalent to minimizing the total cut weight in the hypergraph. To find a partition f0 that is balanced with minimal total read cost is equivalent to finding a balanced min-cut partition for hypergraph G. The latter problem in general is known to be NP-hard [59], but effective heuristic algorithms have been developed; e.g., hMetis [52] and PaToH [53]. We can use one such algorithm to obtain f0 . Now, consider the original scenario where queries arrive sequentially, unknown in advance, and migration is allowed so that we can adapt the partition upon receipt of each query. The corresponding hypergraph G is therefore a streaming hypergraph where the set of vertices is known but the set of hyperedges not; instead, the hyperedges are inserted to the hypergraph one at a time. Not only that we need to compute an online balanced min-cut partitioner for this streaming hypergraph, but also this partitioner should incur the least move cost. In the literature of online hypergraph partitioning, this problem is not yet explored. 3.2.2
Correspondence to Metrical Task Systems
As an online problem seeking an optimal solution for sequential input, our proposed problem can be viewed in the framework of metrical task systems (MTS). An MTS is a multi-state system for processing sequential tasks, in which the cost to process a task depends on the state of the system and the system can change its state anytime subject to a transition cost metric. The MTS problem, introduced in [60], is to compute an efficient schedule ψ = ψ1 ψ2 ...ψT for a task sequence σ = σ1 σ2 ...σT , where ψt is the system state in
which σt will be processed, such that the total processing and transition cost is minimized,
60
min Ctransition (ψ ) + Cprocessing (ψ, σ ) , where
Ctransition (ψ ) = Cprocessing (ψ, σ ) =
T X i=1 T X
costtransition (ψt−1 , ψt ) costprocessing (ψt , σt ).
i=1
Here, ψ0 is the initial state (given). An online scheduling algorithm must compute ψt knowing only σ1 σ2 ...σt−1 . In competitive analysis, an online algorithm is said to be k -competitive iff, given any task sequence input, its cost is at most k times that of an optimal offline algorithm (plus a constant depending only on k ). It is known [60] for any MTS with n states that a deterministic online algorithm can be constructed with (2n-1) competitive ratio, which is optimal among all deterministic algorithms. My partitioning problem can be cast into a MTS. I combine Objective (3.1) and Objective (3.2) into a single objective (αΛ + (1 − α)Γ), where weight α ∈ [0, 1] indicates the priority between read cost versus move cost. Then, the
MTS to be optimized is as follows: States: The set of states is the set of all partitions f : [N ] → [M ] that
satisfy Ineq. (3.3). The initial state is ψ0 = f0 (the initial partition in our problem). State transition cost: Cost function costtransition (f, f 0 ) is defined to be (1 −
α) times the move cost to change from partition f to partition f 0 . Task processing cost: Task σt at time t is the query qt . Cost function
costprocessing (f, qt ) is defined to be α times the read cost for query qt using
partition f . 61
By solving this MTS, I can derive an optimally-competitive online algorithm for our partitioning problem. Unfortunately, the number of states is roughly n≈
N! (N/M )!M M !
(number of partitions when the capacity C is precisely N/M ),
too large to be computationally practical. In the literature of metrical task systems, I am aware of no research aimed to substantially downsize the state set to obtain a more efficient algorithm that remains as competitive.
3.3
Evolutionary Algorithm Framework
As I discussed in Chapter 2, EA is a popular approach to be used to explore Pareto-optimal solutions for multi-objective optimization problems and often perform well approximating solutions. Here, I propose an EA framework for SQCP, which hereafter is referred to as Q-PUT. This framework consists in how I encode a partition sequence as an individual and how to apply crossover, mutation, and selection to evolve a population of individuals from one generation to the next. In the SQCP problem, with the query sequence length T , the number of items N , and the number of servers M , the number of possible partition sequences is M N T . The solution search space is smaller with the balance constraint, but still extremely large, making EA very slow to converge to a stable good solution state. Q-PUT is incorporated with several heuristics to make it faster to converge to good Pareto-optimal solutions. These heuristics are about how we should migrate data items between consecutive partitions. If such a heuristic helps EA converge better, it should be used as a greedy heuristic for the online solution to SQCP. 3.3.1
Representation of An Individual
In my framework, an individual is a solution candidate for the SQCP problem. For such a candidate, which is a sequence of partitions, f = f1 f2 ...fT
62
Figure 3.1: A gene in the chromosome of Q-PUT
where ft : [N ] → [M ] ∀t = 1, 2, ...T , let {ot1 , ot2 , ..., otkt } denote the items that are migrated to obtain partition ft from partition ft−1 , and {st1 , st2 , ..., stkt } the corresponding destination servers. The individual to represent f is
then encoded as a “chromosome” string, ω = ω1 ω2 . . . ωT , of T “genes” where ωt = {hot1 , st1 ihot2 , st2 i . . . hotkt , stkt i}. Figure 3.1 illustrates how a gene is encoded in
our EA framework. Conversely, given an arbitrary string ω of the above template, we can reconstruct precisely the corresponding partition sequence. Indeed, partition f1 can be reconstructed using f0 and ω1 , partition f2 using f1 and ω2 , etc.
In the encoding of an individual, it is possible that the destination server of an item in ωt is the same as its destination server in ωt−1 ; in this case, the item is considered “not-migrated” even though it is included in ωt . Figure 3.2 illustrates how an individual is encoded in Q-PUT. In the encoding of this individual, f1 is constructed using f0 and ω1 = {h1, 2ih4, 7i}; f2 is constructed using f1 and ω2 = {h2, 4ih5, 8i}. Hence, the individual is encoded as ω = ω1 ω2 = {h1, 2ih4, 7i}{h2, 4ih5, 8i}.
Figure 3.2: Example individual in Q-PUT
63
3.3.2
Initial Population
A population is a set of individuals. In the beginning (the first generation of EA), each individual ω = ω1 ω2 ...ωT of the initial population is a string of random genes. In practice, we need to limit the number of items that may be migrated during a partition adjustment because moving too many items at a time may block ongoing regular transactions. Let kmax be this limit; this parameter should be pre-defined. To generate each gene ωt : (1) choose a random kt ∈ [0, kmax ]; (2) choose a random subset of kt items, {ot1 , ot2 , ..., otkt }; and (3) for each item oti , choose a random server sti as the destination for this item. 3.3.3
Crossover Mechanism
In the crossover mechanism, a number of pairs of individuals, called parents, are randomly selected from the current population and each of these pairs will be replaced by two new individuals, called offsprings. I use the Two-Point Crossover mode as shown in Figure 3.3.
Figure 3.3: Two-point crossover mechanism is used in Q-PUT
64
Suppose that the parents are ω = ω1 ω2 . . . ωT and ω 0 = ω10 ω20 . . . ωT0 . First, two random positions i and j (1 < i < j < T ) are chosen. Then, the offsprings 0 0 . . . ωT0 . The number of ωi . . . ωj ωj+1 are ω1 . . . ωi−1 ωi0 . . . ωj0 ωj+1 . . . ωT and ω10 . . . ωi−1
parent couples to be replaced in the crossover mechanism is determined by a probability pcrossover (given as input). We should expect pcrossover × |A| parents to be replaced by their offsprings. 3.3.4
Mutation Mechanism
In the mutation mechanism, a number of individuals are randomly selected from the current population, each to be replaced by a new individual. This offspring is obtained by altering some genes, chosen at random, in the original individual. Let the original individual be ω = ω1 ω2 . . . ωi . . . ωT . For each gene ωi selected to be altered, it is freshly reset to a random gene ωi0 (how to generate a random gene is already described above). The corresponding offspring is the same as ω with the exception that each ωi is replaced with ωi0 , as shown in Figure 3.4.
Figure 3.4: Mutation mechanism in Q-PUT
There are two probability parameters given as input: the probability that an individual is selected for mutation, pmutate , and the probability that a gene inside an individual is selected to be altered pgene mutate . Thus, expectedly, the number of individuals to be replaced is pmutate × |A| and the number of genes to be mutated in each selected individual is pgene mutate × T .
65
3.3.5
Selection Mechanism
In each generation h, once we obtain the new regular population Ph by applying crossover and mutation on the archive Ah , we need to select the best-fit individuals from Ph ∪ Ah to form the new archive Ah+1 for the next generation. The fitness is defined as follows. Let q p denote that individual q dominates individual p. The fitness of an individual p in generation h is "
X 1 + f itness(p) = β (p) σ (q) δ (p, k ) + 2 q∈P ∪A :qp h
#
h
where β (p) is 0 if individual p corresponds a sequence of partitions where some
partition violates the balance constraint. Else, β (p) is 1. σ (q) is the number of individuals dominated by q in the set Ph ∪ Ah . δ (p, k ) denotes the distance in the objective space from individual p to its p k -nearest individual. As common setting, k is set to |P| + |A|.
I retain only the |A| individuals with the highest fitness to form the new archive. Note that, all the individuals violating the balance constraint are always removed (fitness is zero because β (p) = 0). 3.3.6
Migration Heuristics
In the basic EA framework, I limit the solution space to only individuals allowing at most kmax items migrated in each partition adjustment. In addition to choosing a reasonable value for kmax we can further reduce the convergence time by placing restriction on which items can move and which destination servers to receive them. Two questions are raised: (1) can a candidate item for 66
the migration be an arbitrary item or should it be among the items that are just requested in the most recent query? and (2) can the destination server for each moved item be an arbitrary server or should the destination server be the same for all the moved items? To investigate these questions, I incorporate the following migration heuristics in our EA framework, which are applied after the latest query qt has been received and processed. RAND-RAND: Move a random subset of up to |qt | arbitrary items, each
to a random server. This is essentially the default heuristic for the basic framework in which we cannot move more than the number of items in the last query. RAND-RAND*: Move a random subset of |qt | arbitrary items, each to a
random server. RAND-SAME: Move a random subset of up to |qt | arbitrary items, all to
the same server. This server is chosen at random. RAND-SAME*: Move a random subset of |qt | arbitrary items, all to the
same server. This server is chosen at random. Q-SUB-RAND: Move a random subset of items in qt , each to a random
server. Q-SUB-SAME: Move a random subset of items in qt , all to the same
server. This server is a random server among those that hold at least one of these items. Q-ALL-SAME: Move all the items in qt , all to the same server. This
server is a random server among those currently holding at least one of these items.
67
It is noted that RAND-RAND and RAND-SAME result in the same degree of randomness in the server locations of the items, because random items are chosen for migration. However, since the solution search space of RAND-SAME is much smaller than that of RAND-RAND, they may behave differently when incorporated in the EA framework. The same is said for RAND-RAND* versus RAND-SAME*. To apply each heuristic above, we need to revise the EA framework accordingly in two places: generation of the initial population and the mutation mechanism. When we need to generate a random gene ωt = {hot1 , st1 ihot2 , st2 i . . . hotkt , stkt i}, for a random individual in the initial population or for mutating an existing individual to create an offspring, the following conditions must apply: RAND-RAND, RAND-SAME, Q-SUB-SAME: kt ≤ |qt | RAND-RAND*, RAND-SAME*: kt = |qt | RAND-SAME, Q-SUB-SAME, Q-ALL-SAME: st1 = st2 = ... = stkt Q-SUB-RAND, Q-SUB-SAME: {ot1 , ot2 , ..., otkt } ⊂ qt Q-ALL-SAME: {ot1 , ot2 , ..., otkt } = qt
In investigation of these heuristics, I conjecture that if an EA heuristic is found to provide the best quality in the final population then in practice we should apply the corresponding migration heuristic to reconfigure the partition adaptively to the query. For example, if “move-to-same-server” is better than “move-to-random-server” for the evolution process, the former should be used for the online heuristic algorithm. 3.3.7
Evaluation Study
I evaluate the heuristic algorithms using four real-world datasets, arXiv, github, retail, and actor-movie, obtained from the dataset repositories at http://konect.uni68
Dataset
Min. Query Max. No. of Items Size Query Size
Avg. Query Size
arXiv
16,264
2
10
3.06
Github
42,444
2
238
7.71
Retail
16,407
2
52
10.76
2
89
12.35
Actor-movie 382,218
Table 3.1: Summary of datasets used in the evaluation
koblenz.de and http://fimi.ua.ac.be/data. I transform these datasets into a diverse set of “transaction” hypergraphs so that we can use each vertex to represent a data item and each hyperedge a query. A query sequence in our simulation is a sequence of 1000 random hyperedges in the hyperedge set. Some statistics of these datasets are summarized in Table 3.1 and Table 3.2. arXiv: A collaboration graph of authors of publications from the arXiv
condensed matter section (cond-mat) from 1995 to 1999. For use in our evaluation, each author’s data corresponds to an item and each publication’s data a query that retrieves the information of all the authors of this publication. Github: A bipartite graph of users and projects in Github, with edges
connecting a project to each of its members. For use in our evaluation, each member’s data corresponds to an item and each project’s data a query that retrieves the information of all of its members. Retail: A retail market basket dataset supplied by an anonymous Bel-
gian retailer. This dataset consists of transactions for approximately five months during 1999-2000, each transaction including a set of items pur-
69
Dataset
Total
% of Total for Each Request Frequency 1
2
3
4
5+
arXiv
2521 (16%) 83.5% 11.9% 3.1% 0.9%
0.2%
Github
4995 (12%) 72.9% 15.9% 5.6% 2.4%
3.2%
Retail
4130 (25%) 57.5% 19.5% 9.3% 4.2%
9.5%
Actor-movie 11130 (3%) 91.0%
7.4%
1.2% 0.2%
0.1%
Table 3.2: Number of items for each range of request frequencies
chased. In our evaluation, each transaction serves as a query retrieving all the items purchased in this transaction. Actor-movie: A bipartite network of movies and the actors that have
played in them. Using this data, I consider each actor as a data item and a movie a query retrieving data about the actors who act in this movie. The initial partition f0 , same in all comparisons, is obtained by a random partitioning of the items among the servers. The benchmark for comparing the proposed heuristics is the partition sequence f1 f2 ...fT where ft = f0 , hereafter referred to as RAND-NoMove; there is no migration during the query sequence. RAND-NoMove thus represents a hash-based partition assignment that does not adapt to the queries. The number of servers is M = 16 and number of queries T = 1000. The maximum server capacity C is set to allow a deviation up to 1% of the total number of items. SPEA2 [41] is used as the evolution engine with the following parameters: regular population size |P| ∈ {100, 500}, archive size |A| = |P|, number of generations h∗ ∈ {200, 400, 600, 800, 1000}, crossover prob-
ability pcrossover = 0.8, mutation probability pmutation = 0.5, and gene mutation
70
probability pgene mutate = 0.001. The probabilities are so chosen to ensure a sufficient diversity in the population. 3.3.7.1
EA Convergence
(a) Population: 100 individuals
(b) Population: 500 individuals
Figure 3.5: The average quality of the Pareto-optimal solutions for arXiv dataset
I am interested in the convergence behavior of each EA heuristic and how much improvement is achieved when converged in comparison to this heuristic’s respective initial population. Figure 3.5, 3.6, 3.7, and 3.8 shows the average quality of the Pareto-optimal population as a result of each EA heuristic after 200, 400, 600, 800, and 1000 generations, with two population sizes considered, 100 and 500 individuals. The x-axis is the population’s average read cost as 71
(a) Population: 100 individuals
(b) Population: 500 individuals
Figure 3.6: The average quality of the Pareto-optimal solutions for Github dataset
a percentage of that of the initial population, the y-axis is the population’s average move cost as a percentage of that of the initial population. For all the heuristics, it is observed that the improvement in both read cost and move cost increases quickly in early generations and, in most cases, slows down towards convergence after 1000 generations. This is consistent for both cases of the population size, 100 or 500 individuals, with the results being slightly better for the larger population size. Starting from the initial population, EA can cut the move cost to as small as 35% of the cost of the initial population; e.g., see Q-SUB-RAND for arXiv in Figure 3.5(b). The
72
(a) Population: 100 individuals
(b) Population: 500 individuals
Figure 3.7: The average quality of the Pareto-optimal solutions for Retail dataset
read cost can be cut to as small as 63%; see Q-ALL-SAME for Retail in Figure 3.7(b). The results also show a clear clustering separation between the group of RAND-RAND*, RAND-SAME*, and ALL-SAME and the group of RANDRAND, RAND-SAME, Q-SUB-RAND, and Q-SUB-SAME. This is understandable because in each group the heuristics share the same constraint on the move cost. On the other hand, different convergence rates are observed for these heuristics in the same group. This is because they have different population densities due to their unique selection of migrated items and destination servers. 73
(a) Population: 100 individuals
(b) Population: 500 individuals
Figure 3.8: The average quality of the Pareto-optimal solutions for Actor-movie dataset
This study shows that our EA framework can be effective in terms of convergence to a stable solution state and improvement over the initial population. Also, the heuristics represent a range of options for EA when it comes to how much priority should be placed on reducing the read cost versus the move cost and how to select migrated items and destination servers. In the discussions below comparing to the benchmark, I use the results for the case of 500-individual population size with 1000 generations as this evolution length is sufficient for all the heuristics to converge.
74
(a) arXiv Dataset
(b) Github Dataset
Figure 3.9: Moving random items (arXiv and Github datasets)
3.3.7.2
Migration Heuristics Analysis
Where to Migrate: Same Server or Not? We want to compare the EA effectiveness between two tactics: move the items (those selected for migration) to the same server or random servers. Allowing arbitrary servers for migration increases the population diversity. On the other hand, moving to the same server reduces the solution search space (faster convergence) and potentially reduces the read cost faster, at the cost of having a less diverse population. Ideally, we want minimal population diversity
75
(a) Retail Dataset
(b) Actor-movie Dataset
Figure 3.10: Moving random items (Retail and Actor-movie datasets)
for the same quality for the Pareto population. The comparative results are shown in Figures 3.9, 3.10, 3.11 and 3.12, which plot the Pareto-front of final individuals for each heuristic, where the x-axis is the average read cost as a percentage of that incurred by the RAND-NoMove benchmark and the y-axis is the average move cost as a percentage of the average query size in the dataset. First we focus on the case where random items are selected for migration and the same move cost constraint is enforced. We compare RAND-SAME vs. RAND-RAND and RAND-SAME* vs. RAND-RAND*. As illustrated
76
in Figures 3.9 and 3.10, for all the datasets, there is no significant difference between RAND-SAME and RAND-RAND. However, when more items are allowed to moved, as shown in the comparison between RAND-SAME* vs. RAND-RAND*, RAND-SAME* tends to offer better read cost, especially for Retail - the dataset with the strongest data associated-ness (Figure 3.10(a)). Next, we focus on the case that the migrated items are among the most recently requested; hence, we compare Q-SUB-SAME to Q-SUB-RAND. The results are shown in Figures 3.11 and 3.12. The “move-to-same-server” tactic (Q-SUB-SAME) results in a much better Pareto front than the “move-torandom-server” tactic (Q-SUB-RAND) for Github and Retail. For arXiv and Movie-actor, no clear winner is projected. The above observations are explainable. It is noted that the data associatedness in arXiv and Movie-actor is weak; as seen in Table 3.2, 83.5% of the queried items in arXiv are requested only one time, and 91% in Actor-movie. As such, it is unlikely for two co-requested items to appear again in the same query, explaining why there is no significant difference between the “move-to-sameserver and the “move-to-random-server” tactics. On the other hand, the data associated-ness is stronger in Github and strongest in Retail, which benefit substantially from “moving-to-same-server”. Which Items to Move: Among the Requested or Arbitrary? Now that we have found “move-to-same-server” preferable to “move-torandom-server”, we focus on comparing the heuristics that adopt “move-tosame-server” to themselves: RAND-SAME, RAND-SAME*, Q-SUB-SAME, Q-ALL-SAME. The results are plotted in Figures 3.13 and 3.14. As expected, with less diversity due to their stronger constraints on the maximum number of items that can be moved, RAND-SAME incurs higher read cost than RAND-SAME* and Q-SUB-SAME incurs higher read cost and
77
(a) arXiv Dataset
(b) Github Dataset
Figure 3.11: Moving requested items (arXiv and Github datasets)
Q-ALL-SAME. On the other hand, also for the same reason, the move cost is less in RAND-SAME and Q-SUB-SAME than in RAND-SAME* and Q-ALLSAME, respectively. We, therefore, are more interested in comparing the moving-to-same server heuristics that assume the same move cost constraint. Specifically, we compare Q-SUB-SAME to RAND-SAME and ALL-SAME to RAND-SAME*. Between Q-SUB-SAME and RAND-SAME, there is no significant difference in performance. They have similar read cost and move cost in all cases, even when
78
(a) Retail Dataset
(b) Actor-movie Dataset
Figure 3.12: Moving requested items (Retail and Actor-movie datasets)
the data associated-ness is strong. This could be because these heuristics both allow a small number of items moved in each partition adjustment and a small migration, whether moving requested items or moving random items, may not result in any significant difference on the read cost in our study. Between Q-ALL-SAME and RAND-SAME*, which allow for more items that can be moved, these heuristics do not dominate each other in weak associated-ness cases (arXiv, Actor-movie) but Q-ALL-SAME clearly becomes superior in cases
79
with stronger data associated-ness; see Figures 3.13(b) for Github and 3.14(a) for Retail. This study suggests that if the data associated-ness is strong, the best migration heuristic is to (1) move all the items most recently requested and (2) move them all to the same server. For weak data associated-ness, there is no clear read-cost benefit by choosing the migrated items among the requested items versus among arbitrary items. We still recommend the “migrate-requesteditems” tactic because thanks to its stricter constraints on creating offspring individuals, the population is less diverse, resulting in a smaller solution search space and, consequently, faster convergence towards the final solution, without sacrificing quality. We also observe that all these heuristics work most effectively for Retail - the dataset with strongest data associated-ness. For example, the read cost resulted from EA can be as small as 60% of the benchmark cost, which is achieved using Q-ALL-SAME for Retail (Figure 3.14(a)). For weak data associated-ness, the improvement is minor. The best read cost EA can offer for arXiv and Actormovie is 87+% of the benchmark (achieved by RAND-SAME* for arXiv in Figure 3.13(a)). These observations suggest that EA is more effective with stronger data associated-ness. 3.3.8
Remarks
The results of our evaluation with real-world datasets show that Q-PUT can be effective to obtaining good solutions for SQCP. Compared to the benchmark no-move partitioning scheme, Q-PUT can offer a read cost as small as 60%, hence a 40% improvement. The evaluation study suggests that an effective migration heuristic for adjusting the current partition on the fly to benefit future queries is to choose the items for migration among the most recently
80
(a) arXiv Dataset
(b) Github Dataset
Figure 3.13: Choosing requested items for migration (ArXive and Github datasets)
requested items and to move them all to the same destination server (how to choose this server is an interesting question for the next step in our research). This heuristic especially works best when the data associated-ness in the queries is strongest. Even in the case of weaker associated-ness the suggested heuristic performs no worse than the random heuristic. Q-PUT as an EA framework is the first of its kind for SQCP and can provide good solutions serving as benchmark for future studies on this problem.
81
(a) Retail Dataset
(b) Actor-movie Dataset
Figure 3.14: Choosing requested items for migration (Retail and Actor-movie datasets)
3.4
Online Algorithms
This section presents approximate greedy algorithms to SQCP. The greedy heuristic applies when each query arrives and we need to decide where to move which data items. I investigate several greedy algorithms that either are based on the recommendation from Q-PUT in the previous section or borrow ideas from greedy algorithms for standard streaming graphs.
82
3.4.1
Greedy Heuristics
I define the following quantities: Association Score at (i, j ): the total number of times item i and item j
have been co-requested up to time t (after receipt of qt ). This count is incremented each time these two items appear in a new query. Demand Score bt (i, s): the sum of association scores of item i with the
items hosted in server s at time t (after receipt of qt ), representing how much demand server s has for item i. Hence, it is always true that bt (i, s) = P
j:ft−1 (j)=s
at (i, j ).
Four heuristics are considered in my study: All-To-Same-Least-Move (ALL-LM): After receipt of query qt , move all
the requested items to the same server with the least migration cost. This server is the one hosting the most among the requested items: s∗ = arg max qt ∩ {i ∈ [N ] : ft−1 (i) = s} s∈[M ]
All-To-Same-Highest-Demand (ALL-HD): After receipt of query qt , move
all the requested items to the same server with the highest demand total for the requested items. This server is the below: ! s∗ = arg max s∈[M ]
X
bt (i, s)
i∈qt
Individual-Highest-Demand (IND-HD): After receipt of query qt , move
each individual requested item i to the server having the highest demand for i: s∗ = arg max bt (i, s) s∈[M ]
83
Individual-Most-Associated (IND-MA): After receipt of query qt , move
each individual requested item i to the server hosting i’s most associated item: j ∗ = arg max at (i, j ); s∗ = ft−1 (j ∗ ) j∈[N ]
The last two heuristics borrow the ideas from partitioning of standard streaming graphs where we try to put each vertex in the same part with the most neighbors (IND-HD) or with the nearest neighbor (IND-MA). When applying the above heuristics, to satisfy the balancing constraint at any time, candidate servers can only be among those that will not exceed the capacity due to the migration. 3.4.2
Implementation Efficiency
In terms of computation, the above heuristic algorithms can be implemented in an efficient manner. The association and demand scores are incrementally updated over the time after each query qt is received. This is done as follows in O(|qt |) time:
∀i, j ∈ qt ∧ i 6= j :
at (i, j ) = at−1 (i, j ) + 1 bt (i, ft−1 (j )) = bt−1 (i, ft−1 (j )) + 1
With the above information already computed, to find the desired server(s) takes no worse than O(|qt | × M ) time for all the heuristic algorithms. 3.4.3
Evaluation Study
I evaluate the heuristic algorithms using the same real-world datasets as used in Section 3.3.7, namely, arXiv, github, retail, and actor-movie. A query 84
sequence in our simulation is a random permutation of the hyperedge set (note that the study in Section 3.3.7 limits the length of the sequence to only 1000 queries). The statistics of these datasets are summarized in Table 3.3. Dataset
arXiv
Github
Retail
Actor-movie
No. Items (N )
16,264
42,444
16,407
382,218
Query Seq. len (T ) 17,837
37,837
85,146
118,476
Min. Query Size
2
2
2
2
Max. Query Size
18
3675
76
294
Avg. Query Size
3
9
10
12
Notes
small N
medium N
small N
large N
N ∼T
N ∼T
N T
N T
Table 3.3: Summary of datasets used in evaluation
I compare ALL-LM, ALL-HD, IND-HD, and IND-MA using the benchmark where the initial partition f0 is obtained by a random partitioning of the items among the servers and is used unchanged for the entire query sequence. This benchmark, referred to as RANDOM-NoMove hereafter, represents a hash-based partition assignment that does not adapt to the queries. We are interested in how ALL-LM, ALL-HD, IND-HD, and IND-MA compare to each other and how much of an improvement they each offer compared to RANDOM-NoMove. The comparison is in terms of read cost and move cost. In the plots discussed below (Figures 3.15, 3.16, 3.17, 3.18, 3.19, 3.20, 3.21 and 3.22), the x-axis is the percentage improvement/reduction in read cost compared to the RANDOM-NoMove benchmark; i.e., 100(1 − r/r0 ) where r is the read cost of the heuristic under consideration and r0 is the read cost of the RANDOMNoMove benchmark. The y-axis is the average migrage cost as a percentage of 85
the average query size in the dataset. For each dataset, five permutations of the set of queries are randomly generated to serve as five query sequences; the results are averaged over these these sequences. Given the sizes of the datasets, four reasonable cases for the number of servers are considered, M ∈ {8, 16}. The maximum server capacity C is set such that it is a factor of M ( b+50 )log2 M times of the average capacity per server, 100 where b ∈ (0, 50) is the tunable balance parameter; this formula is used in hMetis [52]-a popular tool for balanced hypergraph partitioning. We consider two cases: b = 1 and b = 2. With M = 8 servers and b = 1, we allow for capacity C to be 6% higher than the average capacity per server (N/M ); when b = 2, this percentage is 12%. With M = 16, these figures are 8% and 16% for b = 1 and b = 2, respectively.
3.4.3.1
Greedy Heuristics Analysis
Results for the arXiv dataset (Figures 3.15 and 3.16) For this dataset, the following is observed. First, the read cost reduction of each online heuristic seems greater as more servers are deployed (M is increased from 8 to 16) or the balancing constraint is less strict (b is increased from 1 to 2). Using an online heuristic, the read cost range from 76% to 83% of that of RAND-NoMove, while the average move cost varies 30%-50% of the average query size. Second, there is a clear performance gap: the better performing heuristics are ALL-HD and IND-MA and the clearly worse heuristics are IND-HD and ALL-LM. In all four configurations, the best heuristic is IND-MA which offers a read cost that is about 77% of RAND-NoMove and a move cost only 30% of the average query size. Note that the average query size is 3.1 items per request
86
(a) 8 servers, max capacity 6% above average per server
(b) 8 servers, max capacity 12% above average per server
Figure 3.15: Results for the arXiv dataset with 8 servers
for this dataset, meaning that by moving one item or so in each query time on average, IND-MA offers a read cost that is 23% better than the benchmark. Results for the Github dataset (Figures 3.17 and 3.18) Like the previous case, we also observe in Github that the read cost reduction of each online heuristic seems greater as more servers are deployed or the balancing constraint is less strict. Another similar observation is that ALL-HD and IND-MA seem comparable to each other. However, the read cost improvement in Github is not as high as that in the arXiv case; it is 83.5% of
87
(a) 16 servers, max capacity 8% above average per server
(b) 16 servers, max capacity 16% above average per server
Figure 3.16: Results for the arXiv dataset with 16 servers
the RAND-NoMove cost at best (when M = 16 and b = 2; see Figure 3.18(b)). The move cost is also higher; the best move cost is more than 60% of the average query size (9.3). Furthermore, while IND-MA is the best for arXiv, ALL-LM is the best for the Github dataset. ALL-LM is the only heuristic that is Pareto-optimal in all configurations. Not only that, ALL-LM is clearly better than the other heuristics in three out of four configurations. Note that the Github dataset is larger than the arXiv dataset in every category (number of items, number of queries, maximum query size, and average
88
query size), but they share a common property that the number of hyperedges is similar to the number of vertices.
(a) 8 servers, max capacity 6% above average per server
(b) 8 servers, max capacity 12% above average per server
Figure 3.17: Results for the Github dataset with 8 servers
Results for the Retail dataset (Figures 3.19 and 3.20) The benefit of applying online heuristics on the Retail dataset is less significant compared to the previous two datasets. At best, the read cost is 89% (of the RAND-NoMove cost) and the move cost is 70% (of the average query size). This could be because the transaction hypergraph is highly dense, where the number of hyperedges is much higher than that of vertices. 89
(a) 16 servers, max capacity 8% above average per server
(b) 16 servers, max capacity 16% above average per server
Figure 3.18: Results for the Github dataset with 16 servers
However, still, the performance gap is somewhat clear between the heuristics. The best, that are Pareto-optimal in all configurations, are ALL-LM (best move cost) and IND-MA (best read cost). The worst is always IND-HD. Results for the Actor-Movie dataset (Figures 3.21 and 3.22) This dataset is the largest in terms of the number of items and the number of queries. It represents a large hypergraph that is highly sparse, in which the number of hyperedges is much smaller than the number of vertices.
90
(a) 8 servers, max capacity 6% above average per server
(b) 8 servers, max capacity 12% above average per server
Figure 3.19: Results for the Retail dataset with 8 servers
Interestingly, compared to the other datasets which are smaller, the read cost improvement is found much more substantial for this dataset (read cost as low as 62% of the benchmark) while the move cost remains moderate (less than 60%). However, there is no clear overall winner. If minimizing read cost is the priority, the best heuristic is ALL-HD. On the other hand, if the move cost is of importance, the best heuristic is IND-MA. We would not recommend ALL-LM because it is always worse than IND-HD.
91
(a) 16 servers, max capacity 8% above average per server
(b) 16 servers, max capacity 16% above average per server
Figure 3.20: Results for the Retail dataset with 16 servers
3.4.4
Remarks
Table 3.4 summarizes the results, including the read cost range, the move cost range, and the recommended/not-recommended heuristic(s) when applied on each dataset. The online heuristics are effective for arXiv and Actor-movie datasets, but not so for Github and Retail. This summary suggests that INDMA (“stored where the nearest neighbor is” heuristic) is a safe choice for the online heuristic whereas IND-HD (“stored where the most neighbors are” heuristic) is a risky choice. However, I consider this suggestion weak due to the
92
(a) 8 servers, max capacity 6% above average per server
(b) 8 servers, max capacity 12% above average per server
Figure 3.21: Results for the Actor-Movie dataset with 8 servers
limited set of the datasets evaluated. A stronger conclusion would be that if we are allowed to move some items, fewer than the average query size, during the process of each query, it is a good idea to do so with one of the presented online heuristics; it is shown in our study always better than RAND-NoMove. Furthermore, we should not disregard any of these heuristics because according to my evaluation there is no consensus winner.
93
(a) 16 servers, max capacity 8% above average per server
(b) 16 servers, max capacity 16% above average per server
Figure 3.22: Results for the Actor-Movie dataset with 16 servers
3.5
Summary
This chapter provides an investigation on SQCP - the novel problem of query-centric partitioning that minimizes the sequential read and move costs of processing a query sequence. The originality of the problem lies in its sequential nature: optimization must take into account the sequential order of query arrival and must not assume knowledge about the incoming queries. Although the online decision must be made optimal for the next query which has yet to arrive, my study, conducted via an EA framework and a set of greedy
94
Dataset
arXiv
Github
Retail
Actor-movie
Read Cost 76%-83% 83%-91% 89%-97%
62%-86%
Move Cost 30%-50% 60%-80% 70%-90%
40%-60%
ALL-LM
x
x
ALL-HD
x
IND-HD
o
IND-MA
x
o
o x
x
Table 3.4: Summary of results
algorithms, has shown consistently that there is a real benefit by migrating items between the servers during the query sequence. Several heuristics have been presented, all offering better read cost than does the “move-nothing” hashbased partitioning approach, while incurring moderate move costs. Although I do not recommend any specific heuristic as always best, and better heuristics may exist beyond the presented studies, I validate the worthiness of the SQCP problem and opens room for future research. Specifically, the following research would be useful: (1) evaluate with more datasets at larger scale, (2) explore online algorithms assuming some known pattern about the query sequence, such as item popularity or item-item association weight, (3) incorporate other constraints such as geo-location related constraints, and (4) albeit the most challenging, provide a competitive analysis on the theoretical bounds of online algorithms for the SQCP problem.
95
CHAPTER 4 FUTURE WORK
My developments on the query-centric partitioning problem can be extended in several incremental ways as I pointed out at the end of Chapter 2 and of Chapter 3. In what follows, I will discuss an orthogonally different approach to solving this problem. First, note that the traditional method for data management, for example, as in relational databases, is based on modeling data as points or feature vectors in a fix-dimensional space where each dimension represents an attribute for the data. This feature-based model provides a lot of mathematical convenience for optimization of data storage, indexing, and search. In the query-centric approach in this dissertation, the information about the data is modeled merely based on relationships among one another; data are related (or associated) because they are requested together in the same query. I am interested in the following question: is it possible to transform our data into a feature-based model thereby we can devise solutions similar to those developed for traditional feature-based data systems? To address this question, my idea is based on graph embedding in a feature space. Specifically, if somehow we can represent the data items as a graph G, we will run a geometric graph embedding technique to project the vertices of G into a k -dimensional metric space X k ⊂ Rk . The data items, 1, 2, ..., N , will
then be represented respectively as points x1 , x2 , ..., xN ∈ X k and we can utilize
96
a partitioning technique designed for feature-based data systems; for example, one that preserves similarity. Geometric graph embedding works as follows. Suppose that we need to embed a graph G in a k -dimensional metric space X k ⊂ Rk associated with a metric dX : (X k , X k ) → R+ ∪{0}. Assuming (X k , dX ) given, the best embedding should result in that the distance between two items in the embedding space X k is best representative of the relationship between them in the original graph G. Denote by dG (i, j ) the graph distance between vertex i and vertex j . This
distance equals 1 if (i, j ) ∈ E ; else, it is the hop length of the shortest path connecting i and j in graph G. To find the best embedding can be formulated as follows: min
x1 ,x2 ,...,xn
N X N X
2 dX (xi , xj ) − dG (i, j )
(4.1)
i=1 j=1
This is essentially a widely-studied Multi-Dimensional Scaling (MDS) problem. If the space X k is Euclidean, one can use ISOMAP [61] to find the best embedding. Several methods have been proposed for the case X k is a curved manifold (spherical or hyperbolic embedding). For example, as in the work of Elad et al. [62], one can adopt a stress minimization approach [63] in which the stress to be minimized is computed based on a distortion measure representative of the quality of the embedding. Alternatively, Wilson et al. [64] take a non-iterative approach similar in spirit to PCA/SVD as used in ISOMAP to obtain a closed-form for the best embedding. Having outlined the basis for my proposed direction above, I now discuss the research tasks involved. The first task is how to obtain the graph G. As discussed in Chapter 3, the data and queries can be modeled as a hypergraph G = (V, E ) where V = [N ] represents the set of items and E = {q1 , q2 , ..., qT } represents the set of queries. 97
In other words, each item is a vertex and each query is an hyperedge consisting of all the items of this query. How do we convert this hypergraph to a standard graph in order to apply graph embedding? This conversion should preserve the relationship between the data, which suggests that there should be a weight measure to represent this relationship. The following is one example. We construct a standard weighted graph G0 = (V 0 , E 0 ) where V 0 = V and each edge E 0 corresponds to a pair of items (i, j ) such that i and j appear together in at
least one query. Furthermore, we define for each edge (i, j ) a weight w(i, j ) as follows w(i, j ) =
P
q∈E:i,j∈q
1/(|q| if (i, j ) ∈ E 0 2)
0
otherwise.
Essentially, weight w(i, j ) provides a pairwise measure for the relationship between item i and item j . Here, we give a larger weight coefficient if i and j appear in a small-size query than a large-size query. If we do not consider the query size, the following is another example for modeling w(i, j ):
w(i, j ) =
{q ∈ E : i, j ∈ q} if (i, j ) ∈ E 0
0
otherwise.
In the future research, I will explore different definitions and their effectiveness for the measure w. The second task is regarding the formulation of the embedding optimization. Unlike the standard graph embedding formulation presented earlier, we have a graph G0 associated with a weight measure w(., .). Because of this weight, we need to revise the formulation of the embedding. One possible way is to re-define the graph distance dG (i, j ) for every pair of two vertices i and j ; a larger weight w(i, j ) should make i and j closer in this distance. One example
98
for the definition of dG can be as follows. Let i1 i2 ...il denote the shortest (hop) path between i = i1 and j = il . Then,
dG (i, j ) =
l−1 X
1
h=1
w(ih , ih+1 )
.
Another way is to keep the same graph distance dG as the hop-based distance ignoring the edge weight, but formulate the embedding problem as follows:
min
x1 ,x2 ,...,xn
N X N X
(1 + w(i, j )) dX (xi , xj ) − dG (i, j )
2 .
(4.2)
i=1 j=1
This problem can be solved, albeit not so efficient for a large input size, by a gradient descent algorithm. In the future, I will investigate the choice of a graph distance as well as more efficient optimization algorithms. The third task is how to find the best space X k . The above discussions assume that X k is known, but in fact we need to find it. The best embedding space should be one that results in the least distance distortion error. The best candidate for X k is a Riemannian manifold which can represent both a flat space (Euclidean) or a curved space (non-Euclidean). Given the data graph G, we can analyze the values of the graph distance dG to, first, find out whether they most likely have a positive, negative, or zero curvature (average over the graph) and then, respectively, try to find a spherical, hyperbolic, or Euclidean space to embed G. The dimension k can be found by first starting with dimension k = 2 and increasing k until the embedding error stops improving. The stopping k will be used as the dimensionality of the embedding space. The last but not least task is how to address the sequential online setting. The optimization to find the points {x1 , x2 , ..., xn } assumes knowing the entire set of queries. As queries arrive in a stream manner, the relationships between 99
the items may change dynamically and we need an embedding algorithm that can efficiently adjust the embedding positions accordingly. This should be done in an incremental manner without having to solve the embedding optimization problem from scratch. Combining all the objectives of the above research tasks with these sequential and online requirements poses fresh challenges for which I am not aware of any efforts in the literature. My future research will be dedicated to addressing these challenges, which will leverage results from different areas of relevance, including geometric embedding, optimization, and graph algorithms. I will evaluate the partitioning techniques resulted in this direction based on the EA and online greedy solutions presented in the previous chapters.
100
CHAPTER 5 CONCLUSIONS
In the era of big data, data storage has increasingly been distributed. Considering of the huge amount of data stored, a good data partitioning scheme is necessary to help users retrieve data efficiently [65–67]. Most of the current data partitioning approaches adopt data-centric schemes, where data with similar content are assigned to the same server, to minimize cross-server communication cost. However, today’s complex distributed systems usually store a variety of data together. Those data may have many attributes or belong to incomparable categories, making it difficult to quantify the content similarity. To address this problem, I proposed a novel query-centric data partitioning approach, where the only relationship of data is their co-appearance in the queries. This approach is independent of any similarity measure. I formulated the unexplored query-centric data partitioning problem as a multi-objective optimization problem and adopt evolutionary algorithm framework to explore Pareto-optimal partitioning solutions. In particular, I have conducted two case studies: query-centric partitioning of an online social network and querycentric partitioning of a general distributed network. The key contributions of the dissertation are summarized below. Online Social Networks are undoubtedly the most popular and important media serves as the daily communication platform among people. In an OSN, user queries usually consist of data information of a set of neighbors. In this case, data from socially connected users should be stored on the same server 101
as much as possible. This dissertation analyzed the properties of user activities and the data associated-ness, formulated the data partitioning problem of OSNs as a multi-objective optimization problem whose goal is to minimizing the total read load and balancing maintenance load. I have developed S-PUT, an effective EA-based query-centric data partitioning framework to explore Pareto-optimal partition solutions. To improve the converging performance of the EA process, S-PUT leveraged the partition results of classic graph partitioning algorithm METIS. Through the evaluation study on three real-world social graph samples and the comparison with the conventional EA and METIS, I have shown that S-PUT can offer both minimized total read load and excellent load balancing. S-PUT is suitable for implementation in a storage system with identical communication cost to go from one server to any other. In practice, S-PUT can be used to produce a good benchmark for comparing techniques. It can also provide a good initial partitioning solutions for further improvement, for example, where the geographical locality between users and servers and that among the servers themselves are taken into account. After conducted the study of query-centric data partitioning for OSNs, I then extended the research to the broader case where a query can be a set of arbitrary data items. Moreover, in this case, I am interested in seeking sequential optimal partitions instead of seeking a single partition optimized for a ”batch” workload of queries. As queries arrive one at a time, given the option to reconfigure the partition after each query so that it can best serve the next query. Formulating it as a multi-objective optimization problem with the goals of minimizing query read cost and minimizing data migration cost, I developed Q-PUT, an EA framework to explore Pareto-optimal solutions. I designed and incorporated several online heuristics, which mimic the online
102
partition revision heuristics, in Q-PUT to reduce the solution space and make it faster to converge to good solutions. Furthermore, I investigated several online algorithms that are based on the heuristics recommended from the QPUT. I have shown that there is a real benefit by migrating items between the servers during the query sequence through the evaluation study on four real-world datasets. Q-PUT is the first attempt to implement EA in the literature for finding sequential partitions optimized for sequential queries. It can serve as benchmark for future studies. The study of Q-PUT also suggests that an effective migration heuristic is to choose the items for migration among the most recently requested items and to move them all to the same destination server (how to choose this server is an interesting question for the next step in our research). In summary, the query-centric storage partitioning approach has the following features: (1) it is independent of any similarity measure, (2) it adpots EA framework to explore best trade-offs between conflict objectives, (3) it adjusts data placements dynamically when queries arrive in stream manner. Thus it is applicable to both similarity search and non-similarity search. The similarity search usually contains: Range Search : a query is a range of values based on some attribute(s).
Examples of this kind of distributed systems include online media systems, where a query is usually a range of media data with similar content; K-Nearest Neighbor (KNN) search : a query is to find the K closet points to
a specific point. Such as in navigation systems, where a search is the set of k nearest locations. The non-similarity search could be: 103
Skyline Search : a query often retrieves a set of data items that have the
best trade-offs among several aspects. Examples of this kind of systems include the car dealer systems, where users may be interested in buying a car with good trade-off between minimum age and minimum price, and the real-estate storage systems where a user may want to find houses with minimum price and maximum quality of neighborhood schools. Itemset Search : a query is a set of arbitrary items. Such in e-commerce
systems, where a transaction usually contains a variety of products, and online search engines, where a query could contain user data and advertisement data applied to this user. The proposed approach is limited to distributed systems where the crossserver communication cost and data migration cost from one server to another is identical regardless of the geographic distance of the servers and bandwidth of networks. In future work, I would like to explore the following research directions: (1) incorporate other constraints such as geo-location related constraints, (2) optimization for replication, (3) explore online algorithms assuming some known pattern about the query sequence, and (4) conduct theoretical analysis of transforming the data defined in query-centric approach into a feature-based model based on graph embedding in a feature space.
104
REFERENCE LIST
[1] Y. Mao, J. Wang, J. P. Cohen, and B. Sheng. Pasa: Passive broadcast for smartphone ad-hoc networks. In 2014 23rd International Conference on Computer Communication and Networks (ICCCN), pages 1–8, Aug 2014.
[2] Ying Mao, Jiayin Wang, and Bo Sheng. Dab: Dynamic and agile buffercontrol for streaming videos on mobile devices. Procedia Computer Science, 34:384 – 391, 2014. The 9th International Conference on Future Networks and Communications (FNC’14)/The 11th International Conference on Mobile Systems and Pervasive Computing (MobiSPC’14)/Affiliated Workshops. [3] Y. Mao, J. Wang, and B. Sheng. Mobile message board: Location-based message dissemination in wireless ad-hoc networks. In 2016 International Conference on Computing, Networking and Communications (ICNC), pages 1–
5, Feb 2016. [4] Y. Mao, B. Sheng, and M. C. Chuah. Scalable keyword-based data retrievals in future content-centric networks. In 2012 8th International Conference on Mobile Ad-hoc and Sensor Networks (MSN), pages 116–123, Dec
2012. [5] Number of monthly active facebook users worldwide as of 4th quarter 2016 (in millions). https://www.statista.com/statistics/264810/numberof-monthly-active-facebook-users-worldwide.
105
[6] 36 mind blowing youtube facts,
figures and statistics
2017.
https://fortunelords.com/youtube-statistics. [7] Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. Schism: a workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment, 3(1-2):48–57, 2010.
[8] Berenice Carrasco, Yi Lu, and Joana M. F. da Trindade. Partitioning social networks for time-dependent queries. In Proceedings of the 4th Workshop on Social Network Systems, SNS ’11, New York, NY, USA, 2011.
[9] Duc A. Tran, Khanh Nguyen, and Cuong Pham. S-clone: Socially-aware data replication for social networks. Computer Networks, 56(7):2001–2013, 2012. [10] Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang, Nikos Laoutaris, Parminder Chhabra, and Pablo Rodriguez. The little engine(s) that could: scaling online social networks. In Proceedings of the ACM SIGCOMM 2010 conference, SIGCOMM ’10, pages 375–386, New York, NY,
USA, 2010. [11] N. Kallen, R. Pointer, E. Ceaser, and J. Kalucki. Introducing gizzard, a framework for creating distributed datastores. Technical report, Twitter Engineering Website, April 2006. [12] Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010. [13] Tieyun Qian, Yang Yang, and Shuo Wang. Refining Graph Partitioning for Social Network Clustering, pages 77–90. Berlin, Heidelberg, 2010.
106
[14] Duc A. Tran and Ting Zhang. Socially aware data partitioning for distributed storage of social data. In 2013 IFIP Networking Conference, pages 1–9, May 2013. [15] Duc A. Tran and Ting Zhang. S-put: An ea-based framework for socially aware data partitioning. Computer Networks, 75, Part B:504 – 518, 2014. [16] Ting Zhang and Duc A. Tran. On query-adaptive online partitioning: A study of evolutionary algorithms. In 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC), pages 1–8, Dec
2016. [17] Ting Zhang and Duc A. Tran. Query-adaptive online partitioning of associated data for efficient retrieval. In The 31st IEEE International Conference on Advanced Information Networking and Applications (AINA), Mar 2017.
[18] Facebook’s memcached multiget hole: More machines != more capacity.
http://highscalability.com/blog/2009/10/26/facebooks-memcached-
multiget-hole-more-machines-more-capacit.html. [19] Muhammad Anis Uddin Nasir, Fatemeh Rahimian, and Sarunas Girdzijauskas. Gossip-based partitioning and replication for online social networks. In 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pages 33–42, Aug 2014.
[20] George Karypis and Vipin Kumar. Metis – unstructured graph partitioning and sparse matrix ordering system, version 2.0. Technical report, 1995. [21] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural
107
graphs. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, pages 17–30, Berkeley, CA, USA,
2012. [22] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review, 41(6):205– 220, 2007. [23] Imranul Hoque and Indranil Gupta. Disk layout techniques for online social network data. IEEE Internet Computing, 16(3):24–36, 2012. [24] Lei Jiao, Jun Li, Tianyin Xu, Wei Du, and Xiaoming Fu. Optimizing cost for online social networks on geo-distributed clouds. IEEE/ACM Transactions on Networking, 24(1):99–112, 2014.
[25] Thang Nguyen Bui and Byung Ro Moon. Genetic algorithm and graph partitioning. IEEE Transactions on Computers, 45(7):841–855, 2002. [26] Peter Sanders and Christian Schulz. Think Locally, Act Globally: Highly Balanced Graph Partitioning, pages 164–175. Berlin, Heidelberg, 2013.
[27] Matthias Br¨ocheler, Andrea Pugliese, and V. S. Subrahmanian. DOGMA: A Disk-Oriented Graph Matching Algorithm for RDF Databases, pages 97–113.
Berlin, Heidelberg, 2009. [28] Kisung Lee and Ling Liu. Scaling queries over big rdf graphs with semantic hash partitioning. Proceedings of the VLDB Endowment, 6(14):1894–1905, 2013.
108
[29] Quang Duong, Sharad Goel, Jake Hofman, and Sergei Vassilvitskii. Sharding social networks. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 223–232, New York, NY, 2013.
[30] Amartya Sen. On Economic Inequality. Clarendon Press, Oxford, 1973. [31] W. J. Reichmann. Use and abuse of statistics. Methuen London, 1961. [32] Graham Upton and Ian Cook. A Dictionary of Statistics. Oxford University Press, 2008. [33] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric embeddings and graph partitioning. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, STOC ’04, pages 222–
231, New York, NY, 2004. [34] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1998.
[35] M. E. J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577–8582, 2006.
[36] Jure Leskovec, Kevin J. Lang, and Michael Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 631–
640, New York, NY, USA, 2010. [37] C. Walshaw, M. Cross, and M.G. Everett. Parallel dynamic graph partitioning for adaptive unstructured meshes. Journal of Parallel and Distributed Computing, 47(2):102 – 108, 1997.
109
[38] Thomas Karagiannis, Christos Gkantsidis, Dushyanth Narayanan, and Antony Rowstron. Hermes: Clustering users in large-scale e-mail services. In Proceedings of the 1st ACM Symposium on Cloud Computing, pages 89–100, New York, NY, USA, 2010. [39] Isabelle Stanton and Gabriel Kliot. Streaming graph partitioning for large distributed graphs. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 1222–
1230, New York, NY, USA, 2012. [40] Eckart Zitzler and Lothar Thiele. Multiobjective optimization using evolutionary algorithms - a comparative case study. In Proceedings of the 5th International Conference on Parallel Problem Solving from Nature, PPSN V,
pages 292–304, London, UK, 1998. [41] E. Zitzler, M. Laumanns, and L. Thiele. SPEA2: Improving the strength pareto evolutionary algorithm for multiobjective optimization. In Evolutionary Methods for Design Optimization and Control with Applications to Industrial Problems, pages 95–100, Athens, Greece, 2001.
[42] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation, 6(2):182–197, Apr 2002.
[43] Khanh Nguyen and Duc A. Tran. An analysis of facebook activities. In Proceedings of IEEE Consumer Communications and Networking Conference (CCNC 2011), Las Vegas, NV, January 2011.
[44] Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The anatomy of the facebook social graph. CoRR, abs/1111.4503, 2011.
110
[45] Sudheendra Hangal, Diana Maclean, Monica S. Lam, and Jeffrey Heer. All friends are not equal: Using weights in social graphs to improve search. In Workshop on Social Network Mining & Analysis, 2010. [46] Rainer Burkard, Mauro Dell’Amico, and Silvano Martello. Assignment Problems. Philadelphia, PA, USA, 2009.
[47] David Liben-Nowell, Jasmine Novak, Ravi Kumar, Prabhakar Raghavan, and Andrew Tomkins. Geographic routing in social networks. Proceedings of the National Academy of Sciences of the United States of America,
102(33):11623–11628, 2005. [48] Salvatore Scellato, Cecilia Mascolo, Mirco Musolesi, and Vito Latora. Distance matters: geo-social metrics for online social networks, pages 1–8.
USENIX, 2010. [49] Rainer E. Burkard, Stefan E. Karisch, and Franz Rendl.
Qaplib –
a quadratic assignment problem library. Journal of Global Optimization, 10(4):391–403, 1997. [50] Rainer E. Burkard, Eranda C ¸ ela, Panos M. Pardalos, and Leonidas S. Pitsoulis. The Quadratic Assignment Problem, pages 1713–1809. Springer US, Boston, MA, 1999. [51] Boyang Yu and Jianping Pan. Location-aware associated data placement for geo-distributed data-intensive applications. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 603–611, April 2015.
[52] George Karypis and Vipin Kumar. Multilevel k-way hypergraph partitioning. In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, DAC ’99, pages 343–348, New York, NY, 1999.
111
¨ [53] Umit C ¸ ataly¨ urek and Cevdet Aykanat. PaToH (Partitioning Tool for Hypergraphs), pages 1479–1487. Boston, MA, 2011.
[54] Umit V. Catalyurek, Erik G. Boman, Karen D. Devine, Doruk Bozda, Robert T. Heaphy, and Lee Ann Riesen. A repartitioning hypergraph model for dynamic load balancing. Journal of Parallel and Distributed Computing, 69(8):711 – 724, 2009.
[55] Aaron J. Elmore, Vaibhav Arora, Rebecca Taft, Andrew Pavlo, Divyakant Agrawal, and Amr El Abbadi. Squall: Fine-grained live reconfiguration for partitioned main memory databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15,
pages 299–313, New York, NY, 2015. [56] Konstantinos Maragos, Kostas Siozios, and Dimitrios Soudris. An evolutionary algorithm for netlist partitioning targeting 3-d fpgas. IEEE Embedded Systems Letters, 7(4):117–120, Dec 2015.
[57] Dan Alistarh, Jennifer Iglesias, and Milan Vojnovic. Streaming min-max hypergraph partitioning. In Advances in Neural Information Processing Systems 28, pages 1891–1899. Curran Associates, Inc., 2015.
[58] Charalampos Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. Fennel: Streaming graph partitioning for massive scale graphs. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM ’14, pages 333–342, New York, NY, 2014.
[59] Laurent Lyaudet. Np-hard and linear variants of hypergraph partitioning. Theoretical Computer Science, 411(1):10 – 21, 2010.
[60] Allan Borodin, Nathan Linial, and Michael E. Saks. An optimal on-line algorithm for metrical task system. J. ACM, 39(4):745–763, October 1992. 112
[61] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction.
Science,
290(5500):2319–2323, 2000. [62] A. Elad, Y. Keller, and R. Kimmel. Texture Mapping via Spherical Multidimensional Scaling, pages 443–455. Springer Berlin Heidelberg, Berlin,
Heidelberg, 2005. [63] Jan de Leeuw and Patrick Mair. Multidimensional scaling using majorization: Smacof in r. Journal of Statistical Software, 31(1):1–30, 2009. [64] R. C. Wilson, E. R. Hancock, E. Pekalska, and R. P. W. Duin. Spherical and hyperbolic embeddings of data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2255–2269, Nov 2014.
[65] J. Wang, T. Wang, Z. Yang, N. Mi, and B. Sheng. esplash: Efficient speculation in large scale heterogeneous computing systems. In 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC), pages 1–8, Dec 2016.
[66] Y. Yao, J. Lin, J. Wang, N. Mi, and B. Sheng. Admission control in yarn clusters based on dynamic resource reservation. In 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), pages 838–
841, May 2015. [67] Y. Mao, J. Wang, and B. Sheng. Skyfiles: Efficient and secure cloudassisted file management for mobile devices. In 2014 IEEE International Conference on Communications (ICC), pages 4202–4207, June 2014.
113