Intermediate RECOGNITION solutions for content ... - Users Site

5 downloads 992 Views 9MB Size Report
30 Jun 2012 ... This deliverable will present Intermediate RECOGNITION solutions ..... fundamental principle that influences our choices in terms of acquisition ...
D3.1:INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT  MANAGEMENT    

Deliverable D3.1: Intermediate RECOGNITION solutions for content management Version:

1.0

Delivery date:

30th June 2012

Keywords:

self-awareness, content management, cognitive models

Workpackage:

WP3 Algorithms and protocols for content management

Editor:

EURECOM (EUR)

Contributing

Cardiff University (CU)

partners:

Consiglio Nazionale delle Ricerche (CNR) University of Cambridge (UCAM) National and Kapodistrian University of Athens (NKUA) Consiglio Nazionale delle Ricerche (CNR)

Abstract:

This deliverable will present Intermediate RECOGNITION solutions for trusted and secure content management, content dissemination and retrieval.

1

RECOGNITION   Table  of  Contents     Executive  summary  ........................................................................................................................  3   1   Introduction  ..............................................................................................................................  4   Memory  and  learning  ......................................................................................................................................  7   Personality  and  disposition  ..........................................................................................................................  7   2   Content  consumption  .............................................................................................................  8   2.1   Profiling  human  behavior  ...................................................................................................................  8   2.1.1   Human  mobility  patterns  .....................................................................................................  8   2.1.2   Personal  Profiling  users  in  on-­‐line  social  networks  ...............................................  12   2.1.3   User’s  ‘influence’  in  social  network  for  information  diffusion  ...........................  14   2.2   Cognitive  approaches  for  the  acquisition  and  consumption  of  content  ......................  18   2.2.1   Tag-­‐cloud  representation  of  places  for  location  based  services  .......................  18   2.3   Decision  making  for  resource  selection  .....................................................................................  28   3   Content  dissemination  ........................................................................................................  32   3.1   Content  dissemination  using  the  “recognition”  heuristic  ...................................................  32   4   Content  management  ...........................................................................................................  38   4.1   Usage  control  mechanisms  ..............................................................................................................  38   4.2   Distributed  placement  of  autonomic  internet  services  .......................................................  40   4.3   Assessing  the  effectiveness  of  centrality  cues  coming  from  bounded  (social)   connectivity  views  ...........................................................................................................................................  42   5   Conclusion  ...............................................................................................................................  44   References  .......................................................................................................................................  45   Appendix  A  –  [Measuring  Individual  Regularity  in  Human  Visiting  Patterns]  ........  48   Appendix  B  –  [Personal  profiling  in  on-­‐line  social  networks]  ......................................  49   Appendix  C  –  [Influential  Neighbours  Selection  for  Information  Diffusion]  ............  50   Appendix  D  –  [Parking  search  in  smart  urban  environments]  .....................................  51   Appendix  E  –  [Content  dissemination  using  the  “recognition”  heuristic]  .................  52   Appendix  F  –  [Usage  Control  Mechanisms]  ..........................................................................  53   Appendix  G  –  [Distributed  Placement  of  Autonomic  Internet  Services]  ....................  54  

2

D3.1:INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT  MANAGEMENT    

Executive summary This deliverable concerns the application of the generic principles and concepts for self-awareness described in D1.3 to algorithms, procedures, and system processes for the acquisition, dissemination, and management of content.

These algorithms,

procedures and processes will provide a link between the cognitive functionalities identified in the first work-package and the definition of more specific ICT components and systems to support self-awareness. The rest of the document is organised under the three main scenarios identified in D1.2 for the consumption, dissemination and management of content. For each of these main sections we present a detailed description of the research undertaken with reference to a number of scientific publications produced during the second year of activity.

3

RECOGNITION  

1

Introduction

This deliverable concerns the intermediate results obtained from all the five tasks of the work-package. These relate to a number of suitable scenarios that have been categorised into three main groups related to the consumption, dissemination and management of content introduced in the first year of activity (see Deliverable D2.1) and represented in Figure 1.

Figure 1 - Venn Diagram Representation of the three main scenarios for content

We can summarise these different scenarios as concerning the following (see D2.1 for a more detailed description of each individual scenario): •

consumption of content, the knowledge or utility that it provides, and the decisions taken by users either explicitly or implicitly through interaction with it. In essence, we are concerned with users consuming content and interacting with content – moving from one item to the next;



dissemination of content as the provision and acquisition of content in a spatial and mobile context. In essence we are concerned with mobile users 4

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     wandering around a city being pushed content that is spatially and temporally relevant; •

management of content as novel approaches for content to become “selfaware” and self-managing. Within this scenario, the overall goal is to design new techniques that allow content to become responsive to the needs of the content provider through management of accessibility and availability. This includes the re-definition of a number of existing content-related operations such as content search/discovery, placement, and replication.

The principle tasks covered by this work-package are: T3.1 to explore storage management policies; T3.2 that applies a user-node-centric approach to data generation and distribution algorithms; T3.3 about making data self-aware, i.e. how can awareness about the environment be embedded in the content itself. Also covered are T3.4 for the definition of policies for intelligent acquisition and geo-spatial information retrieval in a content-centric Internet, extending to a physical pervasive environment and human mobility, in which physical places and spaces support autonomic acquisition of relevant content; and finally, T3.5 that will study the trust and security management issues motivated by solutions designed in the previous tasks. In summary: •

This work-package primarily concerns the application of the generic principles and concepts described in D1.3 to the definition of algorithms, procedures and system processes based on the scenarios defined in D2.1.



It will also indicate how to evolve from WP1 to the definition of the concrete ICT systems and protocols to be validated by the large-scale simulations and experiments in WP4.

The linking of the generic principles and concepts towards the three scenarios defined in Figure 1 is given in Figure 2, showing how they inform one another and combine to enable functionality within the three scenarios of content consumption, content management and content dissemination.

5

RECOGNITION  

User Influence

Profiling

Content Acquisition

Social Connectivity

Mobility

Content Filtering

Sense of Place

Content Dissemination

Resource Selection

Content Consumption

Content Management

Recognition Heuristic

Usage Control

Decision Making Content Placement

Centrality

Figure 2 - Linking of generic principles and components within the three scenarios

In the remaining part of this document we highlight the specific contributions in terms of algorithms and procedures that implement specific cognitive functionalities for self-awareness. These are primarily related to each of the five principles described in D1.3, thus supporting different components of the tri-partite model. These high level principles will then allow the development each specific cognitive function in the more specific ICT scenarios highlighted above. Table 1 reflects how each scenario (content consumption, content dissemination and content management) relates to the five principles.

6

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT    

Principles for Self-Awareness

Tri-Partite model for cognition Content

Content

Content

Consumption

Dissemination

Management





Memory and learning Personality and disposition Social connectivity

✔ ✔



Sense of place





Decision making





Table 1 – Relationship between self-awareness and the three scenarios

7



RECOGNITION  

2 Content consumption Methodologies for personalised acquisition and consumption of content are at the basis of several ICT scenarios. They directly relate to one or more of the general cognitive areas identified in work-package one. In particular the deriving methodologies for content acquisition and consumption would benefit from the following self-aware concepts (as listed in D1.3): •

Personality and disposition. User’s characteristics such as personal interests and mobility behaviour are related and reflect the inherent personality of individuals.



Social connectivity. A further component for personalised content provision is user profiling based on their role within on-line-social networking services.



Sense of place. Spatial and temporal awareness intended as the individual perception of a specific user in a particular location at a particular time is a fundamental principle that influences our choices in terms of acquisition and consumption of content.



Decision-making. The efficiency of state-of-the-art systems in the context of “smart cities” is attributed mostly to people’s behaviour and especially decisions on the use and management of content/information that becomes available by these systems.

In the remaining part of this section we present the individual contributions resulting from the second year of activity. These are structured into two main sub-sections: one concerning more specific components/concepts of the self-awareness bubble and one on more complete approaches to service provision. 2.1

Profiling human behavior

In this section we focus on the increasingly popular applications for mobile platforms (e.g smart phones, tablets, etc.) either in the form of social networking or location based services (e.g maps, recommendation systems within a local area). We then investigate primarily the behaviour of individuals in relation to the characteristics of the location visited, their patterns of regularity, and their role in social networking applications.

2.1.1

Human mobility patterns

We show in this section two applicative examples that focus on the monitoring of 8

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     individuals purely based on their mobility behaviour. This can reveal more inherent user’s characteristics such as interest preferences and personality based on frequency and regularity of the places visited. These works primarily fulfil task T3.4, investigating self-* policies for intelligent acquisition and geo-spatial information retrieval in a content-centric scenario, extending to a physical pervasive environment and human mobility, in which physical places and spaces support autonomic acquisition of relevant content. In particular, this task explores the potential for interaction between devices, artefacts and systems to build intelligent personal profiles, based on the behaviour of individuals as they move through their environment over time. Enabling personal profiling allows individuals to understand their own embedded behavior and increases self-awareness. This will be captured in the final RECOGNITION node architecture. This ability can be used for predictions that enable advanced services and content to be provided, taking into account awareness of the physical situation in a mobile setting. Profiles can potentially be shared between users, from which they can make deductions and inference in terms of communication. The cognitive principles lying behind this research concern personality and disposition and the notion of sense of place. We explore different ways of understanding and exploiting mobility, which is vital for location based services. To address this we propose a method for monitoring of the behaviour of individuals focusing exclusively on mobility data. In Peng et al. (2012) and as reported in Appendix A, we analysed the mobility patterns of Taxi Trips in an Urban Area by looking at passengers’ traffic pattern for 1.58 million taxi trips of Shanghai, China. By employing non-negative matrix factorization and optimization methods, the authors find that, people travel on workdays mainly for three purposes: commuting between home and workplace, traveling from workplace to workplace, and others such as leisure activities. Therefore, traffic flow in one area or between any pair of locations can be approximated by a linear combination of three basis flows, corresponding to the three purposes respectively that can then define three corresponding profiles for human mobility. The coefficients in the linear combination are referred to as traffic powers, each of which indicates the strength of each basis flow. The traffic powers on different days 9

RECOGNITION   are typically different even for the same location, due to the uncertainty of human motion. Therefore, we provide a probability distribution function for the relative deviation of the traffic power. This distribution function is in terms of a series of functions for normalized binomial distributions. It can be well explained by statistical theories and is verified by empirical data. These findings are applicable in predicting the road traffic, tracing the traffic pattern and diagnosing the traffic related abnormal events. These results can also be used to infer land uses of urban area quite parsimoniously. More details about this work can be found in Appendix A. The mobility pattern of an individual is highly personal and contains information that can be can be modelled and leveraged for personalised content provision. One particularly interesting characteristic is the level of regularity with which an individual visits locations. To address this we propose a novel metric for quantifying the regularity with which an individual visits a particular location, see Williams et al. (2012). Much of the mobility data we deal with is only available as zero-duration events, and therefore our proposed measure, called IVI-irregularity (i.e., inter-spike interval irregularity), is designed for these data. The measure is adapted from a synchrony measure (Kruez, et al., 2009) used in neural coding, the branch of neurophysiology concerned with the coding of information among the neurons in the brain. This has been developed as a consequence of the RECOGNITION project’s initial workshop and interaction with psychologists. In the context of neural coding, neurophysiologists deal with ensembles of spike trains, where each train represents the instantaneous electrical pulses (or spikes) of a particular neuron. An ensemble of spike trains is said to exhibit high synchrony if the spikes in the trains occur at similar times. Spikes can be regarded as abstract, zeroduration events; in our case, spikes correspond to visits to a particular location. An example ensemble of trains for a user’s visits to a particular location over the period of four weeks is shown in Figure 3.

10

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     Week 1 Week 2 Week 3

:0 0 00 n Su

0: 00 t0 Sa

Fr i

Th u

00

:0 00 W ed

:0 0

00 :0 0

0

0 :0 00 Tu e

M

on

00

:0 0

Week 4

Figure 3 - Example visit trains for a Dartmouth student's visits to a particular location

The spike train synchrony measure quantifies the dissimilarity in visits in different weeks; if visits in each week occur at very similar times, then dissimilarity is very low, and thus regularity is high. The IVI-irregularity measure takes a value of zero when visits are at identical times each week; larger IVI-irregularity values indicate more irregularity in the visiting times. As proof of concept, we used IVI-irregularity to study visit regularity in three datasets, each representing a different scenario. We considered users’ checkins on Foursquare (a popular location-sharing service) in three urban areas. We also considered a dataset of student and staff movements on a university campus (Dartmouth college), inferred from their devices’ WLAN access locations. Finally, we studied the regularity of visits made by London Underground Oyster card users to metro stations. The measure allows us to compare the overall regularity of the three populations, and we observe that campus visits are the most irregular, likely due to the flexible and spontaneous nature of student behaviour, and transport visits are the most regular, likely due to the significant commuter population. We have also found that there is a great deal of diversity in the regularity within a population. For the purposes of user profiling, we can narrow our focus to the individual. The measure can tell us how many of an individual’s locations he or she has a regular visit pattern with. By setting a threshold IVI-irregularity value we can differentiate between user-location pairs that have almost-perfect regularity and pairs that are irregular. By doing this we found that 8% of Foursquare users and Dartmouth WLAN users had at least one location that they visited with high regularity. In the case of the London Underground, we have found that 21% of passengers have at least one regular location, which is likely due to the more-routine nature of travel.

11

RECOGNITION   Our findings have implications for content provision and user profiling. With our measure we are able to extract the places that an individual visits with a regular (i.e., week-on-week consistent) visit pattern, and we have found that individuals are diverse in their levels of irregularity. Some users have many places they visit with regularity, whereas others have none. It is important, therefore, that users be treated differently depending on their overall level of regularity, and that we identify which (if any) of their locations they visit with consistent patterns. Doing so allows content to be personalised both spatially and temporally. Furthermore, the extent of an individual’s regularity may be in part linked to personality traits. More neurotic individuals are likely to favour routine in the timing of their activities, such as when to eat lunch or to go grocery shopping. Other individuals may be more spontaneous in when they choose to visit locations, and this will be reflected in the IVI-irregularity values. By comparing personality traits of users with their visit patterns we can attempt to identify any correlations between traits and visit regularity. A complete description of the algorithms and an extended set of results are available in Appendix A.

2.1.2

Personal Profiling users in on-line social networks

Profiling user behaviour in terms of preferences and needs is linked to many influential factors of human cognition, such as the processes of filtering and discovery, recognition of cognitive cues (knowledge); establishing priorities and personal affinities, e.g. in social relationships (knowledge and reasoning); and information processing and knowledge exchange (reasoning). Profiles of different personalities can also indirectly affect the adaptation phase (module three) since the learning mechanisms can vary in relation to different preferences, personalities and behaviours. Profiling users' behaviour in social networks has been widely used by various applications, such as commercial, financial or political applications, to enhance performance by offering personalised services to their users. Analysis has been conducted on profiling users behaviour in the political domain. Published work on this topic is available from Appendix B, see A. Boutet, H. Kim and E. Yoneki (2012a, 2012b).

Social media such as Facebook and Twitter has revolutionized the way to cover the world of politics. As we approach the electoral year, social media will be 12

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     overwhelmed by political data, e.g. political news, thoughts, chatters, and stories. Thus social media can drastically impact the political opinion of its users and eventually may even revolutionize the way users cast their votes. In addition, data provided in social media would potentially help political parties assess the influence of their campaigns or even predict the results of upcoming election and refine their strategies to increase their chance of winning. We studied the impacts of social media in the UK general election in 2010. This particularly investigated how social media (1) helps identify the various characteristics of a political party, and (2) impacts the political leaning of users. This analysis was run by collecting the Twitter’s data related to the 2010 UK election. The data contained tweets (messages) comprising of up to 140 characters including some URLs or hashtags as indicators of the related topics. About 1,150,000 tweets were collected from 220,000 Twitter users between 5th and the 12th of May. We examined the characteristics of the three main parties in the UK, namely: Labour, Conservative, and Liberal Democrat, and discussed the main differences between the parties in terms of activity, influence, structure, interaction, contents, mood and sentiment. The results demonstrated that Labour members were the most active and influential in Twitter during the election while Conservative members were the most organized to promote their activities. In addition, party members were more likely to retweet the messages from their own party than other parties. Also, the websites and blogs referenced by users following a political party were considerably different from those referenced by people who were following the other political parties. Finally, the conducted sentiment analysis revealed that members were more likely to express positive opinions when they referenced their own party. All of these suggested segregated structural patterns of the tweets produced by the member of different parties. Classification methods can be potentially used for profiling users behaviour. The existing classification methods are generally based on the assumption that the data conforms to a stationary distribution. Since the statistical characteristics of microblogging data, e.g. Twitter data, continuously changes, this assumption may degrade the performance of a classification when used for profiling behaviour.

13

RECOGNITION   To address this weakness, we developed a practical user classification algorithm using Bayesian framework to identify the political leaning of users in micro-blogging services like Twitter (A. Boutet, H. Kim and E. Yoneki; 2012a, 2012b). The classification algorithm estimated the political leaning using messages expressing the users’ political view (i.e. tweets referring to a particular political party). As opposed to conventional classification methods which demand an expensive training phase to optimally tune a set of parameters, the proposed classification only exploits the number of tweets referring to a particular political party, and so can be simply implemented without loosing its efficiency. This classification algorithm dramatically reduces the computational cost of the training phase and efficiently performs in online classification. In addition the proposed classification does not require the knowledge about network topology unlike the classification methods formulated based on community structure, e.g. (Golbeck and Hansen 2011). The effectiveness of the proposed classification algorithm was compared against three common classification methods. The comparative analysis was run by using a ground truth dataset composed of users who explicitly reported their political affiliation through their profile in Twitter. The experimental results showed that the proposed classification achieves about 86% accuracy in identifying the users’ political leaning. The amount of accuracy was superior to or, at least, as good as the accuracy achieved by the three baseline classification. Experimental results on political data proved the effectiveness of the proposed classification algorithm for profiling users behaviour. As a part of an ongoing project we will also apply the classification model to analyse users behaviours on the other domains such as sporting events, e.g. Olympic 2012, social events, and scientific networks. A full description of this research is provided in Appendix B.

2.1.3 User’s ‘influence’ in social network for information diffusion In Kim and Yoneki (2012) we have introduced the concept of user’s ‘influence’ in social network, as a parameter of multi-dimensional profile of users’ behaviour. In particular the focus on addresses the notion of influential neighbour selection for information diffusion in on-line social networks. This work primarily concerns task T3.4 examining the building of intelligent personal

14

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     profiles, based on the behaviour of individuals, thus exhibiting cognitive abilities such as belief and similarity. Profiles can potentially be shared between users, from which they can make deductions and inference in terms of communication. This will extend to exploring how similarity (e.g., based on profiles) can be used between individuals as a basis for self-awareness and learning. In addition, these specific contributions relate to task T3.1 investigating policies for data dissemination and management based on social structures (such as on-line social networks). For example, collaborative policies can be enforced to establish datamanagement coordination policies among the nodes willing to cooperate. Node cooperation can then driven by social structures (as in this particular case) and collaboration can be introduced. The cognitive principles behind this work concern personality and disposition and social connectivity. The users of a social network publish information, ideas, or news in a social network through the network’s topology and by sharing them with their neighbours. The published information can either be quickly ignored or widely propagated in the network by neighbours. Many social network users would like to spread their information to the widest extent possible. Maximizing the extent to which information is propagated in a social network is motivated when, for example, diffusing medical and technological innovations, or advertising new products. The problem of maximizing information diffusion in online social networks has been studied, in Kim and Yoneki (2012). The authors considered a scenario of a social network in which members would like to propagate their information as much as possible. However, they can only share the information with a subset of their neighbours. As opposed to previous studies, which mainly focused on maximizing information diffusion by selecting a set of arbitrary members, they assumed each member can only communicate with her immediate neighbours and has no knowledge about the network’s global topology. This assumption is, indeed, true when considering a social network from users’ perspective. Therefore, the question is: what subset of neighbours should be selected so that information diffusion is maximized in the network? This problem can be called influential neighbour selection (INS). Further details about this research are available in Appendix C.

15

RECOGNITION   We have empirically studied the INS problem by evaluating the performance of four different selection schemes, namely: Random, Degree, Volume and Weighted Volume. Random is the straightforward selection method that randomly pricks an arbitrary set of neighbours to propagate information. Using the Random method, users do not need any knowledge about the network topology and the communication cost is independent of the network’s topology and hence very low. However, Random neglects the fact that different neighbours may have different influences in the network. Therefore, some information about neighbours and their influences in the network may lead to a better strategy for maximizing information diffusion. The second method studied is called Degree, which considers the network as an undirected graph where the network’s members indicate the nodes and the communication links between the members are the edges. The Degree method defines the degree for each node as the number of nodes that are directly connected to it. The Degree method then selects a set of neighbours with the maximum degrees. This is perhaps a better strategy, compared to Random, because sharing information with high-degree neighbours would probably increase the chance that information is seen and spread by many members of the network. This method requires the degree knowledge of the neighbours. Therefore, the expected communication cost is larger than Random. The third method studied is called Volume, proposed by Wehmuth and Ziviani (2012). In this selection method volume centrality of each member is defined as the sum of degrees of all the member’s neighbours. Members with high volume centrality scores are those whose neighbours pose large degrees. As such, the volume method selects a subset of neighbours with the highest volume centrality scores. The Volume method demands the degree knowledge of the neighbours within a predefined distance h > 1 from a particular member. Therefore, its communication cost is bigger than both Random and Degree methods. The fourth method, called Weighted-volume, is an extension of the Volume method that defines a volume centrality score as a weighted sum of neighbours’ degree, varying the contribution of each neighbour to compute volume centrality. The weights are adjusted based on the distance between two nodes and the neighbour’s clustering coefficient. The clustering coefficient of a node is defined as the probability of the

16

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     node’s neighbours to be neighbours to each other as well. The communication cost of Weighted-volume is the same as communication cost for the Volume method. The approach we used is the Independent Cascade (IC) model (Goldenberg et al. 2001), which is widely used for the analysis of information diffusion, to simulate and assess the performance of the four neighbour selection methods. Intensive simulations were run using four datasets of real world network topologies, namely: (1) PGP with 10,680 nodes and 24,316 edges, (2) Email with 1,134 nodes and 5,453 nodes, (3) Blog with 1,224 nodes and 16,718 edges and (4) Facebook with 26,701 nodes and 25,249 edges. The simulations we conducted began by assessing the correlation between the closeness centrality and each of the four selection methods. The IC model (Goldenberg et al. 2001) was then used to evaluate the performance of the four selection methods in terms of maximizing information diffusion in short-term and long-term propagations. Finally, the effects of changing parameters, such as the number of activated neighbours and the transition probabilities, in performance were analysed. Overall, the simulation results suggest that the performance of the three non-random methods that use local connectivity information are superior to the Random method for short-term propagation. However, when considering long-term propagations, no significant difference is observed between Random and the three heuristic methods. Indeed, even with the Random method more than half of a network is covered in a long-term propagation, which seems satisfactory when considering the high communication costs of the other methods. One of the assumptions of the aforementioned experiments was that the infusion probability was constant across all the network's members. However, this assumption is not necessarily true in practice. As a part of the on going study, we will run case studies on models that relax this restricting assumption, thus investigating other aspects of users' influence in information diffusion.

17

RECOGNITION  

2.2 Cognitive approaches for the acquisition and consumption of content This section introduces cognitive-inspired techniques and methodologies for the provision and consumption of content from WP1. In particular, we consider two scenarios: one concerning the definition and acquisition of content concerning location and places (so focusing on the cognitive spatial and temporal representation of content) and a study that investigates the impact of cognitive heuristics/biases in human behaviour and the actual decision-making process (looking at cooperative and selfish techniques in competitive environments as example scenarios). This work primarily concerns the consumption and acquisition of content and does so in terms of how valuable information about locations are distributed within a network, for example as alternatives among users of location based services in addition to tips, comments etc. Within this scenario we also address the content dissemination scenario in terms of how knowing or inferring the personality and characteristics of users visiting particular places can be further used to disseminate relevant content to these individuals.

2.2.1 Tag-cloud representation of places for location based services As introduced in the first work-package (particularly in D1.3) the concept of spatial and temporal awareness can be developed to express the individual perception of a specific user in a particular location at a particular time. It is a fundamental principle that influences and leads user choices in terms of acquisition and consumption of content and thus it can be engaged for location based services and selected provision of content. Although the description of a particular place can be accurately given based on its general and objective characteristics, its importance and the interest level for a particular individual is certainly directly connected to the user’s personal characteristics (such as personal preferences and personality profile). We refer to the different ‘views’ that distinct users have for a same location/place as the notion of ‘sense of place’. This work primarily contributes to task T3.4 about defining novel policies intelligent acquisition and geo-spatial information retrieval in a content-centric scenario,

18

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     extending to a physical pervasive environment and human mobility. In particular this task will explore the building of “personal profiles” based on the behaviour of individuals in terms of their mobility and visited places, and the use of local ubiquitous information sources associated with the elements of the service/activity ontology that characterise the type of geographic place. It also links to notions such as the ‘level of belief’ in terms of thoughts, ideas, and representations that we have for location and places and what is our role and interaction with them. As such this work relates to all the cognitive principles listed in this section, namely personality and disposition, sense of place, and social connectivity. Location based social networking services are a good source of data and test beds for this kind of research, since they allow a monitoring of the behaviour of individuals based on the places visited and so can reveal more inherent user characteristics such as personal interests and frequency and regularity of their mobility pattern. We note that location based online and mobile services generate a very long tail of content that requires significant filtering and aggregation to be of personal use. We seek to achieve this by creating tag clouds, initially in a generic framework that can be subsequently personalised. Location based services such as Foursquare provide content by allowing users to leave comments in the form of tips or short reviews about visited places. This can be aggregated and personalised in many different ways and the current state of the art only adopts a range of limited mechanisms to solve this problem. We challenge this by examining the long tail of content in a different manner. We seek to consider the content from the perspective of its local distinctiveness and also potentially from the perspective of the individual to bring out the ‘personality of a venue1’. Here the underlying assumption is that the resulting characteristics of a venue could in a second analysis have a direct connection or relevance to the personality characteristics of the users visiting them. Thus in effect we are providing a new mechanism for filtering and highlighting the relevance (or otherwise) of content. Research has commenced that seeks to achieve this by analysing a number of sources that provide venue reviews from which tags (or keywords) are specially extracted to 1    location  or  place  

19

RECOGNITION   reveal a ‘personality profile of a venue’. Tags are related to some basic features of a venue, such as its type and category (e.g pub, coffee shop, restaurant); more importantly they may reveal other distinctive characteristics (some coffee shops may be preferred because they serve particular kinds of dietary food, some others for services provided (e.g Wi-Fi), and some venues may be liked because of their particular atmosphere and the social environment surrounding them). The main steps of the proposed procedure are as shown in Figure 4 and are described in more detail in the following sections.

Figure 4. The main steps of the tag-clouds based place representation procedure

Generation of content from on-line reviews (Place Data) The first step simply consists in producing one text document that aggregates the content of online reviews for each given venue. Content concerning a specific venue is identified by its geo-coordinates and by is name. Since this can differ within different sources ‘entity resolution’ techniques are applied. These involve the use of simple metrics for detecting string similarity, such as the Levenshtein distance, but more complex measures can be used for future enhancements. Sources currently used in prototype form are: Foursquare, Google Places, Yelp, Qype and also Yell.com and the venue Website when these are available and found by the search procedure. Finally, this procedure can be integrated into existing geographical software such as the MetaGazetter application (MG) described in D1.3 Section 2.4. Although not strictly necessary this could improve efficiency in the retrieval of place data object of this initial phase.

Keyword extraction and tag-cloud formation (Keyword Extraction) The keyword extraction process is extended by using TF-IDF (see Jones (1972)) and receives two inputs for each given venue considered: •

A text document aggregating a number of on-line reviews as described in the previous step.

20

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     •

A ‘corpus’, or set of documents, is used for comparison that can be created by either globally combining all the available reviews or by filtering by type, location, or based on the social and personal characteristics of the end user. The underlying idea is that the use of different corpora emphasises those keywords that are more characteristic for a particular venue. A specific tag that represents a venue but is not included with high frequency in the corpus is interpreted as revealing more about the specific venue (and what makes it different from others of a similar type) and vice versa. In order to proceed with the keyword extraction process the text related to a single venue needs to be compared to a corpus of text documents that is constructed, in the most general form, by simply aggregating the text documents related to each individual venue available (global corpus). Alternatives can be obtained by filtering the global corpus by: o Category: type/category of venues (e.g coffee shops, pubs, eatery clubs). o Locality: filtering by venues included in a given local area (e.g included in a circle of a fixed radius centred in a specific location). These can be further combined with the filtering by venues type/category. o Trace: only including venues contained in the history of visited venues by a given user (e.g. as retrieved from his/her Foursquare check-ins history). o Social: including venues somehow connected to the social community of the given user (e.g. venues visited by himself and his/her friends). o

Reading lists: A different corpus can be created by considering reading lists of the authenticated user (the one currently logged-in using the app) by combining text extracted from the bookmarked articles. We have currently used the web services Instapaper and Readability. This is believed to embed personality characteristics of the user. To be effective the corpus may need to be pre-filtered by type of the specific venue for which the tag-cloud is produced.

21

RECOGNITION   A following task is related to the ‘cleaning’ process of the corpus text by removing stop-words such as prepositions and articles, non-English words, special characters, words used only once, etc. The current implementation for the generation of a final tag-cloud is completed using string manipulation and NLP techniques implemented in the Natural Language Toolkit (NLTK)2. Then keywords can be ranked for every place, according to the selected subset of the corpus using ‘term frequency–inverse document frequency’ (TF-IDF), with a possible split between adjective and nouns provided by the NLTK part-of-speech tagger. This provides every tag with a score according to its relevance to the content in the corpus. Figure 5 gives a screen shot of the tag cloud building process for ‘A shot in the dark’ (a coffee shop in Cardiff) based on the ‘global corpus described above, and an overview of the architectural components. We can observe as well as tags representing food and beverage (as predictable, e.g. coffee, latte) other tags concerns words representing more the general atmosphere and ‘mood’ of the place (e.g. chilled, comfy, relaxed). TF-IDF weights are visualised in the tag cloud by font size variation.

Figure 5: building the tag cloud for the Cardiff venue ‘A shot in the dark’

Work concerning validation of the keywords representation of a place is currently ongoing and organised into two main areas that investigate the objective and subjective

2  http://nltk.org/  

22

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     similarity between venues and an analysis of all the characteristics that define the perception of place/location or a specific user (in D1.3 we referred to the possible different ‘views’ that distinct users may have for a specific location/place with the notion of ‘sense of place’). Using the tag cloud representation of places we can compute the similarity between them in order to provide: •

a deeper knowledge of urban aggregates that be used to cluster geographical areas and sub-areas into neighbourhood sharing of common aspects and characteristics.



a personalised representation of the most valuable tags for a given user in a given local area. This can be integrated with information that profile the users based both on their objective characteristics (e.g. demographic details) and personal preferences as well as the application of techniques for the prediction of the next place to be visited by the given user (based for example on the regularity metrics introduced in Section 2.1.1).

Place Similarity can then be computed using simple metrics such as the Jaccard index and cosine similarity (in which the weighting of the individual keywords can be those provided by TF-IDF) and the application of methodologies for comparing ranked items, such as the discounted cumulative gain and the Spearman rank-order correlation. This provides us with a suite of functionality to link content embedded in the Internet with relevant spatial locations.

Adaptation to Personality (Place Personality) The construction and provision of a personalised representation for a specific location/venue also relates to the notion that personality and disposition of users may govern the generic locations and content that is of most importance. To assess this we consider the relationship between place, location and their inferred personality. From a venue point of view the links between words related to personality traits and venue tag clouds can be used to define and link content to the novel concept of ‘personality of venues’. The cognitive basis for this was discussed in D1.3 and in this deliverable we are interested in the preliminary mapping of these concepts into ICT. One simple approach to achieve this is by semantically navigating the original sets of 23

RECOGNITION   terms as the basis of the so-called lexical approach (see Goldberg, 1982, De Raad, 2000) from which we can attempt to make inferences on the location from the types of words that occur in content made by people to give recommendations or otherwise. This is in contrast with the questionnaire centred trait approach (McCrae & Costa, 1992) from which personality dispositions of individuals are used to make inferences about the characteristics of a place for location-based services. The lexical approach will result, for each personality dimension, in a ‘bag of words’ representing personality characteristics which could be linked with a certain degree of similarity to the keywords contained into the tag clouds representing a specific venue. As a result we can compute and visualise for a specific venue/location the equivalent of a human personality score for each of the dimensions of the Big5 personality traits theory. An example of this is given in Figure 6 for the Cardiff venue ‘A Shot in the Dark’. Terms representing personality dimensions are matched with the keywords in the tag cloud, thus producing a Big5 personality ‘profile’ for the given venue by aggregating the scores for each single dimension. These are often represented with the acronym OCEAN: Openness (inventive/curious vs. consistent/cautious), Conscientiousness (efficient/organized vs. easy-going/careless), Extraversion (outgoing/energetic vs. solitary/reserved), Agreeableness (friendly/compassionate vs. cold/unkind) and Neuroticism (sensitive/nervous vs. secure/confident). For a complete overview of the use of personality questionnaires see John et al, (2008).

Figure 6: Big5 representation of the Cardiff venue ‘A shot in the dark’

24

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     We note that text analysis software (e.g. Linguistic Inquiry and Word Count (LIWC3)) can also be potentially used to indirectly link tag clouds to personality dimensions as well as results obtained from dedicated applications (see deliverable D4.1). Figure 7 describes the prototype complete architectural components that can lead to a software application that will visualise for a user a number of recommended venues according to his characteristics and personality (a profiling of the user personal characteristics and the use of procedures such as the prediction of the next place based on their mobility behaviour can be integrated for this scope). Beside the ‘Place-based’ components described above the correlation between personality characteristics of people visiting specific places can be also considered from a user’s point of view Such estimation of personality can potentially be derived from a range of sources including: o

text analysis using existing tools/frameworks (e.g. LIWC4).

o

self assessment (quick).

o

personality tests.

o

mobile user behaviour analysis (e.g. through mobile applications such as funf for Android platforms5 ).

In this work we favour the use of personality tests as the original and most direct means of establishing personality profiling. This work is currently on going and will be described under the next activities related to this work-package. A synthetic representation of the potential system can be that shown in Figure 8, in which a given user can select from a number of venue of specific characteristics (visualised in the tag-cloud) associated to different ‘OCEAN’ personality scores.

3  http://www.liwc.net/   4  http://funf.org/  

   

25

RECOGNITION  

Figure 7: From keyword extraction to place recommendation: main architectural components

Figure 8: Visualising venues for personal recommendations

26

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     We conclude this section with a brief summary on the work undertaken and the progress in this work-package (WP3) at this stage, and for the next (WP4), which validates the methodologies proposed in WP3. Content generation, keyword extraction and tag clouds formation: o Reviews have been retrieved (using API’s provided by different on-line sources) to constitute a text document for every venue. o Corpora of venues have been constituted from individual venue review documents, organised by venue type or proximity. o Keyword extraction has been done for venues, using either word occurrence count of TF-IDF weighting. o Visualisation has been achieved as tag clouds, either in a semi-automated way, using Wordle6, or automatically by customising the Python package networkx. o Tag clouds comparison has been carried out by applying simple similarity metrics such as the Jaccard and Cosine similarity indexes. o Future plans for this work package and its incoming validation in WP4 include: o Extensive simulations that compare tag clouds either representing different places or the same location but buying built using different corpora, for example taking into account all venues in proximity of a given one or only those belonging to the same type (e.g. pubs, coffee shops). More complex measure of similarity can be also used at this scope. o Conducting a local participation experiment that validates the findings above. o Integration with the MG application for faster retrieval of on-line content (possible). Development of place personality Once the keywords extraction has been completed the definition of the ‘personality’ of a place requires the following:

6  http://www.wordle.net/  

27

RECOGNITION   o A list of terms representing Big5 traits has been extracted from relevant literature. o A list of terms representing Big5 traits extracted from literature have been matched to words extracted from venues. The scaling required given the total number of words and matches to accurately reflect the personality of the venue is still investigated. o On going and future activities include: o Visualisation of the above concepts though a software application (see Figure 8 for a prototype representation aiming to provide personal recommendations based on ‘place personality’. o Preliminary experiments using machine learning have been done to predict the most likely next visited place according to the time of day for a user. This can be integrated in the final recommendation system. o Validation of the ‘place personality’ concept through a local participation survey. This can consist in a dedicated survey in which users are asked to describe their own perception of the ‘personal characteristics’ of a venue (such as mood, atmosphere etc.) by choosing from a given list of adjective showing different correlation with the five personality traits dimensions.

2.3 Decision making for resource selection In D1.3 Section 2.5.1 we consider the impact of cognitive heuristics/biases in human behaviour and especially the actual decision-making process on the performance of certain ICT systems/applications. The prime assumption, as described in D1.3, is that users run a service resource selection task acting as selfish agents within autonomic networking environments. Actually, they make their decisions drawing on (e.g., consuming) content, which may be released centrally or opportunistically collected by and distributed among the interested parties. Hence, such information constitutes a particular type of content that assists the discovery and use of the resources over which the users compete. In this section we apply this general concept to a particular case-study: the parking search assistance, whereby the drivers are the selfish agents and the public together 28

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     with the private parking spots are the resources. Here the local content relating to parking directly informs individual behaviour. The aim of the study is the investigation of heuristic/bounded rational decision-making within the particular competitive environment and its comparison with the ideal normative/full rational reasoning. Rather than proposing a new system or simplify an existing search algorithm drawing insights from cognitive science, we assess how cognitive heuristics/biases affect the efficiency of this real-life application, yet without assessing the exact relevance of the heuristics/biases in the particular context. Overall, this work fulfils primarily task T3.2 about conditions on decision-making under which particular content should be replicated or directed towards particular network regions and in a subsidiary way task T3.4 about provision of location-based services. It essentially relates to decision-making and sense of place principles. Although these are essentially abstract concepts their application relates to different content centric scenarios and applications (linking primarily to the consumption but also dissemination of content). In this study, we consider an urban environment with low-cost but scarce on-street parking spots and more expensive private parking lots with capacity that suffices to serve all parking requests. Drivers choose to either compete for the cheaper but risky parking solution or head for the more secure, yet expensive option, drawing (or not) on information available about the competition level (i.e., demand), the parking capacity (i.e., supply) and the employed pricing policy. This content might be available through inter-vehicle interaction or broadcast from the parking service operators (through some parking assistance systems). In both cases, the information might vary in terms of quantity and accuracy. Within this setting, where the availability of perfect and real-time information about dynamic characteristics of the environment is a clearly unrealistic assumption, we iterate on several expressions of bounded rationality in decision-making. Note that this is an umbrella term for several deviations from the fully rational decision-making paradigm: incomplete information, time, computational and processing constraints, and cognitive biases in assessing/comparing alternatives. Experimental work shows that, practically, people exhibit such bounded rationality symptoms and rely on simple rules of thumb (heuristic cues) to reach their decisions in various occasions and tasks.

29

RECOGNITION   Overall, we have identified the following instances of bounded rationality as worth exploring and assessing in the context of our assisted parking search service: •

Incomplete information about the demand – The most apparent deviation from the perfect information norm relates to the amount of information driver nodes have at their disposal. As two distinct variations hereby, we consider probabilistic (stochastic) information and full uncertainty.



The four-fold pattern of risk aversion– Particular experimental data show that human decisions exhibit biases of different kinds, in comparing alternatives. For instance, a huge volume of experimental evidence confirms the fourfold pattern of risk attitudes, namely, people’s tendency to be risk-averse for alternatives that bring gains and risk-prone for alternatives that cost losses, when these alternatives occur with high probability; and the opposite risk attitudes for alternatives of low probability (Tversky and D. Kahneman 1992).



Own-payoff effects – This is another type of bias that was spotted in the context of experimentation with even simple two-person games, such as the generalized matching pennies game. Theoretically, in these matching pennies games, a change in a player’s own payoff that comes with a particular strategy/choice, must not affect that player’s choice probability. However, people’s interest for a particular strategy/choice is shown to increase as the corresponding payoff gets higher values. This behaviour makes choice probabilities range continuously within 0 and 1and not jump from 0 to 1 as soon as the corresponding choice gives the highest payoff. This bias gives further credit to Simon’s early arguments (Simon 1955, 1956) that humans are satisficers rather than maximizers, i.e., that they are more likely to select better choices than worse choices, in terms of the utility that comes with them, but do not necessarily succeed in selecting the very best choice.



Fixed-distance heuristic– Hutchinson et al. (2011) list and discuss a number of simple heuristic approaches for parking search, albeit in the simple context of a long dead-end street, with two one-directional lanes leading to and away from a destination and a parking strip between the two lanes. One of the simplest example is the “fixed-distance” heuristic that ignores all spaces until the car reaches D places from the destination and then takes the first vacancy. Overall,

30

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     all these heuristics rely on related rules for search that have been suggested from other domains (i.e., psychology, economics) and criteria that have been identified as important for drivers such as the parking fee, parking time limits, distance from drivers’ travel destination, accessibility and security level (Van der Goot, Golias J). Opposite to these bounded rationality instances stands the ideal reference model of the perfectly or fully rational decision-maker. In this case, the main assumption is that the decision-maker can possess all relevant information, analyse all possible combinations of actions she and the other users can take, assess the cost/gains of each possible outcome, and strategically make the choice that minimizes her own cost. It is notable that provision of sufficient local content for fully rational decision making is likely not to be cost effective in terms of resources and control mechanisms. In (Kokolaki et al., 2012, see Appendix D) we have analysed the scenario of fully rational decision-making by formulating the assisted parking search task as an instance of resource selection games, namely the parking spot selection game, drawing on classical Game Theory. In particular, this study addresses the strategic parking game variant whereby a parking assistance service announces information of perfect accuracy about parking demand (number of drivers interested in parking), supply (number of parking spots) and pricing policy. The drivers draw on this information to choose to either compete or not (resign for the private parking lot) for the cheaper but scarce on-street parking spots running the risk of failing to get a spot and having to a posteriori take the more expensive alternative, this time suffering the additional cruising cost in terms of time, fuel consumption (and stress) of the failure attempt. We derive the equilibrium behaviours of the drivers and compare the costs paid at the equilibrium against those induced by the ideal centralized system that optimally assigns parking spots and minimizes the social cost. We quantify the efficiency of the service using the Price of Anarchy (PoA) metric, computed as the ratio of the two costs (i.e., equilibrium cost over optimal cost). In general, PoA deviates from one implying that, at the equilibrium, the number of driver nodes choosing to compete for the on-street parking spots exceeds their supply. The PoA can be reduced by properly manipulating the price differentials between onstreet and private parking and the location of the private parking facilities. Notably,

31

RECOGNITION   our results are inline with earlier findings about congestion pricing, in a work with different scope and modelling approach (Larson and Sasanuma, 2010). The results of this study will serve as a benchmark for assessing the impact of different cognitive biases on the efficiency of the parking search process (preliminary results on this will be shown in deliverable D1.4 of WP4).

3 Content dissemination Cognitive functionalities relevant to the acquisition and dissemination of knowledge relate to a number of principles for self-awareness. In fact, dissemination protocols assume exchange of knowledge about topological information and personal preferences as well as social links which can be exploited to better disseminate content. However, with reference to the classification proposed in D1.3, this section primarily focus on the following principles: •

Memory and Learning. Memory is a fundamental component of any dissemination mechanism in order to store information about node characteristics (e.g. role and position in the network) and the content itself. Learning mechanisms are involved to update and adapt to network and content modifications and adjustment.



Decision making. Reasoning and decision making is the main principle governing the processes involved in any dissemination scenarios (see also Section 2.3). In particular we will focus on heuristic solutions that use and exploit the often partial and incomplete knowledge available.

Focusing particularly on the reasoning phase we propose novel heuristic inspired protocols dissemination of content in mobile opportunistic environments.

3.1 Content dissemination using the “recognition” heuristic In Section 7.1 of D2.1 we described a fully-fledged node architecture supporting content dissemination in mobile opportunistic networks, which exploits information exchanges between the mobile devices carried by users to build context awareness and to optimize the content-dissemination decision process. Typical information that can be exchanged are: which data channels (i.e. high-level topics) a users is interested in, which data items are stored in the local caches of meeting devices, which data

32

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     items have been most recently shared, etc. Note that the acquisition of information about content characteristics in relation to those of the users (e.g preferences, personality and behaviour) is a common object shared with other elements of our research (see, for example, Section 2.2.1). It is intuitive to recognize that, in a largescale content-centric network, a massive amount of information will be produced and exchanged and the content-dissemination system will be faced with the very challenging task of determining the relevance of discovered content and selecting the most interesting data for the different users. Our work in this deliverable specifically fulfils task T3.2 about the definition of data dissemination and replication mechanisms. These are set in a user-node-centric approach ensuring that content generated by users and collected in the environment is disseminated and replicated in the network efficiently and appropriately. The primary cognitive principles related to this task include all the basic functionalities of memory, learning and decision making whereas personal properties such as channel preferences are also assumed to be exchanged in the network. The approach proposed here is to mimic the “recognition” cognitive heuristic (Goldstein D.G. and Gigerenzer, 2002) outlined in Deliverable D1.2, for providing mobile devices with autonomic decision-making capabilities (a comparison with alternative approaches will be object of the ‘validation’ stage as reported in the following work-package four, see D1.4). Specifically, the recognition heuristic is based on a very simple rule: when comparing the value of a pair of objects with respect to a given evaluation criterion, and one is recognized (i.e. the brain is able to recall that it already “heard” about that object) while the other is not, this heuristic infers that the recognized object has a higher relative value. In other words, the fact that an object is more easily recognizable than another object is used as a surrogate of its value. Thus, the brain does not need complete information on the objects and the environment (which would be needed for computing their value with the original evaluation criterion) but it is sufficient to define a criterion to establish if an object is more recognizable than another. Intuitively, the recognition heuristic is more effective when the recognition of objects is highly correlated with the original evaluation criterion. In the following we will overview how these concepts and rules can be embedded in a content dissemination system for mobile opportunistic networks.

33

RECOGNITION   First of all we assume that the mobile devices carried by the users have limited memory storage, which is organized into caches. One cache is dedicated to store data objects generated locally or objects belonging to the channel the user is interests in and obtained by encounters with other peering devices. Another cache, called the opportunistic cache, is dedicated to store objects also obtained by encounters with other peer devices but that belong to channels the node is not interested in. In a sense, the aggregation of opportunistic caches deployed at devices close to each other forms a distributed storage used to improve the efficiency of the content dissemination process. Since opportunistic caches have limited size and they can hold only a small fraction of the total number of data objects in the network, it is of paramount important to decide which are the objects most “useful” for a collaborative information dissemination process, and which should be replaced upon an encounter. To this end, the proposed algorithm introduces two recognition thresholds to assess the popularity of channels and data objects. The more nodes subscribed to a channel are encountered, the more recognized is that channel. Therefore, the circulation of data objects belonging to that channel should be favoured. Similarly, the less times a data object is encountered in the opportunistic caches of other nodes, and the less recognized is that object. Thus, that object should be replicated more broadly to increase its diffusion. On the basis of these considerations, a node decides to fetch a data item from the encountered node if: a) it recognizes the channel the data item belongs to, and b) is does not recognize the data item itself. Finally, among the data items with the same recognition value, new ones are considered more relevant than old ones. Note that we assume that the opportunistic cache has a limited size but the each node has sufficient memory capacity to store the IDs of all encountered data objects along with a counter counting how many times each data object has been found in the opportunistic caches of encountered nodes. In D2.1 we presented the main features of the dissemination algorithm, together with preliminary simulation results, aimed more at broadly validating the general idea than to provide precise assessment of the performance of the data dissemination scheme. Such a simulation-based performance evaluation will be carried out in the framework of WP4, and a first set of significant results is presented already in D4.1. In this deliverable, on the other hand, we refresh the description of the data dissemination algorithm, and then mainly focus on an analytical model that describes the transient

34

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     and stationary behaviour of the data dissemination algorithm based on the recognition heuristic. This model allows us to derive analytical expressions that describe the properties of the data dissemination scheme as a function of key system’s and environmental parameters. For example, as described in the following, we are able to describe analytically how the diffusion of data to interested users evolves over time. The analysis presented here allows us to model the data dissemination scheme, and is thus a natural complement to its design and definition carried out in WP3. This is complementary with respect to the simulation-based evaluation presented in D4.1. The analytical expressions presented in this deliverable allow us to describe with formulas the dependence between the key system’s/environmental parameters and the evolution of the data dissemination process. As often the case in analysis, this comes at the cost of abstracting some details about the behaviour of the algorithm, which is needed in order to make the analysis tractable. Simulations presented in D4.1 are obtained by avoiding such abstractions, as the developed simulator reproduces the exact behaviour of the algorithm. As of now, simulation results are used to (i) validate the analytical results presented in this deliverable and (ii) assess the sensitiveness of the algorithm with respect to a number of parameters over an important range of possible values. In the following we briefly sketch the proposed modelling approach. The key point of this model is to describe the status of the various caches deployed at each node of the network with distinct Markov chains. Specifically, for each data object a Markov chain describes (i) the evolution of the recognition level of that data object; and (ii) whether the data object is stored in the opportunistic cache and in which position, respectively. In other words, the opportunistic cache can be seen as a queuing network, where each sub-queue models the number of data objects in the opportunistic cache with a given recognition level. Note that the Markov property of the system is due to the use of a random waypoint mobility model, which generates exponentially distributed inter-contact times. Finally, the status of the system is described at a sequence of time instants that correspond to an encounter between two devices. Then, the status of the Markov chain is recomputed based on the replication level of each data object (i.e., the fraction of nodes storing a copy of that data items) and its average recognition level. After updating the system status the new estimate of the replication level is obtained. Note that different initial conditions (e.g., different 35

RECOGNITION   initial distribution of data objects in the local caches of the nodes) may lead to quite different system evolutions. The accuracy of the proposed model has been validated using the same simulation environment adopted for the preliminary assessment of the performance of the proposed content dissemination mechanism. To demonstrate that the proposed model is sufficiently accurate to capture both the transient and steady-state behaviour of key performance indexes, in Figure 9 we show the data replication level for a network of 45 nodes, with 3 different channels with 99 data items each, and a channel popularity following a Zipf distribution with parameter 1. Regarding the mobility model, the node mobility is uniformly sampled in the range [1,1.86]m/s and the node moves in a square area of side 1km according to the Random Waypoint model. Finally, the shown results refer to a scenario where the recognition thresholds for all channels and data items are set to 5, the opportunistic cache of each node can store 3 data objects, and the data items are distributed uniformly over the nodes. 0.02

MOD. Most Pop. Channel

SIM. Most Pop. Channel

0.02

SIM. Mid Pop. Channel

MOD. Mid Pop. Channel

SIM. Least Pop. Ch.

MOD. Least Pop. Ch.

0.015

Hit Ratio

0.015

0.01

0.005

0.01

0.005

0

0 1

10

100

1000

10000

1

Time

10

100

1000

10000

Time 0.016

0.02

0.014

0.015

0.012

0.015

0.01

0.01

0.008

0.01

0.006 0.005

0.005

0.004 0.002

0

0 1

10

100

1000

10000

0 1

10

100

1000

10000

1

10

100

1000

10000

Figure 4 - Replication level for recognition levels equal to 5 and size of opportunistic cache equal

For the sake of clarity the subplots in the bottom row directly compare the analytical and simulation results for the same channel, with the most popular channel in the rightmost sub-plot and the least popular one in the leftmost sub-plot. Important observations can be derived from the shown results. First of all, the replication levels 36

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     show clear peaks. In addition, the peak recognition value is reached before for channels with high popularity than for channels with low popularity. This is due to the fact that the most popular channel and its data items get recognized before all the other channels. Thus, the data dissemination process for the most popular channel can start before than for other channels. However, after this peak the replication levels of all the channels decrease. This is a typical behaviour of the recognition heuristic and of the replacement policy used for the opportunistic cache. Indeed, the more diffused a data object is (i.e., the more recognized), the less relevant it is for the dissemination process. This implies that less diffused objects can increment their replication level more easily than data objects of popular channels. This also explains why the hit ratio of the least popular channel is comparable to the hit ratio of the most popular channel. The second interesting observation is that the replication levels converge to steady values that are almost the same for all the channels. Typically, this happens when all the data objects have reached the maximum recognition level. In this case, all data objects are considered equivalent for the dissemination process and they are equally distributed in the opportunistic caches. Finally, it is important to point out that our model is remarkably accurate in predicting this stationary behaviour of the data-object replication levels. Furthermore, our model is able to predict the times at which the replication levels reach their maximum and minimum values. A complete description of the analytical model and an extended set of results are available in Appendix E.

37

RECOGNITION  

4 Content management This section presents a number of new methodologies for the effective placement and management of content, as well as novel privacy and security schemas. In particular we propose novel protocols for usage control, which refers to the control of the data after its publication; a new algorithm for the placement and replication of active data; and heuristic based methodologies to test decision-making scenarios under competitive environments. These are supported by the following self-aware principles: •

Memory and Learning. As fundamental principles for the storage and update of the content carried in the network.



Decision making. As the main principle governing the reasoning phase involved in any dissemination scenarios.



Social connectivity. As the exploitation of social structures to reinforce trust and security schemas.



Sense of place. As component for spatial awareness to provide relevant topological information for determining the centrality of each of the network nodes for effective placement of content.

4.1 Usage control mechanisms We consider a content management scenario whereby users would like to efficiently store and share their content while still defining rules or obligations on the access or the usage of the corresponding content. Beyond basic exposures that are partially covered by classical security mechanisms such as data confidentiality, authentication and access control, new security and privacy requirements arise due to the sheer volume of data exchange and the span of dissemination enabled by content management services. Existing privacy controls based on access control techniques do not prevent massive dissemination of private data by malevolent acquaintances in the social network, unauthorized duplication of files or persistence of some files in thirdparty operated storage beyond their deletion by their owners. As a result of such exposures, users lose control over their data. Giving users control over their data and over the way are disseminated, unfortunately, can not be achieved by means of classical access control mechanisms. Access control can achieve perfect control over the identity of parties authorized to access the data and the circumstances of the 38

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     access operation pertaining to time and content but it does not allow for any control over the way these parties make further use of the data. Such a comprehensive control spanning the entire lifetime of each data segment can actually be assured through a security service called usage control. Even though a generic usage control solution fitting all possible settings seems infeasible, in a confined environment with a well defined set of subjects, resources and operations, usage control can be achieved. The impact of leaving the system to violate some the rules would be negligible. Following this specific approach, we propose two usage control enforcement mechanisms whereby the confined environment is defined as a decentralized social network on the one hand and a Peerto-Peer network on the other hand. This research contributes to task T3.5 in its entirety concerning trust and security management mechanisms and schemas through self-awareness. In particular, it will investigate the usage and interaction with self-aware and ‘active metadata’, thus linking to task T3.3 about self-aware data (i.e. how can awareness about the environment be embedded in the content itself), and Task 3.1 about collaborative strategies in distributed structures (such as p2p platforms). This is supported by the cognitive principle of decision making as well as those concerning memory and learning and social connectivity. The first mechanism that we propose considers a decentralized online social network, such as Safebook7, as the previously defined confined environment and targets a specific content management application which is picture sharing. Current picture sharing tools in online social networks allow users to upload any picture. Access rules to these pictures are defined by the owner of the picture; that is the one who uploads it. This user also has some abilities to tag some users which basically inform other users about their presence in the picture. We assume that each person whose face appears in any picture should decide whether her face in that picture should be disclosed or not. The proposed usage control mechanism is enforced thanks to the cooperation among multiple social network users: the idea of the proposed mechanism is to exploit the distributed nature of Safebook and to leverage real-life

7    http://www.safebook.us/home.html  

39

RECOGNITION   social links to control the access to pictures. The enforcement on the control of the pictures is assured thanks to a dedicated multi-hop data forwarding protocol. Before reaching its final destination, content has to follow a dedicated path of a sufficient number of intermediate nodes which automatically obfuscate the picture and follow the rules defined in the corresponding usage control policy. The length of this path depends on the ratio of malicious nodes in the system: we assume that the cooperation of at least t nodes would guarantee the correct execution of any operation. Thanks to the underlying privacy preserving multi-hop forwarding protocol, cleartext pictures will be accessible to authorized users only. The proposed solution is further evaluated with respect to different security attacks such as the unauthorized picture broadcast or the forwarding of cleartext pictures. The impact of such attacks is evaluated based on existing social graphs. More details about this research are provided in Appendix F. Additionally, we propose a preliminary secure content management solution which is based on a peer-to-peer architecture whereby a randomly selected set of peers assure usage control enforcement for each data segment. Usage control is achieved based on the assumption that at least t out of any set of n randomly chosen peers will not behave maliciously. Such a system would still suffer from re-injection attacks whereby attackers can gain ownership of data and the usage policy thereof by simply re-storing data after slight modification of the content. In order to cope with reinjection attacks, the scheme relies on a similarity detection mechanism based on a special hash functions. The robustness of the scheme is being evaluated in an experimental setting using a variety of re-injection attacks.

4.2 Distributed placement of autonomic internet services We also consider the exploitation of local knowledge in content placement and forwarding strategies by the definition and use of centrality metrics and decentralized strategies for content migration as well as mechanisms that assess the local utility of each encountered user-node under incomplete knowledge. This research primarily addresses task T3.2 about content placement strategies assessing the relevance of storing content in particular places, and of replicating it at a certain level. In addition, it will also relate to task T3.3 about self aware content at node level (carrying itself relevant topological information) and task T3.1 that specifically targets the definition of storage management policies in order to both

40

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     optimize the storage usage at individual nodes, as well as to jointly manage storage resources available on different nodes. This work is supported by the cognitive principles of decision-making, memory and learning and sense of place. The service/content placement problem as described in D2.1 calls for distributed and low-complexity schemes that efficiently accommodate the ever exploding (usergenerated) content and services across the Internet. Approximations are necessary even at the expense of non-guaranteed optimality as a fully rational (i.e., exhaustive) approach to determining the optimal solution is clearly not feasible. We are concerned with the so-far dominant strategy of service migration that is practically employed to meet the above requirements. Our heuristic method constitutes a user-centric approach to data distribution, whereby nodes iteratively push data to a better-placed host. In each iteration incomplete (i.e., local) information is utilized to decide on the selection of a small subset with respect to the situated context; some highly central (significant) nodes are singled out on the criterion of the amount of content/service demand they serve. A small-scale optimization problem is solved over these nodes to steer the service migration towards prominent locations. As a first step, we devise a theoretical algorithm, called cDSMA, which eventually decomposes the global 1-median problem into a series of significantly smaller local optimization problems and yet maintains close-to-optimal performance. However, these results are obtained under the assumption that the relevant topological (sense of place) and demand information (personal preferences) for determining the node centrality values is fully available to the network nodes. To build a practical protocol round cDSMA, we then need to relax the assumption of ideal information availability; sense of place then involves node centrality approximations. We have elaborated on the design and evaluation of a practical realworld cDSMA implementation that needs to cope with two main challenges. Firstly, information residing with individual network nodes such as the personal preferences of a user-node, has to be collected at the node currently hosting the service; this periodical communication constitutes the learning module of the placement task and is expected to help establish some critical body of knowledge on each node. Secondly, even if this information is compiled by each node in a distributed manner, it appears to require global information about the network topology. We address these

41

RECOGNITION   challenges by exploiting environmental information that can be locally inferred at each node. By directly measuring the transit traffic load that is destined for the current service host we obtain a set of values that approximate our original centrality cues. How accurately the measured values match the theoretical conditional centrality values depends on the network topology and routing protocol. We study two scenarios: (a) the topology gives rise to a single shortest path between all network node pairs and the routing protocol routes traffic over this single shortest path; (b) the topology induces multiple shortest paths between all network node pairs and the routing protocol forwards splits traffic demand equally among all of them. In our implementation, the criterion for selecting nodes are the measurement values that are communicated to the current host via dedicated messages recording the ids of the intermediated visited nodes. The nodes reporting high measurement values are selected by the host to induce a subgraph over which the small-scale optimization takes place. We have designed two implementations, one for each of the routing strategies, that help the service host parse the incoming messages i.e., conduct filtering of the inputs based on knowledge about the relevance of the information. Thus the host extracts a partial yet accurate topological view of the optimization subgraph and furthermore determines the weight of each selected node to the current solution. Our results show that the proposed implementation achieves very good accuracy compared to the theoretical values, robustness against different demand distributions and excellent scalability properties. A magnitude of tens of nodes suffices to obtain effective placement even when the size of the considered physical topologies scales up to 5 or 7 hundred of nodes. Our results can serve as the basis for the evaluation of heuristic schemes related to the placement task. Relevant data dissemination protocols that rely on spatial awareness and seek to provide a realworld solution, can use the proposed scheme as a performance reference. Details about the proposed methodology are available in Appendix G.

4.3 Assessing the effectiveness of centrality cues coming from bounded (social) connectivity views Regarding the effectiveness of employing locally determined centrality cues in network-wide content-centric operations, we have extended our previous work on the corresponding correlation study i.e., between the local and global counterparts of the well-known betweenness index. As the precise global view is not feasible, the notion 42

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     of sense of place is again captured by the individual estimations of one's social standing within the network. We have recognized different levels as to how an individual perceives her social connectivity and thus, builds up a knowledge base of corresponding centrality cues. Those can be practically reflected by a broad set of locally determined indices ranging from the immediately-available node degree to a number of ego-centric betweenness centrality variants of increasing complexity and information availability. We have conducted an experimental study (see Appendix G) to shed some light on the effectiveness of using those local indices against the ideal scenario of deriving centrality values assuming full knowledge availability. Positive rank-correlation between the studied counterparts is particularly observed in real-world network topologies of hundreds of nodes we have experimented with. Moreover, the corresponding cost to compute each of those indices has been shown to be negligible. Practical protocol implementations for content management can therefore benefit from those easily available rank-preserving local metrics. A concrete example involves informed decisions on which network nodes are appropriate to (cache and) efficiently provide content in information-centric networking schemes (Chai, 2012).

43

RECOGNITION  

5 Conclusion This document addresses the main objectives of the third work-package of the project (WP3) by presenting a number of intermediate algorithms and methods related to the consumption, dissemination, and management of content. These three content centric scenarios have been previously identified (see Deliverable D2.1) to represent ways in which users, the network, and content can become more self-aware in real-world settings by implementing the functionalities derived from the study and analysis of relevant cognitive areas of psychology introduced with the first work-package. These methodologies (to be next used to development) represent ways in which users, the network, and content can become more self-aware in real-world settings. This is achieved by applying the generic principles and concepts for self-awareness described in D1.3. For each of the content scenarios and particular contributions within them we have identified the specific principles that mostly relate to them. In particular, we have here considered new algorithms and metrics to detect regularity of human mobility patterns; an implementation of an methodology to produce a tagrepresentation of places and venues (both related to the consumption of content); a procedural study that profiles users based on their influence and personal characteristics in on line social networks, heuristic and economic based methods to test decision making scenarios in competitive environments, heuristic-based methods for the dissemination and placement of internet content and two novel mechanisms for usage control for data safety and security. The resulting procedures and methodologies will then be applied and evaluated in large-scale simulations constituting the objectives of the following final technical work-package of the project (WP4).

44

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT    

References A. Boutet, H. Kim and E. Yoneki "What’s in Your Tweets? I Know Who You Supported in the UK 2010 General Election" International AAAI Conference on Weblogs and Social Media (ICWSM), Dublin, Ireland, 2012a. A. Boutet, H. Kim and E. Yoneki "What’s in Your Tweets? I Know What Parties are Popular and Who You are Supporting Now!" IEEE/ACM International Conference on Social Networks Analysis and Mining (ASONAM) (Full Paper), Istanbul, Turkey, 2012b R. Bruno, M. Conti, M. Mordacchini and A. Passarella "An Analytical Model for Content Dissemination in Opportunistic Networks using Cognitive Heuristics" ACM MSWIM 2012. W.K. Chai, D. He, I. Psaras and G. Pavlou “Cache Less for More” InformationCentric Networks, Proc. of the 11th IFIP Networking international conference, Prague, Czech Republic, May 2012 Costa, P.T.,Jr. and McCrae, R.R. “Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) manual” Odessa, FL: Psychological Assessment Resources. 1992 L.A. Cutillo, R. Molva and M. Önen, “Privacy Preserving Picture Sharing: Enforcing Usage Control in Distributed On-Line Social Networks” In Proceedings of EUROSYS 2012, 5th ACM Workshop on Social Network Systems, Bern, Switzerland, April 10, 2012. De Raad, B. (2000). “The big five personality factors: The psycholexical approach to personality”. Göttingen: Hogrefe. Goldberg, L.R. (1982). “From Ace to Zombie: Some explorations in the language of personality”. In C.D. Spielberger & J.N. Butcher (Eds.), Advances in personality assessment, Vol. 1. Hillsdale, NJ: Erlbaum. Goldstein D.G and Gigerenzer G, “Models of Ecological Rationality: The Recognition Heuristic”. Psychological Review, Vol. 109, No. 1, 75–90, 2002. Golias J., Yannis G. and Harvatis M., “Off-street parking choice sensitivity”, Transportation Planning and Technology, Vol. 25, No. 4, 2002, pp. 333-348.

45

RECOGNITION  

J. M. C. Hutchinson, C. Fanselow and P. M. Todd, “Car Parking as a Game Between Simple Heuristics”. Ecological Rationality: Intelligence in the World, ed. P. M. Todd, G. Gigerenzer, and ABC Research Group. New York: Oxford University Press, 2011. John, O. P., Naumann, L. P., and Soto, C. J., “Paradigm Shift to the Integrative BigFive Trait Taxonomy: History, Measurement, and Conceptual Issues”. In O. P. John, R. W. Robins, & L. A. Pervin (Eds.), Handbook of personality: Theory and research (pp. 114-158). New York, NY: Guilford Press, 2008. K.S. Jones, “A statistical interpretation of term specificity and its application in retrieval”, Journal of Documentation Vol. 28, No 1, pp. 11-21, 1972 H. Kim and E. Yoneki, “Influential Neighbours Selection for Information Diffusion in Online Social Networks”. IEEE International Conference on Computer Communication Networks (ICCCN), Munich, Germany, June 2012. E. Kokolaki, M. Karaliopoulos and I. Stavrakakis, “On the efficiency of the parking assistance service: A game-theoretic analysis”, Submitted, 2012. T. Kreuz, D. Chicharro, R. G. Andrzejak, J. S. Haas, and H. D. I. Abarbanel, “Measuring multiple spike train synchrony”, Journal of Neuroscience Methods, vol. 183, no. 2, pp. 287–299, 2009. R. C. Larson and K. Sasanuma, “Congestion pricing: A parking queue model” Journal of Industrial and Systems Engineering, vol. 4, no. 1, pp. 1–17, 2010 R. McKelvey and T. Palfrey, “Quantal Response Equilibria for Normal Form Games”, Games and Economic Behavior, vol. 10, pp. 6–38, 1995. D.L Nelson, C.L. McEvoy, and S. Dennis, “What is free association and what does it measure?” Memory & Cognition, Vol. 28, pp. 887-899, 2000 D.L Nelson, C.L. McEvoy, and T.A. Schreiber, “The University of South Florida free association, rhyme, and word fragment norms”. Behavior Research Methods, Instruments, & Computers, Vol. 36, pp. 402-407, 2004 C. Peng, X. Jin, K.C. Wong, M. Shi, P. Lio “Collective Human Mobility Pattern from Taxi Trips in Urban Area”. PLoS ONE 7(4), 2012

46

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT     Rishin H., and Mukhopadhyay D., “Levenshtein Distance Technique in Dictionary Lookup Methods: An Improved Approach” arXiv preprint arXiv:1101.1232, 2011 R. Rosenthal, A bounded-Rationality “Approach to the Study of Noncooperative Games”, Int. J. Game Theory, vol. 18, pp. 273-292, 1989. H. A. Simon, “A behavioral model of rational choice”, Quarterly Journal of Economics, vol. 69 (1), pp.99–118, 1955. H. A. Simon, “Rational choice and the structure of the environment”, Psychological Review, vol. 63 (2), pp. 129–138, 1956. A. Tversky and D. Kahneman, “Advances in prospect theory: Cumulative representation of uncertainty”, Journal of Risk and Uncertainty vol. 5 (4), pp. 297–323, 1992. D. Van Der Goot, “A model to describe the choice of parking places”, Transportation Research Part A: General, Volume 16, Issue 2, March 1982, Pages 109-115. M. J. Williams, R. M. Whitaker, and S. M. Allen, “Measuring Individual Regularity in Human Visiting Patterns”, SOCIALCOM 2012

47

RECOGNITION  

Appendix A – [Measuring Individual Regularity in Human Visiting Patterns] This appendix contains the reprint of the following paper: M. J. Williams, R. M. Whitaker, and S. M. Allen, Measuring Individual Regularity in Human Visiting Patterns, SOCIALCOM 2012. C. Peng, X. Jin, K.C. Wong, M. Shi, P. Lio` Collective Human Mobility Pattern from Taxi Trips in Urban Area. PLoS ONE 7(4), 2012

48

Measuring Individual Regularity in Human Visiting Patterns Matthew J. Williams, Roger M. Whitaker, and Stuart M. Allen Cardiff School of Computer Science & Informatics, Cardiff University Queen’s Buildings, 5 The Parade, Cardiff CF24 3AA, UK Email: {m.j.williams, r.m.whitaker, stuart.m.allen}@cs.cardiff.ac.uk

Abstract—The ability to quantify the level of regularity in an individual’s patterns of visiting a particular location provides valuable context in many areas, such as urban planning, reality mining, and opportunistic networks. However, in many cases, visit data is only available as zero-duration events, precluding the application of methods that require continuous, denselysampled data. To address this, our approach in this paper takes inspiration from an established body of research in the neural coding community that deals with the similar problem of finding patterns in event-based data. We adapt a neural synchrony measure to develop a method of quantifying the regularity of an individual’s visits to a location, where regularity is defined as the level of similarity in weekly visiting patterns. We apply this method to study regularity in three real-world datasets; specifically, a metropolitan transport system, a university campus, and an online location-sharing service. Among our findings we identify a core group of individuals in each dataset that visited at least one location with near-perfect regularity.

I. I NTRODUCTION The popularity of devices capable of tracking where individuals have visited (such as GPS-enabled mobile phones) offers both opportunities in providing location-aware commercial services to users and research opportunities in measuring and understanding human mobility behaviour. Furthering our understanding of human visiting patterns is important in diverse areas such as urban planning [1], recommender systems [2], opportunistic networks [3], and limiting the spread of biological and computer viruses [4]. It is difficult to study human mobility without considering its temporal nature. It has been shown that both the ordering of visits and the timing of visits [5] contains information that can be used to build powerful predictors of future behaviour. Furthermore, human behaviour is driven by daily and weekly routine [6], [7]. Although this form of temporal structure is a rich source of information about individual behaviour, there has been little work to examine regularity in individual visiting patterns. Factors such as wealth, profession, lifestyle, and health affect an individual’s routine, and therefore his or her mobility patterns. This is likely to give rise to diversity in the population’s visiting patterns and regularity. Indeed, diversity has been found to be fundamental to human behaviour, both within the same population and among different populations, even having an evolutionary component [8]. Diversity in visiting regularity may also exist among locations, with some places, such as workplaces, having a natural predisposition for routine.

While collective analysis of behaviour (i.e., focusing on aggregate statistics of large populations of individuals) reveals periodic temporal behaviour [7], [9], it is important to also consider the individual scale (e.g., [10]), focusing on the patterns of individuals from which the collective properties emerge. It is at the individual scale that context-aware computing, user profiling, and personalised recommendations are performed. However, analysis at this scale is more challenging as the data are more sparse and the effects of unpredictable changes in behaviour are more prominent. These effects are smoothed at the collective scale due to the aggregation of many different, but weakly correlated, patterns. In many real-world systems the visits of users are reduced to instantaneous events, with information about the duration of a stay either unrecorded or ignored. Despite this loss of information, it is still valuable to analyse patterns of visits in these systems. Examples of systems that capture event-based visits include ‘checkins’ to venues in social networks and location sharing services (for example, Facebook, Foursquare, and Google Latitude), geo-tagged user-contributed content (such as Twitter and Flickr), and electronic ticket payments in metropolitan transport systems (such as the London transport network). With these data there is no clear way to infer the staying time, but nevertheless we are still able to extract interesting patterns from arrival times alone. In this paper we present a simple and efficient method for measuring regularity in an individual’s visits to a location and use it to explore the presence of regularity and routine in real-world data. We define regularity as a visiting pattern that is repeated with a reoccurring time frame (for example, on a week-by-week or day-by-day basis). User visit data such as this is very sparse and consequently challenging to effectively model. This sparsity makes it difficult to apply many established approaches for measuring regularity and periodicity, such as nonlinear time series analysis, harmonic analysis, and recurrence quantification analysis, as these are most effective for time series that are continuous and densely sampled. Although these approaches are unsuitable, in this paper we draw on the large body of relevant work in the neurophysiology community dealing with the problem of finding regularity in event-based data. The measure we present, named IVI-irregularity (intervisit interval irregularity), is adapted from a synchrony measure used in neural coding [11] (the branch of neurophysiology

II. M EASURING R EGULARITY We define regularity as repeated routine over time. For example, an individual visiting a location at very similar times each week is considered to have a highly regular pattern for that location. On the other hand, if the individual visits the location at very different times each week it is considered to be very a irregular pattern. Throughout this paper we use weekby-week comparison to determine regularity; however, in the following formulation we generalise this to any window size, denoted by ω. The measure we introduce quantifies the level of irregularity in an individual’s visits to a particular location in a given period of time. Let the chronology of an individual v’s visits to a particular location l be denoted by the ordered sequence of times Cv,l = {ti | i = 1, . . . , L}, where L is the number of v’s visits to l. These times are assumed to be offsets from some arbitrary origin, giving values ti ∈ (0, Tmax ] ∀ i = 1, . . . , L. The chronology is segmented into disjoint windows of duration ω to build N visit trains. The absolute times of visits are translated to offsets from the start time of their corresponding window; thus, each train has visit times in the interval (0, ω]. We assume Tmax and ω are chosen such that ω N = Tmax . We denote the number of visits in the nth train with Ln and the

Week 1 Week 2 Week 3

:0 0 00 n

t0 Sa

Su

0: 00

:0 0 Fr i

Th u

00

:0 00 W ed

00 :0 0

0

0 :0 00 Tu e

on

00

:0 0

Week 4

M

concerned with the coding of information among the neurons in the brain). In the context of neural coding, neurophysiologists deal with ensembles of spike trains, where each train represents the instantaneous electrical pulses (or spikes) of a particular neuron. An ensemble of spike trains is said to exhibit high synchrony if the spikes in the trains occur at similar times. Spikes can be regarded as abstract, zero-duration events; in our case, spikes correspond to visits to a particular location. We use a spike train synchrony measure to quantify the dissimilarity in visits in different weeks; if visits in each week occur at very similar times, then dissimilarity is very low, and thus regularity is high. Throughout this paper we use the terms visit and inter-visit interval (IVI) rather than spike and inter-spike interval, as we are applying these techniques outside the context of neurophysiology. Using IVI-irregularity we seek to determine the prevalence of regular relationships between individuals and locations and factors that influence the level of regularity. We study these questions using three empirical traces of human mobility, and find that a core subgroup of individuals in each dataset have a number of locations they visit with high regularity. For many applications it is useful to treat regular visits differently to erratic visits. Being aware of these characteristics of human mobility, and being able to effectively measure them, is valuable in many of the aforementioned scenarios. The rest of this paper is structured as follows. The irregularity measure is formulated in Section II. In Section III the datasets used in the analysis are discussed. The analysis of regularity in these datasets is presented in Section IV. We discuss related work in Section V. Finally, in Section VI we conclude the paper with a summary of the contributions and opportunities for future work.

Fig. 1. Example visit trains for a particular user and access point in the DARTMOUTH dataset. Window width ω = 7 days.

sequence of visit times with {uni | i = 1, . . . , Ln }. An example of the visit trains for a chronology in the DARTMOUTH dataset (discussed in Section III) are shown in Figure 1. Irregularity is quantified by applying the ISI-diversity [12] measure to the ensemble of N visit trains. The measure is computationally efficient, scaling linearly in both the number of visit trains N and number of visits L. We begin by defining the inter-visit interval (IVI) as the time between two consecutive visits. The instantaneous inter-visit interval function I n (u) gives the IVI for the nth train at time offset u; formally, we consider three cases, I n (u) = un1 I n (u) = ω − unLn

if if

0 < u ≤ un1 , < u ≤ ω,

unLn

and I n (u) = min(uni | uni ≥ u) − max(uni | uni < u) if un1 < u ≤ unLn . We define two further instantaneous measures. For time offset u, the instantaneous mean µ(u) is given by µ(u) =

N 1 X n I (u) N n=1

and the instantaneous standard deviation σ(u) is given by !1/2 N X 1 2 n σ(u) = (I (u) − µ(u)) . N − 1 n=1 The coefficient of variation cvar (u) provides a measure of dispersion in the IVI values at time offset u, cvar (u) =

σ(u) . µ(u)

The coefficient of variation is a unitless measure and normalised against the mean, which enables comparison between the dispersion in collections of large IVI values and collections of small IVI values. By integrating over time offset u we obtain a measure of overall dissimilarity D(Cv,l ) in the ensemble of visit trains for chronology Cv,l ; formally, Z 1 ω cvar (u) d u . D(Cv,l ) = ω 0 The resulting D(Cv,l ) is a non-negative value, with D(Cv,l ) = 0 indicating identical trains (or perfect regularity), and higher values indicating more irregularity in the visiting patterns. We refer to D(·) as the IVI-irregularity measure.

III. DATASETS We use the IVI-irregularity measure to study regularity in visiting patterns in the following datasets. Foursquare Checkins (F OURSQUARE): Foursquare is a popular location-based mobile social network. Foursquare users voluntarily ‘check in’ to venues using the Foursquare mobile application. In this way each user compiles a record of his or her visits. We collected a dataset of all checkins in three urban areas in the United Kingdom: Bristol, Cardiff, and Cambridge. These checkins were collected in 2011 [13]. Dartmouth Wireless LAN Access Point Logs (DARTMOUTH): Visits in this dataset are drawn from the use of wireless access points (APs) by staff and students at Dartmouth College campus in the United States [14]. Over 450 APs placed across the 800km2 of campus provide wireless coverage for most of the area, serving roughly 5,000 undergraduates and 1,200 faculty. The campus includes a variety of facilities, including residences, auditoriums, and social spaces. We use the AP movements trace of April 2003. London Underground Journeys (U NDERGROUND): The London Underground is a metropolitan rapid-transit rail system serving most of Greater London in the United Kingdom. The Oyster automated fare collection system is used by many passengers, requiring each user to swipe his or her personal Oyster RFID card at the station of entry and station of exit. This provides a record of the passengers’ Underground station visits. We obtained an anonymised dataset of all Oyster card journeys over 28 days in March 2010 from Transport for London (TfL), the government body responsible for the service, to use in this paper. While all three datasets capture visits of individuals to locations, they are drawn from different domains and circumstances, and represent different geographic scales (relevant differences are summarised in Table I). Of the three datasets, F OURSQUARE is unique in that its visits are self-reported by users and so visits may be liable to misreporting and underreporting; nevertheless, it is an interesting dataset as it is at urban scale and covers many venue types. In the case of the DARTMOUTH dataset we carried out additional processing to prepare it for analysis. In particular, we discarded repeated visits by the same user to the same AP separated by a short interval (less than 15 minutes) as these are artefacts of the WLAN AP protocol. In addition, to filter out stationary devices (e.g., wireless-enabled desktop computers), we only included devices that visited at least five different APs. Each dataset spans a period of four consecutive weeks. We note that the original data contained many individuals that visited certain locations very rarely or exclusively in a few of the four weeks. Chronologies such as these not suitable for studying regularity, as their activity is too rare and too transient. We restrict the datasets to chronologies with at least two visits in each of the four weeks. The resulting datasets are summarised in Table I. This filtering culled 93% of the original DARTMOUTH and U NDERGROUND person-location pairs, indicating that, although the set of places a person has

TABLE I S UMMARY OF DATASETS USED IN THE ANALYSIS OF REGULARITY. E ACH DATASET CORRESPONDS TO A FOUR - WEEK PERIOD . M DENOTES THE NUMBER OF CHRONOLOGIES AND hLi DENOTES THE MEAN NUMBER OF VISITS PER CHRONOLOGY. A CHRONOLOGY Cv,l IS ONLY INCLUDED IN A DATASET IF v VISITED l AT LEAST TWICE IN EACH OF THE FOUR WEEKS . Dataset Area(s) Scale Month Location type Visit type Individuals Locations Visits M hLi

F OURSQUARE Bristol, Cardiff, and Cambridge Urban June Venue Checkin 293 336 4,640 401 11.6

DARTMOUTH

U NDERGROUND

Dartmouth

London

Campus April Access point Association 1,681 391 229,300 3,656 62.7

Metropolitan March Metro station Card swipe 1,167,363 270 58,945,475 2,260,354 26.1

visited at least once may be large, many of these places are only visited very occasionally. The number of chronologies for F OURSQUARE reduced to 3% of the original, leaving a small sample of 401. The remaining chronologies in F OURSQUARE involve 4% of the users, a small proportion compared to 67% in DARTMOUTH and 23% in U NDERGROUND. IV. V ISITS AND R EGULARITY IN R EAL -W ORLD M OBILITY T RACES We divide our analysis into three areas of interest. We first consider the influence of the time of week on the inter-visit intervals of chronologies (Section IV-A). In Section IV-B we compare the datasets in terms of their irregularity. Finally, we consider how prevalent regular visiting patterns are among the individuals in each dataset (Section IV-C). A. Inter-visit intervals and the time of week As discussed in Section II, our approach focuses on the weekly patterns of inter-visit intervals (IVIs) for an individual’s visits to a particular location. The IVIs themselves, along with their level of dispersion at a particular time-ofweek, are an interesting property of human mobility and thus we consider them specifically in Figure 2. The figure shows how IVI dispersion (as quantified by the coefficient of variation cvar of a chronology at a given time-of-week) varies throughout the week. The small standard deviations in visit rates indicate that the volume of visits is very similar in each week. This contrasts with the hcvar i values which have very high standard deviation. This highlights the person-specific nature of an individual’s visiting patterns with a location; in other words, the visiting patterns (and therefore IVIs) of two different individuals visiting the same location can be very different. In the U NDERGROUND dataset we observe that, on average, chronologies’ IVIs are most-dispersed between 10:00 and 16:00 on weekdays, and least-dispersed during nighttime. This is because the relative effect of a discrepancy in visit times that are close together is greater than when the visit times are further apart. For example, the morning and afternoon

100

% of scores x

Foursquare

80 Foursquare Dartmouth Underground

60 40 20

Underground

Dartmouth

0

0.0

0.2

0.4 0.6 Irregularity score

0.8

Fig. 3. Cumulative distributions of IVI-irregularity scores (i.e., D(·) values) in each dataset. High D(·) indicates high irregularity. The mean IVIirregularity value hDi is 0.381 (± 0.131) for F OURSQUARE, 0.510 (± 0.185) for DARTMOUTH, and 0.373 (± 0.173) for U NDERGROUND.

DARTMOUTH dataset includes many types of visit (including social, residential, and academic), whereas U NDERGROUND is restricted to transportation. This late-evening visit activity is also the reason for the delayed dip in IVI dispersion, which does not decrease until 22:00 (compared to 16:00 in U N DERGROUND ). It is also worth noting that the DARTMOUTH decline in visit rate on the weekend is small. This is explained by a large number of students living on-campus, compared to a small proportion of students and staff who either live offcampus or spend the weekend elsewhere. B. Comparison of regularity between datasets

Fig. 2. Time-of-week means of visit rates and coefficients of variation (hcvar i) for each dataset. hcvar i is obtained by averaging over the cvar values in the corresponding two-hour time slot of all chronologies. A high hcvar i indicates that the instantaneous IVI values were, on average, more dispersed during that time of week.

commute on the same day are separated by roughly nine hours, whereas the time between the afternoon commute and the following day’s morning commute is roughly 15 hours. Therefore, minor discrepancies in the visits to a commuter’s stations will have a greater influence on the dispersion of daytime IVIs than nighttime IVIs. The same behaviour is responsible for the dip in IVI dispersion during the weekend. Many chronologies consist of predominantly weekday visits. The weekends for these chronologies will correspond to large IVIs spanning from Friday to Monday, and so the dispersion (cvar ) will be less during this period. When comparing DARTMOUTH and U NDERGROUND we note that DARTMOUTH’s weekday visit activity is sustained throughout the day and lasts longer into the evening, rarely declining before midnight. This reflects the fact that the

Given that the three datasets differ in scale, context, and time of year we would expect differing visiting behaviours in each. Indeed, we have already discussed how the three datasets’ time-of-week visit rates exhibit different patterns. The same is also true of the level of regularity present in each dataset, as shown in Figure 3. DARTMOUTH is distinct from the other two datasets, with the weight of its distribution shifted towards higher irregularity. This is reflected in the mean irregularity hDi (which we take over the available user-location chronologies), which is higher for DARTMOUTH (0.510) than for F OURSQUARE and U NDERGROUND (0.381 and 0.373, respectively). This suggests that the patterns of individuals visiting locations on Dartmouth campus tend to be more irregular. This is unlikely to be due to a sudden change in routine, as the duration of the dataset (April 2003) is a continuous period of term-time teaching, uninterrupted by holidays or exams. The small deviations in visit rates (see visit rate plots in Figure 2) also indicate that there was no overall change in visiting patterns between the weeks. An alternative reason for the increased irregularity may be the highly dynamic and spontaneous nature of student behaviour. This contrasts with Underground passengers and Foursquare users, whose student proportion is likely to be much smaller, consisting instead of a large population of individuals following less-flexible routines (for example, commuters).

The finer-grained scale of the DARTMOUTH dataset may also contribute to the increased irregularity. The Dartmouth APs had an indoor range of around 40 to 100 metres, so most buildings required multiple APs to achieve good WLAN coverage. This means that users moving as little as a few tens of metres can register as having visited a new location. These short-distance movements are likely to be more unpredictable and driven less by routine than larger-distance movements, and thus result in higher irregularity in AP visits. We also note the similar mean irregularities of F OURQ SUARE and U NDERGROUND chronologies, which may be attributed to both datasets being at a city-wide scale and consisting of a broad cross-section of people, as opposed to Dartmouth campus’s predominantly student population.

patterns, since they represent a convolution of many different routines throughout the week. We consider the probability p(m) that, given an individual v who has visited m stations, the individual’s most irregular station l (i.e., l such that D(Cv,l ) is maximised) is also the station that v visited the most. We find that p(2) = 0.55, p(3) = 0.37, p(4) = 0.29, p(5) = 0.28, and p(6) = 0.28, indicating that the probability of these stations matching is slightly higher than chance. The deviation from chance becomes greater when individuals have four or more frequently visited stations. This deviation is more significant in DARTMOUTH, which has probabilities p(2) = 0.57, p(3) = 0.47, and p(4) = 0.43.

C. Prevalence of regularity among individuals

Relevant related work includes other approaches to quantifying patterns in human mobility. Information entropy has been used in [15] to quantify the predictability of mobile phone users’ patterns of transition between home and work. The work we have presented attempts to go beyond only home and work, considering the many other locations a person visits. An interesting observation in [15] is that university students, especially those in their first year of study, have the highest entropy, and therefore are the least predictable. This agrees with our finding that DARTMOUTH individuals have higher irregularity. Song et al. [5] have made two key contributions relevant to the work in this paper. First, the authors investigate a different but related concept of regularity, which is defined by them as the probability that an individual is found at his or her mostvisited location. They find that this property is tied to the time-of-week, as we also observed with the mean coefficient of variation (Section IV-A). As previously mentioned, we go beyond the individual’s most-visited location and consider their relationships with other places. Second, the authors find that a significant amount of predictive information is encoded in the sequence and ordering of visits. In this paper we have focused on IVIs and their variation by time of week; patterns in the sequences of IVIs is an interesting direction for future work. We also note the datasets in this paper are geographically fine-grained compared to the mobile phone records used in [5], which are on the granularity of cell towers. There has been research (such as [16]) in the opportunistic networking community that views human behaviour as events in a point process and leverages the corresponding literature; however, we have not found any work in the fields of human mobility or human encounters that utilises the methods of neural coding (in which neuronal spikes are often treated as point processes) as we have in this paper.

We now study the extent to which an individual has regular relationships with the locations he or she visits. We begin by considering the overall number of locations individuals tend to visit, as shown in Figure 4a. In F OURSQUARE and DARTMOUTH the percentage of individuals decreases with the number of different locations, with DARTMOUTH users typically visiting a wider variety of locations. U NDERGROUND follows a similar pattern, except its peak is at two locations rather than one, which is explained by the nature of Underground journeys. Individuals with only one location are due to the rare instances of a passenger either bypassing the exit turnstile or exiting from the entry station, and the minimumvisits filtering we discussed in Section III. Using the IVI-irregularity D(Cv,l ) of an individual v’s visits to location l we can evaluate whether v’s visits to l are regular or irregular. We set a threshold for irregularity, below which we will regard v’s visits to l as regular. In Figure 4b we plot the distribution of individuals and how many of the locations they visited were deemed regular in this way. We set a strict threshold of 0.2, as we wish to find the chronologies with near-perfect regularity. As shown in Figure 3, a minority of chronologies in each dataset are within this threshold (8.2% in F OURSQUARE, 4.4% in DARTMOUTH, and 17.4% in U NDERGROUND). Figure 4b shows how the set of highly regular chronologies is distributed among the individuals. 8% of Foursquare users and Dartmouth WLAN users had at least one location that they visited with high regularity. The percentage increases in the case of Underground passengers, with 21% of individuals having at least one regular location, likely due to the moreroutine nature of travel. At stricter thresholds (i.e., thresholds closer to 0), the size of the core group of users with at least one regular venue decreases. The threshold at which the size of this group dropped to 1% of individuals was 0.009 for F OURSQUARE, 0.050 for DARTMOUTH, and 0.007 for U NDERGROUND. We also consider whether there is any relationship between an Underground passenger’s most-visited station and his or her most-irregular station. Most-visited stations are likely to be ‘home’ stations, which we expect to have irregular visiting

V. R ELATED W ORK

VI. C ONCLUSIONS AND F UTURE W ORK In this paper we introduce a novel method for measuring regularity in an individual’s visits to a particular location, adapted from the neural coding concept of synchrony. The method is computationally efficient, does not require binning, and is applicable even for low visit rates.

Foursquare Dartmouth Underground

100

80 % of individuals

% of individuals

80 60 40

60 40 20

20 0

Foursquare Dartmouth Underground

100

1

2

3

5 4 6 Number of locations

7

8

9

(a) Distributions showing the number of locations for individuals used in our analyses. Individuals exceeding nine locations are not plotted. Fig. 4.

0

0

1

2 3 Number of regular locations

4

5

(b) Distributions showing the number of regular locations (i.e., where D(Cv,l ) ≤ 0.2) for individuals in each dataset.

Number of regular locations per individual compared to the overall number of locations per individual.

Using this method we investigate the visiting patterns of individuals in three diverse datasets; specifically, a metropolitan transport system, a university campus, and an online location-sharing service. We find that campus visits are the most irregular, likely due to the flexible nature of student behaviour, and transport visits are most regular, likely due to the significant commuter population. In all three datasets we find a core group of individuals that visit at least one location with near-perfect regularity. We also note a correlation between an individual’s most-visited location (likely to be a associated with their home) and irregularity. This paper has focused on regularity from the perspective of the individual, but we can use the same approach to consider the location perspective. Future work will investigate how the type of a location (e.g., the Foursquare venue category) influences the regularity of users visiting it, and how this contributes to the overall mean irregularity. We can also consider the prevalence of regularity among locations. This has implications for retailers and shop owners, as it would allow them to distinguish regular visitors from irregular visitors. We intend also to extend this work from person-at-location regularity to person-to-person regularity. ACKNOWLEDGEMENTS We are grateful to Transport for London (TfL), and employees Mark Roberts, Andrew Gaitskell, and Duncan Horne, for providing the London Oyster data. We thank Dafydd Evans for insightful discussions. This research has been funded by RECOGNITION grant 257756, an European Commission FP7 FET project. Data processing was performed in part using the computational facilities of the Cardiff ARCCA division. R EFERENCES [1] J. Cranshaw, R. Schwartz, J. Hong, and N. Sadeh, “The livehoods project: Utilizing social media to understand the dynamics of a city,” in Proc. 6th Int. AAAI Conf. on Weblogs and Social Media, 2012.

[2] D. Quercia, N. Lathia, F. Calabrese, G. Di Lorenzo, and J. Crowcroft, “Recommending social events from mobile phone location data,” in Proc. 10th IEEE Int. Conf. on Data Mining (ICDM), 2010, pp. 971 –976. [3] B. Han, P. Hui, V. Kumar, M. Marathe, J. Shao, and A. Srinivasan, “Mobile data offloading through opportunistic communications and social participation,” IEEE Trans. on Mobile Computing, vol. 11, no. 5, pp. 821 –834, 2012. [4] P. Wang, M. C. Gonz´alez, C. A. Hidalgo, and A. Barab´asi, “Understanding the spreading patterns of mobile phone viruses,” Science, vol. 324, no. 5930, pp. 1071–1076, 2009. [5] C. Song, Z. Qu, N. Blumm, and A. Barab´asi, “Limits of predictability in human mobility,” Science, vol. 327, no. 5968, pp. 1018–1021, 2010. [6] M. Williams, R. Whitaker, and S. Allen, “Decentralised detection of periodic encounter communities in opportunistic networks,” Ad Hoc Networks, 2011, in press. [7] S. Scellato, M. Musolesi, C. Mascolo, and V. Latora, “On nonstationarity of human contact networks,” in Proc. 2nd Workshop on Simplifying Complex Networks for Practitioners, Genoa, Italy, 2010. [8] G. R. Brown, T. E. Dickins, R. Sear, and K. N. Laland, “Evolutionary accounts of human behavioural diversity,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 366, no. 1563, pp. 313–324, 2011. [9] J. Candia, M. C. Gonz´alez, P. Wang, T. Schoenharl, G. Madey, and A. Barab´asi, “Uncovering individual and collective human dynamics from mobile phone records,” Journal of Physics A: Mathematical and Theoretical, vol. 41, no. 22, p. 224015, 2008. [10] S. Jiang, J. Ferreira, and M. Gonz´alez, “Clustering daily patterns of human activities in the city,” Data Mining and Knowledge Discovery, pp. 1–33, 2012. [11] E. N. Brown, R. E. Kass, and P. P. Mitra, “Multiple neural spike train data analysis: state-of-the-art and future challenges,” Nature Neuroscience, vol. 7, no. 5, pp. 456–461, 2004. [12] T. Kreuz, D. Chicharro, R. G. Andrzejak, J. S. Haas, and H. D. I. Abarbanel, “Measuring multiple spike train synchrony,” Journal of Neuroscience Methods, vol. 183, no. 2, pp. 287–299, 2009. [13] G. B. Colombo, M. J. Chorley, M. J. Williams, S. M. Allen, and R. M. Whitaker, “You are where you eat: Foursquare checkins as indicators of human mobility and behaviour,” in Proc. IEEE PERCOM Workshops, 2012. [14] T. Henderson, D. Kotz, and I. Abyzov, “The changing usage of a mature campus-wide wireless network,” Computer Networks, vol. 52, no. 14, pp. 2690–2712, 2008. [15] N. Eagle and A. Pentland, “Reality mining: sensing complex social systems,” Personal and Ubiquitous Computing, vol. 10, no. 4, pp. 255– 268, 2006. [16] A. Chaintreau, P. Hui, J. Crowcroft, C. Diot, R. Gass, and J. Scott, “Impact of human mobility on opportunistic forwarding algorithms,” IEEE Trans. on Mobile Computing, vol. 6, no. 6, pp. 606–620, 2007.

Collective Human Mobility Pattern from Taxi Trips in Urban Area Chengbin Peng1,2, Xiaogang Jin2, Ka-Chun Wong1, Meixia Shi3, Pietro Lio`4* 1 Mathematical and Computer Sciences and Engineering Division, King Abdullah University of Science and Technology, Jeddah, Kingdom of Saudi Arabia, 2 Institute of Artificial Intelligence, College of Computer Science, Zhejiang University, Hangzhou, China, 3 College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, China, 4 Computer Laboratory, Cambridge University, Cambridge, United Kingdom

Abstract We analyze the passengers’ traffic pattern for 1.58 million taxi trips of Shanghai, China. By employing the non-negative matrix factorization and optimization methods, we find that, people travel on workdays mainly for three purposes: commuting between home and workplace, traveling from workplace to workplace, and others such as leisure activities. Therefore, traffic flow in one area or between any pair of locations can be approximated by a linear combination of three basis flows, corresponding to the three purposes respectively. We name the coefficients in the linear combination as traffic powers, each of which indicates the strength of each basis flow. The traffic powers on different days are typically different even for the same location, due to the uncertainty of the human motion. Therefore, we provide a probability distribution function for the relative deviation of the traffic power. This distribution function is in terms of a series of functions for normalized binomial distributions. It can be well explained by statistical theories and is verified by empirical data. These findings are applicable in predicting the road traffic, tracing the traffic pattern and diagnosing the traffic related abnormal events. These results can also be used to infer land uses of urban area quite parsimoniously. Citation: Peng C, Jin X, Wong K-C, Shi M, Lio` P (2012) Collective Human Mobility Pattern from Taxi Trips in Urban Area. PLoS ONE 7(4): e34487. doi:10.1371/ journal.pone.0034487 Editor: Matjaz Perc, University of Maribor, Slovenia Received January 17, 2012; Accepted February 28, 2012; Published April 18, 2012 Copyright: ß 2012 Peng et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: CP was supported by Graduate Fellowship from King Abdullah University of Science and Technology. CP and XJ were supported by the National Science Foundation of China under Grant No. 61070069. PL was supported by the following project: RECOGNITION: Relevance and Cognition for Self-Awareness in a Content-Centric Internet (257756), which is funded by the European Commission within the 7th Framework Programme (FP7). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected]

strongly influence urban formation, evolving, and future planning [18–21], whereas the land use can also affect the urban traffic [22– 24] and the human mobility [25]. Thirdly, the better understanding of human mobility can help to more easily control the spreading of contagious diseases by limiting the contact among individuals [26], since the transmission of infected people from one place to another is an important way to infect the susceptible ones, either in a small scale area [27,28] or from a worldwide viewpoint [29–31]. Similar theories hold for viruses contamination with malicious code among wireless communication devices [32,33]. Due to the high importance of human mobility research, and the availability of the large amount of empirical data as a consequence of the prevalence of wireless communication devices, researchers become more and more interested in the statistical features of human mobility pattern via real world data [34]. Ref. [7] and Ref. [9] suggest that human travels are reminiscent of Le´vy Flights [35] according to the trajectories of bank notes and taxies respectively, while Ref. [36] reports some variances by the GPS information from volunteers. These differences are later recognized as a result of the periodic pattern of individual’s traveling [8] and recently Ref. [37] discovers up to 93% of total time when individual locations are predictable in their data set, which contains trajectories of mobile phone users. For taxi trips, Ref. [38] studies the distribution of the travel distances and time.

Introduction Urban traffic has drawn the attention of physicists since more than one decade ago. Generally, there has been two kinds of approaches for the traffic analysis. In microscopic models, some researchers represent vehicles as particles interacting with each other [1,2], while some others use the cellular automata framework [1,3,4]. Based on game theory, the impact of individuals’ irregular behaviors on traffic system is also emphasized [5]. On the other hand, from the macroscopic perspective, the idea of fluid dynamics is introduced [1,6]. In recent years, a new and more fundamental approach for traffic analysis is emerging: human mobility, by drawing statistical inferences from the enormous empirical data [7–9]. Several reasons boost the research in this area. Firstly, the knowledge of the mobility pattern is essential in traffic modeling [10,11] for simulation, forecasting [12,13] and control [11]. In addition, by measuring the traffic flow during some time interval to see whether or not it agrees with the verified estimation, the collective mobility analysis can serve as a tool for abnormality definition and detection [14,15]. Compared to computer vision based detection [16,17], collective mobility model based abnormality detection can be applied in a much larger scale of area, for example, the whole city. Secondly, the mobility pattern and the consequential traffic flow can also interact with the land use. The characteristics of traveling PLoS ONE | www.plosone.org

1

April 2012 | Volume 7 | Issue 4 | e34487

Collective Human Mobility Pattern

j [ ½1,n\Z. Let h be the number of time slots, normally 24 for one day. Therefore for location (i,j), the numbers of departure and arrival trips (macro pattern) along time each day can be represented by a 1|h vector Si,j , which is easy to calculate. We can also define a set of 1|h vectors containing normalized numbers of trips along time: B1 , B2 , B3 , . . . , BK , each for one basis pattern that we seek for. The macro pattern is a linear combination of basis patterns, so we have

Nevertheless, previous statistical inferences of human mobility mostly focus on individual level, while this article analyzes the citizens’ collective dynamics in the urban area. In our research, based on the traveling purposes, we discovered three distinct basis patterns for collective traffic flow regardless of the location. In addition, a distribution is revealed that can characterize the fluctuation of the traffic flow at any time in each location. As mentioned above, these findings can be useful for urban planning, traffic estimation and anomalous detection. Further studies on interaction between different areas will provide a more detailed collective mobility model, and would additionally benefit the research on epidemic spreading in urban area.

2

3 B1 6 7 6 B2 7 6 7 6 7 Si,j ~Pi,j 6 B3 7 6. 7 6. 7 4. 5 BK

Analysis Data Description and Background Assumptions In this research, the data [39] are collected from about two thousand taxies operating within the urban area of Shanghai, China. These data mainly focus on the central part of city, and the population in this part is about seven million according to the fifth national population census [40]. The information about when and where passengers were picked up and dropped off can be retrieved from the raw data, and every pair of picking and dropping information is defined as a taxi trip. The data set includes about 1.58 million taxi trips. The longitude and latitude location information in the data by GPS is converted to positions in a planar coordinate system, with the city landmark Oriental Pearl Tower as the origin. For the ease of analyzing and representing, the urban area is divided into squares, similar to a chessboard. The side lengths of each square is identically 200 meters. In our context, each location corresponds to one of these squares. More details can be found in Appendix S1.

where Pi,j is a row vector containing K coefficients for the linear combination on the right-hand side. By taking all the locations into account, it can also be written as 2

2 3 3 S1,1 P1,1 6 S1,2 7 6 P1,2 7 6 6 7 7 6. 6. 7 7 6. 6. 7 7 6. 6. 7 7 6 6 7 7 6S 6P 7 7 6 1,n 7 6 1,n 7 6 6 7 7 3 6 S2,1 7 6 P2,1 72 6 6 7 7 B1 6S 6P 7 7 7 6 2,2 7 6 2,2 76 B2 7 6 6 7 76 6 7 6 .. 6. 7 76 6. 7 ~ 6 .. 76 B3 7 7 6 6 7 76 6 6 7 7 .. 7 7 6 S2,n 7 6 P2,n 76 . 5 6 6 7 74 6. 6. 7 7 6 .. 6 .. 7 7 BK 6 6 7 7 6 6 7 7 6 Sm{1,n 7 6 Pm{1,n 7 6 6 7 7 6S 6P 7 7 6 m,1 7 6 m,1 7 6 6 7 7 . . 6. 6. 7 7 4. 4. 5 5 Sm,n Pm,n

Basis Traffic Flows: the Constancy As we know, even a 200m|200m area in a city can possess land of several different types, for example, containing schools, shops and apartments at the same time. In this section, we will discuss how to categorize the taxi trips according to the traveling purposes, and then use these categories to infer the land use composition for each square. First of all, we consider the taxi trip categorization. People setting out in the same location would possibly have different purposes: some may go to workplaces while some others may go for entertainment. Meanwhile, for trips belonging to the same category but in different locations, the collective pattern should be similar, regarding to the departure and arrival time in a large amount of data. For example, if the number of trips between residential area and workplaces (for commuting purpose) reaches the highest at 8:00 am (going to work) and 5:00 pm (getting off work), then the number of trips in this category in any place would peak almost at the same time, although the scale may be different. In short, we can define a set of basis collective patterns, each of which corresponding to a trip category respectively. Then linear combinations of these patterns can describe the macro traveling pattern of each location. Finally, the coefficients in a linear combination can reflect the land uses of the location. Directly from the taxi data, we can only calculate the macro patterns. Therefore, we should adopt appropriate inference methods to find the basis patterns and the coefficients for each location. To represent our method more formally, we define (i,j) to index the square in ith row and jth column among all the squares divided within the city. If m is the number of rows and n is the number of columns for squares in the map, then i [ ½1,m\Z, and PLoS ONE | www.plosone.org

ð1Þ

ð2Þ

and abbreviated as

S~PB

ð3Þ

Because the two matrices on the right-hand side of Eq. (3) are unknown, there are many matrix decomposition methods that may apply. However, according to the physical meaning of P and B, all the entries of these two matrices should be nonnegative. Therefore, we choose nonnegative matrix factorization (NMF) [41,42] for the decomposition. In our context, it is a method to factorize a matrix S [ Rmn|h z K|h into two nonnegative factors P [ Rmn|K and B [ Rz approxz imately. By this approach, we can find the basis patterns (the row vectors of B) and the parameter vectors (the row vectors of P) simultaneously. As vector Pi,j (the ½(i{1)mzjth row of matrix P) is only responsible for vector Si,j (the ½(i{1)mzjth row of matrix S), in fact, each element of Pi,j denotes the scale of traffic flow with respect to the corresponding category, in location (i,j). Hence, we also call these elements the traffic power because they reflect how strong the traffic flows of different categories are.

2

April 2012 | Volume 7 | Issue 4 | e34487

Collective Human Mobility Pattern

Now the only thing left is to determine K, the number of the basis patterns. From the algorithmic perspective, we noticed that NMF starts with random initial conditions [41]. By experiments on the taxi data with many different random initial conditions, we find that only when K equals 3, the factorization results can be stable. This fact indicates that with parameter K~3, NMF can find out statistically significant characteristics for the data, and Fig. 1 demonstrates the resulted basis pattern B1 , B2 and B3 . On the other hand, from the land-use and trip-category perspective, K~3 is a reasonable choice in categorizing trip purposes. There are several land-use definitions related to the topic of mobility. For example, each place may be classified as a residential (home), working, shopping, or recreational location [27]. It may also be regarded as one of the following types: a residential area, a workplace, a commercial zone, a recreation area and educational facilities [43]. In Ref. [44], these types are simplified into workplace, home and shop. Specifically for the city of Shanghai based on GIS information, Ref. [45] refers to the land types including residence, industry, agriculture, roads, water, land for construction and other urban land. In our context, we can simplify the land-use definition to be: residences, workplaces and others. Here workplaces shall include any industrial and office workplaces as well as schools, and other places can include shopping and recreational facilities, hospitals, etc. For trips, some scientists categorize these individual activities into several orientations: family, work, leisure and service-based movement [46]. Similarly, according to our land-use definition, we can use three purpose-based categories for the trips: commuting between home and workplace (B1 ), business traveling between two workplaces (B2 ), and trips from or to other places (B3 ). This representation is in accordance with the algorithmic result in Fig. 1. Take a typical workday as an example, based on our three categories, the major traffic flows in the city are supposed to be as follows: those from home to workplaces in the early morning (green line), from one workplace to another in the daytime (red line), from workplace to home or to other places at dusk (green line again), and those between other places and home in the night (blue line).

Therefore, K~3 is an effective and reasonable choice. In the following sections with K~3, for clarity, we will use Bc, Bw and Bo to replace B1 , B2 and B3 respectively: 2

3 2 3 B1 Bc 6 7 6 7 B~4 B2 5 ~ 4 Bw 5 B3 Bo

We also use Pci,j , Pwi,j and Poi,j to represent the three entries in vector Pi,j :   Pi,j ~ Pci,j ,Pwi,j ,Poi,j

Daily Traffic Power: the Variation Typically in a city, the volume of the traffic flow is quite regular everyday [8]. However even for the same time in the same location but on different days, the volume is vulnerable to change within a certain range. This section is devoted to analyze how Pi,j fluctuates everyday. In this case, P is calculated from the average basis pattern SBT according to Appendix S2. We define a random variable a to represent the relative variance of the traffic power. The empirical distribution function of a can be simply extracted from a collection of the following expressions in different locations on different days:

Basis Patterns B1 B2 B3

Relative Traffic Volume

0.14 0.12

Pci,j Pwi,j Pwi,j , , SPci,j T SPwi,j T SPwi,j T

0.1

0.06 0.04 0.02

0

5

10

15

20

Time in Hour Figure 1. Basis Pattern B: Green is B1, Red is B2, and Blue is B3. Solid Lines Represent the Mean SBT, while Dashed Lines Represent the Positive and Negative Deviations Averaged on Different Days. doi:10.1371/journal.pone.0034487.g001

PLoS ONE | www.plosone.org

ð6Þ

where S:T means the daily average, as we have used. We also find the theoretical distribution function of a, which is more complex. First, we try to find a only for the first category of trips in location (i,j). We define pn as the potential population that may affect the first-category traffic in this location, and r as the probability (ratio) that an individual in the population finally becomes part of that traffic flow. Then the number of such trips follows a binomial distribution:

0.08

0

ð5Þ

Appendix S2 describes the detailed implementation about applying NMF to this problem. The basis pattern on different days are averaged to SBT. Then, Pi,j , the traffic power, can be recalculated based on SBT for different day. If it variants in an acceptable interval day by day, the daily average of Pi,j , represented by SPi,j T, can indicate the land use of location (i,j). For example, if SPci,j T is large, then the traffic flow corresponding to basis pattern SBcT is large, suggesting that location (i,j) serves mainly for residences or workplaces, while if SPwi,j T is the largest, we can be quite sure that this location is mainly for workplaces. In addition, if the variation of Pi,j on some day goes out of the acceptable interval, it indicates that something abnormal happens on that day. This feature can be helpful for anomaly detection on human activities in a large area. In the next section, we will analyze the variance of Pi,j , to determine what is an acceptable interval.

0.18 0.16

ð4Þ

PTN (tn)   pn tn ~ r (1{r)pn{tn tn 3

ð7Þ

April 2012 | Volume 7 | Issue 4 | e34487

Collective Human Mobility Pattern

where tn can be any non-negative integer less than pn. Because it is a binomial distribution, the corresponding CDF can be written in terms of the beta functions: DTN (tn) ~ PTN (TNƒtn)

ð12Þ

and the CDF is ð8Þ

~ 1{PTN (TN§tnz1) ~ 1{Ir (tnz1,pn{tn) where Ir (tnz1,pn{tn)~

  N Pa,vk (a)~ ra|vk (1{r)N{a|vk a|vk

ta|vk s

Da,vk ðaÞ~

k~0

B(r : tnz1,pn{tn) . B(x : c1 ,c2 ) is the B(tnz1,pn{tn)ð c1 {1

u

c2 {1

(1{u)

 N k r ð1{rÞN{k k

ð13Þ

Finally, we discuss how to make a representative for variations of any traffic category in any location. We define a vector s, in which each entry sk denotes the proportion of traffic flow corresponding to STNT~vk . Then for a randomly selected traffic flow, when the average number of trips STNT is not given, a general expression for the CDF of a is

x

incomplete beta function as B(x : c1 ,c2 )~

X

du

0

is the beta function as B(c1 ,c2 ) ð1 c2 {1 c1 {1 (1{u) du. Eq. (8) is strictly equal when B(c1 ,c2 )~ u

and

0

tn is a positive integer, while for a real positive number of tn, we may use this approximation:

Da (a)~

X

ð14Þ

sk Da,vk (a)

k

DTN (tn) 1 f½1{Ir (tnz1,pn{tn) & 2 z½1{Ir ((tn{1)z1,pn{(tn{1))g

By beta approximation as in Eq. (9), it can be written into a continuous version

ð9Þ

Pci,j TN , where STNT is ~ According to the definition, a~ SPci,j T STNT equivalent to pn|r by the property of expectation of the binomial distribution, and can be treated as a constant for a given location. Therefore, the probability density function (PDF) of a is:

~

k

&

Pa (a) ~ Pa|STNT (a|STNT) ~ PTN (a|STNT)   pn ~ ra|STNT (1{r)pn{a|STNT a|STNT

Da (a) P sk Da,STNT (a,vk ) P1 sk f½1{Ir (a|vk z1,N{a|vk ) k 2 z½1{Ir ((a|vk {1)z1,N{(a|vk {1))g

ð15Þ

Results

ð10Þ

In this section, we demonstrate how our theoretical results are supported by the empirical investigation. The general characteristics of our data set, such as the displacement distribution in Fig. 2 and the visiting frequency distribution in Fig. 3, are similar to others’ [8,38]. The plot of daily

where a should satisfy the condition that a|STNT is a nonnegative integer. The cumulative distribution function (CDF) is

Displacement Distribution −1

10

ð11Þ −2

10 PDF

Da (a) ~ Pa (aƒa) ~ PTN ðTNƒta|STNTsÞ ~ DTN ðta|STNTsÞ ta|STNTs X  pn  ~ rk ð1{rÞpn{k k k~0

where t:s where represents the floor function. We call this distribution the normalized binomial distribution of a. As listed in Appendix S3, the moment generation functions of a indicate that STNT plays an essential role in the distribution. Numerical simulations also provide evidence that the distribution of a is strongly affected by STNT (the product of pn and r), but is almost irrelevant to pn or r alone. Therefore, we can assign an constant integer N to pn. Let v be a vector containing all the possible values of STNT. Then the PDF of a with STNT~vk can be written in this form PLoS ONE | www.plosone.org

−3

10

−4

10

0

5

10

15

20

25

30

35

40

Displacement [km]

Figure 2. Traveling Distance Distribution. doi:10.1371/journal.pone.0034487.g002

4

April 2012 | Volume 7 | Issue 4 | e34487

Collective Human Mobility Pattern

Visiting Frequency of Different Locations −1

10

−2

PDF

10

−3

10

−4

10

−1

10

0

1

10

10

2

10

Number of Visitings

Figure 3. Visiting Frequency Distribution of Different Locations. doi:10.1371/journal.pone.0034487.g003

Figure 4. The Average Traffic Flow of Each Location, and the Tags Corresponding to Following Locations: q 1 Shanghai Railway Station; q 2 Nanjing Road & People’s Square; q 3 Lujiazui Finance & Trade Zone; q 4 Shanghai South Railway Station; q 5 Pudong International Airport. doi:10.1371/journal.pone.0034487.g004

traffic flow in Fig. 4 exhibits some hot areas by red, including the most flourishing commercial street Nanjing Road as the largest red block, Shanghai Railway Station, Shanghai South Railway Station, Lujiazui Finance & Trade Zone, etc. The largest isolated area in blue is the Pudong International Airport. Without any intentional intervention, by NMF with random initial values, we find that the normalized basis pattern on workdays is generally quite similar (Fig. 1). Therefore, we can use the traffic power P to analyze the mean and the deviation of daily traffic. In Fig. 5, the three components of Pi,j in every location is normalized and represented by yellow, red and blue respectively. For example, a location in yellow color means the traffic flow of the first category (Bc: commuting between home and workplace) is dominant there. Mixed colors in some places indicate a mixture of traffic flows of different categories. It is noticeable that in area where the traffic flow is large, the positive (Fig. 6(a)) and negative (Fig. 6(b)) deviation of the traffic power P is quite small. The distribution of this deviation can be represented accordingly by Fig. 7(a) and Fig. 7(b), which is fitted well with Eq. (15). This fitting result is quite different from the best fitted normal distribution by the central limit theory, which verifies Eq. (14) and Eq. (15) that a should be a collection of random variables following a set of distributions with different parameters. The proportion of traffic flow with STNT~vk is sk , as plotted in Fig. 8. Here we limit each sk to be no larger than twice of the empirical value. According to the result in Fig. 7, for the whole city, 80% of the deviations are within the range of 0:5*1:5. Although the lengths of vectors s and v are identically 50 in our estimation, the number of active pairs (&0) of sk and vk is only about 10, and this number can be reduced if we only calculate for a small area given the sufficient amount of data. In short, we can see that Eq. (15) can be a reasonable approximation for the relative deviation of the daily traffic flow. Fig. 9(b) presents the components of P for the central part of the city in comparison with the urban planning map for Year 2004–2020 in Fig. 9(a). Generally, it can be seen that the residence area have a large volume of traffic with respect to Bc and Bo, corresponding to trips between home and workplaces and trips for other purposes, while in the workplace area especially for business, there are lots of flows corresponding to the second PLoS ONE | www.plosone.org

Traveling Purposes

20

Bc Bw Bo

15 10

Y [km]

5 0 −5 −10 −15 −20 −20

−10

0 X [km]

10

20

Figure 5. The Average Component Proportions of Pi,j in Each Location, Equivalent to the Categorical Proportion of the Traffic. doi:10.1371/journal.pone.0034487.g005

category Bw, and in the remaining area, the third one Bo is quite significant. We should note that the urban planning map (2004– 2020) is not an exact description for the land uses of Year 2007, and consequently, the patterns of the two figures may not agree well in some small areas. For example, the red patch around point ({5,{5) in Fig. 9(a) is planned as an industrial land, namely,

5

April 2012 | Volume 7 | Issue 4 | e34487

Collective Human Mobility Pattern

(a) Deviation Distribution, CDF 1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 Empirical Distribution Fitted Mixture Binomial Distribution Fitted Normal Distribution

0.1 0 0

1

2

3

4

5

α (b) Deviation Distribution, PDF Empirical Distribution Fitted Mixture Binomial Distribution Fitted Normal Distribution

1

PDF

0.8

0.6

0.4

0.2

0 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

α

Figure 7. The Distribution of the Relative Deviation for Components of Pi,j : (a) CDF; (b) PDF.

Figure 6. The Relative Deviation for Components of Pi,j in Each Location: (a) the Average Positive Deviation; (b) The Average Negative Deviation. doi:10.1371/journal.pone.0034487.g006

Relative Frequency of Different 〈TN〉 0.25 Empirical Frequency Fitted Frequency

workplace in our context, while in fact it was a construction site for Expo 2010 Shanghai China with very few taxi traffic in Year 2007. Yet it is still reasonable for a construction site to have the major taxi flows of type Bo as shown in Fig. 9(b) because in the evening workers would be very likely to go out for recreation, entertainments, etc. In addition, we can see how the government planning [47] is affected by what it is now. For example, Nanjing Road and near by is the largest block with high traffic throughput, and traffic flows are constituted mainly by those of workplaces related (Bw) and other facilities related (Bo) categories. In the planning, it is designed to be a public activity center for administrative, business and shopping purposes. Lujiazui is another similar but smaller zone, which is planned mainly for business and shopping centers.

Relative Frequency

0.2

0.15

0.1

0.05

0

In this research, we find that the traffic on workdays can be divided into three categories according to the different purposes: PLoS ONE | www.plosone.org

0

10

20

30

40

50

〈TN〉

Discussion

Figure 8. The Parameters for the Distribution. doi:10.1371/journal.pone.0034487.g008

6

April 2012 | Volume 7 | Issue 4 | e34487

Collective Human Mobility Pattern

instance, a large Pci,j means there is a large portion of traffic between home and workplaces at location (i,j). This theory can also help to infer the land use composition by a quite easy, realtime, and automated way. For example, the evidence of a large Pci,j everyday indicates location (i,j) is mainly for residential or working purpose, while a large Pwi,j can imply that it has lots of workplaces. A mixture of different land uses in a single location can be found by this method as well. Second, based on the NMF approach, the time series of the total traffic at any location can be expressed as a linear combination of the basis patterns. Therefore, we can compress the traffic data of a large area into a very small data size, but still with a quite high resolution. Namely, we only need to store the global basis patterns, and for each location, we use a small vector for the traffic power to represent how strong each basis pattern is. Third, we find that the distribution of the relative deviation is not a normal distribution, indicating that the random variable a is not identical from one place to another, or from time to time. The significance of Eq. (14) and Eq. (15) is, they provide an expression of how traffic fluctuates for various unknown positions and time intervals. This description of relative deviation can also be helpful to estimate the change of the traffic flow, which would be important in traffic predicting, controlling and urban planning. Finally, with the deviation distribution, we can not only predict the change of traffic, but also diagnose the abnormality of the traffic: where, when, why, and how. The first two functions are obvious, while ‘why’ abnormal can be disclosed by the traffic power, and ‘how’ abnormal can be revealed by the probability of the deviation. For example, if some traffic flow is very abnormal one day, the probability density of the variance on that day should be very small. Our analysis focusing on the traffic flows in different locations on different workdays. Our results can also be extend to the traffic on a road. The road traffic is a summation of the traffic passing this road from several sources and to several destinations. Therefore, the volume and the deviation of the road traffic flow can also be explained in our framework.

(a) Traveling Purposes By Empirical Data Bc Bw Bo

15

10

Y [km]

5

0

−5

−10 −15

−10

−5

0 X [km]

5

10

(b) Area Type by Urban Planning Residence Workplaces Others

15

10

Y [km]

5

0

−5

Supporting Information Appendix S1 More on Data Description and Background

−10 −15

−10

−5

0 X [km]

5

Assumptions. (PDF)

10

Appendix S2 Implementation Details about the Factorization.

(PDF)

Figure 9. Comparing the Empirical Data to Urban Planning Map: (a) the Area Type from Urban Planning [47] for Central Part of the City; (b) the Average Categorical Proportion of Traffic for Central Part of the City. doi:10.1371/journal.pone.0034487.g009

Appendix S3 Moment Generation Function of a.

(PDF)

Acknowledgments We would like to thank Wireless and Sensor networks Lab (WnSN, Shanghai Jiao Tong University, China) for providing the data source. We thank Dr. Min-You Wu, Yang Yang (Shanghai Jiao Tong University, China) for supports in data. We also thank Xianchuang Su, Dr. Yixiao Li, Dr. Yong Min and Chuanzi Chen (Zhejiang University, China), Dr. David Keyes and Dr. Xiangliang Zhang (King Abdullah University of Science and Technology, Saudi Arabia) for precious suggestions. For computer time, this research used the resources of the Supercomputing Laboratory at King Abdullah University of Science & Technology (KAUST) in Thuwal, Saudi Arabia.

commuting between home and workplaces, traveling from workplace to workplace, and others such as leisure activities. Each of these categories has a highly distinguishable basis pattern: Bc, Bw or Bo. The relative daily deviation of the traffic flow in each category can be modeled as Eq. (14), which is a mixture of normalized binomial distributions, with a continuous approximation as Eq. (15). This basis pattern theory is applicable to data sets containing the beginning and ending information of trips, such as the bicycle departure and arrival data [48], cell phone based mobility information [8], GPS based data, etc. The first contribution of this research is, it provides a very economical approach to understand how the urban traffic at different locations are composed from the three categories. For

PLoS ONE | www.plosone.org

Author Contributions Conceived and designed the experiments: CP XJ PL. Performed the experiments: CP KW. Analyzed the data: CP XJ KW MS PL. Contributed reagents/materials/analysis tools: CP PL. Wrote the paper: CP XJ PL.

7

April 2012 | Volume 7 | Issue 4 | e34487

Collective Human Mobility Pattern

References 1. 2. 3.

4. 5. 6. 7. 8. 9. 10. 11. 12.

13. 14.

15.

16.

17.

18.

19. 20. 21.

22.

23. 24. 25. 26.

Chowdhury D, Santen L, Schadschneider A (2000) Statistical physics of vehicular traffic and some related systems. Physics Reports 329: 199–329. Nagel K (1996) Particle hopping models and traffic flow theory. Physical Review E 53: 4655. Esser J, Schreckenberg M (1997) Microscopic simulation of urban traffic based on cellular automata. International Journal of Modern Physics C-Physics and Computer 8: 1025–1036. Simon P, Nagel K (1998) A simplified cellular automaton model for city traffic. Arxiv preprint cond-mat/ 9801022. Perc M (2007) Premature seizure of traffic flow due to the introduction of evolutionary games. New Journal of Physics 9: 3. Helbing D (1995) Improved fluid-dynamic model for vehicular traffic. Physical Review E 51: 3164. Brockmann D, Hufnagel L, Geisel T (2006) The scaling laws of human travel. Nature 439: 462–465. Gonza´lez M, Hidalgo C, Baraba´si A (2008) Understanding individual human mobility patterns. Nature 453: 779–782. Jiang B, Yin J, Zhao S (2009) Characterizing the human mobility pattern in a large street network. Physical Review E 80: 021136. Leutzbach W (1987) Introduction to the theory of traffic flow. Springer Verlag. Kerner B (2009) Introduction to modern traffic flow theory and control: the long road to threephase traffic theory. Springer Verlag. Kitamura R, Chen C, Pendyala R, Narayanan R (2000) Micro-simulation of daily activity-travel patterns for travel demand forecasting. Transportation 27: 25–51. Kuppam A, Pendyala R (2001) A structural equations analysis of commuters’ activity and travel patterns. Transportation 28: 33–54. Liao Z, Yang S, Liang J (2010) Detection of Abnormal Crowd Distribution. In: IEEE/ACM International Conference on Green Computing and Communications & IEEE/ACM International Conferenceon Cyber, Physical and Social Computing IEEE. pp 600–604. Candia J, Gonza´lez M, Wang P, Schoenharl T, Madey G, et al. (2008) Uncovering individual and collective human dynamics from mobile phone records. Journal of Physics A: Mathematical and Theoretical 41: 224015. Andrade E, Blunsden S, Fisher R (2006) Modelling crowd scenes for event detection. In: Proceedings of the 18th International Conference on Pattern Recognition IEEE, volume 1. pp 175–178. Mehran R, Oyama A, Shah M (2009) Abnormal crowd behavior detection using social force model. In: IEEE Conference on Computer Vision and Pattern Recognition IEEE. pp 935–942. Handy S (1996) Methodologies for exploring the link between urban form and travel behavior. Transportation Research Part D: Transport and Environment 1: 151–165. Horner M, O’Kelly M (2001) Embedding economies of scale concepts for hub network design. Journal of Transport Geography 9: 255–265. Dieleman F, Dijst M, Burghouwt G (2002) Urban form and travel behaviour: micro-level household attributes and residential context. Urban Studies 39: 507. Waddell P (2002) Modeling urban development for land use, transportation, and environmental planning. Journal of the American Planning Association 68: 297–314. Boarnet M, Crane R (2001) The influence of land use on travel behavior: specification and estimation strategies. Transportation Research Part A: Policy and Practice 35: 823–845. Wegener M (2004) Overview of land use transport models. Handbook of transport geography and spatial systems 5: 127–146. Handy S (2005) Smart growth and the transportation-land use connection: what does the research tell us? International Regional Science Review 28: 146. Han X, Hao Q, Wang B, Zhou T (2011) Origin of the scaling law in human mobility: Hierarchy of traffic systems. Physical Review E 83: 036117. Longini I Jr., Nizam A, Xu S, Ungchusak K, Hanshaoworakul W, et al. (2005) Containing pandemic influenza at the source. Science 309: 1083.

PLoS ONE | www.plosone.org

27. Eubank S, Guclu H, Kumar V, Marathe M, Srinivasan A, et al. (2004) Modelling disease outbreaks in realistic urban social networks. Nature 429: 180–184. 28. Easley D, Kleinberg J (2010) Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge University Press. 29. Anderson R, Fraser C, Ghani A, Donnelly C, Riley S, et al. (2004) Epidemiology, transmission dynamics and control of SARS: the 2002–2003 epidemic. Philosophical Transactions of the Royal Society of London Series B: Biological Sciences 359: 1091. 30. Hufnagel L, Brockmann D, Geisel T (2004) Forecast and control of epidemics in a globalized world. Proceedings of the National Academy of Sciences of the United States of America 101: 15124. 31. Riley S (2007) Large-scale spatial-transmission models of infectious disease. Science 316: 1298. 32. Kleinberg J (2007) The wireless epidemic. Nature 449: 287–288. 33. Hu H, Myers S, Colizza V, Vespignani A (2009) WiFi networks and malware epidemiology. Proceedings of the National Academy of Sciences 106: 1318. 34. Castellano C, Fortunato S, Loreto V (2009) Statistical physics of social dynamics. Reviews of modern physics 81: 591–646. 35. Shlesinger M, Zaslavsky G, Frisch U (1995) Le´vy flights and related topics in physics. In: Le´vy Flights and Related Topics in Physics: Proceedings of the International Workshop Held at Nice, France. volume 450. 36. Rhee I, Shin M, Hong S, Lee K, Chong S (2008) On the levy-walk nature of human mobility. In: INFOCOM 2008 The 27th Conference on Computer Communications. IEEE. IEEE. pp 924–932. 37. Song C, Qu Z, Blumm N, Barabasi A (2010) Limits of predictability in human mobility. Science 327: 1018. 38. Liang X, Zheng X, Lv W, Zhu T, Xu K (2012) The scaling of human mobility by taxis is exponential. Physica A: Statistical Mechanics and its Applications 391: 2135–2144. 39. Shanghai Jiao Tong University, China (2007) SUVnet-Trace data. Available: http://wirelesslab.sjtu.edu.cn. Accessed 2012 Mar 9. 40. Shanghai Population and Family Planning Commission, China (2001) From the fifth population census to evaluate the population condition for the sustainable development of Shanghai. Available: http://www.popinfo.gov.cn/yearbook/ 2001nj/zhuanwen/7-4.htm. Accessed 2012 Mar 9. 41. Lee D, Seung H (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791. 42. Lin C (2007) Projected gradient methods for nonnegative matrix factorization. Neural computation 19: 2756–2779. 43. Hollick M, Krop T, Schmitt J, Huth H, Steinmetz R (2004) Modeling mobility and workload for wireless metropolitan area networks. Computer Communications 27: 751–761. 44. Ben-Akiva M, Bowman J, Ramming S, Walker J (1998) Behavioral realism in urban transportation planning models. Transportation Models in the PolicyMaking Process: Uses, Misuses and Lessons for the Future. pp 4–6. 45. Zhang L, Wu J, Zhen Y, Shu J (2004) A GIS-based gradient analysis of urban landscape pattern of Shanghai metropolitan area, China. Landscape and Urban Planning 69: 1–16. 46. Onnela J, Sarama¨ki J, Hyvo¨nen J, Szabo´ G, Lazer D, et al. (2007) Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences 104: 7332. 47. Shanghai Municipal Bureau of Planning and Land Resources, China (2009) Shanghai urban planning: land-use planning. Available: http://www.china.com. cn/aboutchina/zhuanti/09dfgl/2009-09/08/content184882372.htm. Accessed 2012 Mar 9. 48. Kaltenbrunner A, Meza R, Grivolla J, Codina J, Banchs R (2008) Bicycle cycles and mobility patterns-Exploring and characterizing data from a community bicycle program. Arxiv preprint arXiv: 08104187.

8

April 2012 | Volume 7 | Issue 4 | e34487

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT    

Appendix B – [Personal profiling in on-line social networks] This appendix contains the reprint of the following papers: A. Boutet, H. Kim and E. Yoneki "What’s in Your Tweets? I Know Who You Supported in the UK 2010 General Election". International AAAI Conference on Weblogs and Social Media (ICWSM), Dublin, Ireland, 2012a. A. Boutet, H. Kim and E. Yoneki "What’s in Your Tweets? I Know What Parties are Popular and Who You are Supporting Now! ". IEEE/ACM International Conference on Social Networks Analysis and Mining (ASONAM) (Full Paper), Istanbul, Turkey, 2012b

49

What’s in Your Tweets? I Know Who You Supported in the UK 2010 General Election Antoine Boutet

Hyoungshick Kim and Eiko Yoneki

INRIA Rennes Bretagne Atlantique France

Computer Laboratory, University of Cambridge UK

Abstract Nowadays, the use of social media such as Twitter is necessary to monitor trends of people on political issues. As a case study, we collected the main stream of Twitter related to the 2010 UK general election during the associated period. We analyse the characteristics of the three main parties in the election. Also, we propose a simple and practical algorithm to identify the political leaning of users using the amount of Twitter messages which seem related to political parties. The experimental results showed that the best-performing classification method – which uses the number of Twitter messages referring to a particular political party – achieved about 86% classification accuracy without any training phase.

Introduction We are interested in how to measure the authority of political parties and the political leaning of users from social media. To illustrate the practicality of our analysis, we used a dataset formed of collected messages from Twitter related to the 2010 UK general election which took place on May 6th, 2010. We examined the characteristics of the three main parties (Labour, Conservative, Liberal Democrat) and discussed the main differences between parties in term of structure, interaction, and contents. Through this intensive analysis about the users with political interests, we develop a simple and practical algorithm to identify the political leaning of users in Twitter – the messages expressing the user’s political views (e.g. tweets referring to a particular political party and retweets from users with known political preferences) are used to estimate the overall political leaning of users. To demonstrate the effectiveness of the proposed heuristic model, we evaluated the performance of the proposed classification method. The experimental results showed that the proposed classification method – which uses the number of tweets referring to a particular political party – achieved about 86% classification accuracy using all trials without any training phase, outperforming existing heuristics (Pennacchiotti and Popescu 2011a; Zhou, Resnick, and Mei 2011) that require expensive costs for tuning of parameters to construct classifier. c 2012, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved.

Our approach has two key advantages: (1) as we only process the messages relevant to a particular event rather than the whole dataset at one time, it drastically reduces the computation costs of constructing a classifier compared with existing approaches which may indeed be unacceptable for online classification; (2) it also has potential: we can discover the temporal trends of a user’s political views by analysing her political leaning over time.

Party Characteristics To analyse the characteristics of the Labour, Conservative and Liberal Democrat (LibDem) parties to identify the relevant features for user’s party affiliation, we collected all tweets published on the top trending topics related to the UK election between the 5th and 12th of May, and kept only the 419 topics which have over 10,000 tweets. The resulting dataset gathers more than 220,000 users for almost 1,150,000 tweets. For these users, we also collected their profiles and about 79,000,000 following/follower relationships. Some user profiles can be used to identify their political party affiliation. We manually identified the 356 Labour, 159 Conservative and 169 LibDem self-identified members as a ground truth dataset. With this ground truth dataset, we detected the communities associated to each political party using a wellknown technique called label propagation method (Raghavan, Albert, and Kumara 2007). Here, the label propagation method spreads affiliations from ground truth users called seeds throughout the retweet graph. We label a user with the party affiliation according to seeds who have reached it. We need to set the maximum propagation distance to k to avoid tie-breaking cases (i.e. multiple nearest nodes with different party memberships exist at the same time). We performed the label propagation until the propagation distance is greater than k. When k = 2, we detected 5,878 Labour, 3,214 LibDem and 2,356 Conservative candidates with a high accuracy of 0.77, 0.78 and 0.90 respectively for an average at 0.82. With these candidates, we analysed the following characteristics of each party: (i) structure/interaction and (ii) content features. Structure and Interaction We studied the differences between the political parties in network structure and inter-

action patterns. The interaction patterns between members within a party reflects a level of party cohesion while the interaction patterns between different communities reflect the exchanges (i.e. conflict or collaboration) between them. We particularly observed the amount of interactions between the political parties by counting the number of exchanged retweets (forward messages to its followers) and mentions (direct messages to another user) between them during the election period.

ically oriented. We observed very few overlaps of the referenced blogs between the parties. This result may confirm the high segregated structure of the blogosphere according to political parties reported in (Adamic and Glance 2005). Finally, we measured the volume of references to a specific party included in tweets. We considered only the tweets referring to one name of party or its leader as such tweets are more likely to reflect the allegiance or interest of the users. The analysis clearly shows that users were more likely to frequently refer to their own preferred party or leader.

User Classification For user classification, our goal is to the party to which a user belongs. We particularly focused on developing a classification method to process the dynamically updated statistics on users; user’s tweet activities are sequentially observed over time.

Bayesian Classification

Figure 1: Exchanged messages between parties According to the detected communities described above, we can see that there was no retweet exchanged between different political parties. In contrast, the mentions between different parties were more frequently used. We can also see that few interactions have been observed between the Labour and Libdem members, in opposition to the high rate of interactions between Conservative and both Labour and LibDem. We surmise that the suggested coalition between Conservative and LibDem generated more discussions among members of both parties than between Labour and LibDem. Content We analysed the contents of tweets by counting the number of hashtags (tags used to define topics) and URLs used in tweets for each party. Political parties showed a similar behaviour for the number of used URLs while Labour members used various hashtags in their tweets compared to the other parties. The usage rates of neutral hashtags indicating the UK election remained at a similar level between all parties while non-neutral hashtags were more or less used depending on their underlying meaning. We also analysed the hashtag similarity between users to evaluate the content homogeneity of each party. For a user, we define a vector containing the frequencies of hashtags used in the user’s tweets and then we computed the cosine similarity between each pair of all users. The average similarity is overall low regardless of political party affiliation. That is, these results imply that Twitter users have heterogeneous behaviour in the use of hashtag. By analysing the URLs mentioned in tweets, we can identify the preferred websites of each party. LibDem members more frequently referred to Financial Times, The Independent and The BBC compared with the other party members. We also observed the blogs which are usually more polit-

Without loss of generality, we assume that a sequence of tweet activities (e.g. retweets or references to a specific party/leader in tweets) by a user is divided into n subsequences, where the kth subsequence corresponds to the tweet activities during the kth time interval. For a user u, we use Ak (u) and Mki (u) to denote the kth subsequence (i.e., the tweet activities performed by the user u during the kth time interval) and the 0-1 binary variable indicating user u’s membership for the party i after the kth time interval (i.e., Mki (u) = 1 when u is a member of the party i), respectively where 1 ≤ k ≤ n and i ∈ {labour, libdem, conservative}. We also use P (Mki (u)) to denote the probability of user u to be a member of the party i after the kth time interval. We P assume that all users should be included to one of parties; i P (Mki (u)) = 1. After the nth time interval, we classify the user u as a member of the party j where P (Mnj (u)) = maxi {P (Mni (u))}. For example, when the affiliation probability distribution for the user u after the nth time interval is given as [0.7, 0.2, 0.1], we classify the user u as a member of the Labour party. We randomly choose the user u’s party in case of equiprobability distribution. We now focus on how to compute P (Mki (u)). At each time interval, for each i ∈ {labour, libdem, conservative}, P (Mki (u)) is updated stochastically according to its probability distribution relying on the user’s tweet activities during the time interval. Before the first inference step, the initial prior affiliation probability of the user u is set uniformly: P (M0i (u)) = 1 i 3 , ∀i. After the kth time interval, P (Mk (u)|Ak (u)) can be calculated by using Bayes’ theorem as follows: P (Ak (u)|Mki (u))P (Mki (u)) P (Mki (u)|Ak (u)) = P j j j P (Ak (u)|Mk (u))P (Mk (u)) where P (Mki (u)|Ak (u)) is the posterior of user u, the uncertainty of Mki (u) after Ak (u) is observed; P (Mki (u)) is the prior, the uncertainty of Mki (u) before Ak (u) is observed P (A (u)|M i (u))

k k is a factor representing the impact of ; and P (Ak (u)) Ak (u) on the uncertainty of Mki (u).

To calculate P (Ak (u)|Mki (u)), we consider two tweet activities for Ak (u), respectively, based on the observation in the previous section: (1) retweeting the messages from the members of each political party and (2) referring to political parties in tweets. We can see that tweets generated by users supporting the same political party were more frequently retweeted. For this activity, we assume P (Ak (u)|Mki (u)) can be calculated as follows: P (Ak (u)|Mki (u))

=

P

v∈RT

i P (M(k−1) (v))

|RT |

(1)

where RT is the set of the users included in retweets as the source or the destination of information. We use Bayesian-Retweet to denote the Bayesian classification where P (Ak (u)|Mki (u)) is defined in (1). The other important tweet activity is to generate a tweet referring to the political party (or party leader) that the user u will support after the kth time interval since party members are more likely to make reference to their own party than another. For this activity, we assume P (Ak (u)|Mki (u)) can be calculated as follows: P Vi (t) i P (Ak (u)|Mk (u)) = t∈T (2) |T | where T is the tweets of the current user during the period and Vi (t) is equal to 1 if the tweet t does a reference to the political party i, 0 otherwise. We use Bayesian-Volume to denote the Bayesian classification where P (Ak (u)|Mki (u)) is defined in (2).

Evaluation The aim of our experiment was to demonstrate feasibility and effectiveness of the proposed classification approach compared with the other popularly used classification methods. For comparison, we also tested the performance of the following classification methods: • Volume classifier: We counted the frequencies referencing parties (or party leaders) in a user’s tweets and then assigned the most frequently referenced party to the user’s political party. • Retweet classifier: This approach detects the communities of users using a label propagation method (Raghavan, Albert, and Kumara 2007) on the retweet graph. In the label propagation process, each user’s party is classified with the majority party in the user’s neighbours. Ties can be broken according to the volume of references to party. From the initial seed users (self-identified members), we iteratively this process until all users’ parties are classified. • SVM classifier: Support Vector Machine (SVM) is known as one of the best supervised learning techniques for solving classification problems. We constructed a SVM classifier using the following six features of a user proposed in (Pennacchiotti and Popescu 2011a; 2011b): (i) the number of followers, (ii) the number of replied

users, (iii) the number of retweeted users, (iv) the number of used words in the user’s tweets, (v) the number of used hashtags in the user’s tweets, and (vi) the average emotion over the user’s tweets. To show the performance of a classifier, we measured the accuracy of the classifier for the self-identified users. The classification accuracy is defined as the ratio between the number of correctly predicted samples and the total number of testing samples. For the classifiers requiring the training samples (Retweet, SVM, and Bayesian-Retweet), one-tenth of the ground truth users was used to construct the classifiers and the rest was reserved for out-of-sample testing. We have seen that the accuracy of these classifiers can be changed with the set of training samples. We used the most influential users with the highest number of followers since these training samples provide the best accuracy compared to the most active users with the highest number of generated tweets or random users. Classifier Volume Retweet SVM Bayesian-Retweet Bayesian-Volume

Accuracy 0.60 0.73 0.63 0.64 0.86

Table 1: Performance according with approach. The Bayesian-Volume produced the best results: the measure accuracy (0.86) is significantly higher compared to other classification methods. In addition, this classification benefits from two advantages. Firstly, it requires to maintain only the affiliation probability of each user without massive training overheads and secondly, as the information about references to a party or a leader in tweets is only needed, incremental computation is significantly faster. These important advantages make it possible to use this solution in real time. Unlike our expectations, SVM which involves an expensive tuning phase, did not outperform other algorithms. We also analysed how the number of partisans of each party and the accuracy of the proposed Bayesian classifiers, respectively, changes with time. The results are shown in Figure 2. Conservative members outnumbers the Labour and LibDem members at the end of the election. Inherently, the accuracy of Bayesian-Volume and Bayesian-Retweet starts at 1/2 (equiprobability), continuously increases with time, and achieved at 0.86 and 0.64, respectively. These results imply that the proposed Bayesian approach is proper to understand users’ political leaning over time. However, the accuracy for Bayesian-Retweet increases more slowly than Bayesian-Volume. Even if a retweet graph generally presents a high segregated structure, latency might be expected to sufficiently propagate retweets over the graph. In contrast, Bayesian-Volume, which uses only the tweets published by users during the current time interval, achieves accurate prediction without latency caused by message propagation between users.

1

Conservative Labour Libdem

Volume Retweet

0.8 Accuracy

Number of members

Conclusion 100000 95000 90000 85000 80000 75000 70000 65000 60000 55000

0.6 0.4 0.2 0

6th

7th

8th

9th 10th 11th 12th Time

(a) Numbers of members

6th

7th

8th

9th 10th 11th 12th Time

(b) Classification accuracy

Figure 2: Dynamic changes of the Bayesian classifiers.

Related Work The exponential growth of social media has attracted much attention. Different approaches have been proposed for classifying users in many directions. (Lin and Cohen 2008) presented a semi-supervised algorithm for classifying political blogs. (Zhou, Resnick, and Mei 2011) also applied three semi-supervised algorithms for classifying political news articles and users, respectively. On the other hand, (Adamic and Glance 2005) studied the linkage patterns between political blogs and found that the blogosphere exhibits a politically segregated community structure with more limited connectivity between different communities. Recently, (Conover et al. 2011) observed a similar structure in a retweet graph of Twitter in politic context. Other classifications used machine learning methods to infer information on users. (Pennacchiotti and Popescu 2011a) demonstrated the possibility of user classification in Twitter with the three different classifications: political affiliation detection, ethnicity identification and detecting, affinity for a particular business. (Pennacchiotti and Popescu 2011b) used Gradient Boosted Decision Trees which is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of decision trees. Several studies have addressed to characterise user behaviour or personality in social networks (Benevenuto et al. 2009). However few works have tried to study the characteristics of politic parties and the interaction structure between parties. (Livne et al. 2011) studied the usage patterns of tweets about the candidates in the 2010 U.S. midterm elections and showed stronger cohesiveness among Conservative and Tea party. Other studies have addressed the predictive power of the social media. (Livne et al. 2011) has investigated the relation between the network structure and tweets and presented a forecast of the 2010 midterm elections in the US, and (Tumasjan et al. 2010; Gayo-Avello, Metaxas, and Mustafaraj 2011) discussed the relevance of Twitter as a valid indicator of political opinion. (O’Connor et al. 2010) used sentiment analysis to compare Twitter streams with polls in different areas and showed the correlation on some points. (Diakopoulos and Shamma 2010) showed that tweets can be used to track real-time sentiment about candidates’ performance during debate.

We analysed the characteristics of the political parties in Twitter during the 2010 UK General Election and identified the two main ways to differentiate political parties: (i) the retweet graph presented a highly segregated partisan structure, and (ii) party members were more likely to make reference to their own party than another. Through these party characteristics, we built two classification algorithms based on Bayesian framework. The experimental results showed that the proposed classification method is capable of achieving an accuracy of 86% without any training which make it a proper solution for real time classification.

References Adamic, L., and Glance, N. 2005. The political blogosphere and the 2004 u.s. election: Divided they blog. In LinkKDD’05. Benevenuto, F.; Rodrigues, T.; Cha, M.; and Almeida, V. 2009. Characterizing user behavior in online social networks. In IMC’09. Conover, M.; Ratkiewicz, J.; Francisco, M.; Gonc¸alves, B.; Flammini, A.; and Menczer, F. 2011. Political polarization on twitter. In ICWSM’11. Diakopoulos, N. A., and Shamma, D. A. 2010. Characterizing debate performance via aggregated twitter sentiment. In CHI’10. Gayo-Avello, D.; Metaxas, P. T.; and Mustafaraj, E. 2011. Limits of electoral predictions using twitter. In ICWSM’11. Lin, F., and Cohen, W. W. 2008. The multirank bootstrap algorithm: Self-supervised political blog classification and ranking using semi-supervised link classification. In ICWSM’08. Livne, A.; Simmons, M. P.; Adar, E.; and Adamic, L. A. 2011. The party is over here: Structure and content in the 2010 election. In ICWSM’11. O’Connor, B.; Balasubramanyan, R.; Routledge, B. R.; and Smith, N. A. 2010. From tweets to polls: Linking text sentiment to public opinion time series. In ICWSM’10. Pennacchiotti, M., and Popescu, A.-M. 2011a. Democrats, republicans and starbucks afficionados: User classification in twitter. In KDD’11. Pennacchiotti, M., and Popescu, A.-M. 2011b. A machine learning approach to twitter user classification. In ICWSM’11. Raghavan, U. N.; Albert, R.; and Kumara, S. 2007. Near linear time algorithm to detect community structures in largescale networks. Physical Review E. Tumasjan, A.; Sprenger, T. O.; Sandner, P. G.; and Welpe, I. M. 2010. Predicting elections with twitter : What 140 characters reveal aboutpolitical sentiment. In ICWSM’10. Zhou, D. X.; Resnick, P.; and Mei, Q. 2011. Classifying the political leaning of news articles and users from user votes. In ICWSM’11.

What’s in Twitter: I Know What Parties are Popular and Who You are Supporting Now! Antoine Boutet INRIA Rennes Bretagne Atlantique Rennes, France [email protected]

Hyoungshick Kim University of British Columbia Vancouver, Canada [email protected]

Abstract—In modern politics, parties and individual candidates must have an online presence and usually have dedicated social media coordinators. In this context, we study the usefulness of analysing Twitter messages to identify both the characteristics of political parties and the political leaning of users. As a case study, we collected the main stream of Twitter related to the 2010 UK General Election during the associated period – gathering around 1,150,000 messages from about 220,000 users. We examined the characteristics of the three main parties in the election and highlighted the main differences between parties. First, Labour members were the most active and influential during the election while Conservative members were the most organized to promote their activities. Second, the websites and blogs that each political party’s members supported are clearly different from those that all the other political parties’ members supported. From these observations, we develop a simple and practical classification method which uses the number of Twitter messages referring to a particular political party. The experimental results showed that the proposed classification method achieved about 86% classification accuracy and outperforms other classification methods that require expensive costs for tuning classifier parameters and/or knowledge about network topology.

I. I NTRODUCTION Social media such as Facebook and Twitter have revolutionised the way people communicate with each other. Users generate a constant stream of online messages through social media to share and discuss their activities, status, opinions, ideas and interesting news stories; social media might be an effective means to examine trends and popularity in topics ranging from economic, social, environmental to political issues [1], [2]. In modern politics, political parties must have an online presence. In this context, monitoring social media can help parties and individual candidates to measure the success of their political campaigns and then refine their strategies. We are particularly interested in this paper in how to identify the characteristics of political parties and the political leaning of users in social media. To illustrate the practicality of our analysis, we used a dataset formed of collected messages from Twitter, which is a popular social network and microblogging service that enables its users to broadcast and share information within posts of up to 140 characters,

Eiko Yoneki University of Cambridge Cambridge, United Kingdom [email protected]

called tweets. We gathered around 1,150,000 messages from the main stream of Twitter related to the 2010 UK General Election between the 5th and the 12th of May from about 220,000 users in Twitter. We first examined the characteristics of the three main parties (Labour, Conservative, Liberal Democrat) in the election and discussed the main differences between parties in term of activity, influence, structure, interaction, contents, mood and sentiment. Our results demonstrated that Labour members were the most active and influential in Twitter during the election while Conservative members were the most organized to promote their activities. Also, the websites and blogs that each political party’s members frequently referred to are clearly different from those that all the other policital parties’ members referred to. Through this intensive analysis about the users with political interests, we develop a simple and practical algorithm to identify the political leaning of users in Twitter – the messages expressing the user’s political views (i.e. tweets referring to a particular political party) is used to estimate the overall political leaning of users. To demonstrate the effectiveness of the proposed heuristic model, we evaluated the performance of the proposed classification method based on a ground truth dataset composed of users who reported their political affiliation in their profile. The experimental results showed that our method – which uses the number of tweets referring to a particular political party – achieved about 86% classification accuracy using all trials, which outperforms the best known classification methods (see [3], [4], [5]), which require expensive costs for tuning of parameters to construct classifier and/or the knowledge about network topology. Although some classification algorithms based on network topology performed well, these may indeed be unacceptable or very expensive: crawling topology information is strictly limited in practice. Our approach has three key advantages: (1) as we only process the messages relevant to a particular event rather than the whole dataset at one time, it dramatically reduces the computation costs of constructing a classifier compared with existing approaches – huge computational overhead for large training sets they impose are likely to be nontrivial, and

Figure 1: Tweets volume and references to party after the exit polls. they may indeed be unacceptable for online classification; (2) the proposed method does not require the knowledge about network topology unlike some classification methods based on community structure [6], [5]; (3) it also has potential: we can discover the temporal trends of a user’s political views by analysing her political leaning over time. II. DATASET FOR THE UK G ENERAL E LECTION The UK General Election took place on May 6th, 2010, and was contested by the three major parties: the Labour party led by Gordon Brown, the Conservative Party led by David Cameron, and the Liberal Democrat (LibDem) party led by Nick Clegg. Although exit polls and initial results were released on the night of the 6th, the final outcome of the election, due to the UK parliamentary system, was not clear until the 11th of May, when Gordon Brown resigned and David Cameron became prime minister, announcing that he would attempt to form a coalition with the Liberal Democrats. We collected all tweets published on the top trending topics related to the UK election between the 5th and 12th of May, and kept only the 419 topics which have over 10,000 tweets. The resulting dataset gathers more than 220,000 users for almost 1,150,000 tweets. Figure 1 showed how the volume of tweets referring to each party changed in response to the major events occurred over the election period. The collected messages include about 168,000 mentions (direct messages to another user), 290,000 retweets (forward messages to its followers), 515,000 hashtags (tags used to define topics) and 25,000 distinct URLs. For these users, we also collected their profiles and about 79,000,000 following/follower relationships. For some users, their profiles can be used to identify their political party affiliation (with manual check). We called them self-identified members. We used the associated 633 Labour, 231 Conservative and 297 LibDem self-identified members as a ground truth dataset to evaluate the performance of classification methods. Furthermore, we can

collected about 42,000 users’ location information including 27,000 users in UK from their profiles, too. III. PARTY C HARACTERISTICS In this section we analysed the characteristics of the Labour, Conservative and LibDem party to find only the relevant features for user’s party affiliation. To have a larger set of users to observe than the collected ground truth information, we first detected the communities associated to each political party. To achieve that, we used a wellknown technique called label propagation method [6] on the retweets structure. This technique is very reasonable – people usually retweet tweets they like (i.e. tweets expressing a similar political opinion in our context), and thus form a highly clustered structure according to parties in a retweet graph. [7] recently verified this idea in politics on Twitter. Here, the label propagation method spreads affiliations from ground truth users called seeds throughout the retweet graph – we label a user with the party affiliation according to seeds who have reached it. We performed the label propagation until the greatest propagation distance k which avoids tie-breaking case (i.e. multiple nearest nodes with different party memberships exist at the same time). It is achieved for k = 2 which permitted to detect 5,878 Labour, 3,214 LibDem and 2,356 Conservative candidates. We tested the performance of this heuristic by selecting onetenth of the ground truth users (115) was used as the seed users and the rest (1,046) was reserved for testing. This heuristic produced a high accuracy of 0.77, 0.78 and 0.90 respectively for an average at 0.82. With these candidates, we analyzed the following characteristics of each party: (i) activity, (ii) influence, (ii) structure/interaction, (iv) content and (v) sentiment features. A. Activity The amount of messages about the political issues in Twitter can be used for measuring the activities of political parties. The activity level of parties can be measured in the

1

0.01

0.01

0.001 0

20

40 60 80 100 120 140 Number of tweets

(c) Replies

50

60

50 100 150 200 250 300 350 400 Number of mentions

Conservative Libdem Labour

0.1

0.01

0.001 0

(d) Mentions

10 20 30 40 50 60 70 80 Star ration = followers / following

0

50 100 150 200 250 300 350 400 Number of lists

(c) Star 1

(d) Listed 1

Conservative Libdem Labour

Figure 2: CCDF for the activity metrics.

Conservative Libdem Labour

CCDF

0.1

CCDF

0.1

different functions: the content generation is measured by the number of tweets; the content relay is quantified by the number of retweets; and the participation in political debates is evaluated by the number of replies and mentions. Figure 2 shows the Complementary Cumulative Distribution Function (CCDF) defined as F¯ (x) = P(X > x) = 1 − F (x) for these metrics where F (x) is the cumulative distribution. Interestingly, the Labour members generated more tweets and replies than those of the other parties while the Conservative members sent much more mentions than other parties. The LibDem party exhibited a smaller activity for retweets.

2000

(b) Following

0.001 0

500 1000 1500 Number of friends

1

0.01

0.001 20 30 40 Number of replies

0

Conservative Libdem Labour

CCDF 0.01

0.001

2500

0.1

CCDF 0.01

500 1000 1500 2000 Number of followers

(a) Followers 1

Conservative Libdem Labour

0.1

CCDF

0.1

10

0.001 0

(b) Retweets 1

Conservative Libdem Labour

0

0.01

0.001 0 20 40 60 80 100 120 140 160 180 Number of retweets

(a) Tweets 1

Conservative Libdem Labour

0.1

CCDF 0.01

0.001

1

Conservative Libdem Labour

0.1

CCDF

0.1

CCDF

0.1

1

Conservative Libdem Labour

CCDF

Conservative Libdem Labour

CCDF

1

0.01

0.01

0.001

0.001 0

10 20 30 40 50 60 70 80 Number of times retweeted

(e) Retweeted

0

5 10 15 Number of times mentioned

20

(f) Mentioned

Figure 3: CCDF for the influence metrics. widely followed, retweeted and mentioned. In another hand, Conservative members were those which frequently used the Twitter Lists feature and probably the more organized to promote their activities during the election.

B. Influence The potential impact in term of visibility and information spread can be leveraged to evaluate the influence of each party. The numbers of following/followers are used to measure the size of the audience of members; the star ollowers metric defined by the ratio of ff ollowing is used to evaluate the behaviour and the visibility of members in a party – information providers or stars tend to follow few while being followed by many (high star ratio), in contrast consumers tend to follow many while being followed by few people (low star ratio); the number of Lists 1 in Twitter is used to measure the level of organization and promotion of the political parties; the numbers of times users of each party have been retweeted and mentioned are useful to evaluate the effective influence of parties. Our analysis demonstrates that all metric values of the Labour members are significantly higher than those of the other two political parties except for the Lists (see Figure 3). Probably, the Labour party benefited from more content providers than Conservative and LibDem generating a large numbers of tweets (correlation with Figure 2a) which were 1 feature

C. Structure and Interaction We also studied the differences between the political parties in network structure and interaction patterns. The structure and the interaction patterns between members within a party reflect a level of party cohesion while the interaction patterns between different communities reflect the exchanges (i.e. conflict or collaboration) between them. Tables I shows some properties (the average degree, the average Clustering Coefficient and size of the Largest Strongly Connected Components) of the following/followers graph for each party. The Labour members formed a larger network structure and also had a high average degree compared with the other two parties. Interestingly, however, the structure of LibDem (0.3890) and Conservative (0.3549) members were much more clustered than that of Labour members (0.2562). Dataset statistics Nodes Edges Size in LSCC Average degree Average CC

Labour 5,878 92,581 5,157 31.5 0.2562

LibDem 3,214 32,586 2,418 20.3 0.3890

Conservative 2,356 24,949 2,183 21.3 0.3549

to provide a feed gathering activities from a group of people.

Table I: Graph properties for each party.

1

1

Conservative Libdem Labour

Conservative Libdem Labour

0.1 CCDF

0.1 CCDF

In addition to the following/followers graph, we also observed the amount of interactions between political parties by counting the number of exchanged retweets and mentions between them during the election period (Figure 4). According to the detected communities described above, we can see that there was no retweet exchanged between different political parties. In contrast, the mentions between different parties were more frequently used. We can also see that few interactions have been observed between the Labour and Libdem members, in opposition to the high rate of interactions between Conservative and both Labour and LibDem. We surmise that the suggested coalition between Conservative and LibDem have generated more discussions among members of both parties than between Labour and LibDem.

0.01

0.01

0.001

0.001 0

5 10 15 20 Number of different domains

25

0

(a) URLs

20 40 60 80 100 120 140 160 Number of hashtags

(b) Hashtags

Figure 5: CCDF for the content metrics. neutral hashtags indicating the UK election remained at a similar level between all parties while non-neutral hashtags were more or less used depending on their underlying meaning. For instance, about 80% of the hashtag #imvotinglabour and about 7% of the hashtag #imnotvotingconservative were used by the Labour and Conservative members, respectively. Hashtags #ge2010 #ukelection #ukvote #ge10 #GE2010 #imnotvotingconservative #electionday #dontdoitnick #imvotinglabour #ukelection2010

times 39,742 13,506 6,332 4,936 4,642 1903 1,586 1,097 904 795

Labour 0.34 0.31 0.35 0.40 0.34 0.50 0.36 0.63 0.80 0.40

LibDem 0.36 0.27 0.34 0.27 0.27 0.41 0.27 0.25 0.05 0.26

Conserv. 0.28 0.40 0.29 0.32 0.38 0.07 0.36 0.10 0.14 0.32

Table II: Ten most commonly used hashtags.

Figure 4: Exchanged messages between parties Finally, we analysed the correlation between social interaction (i.e. retweets and mentions) and geographical distance in each party. Not reported here for space reasons, we showed that all political parties had the similar behaviours, and mainly interacted with close users (around 50% of the interactions was performed with users located at less than 50 kilometers). D. Content We analysed the contents of tweets by counting the number of hashtags and URLs used in tweets for each party (see Figure 5). We can see that the political parties showed a similar behaviour for the number of used URLs while Labour members used various hashtags in their tweets compared to the other parties. Table II shows the ten most commonly used hashtags and their associated usage rates per party. The usage rates of

We also analysed the hashtag similarity between users to evaluate the content homogeneity of each party. For a user, we defined a vector containing the frequencies of hashtags used in the user’s tweets and then we computed the cosine similarity between each pair of all users. Not reported here for space reasons, we showed that the average similarity is overall low regardless of political party affiliation. That is, these results imply that Twitter users have heterogeneous behaviour in the use of hashtag. By analysing the URLs mentioned in tweets, we can identify the preferred websites of each party. Table III shows the ten most commonly used websites and their associated usage rates per party. Interestingly, LibDem members more frequently referred to Financial Times, The Independent and The BBC compared with the other party members. Websites www.guardian.co.uk www.youtube.com twitpic.com news.bbc.co.uk yfrog.com www.voterpower.org.uk www.independent.co.uk blogs.ft.com sphotos.ak.fbcdn.net www.telegraph.co.uk

times 532 484 467 314 261 241 173 137 115 83

Labour 0.37 0.30 0.40 0.26 0.45 0.42 0.37 0.24 0.27 0.38

LibDem 0.34 0.31 0.33 0.43 0.38 0.35 0.51 0.69 0.47 0.32

Conserv. 0.28 0.37 0.25 0.25 0.16 0.21 0.11 0.05 0.24 0.28

Table III: Ten most commonly used URLs.

We also particularly observed the blogs which are usually more politically oriented. Only blogs using the most famous frameworks (blogspot.com, livejournal.com, wordpress.com, typad.com) have been taken into account. We compared the usage rates of these blogs between parties. Not reported due to space limitation, we observed very few overlaps of the referenced blogs between the parties. This result may confirm the high segregated structure of the blogosphere according to political parties reported in [8]. Finally, we measured the volume of references to a specific party included in tweets. We considered only the tweets referring to one name of party or its leader as such tweets are more likely to reflect the allegiance or interest of the users. Figure 6 illustrates the relative volumes of references to parties according to each party. These results clearly show that users were more likely to frequently refer to their own preferred party or leader. 1

Ref. to Conservative Ref. to LibDem Ref. to Labour

0.1

CCDF (%)

CCDF (%)

1

0.01

0.001

Ref. to Conservative Ref. to LibDem Ref. to Labour

0.1

0.01

0.001 0 10 20 30 40 50 60 70 80 90 100 Number of references

0 10 20 30 40 50 60 70 80 90 100 Number of references

(a) Labour

(b) LibDem

CCDF (%)

1

Ref. to Conservative Ref. to LibDem Ref. to Labour

0.1

0.01

0.001 0 10 20 30 40 50 60 70 80 90 100 Number of references

(c) Conservative

Figure 6: CCDF for the volume of references. E. Sentiment We evaluated the sentiment of words used in tweets. To extract this information we used the Linguistic Inquiry Word Count. LIWC is a dictionary of words used in everyday conversations which assess the emotional, cognitive and structural components of a text sample. After removing the URLs and hashtags from the collected tweets, LIWC makes the words matching for positive (i.e. happy, good) and negative emotions (i.e. out, hate). Then, the sentiment for a given tweet was given by the sentiment score proposed p −µ n by Kramer [9]: Sentiment = iσp p − niσ−µ where pi (ni ) n is the fraction of positive (negative) words for user i ; µp (µn ) is the average fraction of positive (negative) across all users; and σp (σn ) is the corresponding standard deviation. Table IV shows the average sentiment score over tweets referring to a party. It is clearly shown that better sentiment was expressed in tweets when users referred to their own preferred party or leader in the tweets.

Party Labour

LibDem

Conservative

Reference to Labour LibDem Conservative Labour LibDem Conservative Labour LibDem Conservative

average emotion score 1.09 0.03 0.32 -0.21 0.34 0.00 -0.08 -0.14 1.36

Table IV: Sentiment on the references to party.

IV. U SER C LASSIFICATION In this section we present a new user classification approach based on the observations in the previous section. Our goal is to identify the party to which a user belongs. We particularly focus on developing a classification method without the knowledge about network topology. For this purpose, we propose an incremental Bayesian approach which requires only a user’s tweet messages over time. We will show this approach performs well by evaluating the performance of the classification method. A. Bayesian Classification Without loss of generality, we assume that a sequence of tweet activities (e.g. retweets or references to a specific party/leader in tweets) by a user is divided into n subsequences, where the kth subsequence corresponds to the tweet activities during the kth time interval. For a user u, we use Ak (u) and Mki (u) to denote the kth subsequence (i.e., the tweet activities performed by the user u during the kth time interval) and the 0-1 binary variable indicating user u’s membership for the party i after the kth time interval (i.e., Mki (u) = 1 when u is a member of the party i), respectively where 1 ≤ k ≤ n and i ∈ {labour, libdem, conservative}. We also use P (Mki (u)) to denote the probability of user u to be a member of the party i after the kth time interval. We assume P that iall users should be included to one of parties; i P (Mk (u)) = 1. After the nth time interval, we classify the user u as a member of the party j where P (Mnj (u)) = maxi {P (Mni (u))}. For example, when the affiliation probability distribution for the user u after the nth time interval is given as [0.7, 0.2, 0.1], we classify the user u as a member of the Labour party. We randomly choose the user u’s party in case of equiprobability distribution. We now focus on how to compute P (Mki (u)). At each time interval, for each i ∈ {labour, libdem, conservative}, P (Mki (u)) is updated stochastically according to its probability distribution relying on the user’s tweet activities during the time interval. Before the first inference step, the initial prior affiliation probability of the user u is set uniformly: P (M0i (u)) = 1 i 3 , ∀i. After the kth time interval, P (Mk (u)|Ak (u)) can be calculated by using Bayes’ theorem as follows:

P (Mki (u)|Ak (u)) =

P (Ak (u)|Mki (u))P (Mki (u)) P j j j P (Ak (u)|Mk (u))P (Mk (u))



where P (Mki (u)|Ak (u)) is the posterior of user u, the uncertainty of Mki (u) after Ak (u) is observed; P (Mki (u)) is the prior, the uncertainty of Mki (u) before Ak (u) is observed P (Ak (u)|Mki (u)) is a factor representing the impact of ; and P (Ak (u)) Ak (u) on the uncertainty of Mki (u). To calculate P (Ak (u)|Mki (u)), we consider the frequency of referring to political parties in tweets for Ak (u) based on the observation in the previous section2 . We can see that a user u more frequently generates tweet messages referring to the political party (or party leader) that the user u is supporting. For this activity, we assume P (Ak (u)|Mki (u)) can be calculated as follows: P Vi (t) P (Ak (u)|Mki (u)) = t∈T |T | where T is the tweets of the current user during the period and Vi (t) is equal to 1 if the tweet t does a reference to the political party i, 0 otherwise. We use Bayesian to denote this Bayesian classification. B. Evaluation The aim of our experiment was to demonstrate feasibility and effectiveness of the proposed classification approach compared with the other popularly used classification methods. For comparison, we also tested the performance of the following classification methods: • Volume classifier: As we observed, the volume of reference to a specific party can reflect the political leaning of the user. We simply counted the frequencies referencing parties (or party leaders) in a user’s tweets and then assigned the most frequently referenced party to the user’s political party. • Sentiment classifier: As we observed, a user is more likely to express a good emotion in the user’s tweets for a party when the user prefers the party. We compute a user’s sentiment scores of parties through the sentiment analysis of the user’s tweets and then assigned the party with the best average emotion score to the user’s political party. • Retweet classifier: As the retweet structure is highly segregated according to the party, the retweet graph can be used to predict users’ affiliation. This approach detects the communities of users using a label propagation method [6] on the retweet graph. In the label propagation process, each user’s party is classified with the majority party in the user’s neighbours. Ties can be broken according to the volume of references to party. 2 We have tested other potential alternatives, but given the space limitations, we describe this that led to the best classification performance.



From the initial seed users (self-identified members), we iteratively this process until all users’ parties are classified. Follower classifier: The relationship of following and being followed in Twitter can reflect the political leanings of users as well [5]. Compared to the previous classifier, this one uses the followers graph to propagate the probability to be members of a certain political party from the selected ground truth users. The inferred probabilities are computed as the average probabilities for all people he or she follows. SVM classifier: Support Vector Machine (SVM) is known as one of the best supervised learning techniques for solving classification problems with high dimensional feature space and small training set size. We constructed a SVM classifier using the following six features of a user proposed in [3], [10]: (i) the list of followers, (ii) the list of friends, (iii) the list of retweeted users, (iv) the list of used words in the user’s tweets, (v) the list of used hashtags in the user’s tweets, and (vi) the emotion over the user’s tweets.

To show the performance of a classifier, we measured their accuracy for the self-identified users (1161). The classification accuracy is defined as the ratio between the number of correctly predicted samples; the results are shown in Table V. Classifiers used tweets and relationships related to these self-identified users. These users published 27,696 tweets, formed a followers graph of 135,786 users for 7,113,860 edges, and a retweet structure composed of 89,942 users for 286,614 retweets. Some classifiers (Follower, Retweet, and SVM) require a training step used to learn the features determining political party membership and/or the knowledge about network topology. Training samples are composed of one-tenth of the ground truth users (115) 3 to construct the classifiers and the rest (1,046) was reserved for out-of-sample testing. Classifier Volume Sentiment Follower Retweet SVM Bayesian

Accuracy 0.62 0.67 0.83 0.81 0.77 0.86

Table V: Performance according with approach. Although the performance of the Bayesian method computed only once at the end of the period is not as strong as some other candidates (accuracy of 0.64 in this case), it outperforms all classification methods when it leverages its incremental approach over time with 10 updates of 3 We have tested different strategies to select the training sample (i.e. sample among random users, among the most active users, and among the most influential users). Reported experiments used a training sample composed of the most influential users which have given the best accuracy.

5th

Conservative Labour Libdem

Bayesian

0.8 Accuracy

Number of members

100000 95000 90000 85000 80000 75000 70000 65000 60000 55000

0.6 0.4 0.2 0

6th

7th

8th

9th 10th 11th 12th Time

(a) Numbers of members

6th

7th

8th

9th 10th 11th 12th Time

(b) Classification accuracy

Figure 7: Dynamic changes of the Bayesian classifier. the users’ affiliation probabilities during the period (accuracy of 0.86). We used fixed time interval of 15 hours to periodically updates the users’ affiliation probabilities according to their tweets in the associated interval. We note that this classification benefits from two advantages. Firstly, it requires to maintain only the affiliation probability of each user without massive training overheads and secondly, as the information about references to a party or a leader in tweets is only needed, incremental computation is significantly faster. These important advantages make it possible to use this solution in real time. Therefore, we recommend that Bayesian should be used as an alternative when the conditions do not allow the use of Follower which requires the knowledge about network topology to achieve good results, which may indeed be unacceptable or very expensive: crawling topology information is strictly limited in practice. Unlike our expectations, SVM which involves an expensive tuning phase, did not outperform other algorithms. We also analysed how the number of partisans of each party and the accuracy of the proposed Bayesian classifier changes with time. The results are shown in Figure 7. We can see that the Conservative members outnumbers the Labour and LibDem members at the end of the election. Inherently, the accuracy of Bayesian starts at 12 (equiprobability), continuously increases with time, and achieved at 0.86. These results imply that the proposed Bayesian approach is proper to understand users’ political leaning over time. V. R ELATED W ORK The exponential growth and the ubiquitous trend of social media has attracted much attention. Classification: Different approaches have been proposed for classifying users in many directions. [11] presented a semi-supervised algorithm for classifying political blogs. [4] also applied three semi-supervised algorithms for classifying political news articles and users, respectively. Their propagation algorithm particularly achieved the accuracy of 99% which is higher than the accuracy results of this paper. This is because we used only 10% of the dataset as initial seeds while they used 90% of the dataset as initial seeds. [5] presented a method that uses the follower connections in Twitter to compute political preferences. This method achived similar results than the label propagation method on the retweet graph in this paper.

[8] studied the linkage patterns between political blogs and confirmed the hypothesis – the limited degree of contacts which may take place between the members of different social groups – which was suggested in [12]. They found that the blogosphere exhibits a politically segregated community structure with more limited connectivity between different communities. Recently, [7] observed a similar structure in a retweet graph of Twitter in politic context. Other classifications used machine learning methods to infer information on users. [3] demonstrated the possibility of user classification in Twitter with the three different classifications: political affiliation detection, ethnicity identification and detecting, affinity for a particular business. Their best algorithm achieved the accuracy of about 88.9% for political affiliation. We note that their results might be overestimated compared with ours because the results were for binaryclass classification. [10] used Gradient Boosted Decision Trees which is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of decision trees. In this paper, we tested several classification methods in order to demonstrate that our proposed method has a comparable performance to the best known classification methods [3], [4], [5] that require expensive costs for tuning of parameters to construct classifier and/or the knowledge about network topology. This is an extended paper of our preliminary work [13]. Characterization: Characterization aims to identify the main characteristics of population. Several studies have addressed to characterise user behaviour or personality in social networks [14], [15]. However few works have tried to study the characteristics of politic parties and the interaction structure between parties. [16], [17] showed that interactions between dislike-minded groups in social media expose people to multiple points of views and promote diversity and thus tend to reduce extreme behaviours. [18] studied the usage patterns of tweets about the candidates in the 2010 U.S. midterm elections and showed stronger cohesiveness among Conservative and Tea party. Prediction: Other studies have addressed the predictive power of the social media. [19] demonstrated how social media contents can be used to predict real-world outcomes and outperformed market-based predictor variables. In Politics, [18] has investigated the relation between the network structure and tweets and presented a forecast of the 2010 midterm elections in the US. [1] claimed that Twitter can be considered as a valid indicator of political opinion and found that the mere number of messages mentioning a party reflects the election result through a case study of the German federal election. However [20] demonstrated that this result was not repeatable with the 2010 US congressional elections. Sentiment analysis: [21] used sentiment analysis to compare Twitter streams with polls in different areas and

showed the correlation on some points. [22] studied the links between the degree of expressed sentiment and influence of users in Twitter and suggested that Twitter users are influenced by those who express negative emotions. [23] showed that tweets can be used to track real-time sentiment about candidates’ performance during a televised debate. [24] also analysed the correlation between the sentiment of tweets in a community and the community’s socio-economic well-being. In addition, they proposed a machine learning technique to learn new positive and negative words for their dictionary of words reflecting people’s emotional and cognitive perceptions. VI. C ONCLUSION Existing classification methods are generally based on the assumption that the data conforms to a stationary distribution. Since the statistical characteristics of the real-world data continuously changes over time, this assumption may lead to degrade the predictive performance of a classification model when the characteristics of dataset are dynamically changed. To address this weakness, we proposed a new user classification approach using Bayesian framework which can incrementally update the classification results with time. As a case study, we first analysed the characteristics of the political parties in Twitter during the 2010 UK General Election and identified three main ways to differentiate political parties: (i) the retweet graph presented a highly segregated partisan structure (ii) party members were more likely to make reference to their own party than another, and (iii) members were more likely to express more positive opinions when they referenced their own party. Through these party characteristics, we built a classification algorithm based on Bayesian framework to compute political preferences of users. The experimental results showed that the proposed classification method is capable of achieving an accuracy of 86% without any training and network topology information which make it a proper solution for real time classification. R EFERENCES [1] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe, “Predicting elections with twitter : What 140 characters reveal aboutpolitical sentiment,” in ICWSM’10, 2010. [2] M. Cha, H. Haddadi, F. Benevenuto, and P. Gummadi, “Measuring User Influence in Twitter: The Million Follower Fallacy,” in ICWSM’10, 2010.

[6] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks.” Physical Review E, 2007. [7] M. Conover, J. Ratkiewicz, M. Francisco, B. Gonc¸alves, A. Flammini, and F. Menczer, “Political polarization on twitter,” in ICWSM’11, 2011. [8] L. Adamic and N. Glance, “The political blogosphere and the 2004 u.s. election: Divided they blog,” in LinkKDD’05, 2005. [9] A. D. Kramer, “An unobtrusive behavioral model of ”gross national happiness”,” in CHI’10, 2010. [10] M. Pennacchiotti and A.-M. Popescu, “A machine learning approach to twitter user classification,” in ICWSM’11, 2011. [11] F. Lin and W. W. Cohen, “The multirank bootstrap algorithm: Self-supervised political blog classification and ranking using semi-supervised link classification,” in ICWSM’08, 2008. [12] M. Hewstone and R. Brown, Contact is not Enough: An Intergroup Perspective on the ”Contact Hypothesis”, 1986. [13] A. Boutet, H. Kim, and E. Yoneki, “What’s in your tweets? i know who you supported in the uk 2010 general election (poster paper),” in ICWSM’12. [14] F. Benevenuto, T. Rodrigues, M. Cha, and V. Almeida, “Characterizing user behavior in online social networks,” in IMC’09, 2009. [15] D. Quercia, R. Lambiottez, D. Stillwell, M. Kosinski, and J. Crowcroft, “The personality of popular facebook users,” in CSCW’12, 2012. [16] Y. Sarita and B. Danah, “Dynamic debates: An analysis of group polarization over time on twitter,” in Bulletin of Science, Technology and Society, 2010. [17] J. An, M. Cha, K. Gummadi, and J. Crowcroft, “Media landscape in Twitter: A world of new conventions and political diversity,” in ICWSM’11, 2011. [18] A. Livne, M. P. Simmons, E. Adar, and L. A. Adamic, “The party is over here: Structure and content in the 2010 election,” in ICWSM’11, 2011. [19] S. Asur and B. A. Huberman, “Predicting the future with social media,” in WI-IAT ’10, 2010. [20] D. Gayo-Avello, P. T. Metaxas, and E. Mustafaraj, “Limits of electoral predictions using twitter,” in ICWSM’11, 2011. [21] B. O’Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith, “From tweets to polls: Linking text sentiment to public opinion time series,” in ICWSM’10, 2010.

[3] M. Pennacchiotti and A.-M. Popescu, “Democrats, republicans and starbucks afficionados: User classification in twitter,” in KDD’11, 2011.

[22] D. Quercia, J. Ellis, L. Capra, and J. Crowcroft, “In the mood for being influential on Twitter,” in SocialCom’11, 2011.

[4] D. X. Zhou, P. Resnick, and Q. Mei, “Classifying the political leaning of news articles and users from user votes,” in ICWSM’11, 2011.

[23] N. A. Diakopoulos and D. A. Shamma, “Characterizing debate performance via aggregated twitter sentiment,” in CHI’10, 2010.

[5] J. Golbeck and D. Hansen, “Computing political preference among twitter followers,” in CHI ’11, 2011.

[24] D. Quercia, J. Ellis, L. Capra, and J. Crowcroft, “Tracking gross community happiness from tweets,” in CSCW’12, 2012.

RECOGNITION  

Appendix C – [Influential Neighbours Selection for Information Diffusion] This appendix contains the reprint of the following paper: H. Kim and E. Yoneki, Influential Neighbours Selection for Information Diffusion in Online Social Networks. IEEE International Conference on Computer Communication Networks (ICCCN), Munich, Germany, June 2012.

50

Influential Neighbours Selection for Information Diffusion in Online Social Networks Hyoungshick Kim

Eiko Yoneki

University of British Columbia Email: [email protected]

University of Cambridge Email: [email protected]

Abstract—The problem of maximizing information diffusion through a network is a topic of considerable recent interest. A conventional problem is to select a set of any arbitrary k nodes as the initial influenced nodes so that they can effectively disseminate the information to the rest of the network. However, this model is usually unrealistic in online social networks since we cannot typically choose arbitrary nodes in the network as the initial influenced nodes. From the point of view of an individual user who wants to spread information as much as possible, a more reasonable model is to try to initially share the information with only some of its neighbours rather than a set of any arbitrary nodes; but how can these neighbours be effectively chosen? We empirically study how to design more effective neighbours selection strategies to maximize information diffusion. Our experimental results through intensive simulation on several realworld network topologies show that an effective neighbours selection strategy is to use node degree information for shortterm propagation while a naive random selection is also adequate for long-term propagation to cover more than half of a network. We also discuss the effects of the number of initial activated neighbours. If we particularly select the highest degree nodes as initial activated neighbours, the number of initial activated neighbours is not an important factor at least for long-term propagation of information.

I. I NTRODUCTION In the field of social networks analysis, a fundamental problem is to develop an epidemiological model and then to find an efficient way to spread (or prevent) information, ideas, and infectious disease through the model. It seems natural that many people are often influenced by opinions of their friends. This is called the “word of mouth” effect and has for long been recognised as a powerful force affecting product recommendation. Recent advances in the theory of networks have provided us with the mathematical and computational tools to understand them better. For example, in the Independent Cascade (IC) model proposed by Goldenberg et al. [1], some non-empty set of nodes are initially activated (or influenced). At each successive step, the influence is propagated by activated nodes independently activating their inactive neighbours based on the propagation probabilities of the adjacent edges. Here, activated nodes mean the nodes which have adopted the information or have been infected. Thus far, however, the models and analytic tools used to analyse epidemics have been somewhat limited. Most previous studies aimed to analyse the characteristics of information diffusion by choosing a set of any arbitrary k nodes in a network as the initial activated nodes. However, this model has

assumed full control of nodes in the network and/or complete knowledge of the network topology, which may indeed be unacceptable in many real life networks: there is no central node to communicate with any nodes in a network and/or to maintain global knowledge of the network topology. From the point of view of an individual user who wants to efficiently spread information through a network, a more reasonable model is to choose the user’s k neighbours as the initial activated nodes instead of a set of any arbitrary k nodes. This model is motivated by a practical scenario in online social networking services (e.g. Facebook or Twitter) — when a user u wants to advertise new information (or events), what is the best way to propagate the information through a network? Probably, the user u can ask u’s neighbours who seem to be enormously influential in the network (e.g. users with many neighbours) to post this information in order to propagate this to their neighbours again. That is, in this paper, we seek to answer a simple question: “How can we select k neighbours to maximize information diffusion in a decentralized fashion?” Here, we assume that each user can only communicate with the user’s immediate neighbours and has no knowledge about the global network topology except for its own connections. We empirically study this problem through intensive simulation experiments on several real-world network topologies. In particular, we evaluated the performance of four reasonable selection schemes from a simple random selection strategy to other complicated selection strategies which take advantage of the knowledge of local connectivity such as node degree. To measure the performance of neighbours selection schemes, we use the Independent Cascade (IC) model [1], which is widely used for the analysis of information diffusion [1], [2], [3]. Our experimental results show that the strategies of using local connectivity of nodes produce similar results for a given budget. Thus more obvious recommendation would be to select high degree neighbours for short-term propagation since the other strategies based on local connectivity may incur a significant communication overhead without the performance improvement. Also, even a straightforward (naive) selection method (e.g. sharing information with their neighbours randomly) can be effective enough to spread information for longterm propagation to cover more than half of a network or large networks. Probably, in these environments, there is no good neighbour to maximize information diffusion. The rest of this chapter is organised as follows. In Section II

we formally define the Influential Neighbours Selection (INS) problem and notation. Then, we present the four reasonable neighbours selection strategies in Section III. In Section IV, we evaluate the performance of the proposed strategies using realworld network topologies, and recommend how they should be used depending on the conditions. Some related work is discussed in Section V. Finally, we conclude in Section VI. II. M ODEL AND P ROBLEM F ORMULATION In this section, we begin with the definition of the Independent Cascade (IC) model [1], and then introduce the Influential Neighbours Selection (INS) problem, which will be used in the rest of the paper. We model an influence network as an undirected graph G = (V, E) where V denotes the node set and E the edge set representing the communication links between node pairs. Each edge (u, v) of the graph G is associated with a propagation probability λ(u, v), which is formalized by function λ : E → [0, 1]. For simplicity, in this paper, we use a constant propagation probability λ for all edges. For a pair of nodes u and v ∈ V , δ(u, v) denotes the number of hops on the shortest path between u and v; if u is not connected to v, δ(u, v) = inf. For node u ∈ V , we use Nh (u) to denote the set of nodes within h distance from u. More precisely, Nh (u) = {v ∈ V \ u : δ(u, v) ≤ h}. When h = 1, we use N (u) instead of N1 (u) to particularly denote u’s neighbour set. The degree and the clustering coefficient of node u are denoted as d(u) = |N (u)| and c(u), respectively. The clustering coefficient of node u measure the probability of neighbours of node u to be neighbours to each other as well. This is calculated for u as the fraction of permitted edges between the neighbours of u to the number of edges that could possibly exist between these neighbours: c(u) = 2·Δ/(d(u)·(d(u)−1)) where Δ is the number of the edges between the neighbours of node u. These two metrics d(u) and c(u) can often be used to analyse the u’s local connectivity pattern. We assume that the time during which a network is observed is finite, from 1 until t; without loss of generality, the time period is divided into fixed discrete steps {1, . . . , t}. Let Si ⊆ V be the set of nodes that are activated at the time step i. We consider the dynamic process of information diffusion starting from the set of nodes S0 ⊆ V that are initially activated until the time step t as follows: In IC model [1], at each time step i where 1 ≤ i ≤ t, every node u ∈ Si−1 may activate its inactivated neighbours v ∈ V \ Si−1 with an independent probability of λ(u, v). The process ends after the time step t with St . A conventional Influential Maximization (IM) problem is to find a set S0 of k nodes with the maximum number of activated nodes after the time step t for a budget constraint k. The Influential Neighbours Selection (INS) problem is a variant of the IM problem; for a node u ∈ V and a budget constraint k, we aim to maximize the number of activated nodes in a network after the time step t by selecting u’s k

neighbours rather than any subset of k nodes as the set of nodes S0 ⊆ V that are initially activated. In this paper, we particularly consider a decentralized version of the INS problem to simulate users in online social networks such as Facebook or Twitter. That is, (1) each node only communicates with its immediate neighbours. Formally, a node u ∈ V can only communicate with v ∈ N (u); (2) each node has no knowledge about the global network topology except for its own connections and (3) each message size is bounded to O(log |V |) bits. III. N EIGHBOURS S ELECTION C RITERIA We present the general framework of the INS problem for an online social network G = (V, E) as follows. Assume that a node u ∈ V has some piece of information and wants to efficiently spread this information through the network G by sharing this with its min(k, d(u)) neighbours only. Node u first tries to assess the influence of information diffusion for each neighbour v ∈ N (u), respectively, by collecting the information about v. We note that v’s influence should be estimated based on each node’s local information only, rather than the whole network since u cannot build up the global network topology. As online social networks such as Facebook typically provide APIs to get the neighbourhood information about user, u might automatically collect the information about its own neighbours. Although users’ personal information cannot be accessed by outsiders with the user’s privacy preference settings, most users typically expose their degree and/or neighbourhood information to at least their neighbours and therefore u can easily collect the information of its neighbours. After collecting the information about neighbours, node u estimates their neighbours’ influences and then selects the top min(k, d(u)) nodes with the highest estimated values from N (u) as the most influential neighbours for information diffusion; that is, for the IC model in Section II, they are chosen as the set of initially activated nodes S0 ⊆ V . For the purpose of influence estimation, we test the following four selection strategies based on local connectivity pattern such as node degree and/or clustering coefficient: 1) Random selection: Pick min(k, d(u)) nodes randomly from N (u). • This strategy is very simple and efficient: The user u does not need any knowledge of the network topology. The expected communication cost is O(1). 2) Degree selection: Pick the min(k, d(u)) highest-degree nodes from N (u). • This strategy requires the degree knowledge of neighbours. The expected communication cost is O(κ) where κ is the average degree in the graph. 3) Volume selection: Pick the min(k, d(u)) highest volume centrality nodes from N (u). Here, the volume centrality is defined as the sum of degree of all w ∈ Nh (v). This metric was recently proposed by Wehmuth and Ziviani [4]. They experimentally showed that this metric is

highly correlated with the traditional closeness centrality which measures how quickly a node can communicate with all other nodes in a network. Closeness centrality is calculated for a node u as the average shortest path length to all other nodes in the network. • This strategy requires the degree knowledge of the nodes within the distance h from v and the expected communication cost is O(κ(h+1) ). To calculate h thei volume centrality for each v ∈ N (u), i=0 κ messages are required where κ is the average degree in the graph. 4) Weighted-volume selection: Pick the min(k, d(u)) highest weighted-volume centrality nodes from N (u). We extend the volume centrality metric to improve centrality estimation accuracy by additionally considering the relative weights of both the distance δ(v, w) between v and w ∈ Nh (v) and the w’s individual clustering tendency c(w). As we might expect, closer nodes’ connectivity has more contributions than far away nodes, so it would be preferable that the relative importance to local connectivity decreases as the distance from the node that we want to estimate (see Figure 1). All other things being equal, central nodes with low clustering coefficients may also be characterized as ‘hubs’ since they are actually linking neighbouring network parts that would be otherwise disconnected (see Figure 2). • This strategy requires the degree and clustering coefficient knowledge of the nodes within the distance h from v and the expected communication cost is O(κ(h+1) ) which can be derived in the same way as above.

Function Ran. Deg. Vol. Wei.

Influence of v 1  d(v) d(w) w∈N (v)



h

w∈Nh (v)

d(w) · (1 − c(w)) · (1/2δ(v,w) )

We summarize the properties of the networks used in experiments in Table II. For Facebook, we particularly used a dataset crawled in early 2008 of 26,701 nodes and 251,249 edges representing a regional sub-network of Facebook. The three notations κ, D, and C represent the “average degree”, “network diameter”, and “number of connected components”, respectively. The diameter of a network (D) is the maximum distance between nodes in the network [5]; the diameter of a disconnected network is taken as infinite (inf). |V | 10,680 1,134 1,224 26,701

Network PGP [6] Email [7] Blog [8] Facebook

|E| 24,316 5,453 16,718 251,249

C 1 1 2 1

κ 4.55 9.62 27.32 18.82

In order to show the usefulness of the node selection criteria proposed in Section III, we first calculate the Pearson correlation coefficients between closeness centrality and them, respectively: degree, volume and weighted influences (see Figure 3). Closeness centrality can be often applied to identify key nodes that are central in information dissemination processes [9]. Wei.

Deg. 1.0

0.8

0.8

Correlation

Correlation

Vol.

1.0

0.6 0.4 0.2 2

3

4

0.2 1

2

Deg.

0.8

0.8

Correlation

Correlation

Wei. 1.0

0.6 0.4 0.2 3

Depth h

(c) Blog

4

(b) Email

Vol.

2

3

Depth h

1.0

1

Wei.

0.4

(a) PGP Deg.

Vol.

0.6

Depth h

These functions are summarised in Table I. We will evaluate the performance and usefulness of these functions in Section IV.

D 24 8 inf 15

TABLE II S UMMARY OF DATASETS USED .

1

Fig. 2. An example to explain our centrality design philosophy for the relative weights of nodes’ clustering tendencies. We compare a node with a high clustering coefficient when c(x) = 1 (left) and a node with a low clustering coefficient when c(y) = 0 (right). In this example, we believe that the y’s role is more important in information diffusion than the x’s role.

O(κ(h+1) )

TABLE I S UMMARY OF ESTIMATION FUNCTIONS .

Deg.

Fig. 1. An example to explain the relative weights of the distance δ(v, w) between v and w ∈ Nh (v). In this example, we believe that x’s connectivity has more contributions on v’s centrality than y’s connectivity.

Cost O(1) O(κ) O(κ(h+1) )

4

Vol.

Wei.

0.6 0.4 0.2 1

2

3

4

Depth h

(d) Facebook

Fig. 3. The Pearson correlation coefficients between node influences in Section III and closeness centrality.

IV. E XPERIMENTAL R ESULTS In this section, we analyse the performance of the selection strategies presented in Section III on several real-world networks.

The weighted-volume and volume centrality values are more highly correlated with closeness centrality compared to degree if h is properly chosen. Moreover, as h increases,

the correlation coefficients with the weighted-volume centrality are significantly higher than those with the volume centrality [4] except for PGP. In particular, when h = 4, the weighted-volume centrality is almost correlated with closeness centrality although this trend appears to be rather weak in PGP. These results imply that the volume (if h = 2) or weighted-volume centrality (if h ≥ 2) provides good approximations of closeness centrality. In this paper, our research interest is finding the best selection strategy to maximize information diffusion. We use the IC model in Section II to evaluate the performance of the strategies presented in Section III with varying the number of initial activated neighbours k and a constant propagation probability λ on edges. We here set h = 3 for weighted and volume to give a good balance between accuracy of influence estimation and communication cost. For simulation of INS, we randomly pick an information source node u for each of the networks in Table II and then select its k neighbours according to a selection criterion presented in Section III. With fixed k and λ, we repeated this 500 times to minimize the bias of the test samples (randomly selected information source nodes); we measure the ratio of the average number of activated nodes per test sample to the total number of nodes in the network. For example, with k = 1 and λ = 0.01, Figure 4 shows how these values are changed over time t under the IC model. Here, we use the different ranges of the time duration on the x-axis since the sizes of networks are totally different (see the number of nodes in each of the networks in Table II). Ran.

Wei. 1.0

0.8

0.8

0.6 0.4 0.2 60

Vol.

Wei.

Ran. 0.8

0.8

0.2 8

16

24

32 40 Time t

48

56

0.6 0.4 0.2

Wei.

Ran. 1.0

0.8

0.8

0.6 0.4 0.2

(c) Blog

15

18

Deg.

Vol.

2

Wei.

Vol.

Wei.

0.6 0.4 0.2

3

4 k

5

6

7

1

Ran.

Deg.

Vol.

Wei.

Ran.

1.0

1.0

0.4

0.8

0.8

13

26

39

52 65 Time t

78

91

3

4 k

5

6

7

(b) Email

0.6

0.2

2

(a) PGP

Ratio of Nodes

Ratio of Nodes

Ratio of Nodes

Vol.

9 12 Time t

Deg.

(b) Email

1.0

6

Ran.

0.4

1

3

Wei. 1.0

90 120 150 180 210 240 Time t

Deg.

Vol.

1.0

(a) PGP Ran.

Deg.

0.6

Ratio of Nodes

30

Deg.

Ratio of Nodes

Vol.

Ratio of Nodes

Deg.

Ratio of Nodes

Ratio of Nodes

Ran. 1.0

strategy is at least as effective as volume and weighted. When we consider how expensive their costs (O(κ(h+1) )) are, we would not recommend using the volume and weighted selection methods. Interestingly, there is a significant gap between random and the other strategies in Email and Blog while the random selection strategy is also comparable to the other strategies in PGP and Facebook. We surmise that the differences of underlying network topologies may explain this. The numbers of nodes of Email and Blog are relatively small (1,134 and 1,224, respectively) while those of PGP and Facebook are quite large (10,680 and 26,701, respectively). Surely, in a large network, the effects of initial activated nodes may be averaged over time to cover the many remaining nodes in the network. Since the effectiveness of strategy choice remains rather limited, our research interest should naturally be shifted from choosing most influential neighbours for information diffusion to finding the optimal parameter values (e.g., k) for each strategy. To accelerate the speed of information diffusion, a possible straightforward approach is to increase the number of initial activated neighbours k. Probably, we can imagine that the naive random selection strategy can also be used to efficiently disseminate the user’s information even for a small network such as Email and Blog if k increases sufficiently. In this context, our goal should be interpreted to find the minimum k for each strategy to achieve a reasonable level of information diffusion over time. With the number of initial activated neighbours k ranging from 1 to 7, we discuss the effects of k. We divide the analysis into the two parts: ‘long-term’ and ‘short-term’ effects since they may be different in nature.

0.6 0.4 0.2

Deg.

Vol.

Wei.

0.6 0.4 0.2

(d) Facebook 1

Fig. 4. Changes in the ratio of the average number of activated nodes to the total number of nodes in the network over time t.

From this figure, unlike the correlation coefficients with closeness centrality, we can see that all strategies except for random produce similar results with time t. The use of closeness centrality seems reliable in theory, but it may be effective less than thought. In practice, the degree selection

2

3

4 k

(c) Blog

5

6

7

1

2

3

4 k

5

6

7

(d) Facebook

Fig. 5. Changes in the ratio of the average number of activated nodes to the total number of nodes in the network for a long-term with the number of initial activated neighbours k.

To demonstrate the long-term effects of k, we first analyse the ratio of the average number of activated nodes in PGP,

Deg.

Vol.

Wei.

Ran.

Vol.

Vol.

Wei.

Ran. 1

0.8

0.8

Ratio of Nodes

Ratio of Nodes

Deg.

0.6 0.4 0.2 0.01

0.03 Propagation Probability

0.05

Deg.

Vol.

0.01

Wei.

Ran. 0.8

Ratio of Nodes

0.8 0.6 0.4 0.2 0.05

0.01

(c) Blog

Ran.

Deg.

Vol.

Wei.

Ran.

1

0.01 2

3

4 k

5

6

7

1

2

(a) PGP Ran.

Deg.

3

4 k

5

6

7

0.6 0.4 0.2 0.01

(b) Email

Vol.

Wei.

Ran.

Deg.

Ratio of Nodes

0.8

Ratio of Nodes

0.8

0.02

0.03 Propagation Probability

0.05

Wei.

Ran.

Deg.

Vol.

0.01

Wei. 0.8

0.01 1

0.02 0.01

2

3

4 k

(c) Blog

5

6

7

1

2

3

4 k

5

6

7

(d) Facebook

Ratio of Nodes

0.8

Ratio of Nodes

0.05

Ratio of Nodes

0.05

0.02

0.6 0.4 0.2 0.01

0.03 Propagation Probability

(c) Blog

0.03 Propagation Probability

Ran. 1

0.03

Wei.

0.05

(b) Email

1

0.03

Vol.

0.2

0.06

0.04

Deg.

0.4

0.06

0.04

0.05

0.6

(a) PGP

Vol.

0.03 Propagation Probability

(d) Facebook

0.05

0.02

Wei.

0.2

0.05

0.03

Vol.

0.4

1

0.03

Deg.

Fig. 7. Changes in the ratio of the average number of activated nodes to the total number of nodes in the network for a long-term with propagation probability λ.

Wei.

0.04

0.05

0.6

1

0.04

0.03 Propagation Probability

(b) Email 1

0.03 Propagation Probability

Wei.

0.2

1

0.01

Vol.

0.4

(a) PGP Ran.

Deg.

0.6

0.06

0.01

Ratio of Nodes

Deg.

Ran. 1

0.06 Ratio of Nodes

Ratio of Nodes

Ran.

strategies should be carefully selected. Moreover, the choice of k might also be important in spreading information quickly. For example, in Blog, the ratio of the average number of activated nodes by each selection strategy increases linearly with k.

Ratio of Nodes

Email, Blog, and Facebook, respectively, after the 240th, 60th, 20th, and 100th time steps to cover more than half of each network. The experimental results are shown in Figure 5. From this figure, we can see that the long-term effects of k may not be linear: the average number of activated nodes in all networks are still below 0.6 even for k = 7. When we use the degree, volume and weighted strategies, k is not an important factor in long-term propagation of information. This is natural enough; the relative importance of the number of the initial activated nodes is reduced over time. However, the random selection strategy is rather affected by k although the long-term effects of k are inherently limited. The ratios of activated nodes in all networks except for Facebook show almost the same pattern — the curves commonly have gentle slope from k = 2 or 3. As a selective strategy is at least as effective as random selection, we can always expect that it is enough to have two or three neighbours who can share the information regardless of the selection method used. To discuss the short-term effects of k, we analyse the ratios of the average numbers of activated nodes in PGP, Email, Blog, and Facebook, respectively, after the 60th, 15th, 5th, and 20th time steps — the first quarter of the duration of the long-term. The experimental results are shown in Figure 6. For improved visualisation, we use the different range on the y-axis of this figure since the levels of the ratios of the average numbers of activated nodes in the networks are totally different from Figure 5.

0.05

Deg.

Vol.

Wei.

0.6 0.4 0.2 0.01

0.03 Propagation Probability

0.05

(d) Facebook

Fig. 6. Changes in the ratio of the average number of activated nodes to the total number of nodes in the network for a short-term with the number of initial activated neighbours k.

Fig. 8. Changes in the ratio of the average number of activated nodes to the total number of nodes in the network for a short-term with propagation probability λ.

From this figure, we can see that the usefulness for shortterm propagation of degree, volume, and weighted is better than the random selection for all networks except for a large network Facebook: the gap between them is clearly shown over k. That is, if one wishes to efficiently spread information for a short-term, one of the degree, volume, and weighted

Finally, we discuss the effect of varying the propagation probability λ. In general, the speed of information diffusion is dramatically improved with λ. To demonstrate this we fix k = 1 and analyse the ratio of the average number of activated nodes for both of long-term and short-term propagation effects in the same manner as above. We here select k = 1 to

minimise the effects of k. The results are shown in Figure 7 and 8. In these figures, as λ increases, so does the number of activated nodes in networks; this seems unsurprising. The more interesting observation is that the gaps between random and the other strategies increases with λ for shortterm propagation while they decrease reversely for long-term propagation. These results imply that the choice of selection strategy should be different based on the target duration (longterm vs. short-term) for information dissemination. In summary, our suggestion is to use the degree strategy for short-term propagation but the random strategy for longterm propagation, respectively. Although the other centralitybased strategies volume and weighted produce similar results to those obtained by degree, they are not recommendable due to the incurring relatively high communication costs. V. R ELATED W ORK Influential Maximization (IM) problem has received increasing attention given the increasing popularity of online social networks, such as Facebook and Twitter, which have provided great opportunities for the diffusion of information, opinions and adoption of new products. The IM problem was originally introduced for marketing purposes by Domingos and Richardson [10]: The goal is to find a set of k initially activated nodes with the maximum number of activated nodes after the time step t. Kempe et al. [11] formulated this problem under two basic stochastic influence cascade models: the Independent Cascade (IC) model [1] and the Linear Threshold (LT) model [11]. In the IC model each edge has a propagation probability and influence is propagated by activated nodes independently activating their inactive neighbours based on the edge propagation probabilities. In the LT model, each edge has a weight, each node has a threshold chosen uniformly at random, and a node becomes activated if the weighted sum of its active neighbours exceeds its threshold. Kempe et al. [11] showed that the optimization problem of selecting the most influential nodes is NP-hard for both models and also proposed a greedy algorithm that provides a good approximation ratio of 63% of the optimal solution. However, their greedy algorithm relies on the Monte-Carlo simulations on influence cascade to estimate the influence spread, which makes the algorithm slow and not scalable. A number of papers in recent years have tried to overcome the inefficiency of this greedy algorithm by improving the original greedy algorithm [12], [13] or proposing new algorithms [14], [13], [15]. For example, Leskovec et al. [12] proposed the Cost-Effective Lazy Forward (CELF) scheme in selecting new seeds to significantly reduce the number of influence spread evaluations, but it is still slow and not scalable to large graphs, as demonstrated in [15]. Kimura and Saito [14] proposed shortest-path based heuristic algorithms to evaluate the influence spread. Chen et al. [13] proposed two faster greedy algorithm called MixedGreedy and DegreeDiscount

algorithms for the IC model where the propagation probabilities on all edges are the same; MixedGreedy is to remove the edges that have no contribution to propagate influence, which can reduce the computation on the unnecessary edges; DegreeDiscount assumes that the influence spread increases with node degree. Chen et al. [15] proposed the Maximum Influence Arborescence (MIA) heuristic based on local tree structures to reduce computation costs. Wang et al. [16] proposed a community-based greedy algorithm for identifying most influential nodes. The main idea is to divide a social network into communities, and estimate the influence spread in each community instead of the whole network topology. Several studies design machine learning algorithms to generate reasonable influence graphs by studying practical influence cascade model parameters from real datasets [17], [18], [19], [20]. In this paper, we use the IC model for the Influential Neighbours Selection (INS) problem as a variant of the IM problem to select the most influential neighbours of a node rather than the most influential arbitrary nodes in a network. To estimate node influence of information diffusion in networks, we test the possibility of the local connectivity patterns of nodes rather than the simulations on influence cascade. Wehmuth and Ziviani [4] recently proposed a method to compute approximate closeness centrality, which uses only local information available at each node. Their study showed the possibility of the use of local connectivity to approximate closeness centrality. However, we showed the limitation of their centrality in applying this metric to the INS problem with on several real-network topologies. The INS problem might be applied to a wide range of social-based forwarding schemes [21], [22], [23]. It has mainly been proposed for Delay Tolerant Networks (DTNs), where the connection between nodes in the network frequently changes over time: the basic idea is to use node centrality for relay selections, and the forwarding strategy is to forward messages to nodes which are more central than the current node. Kim et al. [24] suggested some approximation methods to predict network centrality values for DTNs. Kim and Anderson [25] also proposed a model to measure the importance of a node by considering the time dimension. VI. C ONCLUSIONS We introduced a new problem called the Influential Neighbours Selection (INS) problem to select a node’s neighbours to efficiently disseminate its information. Previous studies had mainly aimed to develop solutions to select the most influential arbitrary nodes in a network for information diffusion. However, this model is not acceptable in many practical situations. For example, from the point of view of a user who wants to disseminate information through a network, it is desirable to consider to share the information with the user’s neighbours only instead of any k nodes in a network; we empirically studied this through intensive simulation based on four realworld network topologies.

We presented four selection criteria from a simple random selection strategy to other complicated selection strategies which take advantage of the knowledge of local connectivity such as node degree and explored their feasibility. We compared these selection methods by computing the ratio of the average number of activated nodes to the total number of nodes in the network. We discussed which selection methods are generally recommended under which conditions. We recommend using the degree selection strategy for short-term propagation but the random selection strategy for long-term propagation to cover more than half of a network, respectively. These strategies are thus amenable for large-scale, online and realtime computation. We also discussed the effects of the number of initial activated neighbours for each strategy. Interestingly, these effects may be rather limited. When we particularly used the degree selection strategy, the number of initial activated nodes is not an important factor at least in the INS problem for long-term propagation. As part of this ongoing study, we plan to test communitybased selection methods; if a user’s neighbours are divided into several disjoint communities, we may improve the performance of information diffusion by selecting initially activated neighbours from different groups, respectively. Another interesting problem is to develop a more general model for information diffusion. We may consider not only a user’s neighbours but also neighbours of neighbours as the candidate space of the initially activated nodes. In other words, we can extend the concept of the INS problem by expanding the set of the initially activated nodes with the distance from an information source node. ACKNOWLEDGEMENT This research is part-funded by the EU grants for the RECOGNITION project (FP7-ICT 257756) and the EPSRC DDEPI Project, EP/H003959. We thank Ben Y. Zhao for his Facebook dataset. R EFERENCES [1] J. Goldenberg, B. Libai, and E. Muller, “Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth,” Marketing Letters, pp. 211–223, 2001. [2] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of influence through a social network,” in KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2003, pp. 137–146. [3] D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins, “Information diffusion through blogspace,” in WWW ’04: Proceedings of the 13th international conference on World Wide Web. New York, NY, USA: ACM, 2004, pp. 491–501. [4] K. Wehmuth and A. Ziviani, “Distributed assessment of the closeness centrality ranking in complex networks,” in Proceedings of the Fourth Annual Workshop on Simplifying Complex Networks for Practitioners, ser. SIMPLEX ’12. New York, NY, USA: ACM, 2012, pp. 43–48. [5] H. Per and H. Frank, “Eccentricity and Centrality in Networks,” Social Networks, vol. 17, no. 1, pp. 57–63, 1995. [6] M. Bogu˜na´ , R. Pastor-Satorras, A. D´ıaz-Guilera, and A. Arenas, “Models of social networks based on social distance attachment,” Physical Review E, vol. 70, p. 056122, Nov 2004. [7] R. Guimer`a, L. Danon, D. A. Guilera, F. Giralt, and A. Arenas, “Self-similar community structure in a network of human interactions,” Physical Review E, vol. 68, no. 6, Dec 2003.

[8] L. A. Adamic and N. Glance, “The political blogosphere and the 2004 u.s. election: divided they blog,” in Proceedings of the 3rd international workshop on Link discovery, ser. LinkKDD ’05. New York, NY, USA: ACM, 2005, pp. 36–43. [9] J. Tang, M. Musolesi, C. Mascolo, V. Latora, and V. Nicosia, “Analysing information flows and key mediators through temporal centrality metrics,” in Proceedings of the 3rd Workshop on Social Network Systems, ser. SNS ’10. New York, NY, USA: ACM, 2010, pp. 3:1–3:6. [10] P. Domingos and M. Richardson, “Mining the network value of customers,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’01. New York, NY, USA: ACM, 2001, pp. 57–66. [11] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of influence through a social network,” in Proc. of ACM SIGKDD ’03, 2003, pp. 137–146. [12] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance, “Cost-effective outbreak detection in networks,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’07. New York, NY, USA: ACM, 2007, pp. 420–429. [13] W. Chen, Y. Wang, and S. Yang, “Efficient influence maximization in social networks,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’09. New York, NY, USA: ACM, 2009, pp. 199–208. [14] M. Kimura and K. Saito, “Tractable models for information diffusion in social networks,” in Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases, ser. PKDD’06. Berlin, Heidelberg: Springer-Verlag, 2006, pp. 259–271. [15] W. Chen, C. Wang, and Y. Wang, “Scalable influence maximization for prevalent viral marketing in large-scale social networks,” in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’10. New York, NY, USA: ACM, 2010, pp. 1029–1038. [16] Y. Wang, G. Cong, G. Song, and K. Xie, “Community-based greedy algorithm for mining top-k influential nodes in mobile social networks,” in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’10. New York, NY, USA: ACM, 2010, pp. 1039–1048. [17] A. Anagnostopoulos, R. Kumar, and M. Mahdian, “Influence and correlation in social networks,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’08. New York, NY, USA: ACM, 2008, pp. 7–15. [18] J. Tang, J. Sun, C. Wang, and Z. Yang, “Social influence analysis in large-scale networks,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’09. New York, NY, USA: ACM, 2009, pp. 807–816. [19] K. Saito, M. Kimura, K. Ohara, and H. Motoda, “Selecting information diffusion models over social networks for behavioral analysis,” in Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III, ser. ECML PKDD’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 180–195. [20] A. Goyal, F. Bonchi, and L. V. Lakshmanan, “Learning influence probabilities in social networks,” in Proceedings of the third ACM international conference on Web search and data mining, ser. WSDM ’10. New York, NY, USA: ACM, 2010, pp. 241–250. [21] P. Hui, A. Chaintreau, J. Scott, R. Gass, J. Crowcroft, and C. Diot, “Pocket switched networks and human mobility in conference environments,” in Proceedings of the 2005 ACM SIGCOMM workshop on Delay-tolerant networking, ser. WDTN ’05. New York, NY, USA: ACM, 2005, pp. 244–251. [22] E. M. Daly and M. Haahr, “Social network analysis for routing in disconnected delay-tolerant MANETs,” in Proceedings of the 8th ACM international symposium on Mobile ad hoc networking and computing, ser. MobiHoc ’07. New York, NY, USA: ACM, 2007, pp. 32–40. [23] P. Hui, J. Crowcroft, and E. Yoneki, “Bubble rap: social-based forwarding in delay tolerant networks,” in Proceedings of the 9th ACM international symposium on Mobile ad hoc networking and computing, ser. MobiHoc ’08. New York, NY, USA: ACM, 2008, pp. 241–250. [24] H. Kim, J. Tang, R. Anderson, and C. Mascolo, “Centrality prediction in dynamic human contact networks,” Computer Networks, vol. 56, no. 3, pp. 983–996, Feb. 2012. [25] H. Kim and R. Anderson, “Temporal node centrality in complex networks,” Physical Review E, vol. 85, p. 026107, Feb 2012.

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT    

Appendix D environments]



[Parking

search

in

smart

urban

This appendix contains the reprint of the following technical report: E. Kokolaki, M. Karaliopoulos, I. Stavrakakis, On the efficiency of the parking assistance service: A game-theoretic analysis, Submitted, 2012.

51

On the efficiency of the parking assistance service: A game-theoretic analysis Evangelia Kokolaki

Merkourios Karaliopoulos

and

Ioannis Stavrakakis

Department of Informatics and Telecommunications National & Kapodistrian University of Athens Ilissia, 157 84 Athens, Greece Email: {evako, mkaralio, ioannis}@di.uoa.gr Abstract—Our paper seeks to systematically explore the efficiency of the parking assistance service in urban environments. We consider a theoretical service model whereby: (a) information of perfect accuracy is broadcast about the parking demand and supply and demand-responsive pricing schemes are applied to steer drivers’ parking choices; (b) drivers are strategic decisionmakers that respond to the service announcements and pricing policies on public (on-street) and private parking facilities in ways that minimize the cost of the acquired parking spot. We formulate the resulting games as resource selection games and derive their equilibria under various assumptions for the price differentials between the public and private parking fee. The efficiency of the equilibrium states is compared against the optimal assignment that could be determined by an ideal centralized parking spot reservation system and conditions are derived for minimizing the related price of anarchy value. Our results outline bounds for the efficiency of parking assistance systems under perfect information but also provide useful practical hints for the way pricing strategies can make the service more attractive.

I. I NTRODUCTION The tremendous increase of urbanization along with the modern life standards and requirements motivate city planners to recruit state-of-the-art systems for the efficient and environmentally sustainable management of the urban processes and operations. Indeed, the (pilot) implementation of solutions and pioneering ideas from the area of information and communication technologies in various aspects of metropolitan life, paves the way to so-called “smart cities”. The search for available parking space is among the daily routine processes that can benefit from this new kind of city environments. In particular, transportation engineers have developed parking assistance systems, realized through information dissemination mechanisms and demand-responsive pricing policies (i.e., in [1]) to alleviate not only the traffic congestion problems that stem from the blind parking search but also the resulting environmental burden. Common to these systems is the exploitation of wireless communications and information sensing technologies to collect and share information about the availability of parking space and the demand level within the search area. This information is then used to steer the parking choices of drivers in order to reduce the effective competition over the parking space and make the overall search process efficient. Additionally, the implementation of smart demand-responsive charging schemes intends to improve parking availability in overused parking zones and reduce double-parking and cruising. Our paper seeks to systematically explore the upper bound of these systems’ efficiency. Thus, the particular questions that become relevant for this exploration are: How do different types of parking assistance systems modulate drivers’

behavior, under the ideal scenario whereby the drivers become perfectly aware of the parking demand and supply? How do such systems affect the cost that drivers incur and the revenue accruing for the parking service operator? We formulate the parking spot selection problem as an instance of resource selection games in Section II. We view the drivers as rational selfish agents that pursue to minimize the cost they pay for acquired parking space. The drivers choose to either compete for the cheaper but scarce public (on-street) parking spots or head for the more expensive private parking lot. In the first case, they run the risk of failing to get a spot and having to a posteriori take the more expensive alternative, this time suffering the additional cruising cost in terms of time, fuel consumption (and stress) of the failured attempt. Drivers make their decisions drawing on perfect information about the number of drivers, the availability of parking spots and the pricing policy, which is broadcast from the parking service operator. In Section III, we derive the equilibrium behaviors of the drivers and compare the induced social cost against the optimal one via the Price of Anarchy metric. Most importantly, in Section IV, we show that the optimization of the equilibrium social cost is feasible by properly choosing the charging cost and the location of the private parking facilities. We close the discussion in Section V iterating on the model assumptions. Related work and contribution: Various aspects of the broader parking space search problem have been addressed in the literature. A queueing model for drivers who circulate in search for on-street parking is introduced in [2] by Larson et al., in order to analyze economic effects of congestion pricing. In another view of the same problem, particular gametheoretical dimensions in general parking applications are explicitly acknowledged and treated in [3], [4] and [5]. In [3], the games are played among parking facility providers and concern the location and capacity of their parking facility as well as which pricing structure to adopt. Whereas, in the other two works, the strategic players are the drivers. In [4], which seeks to provide cues for optimal parking lot size dimensioning, the drivers decide on the arriving time at the lot, accounting for their preferred time as well as their desire to secure a space. In a work more relevant to ours, Ayala et al. in [5] define a game setting where drivers exploit (or not) information on the location of others to occupy an available parking spot at the minimum possible travelled distance, irrespective of the distance between the spot and driver’s actual travel destination. The authors present distributed parking spot assignment algorithms to realize or approximate the Nash equilibrium states. In our work, the game theoretic analysis highlights the cost effects of the



• •

Fig. 1. Parking map of the centre area of Athens, Greece. Dashed lines show metered controlled (public) parking spots, whereas “P” denotes private parking facilities. The map illustrates, as well, the capacity of both parking options.

parking operators’ charging policies on drivers’ decisions, drawing on closed-form expressions of the stable operational points in the different game settings. II. T HE PARKING SPOT SELECTION GAME In the parking spot selection game, the set of players consists of drivers who circulate within the center area of a big city in search of parking space. Typically, in these regions, parking is completely forbidden or constrained in whole areas of road blocks so that the real effective curbside is significantly limited (see Fig. 1). The drivers have to decide whether to drive towards the scarce low-cost (controlled) public parking spots or the over-dimensioned and more expensive private parking lot (we see all local lots collectively as one). All parking spots that lie in the same public or private area are assumed to be of the same value for the players. Especially the on-street parking spots are considered quite close to eachother, resulting in practically similar driving times to them and walking times from them to the drivers’ ultimate destinations. Thus, the decisions are made on the two sets of parking spots rather than individual set items. We observe drivers’ behavior within a particular time window over which they reach this parking area. In general, these synchronization phenomena in drivers’ flow occur at specific time zones during the day [6]. Herein, we account for morning hours or driving in the area for business purposes coupled with long parking duration. Thus, the collective decision making on parking space selection can be formulated as an instance of the strategic resource selection games, whereby N players (i.e., drivers) compete against each other for a finite number of common resources (i.e., public parking spots) [7]. More formally, the one-shot1 parking spot selection game is defined as follows: Definition II.1. A Parking Spot Selection Game is a tuple Γ(N ) = (N , R, (wj )j∈(pub,priv) ), where: • N = {1, ..., N }, N > 1 is the set of drivers who seek for parking space, 1 The

study of the dynamic variant of the game suggests a natural direction for future work.

R = Rpub ∪ Rpriv is the set of parking spots; Rpub is the set of public spots, with R = |Rpub | ≥ 1; Rpriv the set of private spots, with |Rpriv | ≥ N , Ai = {public, private}, is the action set for each driver i ∈ N, wpub (·) and wpriv (·) are the cost functions of the two actions, respectively.

The parking spot selection game comes under the broader family of congestion games. The players’ payoffs (here: costs) are non-decreasing functions of the number of players competing for the parking capacity rather than their identities and common to all players. More specifically, drivers who decide to compete for the public parking space undergo the risk of not being among the R winner-drivers to get a spot. In this case, they have to eventually resort to private parking space, only after wasting extra time and fuel (plus patience supply) on the failed attempt. The expected cost of the action public, wpub : A1 × ... × AN → R, is therefore a function of the number of drivers k taking it, and is given by wpub (k) = min(1, R/k)cpub,s + (1 − min(1, R/k))cpub,f

(1)

where cpub,s is the cost of successfully competing for public parking space, whereas cpub,f = γ ·cpub,s , γ > 1, is the cost of competing, failing, and eventually paying for private parking space. On the other hand, the cost of private parking space is fixed wpriv (k) = cpriv = β · cpub,s

(2)

where 1 < β < γ, so that the excess cost δ · cpub,s , with δ = γ − β > 0, reflects the actual cost of cruising and the “virtual” cost of wasted time till eventually heading to the private parking space. We denote every action profile with the vector a = (ai , a−i ) ∈ ×N k=1 Ak , where a−i denotes the actions of all other drivers but player i in the profile a. Besides the two pure strategies coinciding with the pursuit of public and private parking space, the drivers may also randomize over them. In particular, if ∆(Ai ) is the set of probability distributions over the action set of player i, a player’s mixed action corresponds to a vector p = (ppub , ppriv ) ∈ ∆(Ai ), where ppub and ppriv are the probabilities of the pure actions, with ppub +ppriv = 1, while its cost is a weighted sum of the cost functions wpub (·) and wpriv (·) of the pure actions. In the following section, we derive the game-theoretic analysis of the particular game formulation looking into both the stable and optimal operational conditions as well as the respective costs incurred by the players. III. G AME ANALYSIS Ideally, the players determine their strategy under complete knowledge of those parameters that shape their cost. Given the symmetry of the game, the additional piece of information that is considered available to the players, besides the number of vacant parking spots and the employed pricing policy, is the level of parking demand, i.e., the number of drivers searching for parking space. We draw on concepts from [8] and theoretical results from [7], [9] to derive the equilibrium strategies for the game Γ(N ) and assess their (in)efficiency. A. Pure Equilibria strategies Existence: The parking spot selection game constitutes a symmetric game, where the action set is common to all players

and consists of two possible actions, public and private. Cheng et al. have shown in ([9], Theorem 1) that every symmetric game with two strategies has an equilibrium in pure strategies. Computation: Thanks to the game’s symmetry, the full set of 2N different action profiles maps into N + 1 different action meta-profiles.  Each meta-profile a(m), m ∈ [0, N ] N encompasses all m different action profiles that result in the same number of drivers competing for on-street parking space. The expected costs for these m drivers and for the N − m ones choosing directly the private parking lot alternative are functions of a(m) rather than the exact action profile. In general, the cost cN i (ai , a−i ) for the driver i under the action profile a = (ai , a−i ) is cN i (ai , a−i ) =



wpub (σpub (a)), wpriv (N − σpub (a)),

for ai = public for ai = private

(3)

where σpub (a) is the number of competing drivers for onstreet parking under action profile a. Equilibria action profiles combine the players’ best-responses to their opponents’ actions. Formally, the action profile a = (ai , a−i ) is a pure Nash equilibrium if for all i ∈ N : 0 ai ∈ arg min (cN i (ai , a−i )) 0 ai ∈Ai

(4)

so that no player has anything to gain by changing her decision unilaterally. Therefore, to derive the equilibria states, we locate the conditions on σpub that break the equilibrium definition and reverse them. More specifically, given an action profile a with σpub (a) competing drivers, a player gains by changing her decision to play action ai in two circumstances: when ai = private and wpub (σpub (a) + 1) < cpriv when ai = public and wpub (σpub (a)) > cpriv

(5) (6)

Taking into account the relation between the number of drivers and the available on-street parking spots, R, we can postulate the following Lemma: Lemma III.1. In the parking spot selection game Γ(N ), a driver is motivated to change his action ai in the following circumstances: • ai = private and σpub (a) < R ≤ N or R ≤ σpub (a) < σ0 − 1 ≤ N or

(8)

σpub (a) < N ≤ R

(9)

• ai = public and R < σ0 < σpub (a) ≤ N

where σ0 =

R(γ−1) δ

(7)

(10)

∈ R.

Proof: Conditions (7) and (9) are trivial. Since the current number of competing vehicles is less than the on-street parking capacity, every driver having originally chosen the private parking option has the incentive to change her decision due to the price differential between cpub,s and cpriv . When σpub (a) exceeds the public parking supply, as in (8), a driver who has decided to avoid competition, profits from switching her action when the expected cost of playing public becomes less than the fixed cost of playing private. From (3) and (5), it must hold that: R R · cpub,s + (1 − ) · cpub,f < cpriv ⇒ σpub (a) + 1 σpub (a) + 1 R(γ − 1) σpub (a) < −1 δ

which yields (8). On the contrary, a driver that first decides to compete for public parking space, switches to private if the competing drivers outnumber the public parking resources. Namely, from (3) and (6), when R R · cpub,s + (1 − ) · cpub,f > cpriv ⇒ σpub (a) σpub (a) R(γ − 1) σpub (a) > δ

inline with (10). It is now possible to state the following Theorem for the pure Nash equilibria of the parking spot selection game. Theorem III.1. A parking spot selection game has: • •



N E,1 one Nash equilibrium a∗ with σpub (a∗ ) = σpub = N, if N ≤ σ0 and σ0 ∈ R N 0 0 bσ0 c Nash equilibrium profiles a with σpub (a ) = N E,2 ∗ σpub = bσ0 c, if N > σ0 and σ0 ∈ (R, N )\N  N Nash equilibrium profiles a0 with σpub (a0 ) = σ0  N E,2 σpub = σ0 and σ0N−1 Nash equilibrium profiles a? N E,3 with σpub (a? ) = σpub = σ0 − 1, if N > σ0 and ∗ σ0 ∈ [R + 1, N ] ∩ N .

Proof: Theorem III.1 follows directly from (4) and Lemma III.1. The game has two equilibrium conditions on σpub for N > σ0 with integer σ0 , or a unique equilibrium condition, otherwise. Efficiency: The efficiency of the equilibria action profiles resulting from the strategically selfish decisions of the drivers is assessed through the broadly used metric of the Price of Anarchy [8]. It expresses the ratio of the social cost in the worst-case equilibria over the optimal social cost under ideal coordination of the drivers’ strategies. Proposition III.1. In the parking spot selection game, the pure Price of Anarchy equals:  γN −(γ−1) min(N,R)  if σ0 ≥ N  min(N,R)+β max(0,N −R) , PoA =   bσ0 cδ−R(γ−1)+βN , if σ0 < N R+β(N −R) Proof: The social cost under action profile a equals: C(σpub (a)) =

N X

cN i (a) =

i=1

cpub,s (N β − σpub (a)(β − 1)), if σpub (a) ≤ R and cpub,s (σpub (a)δ − R(γ − 1) + βN ), if R < σpub (a) ≤ N

(11)

The numerators of the two ratios are obtained directly by NE replacing the first two σpub values (worst-cases) computed in Theorem III.1. On the other hand, under the ideal action profile aopt , exactly R drivers pursue on-street parking, whereas the remaining N − R are served by the private parking resources. Therefore, under aopt , no drivers find themselves in the unfortunate situation to have to pay the additional cost of cruising in terms of time and fuel after having unsuccessfully competed for an on-street parking spot. The optimal social cost, Copt is given by: Copt =

N X i=1

cN i (aopt ) = cpub,s [min(N, R) + β · max(0, N − R)]

B. Mixed-action equilibria strategies We mainly draw our attention on symmetric mixed-action equilibria since these can be more helpful in dictating practical strategies in real systems. Asymmetric mixed-action equilibria are discussed in [10]. Existence: Ashlagi, Monderer, and Tennenholtz proved in ([7], Theorem 1) that a unique symmetric mixed equilibrium exists for the broader family of resource selection games with more than two players and increasing cost functions. It is trivial to repeat their proof and confirm this result for our parking spot selection game Γ(N ), with N > R and cost functions wpub (·) and wpriv (·) that are non-decreasing functions of the number of players (increasing and constant, respectively). Computation: If we denote!by B(σpub ; N, ppub ) =

N σ p pub (1 − ppub )N −σpub σpub pub

(12)

=

N −1 X

wpub (σpub + 1)B(σpub ; N − 1, ppub )

σpub =0

cN i (private, p)

=

cpriv

denote the expected costs of choosing the on-street (resp. private) parking space option when all other drivers play the mixed-action p, while N N cN i (p, p) = ppub · ci (public, p) + ppriv · ci (private, p)

(13)

is the cost of the symmetric profile where everyone plays the mixed-action p. With these at hand, we can now postulate the following Theorem. Theorem III.2. The parking spot selection game Γ(N ) has a unique symmetric mixed-action Nash equilibrium pN E = E NE (pN pub , ppriv ), where: • •

7000

6000

6000

5000

5000

4000 3000 β=2, γ=3 β=2, γ=7 β=10, γ=11 β=10, γ=15

2000 1000 0 0

100 200 300 400 Number of competing drivers, σ pub

a. Pure-action profiles

E pN pub = 1, if N ≤ σ0 and NE ppub = σN0 , if N > σ0 ,

E NE where pN pub = 1 − ppriv and σ0 ∈ R.

Proof: The proof is given in [10]. IV. N UMERICAL RESULTS The analysis in Section III suggests that the charging policy for on-street and private parking space and their relative location, which determines the overhead parameter δ of failed attempts for on-street parking space, affect to a large extent the (in)efficiency of the game equilibrium profiles. In the following, we illustrate their impact on the game outcome and discuss their implications for real systems. For the numerical results we adopt per-time unit normalized values used in the typical municipal parking systems in big European cities [6]. The parking fee for public space is set to cpub,s = 1 unit whereas the cost of private parking space β ranges in (1, 16] units and the excess cost δ in [1, 5] units. We consider various parking demand levels assuming that

500

4000 3000 β=2, γ=3 β=2, γ=7 β=10, γ=11 β=10, γ=15

2000 1000 0 0

0.2 0.4 0.6 Probability of competing, p

0.8

1

pub

b. Symmetric mixed-action profiles

Fig. 2. Social cost for N = 500 drivers when exactly σpub drivers compete (a) or when all drivers decide to compete with probability ppub (b), for R = 50 public parking spots, under different charging policies.

private parking facilities in the area suffice to fulfil all parking requests. Figure 2 plots the social costs C(σpub ) under pure (Eq. 11) and C(ppub ) under mixed-action strategies as a function of the number of competing drivers σpub and competition probability ppub , respectively, where ! C(p)

=

cpub,s

N X

σ=0

the probability distribution of the number of drivers that decide to compete for on-street parking spots, where p = (ppub , ppriv ) denotes a mixed-action, then cN i (public, p)

7000

Social cost

Proof: The proof is given in [10].

Social cost

Proposition III.2. In the parking spot selection game, the pure 1 Price of Anarchy is upper-bounded by 1−R/N with N > R.

N σ p (1 − p)N −σ · σ

[min(σ, R) + max(0, σ − R)γ + (N − σ)β] (14)

Figure 2 motivates two remarks. Firstly, the social cost curves for pure and mixed-action profiles have the same shape. This comes as no surprise since for given N , any value for the expected number of competing players 0 ≤ σpub ≤ N can be realized through appropriate choice of the symmetric mixedaction profile p. Secondly, the cost is minimized when the number of competing drivers is equal to the number of onstreet parking spots. The cost rises when either competition exceeds the available on-street parking capacity or drivers are overconservative in competing for on-street parking. In both cases, the drivers pay the penalty of the lack of coordination in their decisions. The deviation from optimal grows faster with increasing price differential between the on-street and private parking space. Whereas an optimal centralized mechanism would assign exactly min(N, R) public parking spots to min(N, R) drivers, if N > R, in the worst-case equilibrium the size of drivers’ population that actually competes for on-street parking spots exceeds the real parking capacity by a factor σ0 which is a function of R, β and γ (equivalently, δ) (see Lemma III.1). This inefficiency is captured in the PoA plots in Figure 3 for β and δ ranging in [1.1, 16] and [1, 5], respectively. The plots illustrate the following trends: Fixed δ - varying β: For N ≤ σ0 or, equivalently, for , it holds that ϑPϑβoA < 0 and therefore, the PoA β ≥ δ(N −R)+R R is strictly decreasing in β. On the contrary, for β < δ(N −R)+R , R the PoA is strictly increasing in β, since ϑPϑβoA > 0. Fixed β - varying δ: For N ≤ σ0 or, equivalently, for ϑP oA δ ≤ R(β−1) > 0. Therefore, the PoA is strictly N −R we get ϑδ R(β−1) increasing in δ. For δ > N −R , we get ϑPϑδoA = 0. Hence, if δ exceeds R(β−1) N −R , PoA is insensitive to changes of the excess cost δ. Practically, the equilibrium strategy emerging from this kind of assisted parking search behavior, approximates the optimal coordinated mechanism when the operation of private parking facilities accounts for drivers’ preferences as well as estimates of the typical parking demand and supply. More specifically,

1.4 1.15 1.4

1.2

1

1.2

1.05

1 5

δ=1 δ=3 δ=5

1.1

PoA

1.1 PoA

PoA

1.3

1 5 4

20 15

3

2

4

6

8

β

10

12

14

a. 2D PoA(β), R = 160

Fig. 3.

16

δ

20

4

15

10

2 1

5 β

b. 3D PoA(β, δ), R = 160

10

3 δ

2

5 β

c. 3D PoA(β, δ), R = 50

Price of Anarchy for N = 500 and varying R, under different charging policies.

if, as part of the pricing policy, the cost of private parking times the cost of on-street parking, is less than δ(N −R)+R R then the social cost in the equilibrium profile approximates the optimal social cost as the price differential between public and private parking decreases. This result is inline with the statement in [2], arguing that “price differentials between onstreet and off-street parking should be reduced in order to reduce traffic congestion”. Note that the PoA metric also decreases monotonically for high values of the private parking cost when the private parking operator desires to gain more than δ(N −R)+R times the cost of on-street parking towards R a bound that depends on the excess cost δ. Nevertheless, these operating points correspond to high absolute social cost, i.e., the minimum achievable social cost is already unfavorable due to the high fee paid by N − R drivers that use the private parking space (see Fig. 2). On the other hand, there are instances, as in case of R = 50 (see Fig. 3), where the value δ(N −R)+R consists a non-realistic option for the cost R of private parking space, already for δ > 1. Thus, contrary to the previous case, PoA only improves as the cost for private parking decreases. Finally, for given cost of the private parking space, the social cost can be optimized by locating the private facility in the proximity of the on-street parking spots so that the additional travel distance is reduced and the excess cost remains below R(β−1) N −R . V. C ONCLUSIONS - D ISCUSSION In this paper, we draw our attention on fundamental determinants of parking assistance systems’ efficiency rather than particular realizations of theirs. We have, thus, formulated the assisted parking search process as an instance of resource selection games to assess the ultimate impact an ideal perfect information mechanism can have on drivers’ decisions. From methodological point of view, we expect that our modeling approach may be reused in a broader class of competitive service provision scenarios2 . On a more practical note, the model dictates plausible conditions under which different pricing policies steer the equilibrium strategies, reduce the inefficiency of the information mechanism, and favor the social welfare. We conclude by iterating on the strong and long-debatable assumption that drivers do behave as fully rational utilitymaximizing decision-makers; namely, they can exhaustively analyze the possible strategies available to themselves and the other drivers, identify the equilibrium profile(s), and act 2 It is tempting to draw parallels with other network resource allocation problems (i.e., Access Point Association Problem).

accordingly to realize it. Simon in [11], challenged both the normative and descriptive capacity of the fully rational decision-maker, arguing that human decisions are most often made under knowledge, time and computational constraints. One way to accommodate the first constraints is through (pre-)Bayesian games of incomplete information. In [10], we formulate (pre-)Bayesian variants of the parking spot selection game to assess the impact of information accuracy on drivers’ behavior and, ultimately, the service cost. However, models that depart from the utility-maximization norm and draw on fairly simple cognitive heuristics, e.g., [12], reflect better Simon’s argument that humans are satisficers rather than maximizers. For example, the authors in [13] explore the impact of the fixed-distance heuristic on a simpler version of the unassisted parking search problem. The comparison of normative and more descriptive decision-making modeling approaches both in the context of the parking spot selection problem and more general decision-making contexts, is an interesting area worth of further exploration. R EFERENCES [1] http://sfpark.org/. [2] R. C. Larson and K. Sasanuma, “Congestion pricing: A parking queue model,” Journal of Industrial and Systems Engineering, vol. 4, no. 1, pp. 1–17, 2010. [3] R. Arnott, “Spatial competition between parking garages and downtown parking policy,” Transport Policy (Elsevier), pp. 458–469, 2006. [4] M. Arbatskaya, K. Mukhopadhaya, and E. Rasmusen, “The parking lot problem,” Department of Economics, Emory University, Atlanta, Tech. Rep., 2007. [5] D. Ayala, O. Wolfson, B. Xu, B. Dasgupta, and J. Lin, “Parking slot assignment games,” in Proc. 19th ACM SIGSPATIAL GIS, 2011. [6] http://www.city-parking-in-europe.eu/. [7] I. Ashlagi, D. Monderer, and M. Tennenholtz, “Resource selection games with unknown number of players,” in Proc. AAMAS ’06, Hakodate, Japan, 2006. [8] E. Koutsoupias and C. H. Papadimitriou, “Worst-case equilibria,” Computer Science Review, vol. 3, no. 2, pp. 65–69, 2009. [9] S.-G. Cheng, D. M. Reeves, Y. Vorobeychik, and M. P. Wellman, “Notes on the equilibria in symmetric games,” in Proc. 6th Workshop On Game Theoretic And Decision Theoretic Agents (colocated with IEEE AAMAS), New York, USA, August 2004. [10] E. Kokolaki, M. Karaliopoulos, and I. Stavrakakis, “Leveraging information in vehicular parking games,” Dept. of Informatics and Telecommunications, Univ. of Athens, http://cgi.di.uoa.gr/~grad0947/tr1.pdf, Tech. Rep., 2012. [11] H. A. Simon, “A behavioral model of rational choice,” The Quarterly Journal of Economics, vol. 69, no. 1, pp. 99–118, February 1955. [12] D. G. Goldstein and G. Gigerenzer, “Models of ecological rationality: The recognition heuristic,” Psychological Review, vol. 109, no. 1, pp. 75–90, 2002. [13] J. M. C. Hutchinson, C. Fanselow, and P. M. Todd, Ecological rationality: intelligence in the world. New York: Oxford University Press, 2012.

RECOGNITION  

Appendix E – [Content “recognition” heuristic]

dissemination

using

the

This appendix contains the reprint of the following paper: R. Bruno, M. Conti, M. Mordacchini, A. Passarella" An Analytical Model for Content Dissemination in Opportunistic Networks using Cognitive Heuristics". ACM MSWIM 2012.

52

An Analytical Model for Content Dissemination in Opportunistic Networks using Cognitive Heuristics Raffaele Bruno, Marco Conti, Matteo Mordacchini and Andrea Passarella IIT–CNR, Pisa, Italy

{r.bruno,m.conti,m.mordacchini,a.passarella}@iit.cnr.it

ABSTRACT When faced with large amounts of data, human brains are able to swiftly react to stimuli and assert relevance of discovered information, even under uncertainty and partial knowledge. These efficient decision-making abilities rely on socalled cognitive heuristics, which are rapid, adaptive, lightweight yet very effective schemes used by the brain to solve complex problems. In a content-centric future Internet where users generate and disseminate large amounts of content through opportunistic networking techniques, individual nodes should exhibit those properties to support a scalable content dissemination system. We therefore study whether such cognitive heuristics can also be used in such a networking environment. To this end, in this paper we develop an analytical model that describes a content dissemination mechanism for opportunistic networks based on one such heuristics, known as the recognition heuristic. Our model takes into account the different popularities of content types, and highlights the impact of the shared memory contributed by individual nodes to make the dissemination process more efficient. Furthermore, our model allows us to investigate the performance of the dissemination process for very large number of nodes, which might be very difficult to carry out through a simulation-based study.

Categories and Subject Descriptors C.2.1 [Network Architecture and Design]: Wireless communication

General Terms Algorithms, Design, Performance

Keywords Opportunistic Networks, Cognitive heuristics, Data Dissemination, Analytical Model

1.

INTRODUCTION

In the Future Internet scenario, the active participation of users in the production and diffusion of content, using mobile devices in connection with more traditional CDNs and P2P networks, will create a very large and dynamic information

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MSWiM’12, October 21–25, 2012, Paphos, Cyprus. Copyright 2012 ACM 978-1-4503-1628-6/12/10 ...$15.00.

environment [14]. A considerable part of these data will also be very contextualized, i.e. relevant only at specific times and/or geographic areas, and of interest only for specific groups of users. Opportunistic networking techniques will thus become a very important complement to infrastructurebased networks supporting mobile users (such as cellular and WiMAX) in order to efficiently disseminate content to interested users [15]. As mobile devices carried by users will have an active role in the data dissemination process, and considering the large volume of dynamic information that will be produced, devices will be faced with the very challenging task of determining the relevance of discovered content and selecting the most interesting data for the user. In order to be effective, this task should be performed rapidly. Furthermore, devices will have to make their decisions using only a partial knowledge of the environment and of the available content. One approach to address the aforementioned problems is to embed autonomic decision-making abilities into mobile devices. In this paper, we focus on a new (to the best of our knowledge) direction in the autonomic networking field, and we exploit results coming from the cognitive psychology area, by using models of how the human brain assesses the relevance of information under partial knowledge. In fact, the human brain is able to achieve effective decision-making results in front of partial information, and the presence of noisy data by exploiting cognitive processes known as cognitive heuristics. These heuristics are the key psychological tools that allow the brain to come up with a highly effective decision under conditions of uncertainty, limited time, knowledge, and computational capabilities. Moreover, they are adaptive tools that are able to refine their effectiveness through a continuous interaction with the environment. In the past, cognitive sciences were able to derive accurate computational models for those cognitive processes. One of the most simple and effective heuristics is known as the recognition heuristic. The role of this heuristic is to assert the relevance of items when choosing among a pair (or a set) of objects. More specifically, it works by assuming that given a pair of objects, if one is recognized (i.e. the brain is able to recall that it already “heard” about that object) and the other is not, the recognized object has a higher value with respect to a given (possibly unknown) evaluation criterion. The simplicity of this heuristic, its effectiveness in decision making processes, and its adaptivity, make it a suitable candidate for embedding a human cognitive process into autonomic, self-aware ICT devices dealing with a content-centric Internet. We have designed an algorithm for data dissemination in opportunistic networking environments that is inspired by the recognition heuristic. It allows us to simplify and limit the complexity of the data selection task in an opportunistic network, while maintaing a high effectiveness in the

data dissemination process. The full specification of this algorithm and preliminary results based on simulations are shown in [2]. In this paper, we first briefly overview the main features of this algorithm. Then, in Section 5 we describe our analytical model of the content dissemination mechanism. This model takes into account the different popularities of content types, and highlights the impact of the shared memory contributed by individual nodes to make the dissemination process more efficient. In Section 6 we present a comparison between the model predictions and related simulation results, showing that the model is able to capture the transient and steady-state behaviors of the data dissemination process. In particular, the model is able to estimate the level of replication of data items in the network, and thus to predict the probability that users will receive the content they are interested in within a given time from the instant it was generated in the network. Furthermore, our model allows us to explore network scenarios with large numbers of users and data items, which are generally difficult to study using simulation tools. Finally, Section 2, 3, and 7 complete the paper by summarizing related work, describing the main features of the recognition heuristic, and drawing the main conclusions of this work, respectively.

2.

RELATED WORK

Data dissemination algorithms have been proposed for diverse families of mobile networks. The techniques proposed in [18]are representative of a body of work focusing on caching strategies for well-connected MANETs. On the contrary, the focus of this paper is on networking environments where content dissemination is more challenging and such policies cannot be applied. The first work that investigated the problem of content dissemination in opportunistic networks was developed in the PodNet Project [9]. Each node in the PodNet system is subscribed to a channel and devote part of its memory space for storing data items belonging to the channel it is subscribed to, and part for supporting a collaborative exchange of information. More precisely, when two nodes meet they exchange their cached data items and use heuristics based on the popularity of data channels to decide which fetched data items to store. A main limitation of PodNet is that it does not exploit any social information about nodes. More advanced approaches exploit information about users social relationships to drive the content dissemination process [19, 3, 1]. Specifically, the work in [19] defines a pub/sub overlay over an opportunistic network, where some nodes act as brokers, dispatching relevant content toward the most interested other peers. These brokers are the most “socially-connected” nodes, i.e., those nodes that are expected to be most available and easily reachable in the network. SocialCast [3] proposes a first attempt to exploit social information in dissemination processes. This is also the goal of the work in [1], where, however, a more refined and complete approach is used, based on social-aware dissemination strategies. Content dissemination is driven by the social structure of the network of users. Nodes evaluates which data items to store taking into account the social utility of the items, i.e. how they are likely of interest to users the node has social relationships with (and which, therefore, are expected to be in touch in the near future). With respect to these approaches, in this paper we take a completely new direction, by borrowing models of human cognitive processes coming from the cognitive psychology domain. As this approach is still totally unexplored, in this paper we limit the set of contextual information that we use to the very minimum, and, for example, we do not exploit information about users social structures. This allows us to

obtain initial exploratory results about the feasibility of this novel approach.

3. THE RECOGNITION HEURISTIC (RH) Cognitive heuristics can be defined as simple rules used by the brain for facing situations in which people have to act quickly, relying on a partial knowledge of all the problem variables, the evaluation criterion of the different possible choices is not known, and the problem itself may be illdefined in such a way that traditional logic and probability theory are prevented to find the optimal solution. Then, cognitive heuristics are able to deal with difficult problems by answering simpler problems. Early works on cognitive heuristics put the accent on how these cognitive processes could lead to systematic errors and cognitive biases. In contrast with this vision, Gerd Gigerenzer and most of recent psychological literature focus the research attention on the ability of these “brain tools” to produce very accurate judgements and the environmental conditions under which they are more effcient. Although the general concept of heuristic is similar to that widely used in computer science, heuristics used by human cognitive processes are formalized in a different way by cognitive sciences. One of the simplest cognitive heuristics, which has attracted a broad attention by cognitive science in the last decade, is the recognition heuristic [6, 4]. The recognition heuristic is based on a very simple rule. When evaluating a couple of objects, and one is recognized (i.e. the brain is able to recall that it already “heard” about that object) and the other is not, this heuristic infers that the recognized object has a higher value with respect to a given evaluation criterion. The recognition heuristic is effective when the recognition of objects is highly correlated with the ideal evaluation criterion that should be used to select among the possible choices if complete information would be available. In this case, the heuristic is said to be ecologically rational. In fact, the correlation is adaptively derived from the environment by exploiting the presence in it of some mediators that carry information (coded in variables) used by the heuristic itself to approximate the value of the objects with respect to the criterion. To better understand how the recognition heuristic works, the foundational study on recognition heuristic [4] uses, as an example, the estimation of the university endowments. To illustrate this example we refer also to Fig. 1, which depicts the general elements involved in the recognition heuristic, and the relationships between them. The evaluation criterion to be used in the example is the value of the endowment. This information is generally not publicly available. Nevertheless, it is argued that newspapers could act as mediators, since they periodically publish news related to the most important universities. Thus, the number of times a university appears on the newspapers could be a strong indicator that it has larger endowments than universities that do not or rarely appear on the media. In other words, in this case the mediator variable related to the evaluation criterion is the number of citations. As said before, the correlation between mediators and the evaluation criterion is called ecological correlation. When a person has to choose which university has the biggest endowment between a couple of institution names, he uses the recognition heuristic and chooses a recognized name against an unknown one. Clearly, in this case, newspapers influence the recognition, since the more they cite an institution, the more likely that institution name will be remembered and, thus, recognized. Since the brain evaluates options exploiting the citations on newspapers instead of the real, unknown criterion, the rela-

Figure 1: Ecological rationality of the recognition heuristic tion between the recognition and the mediators is called surrogate correlation. From this example it is straightforward to notice that the effectiveness of the recognition heuristic, i.e. the recognition validity, is continuously reinforced by the stimuli received from the environment. The simplicity of the recognition heuristic has made it a powerful tool in support of decision-making processes. As such, it has been successfully used in various fields [11], like financial decision-making processes [12], the forecast of future purchase activities [7], the results of sport events [17], or political election outcomes [10].

4.

AN RH-BASED DATA DISSEMINATION SCHEME FOR OPPORTUNISTIC NETWORKS

In this study we consider a network scenario where mobile nodes generate data items and other nodes can be interested in those items. More precisely, users are interested in data channels, i.e. high-level topics to which the data items belong. The goal is to bring all the data items of a given channel to all the nodes that are interested in it. To this end, each node contributes a limited amount of storage space to help the dissemination process, since contacts between users are the only way to disseminate data items. Ideally, upon encountering another node, a node should select which data items to fetch and keep in its storage space, such that the total utility of its local storage space for the overall dissemination process is maximized. Solving this problem exactly would clearly require global knowledge and central coordination, which is unfeasible in opportunistic networks. Therefore, the algorithm we have originally proposed in [2] (and summarized in this section) exploits the recognition heuristic to efficiently approximate the utility of storing the data items that can be exchanged during a pair-wise contact between two nodes. Based on these approximated utility values, each node stores only the most useful data items, selected among both those already available locally and those available on the encountered node, until it’s storage space is full. With reference to both the description given in Sec. 3 and the elements reported in Fig. 1, we have that the unknown criterion to be estimated is then the utility of fetching an item, with respect to the global information dissemination process. Since we said that the only way to obtain information is through contacts between nodes, each peer considers the other nodes as the environmental mediators depicted in Fig. 1. Indeed, they are the only mean by which the information needed by the recognition heuristic can be collected. About the information exchange process, assuming that each individual node is interested in only one channel, nodes exchange only minimal information when meeting: the channel they are interested into and a summary of the data items they currently store. At each encounter, this information is

used to evaluate the utility of those data items during future meetings with other nodes. Essentially, a node estimates how many users are interested in the channels of discovered items and how those items are spread in the network. At a very high level, nodes decide which items to store based on two simple assumptions: i) the more other nodes request a given channel, the more its data items are relevant, ii) the less a data item is replicated in the network, the more useful could be to store a copy of that item. These assumptions bind the observed information carried by the mediators with the estimation of the inaccessible evaluation criterion, and, thus, constitute the surrogate correlation of the recognition heuristic. Since in our algorithm there are two types of decisions to be taken to estimate the utility of the data items to fetch (i.e., how relevant a channel is, and how relevant a data item is) we use two distinct recognition heuristics to separately recognize channels and data items. Intuitively, a node recognizes a channel as soon as it becomes “enough popular”, i.e., as soon as that node encounters enough other nodes that are interested in the same channel. Hence, if a channel is recognized, it means that several users are interested in it and, thus, it is worth (for the overall community) to circulate its data items. Furthermore, a node recognizes a data item if it is “spread enough”, i.e., as soon as it is encountered on at least a given number of other nodes. If a data item is not recognized, it means that only a few users have a copy of it in their memory, so it should be replicated more broadly to increase its diffusion. On the basis of these considerations, a node decides to fetch a data item from the encountered node if: a) it recognizes the channel the data item belongs to, and b) is does not recognize the data item itself. Finally, to recognize a channel (or a data item) each node maintains a counter that count how many times it encountered a node that it is interested in that channel (stores that data item). Then a channel (data item) is recognized if this counter reaches a given recognition threshold. Furthermore, the recognition counters is decremented if the channel (data item) is not seen for a while. The exploitation of such “recognition thresholds” is consistent with the cognitive psychology research on how the recognition memory works in the brain (e.g. [16]). As we have assumed that the shared storage space contributed by the nodes is limited, we have to define an algorithm that exploits the recognition level of channels and data items to select which items to keep (among the ones currently already available at the node or available on the encountered node) in case the storage space is insufficient to take all available data items. This algorithm thus works as a replacement algorithm for data items in the nodes storage space. To this end, we use a modified Take-the-Best algorithm [5], taken from the cognitive psychology literature. This algorithm ranks two (or more) options using several steps. On each step a heuristic is applied (e.g. the recognition heuristic), until it reaches the first (best) step where the heuristic can discriminate among the options. In our case, nodes decide which data item to store using a threestep version of the Take-the-Best algorithm. The first step applies the recognition heuristic to the channels, the second one uses the recognition of data items, while the last step ranks the items according to their recognition value, as explained below. The steps of the algorithm are shown in Fig. 2, where B denotes the size of the memory space dedicated to store data items belonging to channels the node is not interested in. More precisely, the data items available on an encountered node (hereafter called “new” items) are evaluated together with the ones already stored in the node (hereafter called “old” items). To this end, we first apply the

Figure 2: Modified Take-the-Best algorithm recognition heuristic to the channels, i.e. all the items whose channel is not recognized are discarded. If there is enough space to store all the remaining items, nothing else happens. Otherwise, we apply the recognition heuristic to data items, i.e. non-recognized data items are dropped. In case the number of remaining items still exceed the size of the node cache, items are ranked according to their recognition level using the following two criteria. First of all, data items with lower recognition values (i.e., whose counter is lower) are more relevant that the ones with higher recognition levels, because the former ones are considered less widespread and, thus, more useful to fetch. Secondly, among the data items with the same recognition value, new ones are more relevant than old ones. Note that the heuristics applied in the subsequent steps are of increasing complexity. This is also a typical cognitive scheme, which tries to use the lower possible amount of resources to discriminate between possible choices. A more detailed description of the recognition algorithms and of the modified Take-the-Best algorithm is given in [2].

4.1 Node Caches To implement the above algorithm, each node makes use of several internal caches. Fig. 3 shows the internal architecture of the caches of a single node. With respect to this

interested to and obtained by encounters with other peers. Furthermore, we assume that nodes are able to reserve enough space for the channel they are interested in. Thus, the SC cache is assumed unlimited. • OC is the Opportunistic Cache, i.e. the cache containing the objects obtained through exchanges with other nodes and belonging to channels the node is not interested in. This cache is the part of the node’s storage space contributed for the overall efficiency of the data dissemination process, beyond the particular interest of the individual node. Therefore, we assume that it has a limited size. Its content consists of the items the node believes to be the most “useful” for a collaborative information dissemination process. They are selected using the values contained in the Recognition caches. Recognition caches: • CC is the Channel Cache: whenever the node meets another peer subscribed to a given channel, the channel ID is put in this cache, along with a counter. The counter is incremented every time a node subscribed to the same channel is found. Note that the storage space required to maintain this information is very limited compared with the size of data items. Thus, the size of this cache is assumed unlimited. • IC is the Item Cache: similarly to the previous cache, when a new data item is seen in an encountered node, its ID is put in this cache, along with a counter. The counter is incremented every time the object is seen again. We assume that also this cache has no space limits. Finally, we remind that the CC and IC caches make use of two different recognition thresholds, Rc and R, respectively. Items and channels whose counters exceed the value of the threshold are marked as recognized. This classification is used by the OC cache in order to handle its data, using the modified Take-the-Best algorithm described before.

5. ANALYTICAL MODEL Figure 3: Node Architecture figure, we have two classes of caches: data caches and recognition caches. More precisely, Data caches: • LI is the cache containing the Local Items, i.e. the items generated by the node itself • SC is the Subscribed Channel cache, i.e. the cache containing the items belonging to the channel the node is

This section develops the analytical model describing the temporal evolution of the replication levels for data items belonging to a tagged cannel c. More precisely, in our model we use a set of embedded Markov chains to describe the status of the various caches deployed at the mobile nodes of the opportunistic network. Since the status of the caches of a node can change only when it exchanges data items upon encountering another node, the embedding points for our analysis are the time instants at which pairs of nodes meet. Concerning the number of Markov chains needed in our model and their state space size, we use a distinct Markov

Symbol N M C a.c rt (a) P op(c) ψat [1](ψat [0]) νct [i] υat [i] ϕta [i] Rc R

Definition Number of nodes of the network Number of items in the network Number of channels in the network Channel to which data item a belongs Fraction of nodes storing a copy of a Number of nodes subscribed to channel c Prob. that data item a is (is not) in SC at time t Prob. that channel c has a recognition level i in CC at time t Prob. that data item a has a recognition level i in IC at time t Prob. that data item a is stored in OC at time t with a recognition level i Threshold for channel recognition Threshold for data item recognition

Table 1: List of mathematical notations chain for each channel to describe the evolution of the channel recognition level in the CC cache of a generic node. Similarly, we use a separate Markov chain for each channel to describe whether a generic data item of that channel is stored in the SC cache. For the IC and OC caches, we consider a generic data item for each channel, and a Markov chain describes (i) the evolution of the recognition level of that data item in IC; and (ii) whether the data item is stored in OC and in which position, respectively. It is important to observe that in this study we consider a homogeneous network scenario, where each node meets the other nodes in the network with equal probability at an encounter event1 . Therefore, it is sufficient to derive the steady state probabilities for the aforementioned Markov chains by considering a single tagged node, since the same average behavior is observed in all the other nodes. In conclusion, our model requires to track the evolution of four Markov chains per channel. As channels (data types) can be reasonably assumed to be much less than data items, the complexity of the model is reasonable. The goal pursued by our analysis is to derive the fraction of nodes in the network that, at time t, have a copy of a data item a belonging to channel c either in SC or OC, hereafter denoted as rt (a). In fact, the knowledge of rt (a) is enough to fully characterize the dissemination process of the data items in the opportunistic network. In the following sections, we will describe how the state probabilities of the Markov chains describing the cache status are used to derive rt (a) for each channel. Although the formal definition of the state spaces of these Markov chains is reported later, for the sake of presentation clarity in Tab. 1 we list the main notations used in our analysis.

5.1 Subscribed Channel Cache Let us consider a generic data item a belonging to the channel c to which the tagged node is subscribed. Then, in the Markov chain describing the status of the SC cache, state 1 is used to indicate that a copy of data item a is stored in SC, while state 0 indicates that SC does not hold a copy of a. Let us assume that at time t+1 the tagged node encounters another node, while the previous encounter event occurred at time t. Intuitively, the probability p0,1 to move from state 0 to state 1 at time t + 1, i.e., the probability that a copy of data item a is fetched by the tagged node from the node encountered at time t+1, is given by p0,1 = rt (a). On 1 This condition holds for the stationary regime of several popular mobility model, such as the Random Waypoint model [13].

the other hand, the probability p0,0 that a data item not present in SC at time t is also not present at time t+1 can be computed as p0,0 = 1−p0,1 . Since we have assumed that the SC cache is large enough to contain all the data items of the channel the tagged node is subscribed to, a data item cannot be dropped by the SC cache. Formally, this implies that p1,1 = 1. Now, let ψat = {ψat [0], ψat [1]} be the state vector of the Markov chain at time t, and let us denote with P = {pi,j }2 , i, j = 0, 1, its transition matrix evaluated at time t+1. Then, we can compute the state vector at time t+1 as follows ψat+1 = ψat P .

(1) ψa0

The initialization distribution of the state vector is = {1, 0}, since the SC is empty at the beginning of the system.

5.2 Channel Recognition As described in Sec. 4.1, CC stores the number of times each channel has been seen by the tagged node. Thus, the state i for the Markov chain used to describe the status of the CC for the tagged channel c represents its recognition level. The state 0 represents the condition of channel c not yet observed. It is important to note that the probability for a channel to be recognized (i.e. to reach the recognition threshold) depends only on its popularity P op(c). Then, the transition probabilities of this time homogenous Markov chain are given by   P op(c)/N if j = i+ 1, i ∈ [0, Rc −1]    i  α (1−P op(c)/N ) if j = i− 1, i ∈ [1, Rc ]     i i (1−α )+α ) if j = i, i ∈ [1, Rc −1] 1−( P op(c) N pi,j = ,  1−P op(c)/N if i, j = 0     1 − (1−P op(c)/N )αi if i, j = Rc    0 otherwise (2) where P op(c)/N is the probability of encountering a node subscribed to channel c, and α (0 < α < 1) is a discount parameter. More precisely, if a channel is not observed in the CC cache of the encountered node, then the tagged node decreases its recognition level i by one with a probability that decreases exponentially as i increases. This approach avoids that a tagged node considers relevant a channel that is no more popular in the network. On the other and, the probabilities of remaining in the same recognition state after the encounter event are simply the complements of the total probabilities of moving to other states. Special cases are state i = 0, where the recognition level cannot decrease, and state i = Rc , where the recognition level cannot increase. Now let νct = {νct [0], νct [1], . . . , νct [Rc ]} be the state vector at time t and P = {pi,j } the transition probability matrix. It holds that νct+1 = νct P . Since at time t a node has no information on the channels, we have that the initial distribution is νc0 = {1, 0, . . . , 0}.

5.3 Item Recognition Conceptually, the recognition process of data items is similar to the channel recognition. In fact, IC maintains a list of the observed data items and their recognition levels, defined as the number of times they have been observed in encountered nodes. Thus, the state i of the Markov chain used to describe the status of the IC for the generic data item belonging to channel c represents its recognition level, while state 0 represents the condition of data item a not yet observed. Considering that the probability that a tagged 2 Note that the pi,j values depend on the parameter t, but for the sake of notation simplicity we omit it.

node finds a copy of data item a in the caches of another node encountered at time t+1 depends only on its replication level rt (a), the transition probability matrix of this Markov chain can be calculated as   r (a) if j = i+1, i ∈ [0, R−1]   ti   γ (1−rt (a)) if j = i−1, i ∈ [1, R]    1−[r (a)(1−γ i ) + γ i ] if j = i, i ∈ [1, R−1] t pi,j = ,  1−rt (a) if i, j = 0     1−(1−rt (a))γ i if i, j = R    0 otherwise (3) where γ (0 < γ < 1) is the discount factor for the item recognition level. Now let υat = {υat [0], υat [1], . . . , υat [Rc ]} the state vector at time t and P = {pi,j } the transition probability matrix. It holds that υat+1 = υat P . Since at time t a node has not information on the data items, we have that the initial distribution is υa0 = {1, 0, . . . , 0}.

5.4 Opportunistic Cache Differently from the other caches, OC has a limited size B. As described in Sec. 4.1, only copies of data items belonging to recognized channels can be stored in the OC cache. However, given that OC can contain only a subset of available data items, a replacement policy is needed to drop less useful data items. We recall that our modified Take-the-Best replacement policy gives a higher priority to data items with lower recognition levels. In case of ties, the data items found in the caches of the encountered node are preferred to the ones already in the OC cache of the tagged node.

the OC of the tagged node at a sub-queue that depends on their recognition level. Due to cache size constraints, some of these data items could not be allowed to enter the OC cache, or some data items already stored in OC could be removed to let new data items to enter the OC cache. In addition, in both steps we consider the possibility that data items in the OC are moved to sub-queue 0 – i.e. are dropped – if their channel is not recognized anymore after the encounter occurring at time t + 1. In the following we separately describe these two modeling phases. Step 1. Let us introduce an auxiliary Markov chain, whose state vector ϕ′a = {ϕ′a [0], ϕ′a [1], . . . , ϕ′a [R]} represents the probability that the generic data item a is in the OC cache with recognition level i after the encounter event, but before new data items are inserted in the OC. Then, we have that ϕ′a = ϕta P ′ , where P ′ is the transition probability matrix of the auxiliary Markov chain modeling the data item reordering. Since no data items enter OC in this phase, we have that p′0,0 = 1. However, a data item could be removed from OC if the channel it belongs to loses its recognition, which happens with probability (1 − νat+1 [Rc ]), or the recognition level of the data items becomes null. In other words p′i,0 = 1 − νat+1 [Rc ], i ∈ [2, R] ,

(4)

p′1,0 = (1 − νat+1 [Rc ]) + νat+1 [Rc ]γ(1 − rt (a)) .

(5)

while

Formula (5) can be explained by noting that a data item of a recognized channel can change its recognition level from one to zero only if it is not in the caches of the encountered node, and the tagged node applies the discount factor γ to the data-item recognition level stored in IC (see formula (3)). Following a similar line of reasoning as in (5) we can compute the probability that a data item already in the OC moves to either backwards or forwards sub-queues. More formally we have that p′i,i−1 = νat+1 [Rc ]γ i (1 − rt (a)), i ∈ [2, R] ,

(6)

Figure 4: Queuing network modeling the OC cache.

p′i,i+1 = νat+1 [Rc ]rt (a), i ∈ [1, R − 1] .

(7)

To tackle the problem of modeling the evolution of OC, we represent this cache as a queuing network, as shown in Fig. 4. In this queuing network, a sub-queue i stores all data items that are in the OC and have the same recognition level i. The sub-queue 0 is a virtual queue that contains all the data items that are outside of the OC, independently of their recognition levels. Then, the ith element (i > 0) in the state vector ϕta = {ϕta [0], ϕta [1], . . . , ϕta [R]} represents the probability that the generic data item a is in the OC cache with recognition level i at time t. It is also important to note that each individual sub-queue has not a fixed size, but we must ensure that the sum of the numbers of data items stored in these sub-queues (excluding sub-queue 0) is lower or equal to B. It is intuitive to note that modeling the replacement policy used to manage the OC cache is a hard task. To facilitate the analysis we split the problem into two simpler sub-problems. First of all, we model the reordering of the data items stored in the OC cache due to changes in their recognition levels after an encounter event. For instance a data item that was initially stored in sub-queue i should be moved to sub-queue i+1 if its recognition level is increased upon an encounter event. Since this process involves only an internal reordering of stored data items, no data items are dropped. The second step in the analysis takes into account that new data items fetched by the caches of the encountered node may enter

On the other hand, the probability of remaining in the same sub-queue after the encounter event at time t+1 is simply given by the complement of the sum of the probabilities of moving backwards or forwards and to leave the OC cache. More formally, it hods that

and

p′1,1 = 1 − (p1,0 + p1,2 )

(8)

p′i,i = 1 − (pi,0 + pi,i−1 + pi,i+1 ) , i ∈ [2, R − 1]

(9)

p′R,R = 1 − (pR,0 + pR,R−1 )

(10) ϕ′a

We now exploit the knowledge of the state vector to compute the new average number of data items in each subqueue i, say Bi′ , after the internal reordering, which is simply given by Bi′ =

M X

ϕ′a [i] .

(11)

a=1

Step 2: To derive the final status of the OC cache, i.e. ϕta+1 [i], in this step we first compute the average number N0,i of new data items that are eligible to enter OC at subqueue i. Then, we compute the number Fi of available frees slot at each sub-queue i of the OC. The Fi value is the key parameter we need to compute the probability that a new

data item is either discarded or cached, and the probability that an old data item stored in OC is removed. To compute the N0,i quantity we should observe that a new data item not already stored in the OC of the tagged node is eligible for entering the OC cache in the sub queue i if and only if it is stored in the caches of the encountered node, if the channel it belongs to is recognized and its recognition level before the encounter event was i−1. Formally, this can be written as follows N0,i =

M X

rt (a)νat+1 [Rc ]υat [i−1]ϕta [0] .

(12)

a=1

It is important to remind that N0,i expresses the number of new data items that can be potentially copied in the OC cache. However, the actual number of new data items that are copied in OC will depend on the number of free slots. More precisely, let us denote with Fi the maximum number of free slots that new data items can occupy in the sub-queue i of the OC cache. It holds that Fi = B −

i−1 X

Bjt+1 ,

(13)

j=1

with F1 = B. Indeed, data items (both new or old) with recognition level equal to one have the highest precedence and they can use the entire OC. On the contrary, data items with recognition level equal to i (i > 1) can use only the part of OC not used by data items with lower recognition levels. It is also important to point out that in formula (13) we must use the Bit+1 value because it provide the size of sub-queue i in OC after completing the internal reordering, the insertion of new items and the removal of old items. However, it is quite straightforward to observe that Bit+1 is simply given by Bit+1 = min(N0,i + Bi′ , Fi ) .

(14)

Formula (14) can be explained by noting that if N0,i + Bi′ > Fi , then there are enough free slots in OC for all the new items that should enter at level i and for the old items that are already at level i after the reordering. On the other case, some new items will be discarded and/or some old items will be dropped till only Fi slots are occupied3 . Now, by using formula (13), (14) and the initial condition F1 = B we can iteratively compute all remaining Bit+1 and Fi values. Finally, to compute the state vector ϕt+1 we introduce an a = auxiliary transition probability matrix P such that ϕt+1 a ϕ′a P . As observed in formula (12), a new data item is eligible for entering OC with probability rt (a)νat+1 [Rc ]υat [i−1]. However, that new data item will certainly be fetched by the tagged node if N0,i ≤ Fi , otherwise only a fraction Fi /N0,i will be fetched. Thus, the probability p0,i that a new item effectively enters the OC cache can be expressed as ( rt (a)νat+1 [Rc ]υat [i−1] if N0,i ≤ Fi p0,i = (15) i rt (a)νat+1 [Rc ]υat [i−1] NF0,i otherwise , On the other hand, an old data item that was already stored in the OC cache should be removed if the new data items have consumed all (or most of the) free slots. More precisely, we have that  if N0,i + Bi′ ≤ Fi  0 if N0,i > Fi pi,0 = 1 (16)  1 − Fi −N0,i otherwise . B′ i

3

Following the same line of reasoning it is easy to compute ˆ0,i of new items that will effectively enter the the number N OC cache at time t+1 as min(N0,i , Fi ).

Formula (16) indicates that in case N0,i + Bi′ ≤ Fi there is no need of replacing data items because the free slots can accommodate both new and old data items. On the other F −N hand if N0,i ≤ Fi a fraction 1 − i B′ 0,i of old data items i has to be removed from the OC cache.

5.5 Item Diffusion At the beginning of network operations data items are not replicated, but they are stored in the LI caches of the nodes that have generated them. Since data items are assumed to be uniformly distributed over the network nodes, we have that at the initialization time r0 (a) = 1/N . After an encounter event occurring at time t the replication of each data item is given by  P op(a.c) t rt (a) = r0 (a) + (1 − r0 (a)) ψa [1]+ N #   R P op(a.c) X t 1− ϕa [i] (17) N i=1 where either the node has generated the data item with probability r0 (a) or it is subscribed to the channel the data item belongs to and it has a copy of a in its SC cache, or it is not subscribed to the channel the data item belongs to and it has a copy of a in one of the sub-queues of its OC cache.

6. PERFORMANCE EVALUATION 6.1 Model Validation In this section we validate the model accuracy comparing the analytical results with the results obtained using the simulator environment presented in [2]. More precisely, we consider a network composed of 45 nodes, 3 different channels with 99 data items each (297 data items in total). The channel popularity follows a zipfian distribution with parameter 1. In other words, giving a channel j, the probability of a node to bePsubscribed to that channel is given β by P op(j) = N · j β / M i=1 i , with β = 1. Thus, channel 1 is the most popular, while channel 3 the least popular. Finally, nodes are assumed to move according to a Random Waypoint model in a square area of side 1Km. The node speed is uniformly sampled in the range [1,1.86] m/s, which are typical pedestrian speeds [8], and we applied the techniques described in [13] to make sure that the model runs in stationary conditions. To better evaluate the impact of the OC cache on the data dissemination process, in the following simulations we assume that the content of SC is not shared during an encounter event. To take into account this assumption in the model, only formula (17) has to be changed to remove the ψat [1] term. In the following graphs average values and 90% confidence intervals are computed by conducting 100 simulations of each scenario with different random seeds. In Fig. 6, 8 and 10 we plot the temporal evolution of the hit ratio, defined as the fraction of data items of the channel a node is subscribed to which are stored in SC, using Rc = 5 and R = 5 for all the channels. As the curves indicate, the larger the OC the shorter the convergence time to hit ratios equal to one. For instance, all the channels reach a hit ratio equal to one around 800s when OC size is 3 slots, while this time is reduced to 200s if the size is 50 slots. Finally, the comparison between analytical and simulation results confirm that our model is sufficiently accurate to predict the temporal evolution of the hit ratios, independently of channel popularity or OC sizes. In Fig. 5, 7 and 9 we show the temporal evolution of the data-item replication level rt (a)

0.018

0.018 SIM. Most Pop. Ch.

MOD. Most Pop. Ch.

0.016

0.016

MOD. Mid Pop. Ch. MOD. Least Pop. Ch.

0.014 0.012

1

SIM. Least Pop. Ch.

0.006

0.006

0.004

0.004

0.002

0.002

0

0.8

0.01 0.008

Hit Ratio

Rep. Factor

0.012

0.01 0.008

0 1

10

100

1000

10000

1

10

100

Time

1000

Most Pop. Channel

Mid Pop. Channel

0.018

Least Pop. Channel

MOD

0.016

MOD

0.016

MOD

0.014

SIM

0.014

SIM

0.014

SIM

0.012

0.012 0.01

0.01

0.008

0.008

0.006

0.006

0.006

0.004

0.004

0.004

0.002

0.002

0 10000

0 1

10

100

1000

10000

Time

0.002

0 1000

0.2

0.012

0.01 0.008

100

0.4

10000

0.018

0.016

10

0.6

Time

0.018

1

MOD. Most Pop. Channel MOD. Mid Pop. Channel MOD. Least Pop. Channel SIM. Most Pop. Channel SIM. Mid Pop. Channel SIM. Least Pop. Channel

SIM. Mid Pop. Ch.

0.014

0 1

10

100

1000

10000

1

10

100

1000

10000

Figure 6: Hit ratios for Rc = 5, R = 5 and B = 3. Figure 5: rt (a) for Rc = 5, R = 5 and B = 3.

0.06

for the OC cache, using Rc = 5 and R = 5 for all the channels. For the sake of clarity, the sub-plots in the bottom rows directly compare the analytical and simulation results for the same channel, with the most popular channel in the rightmost sub-plot and the least popular one in the leftmost sub-plot. Important observations can be derived from the shown results. First of all, the replication levels show clear peaks. In addition, the peak recognition value is reached before for channels with high popularity than for channels with low popularity. This is due to the fact that the most popular channel and its data items get recognized before all the other channels. Thus, the data dissemination process for the most popular channel can start before than for other channels. However, after this peak the replication levels of all the channels decrease. This is a typical behavior of the recognition heuristic and of the replacement policy used for the OC cache. Indeed, the more diffused a data item is (i.e., the more recognized), the less relevant it is for the dissemination process. This implies that less diffused items can increment their replication level more easily than data items of popular channels. This also explains why the hit ratio of the least popular channel is comparable to the hit ratio of the most popular channel. The second interesting observation is that the replication levels converge to steady values that are almost the same for all the channels. Typically, this happens when all the data items have reached the maximum recognition level, i.e. R. In this case, all data items are considered equivalent for the dissemination process and they are equally distributed in the OC caches. Finally, it is important to point out that our model is remarkably accurate in predicting this stationary behavior of the dataitem replication levels. Furthermore, our model is able to predict the times at which the replication levels reach their maximum and minimum values.

6.2 Scalability Analysis In this section we aim at demonstrating the scalability of our model. More precisely, an accurate simulation of node mobility, communication protocols and cache management policies would be impossible to be executed in feasible time for large-scale opportunistic networks involving thousands of nodes and hundreds of channels. In these cases, we argued that the analysis can help in exploring the behavior of the system even with large numbers of involved peers and objects. To validate this statement, we have solved our model considering a network with N = 1000 nodes, M = 100 channels, and ten objects per-channel (1000 data items in total). This values have been chosen to better represent a

0.06

MOD. Most Pop. Channel

SIM. Most Pop. Ch. SIM. Mid Pop. Ch.

MOD. Mid Pop. Channel

0.05

0.05

MOD. Least Pop. Channel

0.04

0.04

0.03

0.03

0.02

0.02

0.01

0.01

0

SIM. Mid Pop. Ch.

0 1

10

100

1000

10000

1

10

100

Time

0.06

0.06

MOD

0.05

SIM

MOD

0.05

SIM

0.04

0.04

0.04

0.03

0.03

0.03

0.02

0.02

0.02

0.01

0.01

0 10

100

1000

10000

SIM

0.01

0 1

10000

Least Pop. Channel

0.06 MOD

0.05

1000

Time Mid Pop. Channel

Most Pop. Channel

0 1

10

100

1000

10000

1

10

100

1000

10000

Figure 7: rt (a) for Rc = 5, R = 5 and B = 10. “long-tail” scenario, where a relatively large number of small groups of users are subscribed to specific channels. Furthermore, channel popularity still follows a zipfian distribution of parameter 1. Fig. 11 and 12 show the hit ratios for the most and least popular channels, respectively, and for various sizes of the OC cache. As benchmark scheme we consider an epidemicbased data dissemination protocol, where meeting nodes only share the data items for the channels they are both subscribed to. The first important observation is that the OC cache ensures a substantial performance gain over the epidemic scheme. In particular, for the least popular channel the time to reach a hit ratio equal to one is one order of magnitude shorter when using the OC cache. Nonetheless, it should to be noticed that, for the most popular channel, a greater OC does not necessarily imply a performance improvement over the epidemic scheme. This is due to the fact that, with a larger OC, the data items of the most popular channel quickly reach the maximum recognition threshold R, even before they have beed copied in all the nodes subscribed to their channel. However, when a data item is highly recognized it is less relevant for the data dissemination process, and less diffused items have higher priority during item replacements in OC. The combination of these behaviors may negatively affect the effectiveness of the diffusion process for the high popular channels. The better understand the impact of channel recognition thresholds on these behaviors, Fig. 13 show the hit ratios of the most and least popular channel for an OC size equal to ten slots and two different channel recognition thresholds. As expected, the larger the channel recognition threshold,

1

MOD. Most Popular Ch. MOD. Mid Popular Ch. MOD. Least Popular Ch. SIM. Most Popular Ch. SIM. Mid Popular Ch. SIM. Least Popular Ch.

0.8

Hit Ratio

0.8

Hit Ratio

1

MOD. Most Pop. Channel MOD. Mid Pop. Channel MOD. Least Pop. Channel SIM. Most Pop. Channel SIM. Mid Pop. Channel SIM. Least Pop. Channel

0.6

0.4

0.2

0.6

0.4

0.2

0

0 1

10

100

1000

10000

1

10

100

Time

Figure 8: Hit ratios for Rc = 5, R = 5 and B = 10. 0.3

0.3

MOD. Most Pop. Channel

1

SIM. Mid Pop. Ch.

MOD. Least Pop. Channel

0.2 0.15

0.1

0.1

0.05

0.05

0

0.8

Hit Ratio

0.2

0 1

10

100

1000

10000

NO OC B=3 B = 10 B = 50

SIM. Least Pop. Ch.

0.25

0.15

10000

Figure 10: Hit ratios for Rc = 5, R = 5 and B = 3.

SIM. Most Pop. Ch.

MOD. Mid Pop. Channel

0.25

1000

Time

1

10

100

Time

1000

0.6

0.4

10000

Time 0.2 Mid Pop. Channel

Most Pop. Channel

0.3

0.3

MOD SIM

0.25

Least Pop. Channel

0.3

MOD

MOD

SIM

0.25

0.25

0.2

0.2

0.2

0.15

0.15

0.15

0.1

0.1

0.1

SIM

0 1

0.05

0 10

100

1000

10000

0 1

10

100

1000

10000

1

10

100

1000

10000

Figure 9: rt (a) for Rc = 5, R = 5 and B = 50. the slower the data dissemination process. However, while for the most popular channel this delay seems to affect all hit ratio values at the same manner, in the case of the least popular channel high hit ratio values are more disadvantaged than lower ones. We have also investigated a network scenario where the channel recognition threshold is set to zero for all the channels. In this case, all the channels are considered recognized from t = 0 and there are no differences between the time instants at which the data dissemination process starts for channels with different popularities. Then, in Fig. 14 we show the hit ratios of the most and least popular channel for two different OC sizes. On the other hand, the size of the OC cache has still a remarkable impact on the convergence times of the hit ratios. In the last set of results, shown in Fig. 15, we use the same parameter setting used for obtaining the results in Fig. 14, but we increase the number of nodes by ten times (10000 nodes in total).We can observe that when increasing the number of nodes the most popular channel does not experience significant improvements in the convergence time of the hit ratios. On the other hand, a quite significant improvement can be observed for the least popular channel because a larger pool of OC caches ensures that more copies of the most rare data items are circulating in the network.

7.

1000

0.05

0 1

100

Time

Figure 11: Hit ratio of the most popular channel for varying B values, Rc = 5 and R = 5. ferent popularities of content types, and highlights the impact of the shared memory contributed by individual nodes to make the dissemination process more efficient. In particular, we have shown that our model is sufficiently accurate to capture the transient and steady-state behaviors of key performance indexes of the data dissemination system, such as data replication levels and hit ratios. Furthermore, the low computational complexity of our model has allowed us to explore network scenarios with large numbers of involved peers and objects, which are problematic to study using only a simulation-based approach. Future work involve the extension of our model to consider heterogeneous environments where there may be nodes with different mobility patterns, or data items may have different priorities. Furthermore, we plan to study how to extend the recognition heuristic to

1

NO OC B=3 B = 10 B = 50

0.8

Hit Ratio

0.05

10

0.6

0.4

0.2

CONCLUSIONS

In this paper we have developed an analytical model to describe the performance of a content dissemination mechanism for opportunistic networks that relies on a cognitive scheme, known as the recognition heuristic, to decide whether to store a copy of data items fetched from the caches of encountered nodes. Our model takes into account the dif-

0 1

10

100

1000

Time

Figure 12: Hit ratio of the least popular channel for varying B values, Rc = 5 and R = 5.

1

Most Pop. Ch. - Rc = 2 Least Pop. Ch. -Rc = 2 Most Pop. Ch. - Rc = 10 Least Pop. Ch. - Rc = 10

Hit Ratio

0.8

0.6

0.4

0.2

0 1

10

100

1000

Time

Figure 13: Hit ratios for the most and the least popular channels and varying Rc values, R = 5 and B = 10. 1

Most Pop. Ch. - B = 3 Least Pop. Ch. - B = 3 Most Pop. Ch. - B =50 Least Pop. Ch. - B =50

Hit Ratio

0.8

0.6

0.4

0.2

0 1

10

100

1000

Time

Figure 14: Hit ratios for the most and the least popular channels and varying B values, Rc = 0 and R = 5. include mediators that take into account social-based information about the users.

8.

ACKNOWLEDGEMENTS

This work is funded by the EC under the RECOGNITION (FP7-IST 257756) and EINS (FP7-FIRE 288021) projects.

9.

REFERENCES

[1] C. Boldrini, M. Conti, and A. Passarella. Design and performance evaluation of ContentPlace, a social-aware data dissemination system for opportunistic networks. Comput. Netw., 54:589–604, March 2010. 1

Hit Ratio

0.8

0.6

0.4

Most Pop. Ch. - B = 3 Least Pop. Ch. - B = 3 Most Pop. Ch. - B = 50 Least Pop. Ch. - B = 50

0.2

0 1

10

100

1000

Time

Figure 15: Hit ratios for the most and the least popular channels and varying B values, Rc = 0, R = 5 and N = 10000.

[2] Marco Conti, Matteo Mordacchini, and Andrea Passarella. Data dissemination in opportunistic networks using cognitive heuristics. In Proc. of IEEE WOWMOM 2011, Lucca, Italy, 20-24 June, 2011, pages 1–6. IEEE, 2011. [3] P. Costa, C. Mascolo, M. Musolesi, and G.P. Picco. Socially-aware routing for publish-subscribe in delay-tolerant mobile ad hoc networks. IEEE JSAC, 26(5):748–760, 2008. [4] G. Gigerenzer and D.G. Goldstein. Models of ecological rationality: The recognition heuristic. Psychological Review, 109(1):75–90, 2002. [5] D.G. Goldstein and G. Gigerenzer. Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103(4):650–669, 1996. [6] D.G. Goldstein and G. Gigerenzer. The recognition heuristic: How ignorance makes us smart, pages 37–58. Oxford University Press, 1999. [7] D.G. Goldstein and G. Gigerenzer. Fast and frugal forecasting. Int. Journal of Forecasting, 25:760–772, 2009. [8] D. Karamshuk, C. Boldrini, M. Conti, and A. Passarella. Human mobility models for opportunistic networks. IEEE Communications Magazine, 46(12):157–165, December 2011. [9] V. Lenders, M. May, G. Karlsson, and C. Wacha. Wireless ad hoc podcasting. SIGMOBILE Mob. Comput. Commun. Rev., 12:65–67, January 2008. [10] J.N. Marewski, W. Gaissmaier, L.J. Schooler, D. Goldstein, and G. Gigerenzer. From recognition to decisions: Extending and testing recognition-based models for multialternative inference. Psychonomic Bulletin & Review, 17(3):287–309, 2010. [11] Julian N. Marewski, Wolfgang Gaissmaier, and Gerd Gigerenzer. Good judgments do not require complex cognition. Cogn. Process, 11:103–121, 2010. [12] M. Monti, L. Martignon, Gigerenzer. G., and N. Berg. The impact of simplicity on financial decision-making. In Proc. of CogSci 2009, July 29 - August 1 2009, Amsterdam, the Netherlands, pages 1846–1851. The Cognitive Science Society, Inc., 2009. [13] W. Navidi and T. Camp. Stationary distributions for the random waypoint mobility model. IEEE Transactions on Mobile Computing, 3(1):99–108, 2004. [14] A. Passarella. A survey on content-centric technologies for the current Internet: CDN and p2p solutions. Comput. Comm., 35(1):1–32, January 2012. [15] L. Pelusi, A. Passarella, and M. Conti. Opportunistic networking: data forwarding in disconnected mobile ad hoc networks. IEEE Communications Magazine, 44(11):134 –141, November 2006. [16] L.J. Schooler and R. Hertwig. How forgetting aids heuristic inference. Psychological Review, 112(3):610, 2005. [17] Sascha Serwe and Christian Frings. Who will win wimbledon? the recognition heuristic in predicting sports events. J. Behav. Dec. Making, 19(4):321–332, 2006. [18] L. Yin and G. Cao. Supporting cooperative caching in ad hoc networks. IEEE Trans. Mob. Comput., 5(1):77–89, 2006. [19] E. Yoneki, P. Hui, S.Y. Chan, and J. Crowcroft. A socio-aware overlay for publish/subscribe communication in delay tolerant networks. In Proc. of ACM MSWIM’07, pages 225–234, 2007.

D3.1:  INTERMEDIATE  RECOGNITION  SOLUTIONS  FOR  CONTENT   MANAGEMENT    

Appendix F – [Usage Control Mechanisms] This appendix contains the reprint of the following paper: L.A. Cutillo, R. Molva, M. Önen, Privacy Preserving Picture Sharing: Enforcing Usage Control in Distributed On-Line Social Networks. In Proceedings of EUROSYS 2012, 5th ACM Workshop on Social Network Systems, Bern, Switzerland, April 10, 2012.

53

Privacy Preserving Picture Sharing: Enforcing Usage Control in Distributed On-Line Social Networks Leucio Antonio Cutillo

Refik Molva

¨ Melek Onen

EURECOM, Sophia-Antipolis, France {cutillo, molva, onen}@eurecom.fr

Abstract The problem of usage control, which refers to the control of the data after its publication, is becoming a very challenging problem due to the exponential growth of the number of users involved in content sharing. While the best solution and unfortunately the most expensive one to cope with this particular issue would be to provide a trusted hardware environment for each user, in this paper we address this problem in a confined environment, namely online social networks (OSN), and for the particular picture sharing application. In current OSNs, the owner of an uploaded picture is the only one who can control the access to this particular content and, unfortunately, other users whose faces appear in the same picture cannot set any rule. We propose a preliminary usage control mechanism targeting decentralized peer-to-peer online social networks where control is enforced thanks to the collaboration of a sufficient number of legitimate peers. In this solution, all faces in pictures are automatically obfuscated during their upload to the system and the enforcement of the obfuscation operation is guaranteed thanks to the underlying privacy preserving multi-hop routing protocol. The disclosure of each face depends on the rules the owner of the face sets when she is informed and malicious users can never publish this content in clear even if they have access to it. Categories and Subject Descriptors K.4.1 [Public Policy Issues]: Privacy Keywords Usage control; picture sharing; Distributed OnLine Social Networks

1.

Introduction

The problem of usage control which refers to the control of the data after its publication, is becoming a very chal-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SNS’12 April 10, 2012, Bern, Switzerland c 2012 ACM ACM 978-1-4503-1164-9/12/04. . . $10.00 Copyright !

lenging problem due to the exponential growth of the number of users involved in content sharing applications. Online social networks like Facebook, Twitter or Linkedin, are becoming the way of communication among people in the Internet. and unfortunately this problem can have a severe impact on privacy in such an environment with several hundreds of millions of registered users. Indeed, even with current privacy protection and access control solutions, users loose their control over data after its very first publication in the network. For example, although a user who uploads or ”posts” some data can indeed prevent unauthorized access to it, she cannot control the further user of it after its publication. Recently, several peer-to-peer (P2P) based Distributed Online Social Networks (DOSN) have been proposed to preserve users’ privacy [4]. In all these solutions, users’ data is not stored by a centralized OSN provider anymore. Such a DOSN can be considered as a good candidate for usage control mechanisms since it leverages on the collaboration of the users for any operation including data management and privacy protection. In this paper, we propose to exploit the advantage of the underlying peer-to-peer architecture in DOSNs in order to enforce the control of the usage of some specific content, namely pictures. The proposed mechanism relies on the collaboration of nodes and control is enforced thanks to the forwarding of any packets towards several hops before reaching the final destination. The proposed usage control mechanism is designed over a recently proposed DOSN named as Safebook [1] which overcomes the problem of selfishness by leveraging on the real life social trust relationships among users. The underlying multi-hop forwarding solution can directly be used as a basis for the usage control mechanism. Section 2 introduces the problem of usage control illustrated by picture sharing applications in online social networks. Section 3 describes the proposed mechanism based on a multi-hop enforcement technique originating from the Safebook DOSN. Finally, the security and efficiency of the proposed protocol are evaluated in section 4.

2.

Problem Statement

Usage control for picture sharing in OSN Usage control [9] becomes a mandatory requirement given the very large number of users sharing different types of content. The ideal goal of guaranteeing the control over any type of data in any type of platform seems very difficult to achieve. We address this problem in a confined environment which is OSN and in the context of picture sharing since, as previously mentioned, this application is one of the most popular applications for OSN users. Current picture sharing tools in online social networks allow users to upload any picture. Access rules to these pictures are defined by the owner of the picture that is the one who uploads it. This user has also some abilities to associate an area of the picture to a label: such a function, namely the tag, can be used to inform other users about their presence in the picture. Unfortunately, if users are not “tagged” in the picture, they will never be aware of these pictures. We assume that each person whose face appears in any picture should decide whether her face in that picture should be disclosed or not and therefore she should define the usage control policy regarding her own face. Decentralized online social networks As previously mentioned, online social networks severely suffer from the centralized control on users’ data and its potential misuse. As an answer, several DOSNs[4] propose to design new applications based on a distributed peer-to-peer architecture. Some of them leverage real life social links to construct a network with trusted peers where the correct execution of any network/application operation depends on users’ behavior. In order to achieve a good security degree, these solutions define a threshold for the number of misbehaving users and analyze the trade-off between security and performance based on this degree: for example, in some solutions a packet must pass through a threshold number of nodes before reaching its destination in order to guarantee a certain security degree. Such peer-to-peer online social networks can be considered as a good candidate for usage control mechanisms in picture sharing. In such an environment, a well behaving node would automatically obfuscate all faces in any picture it receives from other nodes. Therefore, given a threshold number of misbehaving or malicious nodes, in order to guarantee the correct execution of the usage control mechanism, the application can define a minimum number of nodes a legitimate message has to pass through, before reaching its final destination. Among these nodes at least one node should behave legitimately and apply the required protection operations. Additionally to the owner of the picture, only the owner of the face included in that picture should be able to have an initial access to the face in that picture. The further usage control rules for the dedicated face have to be defined by the corresponding user and the correct appliance of these

rules should be verified at each node in the path towards the destination.

3. The proposed usage control mechanism In this section, we describe a usage control mechanism enforced thanks to the cooperation among multiple users that perform multi-hop forwarding. As previously mentioned, the idea of the proposed mechanism is to exploit the distributed nature of peer-to-peer online social networks and to leverage real-life social links to control the access to pictures: as opposed to centralized solutions, all operations are performed with the collaboration of multiple nodes. Thanks to this multi-hop enforcement in this distributed setting, cleartext pictures will only be accessible based on the rules defined by users whose faces figure in the corresponding pictures. Intermediate nodes will detect faces on pictures they receive and verify whether users appearing in the picture have defined any rule on the usage to their face. A social network that answers the previously described requirements is proposed in [1] as a distributed privacy preserving online social network named as Safebook. We briefly summarize its characteristics before describing the newly proposed usage control mechanism. 3.1 Safebook: a P2P DOSN leveraging real life trust The main aim of Safebook is to avoid any centralized control over user data by service providers. The correct execution of different services depends on the trust relationships among nodes which are by definition deduced from real-life social links. Safebook defines for each user a particular structure named as Matryoshkas ensuring end-to-end confidentiality and providing distributed storage while preserving privacy. As illustrated in figure 1, the Matryoshka of a user V, namely the core, is composed by several nodes organized in concentric shells. Nodes in the first shell, also called mirrors, are the real life friends of V, and store her profile data to guarantee its availability. If a requester U would have directly contacted one of V’s mirrors, say A, U would have been able to infer the friendship relation between V and A. To protect such an information, several multihop paths, chains of trusted friends, are built where every user’s node selects among her own friends one or more next hops that are not yet part of the core’s Matryoshka. A can then be seen as the root of a subtree with branching span1 whose leaves, namely the entrypoints, lie in the outermost shell. When a user U looks for V’s data, her request is served by the entrypoints of V’s Matryoshka and forwarded to the mirrors along these predefined path. The answer follows the same path in the opposite direction. To prevent malicious users from creating multiple identities, identifiers are granted and certified by the last component of Safebook, the Trusted Identification System(TIS). The TIS is contacted only once during the user registra1 for

the sake of clarity, we will consider span=1 in the rest of the paper.

tion phase and does not impact the decentralized nature of Safebook’s architecture since it is not involved in any data communication or data management operation. We now present a new usage control mechanism taking Safebook as a basis.

Figure 1. The Matryoshka graph of a user V, from [1] 3.2

Overview of the solution

In the particular environment of Safebook, a user can mainly play three different roles: • she can publish a picture: in this case, she is represented

as the core of her own Matryoshka and her friends will store this picture. She also has the option to tag some of her friends who might appear in the picture. • She can act as a forwarder for some pictures: in this case,

she belongs to the Matryoshka of either the owner of the picture or the owner of a face tagged in that picture. • She can also request to retrieve some pictures which

belong to one of her friends: in this case, she first needs to contact one of the entrypoints of the corresponding core in order to reach that particular user. Dedicated tasks have been defined for each of these three roles. For example, before publishing a picture, the client has to perform some picture obfuscation operation. Similarly, as a forwarder, the node has to perform some verification operations in order to check whether the picture it is forwarding is correctly “protected” or not. All these tasks will be described in detail in the following section. In the sequel of the paper, we use the following notation: • L P denotes the regions in a picture P where a face ap-

pears; • F P denotes the set of faces that appear in P ; • fVP denotes user V’s face in P ; • fV denotes V’s face features which are used for face

detection algorithms; ! " − • φ+ denotes user V’s public and private keys, reV , φV spectively; • a message M signed using user V’s private key φ− V is

denoted by {M }Sφ ; V

• EK {M } denotes the encryption of a message M with

the symmetric key K.

3.3 Solution description In this section, the solution is mainly presented phase by phase. A user first registers to the social network in order to be able to publish a picture. The user can also act as an intermediate node, namely a forwarder node and check whether the content it receives follows the corresponding usage control rules. Finally, the picture retrieval phase is described. User registration Whenever a user V registers to Safebook and joins the network, she"first generates a pair of public ! + and private keys φ− and sends the public key φ+ V , φV V and some samples of her face fV to an off-line trusted third party, namely the Feature Certification Service(FCS). The # $ FCS generates a certificate for V denoted by Cert fV , φ+ V , which proves that the user with some face features fV owns the public key φ+ V . This face feature certification phase is performed only once and the user does not need to contact the third party anymore. Picture publication To make her picture P available in the network, the publisher V has to perform the following main tasks: • Picture insertion and face detection: To publish her pic-

ture P , the user V provides P to her Safebook client. One of the main components of the system consists of the face detection mechanism. The face detector aims at finding the presence of faces in the input picture P and, if this is the case, it returns their location L P 2 . • Picture tagging: When the face detector derives L P ,

the client uses the second component, namely the face extractor, which is in charge of copying every face fiP ∈ F P detected in liP ∈ L P in a separate file. After this extraction task, the publisher V is asked to tag each face, i.e. to associate every fiP to a profile in V’s P contact list. If a face fN is tagged with the profile of its owner N , N receives a copy of the original picture P 3 and can decide to publish it again on her own profile. Furthermore, the publisher also decides whether her own face can be made available for the network or not. • Picture publication: Once all known faces f P ∈ F P are

tagged, the client can execute its last component which is the face obfuscator: The face obfuscator transforms the face location areas L P to uninterpretable areas using any human or computer vision algorithm, thus generating an obfuscated picture OP . In our solution, the face obfuscator simply replaces every pixel in L P with a black one. The obfuscated picture can thus be seen as the original one with black shapes hiding every detected face. From the resulting obfuscated picture OP , an unambiguous pic2 We

assume that face detection algorithms are secure enough. Their design is out of the scope of this paper. 3 This doesn’t violate V’s privacy, as V aims at making P publicly available.

ture identifier I is computed as I P = h (OP ), where h (·) P denotes a cryptographic hash function. V’s face # fV $is then signed together with the certificate Cert fV , φ+ V , P the identifier I , and an expiration time expT ime. ! # $ P P " Finally, Cert fV , φ+ and OP V , I , fV , expT ime S are published4 .

φV

• Picture advertisement: Once advertised by V about the

presence of P , a user N can control the disclosure of her P face fN in that picture. P N may decide to make fN publicly available, and publish OP together with the following signed message: ! # $ P P " Cert fN , φ+ N , I , fN , expT ime S φN

If N wishes to disclose this picture to a subset of its contact list only, she can encrypt the corresponding message with a symmetric key K previously distributed to the dedicated users. In this case, N will publish OP together P with EK {fN } , IP . Forwarding pictures Every intermediate node T storing or forwarding an obfuscated picture OP runs by default the face detector and obfuscator components on OP . These tasks ensure the required privacy property in case some clients are manipulated by malicious nodes. When storing or forwarding a user V’s publicly available face, a legitimate node T first checks the validity of the signature SφV , the expiration time, and the relation between the face features fV in the certificate and the ones extracted from fVP . In case of verification failure, V’s face is obfuscated. Picture retrieval To retrieve V’s pictures, a user U who is not included in V’s contact list sends a picture request pctReq message to V and receives a set of identifiers I Pj related to V’s publicly available pictures Pj . U then asks for the identifiers she is interested in. For every identifier I Pj U retrieves an obfuscated picture OP j and the message % & # $ Pj Pj Cert fV , φ+ , I , f , expT ime j V V SφV

containing V’s publicly available face in that picture. When interacting with her friend N , U sends her a picture request containing some secret s, and receives a list of picture identifiers I Pj associated to pictures of N , which are either publicly available, or available to those contacts knowing s, only. U then detects a match in I P between the identifiers retrieved from V and N , and, as she previously received OP P from V, she now asks for the missing information fN . At P P the reception of piRes = {EK {fN } , I }, U can retrieve P fN since she already owns the appropriate decryption key K shared at the friendship establishment with N . 4 This

phase corresponds to the storage of the picture at V’s friend nodes.

4. Evaluation In this section, we evaluate the proposed mechanism with respect to different security issues such as eavesdropping, unauthorized access or software manipulation attacks. The impact of these attacks is evaluated based on existing social graphs: in September 2005, Facebook published anonymous social graphs of 5 universities in the United States5 : California Institute of Technology (Calt.), Princeton University (Princ.), Georgetown University (Georg.), University of North Carolina (UNC), Oklahoma University (Okl.). Each graph is represented by an adjacency matrix A whose non diagonal elements aij are set to one if user νi ∈ V is a friend of user νj ∈ V , or zero otherwise. As each adjacency matrix is symmetric, the represented social graph is undirected. Before presenting the evaluation results, we briefly discuss about the feasibility of the proposed usage control mechanism. Feasibility The feasibility of the proposed usage control mechanism depends on the robustness and speed of the face detection and verification procedures and on the feasibility of the DOSN at its basis, in our case Safebook. Face detection [12] and face recognition [13] procedures nowadays run in real time in common personal computers. Most of them [5, 7] make an intensive use of Scale Invariant Feature Transform (SIFT) [6], a well known techique used to extract view-invariant representations of 2D objects. Recognition rates of these solutions raise up to 95% in well known databases such as the Olivetti Research Lab (ORL) one [5, 7]. Other techniques can also be used to improve the recognition rate [11] at the expenses of a bigger face feature descriptor. The proposed usage control scheme does not put any constraints on the face detection and recognition architecture: when the adopted face descriptor is bigger than the face image itself, a reference face image rV rather than the feature descriptor fV itself can be certified by the FCS. This change does not have a concrete impact on the time necessary to compare the reference face in the certificate with that one a user wants to make publicly available. The feasibility of Safebook has been presented in [2]. The study discusses an inherent tradeoff between privacy and performance: on one hand, the number of shells in a user’s Matryoshka should be defined as large as possible to enforce privacy in terms of communication obfuscation and protection of the friendship links, but small enough to offer a better performance in terms of delays and reachability; on the other hand, increasing the number of shells after a certain trheshold does not increase the privacy anymore. Such a threshold depends on the social network graph itself [3], more precisely on the number of hops after which a random walk on the social network graph approximates with a pre-defined error its steady state distribution [8]. In this 5 http://people.maths.ox.ac.uk/

porterm/data/facebook5.zip

case, the endpoint and the startpoint of the random walk are uncorrelated. Based on the results of the study in [2], we conducted our experiments by setting a number of shells as high as 4. We assume that the number of online nodes correspond to 30% of the total number of users. Unauthorized picture broadcast Even if malicious users manipulate the underlying software, broadcasting cleartext faces is prevented thanks to the collaborative multi-hop enforcement scheme. Indeed, it is assumed that there is at least one legitimate node which will execute the correct verification operations and the corresponding transformations in order to protect forwarded packets. Nevertheless, Safebook allows the forwarding of encrypted information to a subset of friends; a malicious member V may exploit this possibility to further send packets to all of its friends. However, V may need to set-up a virtual server and establish friendship relationships with all users to provide all of them with a picture P . This kind of attacks can be prevented by setting a maximum proper rate on friendship requests. The malicious node would need to design some server advertisement mechanisms using additional out of band information exchange (outside the Safebook network). Furthermore, a malicious node V may also ask one of her contacts C1 to manipulate her Safebook client in order to encrypt and republish an unauthorized picture P at her own profile. C1 , in turn, may ask the same to a malicious contact C2 . If recursively repeated through the social network graph, this attack may disclose P to all the contacts of every malicious Cn .This attack again requires the manipulation of the OSN client itself and the impact of such an attack would only be important if the number of malicious users is very large. Fortunately, such a massive scale attack may end on the creation of an environment where the adversary may in turn be a victim for her own private data. Unobfuscated picture forwarding As previously mentioned, the enforcement of the control on the usage of a given picture is based on the collaboration of users and the correct execution of the previously described operations. However, some users can still be malicious and avoid obfuscating some public pictures. To evaluate the impact of misbehavior, we have simulated the process of Matryoshka creation in which the chains leading from the mirrors to the corresponding entrypoints are built. We assume 30% of the nodes is online, and a fraction of them misbehave. We also assume that misbehaving nodes are always online. Two strategies are adopted to select misbehaving nodes: in the first one, R, they are randomly selected; in the second one, C, they are selected between and in the friendlists of all the nodes with higher weight wV defined as: wV = dV ∗ ccV where dV represents the degree of node V, i.e. the number of V’s contacts, and ccV its clustering coefficient, i.e. the ratio

between the number of existing links between V’s contacts divided by the number of possible links that could exist. We define a compromised chain as a chain that is entirely composed by misbehaving nodes. Table 1 reports the number of compromised chains mxy when misbehaving nodes are as high as x%, and strategy y is applied. The average degree d, the average number of chains p a user can buid and the average number of Matryoshka q are also computed for each social graph. With a limited number of misbehaving nodes, e.g. 10%, these nodes should take part in the same chains to challenge the security of system. Nevertheless, with an increased number of misbehavmx ing nodes such as 25%, a significant ratio of chains py gets compromised (up to 83% in the case of Okl.) and the system does not protect the user’s privacy anymore. d 43.3 88.9 90.4 84.4 102.4

Calt. Princ. Georg. UNC Okl.

p 14.6 30.3 32.6 30.8 39.9

q 58.3 126.5 130.4 123.4 159.7

m10 R 0.25 0.33 0.43 0.38 0.48

m10 C 0.83 5.76 6.35 8.40 11.06

m25 R 6.38 17.72 16.12 14.05 18.58

m25 C 10.16 23.23 24.91 24.15 33.24

Table 1. Characteristics summary of examined SN graphs. Data confidentiality and Anonymity Given an obfuscated picture OP , it should be impossible to retrieve any information about users whose depicted faces are not made publicly available. Since by the very design of the Safebook client, there is no way to query the OSN for the identity of the users whose faces are missing, it is, indeed, not possible for an adversary to extract any useful information from an obfuscated picture. Only friends of a user N can discover this informaP tion and retrieve fN . Whenever N ’s friend U receives the list of identifiers of the pictures she is allowed to access, she checks the list of picture identifiers in her cache and may P find a match for I P . In this case, U can ask and obtain fN , encrypted with a key K she received from N previously. Invalid tagging and picture republishing A malicious user P V may associate N ’s face fN with her own profile while tagging a picture P . Nevertheless, V will not manage to make P P fN publicly available, unless#the features of fN are similar $ + enough to that ones in Cert fV , φV . However, according to the Face Recognition Vendor Test (FRVT) of 2006 [10] the false rejection rate for a false acceptance rate of 0.001 is 0.01 for state-of-the-art face recognition algorithms. A picture P can be accessed and republished by a third node Y that does not appear on it. Y can in fact store in her profile the obfuscated OP and any publicly available face !

# $ P P " Cert fX , φ+ X , I , fX , expT ime S

for that picture.

φX

Limitations In order for the intermediate nodes to verify whether the rules are followed or not, the picture should not be encrypted of course(even if there is obfuscation). This security mechanism cannot be implemented over encrypted messages. However, if a malicious user would want to encrypt the picture in order to circumvent the usage control mechanism, only nodes with corresponding decryption keys can have access to these pictures. Such an attack would require an important communication overhead and its impact on the security would not be that important. Figure 2 summarizes the characteristics of the proposed solution.

Figure 2. Spread of information vs usage control.

5.

Conclusion and Future Work

As it is not feasible to design a perfect usage control mechanism to control the management of any type of data in any environment, we proposed a preliminary solution dedicated to picture sharing tools widely used in the context of online social networks. Although it might be feasible to design such a mechanism in a centralized environment, current OSN providers are not yet interested in such protection mechanisms. On the contrary, decentralized, P2P based online social networks rely on the collaboration of users for any operation including data management and security. The proposed usage control mechanism takes advantage of this inherent cooperation between users and ensures the enforcement on the control of the pictures thanks to a dedicated multihop picture forwarding protocol. The message has to follow a dedicated path of sufficient intermediate nodes which perform the dedicated tasks defined in the usage control policy before reaching its final destination. Thanks to this multihop enforcement mechanism, users whose face appears in a given picture will be able to control its usage in the very beginning stage of its publication. Nevertheless, the protection of the picture and the enforcement of this control is only efficient in the confined environment of the DOSN and when pictures are not encrypted; however, the impact of attacks launched outside this environment or aiming at encrypting the message is very limited within the DOSN. In the future work, we plan to evaluate the scalability and the performance impact of the proposed solution, and integrate its features in the current Safebook prototype6 . 6 http://www.safebook.eu/home.php?content=prototype

Acknowledgments This work has been supported by the RECOGNITION project, grant agreement number 257756, funded by the EC Seventh Framework Programme Theme FP7-ICT-2009 8.5 for Self-Awareness in Autonomic Systems.

References [1] L. A. Cutillo, R. Molva, and T. Strufe. Safebook : a privacy preserving online social network leveraging on real-life trust. IEEE Comm. Mag., Consumer Comm. and Networking, 2009. ¨ [2] L. A. Cutillo, R. Molva, and M. Onen. Performance and Privacy Trade-off in Peer-to-Peer On-line Social Networks. 2010. Technical Report RR10244. ¨ [3] L. A. Cutillo, R. Molva, and M. Onen. Analysis of privacy in online social networks from the graph theory perspective. In IEEE Globecom 2011, Selected Areas in Communications Symposium, Social Networks Track, 2011. [4] A. Datta, S. Buchegger, L.-H. Vu, T. Strufe, and K. Rzadca. Decentralized online social networks. In B. Furht, editor, Handbook of Social Network Technologies and Applications. Springer US, 2010. [5] C. Geng and X. Jiang. Face recognition using sift features. In Proceedings of the 16th IEEE international conference on Image processing, pages 3277–3280. IEEE Press, 2009. [6] D. G. Lowe. Distinctive image features from scale-invariant keypoints, 2003. [7] A. Majumdar and R. K. Ward. Discriminative SIFT Features for Face Recognition. In Proc. of Canadian Conference on Electrical and Computer Engineering, 2009, 2009. [8] M. Mitzenmacher and E. Upfal. Probability and Computing : Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, January 2005. [9] J. Park and R. Sandhu. Towards usage control models: beyond traditional access control. In SACMAT ’02. ACM, 2002. [10] P. J. Phillips, W. T. Scruggs, A. J. O’Toole, P. J. Flynn, K. W. Bowyer, C. L. Schott, and M. Sharpe. FRVT 2006 and ICE 2006 Large-Scale Results. Technical Report NISTIR 7408. [11] C. Velardo and J.-L. Dugelay. Face recognition with DAISY descriptors. In MM’10 and Sec’10, ACM SIGMM Multimedia and Security Workshop, Rome, Italy, 09 2010. [12] P. Viola and M. J. Jones. Robust real-time face detection. Int. J. Comput. Vision, May 2004. [13] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition: A literature survey, 2000.

RECOGNITION  

Appendix G – [Distributed Placement of Autonomic Internet Services] This appendix contains the reprint of the following technical report: P. Pantazopoulos, M. Karaliopoulos, I. Stavrakakis, Distributed Placement of Autonomous Internet Services , 2012 (in preparation)

54

1

Distributed Placement of Autonomous Internet Services Panagiotis Pantazopoulos, Merkourios Karaliopoulos, Member, IEEE, and Ioannis Stavrakakis, Fellow, IEEE Abstract—With the accelerated growth of the social networking platforms the largest source of data on the Internet today is user generated. Following similar ways as with the user generated content, end-users are increasingly prompt to create a rich ecosystem of service instances that necessitates the distributed and scalable treatment to the fundamental problem of their optimal placement. To cope with the lack of global network topology and service demand information, typically required by a centralized approach, we propose an iterative scalable service migration mechanism that moves each service instance to prominent locations resembling a local search approach. Each neighborhood of service host candidates is selected with respect to a node centrality metric(wCBC) that captures the capacity of a node to act as service demand concentrator under shortestpath routing. The same metric, correctly tuned, help us project the demand generated by the rest of the network nodes on the selected neighborhood over which a small-scale 1-median determines the next best solution. First we provide theoretical insights for our migration heuristic and conduct a proof-of-concept study under the hypothesis of feasible centrality computations. Its advantages against typical local search schemes and effectiveness already with minimal size of the selected neighborhood, are demonstrated over both synthetic and real-world topologies. Then, we exploit the straightforward wCBC interpretation and propose a distributed practical implementation that uses passive measurements of transit traffic to determine each migration hop. Our results suggest that services reach close-to-optimal locations in a few hops regardless the employed routing practice. Index Terms—Distributed Facility Location, User-Generated Services, Service Migration, Centrality.



1

I NTRODUCTION

NE of the most significant changes in networked communications over the last few years concerns the role of the end-user. Traditionally the end-user has been almost exclusively consumer of content and services generated by explicit entities called content and service providers, respectively. Nowadays, the Web2.0 technologies have enabled a paradigm shift towards more user-centric approaches to content generation and provision. This shift is strongly evidenced in the abundance of User-Generated Content(U GC) in social networking sites, blogs, wikis, or video distribution sites such as YouTube, which have motivated even the rethinking of the Internet architecture fundamentals [1], [2]. The generalization of the U GC concept towards services is increasingly viewed as one of the major trends in user-oriented networking [3]. The user-oriented service creation concept aims at engaging end-users in the generation and distribution of service components, more generally service facilities [4]. There already exist online applications that enable end-users to compose their own customized combination of heterogeneous web sources through easy-to-use graphical interfaces. Google App Engine [5] and Yahoo! Pipes [6] are typical examples of what is often referred to as web-based mashup

O

• The authors are affiliated with the Department of Informatics and Telecommunications, National & Kapodistrian University of Athens, Ilissia, 157 84 Athens, Greece. E-mail: {ppantaz, mkaralio, ioannis}@di.uoa.gr This work has been partially supported by the EC-FP7-FET RECOGNITION project (FP7-IST- 257756) and the Marie Curie grant RETUNE (FP7-PEOPLE- 2009-IEF-255409).

tools. At the same time, efforts are under way to develop platforms that will engage end users in the creation of services with telecom-based features (i.e., messaging, voice calls etc) integrated over the Next Generation Networks [7]. In parallel with the proliferation of the so-called UserGenerated Service (U GS) paradigm, significant research effort is being carried out on the design and deployment of energy-efficient data storage architectures. The aim is to offload part of the data management operations from the few large power-hungry data centers onto many lighter peer devices at the network edge [8]. Numerous peer devices such as home gateways can be instrumented through virtualization technologies to create a distributed Internet service platform that leverages enduser proximity. The “in-network storage” argument has also been a key-concept of the emerging Information-centric networking(ICN) paradigm [9]. In ICN, the network is equipped with functionality that lets it contribute actively and reliably to the distribution of information objects. In the intersection of these trends, we anticipate a rich ecosystem of service facilities that will be generated in pretty much every network location, many of them having strongly local scope, and will have, in principle, access to storage resources in various network locations. The technical challenge then remains how to optimally place these services to minimize their access cost. However, more than ever before, the search is for scalable distributed service placement approaches that can be carried out by typical network devices, without specialized processing capacity. Such approaches will need to determine, among others, what kind of information is needed to push a service

2

facility towards cost-effective locations and how can this information be efficiently collected and processed. Motivated by these challenges, our work proposes a scalable decentralized heuristic algorithm that iteratively moves services from their generation location to the network location that minimizes their access cost. We follow earlier work on service placement in viewing the problem as an instance of the facility location problem [10]; more precisely, we formulate it as the 1-median problem since service facilities are not replicated. Contrary to centralized approaches, where a single super-entity with global information about network topology and service demand solves the problem in a single iteration, we let the service migrate towards its optimal location over a few hops. In each hop, a small-scale and simpler 1-median problem is solved so that the computational load is spread amongst the nodes on the migration path. The service migration path is selected with the help of a node centrality metric we have devised earlier in [11] and call weighted Conditional Betweenness Centrality (wCBC). In each migration step, the metric assesses the capacity of each network node to route service demand load from the rest of the network towards the current service location. Therefore, the metric can effectively identify and, even, elevate directions of high demand attraction, i.e., network areas presenting high demand for the service. It does this by completely determining the properties of the subgraph wherein the small-scale 1-median problem will be solved (1-median subgraph): first, by selecting the nodes of the subgraph and, then, by modulating the demand weights with which each one participates in the 1-median problem formulation. As a result, the service reaches its optimal location in the network in a few hops. We detail the metric and the way it is used by our algorithm, hereafter called cDSMA, in Sections 3 and 4, respectively. Our contributions are both on the theoretical and practical front. From theoretical point of view, we propose a novel heuristic algorithm for the well-studied 1-median problem, which comes under the broader family of local-search techniques. The algorithm’s convergence and approximation properties are discussed in section 4.1. The negative result in this respect is that cDSMA is not a constant-ratio approximation algorithm, since synthetic examples can be constructed, where its deviation from the optimal cannot be bounded. On a practical note, we provide a systematic specification and evaluation of our algorithm, from the initial concept and properties down to practical implementation concerns. Our analysis proceeds in two phases. First, in Section 6, we carry out a proof-of-concept analysis over synthetic network topologies and under the ideal assumption that nodes can obtain accurate topological and demand information for the whole network. Essentially, this analysis tests the capacity of our metric as a guide of the service migration and exposes main properties and advantages of cDSMA. Under these ideal conditions, the algorithm achieves remarkably high accuracy and fast convergence over realworld ISP topologies of hundreds of nodes, even when the

1-median problem iterations are solved with no more than 6% of the total network nodes. Hence, in realistic settings, and contrary to the theoretical worst-case prescriptions, cDSMA shows excellent capacity to approximate the optimal solution. Moreover, cDSMA shows excellent scalability and robustness properties to service demand estimation inaccuracies across the network. Compared with pure localsearch policies, where the next service migration hop is sought for within the local neighborhood of its current location (fig. 1.a), cDSMA needs much fewer migration hops to yield placements of the same accuracy that are not dependent on the service generation location. Later in Section 7, we relax the assumption of ideal global information and propose a real-world distributed implementation for cDSMA, catering for all challenges related to distributed operation: how the node each time hosting the service collects topological and demand information from other nodes, how it uses it to reconstruct the inputs needed by the algorithm, what has to be measured and what can be inferred, what changes when the underlying routing protocol is single- or multipath. The implementation leverages the straightforward interpretation of the wCBC metric so that each node can locally obtain estimate values of its own wCBC and communicate them via dedicated messages to the current service host. This information can then be processed by the service host to extract partial topological information about the 1-median subgraph and determine the next service host on its migration path. The implementations can exercise further flexibility regarding how many nodes will measure and report their local estimates to the service host node. As shown in Section 8, this way they effectively tradeoff the algorithm accuracy with the message overhead in the network.

2 S ERVICE PLACEMENT: A FACILITY LOCA TION PROBLEM The optimal placement of service facilities within network structures has been typically tackled as an instance of the facility location problem [10]. Input to the problem is the topology of the network nodes that may host services and the distribution of service demand across the users of the network. The objective is to place services in a way that minimizes the aggregate cost of accessing them over all network users. We focus on the 1-median formulation that seeks to minimize the access cost of a single service and appears to be a better match for the expected features of the UserGenerated Service paradigm. Recent evidence confirms that there are few highly popular service/content objects and many others of interest to significantly fewer users [12]. UGS will enable the generation of service facilities in various network locations from a highly versatile set of amateur user-service providers. The huge majority of these services is expected to address users in the “proximity” of the user-service provider, either geographical or social (friends, colleagues, etc. ), so that their replication across the network is not justified.

3

a.

b.

c.

d.

Fig. 1. a,b)1-median subgraph nodes under local-search heuristics(a) and cDSMA(b). c) With node 7 as current service host, two nodes (8 and 11) in the subgraph G7 map demand from the rest of the network (terms wmap (8; 7), wmap (11; 7)). d) By crediting the demand of the G \ GHost nodes only on the entry nodes C and L cDSMA moves the service to the optimal location C.

Assume that the network topology is represented by an undirected connected graph G = (V, E) of |V | nodes and |E| links. Each network node n serves users that access the service with different intensity, generating demand w(n) for the service. Each service is placed on a single service facility k ∈ V minimizing the access cost of the service over all network users: X Cost(F ) = w(n) · d(k, n) (1) n∈V

In principle, the distance may have different context, depending on route policies and the network dimensioning process; namely, the edge set E is weighted. The exposition of the algorithm hereafter assumes that minimum cost paths coincide with minimum hopcount paths but its adaptation to more general shortest-path concepts is straightforward. 2.1 Why a distributed heuristic algorithm for service migration Centralized solutions are inefficient, if at all feasible, for our problem. From a pure algorithmic point of view, the complexity of the 1-median problem complexity is bounded by O(|V |3 )1 . Hence, while not prohibitive, it does not scale well and any improvement on it would be welcome, especially for larger networks. More significantly, centralized approaches assume the existence of a super-entity with global topological and service demand information that has the resources and the mandate to determine and realize the placements. This implies an implicit logical hierarchy in the role of network nodes, which in many cases is not present. Moreover, given that (minor) user demand shifts or network topology changes may alter the optimal service location, it is neither practical nor affordable to each time centrally compute a new problem solution. Our approach is to replace the one-shot placement of service with its few-step migration towards the optimal location. This way we end up solving a few 1-median problems of dramatically smaller scale and complexity 1. In comparison, the k-median problem is NP-hard in general topologies so that much of the research effort around it has been devoted to the design of efficient approximation algorithms [13].

instead of coping with the global 1-median optimization problem. Central to the algorithm is a metric inspired from Complex Network Analysis that we call Weighted Conditional Betweenness Centrality (wCBC) [11]. For every transit location of the service in the network, the wCBC is a measure of the demand each node routes towards the current service host node and is used for two tasks. First, it identifies nodes with the highest wCBC values as candidates for hosting the service in the next iteration. These nodes induce a small subgraph on the original network graph, wherein the 1-median problem is solved for the next-best service location. In the rest of this paper, we refer to this service-host-node-dependent graph as 1-median subgraph. Secondly, the metric simplifies the mapping of the service demand from the rest of the network nodes on this subgraph. This task is deemed mandatory to appropriately weigh the service demand gradients across the network. We detail the metric, its motivation and practical interpretation in Section 3.

3

W EIGHTED C ONDITIONAL C ENTRALITY

B ETWEEN -

NESS

Central to our distributed approach is the Weighted Conditional Betweenness Centrality (wCBC) metric. It originates from the well-known betweenness centrality metric and captures both topological and service demand information for each node. 3.1

Capturing network topology: from BC to CBC

Betweenness centrality, one of the most frequently used metrics in CNA [14], reflects to what extent a node lies on the shortest paths linking other nodes. Let σst denote the number of shortest paths between any two nodes s and t in a connected graph G = (V, E). If σst (u) is the number of shortest paths passing through the node u ∈V, then the betweenness centrality index of node u is given by (2). BC(u) =

X s,t∈V,s6=t6=u

σst (u) σst

(2)

4

BC(u) captures the ability of a node u to control or assist the establishment of paths between pairs of nodes. It is an average value estimated over all network pairs. In [15] we proposed Conditional BC (CBC), as a way to capture the topological centrality of a random network node with respect to a specific node t. It is defined as X σst (u) CBC(u; t) = (3) σst

wCBC value, as specified in (4). For example, when there are, say m, shortest paths between a given node pair (s, t) but the routing protocol uses only one of those, a node will measure the full demand of s, whereas, theoretically, the contribution to the nominal wCBC value is the (1/m)th of the measured one. We will see later in section 7 that what matters is the estimate of the actual demand routed through the node rather than a wCBC approximation.

with σst (s) = 0. Note that the summation is over all |V − 1| node pairs involving node t rather than all possible |V |(|V |−1) node pairs, as in (2). Effectively, CBC assesses to what extent a node u acts as a shortest path aggregator towards the current service location t by enumerating the shortest paths to t involving u from all other network nodes. In Appendix, we compute the node CBC values and their distributions over simple network topologies.

4

s∈V,u6=t

3.2 Capturing service demand: from CBC to wCBC A high number of shortest paths through the node u does not necessarily mean that equally high demand load stems from the sources of those paths. Weighted conditional betweenness centrality (wCBC) enhances the pure topologyaware CBC metric in a way that it takes into account the service demand that can be routed through the shortest paths towards the service location [11]. In the metric definition, the shortest path ratios of σst (u) to σst in Eq. (3) are weighted by the demand loads generated by each node s as follows: X σst (u) . (4) wCBC(u; t) = w(s) · σst s∈V,u6=t

Note that σut (u) = σut so that the wCBC(u; t) value for each network node u is lower bounded by its own demand w(u). Therefore, wCBC assesses to what extent a node can serve as demand load concentrator towards a given service location. It is straightforward to see that when the demand for a service is uniformly distributed across the network nodes, the wCBC metric degenerates to the CBC one, within a scale constant. 3.2.1 Approximating the metric with measurements The wCBC(u; t) metric practically represents the service demand that node u routes toward node t, including its own demand w(u) and the transit demand wtrans (u; t) flowing from other network nodes through u towards t. Therefore, individual nodes may, in principle, estimate their own ˆ through passive measurements [16] metric values wC BC of the service demand they route towards the current service host node. In other words, what is actually computed theoretically in (4) for node u demanding global information about the network topology and service demand, can be locally approximated by u providing the basis for the practical implementation of a distributed solution. The approximation lies in the fact that what is measured, even with perfect accuracy, is not always equal to the nominal

T HE C DSM A LGORITHM

DESCRIPTION

Our centrality-driven Distributed Service Migration Algorithm (cDSMA) progressively steers the service towards its optimal location via a finite number of steps. Step 1: Initialization. The first algorithm iteration is executed at the node s that initially generates the service facility (pseudocode line 2). In subsequent iterations, the new reference node is the node each time hosting the service. Step 2: Metric computation and 1-median subgraph derivation. Next, the wCBC(u; s) metric is computed2 for every node u in the network graph G = (V, E). Nodes featuring the top α% wCBC values together with the node currently hosting the service (Host) form the 1-median subgraph GHost over which the 1-median problem will be solved (lines 3 − 4 and 15 − 16). Clearly, the size of the 1-median subgraph and the algorithm overall complexity are directly affected by the choice of the parameter α. Step 3: Mapping the demand of the remaining nodes on the subgraph. To account for the contribution of the “outside world” to the service provisioning cost, the demand for service from nodes in G \ GHost (i.e., the nonshaded nodes in fig. 1c) is mapped on the GHost ones. To do this correctly and with no redundancy, the algorithm credits the demand of some outside node z only to the first “entry” GHost node encountered on each shortest path from z towards the service host. Thus, the weights w(n) for calculating the service access cost at node n in the GHost subgraph (see Section 2) are replaced by effective demands: wef f (n; Host) = w(n) + wmap (n; Host), where (assuming that Host is node t): X σ ′ (n) (5) w(z) zt wmap (n; t) = σzt z∈{G\Gt }

′ σzt (n) =

σzt X j=1

1I{n∈SPzt (j) T n=

argmin d(z,u)} u∈SPzt (j)

with SPzt (j) standing for the j th element of the shortest path set from node z to node t. For example, in fig. 1c the original service demand of node, say, 16 is not mapped on all the G7 nodes lying on the shortest paths from 16 to the Host 7 (i.e., 11, 12 and 8), but only on 11. The mapping step and its rationale can be better understood in the following example. In the network of fig. 1.d the service migrates towards the lowest cost location, which 2. For the actual wCBC computation, which involves solving the allpairs shortest path problem, we properly modified the scalable algorithm in [17] for betweenness centrality computation, with runtime O(|V ||E|).

5

under uniform demand is node C. Assume that the size of the GHost subgraph is 4 and at some migration step the service resides at node F . The top wCBC nodes around F are C, K and L. The cDSMA assigns wef f values only to the entry-nodes C and L, wef f (C; F ) = 6 and wef f (L; F ) = 3, while setting the wmap values of nodes F and K to zero, and effectively identifies the gradient direction towards the node C. Thus, it better projects the demand attraction forces on the selected nodes, C being the stronger. On the contrary, if we map the demand of nodes in G \ GHost on all GHost nodes, we end up with wef f (F ; F ) = 8 and wef f (K; F ) = 3. The service then cannot identify node C as the next-best location and locks at node F . Step 4: 1-median problem solution and service migration to the new host node. Any centralized technique (e.g., [13]) may be used to solve this small-scale optimization problem and determine the next best location of the service among the GHost nodes. We call s the current service location (line 21) while the optimal one in GHost is assigned to Host (line 22). As long as node s (a) yields higher cost than the candidate Host node (line 10); and (b) the candidate Host has not been used as a service host before (lines 11 − 13), the service is moved to the Host node and the algorithm iterates through steps 2-4. Progressively, cDSMA steers the service to the (globally) lowest-cost location. Algorithm 1 cDSMA in G=(V,E) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.

choose randomly node s place SERV ICE @ s for all u ∈ G do compute wCBC(u; s), set f lag(u) = 0 Gos ← {α% of G with top wCBC values} ∪ {s} for all u ∈ Gos do compute wmap (u; s) wef f (u; s) ← wmap (u; s) + w(u) compute cost C(u) in Gos Host ← 1-median solution in Gos while CHost < Cs do if f lag(Host) == 1 then abort else move SERV ICE to Host, f lag(s) = 1 for all u ∈ G do compute wCBC(u; Host) GHost ← {α% of G with top wCBC values} ∪ {Host} for all u ∈ GHost do compute wmap (u; Host) wef f (u; Host) ← wmap (u; Host) + w(u) compute cost C(u) in GHost s ← Host Host ← 1-median solution in GHost end if end while

4.1 Convergence and approximation properties of cDSMA We complete the description of the cDSMA elaborating on its convergence properties and theoretic capability to approximate the optimal solution. Clearly, a service facility following the migration process of Algorithm 1 will visit any network node at most once (i.e., see condition in line

Fig. 2. A ring topology of N = 2k nodes under a non-uniform demand pattern results in symmetric (with respect to B) demand mapping on the GA entry nodes K and L. This blocks the service migration process and yields a highly suboptimal solution.

12). Thus, our heuristic will take O(|V |) steps to terminate. Its theoretical performance bounds are studied next. Proposition 4.1: cDSMA provides no constant factor approximation guarantee. Proof: We sketch a counterexample that leads to arbitrarily bad solution quality: Assume a ring topology of N = 2k nodes (k ∈ Z+ ), where every node but one aggregates a unit of demand load from the users it serves. The single heavy hitter node A generates W units of demand load and the service facility is generated at the anti-diametric ring node B (see fig 2). Under cDSMA the current service host B will select αN nodes, with α < 1, that will form its GB subgraph. Interestingly enough, the demand that will be mapped on the GB -chain entry nodes (i.e., K and L) is such that the initial location becomes a local minimum for every value of α < 1. Therefore, the service remains with the node A without initiating at all the migration process. The relevant global access cost CcDSMA (B) at node B is CcDSMA (B) = 2

k−1 X

i + Wk =

i=1

N 2 + 2N (W − 1) (6) 4

On the other hand, the optimal service location is at node A, where the cost is: i=k−1 X

N2 4

(7)

W −1 CcDSMA (B) =1+2 COP T N

(8)

COP T = 2

i=1

i+k =

Therefore, their ratio equals:

It follows from eq. 8 that the resulting placement may become arbitrary bad as the demand of the heavy hitter node rises. Proposition 4.1 suggests that there are combinations of network topology and demand that may generate such symmetric 1-median subgraph mappings that trap the service in a (local) minimum and prematurely terminate its migration towards the optimum. On the positive side, when looking at the particular unfavorable example, the approximation ratio improves fast as the network size N grows, i.e., when the migration becomes more relevant; whereas for small

6

networks (of one or two tens of nodes), the placement task is computationally feasible centrally. More generally, our experimentation results in Section 6 and, more notably, in Section 8, demonstrate that cDSMA actually provides excellent accuracy for all realistic scenarios of network topologies and demand distributions even for very small sizes of the 1-median subgraph. Similar behavior is not uncommon to appear in approximation algorithms, especially those that tackle computationally difficult problems; parallels can be drawn with the well-known k-means algorithm that is a hill-climbing approach to the partitioning of data points into k disjoint clusters. It is widely used in practice although it occasionally converges to local minima that can be arbitrarily bad in terms of accuracy, when compared to the optimal clustering [18].

5

E VALUATION

METHODOLOGY

Our evaluation of the algorithm proceeds in two steps. First, in Section 6, we study its behavior under the ideal assumption that nodes avail perfect global information about the network topology and the service demand. This proof-of-concept validation is carried out over both synthetic and real-world network topologies and under various assumptions for the service demand distribution. Later in Section 8 we assess a realistic protocol implementation proposal, earlier presented in Section 7. Network topology: The synthetic topologies we experiment with are Barabási-Albert graphs [19] and twodimensional rectangular grids. The specifics graph models were chosen deliberately since they bear very different and distinct structural properties. The B-A graphs form pure probabilistically and can reproduce a highly skewed node degree distribution that approximates the power-law shape reported in literature [20]. Grids, on the other hand, exhibit strictly regular structure with constant node degree and diameter that grows exponentially with the number of network nodes. The synthetic network topologies let us highlight the behavior of cDSMA in the presence of general network structure properties. Nevertheless, the ultimate assessment of our algorithm is carried out over real-world ISP network topologies. The recently published dataset [21] we consider includes numerous snapshots of 14 different AS topologies, corresponding to Tier-1, Transit and Stub ISPs [22]. The data were collected daily during the period 2004-08 with the help of a multicast discovering tool called mrinfo, which circumvents the complexity and inaccuracy of more conventional measurement tools such as traceroute. We focus on the larger Transit and Tier-1 ISP datafiles sizing up to approximately 1000 nodes and show results for a representative subset featuring adequate variance in size, diameter, and connectivity degree statistics. Service demand distribution: At a first level, our assessment distinguishes between uniform and non-uniform demand scenarios. Though far from realistic, uniform demand scenarios let us study the exclusive impact of network topology upon the behavior of the algorithm. On the contrary,

under non-uniform demand distributions, the algorithm is exposed to the simultaneous influence of network topology and service demand dynamics. Mathematically speaking, a Zipf distribution models the preference w(n; s, N ) of nodes n, n ∈ N to a given service 1/ns w(n; s, N ) = PN . s l=1 1/l

(9)

Practically, the distribution could correspond to the normalized service request rate. Increasing the parameter s from 0 to ∞, the distribution asymmetry grows from zero (uniform demand) towards higher values. At a second level, we consider two options as to how the non-uniform service demand emerges spatially within the network. In the default option, each node randomly generates demand according to the Zipf law. The alternative is to introduce geographical correlation by concentrating nodes with high demand in the same network area. This second scenario lends itself to modeling services with strongly local scope; whereas, the first one matches better services that attract geographically broader interest. Algorithm performance metrics: We are concerned with two metrics when assessing the performance of cDSMA. The first one relates to its accuracy, i.e., how well it approximates the optimal solution. It is defined as the average normalized excess cost, βalg , and equals the ratio of the service access cost our algorithm achieves, Calg (G, w), over the cost achieved by the optimal solution (derived by a brute-force centralized algorithm assuming availability of global topology and demand information) Copt (G, w), for given network topology G and demand distribution w: βalg (α; G, w) = E[

Calg (α; G, w) ] Copt (α; G, w)

(10)

βalg clearly depends on the percentage α of the network nodes participating in the solution. Less intuitively yet inline with 4.1, greater α values may not result in βalg improvement. Closely related to βalg are the αǫ indices αǫ = argmin {α|βalg (α) ≤ (1 + ǫ)}, corresponding to the minimum values of α, where the access cost achieved with our heuristic falls within 100 · ǫ% of the optimal. The second metric is the migration hop count, hm , which is generally a function of α and reflects how fast the algorithm converges to its (sub)optimal solution. Smaller hm values imply faster service deployment and less overhead involved to transport and service set-up/shut-down tasks. When the involved parameters vary along the network instances (e.g., BA graphs) or the service demand distribution, we present results that are the averages over 10 different topologies and/or 10 different vectors of Zipfdistributed demand values. We repeat a number of simulation runs (i.e., 20, 40 and 50) to obtain a sample up to at least 7% of the {net topology, demand vectors} space. Typically, the average values are presented together with the 95% confidence intervals, estimated over the runs.

7

1.08

1.08

1.15

1.06

1.15

1.06

1

(α)

1.02 1

1

0.98 9

1.04

alg

1.05

β

1.02

1.1

βalg (α)

βalg (α)

βalg (α)

1.1 1.04

1.05

1

0.98

11

13

15 17 19 21 GHost subgraph size

23

a. B-A (s=0)

25

27

0.95 6

7

8

9 10 11 12 GHost subgraph size

13

14

15

b. 10x10 Grid (s=0)

9

11

13

15 17 19 21 GHost subgraph size

23

25

0.95 6

27

7

c. B-A (s=1)

8

9 10 11 12 GHost subgraph size

13

14

15

d. 10x10 Grid (s=1)

Fig. 3. Synthetic topologies of 100 nodes : cDSMA accuracy vs. 1-median subgraph size under uniform (a,b) and Zipf (c,d) demand distribution.

6

T HEORETICAL

ASSESSMENT OF C DSMA

6.1 Experiments with synthetic topologies Figure 3 plots the average normalized excess cost βalg for B-A like graphs3 and grids of 100 nodes against the GHost size under different demand patterns. As expected, the error induced by cDSMA tends to decrease with the GHost size despite the non-monotonicity of the corresponding curves. Grid topology: Under uniform demand the regular structure of the grid renders the most central points of the grid, i.e., its barycenter, the optimal service locations. Therefore, the demand gradient is directed towards the grid center. The migrating service moves along it; nevertheless, the closer it gets to the grid center the more intense are the attraction forces from the nodes that lie behind it. As a result, there are nodes around the optimal grid location(s) that impede the local search and become traps for the migrating service (see fig 4.a). This is due to a combination of the subgraph size (i.e., α) and the topology symmetry. More specifically, the demand mapping step correctly assigns higher wef f values to those nodes in the GHost subgraph that are closer to the optimal location. However, the current host exhibits the lowest access cost among all the GHost nodes and terminates the service migration. As the α percentage grows bigger the optimal GHost location is shifted away from the current service host and thus, the number of trap nodes decreases (see fig 4.b,c). The shift from uniform to a more skewed spatially uncorrelated demand distribution results in the trap nodes being anywhere within the network. In particular, the traps will appear in nodes that lie somewhere in-between heavy hitters and stand under the influence of approximately equal attraction forces (fig 4.d). The service gets locked there, much as happened under the uniform demand case. The difference now is that as we let α grow, the GHost subgraph will stretch to many directions, namely the ones that heavy hitters use to reach the current service host. This means that we will need on average more nodes than before in order to overcome the traps. Consequently the normalized cost ratio converges to one more slowly (fig. 3.d). When the demand distribution is spatially correlated (i.e., the interest in the service is concentrated in a particular neighborhood, as when the service has strongly 3. To obtain proper B-A networks, we would need network sizes in the order of thousands of nodes. Hence, strictly speaking, the scale-free networks of a few hundred nodes’ size we experiment with are small to be called B-A networks. We retain the name for ease of reference.

TABLE 1 Impact of spatially correlated service demands skewness s 1 2

Csp (1, s) 0.786 8.540

βalg (0.1) 1.026±0.016 1.002±0.003

βalg (0.1, Csp ) 1.013± 0.010 1.0±0.0

local scope), a cluster of nodes with high service demand appears in a random area within the grid. Let K cluster nodes collectively represent some percentage z% of the total demand for the service, whereas the other N −K nodes share the remaining (100 − z)% of the demand. We call the ratio z/(100 − z) the demand spatial contrast Csp . In 2D grids, clusters are formed by a cluster head node together with its R-hop neighbors. The contrast P can then be written P K

as: Csp (R, s) = PNn=1

w(n;s,N )

n=K+1

w(n;s,N )

K

= PNn=1

1/ns

n=K+1

1/ns

and the

average normalized excess cost becomes a function of both α and the contrast value. The values of βalg (α, Csp ) under spatially random and correlated (R = 1) distribution of demands are reported in Table 1 for a 10x10 grid topology. Having the top demand values stemming from a certain network neighborhood we actually “produce” a single pole of strong attraction for the migrating service. cDSMA now follows the demand gradient more effectively than before. As the percentage of the total cluster nodes’ demand grows larger (i.e., higher Csp ), the pole gets even stronger driving the service firmly to the optimal location. B-A graphs: The B-A graph characteristics seem to amplify the service trap phenomena. Regardless the generation location the high-degree hub nodes of B-A graphs [19] are correctly identified as low-cost solutions and therefore cDSMA has the service moved there already with its first hop. As the hub node communicates directly with almost all GHost nodes, it easily becomes the minimum cost node. In a B-A graph of 100 nodes, where we iterate generating a service at each node under uniform demand, 62 times the service locks on 3 different hub nodes other than the optimal; and this is almost consistently done in the first hop. A closer look reveals that such local minima appear robust to the GHost size in line with fig 3.a. Fig. 5 depicts an example of cDSMA behavior when the service is generated in some node and subsequently moves to a sub-optimal hub. By increasing the number of selected nodes we essentially tend to include in the GHost subgraph a direct neighbor of the current high-degree service host. Thus we get the same final cost across a wide α range. Moreover, when

8

a. |GHost | = 5

b. |GHost | = 7

c. |GHost | = 12

d. |GHost | = 5

e. |GHost | = 7

f. |GHost | = 16

Fig. 4. The trap nodes (dark dots) as a function of the GHost subgraph size for a 7x7 grid under uniform (a,b,c) and non-uniform demand (d,e,f) that emerges by nodes 3 and 37 being equally heavy hitters. For the former case (optimal is 25) the traps vanish when |GHost | = 14, while for the latter (optimal is 24) when |GHost | = 18. 2

1.2

1.15

1

1.1

0.5

Normalized cost

Hopcount

1.5

1.05

Normalized cost Hopcount 0 0.1

0.15

0.2

0.25

0.3

percentage α

0.35

0.4

1 0.45

Fig. 5. Scaling the GHost size can hardly help the service overcome a high degree B-A trap node.

the GHost is big enough to find the optimal host, it does so by directly moving there from the generation location; accordingly, the hopcount remains on average slightly over one (see table ??). When a non-uniform demand pattern emerges, the combination of the high degree nodes with the presence of heavy-hitters, on average works in favor of cDSMA. In the above B-A graph of 100 nodes, for 39.5 times averaged over 4 different demand vectors, the service locks on the 3 different sub-optimal hub nodes. The above results suggest that the cDSMA performance while closely approximating the optimal exhibits sensitivity to certain connectivity properties of the network topology. In the presence of high degree hub nodes assisted by low average path length the algorithm requires relatively high GHost size to correctly determine the next best solution. Moreover, the randomization introduced by skewed demand distributions does not necessarily benefits the algorithm (e.g., grids). In the sequence, we validate these general rules about cDSMA over real-world network topologies.

of non-uniform demand pattern) to achieve a solution that lies within 2.5% of the optimal; the corresponding average migration hop count hm is also shown. The |GHost | values show a notable insensitivity to both topological structure and service demand dynamics. Although the considered ISP topologies differ significantly in size and diameter, the required 1-median subgraph size does not change substantially. Employing 4.5% of the total number of nodes or 6% for the least favorable case suffices to obtain very good accuracy across all ISP topologies. Likewise, the required 1-median subgraph size remains in almost all experiments practically invariable with the demand distribution skewness. Although for larger values of s, few nodes become stronger attractors for the algorithm, the added value for its accuracy is in most considered datasets negligible. Even for the larger topologies that appear more sensitive to service demand variations, the |GHost | differences across the skewness values are no more than 4% of the total network size. This two-way insensitivity of cDSMA is of major importance as: a) the computational complexity of the local 1-median problem can be negligible and scales well with the size and diameter of the network; b) the algorithm performance is robust to possibly inaccurate estimates of the service demand each node poses. Later in Section 8 we will see how our practical cDSMA implementation maintains similar welcome characteristics. 6.3

cDSMA vs. locality-oriented service migration

The ultimate assessment of cDSMA is carried out over real-world ISP network topologies, which do not typically have the predictable structural properties of B-A graphs and grids. Still, we show below that insightful analogies regarding the behavior of cDSMA can be drawn between real-world and synthetic topologies. Table 2 summarizes the performance of cDSMA over the data that represent the real-world topologies4. It reports the minimum number of nodes |GHost | required (across the demand vectors in case

It is instructive to compare cDSMA against stricter “localsearch” approaches to distributed service migration. One example is the R-ball heuristic used in [24], where the search for a better service host is a priori bounded within the r-hop distance neighborhood of the node each time hosting the service5 . On the contrary, cDSMA invests more effort and intelligence in selecting the nodes of the subgraph, wherein the reduced 1-median problem will be solved. The resulting 1-median subgraph is spatially stretched along paths consisting of highly “central” nodes and oversteps the local neighborhood “barriers”. This is clearly illustrated in fig. 6 showing the hopcount distribution between the service host node and the selected each time 1-median subgraph

4. Several files miss some edges resulting in more than one connected components [22]. Thus, a pre-processing task using a linear-time algorithm [23], is needed to retrieve the maximal connected component mCC.

5. A direct side-by-side comparison of the two service migration approaches is not applicable since the r-ball heuristic is combined with service replication, thus coping with the k-median problem.

6.2 Experiments with real-world network topologies

9

TABLE 2 Mean value of αǫ for various datasets under different demand distributions ISP type:Tier-1 Global Crossing -//NTTC-Gin Sprint -//Level-3 -//Sprint type:Transit TDC DFN-IPX-Win JanetUK Iunet



< hm >

36/3549 35/3549 33/2914 23/1239 21/1239 27/3356 13/3356 20/1239

76 100 180 184 216 339 378 528

10 9 11 13 12 24 25 16

3.71 3.78 3.53 3.06 3.07 3.98 4.49 3.13

1.00±0.23 1.30±0.34 1.0±0.0 1.40±0.36 1.40±0.41 2.23±0.58 2.27±0.59 1.40±0.62

6 7 18 8 7 4 4 11

1.63±0.35 1.26±0.16 1.11±0.13 1.22±0.15 1.42±0.19 3.15±0.24 2.48±0.37 1.27±0.21

3 6 10 7 6 3 4 21

3.32±0.78 1.45±0.22 1.08±0.09 1.95±0.31 1.50±0.12 2.41±0.40 ± 1.09±0.11

24

52/3292 41/680 40/786 39/1267

72 253 336 711

9 14 14 13

3.28 2.62 2.69 3.45

1.20±0.29 1.40±0.36 1.23±0.31 1.0±0.0

5 7 11 11

1.09±0.12 1.35±0.23 1.13±0.08 1.03±0.06

5 6 9 43

1.46±0.28 1.49±0.19 1.30±0.12 0.99±0.01

4 6 6 9

DataSet 35 (diameter=9)

DataSet 40 (diameter=14)

0.4

fraction of nodes

fraction of nodes

0.25

a=3% a=5% a=10%

0.3

0.2

0.1

0

1

2

3

4

5

6

7

distance from service host

8

9

a=3% a=5% a=10%

0.2

0.15

0.1

0.05

0

0

1

2

3

4

5

6

7

8

9

10 11 12 13 14

distance from service host

Fig. 6. Hopcount distribution of 1-median nodes from Host under cDSMA. In most of our experiments, LOM demonstrates comparable accuracy to our heuristic. It suffices to set R=1 to obtain near-optimal service placements, while taking more nodes into account seems to offer no extra gain. Still, the LOM approach exhibits two disadvantages when compared against cDSMA. First, the LOM needs far more migration hops than cDSMA since it hard bounds the length of a single migration hop. For R=1, in particular, the LOM performs as many hops as Dgen to reach the optimal service location. On the other hand, cDSMA chooses the most “appropriate” candidate host nodes allowing them to stretch across the demand gradient direction; consequently,

< hm >

s=2 ⌈|GHost |⌉

Diameter

0.5

< hm >

s=1 ⌈|GHost |⌉

mCC nodes

nodes, as extracted from five executions of cDSMA under Zipf demand with s=2. To highlight what cDSMA gains by the more informed derivation of the 1-median subgraph, we compare it against its variant that implements the R-ball heuristic, hereafter called Locality-Oriented Migration (LOM). Both variants use the same demand mapping mechanism (Section 4) to capture the demand from nodes lying outside the induced 1-median subgraphs. The comparison between cDSMA and LOM, illustrated in Table 3, proceeds as follows. We first generate asymmetric service demand (Zipf distribution with s = 1) across the network. We compute the globally optimal service host node and select a fixed set of service generation nodes, at Dgen hops away from the optimal location. We then calculate the values of βalg and hm metrics for the two approaches along with the mean number of nodes included in the 1-median subgraph GLOM for each execution. For the cDSMA, we have set the parameter α = 2% yielding 1-median subgraphs of size 4 for Datasets 23, 33 and 7 for Dataset 27 (Table 3).

0

s=0 ⌈|GHost |⌉

Dataset id/AS#

2 4 8 4 5 5

it leads the service fast to prominent locations. As a positive side-effect, it almost decouples the convergence speed of the algorithm from the service generation location. Thus, it does not differentiate user nodes according to their proximity to prominent, in terms of topology and/or service demand, node inducing a notion of fairness in the performance they get. Secondly, the LOM heuristic imposes much less flexibility in determining the size of the 1-median subgraph. For R=2, LOM may end up seeking for the next best service host node among an order of magnitude more candidate hosts than cDSMA. On the other hand, taking R=1 does not always suffice; in the Dataset 27 experiment (Dgen = 14) the 1-median subgraph migration process stops prematurely one hop away from the service generation location yielding prohibitively high cost.

7

A

PRACTICAL C DSMA IMPLEMENTATION

The evaluation of cDSMA in section 5 has shown that the algorithm eventually decomposes the original global 1median problem into a series of significantly smaller local optimization problems and yet maintains close-to-optimal performance. However, these results are obtained under the assumption that the relevant topological and demand information is fully available to the network nodes. Thus, they should only be viewed as a concept validation study. A practical real-world implementation of cDSMA needs to cope with two main challenges. Firstly, information residing with individual network nodes has to be collected at the node each time hosting the service. This information includes the wCBC and wef f values that guide the 1median subgraph derivation and the demand mapping steps, respectively (Section 4). Secondly, even if this information is compiled by each node in a distributed manner, equations 4 and 5 imply that global information about the network topology and demand is required. The second concern is partly addressed by the net interpretation of the wCBC metric, as already discussed in section 3.2.1. The wCBC(u; t) value represents the aggregate traffic demand flowing through the node u towards the service host node t. Therefore, the {wCBC} values can be locally inferred at each node by directly measuring their transit traffic load that is destined for node ˆ t. How accurately the measured values {wC BC} match the

10

TABLE 3 Convergence speed and accuracy of LOM and cDSMA on real-world topologies

LOM R=1 LOM R=2 cDSMA

Dgen : βalg hm mean|GLOM | βalg hm mean|GLOM | βalg (2%) hm

3 1 3 7.2 1 2 28.3 1 1

Dataset 23 4 5 1.0299 1.0434 2 2 9.5 8 1.0299 1.0434 1 1 29.5 28 1.0299 1.0434 1 1

7 1.0299 4 7.8 1.0299 2 22.3 1.0299 2

3 1 3 5 1 2 14.3 1 1

theoretical values {wCBC}, as defined in eq. 4, depends on the network topology and routing protocol. The network topology may present each single node pair with one or more shortest (a.k.a. minimum hopcount) paths; while the routing protocol may use one, more than one or none of them. Therefore, leaving measurement inaccuracies aside, the two sets of values coincide in two scenarios: (a) when the topology gives rise to a single shortest path between all network node pairs (e.g., tree topologies) and the routing protocol routes traffic over this single shortest path; (b) when the topology induces multiple shortest paths between all network node pairs (e.g., lattice topologies) and the routing protocol splits the traffic demand equally among all of them. In general, the routing protocol can be viewed ˆ as a transformation F : {wCBC} → {wC BC}, which becomes an identity one in the two particular scenarios mentioned above. In the remainder of this section, we present the proposed cDSMA implementation in a stepby-step fashion. We insist on how this implementation addresses the aforementioned practical challenges drawing, where appropriate, a distinction between single-path and multi-path routing operation. 7.1 Service host advertisement Every time the service carries out a cDSMA-driven migration hop, the new host initiates a service advertisement phase (see Fig. 7.b) to inform all network nodes about the current service location. This task may be carried out by any efficient flooding scheme requiring O(|E|) messages and O(D) time, where D is the network diameter. Note that this step is sine qua non for all protocol instances realizing distributed service placement under dynamic demand patterns (e.g., [24], [25]). 7.2 Reporting of local wCBC estimates and inference of the 1-median subgraph Drawing on passive measurements [16] each node u ∈ ˆ G = (V, E) derives an estimate wC BC(u; t) of the traffic demand flowing through u towards the current service host node t. These estimates are then reported to node t via dedicated measurement-reporting messages in O(D) time. Nodes can separately report the portion of traffic demand w(u) originating from themselves and the transit traffic wtrans (u; t) originating from other nodes towards t, with ˆ wC BC(u; t) = wtrans (u; t) + w(u)

(11)

4 1 4 5 1 2 14.7 1 2

Dataset 33 5 7 1 1 5 7 5.3 5.9 1 1 3 4 14.3 18.2 1 1 2 3

10 1 10 5.8 1 5 18.8 1 4

3 1 3 5 1 2 17.7 1 1

4 1 4 4.8 1 2 14.7 1 1

Dataset 27 5 7 1 1 5 7 4.3 5.2 1 1 3 4 11 15.2 1 1 1 2

10 1 10 5 1 5 15 1 2

14 2.67 1 5.5 1 8 14.7 1 2

The service host node then ranks the reported values and selects the nodes with the top α|V | values to form the 1-median subgraph, wherein the 1-median problem will eventually be solved (see 3.2.1). Besides bearing the two traffic demand values, these O(|V |) dedicated messages are exploited further to offer the current service host node with a (partial) view of the 1median subgraph topology. As each measurement-reporting message travels on its shortest path towards the Host, it records all nodes lying on it. The inferred GHost topology exhibits attributes that depend on whether the employed routing protocol makes use of one or more, when available, shortest paths among given node pairs : Single-path (SP) routing: Single-path routing is the standard practice; a single path is used for routing traffic between two nodes at any point in time and resilience to failures is achieved by the use of (hot) stand-by paths(links) that are activated upon failure of the operational one. Under the SP routing policy the following proposition is relevant: Proposition 7.1: In each iteration of cDSMA, the 1median subgraph under single-path routing policy is a tree rooted at the current service host node. Proof: Since each node communicates with the current service host Host via a single shortest-path route, the selection of nodes for the 1-median subgraph is carried out on the spanning tree rooted at Host, as induced by the routing protocol operation. The only case that the resulting subgraph GHost may not be a tree is when the node selection criterion results in a non-connected subgraph. But ˆ as the node selection this is not possible with wC BC criterion. To see this, consider any node, say z, which is part of the 1-median subgraph. Then for all nodes k lying on the A − Host shortest path, it holds ˆ wC BC(k; Host) = ≥ = ≥

wtrans (k; t) + w(k) wtrans (z; t) + w(z) + w(k) ˆ wC BC(z; Host) + w(k) ˆ wC BC(z; Host) (12)

that is, all nodes lying in the shortest path A − Host report ˆ values at least as high as that of node A. Hence, if wC BC A is selected as member of the subgraph, all nodes between A and Host on the tree branch A − Host are selected as subgraph nodes as well, implying that the subgraph is

11

a. The graph G(E, V )

b. Advertisement phase (service at A)

c. Host induces GA

d. Host solves 1-median

Fig. 7. Example of cDSMA protocol implementation under single-path routing and uniform demand. connected and, thus, a tree6 . Corollary 7.1: Under SP routing, the distance of any GHost node from the Host is upper-bounded by α|V |-1. Proof: Since GiHost is a tree rooted at Host, the greatest distance from the root to a node equals to the maximum height of an α|V |-node tree, which is a · |V | − 1. We denote the set of node records appearing on the message of node x with msgx , their cardinality with |msgx | and the mth node entry in msgx with msgx (m). Fig. 8 (left) presents the topological information that would become available to the node A in Fig. 7.c by the end of this step. Regardless the routing protocol, we explain in 7.3 that the topological information communicated by these dedicated messages suffices for carrying out the demand mapping task on the 1-median subgraph without need for any additional feedback from the network nodes. 7.3 Global demand mapping on the 1-median subgraph After the derivation of the 1-median subgraph, the current service host needs to further process the α|V | measurement-reporting messages that correspond to the selected subgraph nodes. The way this will be done depends as well on the deployed routing protocol. Single-path (SP) routing: Proposition 7.1 has two favorable implications that greatly simplify the realization of the demand mapping task. Firstly, the wef f values of all nodes can be computed through direct additions and subtractions ˆ values. Therefore, for leaf nodes of the reported wC BC ˆ x of the resulting tree T , wef f (x; t) = wC BC(x; t); whereas, for all inner nodes u X ˆ ˆ wC BC(z; t) (13) wef f (u; t) = wC BC(u; t) − z∈Ch(u;T )

where Ch(u; T ) is the set of child nodes of u in T . Secondly, the parsing of the measurement-reporting messages does not have to be exhaustive. Instead, the α|V | messages of the 1-median subgraph (tree) nodes can be sorted and parsed in decreasing length order. Since messages originating from internal tree nodes with a single child are subsets of the longer messages originating from the external tree nodes, they can be safely discarded without any information loss. For example, in Fig. 8(right), the 6. Even when there is a tie in the wCBC values of two or more nodes, the current service host can use the topological information in the measurement-reporting messages to choose the node(s) that preserve the tree property.

Fig. 8. Left: Header format of msginode1 , the message sent from node 1 of GHost to the Host. Right: Message headers in decreasing length for the GA of fig.7.c messages msgF and msgB are discarded upon parsing their first node entry. Algorithm 2, hereafter called DeMaSP, summarizes the process. DeMaSP sequentially parses the measurementreporting messages of nodes selected for the 1-median subgraph (selected nodes) in decreasing length order of their msg part. The output variable wef f , one for each selected node, is initialized to the reported measured traffic values. While parsing the messages, DeMaSP subtracts the portion of traffic demand that has already been credited to nodes of higher depth in the 1-median tree. For instance, it computes the wef f (B) of the internal node B (fig. 7c) ˆ reported values by substracting the sum of the wC BC of its child nodes over the emerging tree i.e., D and F , ˆ from the node’s own wC BC(B; A); the outcome equals the w(C) plus the native w(B). Finally, the Host assigns to itself the amount of demand that has not been credited on the selected nodes; it equals its own demand w(Host) plus the difference between the measured incoming traffic demand, denoted by wtrans (Host; Host), and the sum of ˆ values of its first neighbours that belong to the the wC BC tree T of selected nodes i.e., Ch(Host; T ). 7.4

1-median solution within the GHost subgraph

The second input needed for the reduced 1-median solution is the pairwise distances between all GHost nodes. This can be acquired as follows: The current Host notifies each of ˆ nodes with unicast messages of which other the top wC BC nodes (co-players) are included in the GHost and queries them for their pairwise distances. Each node determines its distance to the other co-players via a mechanism such as the ping utility (O(α2 |V |2 ) steps, O(α2 |V |2 ) messages), and communicates them (O(α|V |) messages) to the host. 7.5

Additional remarks

Two remarks are worth making regarding the generality and scalability of cDSMA, respectively. First, cDSMA can be applied under any deployed routing strategy, even when

12

Algorithm 2 Message header parsing and demand mapping under SP (DeMaSP) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

input: set of selected nodes in GHost , {msgu } ∀u ∈ GHost output: vector wef f (u) ∀u ∈ GHost Initialization ˆ for all x ∈ GHost do wef f (x) = wC BC(x) vector B ← sort all msgx in decreasing order of |msgx | for i = 1 up to Len(B) do parse B(i) = msgx for m = 1 up to |msgx | − 1 do if msgx (m) is marked then if m > 1 then k = msgx (m − 1), l = msgx (m) ˆ wef f (l) = wef f (l) − wC BC(k) end drop msgx else if m > 1 k = msgx (m − 1), l = msgx (m) ˆ ˆ wef f (l) = wC BC(l) − wC BC(k) end mark msgx as read end if end for end for wef f P (Host) = w(Host) + (wtrans (Host; Host) − ˆ wC BC(z)

z∈Ch(Host;T )

the paths actually used are not the minimum hopcount (or shortest in a different sense). In that case, the theoretical ˆ {wCBC} and the measured {wC BC} values will deviate, even under single-path routing and tree topologies. The cDSMA will derive a probably different subgraph and the weights of the nodes in this graph will be correspondingly affected. However, the algorithm will carry its task as usual; it cannot know and does not care about learning how exactly the routing protocol in operation transforms the field of theoretical {wCBC} values. Secondly, the transit traffic measurement task and the nodes’ communication with the Host via path-recording messages do not have to concern the whole network. The service Host may normally carry out the advertisement phase and then request only from a limited number of nodes to perform traffic measurements, therefore reducing the messages it needs to process. The induced tradeoff between the convergence speed and computational complexity of the algorithm is studied in the next Section.

8 P ERFORMANCE OF C DSMA PRACTICAL IMPLEMENTATIONS We now compare the two practical implementations of cDSMA, i.e., over single-path (cDSM ASP ) and multipath (cDSM AMP ) routing, against their theoretical primitive in Section 4. Effectively, we repeat the experiments of Section 6.2 over the ISP network topologies: the service facilities are generated at the same initial locations, the demand vectors coincide with those employed in 6.2, and the 1median subgraph sizes are set to those ⌈|GHost |⌉ values that yield solutions of cost within 2.5% of the optimal for the theoretical cDSMA (ref. Table 2). For the multipath routing

case, in particular, we set each weight factor wfuj (u; Host) of node u to be equal to the inverse of its outgoing links towards the Host; in other words we assume that traffic destined for the Host is equally split over each of those u’s outgoing links that leverage shortest paths to the Host. We first evaluate our practical implementation assuming that all network nodes are engaged in performing and reporting traffic demand measurements (see section 7.2). Later we relax this requirement, delegating the measurement task only to nodes within the r-hop neighborhood of the current service host letting r modulate the accuracy vs. generated message overhead tradeoff. Network-wide traffic measurements. Table 4 reports the performance of our practical implementation in terms of mean normalized cost and hopcount values when the current service host collects traffic measurements by every network node. Both practical instances turn out to be on average as accurate as their theoretical primitive; indeed, in most cases they provide solutions with cost within 2.5% of the optimal. This close match holds across different ISP network topologies and both demand patterns (uniform and skewed) suggesting that the algorithm is adequately robust to variations of network topology and traffic demand. More importantly, our practical schemes respond successfully to the different ways the underlying routing protocols transform the actual spatial demand distribution, inline with the remark of 7.5. As a result, a service user is expected to experience consistently close-to-optimal performance irrespective of the routing protocol selection in the network. On the other hand, some increased confidence intervals suggest that a few nodes in the corresponding samples trap the migrating service and therefore increase the placement costs. Since the 1-median subgraphs of the experiments in Table 4 are carried out with at most 6% of the respective network nodes, a slight increase of the subgraph size is tolerable and could improve the quality of the solution. cDSM ASP typically needs slightly more hops to reach the final host-node than those of the theoretical algorithm. Indeed, the latter as well as cDSM AMP are characterized by the capability of their selected nodes to spatially stretch across prominent directions as opposed to the upperbounded GHost diameter of cDSM ASP (see 7.2). It seems though that this bound results only in a marginal hopcount increase; cDSM ASP can still move through these ISP networks taking few hops. This is in compliance with fig. 6, where even for higher asymmetry (s=2) a negligible number of GHost nodes as determined by the theoretical algorithm, lie further than the respecting DeMaSP upper bound. Accordingly, the hops that cDSM AMP takes are pretty much the same with the theoretical algorithm. Overall and with respect to the network diameter and the ⌈|GHost |⌉ size, both cDSMA practical instances take no more than three hops, on average, to reach their final location. Effectively, this minimizes the installation costs on intermediate hosts along the migration path as well as the overhead of host advertisements and measurement reports. Traffic measurements within the Rmsr -hop neighborhood. We repeat the same experiments but now the service host-

13

TABLE 4 Performance of practical implementation under the theoretical ⌈|GHost |⌉ values cDSMASP s=0 Dataset id 36 35 33 23 21 27 13 20 52 41 40 39

β(⌈|GHost |⌉) 1.0039±0.0152 1.0122±0.0122 1.0378±0.0441 1.0132±0.0356 1.0391±0.0529 1.0±0.0 1.0165±0.0481 1.0144±0.0124 1.0091±0.0132 1.0154±0.0137 1.0119±0.0144 1.0144±0.0080

cDSMAM P s=1

hm 1.50±0.36 1.30±0.40 0.97±0.13 1.53±0.48 1.26±0.32 2.30±0.62 3.07±1.01 1.33±0.44 0.97±0.13 1.07±0.18 1.0±0.0 1.0±0.0

β(⌈|GHost |⌉) 1.0316±0.0145 1.0229±0.0210 1.0461±0.0278 1.0255±0.0164 1.0339±0.0206 1.0016±0.0036 1.0160±0.0093 1.0311±0.0225 1.0103±0.0059 1.0153±0.0103 1.0194±0.0096 1.0195±0.0118

s=0 hm 1.80±0.31 1.30±0.17 1.12±0.14 1.25±0.18 1.34±0.18 3.39±0.33 2.59±0.39 1.26±0.12 1.13±0.21 1.40±0.26 1.16±0.19 0.99±0.01

node prompts only those nodes lying within Rmsr hops to communicate their traffic measurement values. Table 5 captures the tradeoff between the achieved algorithm accuracy (mean access cost and hopcount) against the generated overhead in terms of measurement reporting messages our cDSMA implementation needs to operate. The latter is quantified by ∆(av.msg), which computes the difference ∆ between the average of the total number of messages sent when the traffic measurements are performed by all nodes and those within Rmsr , respectively. Table 5 reveals that both practical instances of cDSMA preserve high performance standards even when the utilized traffic measurement values come from a restricted vicinity around the current Host. Although we keep the α percentages constant and equal to the corresponding ones employed in Table 2 and 4 experiments, the GHost size varies across iterations being upper-bounded by ⌈|GHost |⌉; ˆ nodes of a certain iteration that lie further those top wC BC than Rmsr from the current Host are not considered. This affects the accuracy of our implementation for small Rmsr values. Nevertheless, Rmsr values of 4 or 5 hops suffice to achieve performance that lies on average within 3.5% of the optimal for both ISP topologies and demand dynamics. Posing the Rmsr bound is expected to somewhat weaken the cDSMA advantage of long hops to the final location. Still, the hopcount of the migrating service increases no more than 1 hop, on average, compared to the networkwide measurement case. In terms of message overhead this means that we need to bear at most one more service advertisement task. On the other hand, as Rmsr scales the hopcount to destination decreases while the number of measurement reporting nodes (i.e., |GHost − {Host}| ) slightly increases. Thus, the number of messages the boundedmeasurement implementation necessitates to operate decreases with Rmsr while the messages of the networkwide measurement implementation remain constant. This is clearly demonstrated in the ∆(av.msg) values that for the former case suggest a dramatically lower number of messages, up to several hundreds for the preferable Rmsr values of 4 or 5.

9

s=1

β(⌈|GHost |⌉) 1.0135±0.0219 1.0087±0.0111 1.0244±0.0408 1.0±0.0 1.0±0.0 1.0±0.0 1.0±0.0 1.0279±0.0400 1.0045±0.0099 1.0151±0.0135 1.0149±0.0154 1.0125±0.0080

hm 1.13±0.31 1.10±0.22 1.0±0.0 1.43±0.36 1.53±0.36 2.23±0.58 2.87±1.09 1.13±0.29 1.07±0.18 1.07±0.32 1.27±0.32 0.98±0.11

R ELATED

WORK

β(⌈|GHost |⌉) 1.0170±0.0131 1.0145±0.0123 1.0185±0.0152 1.0123±0.0084 1.0122±0.0132 1.0018±0.0040 1.0105±0.0069 1.0055 ±0.0051 1.0076±0.0062 1.0092±0.0078 1.0127±0.0093 1.0096±0.0069

hm 1.37±0.06 1.41±.006 1.02±0.03 1.17±0.05 1.48±0.07 3.23±0.06 2.36±0.06 1.29±0.04 1.10±0.02 1.50±0.14 1.09±0.04 1.09±0.06

Out of the vast literature that studies placement problems under the lenses of discrete optimization we focus on the distributed solutions and identify two main approaches: those adopting a facility location approach [10] and those drawing on a knapsack problem formulation [26]. Placement problems can be cast in the knapsack framework when constrained storage is considered. Each data item is associated with a utility value and the goal is then to determine the set of items that result in maximum total utility, when stored in a resource-constrained buffer. Relevant works span from middleware solutions that dynamically place application instances on limited-capacity servers [27] to data dissemination systems that drive the content replication in mobile devices upon opportunistic encounters [28]. More relevant to our service deployment scenario though is the former approach that attracts both theoretical and practical interest. The theoretical thread relates to the approximability of the facility location problem by distributed approaches. Algorithms are typically executed over a complete bipartite graph where the m facilities and n client nodes communicate with each other in synchronous send-receive rounds. Moscibroda and Wattenhofer in [29] draw on a primal-dual approach earlier devised by Jain and Vazirani in [30], to derive a distributed algorithm that trades-off the approximation ratio with the communication overhead under the assumption of O(logn) bits message size. More recently, Pandit and Pemmaraju [31] have derived an alternative distributed algorithm for the metric √ facility location that runs in k rounds achieving an √ O(m2/ k · n3/ k )-approximation. Our work, on the other hand, comes under the broader family of heuristic algorithms that most often cope better with the practical limitations of real-world applications; at the same time they can be extremely effective. In most of the proposed solutions in this context, the available content/service facilities migrate towards their optimal location. In [32] the authors propose a joint optimization of content replication and placement, suited for wireless networks. Nevertheless, their methodology resembles the one we follow. They formulate a capacitated multi-commodity optimization problem and break it down to a multitude

14

TABLE 5 Performance of practical implementation when Rmsr bounds the number of measurement reporting nodes cDSMASP Dataset 35 Rmsr : 2 under uniform demand (s=0) βRmsr 1.0275 ± 0.0541 hm 1.65±0.53 ∆(av.msg) 89 ± 1 under non-uniform demand (s=1) βRmsr 1.0306±0.0217 hm 1.83 ± 0.36 ∆(av.msg) 65±1

Dataset 20

3

4

5

2

3

4

5

1.0122±0.0122 1.45±0.43 108±1

1.0122±0.0122 1.30±0.40 122±1

1.0122±0.0122 1.30±0.40 122±1

1.0133±0.0126 2.20±0.61 278±1

1.0149±0.0124 1.65±0.50 567±1

1.0144±0.0124 1.55±0.48 618±1

1.0144±0.0124 1.35±0.44 723±1

1.0299±0.0210 1.48±0.22 99±1

1.0299±0.0210 1.35±0.17 111±1

1.0299±0.0210 1.30±0.17 116±1

1.0357±0.0228 2.29±0.60 115±1

1.0338±0.0225 1.77±0.38 385±1

1.0322±0.0228 1.55±0.27 498±1

1.0322±0.0229 1.43±0.21 559±1

cDSMAM P Dataset 35 Rmsr : 2 under uniform demand (s=0) βRmsr 1.0429±0.0739 hm 1.25±0.39 ∆(av.msg) 89±1 under non-uniform demand (s=1) βRmsr 1.2655±0.5016 hm 2.02±0.19 ∆(av.msg) 75±1

Dataset 20

3

4

5

2

3

4

5

1.0087±0.0111 1.25±0.32 88±1

1.0087±0.0111 1.10±0.22 103±1

1.0087±0.0111 1.10±0.22 103±1

1.0420±0.0501 1.88±0.65 195±1

1.0340±0.0451 1.50±0.54 393±1

1.0314±0.0427 1.42±0.51 434±1

1.0349±0.0452 1.25±0.39 523±1

1.0141±0.0123 1.75±0.10 101±1

1.0145±0.0123 1.56±0.08 120±1

1.0145±0.0123 1.51±0.08 125±1

1.0539±0.0138 2.21±0.18 188±1

1.0254±0.0170 1.91±0.12 342±1

1.0131±0.0099 1.65±0.08 476±1

1.0079±0.0066 1.56±0.07 521±1

of single-commodity problems that seek to minimize the corresponding cost of one information item available at each facility. Each problem is solved mimicking a local search technique that relies on measurements of the demand queries each node serves. Accordingly, these measurements drive the content replication and hand-over to the immediate neighbors of the current facility. Even closer to our work is the upcoming paper of Smaragdakis et al. [24] that proposes the R-ball heuristic. They reduce the original k-median problem in multiple smaller-scale 1-median problems solved within an area of R-hops from the current location of each service facility. This concept has been realized into a functional system by the authors of [33] who consider practical improvements to cope with the dynamical join-and-leave events of p2p overlays. Contrary to cDSMA, the area over which the R-ball heuristic searches for candidate next service hosts and maps demand from the “outer” nodes is the immediate neighborhood around the current service location. In 6.3 we have discussed in detail how cDSMA compares with a similar, local-search oriented approach. A slightly different approach to the iterative migration of facilities through locally-determined hops has been adopted by Oikonomou and Stavrakakis in [25]. They exploit the shortest-path tree structures that are induced on the network graph by the routing protocol operation to estimate upper bounds for the aggregate cost in case the service migrates to the 1-hop neighbors. Migration hops are therefore one physical hop long and this decelerates the migration process, especially in larger networks.

10

C ONCLUSION

In this paper we have developed a scalable and effective heuristic approach in dealing with the complexity and limitations of the distributed-fashion optimization of service placement. Our key theoretical idea amounts to solving the 1-median problem in a scalable manner replacing

the centralized and costly approach with an agile service migration mechanism that places each service instance at cost-effective host nodes. To do so we have turned to the Complex Network Analysis toolbox that allows us through an iterative local-search-like process to correctly identify those nodes that exercise intense attraction forces to each service. With systematic simulation we have explored the involved {topology,demand} parameter space and validated the effectiveness of the proposed solution focusing on experiments with real-world topologies. Since the retrieval of the relevant centrality values is subject to practical limitations we have introduced a distributed implementation of the heuristic shown to drive the service towards near-optimal locations utilizing locally available information. The proposed implementation copes with a variation of information-flow trajectories and at the same time effectively bounds the number of necessary messages with a negligible performance penalty in return. The problem of how the network can effectively push facilities to prominent locations gains renewed interest when considered under a decision-theoretic view. Nodes that host a service instance contribute to the social welfare at the expense of individual resources. Exploring this networking dilemma is the focus of our future work.

R EFERENCES [1] [2] [3] [4]

[5] [6]

T. Plagemann et al., “From content distribution networks to content networks - issues and challenges,” Comput. Commun., vol. 29, no. 5, pp. 551–562, 2006. V. Jacobson et al., “Networking named content,” in 5th ACM CoNEXT 2009, Rome, Italy, December 2009, pp. 1–12. A. Galis et al., “Management architecture and systems for future internet networks,” in FIA Book : "Towards the Future Internet - A European Research Perspective", Prague, May 2009, pp. 112–122. E. Silva, L. F. Pires, and M. van Sinderen, “Supporting dynamic service composition at runtime based on end-user requirements,” in Proc. 1st International Workshop on User-generated Services (colocated with ICSOC2009), Stockholm, Sweden, November 2009. [Online]. Available: http://code.google.com/appengine/ [Online]. Available: http://pipes.yahoo.com/pipes/

15

[7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28]

[29] [30] [31] [32] [33]

J. C. Yelmo et al., “A user-centric approach to service creation and delivery over next generation networks,” Computer Communications, vol. 34, no. 2, 2011. V. Valancius et al., “Greening the internet with nano data centers,” in Proc. 5th ACM CoNEXT, Rome, Italy, 2009, pp. 37–48. D. Trossen, M. Sarela, and K. Sollins, “Arguments for an information-centric internetworking architecture,” SIGCOMM Comput. Commun. Rev., vol. 40, no. 2, pp. 26–33, Apr. 2010. P. Mirchandani and R.Francis, Discrete location theory, John Wiley and Sons, 1990. P. Pantazopoulos, M. Karaliopoulos, and I. Stavrakakis, “Centralitydriven scalable service migration,” in Proc. of the 23rd International Teletraffic Congress (ITC’11), San Francisco, California, USA, 2011. M. Hefeeda and O. Saleh, “Traffic modeling and proportional partial caching for peer-to-peer systems,” IEEE/ACM Trans. Netw., vol. 16, no. 6, pp. 1447 –1460, Dec. 2008. R. Solis-Oba, “Approximation algorithms for the k-median problem,” ser. Lecture Notes in Computer Science, vol. 3484. Springer, 2006. M. E. J. Newman, “The Structure and Function of Complex Networks,” SIAM Review, vol. 45, no. 2, pp. 167–256, 2003. P. Pantazopoulos, I. Stavrakakis, A. Passarella, and M. Conti, “Efficient social-aware content placement for opportunistic networks,” in IFIP/IEEE WONS, Kranjska Gora, Slovenia, February, 3-5 2010. S. Jaiswal, G. Iannaccone, J. Kurose, and D. Towsley, “Formal analysis of passive measurement inference techniques,” in Proc. of the 25th IEEE INFOCOM’06, Barcelona, Spain, 2006, pp. 1–12. U. Brandes, “A faster algorithm for betweenness centrality,” Journal of Mathematical Sociology, vol. 25, pp. 163–177, 2001. T. Kanungo et al., “A local search approximation algorithm for kmeans clustering,” Comput. Geom. Theory Appl., vol. 28, no. 2-3, pp. 89–112, Jun. 2004. A. L. Barabasi and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, no. 5439, pp. 509–512, Oct 1999. G. Siganos, M. Faloutsos, P. Faloutsos, and C. Faloutsos, “Power laws and the as-level internet topology,” IEEE/ACM Trans. Netw., vol. 11, no. 4, pp. 514–524, 2003. J.-J. Pansiot, “mrinfo dataset.” [Online]. Available: http://svnet. u-strasbg.fr/mrinfo/ J.-J. Pansiot, P. Mérindol, B. Donnet, and O. Bonaventure, “Extracting intra-domain topology from mrinfo probing,” in Proc. Passive and Active Measurement Conference (PAM), April 2010. R. M. Karp and R. E. Tarjan, “Linear expected-time algorithms for connectivity problems (extended abstract),” in ACM STOC ’80, Los Angeles, California, 1980, pp. 368–377. G. Smaragdakis, N. Laoutaris, K. Oikonomou, I. Stavrakakis, and A. Bestavros, “Distributed Server Migration for Scalable Internet Service Deployment,” IEEE/ACM Trans. Netw., (To Appear) 2012. K. Oikonomou and I. Stavrakakis, “Scalable service migration in autonomic network environments,” IEEE JSAC, vol. 28, no. 1, pp. 84–94, 2010. S. Martello and P. Toth, Knapsack Problems: Algorithms and Computer Implementations. New York: Wiley, 1990. C. Adam and R. Stadler, “Service Middleware for Self-Managing Large-Scale Systems,” IEEE Trans. on Network and Service Management, vol. 4, no. 3, pp. 50–64, Apr. 2008. C. Boldrini, M. Conti, and A. Passarella, “Design and performance evaluation of contentplace, a social-aware data dissemination system for opportunistic networks,” Computer Networks, vol. 54, no. 4, pp. 589 – 604, 2010. T. Moscibroda and R. Wattenhofer, “Facility location: distributed approximation,” in PODC ’05, 2005, pp. 108–117. K. Jain and V. V. Vazirani, “Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation,” Journal of ACM, vol. 48, no. 2, 2001. S. Pandit and S. Pemmaraju, “Return of the primal-dual: distributed metric facilitylocation,” in ACM PODC ’09, 2009, pp. 180–189. C.-A. La, P. Michiardi, C.-F. Chiasserini, and M. Fiore, “Content replication and placement in mobile networks,” IEEE JSAC, 2012 [To Appear]. T. Sproull and R. Chamberlain, “Distributed algorithms for the placement of network services.” in International Conference on Internet Computing, Las Vegas Nevada, USA, July 2010.