Aug 31, 2017 - Currently the services include two implementations of the Latent Dirichlet ...... [9] A framework for rea
Ref. Ares(2017)5435490 - 08/11/2017
SoBigData – 654024
www.sobigdata.eu
Project Acronym
SoBigData
Project Title
SoBigData Research Infrastructure Social Mining & Big Data Ecosystem
Project Number
654024
Deliverable Title
Social mining method and service integration 2
Deliverable No.
D9.2
Delivery Date
31 August 2017
Authors
Roberto Trasarti (CNR), Valerio Grossi (CNR)
SoBigData receives funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 654024
SoBigData – 654024
www.sobigdata.eu
DOCUMENT INFORMATION PROJECT Project Acronym
SoBigData
Project Title
SoBigData Research Infrastructure Social Mining & Big Data Ecosystem
Project Start
1st September 2015
Project Duration
48 months
Funding
H2020-INFRAIA-2014-2015
Grant Agreement No.
654024 DOCUMENT
Deliverable No.
D9.2
Deliverable Title
Social mining method and service integration 2
Contractual Delivery Date
31 August 2017
Actual Delivery Date
08 November 2017
Author(s)
Roberto Trasarti (CNR), Valerio Grossi (CNR)
Editor(s)
Dino Pedreschi (UNIPI)
Reviewer(s)
Valerio Grossi (CNR)
Contributor(s)
Tiziano Squartini (IMT), Cristina Muntean (CNR), Marco Cornolti (UNIPI), Stefano Cresci (CNR), Andrea Passarella (CNR), Gerhard Gossen (LUH), Thorsten May (FRH), Gennady Andrienko (FRH), Kalina Bontcheva (USFD)
Work Package No.
WP9
Work Package Title
JRA2_Integrating Big Data Analytics Methods and Techniques
Work Package Leader
CNR
Work Package Participants
CNR, USFD, UNIPI, FRH, UT, IMT, LUH, KCL, SNS, AALTO, ETHZ
Dissemination
Public
Nature
Report
Version / Revision
V1.0
Draft / Final
Final
Total No. Pages (including cover) Keywords
27 Method and Services, Integration, Data Analytics
D9.2 Social mining method and service integration 2
Page 2 of 27
SoBigData – 654024
www.sobigdata.eu
DISCLAIMER SoBigData (654024) is a Research and Innovation Action (RIA) funded by the European Commission under the Horizon 2020 research and innovation programme. SoBigData proposes to create the Social Mining & Big Data Ecosystem: a research infrastructure (RI) providing an integrated ecosystem for ethic-sensitive scientific discoveries and advanced applications of social data mining on the various dimensions of social life, as recorded by “big data”. Building on several established national infrastructures, SoBigData will open up new research avenues in multiple research fields, including mathematics, ICT, and human, social and economic sciences, by enabling easy comparison, re-use and integration of state-of-the-art big social data, methods, and services, into new research. This document contains information on SoBigData core activities, findings and outcomes and it may also contain contributions from distinguished experts who contribute as SoBigData Board members. Any reference to content in this document should clearly indicate the authors, source, organisation and publication date. The document has been produced with the funding of the European Commission. The content of this publication is the sole responsibility of the SoBigData Consortium and its experts, and it cannot be considered to reflect the views of the European Commission. The authors of this document have taken any available measure in order for its content to be accurate, consistent and lawful. However, neither the project consortium as a whole nor the individual partners that implicitly or explicitly participated the creation and publication of this document hold any sort of responsibility that might occur as a result of using its content. The European Union (EU) was established in accordance with the Treaty on the European Union (Maastricht). There are currently 27 member states of the European Union. It is based on the European Communities and the member states’ cooperation in the fields of Common Foreign and Security Policy and Justice and Home Affairs. The five main institutions of the European Union are the European Parliament, the Council of Ministers, the European Commission, the Court of Justice, and the Court of Auditors (http://europa.eu.int/). Copyright © The SoBigData Consortium 2015. See http://project.sobigdata.eu/ for details on the copyright holders. For more information on the project, its partners and contributors please see http://project.sobigdata.eu/. You are permitted to copy and distribute verbatim copies of this document containing this copyright notice, but modifying this document is not allowed. You are permitted to copy this document in whole or in part into other documents if you attach the following reference to the copied elements: “Copyright © The SoBigData Consortium 2015.” The information contained in this document represents the views of the SoBigData Consortium as of the date they are published. The SoBigData Consortium does not guarantee that any information contained herein is error-free, or up to date. THE SoBigData CONSORTIUM MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, BY PUBLISHING THIS DOCUMENT.
D9.2 Social mining method and service integration 2
Page 3 of 27
SoBigData – 654024
www.sobigdata.eu
TABLE OF CONTENT DOCUMENT INFORMATION ......................................................................................................... 2 DISCLAIMER ................................................................................................................................ 3 TABLE OF CONTENT ..................................................................................................................... 4 DELIVERABLE SUMMARY ............................................................................................................. 5 EXECUTIVE SUMMARY ................................................................................................................ 6 1 RELEVANCE to SoBigData ....................................................................................................... 7 1.1 PURPOSE OF THIS DOCUMENT AND RELATION TO OTHER WORKPACKAGES ................................. 7 1.2 STRUCTURE OF THE DOCUMENT ................................................................................................... 7 2 RESEARCH INFRASTRUCTURE INTEGRATIONS AND UPDATES ................................................. 8 2.1 NEW METHODS AND SERVICES ..................................................................................................... 8 2.1.1 DICTIONARY CREATOR .................................................................................................................. 8 2.1.2 M-ATLAS WEB APPLICATION (NEW APPLICATION VRE) ................................................................ 8 2.1.3 TCLUSTERING ................................................................................................................................ 9 2.2 UPDATE OF EXISTING METHODS AND SERVICES ............................................................................ 9 2.2.1 "MAX & SAM" NETWORKS RECONSTRUCTION ............................................................................. 9 2.2.2 POLARIZED DEBATES TRACKER ................................................................................................... 10 2.2.3 QUICKRANK ................................................................................................................................. 10 2.2.4 VISUAL TOPIC EXPLORATION (IGD) ............................................................................................. 10 2.2.5 STATISTICAL VALIDATION ............................................................................................................ 11 2.2.6 MAXIMUM-ENTROPY NETWORK RECONSTRUCTION ................................................................. 12 2.2.7 TAIL-GRANGER CAUSALITY NETWORK RECONSTRUCTION ......................................................... 12 2.2.8 THE SEGREGATION DISCOVERY METHOD (SCUBE) ..................................................................... 12 2.2.9 ANALYSING VOTING INTENT IN TWEETS ..................................................................................... 13 3 CONCLUSION: STATE OF THE ART ......................................................................................... 18 REFERENCES .............................................................................................................................. 27
D9.2 Social mining method and service integration 2
Page 4 of 27
SoBigData – 654024
www.sobigdata.eu
DELIVERABLE SUMMARY This deliverable contains the state of art of the methods/services in SoBigData, including all the algorithmic resources available in Virtual Access, through the on-line platform, and the Transnational access in the various local research infrastructures. Due its nature this deliverable is incremental to the deliverable D9.1: “Social mining method and service integration 1” describing in details only the new methods/services or changes in existing ones. Anyway at the end of the document a complete list reflecting the actual state is reported.
D9.2 Social mining method and service integration 2
Page 5 of 27
SoBigData – 654024
www.sobigdata.eu
EXECUTIVE SUMMARY All the partners involved in the WP9 put efforts in integrating new methods and services as well as improving the existing one. Here the result of this work is presented. The main objective is to make the SoBigData platform rich of resources available to the public and usable in the various channels: in the local infrastructure through Transnational access or using the in-line platform as Virtual Access. In particular in this second case the resources are listed in the Catalogue (see deliverable D9.1: “Social mining method and service integration 1”) of the platform and can be accessed in different forms: as downloadable items, executed on the cloud using the Lab (see D9.1), or as external web- services.
D9.2 Social mining method and service integration 2
Page 6 of 27
SoBigData – 654024
1
www.sobigdata.eu
RELEVANCE TO SOBIGDATA
The method and services, as well as the dataset, are the basic bricks of SoBigData, they are the tools that can be used by the researchers to build their own analytical flow. The accessibility is one of the key factors for the good success of SoBigData and all the partners are working on adding and improving their methods to be shared.
1.1
PURPOSE OF THIS DOCUMENT AND RELATION TO OTHER WORKPACKAGES
This deliverable describe the state of the art of the algorithmic resources available both in transnational and virtual access. The concepts reported in this deliverable are related to:
1.2
•
WP6 and WP7: as exploratories provide the scientific context of the calls for projects of the Transnational Access activities (WP6) and promote Virtual Access (WP7);
•
WP8: for the definition of dataset metadata and data management plan;
•
WP10: for the definition of E-infra;
•
WP11: for the definition of exploratories.
STRUCTURE OF THE DOCUMENT
The first section will describe the update and the new methods integrated in the platform from the ones described in the deliverable D9.1: “Social mining method and service integration 1”. Then a complete list of the state of art of the methods and services are reported (including all of them).
D9.2 Social mining method and service integration 2
Page 7 of 27
SoBigData – 654024
2
www.sobigdata.eu
RESEARCH INFRASTRUCTURE INTEGRATIONS AND UPDATES
Starting from the description of the methods and services in the D9.1, this section provides an overview of all the new methods and services and the update of existing one.
2.1 2.1.1
NEW METHODS AND SERVICES DICTIONARY CREATOR
Creates a dictionary with inverse document frequency (idf) values from the Google NGrams dataset that can be used for relevance computation when no collection-wide statistics are available (e.g during crawling). The code for this method has been publicly released. The method has been successfully used as part of the method for Extracting event-centric collections from Web archives. 2.1.2
M-ATLAS WEB APPLICATION (NEW APPLICATION VRE)
The M-Atlas system is now an on-line analytical system integrated as VRE Application (see Fig.1). The system allows the user to build and execute analytical processes using the available spatio-temporal tools handling trajectory data and positional information (e.g. call data records). The user is able to store data and work on his personal space (completely private) directly on the cloud. Moreover he can share the results or data with other users.
Figure 1. The On-line M-Atlas system accessible as Application VRE in SoBigData. D9.2 Social mining method and service integration 2
Page 8 of 27
SoBigData – 654024 www.sobigdata.eu In details, The M-Atlas on-line system provide the following data mining tools (most of them also integrated in the Lab or as downloadable package in the SoBigData platform to be executed separately from the system or in local): • • • • • • • • • • •
Trajectory Reconstruction Algorithm: reconstruct the trajectories (i.e. trips) of the user starting from the sequence of points in his history (e.g. GPS points coming from a device). T-Statistical: extract several statistics from trajectories. T-Clustering: a density-based clustering algorithm equipped with several spatio-temporal distance functions. O/D Matrix: a tool to extract origin/destination matrices from a set of trajectories. Density Map: a tool to build a presence heatmap from a set of trajectories Flows: a tool to build a map representing the predominant direction of trajectories in specific areas. T-Pattern: extract spatio-temporal sequential patterns from trajectories representing the typical regions and the typical interval of time of the visits from a set of trajectories. Flocks: detect groups of people moving together in “formation” according to a set of constraints Individual Call Profile Builder: for each user and each area in analysis it creates a profile of his call habits (ICP) Sociometer: uses the ICP and some expert knowledge to classify users into categories (e.g. residents, commuter, visitors, etc.) Mobility Network: extract from each user a representation of his network of places and trips starting from his trajectories.
Moreover M-Atlas provides also a palette of spatio-temporal primitives for data exploration and visualization. The system is now in a beta testing phase and it is open to the public (require to register as beta tester). 2.1.3
TCLUSTERING
An density based clustering algorithm (i.e. Optics) equipped with spatio-temporal distance functions. It is integrated in the SoBigData platform to be executed in the cloud in the Lab. This algorithm is extracted by the M-Atlas tools in order to be executed separately on a remote database.
2.2 2.2.1
UPDATE OF EXISTING METHODS AND SERVICES "MAX & SAM" NETWORKS RECONSTRUCTION
This method [1][2] aims at reconstructing economic and financial networks, taking as input nodes fluxes (e.g. assets and liabilities, exports and imports) as well as the total number of observed links. The latter define the probability for any two banks to have a transaction, as well as the expected magnitude of the transaction itself. The method has been integrated into the VA (as a Python code). The code, however, can be also made available upon request (as a Matlab code). The reconstruction provided by our method has been compared with the performance of other similar algorithms. Remarkably, this “horserace” has highlighted that out method is “the clear winner” among the ensemble algorithms [3]. The method has been recently extended to implement the reconstruction of bipartite networks too [4][5].
D9.2 Social mining method and service integration 2
Page 9 of 27
SoBigData – 654024
www.sobigdata.eu
Figure 2. Performance analysis of the Max & Sam algorithm
The method has been evaluated and its performance characteristics have been analysed. All code and evaluation data has been publicly released. 2.2.2
POLARIZED DEBATES TRACKER
The methods is presented in 9.1 in section Polarized Societal Debates. The Polarized Debates Tracker method has been updated. The method is now fully integrated as downloadable software. A more detailed description according to SoBigData.eu metadata is available in the Catalogue. 2.2.3
QUICKRANK
QuickRank is a tool for creating Learning to Rank methods in a scalable and efficient way. In D9.1 the method was partially integrated (up to 80%). The integration was then finalized. The method is currently fully integrated into the VRE as a tool hosted by the D4Science Infrastructure. 2.2.4
VISUAL TOPIC EXPLORATION (IGD)
Figure 3 shows a temporal topic evolution for the Sarrazin Dataset (left) and the Dataset of Vis publications between 1990 and 1997. The Vis publications have been used to make a plausibility check for the algorithms, since we are far more familiar with the evolution of our scientific field. (The division of the subdisciplines of the Scientific Visualization and Information Visualization are manifest in corresponding topics). D9.1 introduces the first draft of the visual topic exploration service. The service has been updated by improving the visual design and by including controls for the temporal topic analysis pipeline. In particular, the input of the service is any document collection, which can be uploaded to the service as a bundle of text files via a web interface. A management console provides access to all collections previously uploaded by users.
D9.2 Social mining method and service integration 2
Page 10 of 27
SoBigData – 654024
www.sobigdata.eu
Figure 3. An example of the Visual topic Exploration presented in the SoBigData platform. Internally the collections are stored on a file server. While the first draft (D9.1) represented a proof of concept, the controls of the processing steps now include: •
The ability to exchange algorithms / or implementations for the actual topic modelling algorithm. Currently the services include two implementations of the Latent Dirichlet Allocation by the Mallet Toolkit and Apache Spark.
•
A more convenient parameter selection to select the different time step. The temporal topic analysis is based on a binning of documents along the time-axis. It is now easier to define binnings that match the actual distribution of the publication dates.
Apart from changes in the backend, we revised the visuals to improve readability, understanding and aesthetics. Individual topics, for lack of actual topic names, are now represented by a ranked list of most important terms (a common approach), the overall distribution of documents, and the topic trend by curved edges. 2.2.5
STATISTICAL VALIDATION
The methodology filters a complex network on the basis of multiple hypothesis testing with respect to a configuration null model. The method was originally proposed in [6] and several variations (for example to unipartite networks) have been recently proposed. The method is now fully integrated in the platform and available as R downloadable software as well. Some theoretical refinement has been implemented. The present version implements both bipartite and unipartite (weighted) networks. Figure 3 shows an example of application of the method for a weighted directed graph representing mobility fluxes between cities in Tuscany. •
A web page with detailed explanations and tutorial can be found here: http://mathfinance.sns.it/statistical_validation/
•
SoBigData Catalogue link: http://data.d4science.org/ctlg/ResourceCatalogue/statistical_validation
D9.2 Social mining method and service integration 2
Page 11 of 27
SoBigData – 654024 2.2.6 MAXIMUM-ENTROPY NETWORK RECONSTRUCTION
www.sobigdata.eu
The methodology reconstructs bipartite networks from the knowledge of nodes’ strengths only, via maximization of the entropy function. An application to fire sales spillover is also performed. The reference for this algorithm is in D. Di Gangi, F. Lillo, D. Pirino, Assessing Systemic Risk Due to Fire Sales Spillover Through Maximum Entropy Network Reconstruction (2015)https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2639178. The method has now a dedicated web page, and MATLAB software is available for download. •
•
A web page with detailed explanations and tutorial can be found here: http://mathfinance.sns.it/network_reconstruction SoBigData Catalogue link: http://data.d4science.org/ctlg/ResourceCatalogue/maximum_entropy_network_reconstruction
2.2.7
TAIL-GRANGER CAUSALITY NETWORK RECONSTRUCTION
Given a set of time series, the methodology builds a network by inferring causality of rare-events. The adopted method is Granger-causality in tails, i.e. it is tested whether an extreme events in one time series helps predicting the occurrence of a future extreme event in another time series. An application to contagion risk is performed as well. The method has now a dedicated web page, and MATLAB software is available for download. •
A web page with detailed explanations and tutorial can be found here: http://mathfinance.sns.it/network_reconstruction/
•
SoBigData Catalogue link:
http://data.d4science.org/ctlg/ResourceCatalogue/tail_granger_causality_network_construction
2.2.8
THE SEGREGATION DISCOVERY METHOD (SCUBE)
The SCube algorithm has been ported from a downloadable integration to a full integration as-a-service within the SoBigData architecture. The method has been extended in functionalities by considering the temporal analysis of segregation, and experimented with both data from Italian companies and Estonian companies. Full description of the methodology and a case study is described in [7] and a paper describing the implementation is [8]. A manual for the user is provided in the method description page of the SoBigData platform, and it is also available at the link https://goo.gl/7tizyM.
D9.2 Social mining method and service integration 2
Page 12 of 27
SoBigData – 654024
www.sobigdata.eu
Figure 4. Network of mobility fluxes between cities in Tuscany. On the left the original network, on the right the statistically filtered network, obtained removing the links which are compatible with a random graph with fixed strength distribution. 2.2.9
ANALYSING VOTING INTENT IN TWEETS
Social media sites continue to gain prominence in the public sphere, with increasing evidence of the pivotal role they have come to play in determining the direction of society. In this story we explore polarized societal debates to determine who is participating in them, what are the beliefs they hold, how do they influence each other, what arguments do they make and how their overall response to events can be helpfully characterized. In doing so, we provide insights into how interventions can help social media to support good decision-making, by better informing policy-makers and citizens, at a time when concerns have been raised that they are leading to bad decision-making (see for example http://www.nesta.org.uk/blog/fake-news-what-it-and-how-can-we-tackle-it). The British in/out EU referendum provided a case study for the work. We developed a GATE application that classified Twitter users according to their Brexit vote intention. For this analysis we required a reliable method for finding samples of voters for the leave or remain option on the referendum. When developing our automated system, there were several key criteria that were considered: • • •
Having an accurate sample is more important than having a complete one; It should be possible to explain how individual results were reached; Systems should be based on explicitly stated intent by the user.
It may be possible to train a machine learning or statistical model for this task, but this is unsuitable in this case. Firstly, learned models, especially very sophisticated models, are difficult to interrogate in terms of explaining a particular decision. Secondly, it would be difficult or impossible to completely eliminate text features from these models which represent the general politics of voters holding a particular stance. Since this is one of the relationships we intend to study directly, we risk biasing our sample in a circular manner. To wit; if we inadvertently found leave voters on the basis of immigration stance, we would be in a poor position to determine how much leave voters care about immigration.
D9.2 Social mining method and service integration 2
Page 13 of 27
SoBigData – 654024 www.sobigdata.eu Instead, we determine vote intent using popular campaign hashtags. The hashtags themselves are usually fairly ambiguous, since most campaigns can be discussed referentially, particularly by critics. We observed, however, that where hashtags appeared at the very end of a tweet, they were more often used in a promotional manner. For a simple pair of examples, compare the (hypothetical) tweet ``Can't believe #voteleave are campaigning in such an underhand way'' and ``Can't believe people really want to vote to stay #voteleave''. We consider a hashtag to be an indicator of intent where it appears at the end of a tweet, not followed any contradictory hashtags but optionally followed by a URL. URLs must be allowed because tweets which quote-retweet another usually end in the URL of the source tweet. The hashtags used for this classification are shown in the table below. Leave
#leave, #VoteLeave, #LeaveEU, #BritainOut, #no2eu, #notoeu, #beleave, #BetterOffOut, #voteleaveeu, #lexit, #Brexit supporter, #Brexitbrexit, #TakeControl, #NotAfraid, #leaveeu, #ukip, #takecontrol, #go, #betteroffout, #out, #voteukip, #voteout, #gogogo, #voteukip2016, #votetoleave, #euxit, #23rdjunevoteleaveeu
Remain #remain, #DontWalkAway, #BetterOffIn, #bremain, #StrongerIn, #VoteRemain, #Votein, #INtogether, #StrongerIn, #labourin, #greenerin, #leadnotleave , #remainineu, #intogether, #ukineu, #brexitrisks, #labourinforbritain, #incampaign, #remaineu, #uktostay, #stayineu, #strongertogether, #saferin, #betterin We used this method to determine the vote intent of users, not just tweets, allowing us to analyse other tweets from that user. We made three lists. In the first, we included users who had been found to support a particular stance at least three times. A total of 54,989 leave supporters and 41,118 remain supporters were discovered in this manner. The remaining lists included more users at the expense of precision; in the second, we included those users who supported a stance twice, and in the final list we included any user who had ever been found to support that stance. By providing data with varying levels of precision, we enable the end user to decide which is more appropriate to their needs. In order to evaluate our success in classifying Twitter users according to their Brexit vote intent, we made use of the fact that a company called Brndstr published a poll in the days leading up to the referendum. The poll asked users how they intended to vote, in return for an icon modification for their Twitter account allowing them to display their allegiance; see figure below. The vote intent they communicated in this way can safely be regarded as ground truth. Tweets containing the formulaic Brndstr wording were gathered ("I #VoteIn [or #VoteOut] for the #Brexit #EURef vote with @Brndstr & unlocked my own Flag Profile pic! What will you vote? #ivoted"). This resulted in a corpus of 14216 vote declarations, after removing 588 duplicates (some users tweeted twice). Among this corpus, a certain number were found to overlap with those users we had already found allegiances for using our hashtag method. This sample proved large enough to enable a high quality evaluation to be performed.
D9.2 Social mining method and service integration 2
Page 14 of 27
SoBigData – 654024
www.sobigdata.eu
Figure 5. Brndstr Icon Modification The table below gives the results. It shows that for our high precision list, in which three partisan tweets were needed to include that person on our list, an accuracy of 0.99 was obtained. Where two tweets are considered sufficient, an accuracy of 0.98 was obtained. Where one partisan tweet only has been used to classify a tweeter, accuracy drops to 0.94. This shows that even the lowest precision list has a high accuracy, and that the hashtag method produces an excellent result. In the table, "(E)" indicates counts relating to those voting to exit the EU, and "(R)" indicates data regarding remain voters. There was no notable difference in the accuracies for each class. List Found (E) Correct (E) Found (R) Correct (R) Accuracy Cohen's Kappa 3
1142
1129
603
594
0.987
0.972
2
1368
1350
901
882
0.984
0.966
all
1935
1801
1744
1667
0.943
0.885
Of course, for the final system, the Brndstr data was also added to the user lists, providing a high quality supplement for the more extensive data arrived at using the hashtag method, and meaning that the final system has an even higher accuracy than that reported here. As part of our work on analysing societal debates, we also implemented real-time visualisations over aggregated text analytics results. For instance, the figure below shows aggregate statistics of the topics discussed by leave and remain supporters respectively, in tweets about the British EU membership referendum. We used the vote intent classifier described above, in order to classify the tweet authors as leave or remain supporters. Afterwards, where possible, the topics of the tweets were detected automatically.
D9.2 Social mining method and service integration 2
Page 15 of 27
SoBigData – 654024
www.sobigdata.eu
Figure 6. Brexit Analyser for the topic frequencies and geographical distribution Twitter users were also geolocated automatically, where possible. This was based on either latitude/longitude fields in their Twitter profiles (which is rarely present) or on the Location string in their profiles. The latter is often a place name that can then be disambiguated to a latitude/ longitude coordinate. These were then mapped to NUTS regions, aggregated and visualised on a map. The figure on the left, in this case, is showing the “swing” for each NUTS region, where blue means there are more Twitter users supporting remain in the Brexit referendum than leave ones. The darker the blue is, the stronger the support. Conversely, red indicates stronger support for leaving the EU. The aggregated data from the automated analysis was made available to BuzzFeed News, who published the findings in an article titled: “3 Million Brexit Tweets Reveal Leave Voters Talked About Immigration More Than Anything Else” https://www.buzzfeed.com/jamesball/3-million-brexit-tweets-reveal-leavevoters-talked-about-imm The following two visualisations were created by BuzzFeed, based on our analysis. The first theme river shows five topics and the volume of debate around them, separated into Leave and Remain supporters. The second focuses specifically on immigration and allows for finer grained comparisons.
D9.2 Social mining method and service integration 2
Page 16 of 27
SoBigData – 654024
www.sobigdata.eu
Figure 7. Topics distribution in time Over 24 different text analytics services have now been integrated and made available to users of the SoBigData infrastructure. These span six different EU languages, and eleven services are specifically customised for social media mining. A subset of these is shown below. Further details can be found in [9].
Figure 8. Distribution of Tweets (Leave and Remain) in time
D9.2 Social mining method and service integration 2
Page 17 of 27
SoBigData – 654024
3
www.sobigdata.eu
CONCLUSION: STATE OF THE ART
In this section, we list concisely the all the methods and services that are provided by the project partners and/or integrated into the project’s RI. We took the list in the D9.1 and we added the new methods described above and updated links and descriptions where it is needed, in other word the following list should be considered as the state of art of methods and services in SoBigData (concerning WP9). Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
Urban Mobility Atlas (UMA) SoBigData.it - KDD Lab - CNR An interface able to visualize the results of the analyses computed by different tools. Integrated as External web service. City of Citizens. A user front-end consisting in a dashboard showing the data and several statistics about traffic considering the incoming / outcoming, systematic/occasional of a city. http://kdd.isti.cnr.it/uma2/?city=Pisa
Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
GeoTopics AALTO A system to explore geographical patterns of urban activity. In progress - code and demo shared publicly. City of Citizens. Used to associate city regions with types of activity. http://mmathioudakis.github.io/geotopics/
Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
RWC Score AALTO A technique to measure the degree of polarization in a network. Code and demo shared publicly. Polarized Societal Debates. Used to quantify controversy of #brexit discussions. https://github.com/gvrkiran/controversy-detection
Method Name Partner(s) Summary
Digital DNA Fingerprinting SoBigData.it - IIT - CNR A technique that models the online behavior of Twitter users as strings of characters (digital DNA sequences). Such strings are then analyzed and compared. Integration into In progress. Will be integrated as External web service. RI/VRE Usage in Stories Polarized Societal Debates. A dashboard showing a broad set of results and visualizations resulting from the application of the digital DNA modeling technique. URL http://wafi.iit.cnr.it/fake/fake/evolution/dna/sequencer/
D9.2 Social mining method and service integration 2
Page 18 of 27
SoBigData – 654024 Method Name Partner(s) Summary
www.sobigdata.eu
Borders SoBigData.it - KDD Lab - ISTI - CNR A process to detect borders as composition of a space partitioning. The tool generates a network between areas and find communities among them. Integration into In progress. RI/VRE Usage in Stories City of Citizens. Used to find the real borders in Tuscany showing how some areas are so interconnected to be considered a single municipality even if they are divided in the administrative records. URL http://www-kdd.isti.cnr.it/~trasarti/sobigdata.eu/borders/index.html
Method Name Partner(s) Summary
Trip Builder SoBigData.it - HPC - ISTI - CNR TripBuilder is a mobile service helping tourists to build their own personalized sightseeing tour of a city. Given a targeted touristic area, the time available for the visit, and the tourist’s profile, TripBuilder provides its users with a time-budgeted tour that
maximizes a tourist’s interests and takes into account both the time needed to enjoy the attractions and the time required to move from one Point of Interest (PoI) to the next one. Integration into Integrated as external web service. RI/VRE Usage in Stories City of Citizens. The utility of the tool for the City of Citizen story is providing city administrative staff with a tool for designing tourist visits in a city in order to stimulate specific types of touristic visits, by taking into account possible preferences and time spent. The preconfigured paths can give an idea of where the most tourist affluence can be found. URL http://tripbuilder.isti.cnr.it/
Method Name Partner(s) Summary
Sociometer SoBigData.it - KDD Lab - ISTI - CNR This tool is able to classify the users w.r.t. their call profiles built on top of call data records. Integration into Downloadable software (implemented in python using spark-context). RI/VRE Usage in Stories City of Citizens. It is used to study the impact of different categories of user in a city and to compute the commuters and touristic flows in Tuscany. URL http://www-kdd.isti.cnr.it/~trasarti/sobigdata.eu/sociometer/index.html
Method Name Partner(s) Summary
Never Drive Alone SoBigData.it - KDD Lab - ISTI - CNR Constructs the network of potential carpooling for a community of systematic travelers. Integration into Downloadable software RI/VRE Usage in Stories City of Citizens. D9.2 Social mining method and service integration 2
Page 19 of 27
SoBigData – 654024 Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories
www.sobigdata.eu
URL
TagME SoBigData.it - UNIPI Entity discovery and linking in text. Full integration into the VRE as a web service. Hosted at D4Science infrastructure. Users migrated from former deployment. Societal Debates and Monitoring Topics across Time and Space. Used to extract Wikipedia entities from the “Thilo Sarrazin” collection of documents. Entities can be used to retrieve geo-spatial and time information. It will later be also used to tag content related to polarized debates. https://tagme.d4science.org/tagme/
Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
SMAPH SoBigData.it - UNIPI Entity discovery and linking in queries, with state of the art quality. Full integration into the VRE as a web service. Hosted at D4Science infrastructure. None https://sobigdata.d4science.org/group/smaph
Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
WAT SoBigData.it - UNIPI Entity discovery and linking in text, with state of the art quality. Full integration into the VRE as a web service. Hosted at D4Science infrastructure. None https://tagme.d4science.org/tagme/
Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
SWAT SoBigData.it - UNIPI Entity salience in text. Full integration into the VRE as a web service. Hosted at D4Science infrastructure. None https://tagme.d4science.org/tagme/
Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories
MyWay SoBigData.it - KDD Lab - ISTI - CNR A trajectory prediction algorithm based on mobility profiles of users. Downloadable software (implemented in python).
URL
City of Citizens. It is used to study the predictability of the movements in the area. http://www-kdd.isti.cnr.it/~trasarti/sobigdata.eu/myway/index.html
D9.2 Social mining method and service integration 2
Page 20 of 27
SoBigData – 654024
www.sobigdata.eu
Method Name Partner(s) Summary
Trajectory Reconstruction SoBigData.it - KDD Lab - ISTI - CNR A data pre-processing method to build trajectories from raw GPS observations according to spatiotemporal constraints. Integration into Full integration into the VRE as tools hosted at D4Science infrastructure. RI/VRE Usage in Stories City of Citizens. This is used as basic steps to transform the raw GPS data into trajectories used by all the advanced tools. URL http://wwwkdd.isti.cnr.it/~trasarti/sobigdata.eu/trajectorybuilder/index.html Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
Mobility Profiles SoBigData.it - KDD Lab - ISTI - CNR Extract from a set of users trajectories their systematic movements. Downloadable software (implemented in python). City of Citizens. The Mobility profiles are extracted to study the systematic and occasional part of the traffic and as base for prediction and car-pooling. http://www-kdd.isti.cnr.it/~trasarti/sobigdata.eu/mobilityprofile/index.html
Method Name Partner(s) Summary
O/D Matrix SoBigData.it - KDD Lab - ISTI - CNR Builds origins/destinations matrix from the trajectories and partitioning of area. Integration into Integrated as External web service (included in UMA). RI/VRE Usage in Stories City of Citizens. As user front-end in the UMA dashboard showing the incoming/outcoming, systematic/occasional fluxes in Pisa w.r.t. Tuscany. URL http://kdd.isti.cnr.it/uma2/?city=Pisa Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
Privacy Risk Evaluation SoBigData.it - KDD Lab - ISTI - CNR A methodology to evaluate the risk of re-identification of the user in a dataset. Downloadable software (implemented in python using spark-context). City of Citizens. The methodology is instantiated to be applied in the Sociometer over the call profiles. http://www-kdd.isti.cnr.it/~trasarti/sobigdata.eu/privacyrisk/index.html
D9.2 Social mining method and service integration 2
Page 21 of 27
SoBigData – 654024 Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
www.sobigdata.eu
Web Archive Collection Extraction LUH Creation of document collections about past events from Web Archives. Code shared publicly Monitoring Topics across Time and Space. Creation of document collections from Web Archives e.g. on Thilo Sarazzin. Later for any event document collection. http://data.d4science.org/ctlg/ResourceCatalogue/web-archive-collections
Method Name Partner(s) Summary
GATECloud USFD Run any GATE application remotely on the cloud, in parallel. Includes existing GATE pipelines such as YODIE for named entity disambiguation, TwitIE for information extraction within Tweets, and many others. Integration into Integrated as an external web service. RI/VRE Usage in Stories Polarized Societal Debates. Provides the text processing functionality for the story, by hosting the political data mining GATE pipeline. URL https://gatecloud.net/ Method Name Partner(s) Summary
GATE Brexit Pipeline USFD Text processing pipeline designed for tweets related to the UK European Union membership referendum, but adaptable to other political events. Includes information extraction, topic and sentiment detection, tweet geolocation, user classification and vote intent classification. Integration into Integrated through the GATECloud integration. RI/VRE Usage in Stories The computation used for the text-driven part of the analysis for the polarised societal debates story. URL http://demos.gate.ac.uk/sobigdata/brexit/ Method Name Partner(s) Summary
Polarized Debates Tracker SoBigData.it - IIT-CNR Detects polarised topics. It provides an iterative classification of users and keywords: first, polarized users are identified, then polarized keywords are discovered by monitoring the activities of previously classified users. This method thus allows to track users and topics over time. The algorithm is written in Python. Integration into Integrated as a downloadable software. RI/VRE Usage in Stories Polarized Societal Debates. D9.2 Social mining method and service integration 2
Page 22 of 27
SoBigData – 654024
www.sobigdata.eu
Method Name Partner(s) Summary
Egonetworks SoBigData.it - IIT-CNR Ego-network analysis technology. The method extracts ego networks of a given set of Twitter users (provided by the method user). For each user, which is an ego, it divides the set of social relationships (alters) according to groups of decreasing intimacy with the ego. The method is based on the analysis of the patterns of interactions, and implements the approach described in https://doi.org/10.1016/j.osnem.2017.04.001 Integration into External Services hosted at IIT-CNR. RI/VRE Usage in Stories Polarized Societal Debates. Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
WikiData Geo Mapper LUH Mapping of extracted named entities to geolocations based on Wikipedia. Code shared publicly Mapping of extracted named entities of the test dataset to geolocations for geovisualizations. http://data.d4science.org/ctlg/ResourceCatalogue/wikidata_geo_mapper
Method Name Partner(s) Summary
GeoVIS / GeoVA Fraunhofer 1. Detection of personal and public places, see details in http://dx.doi.org/10.1177/1473871615581216 . 2. Visual exploration of events in space and time, see details in http://geoanalytics.net/vam/index.html (chapter 6). Integration into Downloadable software (implemented in Java). RI/VRE Usage in Stories 1. Detection of personal and public places in Pisa and Florence for the “City and citizens” story; 2. Spatiotemporal exploration of documents for the Thilo Sarrazin collection of documents. URL http://geoanalytics.net/vam/index.html Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
"Max & Sam" networks reconstruction SoBigData.it - IMT Reconstruction of networks adjacency matrix from partial information. Downloadable software (implemented in Matlab) on a webpage (light integration). Societal Well-Being and Economic Performance. https://ckansobigdata2.d4science.org/dataset/maxandsam_network_recontruction_method
D9.2 Social mining method and service integration 2
Page 23 of 27
SoBigData – 654024 Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
www.sobigdata.eu
Systemic risk estimation via DebtRank indicator SoBigData.it - IMT Estimation of systemic risk in financial networks through the DebtRank indicator. Downloadable software (implemented in Matlab) on a webpage (light integration). Societal Well-Being and Economic Performance. https://ckansobigdata2.d4science.org/dataset/debtrank_systemic_risk_estimation_method
Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
Entropy measures for poverty SoBigData.it - CNR and UNIPI Computers the correlation between mobility and the poverty measures. Light integration as a Web page
Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
Nowcasting GDP SoBigData.it - CNR and UNIPI Provides estimates of GDP and well-being from human shopping behavior. Downloadable software.
Societal Well-Being and Economic Performance. Second hypothesis of the story. http://sobigdata.ee/tutorials/tutorial-hypothesis-2.html
Societal Well-Being and Economic Performance. Second hypothesis of the story. http://sobigdata.ee/tutorials/tutorial-hypothesis-2.html
Method Name Partner(s) Summary
Systemic risk via Maximum-Entropy network reconstruction SoBigData.it - SNS Maximum entropy network reconstruction by using different ensembles and estimate of the systemic risk due to fire sale spillovers in a set of financial institutions and asset classes. Integration into Dedicated web page and downloadable software. RI/VRE Usage in Stories Societal Well-Being and Economic Performance. URL http://data.d4science.org/ctlg/ResourceCatalogue/maximum_entropy_network_reconstruction Method Name Partner(s) Summary
Network construction via tail Granger-causality SoBigData.it - SNS Starting from a set of time series, the method builds a network of causality relation of rare events. An application to systemic risk is also provided.. Integration into Dedicated web page and downloadable software. RI/VRE Usage in Societal Well-Being and Economic Performance. Stories URL http://data.d4science.org/ctlg/ResourceCatalogue/tail_granger_causality_network_construction D9.2 Social mining method and service integration 2
Page 24 of 27
SoBigData – 654024
www.sobigdata.eu
Method Name Partner(s) Summary
Statistical Validation SoBigData.it - SNS Given a unipartite or bipartite network, the method filters it to its backbone structure via multiple hypothesis testing. Integration into Full integration as a method in the platform. RI/VRE Usage in Stories City of Citizens. Cross-story. URL http://data.d4science.org/ctlg/ResourceCatalogue/statistical_validation
Method Name Partner(s) Summary Integration into RI/VRE Usage in Stories URL
Credit risk scoring UT Evaluates the credit risks of companies as a probability of default. Light integration as a Web page. Societal Well-Being and Economic Performance. First hypothesis of the story. http://sobigdata.ee/tutorials/tutorial-hypothesis-1.html
Method Name Partner(s) Summary
SCube - Segregation discovery SoBigData.it - UNIPI Provides an OLAP cube of segregation indexes of social groups in company boards. Integration into SCube is integrated as-a-Service in the SoBigData Infrastructure. RI/VRE Usage in Stories Societal Well-Being and Economic Performance. URL http://data.d4science.org/ctlg/ResourceCatalogue/scube
Method Name Partner(s) Summary
QuickRank SoBigData.it - HPC Lab - ISTI - CNR QuickRank is an efficient Learning to Rank toolkit providing multithreaded C++ implementation of several algorithms: GBRT, LambdaMART, Oblivious GBRT / implementation of several algorithms: GBRT, LambdaMART, Oblivious GBRT / LambdaMART, CoordinateAscent, LineSearch, RankBoost. Integration into QuickRank is integrated into the VRE as a tool hosted at D4Science RI/VRE infrastructure. URL http://quickrank.isti.cnr.it/
Method Name Partner(s) Summary
Dictionary Creator LUH Creates a dictionary with inverse document frequency (idf) values from the Google NGrams dataset that can be used for relevance computation when no collection-wide statistics are available (e.g during crawling) Integration into Code is shared publicly RI/VRE URL http://data.d4science.org/ctlg/ResourceCatalogue/dictionary_creator D9.2 Social mining method and service integration 2
Page 25 of 27
SoBigData – 654024 Method Name
www.sobigdata.eu
M-Atlas Web - On-line Analytical tool (Beta)
Partner(s) Summary
SoBigData.it - KDD Lab - iSTI - CNR M-Atlas is centered onto the concept of a trajectory but is able to handle other kinds of data such as positional data (e.g. Call Data Records), and the mobility knowledge discovery process can be specified by M-Atlas queries that realize data transformations, data-driven estimation of the parameters of the mining methods, the quality assessment the obtained results, the quantitative and visual exploration of the discovered behavioral patterns and models, the composition of mined patterns, models and data with further analyses and mining, the incremental mining strategies to address scalability. M-Atlas has mechanisms for mining trajectory patterns and models that, in turn, can be stored and queried. The system is also equipped with a querying and mining language makes this analytical process possible and providing the mechanisms to master the complexity of transforming raw GPS tracks into mobility knowledge. Integration into On-Line access as Application RI/VRE URL https://sobigdata.d4science.org/group/m-atlas/m-atlas Method Name Partner(s) Summary
TClustering SoBigData.it - KDD Lab - ISTI - CNR A Trajectory Clustering algorithm equipped with spatio-temporal distance functions. The base of the algorithm is Optics and it can be executed in the cloud connecting it directly to a remote database where the trajectories are stored (as Postgis Linestring). Integration into Integrated in the Lab RI/VRE URL http://data.d4science.org/ctlg/ResourceCatalogue/matlas_-_optics_algorithm
D9.2 Social mining method and service integration 2
Page 26 of 27
SoBigData – 654024
www.sobigdata.eu
REFERENCES [1] Systemic risk analysis in reconstructed economic and financial networks, G. Cimini, T. Squartini, D. Garlaschelli, A. Gabrielli, Sci. Rep. 5 (15758) (2015). [2] Network reconstruction via density sampling, T. Squartini, G. Cimini, A. Gabrielli, D. Garlaschelli, App. Netw. Sci. 2 (3) (2017). [3] The missing links: A global study on uncovering financial network structures from partial data, K. Anan et al. Working Paper Series, No 51/July 2017 https://www.esrb.europa.eu/pub/pdf/wp/esrbwp51.en.pdf?a39208bc8178b388fccee8012a4470d3 [4] Enhanced capital-asset pricing model for the reconstruction of bipartite financial networks, T. Squartini, A. Almog, G. Caldarelli, I. van Lelyveld, D. Garlaschelli, G. Cimini, arXiv:1606.07684 (accepted for publication on Physical Review E) (2017). [5] Extracting Event-Centric Document Collections from Large Scale Web Archives, G. Gossen, E. Demidova, T. Risse. Conference on Theory and Practice of Digital Libraries (TPDL), 2017. [6] Statistically validated networks in bipartite complex systems, M. Tumminello, S. Miccichè, F. Lillo, J. Piilo, R. N. Mantegna. PLoS ONE 6(3): e17994. doi:10.1371/journal.pone.0017994 (2011) [7] Segregation Discovery in a Social Network of Companies, A. Baroni, S. Ruggieri. Journal of Intelligent Information Systems, 2017. Online First: 5 September 2017. DOI: 10.1007/s10844-017-0485-0. [8] SCube: A Tool for Segregation Discovery, A. Baroni, S. Ruggieri. arXiv:1709.08348. Accessible from https://arxiv.org/abs/1709.08348. [9] A framework for real-time semantic social media analysis. D. Maynard, I. Roberts, M. Greenwood, D. Rout, K. Bontcheva. Journal of Web Semantics. In press. https://doi.org/10.1016/j.websem.2017.05.002
D9.2 Social mining method and service integration 2
Page 27 of 27