Semistructured extract from Wikipedia â useful background knowledge! Classification: 'Edward Brooke is a politician'. Containment hierarchy: 'Bern is in ...
Big Data Integration for eGovernment Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland
ICEDEG 2017 Quito, April 20, 2017
Instant Quiz • • • •
3 Vs of Big Data? Wrapper-Mediator? Hadoop? Spark?
eXascale Infolab (XI) • New lab @ U. of Fribourg–Switzerland • Big Data infrastructures for social / semantic / scientific data (… mostly) ©photopulse.ch
http://exascale.info/
Exascale Data Deluge ➡ New data formats ➡ New machines ➡ Peta & exa-scale datasets
• Web companies – Google – Ebay – Yahoo
➡ Obsolescence of traditional information infrastructures
• Science – Biology – Astronomy – Remote Sensing
• Financial services, retail companies governments, etc. © Wired 2009
5
How can Big Data help eGovernment?
Context: The n-Vs of Big Data • Volume • amount of data
• Velocity • speed of data in and out
• Variety • range of data types and sources [Gartner 2012] "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization"
A Quick Look at
(Too) many different sources & formats! Difficult to find the right piece of data Difficult to merge different datasets
Fundamentally, a Big Data Integration problem
On the Menu Today 1. 2. 3. 4.
Entity-Centric Data Integration The XI Pipeline to integrate textual data Use-case 1: ArmaTweet Use-case 2: Dependency-Driven Analytics
Information Management • The story so far: – Strict separation between unstructured and structured data management infrastructures Keywords
SQL
HTTP
JDBC
Inverted Index DBMS
The 3rd V: Data Integration • Data integration is still one of the biggest CS problem out there (according to many e.g., Gartner) • Integration typically requires some sort of mediation 1. Unstructured Data: keywords, synsets 2. Structured Data: global schema, transitive closure of schemas ⇒ nightmarish if 1 and 2 taken separately, horror marathon if considered together
Entities as Mediation • Rising paradigm – Store information at the (semantic) entity granularity – Integrate information by inter-linking entities
• Advantages? – Coarser granularity compared to keywords • More natural, e.g., brain functions similarly (or is it the other way around?)
– Denormalized information compared to RDBMSs • Schema-later, heterogeneity, sparsity • Pre-computed joins, “Semantic” linking
• Drawbacks?
Entity-Centric Data Integration Higher-level apps
Knowledge Graph
On the Menu Today 1. 2. 3. 4.
Entity-Centric Data Integration The XI Pipeline to integrate textual data Use-case 1: ArmaTweet Use-case 2: Dependency-Driven Analytics
Integrating Textual Data • The XI Pipeline – From text to (contextualized) entities Mention Extraction
NER 2.1
Entity Linking 2.3
Entity Typing 2.3
• Runs on massive amounts of data (Spark)
Co-Ref Resolution 2.4
2.1 Named
Entity Recognition (NER)
Text extraction (Apache Tika)
POS Tagging
Ranked list of n-grams
Lemmat ization
Feature Feature extraction extraction Features
Supervised Classi!er
List of extracted n-grams
frequency reweighting
foreach
List of selected n-grams
n-gram Indexing
Candidate Selection
n+1 grams merging
Roman Prokofyev, Gianluca Demartini, Philippe Cudré-Mauroux: Effective named entity recognition for idiosyncratic web collections. WWW 2014: 397-408
2.2
Entity Linking
• Linking entities to text is an old problem… – … and is extremely hard, esp. for machines
• Dozens of approaches have been suggested • What if – We want to combine approaches / frameworks? – We want to leverage both human computations & algorithms?
ZenCrowd • Three-stage blocking approach for integrating textual data & entities 1. Cheap inverted index selection of candidates 2. Expensive similarity measures 3. Crowdsource low confidence matches
• Uses dynamic templating to create micro-matching-tasks and publish them on MTurk • Combines both algorithmic and human matchers using probabilistic networks
Gianluca Demartini, Djellel Eddine Difallah, Philippe Cudré-Mauroux: ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. WWW 2012: 469-478
ZenCrowd Architecture HTML Pages
Input
ZenCrowd
Micro Matching Tasks
MicroTask Manager
Entity Extractors
Crowdsourcing Platform
HTML+ RDFa Pages Output
Algorithmic Matchers
Decision Engine Probabilistic Network
LOD Index Get Entity
Workers Decisions
LOD Open Data Cloud
Probabilistic Inference • Probabilistic network to integrate a priori & a posteriori information – Agreement of good turkers & algorithms • Learning process
– Constraints • Unicity • Equality (SameAs)
– Giant probabilistic graph • Instantiated selectively
c11
pw1( )
pw2( )
w1
w2
c21
c12
lf1( )
l1 pl1( )
c22
c13
lf2( )
sa1-2( )
l2 pl2( )
c23 lf3( )
u2-3( )
l3 pl3( )
Does it Work? • Improves avg. prec. by 0.14 on average! – Minimal crowd involvement – Embarrassingly parallel problem 0.8$ 0.78$
Top$US$ Worker$
0.5$
0.76$
US$Workers$ IN$Workers$
Precision)
Worker&Precision&
1$
0.74$ 0.72$ 0.7$ 0.68$ 0.66$ 0.64$ 0.62$
0$ 0$
250$
500$
Number&of&Tasks&
0.6$ 1$
2$
3$
4$
5$
6$
7$
Top)K)workers)
8$
9$
2.3 Entity
Typing
• Entities can have many types (facets) • Which fine-grained types are most relevant given the context? Thing
American Billionaires
People from King County
American Philanthropist
Windows People
American Computer Programmers Harvard University People
People from Seattle
Agent
Person Living People
American People of Scottish Descent
TRank • Fine-grained Typing • Tree of 447’260 types • Rooted on • Depth of 19
• Ranks relevant types by analyzing the context • • • •
Textual context Graph context Decision trees Linear regression
Alberto Tonon, Michele Catasta, Roman Prokofyev, Gianluca Demartini, Karl Aberer, Philippe Cudré-Mauroux. Contextualized Ranking of Entity Types Based on Knowledge Graphs. JWS 2016.
2.4 Co-Reference
Resolution
• Better co-reference resolution through the knowledge graph, linking & typing Barack Obama called Angela Merkel last week; the president asked the chancellor whether…
Roman Prokofyev, Alberto Tonon, Michael Luggen, Loic Vouilloz, Djellel Eddine Difallah, and Philippe Cudre-Mauroux: SANAPHOR: Ontology-Based Coreference Resolution. ISWC 2015.
On the Menu Today 1. 2. 3. 4.
Entity-Centric Data Integration The XI Pipeline to integrate textual data Use-case 1: ArmaTweet Use-case 2: Dependency-Driven Analytics
Event Detection on Social Streams • Frequent to mine social streams for specific events • Examples: election in a given country, plane hijacking, etc.
• Keyword search yields catastrophic results • No understanding of entities and their relationships => too many false positives
• Entity-Centric approach can help • Lifts ambiguity relating to entities • Handles relationships
ArmaTweet
3.1
3.2
3.3
Alberto Tonon, Philippe Cudré-Mauroux, Vincent Lenders, Albert Blarer and Boris Motik. ArmaTweet: Detecting Events by Semantic Tweet Analysis. ESWC 2017.
ArmaTweet Event Detection Process (1/2) 3.1 • Automatically annotate tweets with quads (subject , predicate, object, location) – Any component can be absent
3.2 • Embed quads into a knowledge graph – DBpedia provides anchors for subject, object, and location • Semistructured extract from Wikipedia ⇒ useful background knowledge! Classification: ‘Edward Brooke is a politician’ Containment hierarchy: ‘Bern is in Switzerland’
– WordNet provides anchors for predicate
ArmaTweet Event Detection Process (2/2) 3.2 • Describe events of interest declaratively over the knowledge graph – E.g., ‘quads where subject is classified as politician, and predicate refers to dying’
3.3 • Use time series analysis to identify events E.g., by threshold or other statistics
ArmaTweet Results Evaluation
P RECISION R ESULTS
• Order or magnitude better performance compared to keyword search + thresholding Event Category
Total Type Events
Aviation accident Cyber attack on a company Capital punishment in a country Militia terror act Politician dying Politician visits a country Unrest in a country
SP PO PC SP SP SPC PC
Total:
Positive Instances by Relevance R3 R3+R2 R3–R1
84 44 (52%) 51 (61%) 64 (76%) 129 20 (16%) 42 (33%) 57 (44%) 153 47 (31%) 67 (44%) 92 (60%) 220 92 (42%) 125 (57%) 141 (64%) 111 76 (68%) 80 (72%) 85 (77%) 44 29 (66%) 36 (82%) 44 (100%) 200 125 (63%) 133 (67%) 148 (74%) 941 433 (46%) 534 (57%) 631 (67%)
On the Menu Today 1. 2. 3. 4.
Entity-Centric Data Integration The XI Pipeline to integrate textual data Use-case 1: ArmaTweet Use-case 2: Dependency-Driven Analytics
Making Sense of a Large Infrastructure…
Microsoft’s own Metadata Lake…
Example 1: Job pipeline analysis (state-of-the-art) • User: “I need help with my ML experiment processing Clicklogs” • Ops / Engineer:
+ Dig through many tables (from cooked logs)
=
+ Write ad-hoc analytics
Wait 30min (drink coffee)
Manual inspect Job exec plan (XML blobs)
Our Solution: Guider (3) User-level queries return bytes of aggregated data. (2) Entity graph that represents a lightweight “skeleton” of the logs used for navigation
(1) Petabytes of daily logs
Guider Architecture
Dependency Definition Raw Data
Querying
Schema + extr. rules
Raw Data
Raw Data
Extraction
dependency graph
Big Data System
Graph System
Scope/ Cosmos
Neo4J
Storage
Dependency-Driven Analytics: a Compass for Uncharted Data Oceans. Ruslan Mavlyutovm, Carlo Curinom, Boris Asipovm, and Phil Cudre-Mauroux. CIDR 2017
Guider Use-Cases 1. 2. 3. 4.
Auditing and Compliance [in production] Job Scheduling [Morpheus] Global Job Ranking Datacenter migration
JobA_* impact
...
...
... failed
JobA_Day1
FileA_Day1 JobB_Day1
Day1 job impact
FileB_Day1
JobA_Day2
Day2 job impact
FileA_Day2
JobB_Day2
FileB_Day2
JobC_Day2
JobA_DayK no impact
DayN job impact
...
JobA_DayN
FileA_DayN
JobB_DayN
FileB_DayN JobD_DayN
Thanks for your Attention!
http://exascale.info
… and heartfelt thanks to our sponsors