Big Data Integration for eGovernment

16 downloads 0 Views 13MB Size Report
Semistructured extract from Wikipedia ⇒ useful background knowledge! Classification: 'Edward Brooke is a politician'. Containment hierarchy: 'Bern is in ...
Big Data Integration for eGovernment Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland

ICEDEG 2017 Quito, April 20, 2017

Instant Quiz • • • •

3 Vs of Big Data? Wrapper-Mediator? Hadoop? Spark?

eXascale Infolab (XI) • New lab @ U. of Fribourg–Switzerland • Big Data infrastructures for social / semantic / scientific data (… mostly) ©photopulse.ch

http://exascale.info/

Exascale Data Deluge ➡ New data formats ➡ New machines ➡ Peta & exa-scale datasets

• Web companies – Google – Ebay – Yahoo

➡ Obsolescence of traditional information infrastructures

• Science – Biology – Astronomy – Remote Sensing

• Financial services, retail companies governments, etc. © Wired 2009

5

How can Big Data help eGovernment?

Context: The n-Vs of Big Data • Volume • amount of data

• Velocity • speed of data in and out

• Variety • range of data types and sources [Gartner 2012] "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization"

A Quick Look at

(Too) many different sources & formats! Difficult to find the right piece of data Difficult to merge different datasets

Fundamentally, a Big Data Integration problem

On the Menu Today 1. 2. 3. 4.

Entity-Centric Data Integration The XI Pipeline to integrate textual data Use-case 1: ArmaTweet Use-case 2: Dependency-Driven Analytics

Information Management • The story so far: – Strict separation between unstructured and structured data management infrastructures Keywords

SQL

HTTP

JDBC

Inverted Index DBMS

The 3rd V: Data Integration • Data integration is still one of the biggest CS problem out there (according to many e.g., Gartner) • Integration typically requires some sort of mediation 1. Unstructured Data: keywords, synsets 2. Structured Data: global schema, transitive closure of schemas ⇒ nightmarish if 1 and 2 taken separately, horror marathon if considered together

Entities as Mediation • Rising paradigm – Store information at the (semantic) entity granularity – Integrate information by inter-linking entities

• Advantages? – Coarser granularity compared to keywords • More natural, e.g., brain functions similarly (or is it the other way around?)

– Denormalized information compared to RDBMSs • Schema-later, heterogeneity, sparsity • Pre-computed joins, “Semantic” linking

• Drawbacks?

Entity-Centric Data Integration Higher-level apps

Knowledge Graph

On the Menu Today 1. 2. 3. 4.

Entity-Centric Data Integration The XI Pipeline to integrate textual data Use-case 1: ArmaTweet Use-case 2: Dependency-Driven Analytics

Integrating Textual Data • The XI Pipeline – From text to (contextualized) entities Mention Extraction

NER 2.1

Entity Linking 2.3

Entity Typing 2.3

• Runs on massive amounts of data (Spark)

Co-Ref Resolution 2.4

2.1 Named

Entity Recognition (NER)

Text extraction (Apache Tika)

POS Tagging

Ranked list of n-grams

Lemmat ization

Feature Feature extraction extraction Features

Supervised Classi!er

List of extracted n-grams

frequency reweighting

foreach

List of selected n-grams

n-gram Indexing

Candidate Selection

n+1 grams merging

Roman Prokofyev, Gianluca Demartini, Philippe Cudré-Mauroux: Effective named entity recognition for idiosyncratic web collections. WWW 2014: 397-408

2.2

Entity Linking

• Linking entities to text is an old problem… – … and is extremely hard, esp. for machines

• Dozens of approaches have been suggested • What if – We want to combine approaches / frameworks? – We want to leverage both human computations & algorithms?

ZenCrowd • Three-stage blocking approach for integrating textual data & entities 1. Cheap inverted index selection of candidates 2. Expensive similarity measures 3. Crowdsource low confidence matches

• Uses dynamic templating to create micro-matching-tasks and publish them on MTurk • Combines both algorithmic and human matchers using probabilistic networks

Gianluca Demartini, Djellel Eddine Difallah, Philippe Cudré-Mauroux: ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. WWW 2012: 469-478

ZenCrowd Architecture HTML Pages

Input

ZenCrowd

Micro Matching Tasks

MicroTask Manager

Entity Extractors

Crowdsourcing Platform

HTML+ RDFa Pages Output

Algorithmic Matchers

Decision Engine Probabilistic Network

LOD Index Get Entity

Workers Decisions

LOD Open Data Cloud

Probabilistic Inference • Probabilistic network to integrate a priori & a posteriori information – Agreement of good turkers & algorithms • Learning process

– Constraints • Unicity • Equality (SameAs)

– Giant probabilistic graph • Instantiated selectively

c11

pw1( )

pw2( )

w1

w2

c21

c12

lf1( )

l1 pl1( )

c22

c13

lf2( )

sa1-2( )

l2 pl2( )

c23 lf3( )

u2-3( )

l3 pl3( )

Does it Work? • Improves avg. prec. by 0.14 on average! – Minimal crowd involvement – Embarrassingly parallel problem 0.8$ 0.78$

Top$US$ Worker$

0.5$

0.76$

US$Workers$ IN$Workers$

Precision)

Worker&Precision&

1$

0.74$ 0.72$ 0.7$ 0.68$ 0.66$ 0.64$ 0.62$

0$ 0$

250$

500$

Number&of&Tasks&

0.6$ 1$

2$

3$

4$

5$

6$

7$

Top)K)workers)

8$

9$

2.3 Entity

Typing

• Entities can have many types (facets) • Which fine-grained types are most relevant given the context? Thing

American Billionaires

People from King County

American Philanthropist

Windows People

American Computer Programmers Harvard University People

People from Seattle

Agent

Person Living People

American People of Scottish Descent

TRank • Fine-grained Typing • Tree of 447’260 types • Rooted on • Depth of 19

• Ranks relevant types by analyzing the context • • • •

Textual context Graph context Decision trees Linear regression

Alberto Tonon, Michele Catasta, Roman Prokofyev, Gianluca Demartini, Karl Aberer, Philippe Cudré-Mauroux. Contextualized Ranking of Entity Types Based on Knowledge Graphs. JWS 2016.

2.4 Co-Reference

Resolution

• Better co-reference resolution through the knowledge graph, linking & typing Barack Obama called Angela Merkel last week; the president asked the chancellor whether…

Roman Prokofyev, Alberto Tonon, Michael Luggen, Loic Vouilloz, Djellel Eddine Difallah, and Philippe Cudre-Mauroux: SANAPHOR: Ontology-Based Coreference Resolution. ISWC 2015.

On the Menu Today 1. 2. 3. 4.

Entity-Centric Data Integration The XI Pipeline to integrate textual data Use-case 1: ArmaTweet Use-case 2: Dependency-Driven Analytics

Event Detection on Social Streams • Frequent to mine social streams for specific events • Examples: election in a given country, plane hijacking, etc.

• Keyword search yields catastrophic results • No understanding of entities and their relationships => too many false positives

• Entity-Centric approach can help • Lifts ambiguity relating to entities • Handles relationships

ArmaTweet

3.1

3.2

3.3

Alberto Tonon, Philippe Cudré-Mauroux, Vincent Lenders, Albert Blarer and Boris Motik. ArmaTweet: Detecting Events by Semantic Tweet Analysis. ESWC 2017.

ArmaTweet Event Detection Process (1/2) 3.1 • Automatically annotate tweets with quads (subject , predicate, object, location) – Any component can be absent

3.2 • Embed quads into a knowledge graph – DBpedia provides anchors for subject, object, and location • Semistructured extract from Wikipedia ⇒ useful background knowledge! Classification: ‘Edward Brooke is a politician’ Containment hierarchy: ‘Bern is in Switzerland’

– WordNet provides anchors for predicate

ArmaTweet Event Detection Process (2/2) 3.2 • Describe events of interest declaratively over the knowledge graph – E.g., ‘quads where subject is classified as politician, and predicate refers to dying’

3.3 • Use time series analysis to identify events E.g., by threshold or other statistics

ArmaTweet Results Evaluation

P RECISION R ESULTS

• Order or magnitude better performance compared to keyword search + thresholding Event Category

Total Type Events

Aviation accident Cyber attack on a company Capital punishment in a country Militia terror act Politician dying Politician visits a country Unrest in a country

SP PO PC SP SP SPC PC

Total:

Positive Instances by Relevance R3 R3+R2 R3–R1

84 44 (52%) 51 (61%) 64 (76%) 129 20 (16%) 42 (33%) 57 (44%) 153 47 (31%) 67 (44%) 92 (60%) 220 92 (42%) 125 (57%) 141 (64%) 111 76 (68%) 80 (72%) 85 (77%) 44 29 (66%) 36 (82%) 44 (100%) 200 125 (63%) 133 (67%) 148 (74%) 941 433 (46%) 534 (57%) 631 (67%)

On the Menu Today 1. 2. 3. 4.

Entity-Centric Data Integration The XI Pipeline to integrate textual data Use-case 1: ArmaTweet Use-case 2: Dependency-Driven Analytics

Making Sense of a Large Infrastructure…

Microsoft’s own Metadata Lake…

Example 1: Job pipeline analysis (state-of-the-art) • User: “I need help with my ML experiment processing Clicklogs” • Ops / Engineer:

+ Dig through many tables (from cooked logs)

=

+ Write ad-hoc analytics

Wait 30min (drink coffee)

Manual inspect Job exec plan (XML blobs)

Our Solution: Guider (3) User-level queries return bytes of aggregated data. (2) Entity graph that represents a lightweight “skeleton” of the logs used for navigation

(1) Petabytes of daily logs

Guider Architecture

Dependency Definition Raw Data

Querying

Schema + extr. rules

Raw Data

Raw Data

Extraction

dependency graph

Big Data System

Graph System

Scope/ Cosmos

Neo4J

Storage

Dependency-Driven Analytics: a Compass for Uncharted Data Oceans. Ruslan Mavlyutovm, Carlo Curinom, Boris Asipovm, and Phil Cudre-Mauroux. CIDR 2017

Guider Use-Cases 1. 2. 3. 4.

Auditing and Compliance [in production] Job Scheduling [Morpheus] Global Job Ranking Datacenter migration

JobA_* impact

...

...

... failed

JobA_Day1

FileA_Day1 JobB_Day1

Day1 job impact

FileB_Day1

JobA_Day2

Day2 job impact

FileA_Day2

JobB_Day2

FileB_Day2

JobC_Day2

JobA_DayK no impact

DayN job impact

...

JobA_DayN

FileA_DayN

JobB_DayN

FileB_DayN JobD_DayN

Thanks for your Attention!

http://exascale.info

… and heartfelt thanks to our sponsors