Link Discovery Tutorial - Part I: Efficiency

Link Discovery Tutorial Part I: Efficiency

Axel-Cyrille Ngonga Ngomo(1) , Irini Fundulaki(2) , Mohamed Sherif(1) (1) Institute for Applied Informatics, Germany (2) FORTH, Greece

October 18th, 2016 Kobe, Japan Ngonga Ngomo et al. (InfAI & FORTH)

LD Tutorial: Efficiency

October 28, 2016

1 / 106

Table of Contents 1

Introduction

2

LIMES

3

MultiBlock

4

Reduction-Ratio-Optimal Link Discovery

5

Link Discovery for Temporal Data

6

GNOME

7

GPUs and Hadoop

8

Summary and Conclusion

Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

2 / 106

Table of Contents 1

Introduction

2

LIMES

3

MultiBlock

4


5


6

GNOME

7

GPUs and Hadoop

8




October 28, 2016

3 / 106

Introduction Link Discovery

Definition (Declarative Link Discovery) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Here, find M 0 = {(s, t) ∈ S × T : σ(s, t) ≥ θ}



October 28, 2016

4 / 106


Definition (Declarative Link Discovery) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Here, find M 0 = {(s, t) ∈ S × T : σ(s, t) ≥ θ} Problem Naïve complexity ∈ O(S × T ), i.e., O(n2 ) Example: ≈ 70 days for cities in DBpedia and LinkedGeoData



October 28, 2016

4 / 106


Definition (Declarative Link Discovery) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Here, find M 0 = {(s, t) ∈ S × T : σ(s, t) ≥ θ} Problem Naïve complexity ∈ O(S × T ), i.e., O(n2 ) Example: ≈ 70 days for cities in DBpedia and LinkedGeoData Solutions 1

Reduced number of comparisons C (A) ≥ |M 0 | [NA11; IJB11; Ngo12; Ngo13]

2

Use planning [Ngo14]

3

Reduce the complexity of linking to < n2 [GSN16]

4

Use features of hardware environment [Ngo+13; NH16]



October 28, 2016

4 / 106

Table of Contents 1

Introduction

2

LIMES

3

MultiBlock

4


5


6

GNOME

7

GPUs and Hadoop

8




October 28, 2016

5 / 106

LIMES Intuition

Definition (Declarative Link Discovery) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Here, find M 0 = {(s, t) ∈ S × T : δ(s, t) ≤ τ } Some δ are distances (in the mathematical sense) [NA11] Examples Levenshtein distance Minkowski distance



October 28, 2016

6 / 106

LIMES Intuition

Definition (Declarative Link Discovery) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Here, find M 0 = {(s, t) ∈ S × T : δ(s, t) ≤ τ } Some δ are distances (in the mathematical sense) [NA11] Examples Levenshtein distance Minkowski distance

Intuition Distances abide by triangle inequality δ(x , z) − δ(z, y ) ≤ δ(x , y ) ≤ δ(x , z) + δ(z, y ) Use exemplars for pessimistic approximation Reduce number of computations Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

6 / 106

LIMES Approach 1

2

Start with random e1 ∈ T n P en+1 = arg max δ(x , ei ) x ∈T

3

i=1

Assign each t to e(t) with e(t) = arg min δ(t, ei ) ei

4

Approximate δ(s, t) by using δ(s, t) ≥ δ(s, e(t)) − δ(e(t), t)



October 28, 2016

7 / 106

LIMES Approach

1

Start with random e1 ∈ T



October 28, 2016

8 / 106

LIMES Approach

1

Start with random e1 ∈ T



October 28, 2016

9 / 106

LIMES Approach

2

en+1 = arg max x ∈T

n P

δ(x , ei )

i=1



October 28, 2016

10 / 106

LIMES Approach

3

Assign each t to e(t) with e(t) = arg min δ(t, ei ) ei



October 28, 2016

11 / 106

LIMES Approach

4

Approximate δ(s, t) by using δ(s, t) ≥ δ(s, e(t)) − δ(e(t), t)



October 28, 2016

12 / 106

LIMES Evaluation

Measure = Levenshtein Threshold = 0.9, KB base size = 103 x-axis is number of exemplars, y-axis is number of comparisons 10 8

0.75 0.8

6

0.85 0.9

4

0.95 2

Brute force

0

0

50


100

150

200

250


300

October 28, 2016

13 / 106

LIMES Evaluation

Measure = Levenshtein Threshold = 0.9, KB base size = 104 x-axis is number of exemplars, y-axis is number of comparisons p Optimal number of examplars ≈ |T | 1000 900 800

700

0.75

600

0.8

500

0.85

400

0.9

300

0.95

200

Brute force

100 0

0

50


100

150

200

250


300 October 28, 2016

14 / 106

LIMES Conclusion

Time-efficent approach Reduces number of comparisons through approximation No guarantee pertaining to reduction ratio



October 28, 2016

15 / 106

Table of Contents 1

Introduction

2

LIMES

3

MultiBlock

4


5


6

GNOME

7

GPUs and Hadoop

8




October 28, 2016

16 / 106

MultiBlock Intuition

Idea Create multidimensional index of data Partition the space so that σ(x , y ) < θ ⇒ index (x ) 6= index (y ) Decrease number of computations by only comparing pairs (x , y ) with index (x ) = index (y )



October 28, 2016

17 / 106

MultiBlock Intuition

Idea Create multidimensional index of data Partition the space so that σ(x , y ) < θ ⇒ index (x ) 6= index (y ) Decrease number of computations by only comparing pairs (x , y ) with index (x ) = index (y )



October 28, 2016

17 / 106

MultiBlock Approach

Similarity functions are aggregations of atomic measures sim, e.g., σ = MIN(levenshtein, jaccard) Each atomic measure must define a blocking function block : (S ∪ T ) × [0, 1] → P(Nn ) Each aggregation must define a similarity, a block and a threshold aggregation Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

18 / 106

MultiBlock Example

sim =

levenshtein(s,t) max(|s|,|t|) q

(normalized levenshtein)

block : Σ → N

Maps each q-gram to exactly one block (e.g., Goedelisation) Only c = max(|s|, |t|)(1 − θ) · q + 1 q-grams to be indexed per string

index (s, θ) = {block(qgrams(s[0...c] )} String mapped to several blocks Comparison only within blocks



October 28, 2016

19 / 106

MultiBlock Example

sim =

levenshtein(s,t) max(|s|,|t|) q

(normalized levenshtein)

block : Σ → N

Maps each q-gram to exactly one block (e.g., Goedelisation) Only c = max(|s|, |t|)(1 − θ) · q + 1 q-grams to be indexed per string

index (s, θ) = {block(qgrams(s[0...c] )} String mapped to several blocks Comparison only within blocks Comparison of drugs in DBpedia and Drugbank Setting Full comparison Blocking (100 blocks) Blocking (1,000 blocks) MultiBlock Ngonga Ngomo et al. (InfAI & FORTH)

Comparisons

Runtime

Links

22,242,292 906,314 322,573 122,630

430s 44s 14s 6s

1,403 1,349 1,287 1,403


October 28, 2016

19 / 106

Table of Contents 1

Introduction

2

LIMES

3

MultiBlock

4


5


6

GNOME

7

GPUs and Hadoop

8




October 28, 2016

20 / 106

HR3 Link Discovery

Definition (Declarative Link Discovery) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Here, find M 0 = {(s, t) ∈ S × T : δ(s, t) ≤ θ}



October 28, 2016

21 / 106

HR3 Link Discovery

Definition (Declarative Link Discovery) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Here, find M 0 = {(s, t) ∈ S × T : δ(s, t) ≤ θ} Intuition: Reduce the number of comparisons C (A) ≥ |M 0 | Maximize reduction ratio: C (A) RR(A) = 1 − |S||T |



October 28, 2016

21 / 106

HR3 Link Discovery

Definition (Declarative Link Discovery) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Here, find M 0 = {(s, t) ∈ S × T : δ(s, t) ≤ θ} Intuition: Reduce the number of comparisons C (A) ≥ |M 0 | Maximize reduction ratio: C (A) RR(A) = 1 − |S||T | Question Can we devise lossless approaches with guaranteed RR? Advantages Space management Runtime prediction Resource scheduling Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

21 / 106

HR3 RR Guarantee

Best achievable reduction ratio: RRmax = 1 −



|M 0 | |S||T |

October 28, 2016

22 / 106

HR3 RR Guarantee


|M 0 | |S||T |

Approach H(α) fulfills RR guarantee criterion, iff: ∀r < RRmax , ∃α : RR(H(α)) ≥ r



October 28, 2016

22 / 106

HR3 RR Guarantee


|M 0 | |S||T |

Approach H(α) fulfills RR guarantee criterion, iff: ∀r < RRmax , ∃α : RR(H(α)) ≥ r

Here, we use relative reduction ratio (RRR): RRR(A) =


RRmax RR(A)


October 28, 2016

22 / 106

Goal

Formal Goal Devise H(α) : ∀r > 1, ∃α : RRR(H(α)) ≤ r



October 28, 2016

23 / 106

HR3 Restrictions

Minkowski Distance v u n uX p δ(s, t) = t |si − ti |p , p ≥ 2 i=1



October 28, 2016

24 / 106

HR3 Space Tiling

HYPPO δ(s, t) ≤ θ describes a hypersphere Approximate hypersphere by using a hypercube Easy to compute No loss of recall (blocking)



October 28, 2016

25 / 106

HR3 Space Tiling

Set width of single hypercube to ∆ = θ/α



October 28, 2016

26 / 106

HR3 Space Tiling

Set width of single hypercube to ∆ = θ/α Tile Ω = S ∪ T into the adjacent cubes C Coordinates: (c1 , . . . , cn ) ∈ Nn Contains points ω ∈ Ω : ∀i ∈ {1 . . . n}, ci ∆ ≤ ωi < (ci + 1)∆



October 28, 2016

26 / 106

HR3 Space Tiling

Set width of single hypercube to ∆ = θ/α Tile Ω = S ∪ T into the adjacent cubes C Coordinates: (c1 , . . . , cn ) ∈ Nn Contains points ω ∈ Ω : ∀i ∈ {1 . . . n}, ci ∆ ≤ ωi < (ci + 1)∆



October 28, 2016

26 / 106

HYPPO Idea

Combine (2α + 1)n hypercubes around C (ω) to approximate hypersphere

RRR(HYPPO(α)) =

(2α+1)n αn S(n)

lim RRR(HYPPO(α)) =

α→∞


2n S(n)


October 28, 2016

27 / 106

HYPPO Reduction Ratio

RRR(HYPPO) for p = 2, n = 2, 3, 4 and 2 ≤ α ≤ 50



October 28, 2016

28 / 106

HYPPO Reduction Ratio

RRR(HYPPO) for p = 2, n = 2, 3, 4 and 2 ≤ α ≤ 50


α→∞


α→∞


α→∞


4 π ≈ 1.27 (n = 2) 6 π ≈ 1.91 (n = 3) 32 π 2 ≈ 3.24 (n = 4) LD Tutorial: Efficiency

October 28, 2016

28 / 106

HR3 Idea

 0 if ∃i : |ci − c(ω)i | ≤ 1, 1 ≤ i ≤ n, n index (C , ω) = P  (|ci − c(ω)i | − 1)p else, i=1



October 28, 2016

29 / 106

HR3 Idea

Compare C (ω) with C iff index (C , ω) ≤ αp α = 4, p = 2



October 28, 2016

30 / 106

HR3 Idea

Claims No loss of recall lim RRR(HR3 (α)) = 1

α→∞



October 28, 2016

31 / 106

HR3 Lemma 1

Lemma index (C , s) = x ⇒ ∀t ∈ C δ p (s, t) > x ∆p



October 28, 2016

32 / 106

HR3 Lemma 1

Lemma index (C , s) = x ⇒ ∀t ∈ C δ p (s, t) > x ∆p p = 2, n = 2, index (C , s) = 5



October 28, 2016

32 / 106

HR3 Lemma 1

Lemma index (C , s) = x ⇒ ∀t ∈ C δ p (s, t) > x ∆p Proof. index (C , s) = x ⇒

n P

(|ci − ci (s)| − 1)p = x

i=1

Out of definition of cube index follows |si − ti | > (|ci − ci (s)| − 1)∆ n n P P Thus, |si − ti |p > (|ci − ci (s)| − 1)p ∆p i=1

i=1

Therewith, δ p (s, t) > x ∆p



October 28, 2016

33 / 106

HR3 Lemma 2

Lemma ∀s ∈ S : index (C , s) > αp implies that all t ∈ C are non-matches Proof. Follows directly from Lemma 1: index (C , s) > αp ⇒ ∀t ∈ C , δ p (s, t) > ∆p αp = θp



October 28, 2016

34 / 106

HR3 Idea

Claims

X

No loss of recall

3

lim RRR(HR (α)) = 1

α→∞



October 28, 2016

35 / 106

HR3 Lemma 3

Lemma ∀α > 1 RRR(HR3 (2α)) < RRR(HR3 (α)) p = 2, α = 4



October 28, 2016

36 / 106

HR3 Proof (idea)

Lemma ∀α > 1 RRR(HR 3 (2α)) < RRR(HR 3 (α)) p = 2, α = 8



October 28, 2016

37 / 106

HR3 Proof




October 28, 2016

38 / 106

HR3 Proof




October 28, 2016

39 / 106

HR3 Theorem

Theorem lim RRR(HR3 (α)) = 1

α→∞

Proof. α→∞⇒∆→0 Thus, C (s) = {s}, C = {t} n P Index function : ∆p (|ci (s) − ci | − 1)p ≤ ∆p αp i=1

∆→0⇒

n P

|si − ti |p ≤ θp

i=1



October 28, 2016

40 / 106

HR3 Idea

Claims No loss of recall

X

3

lim RRR(HR (α)) = 1

α→∞


X


October 28, 2016

41 / 106

HR3 Evaluation

Compare HR3 with LIMES 0.5’s HYPPO and SILK 2.5.1 Experimental Setup: Deduplicating DBpedia places by minimum elevation, elevation and maximum elevation (θ = 49m, 99m). Geonames and LinkedGeoData by longitude and latitude (θ = 1◦ , 9◦ )

Windows 7 Enterprise machine 64-bit computer with a 2.8GHz i7 processor with 8GB RAM.



October 28, 2016

42 / 106

HR3 Evaluation (Comparisons)

Experiment 2: Deduplicating DBpedia places, θ = 99m 0.64 × 106 less comparisons



October 28, 2016

43 / 106

HR3 Experiments (Comparisons)

Experiment 4: Linking Geonames and LinkedGedData, θ = 9◦ 4.3 × 106 less comparisons



October 28, 2016

44 / 106

HR3 Experiments (RRR)

Experiment 3,4: Geonames and LGD, θ = 1, 9◦



October 28, 2016

45 / 106

HR3 Experiments (Runtime)

Experiment 3,4: Geonames and LGD, θ = 1, 9◦ 180 160

Runtime (s)

140 120 100

HYPPO (Exp 3) HR3 (Exp 3) HYPPO (Exp. 4) HR3 (Exp. 4)

80 60 40 20 0

2

4

8

16

32

Granularity Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

46 / 106

HR3 Experiments (Runtime)

Experiment 1, 2: DBpedia, θ = 49, 99m Experiment 3, 4: Geonames and LGD, θ = 1, 9◦ 10

Runtime (s)

10

10

10

10

4

3

HR3 HYPPO SILK

2

1

0

Exp. 1


Exp. 2

Exp. 3


Exp. 4 October 28, 2016

47 / 106

HR3 Conclusion

Main result: new category of algorithms for link discovery Outperforms the state of the art (runtime, comparisons) Future Work Combine HR3 with multi-indexing approach Devise resource management approach Develop other algorithms (esp. for strings) with the same/similar theoretical guarantees



October 28, 2016

48 / 106

Table of Contents 1

Introduction

2

LIMES

3

MultiBlock

4


5


6

GNOME

7

GPUs and Hadoop

8




October 28, 2016

49 / 106

AEGLE Why?

:E1 :E1 :E1 :E1

rdfs:label "Engine failure"@en rdf:type :Error :beginDate :"2015-04-22T11:39:35" :endDate :"2015-04-22T11:39:37"

:E2 :E2 :E2 :E2

rdfs:label "Car accident"@en rdf:type :Accident :beginDate :"2015-06-28T11:45:22" :endDate :"2015-06-28T11:45:24"

Need to create links between events, e.g., :startsEvent Need to deal with volume and velocity



October 28, 2016

50 / 106

AEGLE Event Definition

Definition (Event) Events can be modeled as time intervals: v = (b(v ), e(v )) b(v ) is the beginning time (:beginDate) e(v ) is the end time (:endDate) b(v ) < e(v )



October 28, 2016

51 / 106

AEGLE Allen’s Interval Algebra Relation

Notation

Inverse

X before Y

bf (X , Y )

bfi(X , Y )

Illustration X Y X

X meets Y

m(X , Y )

Y

mi(X , Y )

X X finishes Y

f (X , Y )

Y

fi(X , Y ) X

X starts Y

st(X , Y )

Y

sti(X , Y )

X X during Y

d(X , Y )

Y

di(X , Y )

X X equal Y

eq(X , Y )

Y

eq(X , Y ) X

X overlaps with Y Ngonga Ngomo et al. (InfAI & FORTH)

ov (X , Y )

ovi(X , Y )


Y October 28, 2016

52 / 106

AEGLE Solution

Aegle: Allen’s intErval alGebra for Link discovEry Efficient computation of temporal relations between events Allen’s Interval Algebra: distinct, exhaustive, and qualitative relations between time intervals Intuition Expressing 13 Allen relations using 8 atomic relations Time is ordered: Find matching entities using two sorted lists



October 28, 2016

53 / 106

AEGLE Express st(s, t) using atomic relations

s t b(s) = b(t)



October 28, 2016

54 / 106


s

s

t b(s) = b(t)


t b(s) < e(t)


October 28, 2016

54 / 106


s

s

t b(s) = b(t)

t b(s) < e(t)

s t e(s) > b(t) Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

54 / 106


s

s

t b(s) = b(t)

t b(s) < e(t)

s

s

t e(s) > b(t) Ngonga Ngomo et al. (InfAI & FORTH)

t e(s) < e(t) LD Tutorial: Efficiency

October 28, 2016

54 / 106

AEGLE Atomic relations

Compute 8 atomic Boolean relations between begin and end points BeginBegin (BB) for b(s), b(t): BB 1 (s, t) ⇔ (b(s) < b(t)) BB 0 (s, t) ⇔ (b(s) = b(t)) BB −1 (s, t) ⇔ (b(s) > b(t)) ⇔ ¬(BB 1 (s, t) ∨ BB 0 (s, t))

BeginEnd(BE ) for b(s), e(t) EndBegin(EB) for e(s), b(t) EndEnd(EE ) for e(s), e(t)



October 28, 2016

55 / 106

AEGLE Combination of relations

s t st(s, t) ⇔ BB 0 (s, t) ∧ BE 1 (s, t) ∧ EB −1 (s, t) ∧ EE 1 (s, t) ⇔ {BB 0 (s, t) ∧ EE 1 (s, t) }



October 28, 2016

56 / 106



t s sti(s, t) ⇔ BB 0 (s, t) ∧ BE 1 (s, t) ∧ EB −1 (s, t) ∧ EE −1 (s, t) ⇔ {BB 0 (s, t) ∧ EE −1 (s, t)} = 0 { BB (s, t) ∧¬(EE 0 (s, t)∨ EE 1 (s, t))}



October 28, 2016

56 / 106



t s sti(s, t) ⇔ BB 0 (s, t) ∧ BE 1 (s, t) ∧ EB −1 (s, t) ∧ EE −1 (s, t) ⇔ {BB 0 (s, t) ∧ EE −1 (s, t)} = 0 { BB (s, t) ∧¬(EE 0 (s, t)∨ EE 1 (s, t))}



October 28, 2016

56 / 106

AEGLE Algorithm for st, sti

Source

s1 s2 t1

Target

t2



October 28, 2016

57 / 106


For st:

Source

s1 s2 t1

Target

t2



October 28, 2016

57 / 106


For st: Compute BB 0 :

Source

s1 s2 t1

Target

t2



October 28, 2016

57 / 106


For st: Compute BB 0 : s1 s2

t1

t2

{(s1 , t1 ), (s2 , t1 )}

Source

s1 s2 t1

Target

t2



October 28, 2016

57 / 106



t1

t2

{(s1 , t1 ), (s2 , t1 )}

Source

s1 s2

Compute EE 1 : s1

t1 Target

s2

t1

t2

t2 {(s1 , t1 ), (s2 , t1 ), (s1 , t2 )}



October 28, 2016

57 / 106



t1

t2

{(s1 , t1 ), (s2 , t1 )}

Source

s1 s2

Compute EE 1 : s1

t1 Target

s2

t1

t2

t2 {(s1 , t1 ), (s2 , t1 ), (s1 , t2 )}

Intersection between BB 0 and EE 1 : {(s1 , t1 ), (s2 , t1 )} Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

57 / 106


For sti:

Source

s1 s2 t1

Target

t2



October 28, 2016

58 / 106


For sti: Retrieve BB 0 and EE 1 :

Source

s1 s2 t1

Target

t2



October 28, 2016

58 / 106


For sti: Retrieve BB 0 and EE 1 : Compute EE 0 : s1

Source

s2

t1

t2

s1 s2 {(s2 , t2 )}

t1 Target

t2



October 28, 2016

58 / 106



Source

s2

t1

t2

s1 s2 {(s2 , t2 )}

t1 Target

t2


Union between EE 0 and EE 1 : {(s1 , t1 ), (s2 , t1 ), (s2 , t1 ), (s2 , t2 )}


October 28, 2016

58 / 106



Source

s2

t1

t2

s1 s2 {(s2 , t2 )}

t1 Target

t2

Union between EE 0 and EE 1 : {(s1 , t1 ), (s2 , t1 ), (s2 , t1 ), (s2 , t2 )} Difference between BB 0 and EE 0 , EE 1 : {(s1 , t2 ), (s2 , t2 )}



October 28, 2016

58 / 106

AEGLE Experimental Set-Up

Datasets: S = T Log Type Dataset name 3KMachines Machinery 30KMachines 300KMachines 3KQueries Query 30KQueries 300KQueries


Size 3,154 28,869 288,690 3,888 30,635 303,991


Unique b(s) 960 960 960 3,636 3,070 184

Unique e(s) 960 960 960 3,638 3,070 184

October 28, 2016

59 / 106



Size 3,154 28,869 288,690 3,888 30,635 303,991

Unique b(s) 960 960 960 3,636 3,070 184

Unique e(s) 960 960 960 3,638 3,070 184

State-of-the-art: Silk extended to deal with spatio-temporal data Baseline for eq using brute-force



October 28, 2016

59 / 106



Size 3,154 28,869 288,690 3,888 30,635 303,991

Unique b(s) 960 960 960 3,636 3,070 184

Unique e(s) 960 960 960 3,638 3,070 184

State-of-the-art: Silk extended to deal with spatio-temporal data Baseline for eq using brute-force

Evaluation measures: atomic runtime of each of the atomic relations relation runtime required to compute each Allen’s relation total runtime required to compute all 13 relations Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

59 / 106

AEGLE Results

Q1 : Does the reduction of Allen relations to 8 atomic relations influence the overall runtime of the approach?



October 28, 2016

60 / 106

AEGLE Results

Q1 : Does the reduction of Allen relations to 8 atomic relations influence the overall runtime of the approach?



October 28, 2016

61 / 106

AEGLE Results

Q2 : How does Aegle perform when compared with the state of the art in terms of time efficiency? Log Type Machine

Query

Dataset Name 3KMachines 30KMachines 300KMachines 3KQueries 30KQueries 300KQueries


Total Runtime Aegle Aegle * Silk 11.26 5.51 294.00 1,016.21 437.79 29,846.00 189,442.16 78,416.61 NA 26.94 17.91 541.00 988.78 463.27 33,502.00 211,996.88 86,884.98 NA


October 28, 2016

62 / 106

AEGLE Results

Relation m eq ovi

Approach Aegle Silk Aegle Silk baseline Aegle Silk

3KMachines 0.02 23.00 0.05 23.00 2.05 3.16 22.00

Machine 30KMachines 0.19 2,219.00 0.79 2,250.00 171.10 222.27 2,189.00

300KMachines 3.42 NA 49.84 NA 23,436.30 38,226.32 NA

3KQueries 0.02 41.00 0.05 41.00 3.15 11.97 42.00

Query 30KQueries 0.21 2,466.00 0.45 2,473.00 196.09 257.59 2,503.00

300KQueries 3.89 NA 348.51 NA 31,452.54 42,121.68 NA

Conclusion Aegle = efficient temporal linking Reduction of 13 Allen Interval relations to 8 atomic relations Efficiency: simple sorting with complexity O(n log n) Scalable LD



October 28, 2016

63 / 106

Table of Contents 1

Introduction

2

LIMES

3

MultiBlock

4


5


6

GNOME

7

GPUs and Hadoop

8




October 28, 2016

64 / 106

Gnome Problem

What if ... Data does not fit memory C , i.e., |S| + |T | > |C |

1 2

Common problem Growing size and number of datasets ≈ 150 × 109 triples ≈ 10 × 103 datasets Largest dataset with > 20 × 109 triples

3

Mostly in-memory solutions



October 28, 2016

65 / 106

Gnome Idea

Insight Most approaches rely on divide-and-merge paradigm Example: HR3 σ(s, t) ≥ θ ⇔ δ(s, t) ≤ ∆



October 28, 2016

66 / 106

Gnome Idea

Insight Most approaches rely on divide-and-merge paradigm Example: HR3 σ(s, t) ≥ θ ⇔ δ(s, t) ≤ ∆



October 28, 2016

66 / 106

Gnome Formal Model 1

Define S = {S1 , . . . , Sn } with Si ⊆ S ∧

S

Si = S

i



October 28, 2016

67 / 106



S

Si = S S Define T = {T1 , . . . , Sm } with Tj ⊆ T ∧ Tj = T i

2

j



October 28, 2016

67 / 106



S


2

j 3

Find mapping function µ : S → 2T with elements of Si only compared with elements of sets in µ(Si ) union of results over all Si ∈ S is exactly M 0 .



October 28, 2016

67 / 106



S


2

j 3

Find mapping function µ : S → 2

T

with

elements of Si only compared with elements of sets in µ(Si ) union of results over all Si ∈ S is exactly M 0 .



October 28, 2016

67 / 106

GNOME Task Graph

Definition A task Eij stands for comparing Si with Tj ∈ µ(Si )



October 28, 2016

68 / 106

GNOME Task Graph

Definition A task Eij stands for comparing Si with Tj ∈ µ(Si ) Task Graph G = (V , E , wv , we ), with V =S ∪T wv (v ) = |V | we (eij ) = |Si ||Tj |

T1 3

2

S1

2

3

S2

T3

1 T2 4 S3 Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

68 / 106

GNOME Task Graph


T1 3

6

2

S1 3

2

3

S2

T3

1 T2 4 S3



October 28, 2016

68 / 106

GNOME Task Graph


T1 3

6

2

4

S1

S2 3

1

3

2 6

T3

2 12

T2 4

4 S3



October 28, 2016

68 / 106

GNOME Problem Reformulation

Locality maximization Two-step approach: 1 2

Clustering: Find groups of nodes that fit in memory and Scheduling: Compute sequence of groups that minimizes hard drive access

T1 3

6

2

4

S1

3

2 S2

3

1

6

T3

2 12

T2 4 S3 Ngonga Ngomo et al. (InfAI & FORTH)

4


October 28, 2016

69 / 106

GNOME Step1: Clustering

Naïve Approach Cluster by Si Example: |C | = 7 T1 3

6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

70 / 106



6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

70 / 106



6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

70 / 106



6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

70 / 106



6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

70 / 106



6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

70 / 106


Greedy Approach Start by largest task Add connected largest tasks until none fits in C Example: |C | = 7 T1 3

6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

71 / 106



6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

71 / 106



6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

71 / 106



6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

71 / 106



6

2

4

S1

3

2 S2

3

1

6

T3

2 12


4


October 28, 2016

71 / 106

GNOME Step2: Scheduling

Insights Output of clustering: Sequence G1 , . . . , GN of clusters Intuition: Consecutive clusters should share data Goal: Maximize overlap of generated sequence



October 28, 2016

72 / 106


Insights Output of clustering: Sequence G1 , . . . , GN of clusters Intuition: Consecutive clusters should share data Goal: Maximize overlap of generated sequence T1 P

Overlap o(Gi , Gj ) =

|v |

v ∈V (Gi )∩V (Gj )

o(G4 , G1 ) = 4

3

6

2

4

S1

3

2 S2

3

1

6

T3

2 12

T2 4 S3



4

October 28, 2016

72 / 106


Insights Output of clustering: Sequence G1 , . . . , GN of clusters Intuition: Consecutive clusters should share data Goal: Maximize overlap of generated sequence T1 P

Overlap o(Gi , Gj ) =

|v |

v ∈V (Gi )∩V (Gj )

o(G4 , G1 ) = 4 Overlap o(G1 , . . . , GN ) =

N−1 P

3

6

o(Gi , Gi+1 )

4

S1 1

6

T3

2 12

T2 4

o(G4 , G1 , G2 , G1 ) = 9

S3


3

2 S2

3

i=1

2


4

October 28, 2016

72 / 106


Best-Effort Select random pair of clusters If permutation improves overlap, then permute Relies on local knowledge for scalability



October 28, 2016

73 / 106


Best-Effort Select random pair of clusters If permutation improves overlap, then permute Relies on local knowledge for scalability Trick: ∆(Gi , Gj ) = (o(Gi−1 , Gj ) + o(Gj , Gi+1 ) + o(Gj−1 , Gi ) + o(Gi , Gj+1 ))− (o(Gi−1 , Gi ) + o(Gi , Gi+1 ) + o(Gj−1 , Gj ) + o(Gj , Gj+1 )) G3 G4


0 4

G1 G1

3 3

G2 G2


2 2

(1)

G4 G3

October 28, 2016

73 / 106


Greedy Start with random cluster Choose next cluster with largest overlap Global knowledge needed



October 28, 2016

74 / 106


Greedy Start with random cluster Choose next cluster with largest overlap Global knowledge needed G3 G3


0 2

G1 G2

3 3

G2 G1


2 4

G4 G4

October 28, 2016

74 / 106

GNOME Experimental Setup

Datasets 1

2

DBP: 1 million labels from DBpedia version 04-2015 LGD: 0.8 million places from LinkedGeoData

Hardware 1

2 3

Intel Xeon E5-2650 v3 processors (2.30GHz) Ubuntu 14.04.3 LTS 10GB RAM

Measures 1 2

Total runtime Hit ratio



October 28, 2016

75 / 106

GNOME Evaluation of Clustering

Only show results of LGD Results on DBP lead to similar insights |C |

Runtimes Naive Greedy

100 200 400

568.0 518.3 532.0

646.3 594.0 593.3

0.57 0.66 0.67

0.77 0.80 0.80

1,000 2,000 4,000

5,974.0 6,168.0 7,118.3

118,454.7 115,450.0 121,901.7

0.51 0.51 0.50

0.64 0.63 0.63



Hit Ratio Naive Greedy

October 28, 2016

76 / 106

GNOME Evaluation of Clustering


Runtimes Naive Greedy

Hit Ratio Naive Greedy

100 200 400

568.0 518.3 532.0

646.3 594.0 593.3

0.57 0.66 0.67

0.77 0.80 0.80

1,000 2,000 4,000

5,974.0 6,168.0 7,118.3

118,454.7 115,450.0 121,901.7

0.51 0.51 0.50

0.64 0.63 0.63

Conclusion 1

Naïve approach is more efficient

2

Greedy approach is more effective

3

Select naïve approach for clustering



October 28, 2016

76 / 106

GNOME Evaluation of Scheduling


Runtimes (ms) Best-Effort Greedy

Hit ratio Best-Effort Greedy

100 200 400

571.3 565.7 581.0

1,599.3 1,448.3 1,379.3

0.56 0.66 0.67

0.68 0.85 0.88

1,000 2,000 4,000

5,666.0 6,268.0 6,675.7

814,271.7 810,855.0 814,041.7

0.51 0.51 0.50

0.86 0.86 0.86



October 28, 2016

77 / 106

GNOME Evaluation of Scheduling


Runtimes (ms) Best-Effort Greedy

Hit ratio Best-Effort Greedy

100 200 400

571.3 565.7 581.0

1,599.3 1,448.3 1,379.3

0.56 0.66 0.67

0.68 0.85 0.88

1,000 2,000 4,000

5,666.0 6,268.0 6,675.7

814,271.7 810,855.0 814,041.7

0.51 0.51 0.50

0.86 0.86 0.86

Conclusion 1

Best-effort approach more time-efficient

2

Greedy more effective

3

Best-effort approach is to be used for scheduling



October 28, 2016

77 / 106

GNOME Comparison with Caching Approaches

|C | 1,000 2,000 4,000

Gnome 5,974.0 6,168.0 7,118.3

FIFO 37,161.0 31,977.0 21,337.0

1,000 2,000 4,000

0.51 0.51 0.51

0.17 0.29 0.54


Runtimes (ms) F2 LFU 42,090.3 45,906.7 39,071.3 39,872.0 40,860.0 28,028.3 Hit ratio 0.16 0.19 0.30 0.32 0.55 0.59


LRU 54,194.3 45,473.0 26,816.7

SLRU 56,904.3 46,795.0 27,200.0

0.17 0.30 0.55

0.17 0.30 0.56

October 28, 2016

78 / 106

GNOME Comparison with Caching Approaches

|C | 1,000 2,000 4,000

Gnome 5,974.0 6,168.0 7,118.3

FIFO 37,161.0 31,977.0 21,337.0

1,000 2,000 4,000

0.51 0.51 0.51

0.17 0.29 0.54

Runtimes (ms) F2 LFU 42,090.3 45,906.7 39,071.3 39,872.0 40,860.0 28,028.3 Hit ratio 0.16 0.19 0.30 0.32 0.55 0.59

LRU 54,194.3 45,473.0 26,816.7

SLRU 56,904.3 46,795.0 27,200.0

0.17 0.30 0.55

0.17 0.30 0.56

Conclusion 1 2

Gnome is more time-efficient Leads to higher hit rates in most cases



October 28, 2016

78 / 106

GNOME Scalability

LGD DBP

100k

200k

400k

800k

362,141.3 434,630.7

1,452,922.0 1,790,350.7

5,934,038.7 6,677,923.0

20,001,965.7 12,653,403.3

Conclusion Sub-quadratic growth of runtime Runtime grows linearly with number of mappings For LGD, 360 – 370 mappings/s



October 28, 2016

79 / 106

Table of Contents 1

Introduction

2

LIMES

3

MultiBlock

4


5


6

GNOME

7

GPUs and Hadoop

8




October 28, 2016

80 / 106

Reach for the Cloud? Dealing with Time Complexity

1

Devise better algorithms

Blocking Algorithms for given metrics PPJoin+, EDJoin HR3



October 28, 2016

81 / 106

Reach for the Cloud? Dealing with Time Complexity

1

Devise better algorithms

Blocking Algorithms for given metrics PPJoin+, EDJoin HR3


2

Use parallel hardware

Threads pools MapReduce Massively Parallel Hardware LD Tutorial: Efficiency

October 28, 2016

81 / 106

Reach for the Cloud? Parallel Implementations

Different architectures Memory (shared, hybrid, distributed) Execution paths (different, same) Location (remote, local)



October 28, 2016

82 / 106

Reach for the Cloud? Parallel Implementations

Different architectures Memory (shared, hybrid, distributed) Execution paths (different, same) Location (remote, local)

Question When should which use which hardware?



October 28, 2016

82 / 106

Reach for the Cloud? Premises and Goals

Premises Given an algorithm that runs on all three architectures ...



October 28, 2016

83 / 106


Premises Given an algorithm that runs on all three architectures ... Note to self: Implement one Picked HR3 Reduction-ratio-optimal



October 28, 2016

83 / 106


Premises Given an algorithm that runs on all three architectures ... Note to self: Implement one Picked HR3 Reduction-ratio-optimal

Reach for the Cloud? Goals Compare runtimes on all three parallel architectures Find break-even points



October 28, 2016

83 / 106

Reach for the Cloud? HR3 1

Assume spaces with Minkowski metric and p ≥ 2 v u n uX p δ(s, t) = t |si − ti |p i=1



October 28, 2016

84 / 106


Create grid of width ∆ = θ/α



October 28, 2016

85 / 106


Link discovery condition describes a hypersphere



October 28, 2016

86 / 106


Approximate hypersphere with hypercube



October 28, 2016

87 / 106


Use index to discard portions of hypercube



October 28, 2016

88 / 106

Reach for the Cloud? HR3 in GPUs

Large number of simple compute cores Same instruction, multiple data Bottleneck: PCI Express Bus 1 2 3

Run discretization on CPU Run indexing on GPU Run comparisons on CPU Global/constant memory

Global/constant cache

Work group Work item

Local memory


Local memory

Local memory


Local memory

October 28, 2016

89 / 106

Reach for the Cloud? HR3 on the Cloud

Naive Approach



October 28, 2016

90 / 106


Naive Approach 1 2 3

Rely on Map Reduce paradigm Run discretization and assignment to cubes in map step Run distance computation in reduce step



October 28, 2016

90 / 106


Load Balancing 1 2 3

Run two jobs Job1: Compute cube population matrix Job2: Distribute balanced linking tasks across mappers and reducers



October 28, 2016

91 / 106

Reach for the Cloud? Evaluation Hardware

1

CPU (Java) 32-core server running Linux 10.0.4 AMD Opteron 6128 clocket at 2.0GHz

2

GPU (C++) AMD Radeon 7870 with 20 compute units, 64 parallel threads Host program ran on Intel Core i7 3770 CPU with 8GB RAM and Linux 12.10 Ran the Java code on the same machine for scaling

3

Cloud (Java) 10 c1.medium nodes (2 cores, 1.7GB) for small experiments 30 c1.large nodes (8 cores, 7GB) for large experiments



October 28, 2016

92 / 106

Reach for the Cloud? Experimental Setup

Run deduplication task Evaluate behavior on different number of dimensions Important: Scale results Different hardware (2-7 times faster C++ workstation) Programming language

Evaluate scalability (DS4 )



October 28, 2016

93 / 106

Reach for the Cloud? Results – DS1

DBpedia, 3 dimensions, 26K GPU scales better



October 28, 2016

94 / 106


DBpedia, 2 dimensions, 475K GPUs scale better across different θ Break-even point ≈ 108 results



October 28, 2016

95 / 106


LinkedGeoData, 2 dimensions, 500K Similar picture



October 28, 2016

96 / 106

Reach for the Cloud? Results – Scalability

LinkedGeoData, 2 dimensions, 6M Cloud (with load balancing) better for ≈ 1010 + results



October 28, 2016

97 / 106

Reach for the Cloud? Summary

Question: When should we use which hardware for link discovery? Results Implemented HR3 on different hardware Provided the first implementation of link discovery on GPUs Devised a load balancing approach for linking on the cloud Discovered 108 and 1010 results as break-even points



October 28, 2016

98 / 106

Reach for the Cloud? Summary

Question: When should we use which hardware for link discovery? Insights Load balancing important for using the cloud GPUs: Need faster buses that PCIe (e.g., Firewire speed) Accurate use of local resources sufficient for most of the current applications



October 28, 2016

99 / 106

Table of Contents 1

Introduction

2

LIMES

3

MultiBlock

4


5


6

GNOME

7

GPUs and Hadoop

8




October 28, 2016

100 / 106

Summary and Conclusion Four approaches to scalability 1

2

3 4

Improve reduction ratio (LIMES, MultiBlock, HYPPO, HR3 , . . . ) Reduce runtime complexity (AEGLE) Better use of hardware (GNOME) Improve execution of specifications



October 28, 2016

101 / 106

Summary and Conclusion Four approaches to scalability 1

2

3 4

Improve reduction ratio (LIMES, MultiBlock, HYPPO, HR3 , . . . ) Reduce runtime complexity (AEGLE) Better use of hardware (GNOME) Improve execution of specifications

Challenges include 1

2 3 4

5

Increase number of reduction-ratio-optimal approaches (HR3 , ORCHID) Adaptive resource scheduling Self-regulating approaches Distribution in modern in-memory architecture (SPARK) ...



October 28, 2016

101 / 106

Acknowledgment

This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).



October 28, 2016

102 / 106

References I Kleanthi Georgala, Mohamed Ahmed Sherif, and Axel-Cyrille Ngonga Ngomo. “An Efficient Approach for the Generation of Allen Relations”. In: ECAI 2016 - 22nd European Conference on Artificial Intelligence, 29 August-2 September 2016, The Hague, The Netherlands - Including Prestigious Applications of Artificial Intelligence (PAIS 2016). Ed. by Gal A. Kaminka et al. Vol. 285. Frontiers in Artificial Intelligence and Applications. IOS Press, 2016, pp. 948–956. isbn: 978-1-61499-671-2. doi: 10.3233/978-1-61499-672-9-948. url: http://dx.doi.org/10.3233/978-1-61499-672-9-948. Robert Isele, Anja Jentzsch, and Christian Bizer. “Efficient Multidimensional Blocking for Link Discovery without losing Recall”. In: Proceedings of the 14th International Workshop on the Web and Databases 2011, WebDB 2011, Athens, Greece, June 12, 2011. Ed. by Amélie Marian and Vasilis Vassalos. 2011. url: http://webdb2011.rutgers.edu/papers/Paper%2039/silk.pdf. Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

103 / 106

References II Axel-Cyrille Ngonga Ngomo and Sören Auer. “LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data”. In: IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011. Ed. by Toby Walsh. IJCAI/AAAI, 2011, pp. 2312–2317. isbn: 978-1-57735-516-8. doi: 10.5591/978-1-57735-516-8/IJCAI11-385. url: http: //dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-385. Axel-Cyrille Ngonga Ngomo et al. “When to Reach for the Cloud: Using Parallel Hardware for Link Discovery”. In: The Semantic Web: Semantics and Big Data, 10th International Conference, ESWC 2013, Montpellier, France, May 26-30, 2013. Proceedings. Ed. by Philipp Cimiano et al. Vol. 7882. Lecture Notes in Computer Science. Springer, 2013, pp. 275–289. isbn: 978-3-642-38287-1. doi: 10.1007/978-3-642-38288-8_19. url: http://dx.doi.org/10.1007/978-3-642-38288-8_19.



October 28, 2016

104 / 106

References III Axel-Cyrille Ngonga Ngomo. “Link Discovery with Guaranteed Reduction Ratio in Affine Spaces with Minkowski Measures”. In: The Semantic Web - ISWC 2012 - 11th International Semantic Web Conference, Boston, MA, USA, November 11-15, 2012, Proceedings, Part I. Ed. by Philippe Cudré-Mauroux et al. Vol. 7649. Lecture Notes in Computer Science. Springer, 2012, pp. 378–393. isbn: 978-3-642-35175-4. doi: 10.1007/978-3-642-35176-1_24. url: http://dx.doi.org/10.1007/978-3-642-35176-1_24. Axel-Cyrille Ngonga Ngomo. “ORCHID - Reduction-Ratio-Optimal Computation of Geo-spatial Distances for Link Discovery”. In: The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I. Ed. by Harith Alani et al. Vol. 8218. Lecture Notes in Computer Science. Springer, 2013, pp. 395–410. isbn: 978-3-642-41334-6. doi: 10.1007/978-3-642-41335-3_25. url: http://dx.doi.org/10.1007/978-3-642-41335-3_25.



October 28, 2016

105 / 106

References IV Axel-Cyrille Ngonga Ngomo. “HELIOS - Execution Optimization for Link Discovery”. In: The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I. Ed. by Peter Mika et al. Vol. 8796. Lecture Notes in Computer Science. Springer, 2014, pp. 17–32. isbn: 978-3-319-11963-2. doi: 10.1007/978-3-319-11964-9_2. url: http://dx.doi.org/10.1007/978-3-319-11964-9_2. Axel-Cyrille Ngonga Ngomo and Mofeed M. Hassan. “The Lazy Traveling Salesman - Memory Management for Large-Scale Link Discovery”. In: The Semantic Web. Latest Advances and New Domains - 13th International Conference, ESWC 2016, Heraklion, Crete, Greece, May 29 - June 2, 2016, Proceedings. Ed. by Harald Sack et al. Vol. 9678. Lecture Notes in Computer Science. Springer, 2016, pp. 423–438. isbn: 978-3-319-34128-6. doi: 10.1007/978-3-319-34129-3_26. url: http://dx.doi.org/10.1007/978-3-319-34129-3_26. Ngonga Ngomo et al. (InfAI & FORTH)


October 28, 2016

106 / 106

Link Discovery Tutorial - Part I: Efficiency

Link Discovery Tutorial - Part I: Efficiency

Suggest Documents

IPv6 Technology Overview Tutorial- Part I Tutorial- Part I - Nanog

Excel Tutorial - Part I

ADA.org: NBDE Part I Tutorial - Ning

Prolog Tutorial Part I - University of Toronto

SAGU Discovery Tutorial

Kusudama Tutorial part 1

C++ Tutorial Part I : Procedural Programming - Sherrill Group

Small tutorial for installing Hadoop MapReduce on Windows Part I ...

Video Editing Part I of II - Tutorial #4

BackTrack 5 tutorial Part I: Information gathering and ... - TechTarget

Tutorial on Convex Optimization for Engineers Part I

PART I PART II

EE Times - Tutorial on the Link Layer Discovery Protocol

Discovery I

The Problem â Part I - Springer Link

PRIVATE LAW EXCEPTIONALISM? PART I - Springer Link

The Solution â Part I - Springer Link

The Rationality of Scientific Discovery Part I - Semantic Scholar

Advances in antiviral drug discovery and development: Part I ...

The Rationality of Scientific Discovery Part I - Semantic Scholar

VBScript Database Tutorial Part 1

PICAXE VSM Tutorial Part 1

Digital Instruments and Players: Part I â Efficiency ... - Semantic Scholar

Supplementary cementing materials in concrete Part I: efficiency and

Link Discovery Tutorial - Part I: Efficiency