Multimedia (and Web) Databases Starting discussions…. What is ...

3 downloads 857 Views 2MB Size Report
Spatio- temporal dimensions. – modeling ..... d(s1,s2) + d (s2,s3) >= d (s1,s3) ...... an aggregation of stars or galaxies that appear close together in the sky and ...
Starting discussions….

Multimedia (and Web) Databases

z

What is media?

K. Selçuk Candan Associate Professor

Arizona State University

2

What is media? – – – – –

K. Selcuk Candan

Sample application (boring…)…

Text/document Images Video Audio …..

z

Police investigation… – – – – –

z z

What is multimedia????? What is hypermedia????

3



K. Selcuk Candan

4

K. Selcuk Candan

ARIA (Architecture for Interactive Arts):

Sample multimedia query z

Quality-Adaptive Media-Flow Architectures to Support Sensor Data Management Media-stream management for intelligentStage z intelligentStage (@ fine arts college)

“Find the records of every criminal who look like the person seen in “surv_im.gif” and who had a bank transfer of more than $500,000 within the last 5 months. Return all police reports which mentions such persons and their past accomplices.”

– –

z

– – –

K. Selcuk Candan

6

..to incorporate real-time and archived media into live performances, on-demand ..to enable artists/performers to have real-time control of the stage

…equipped with –

5

Video data (surveillance cameras) Audio data (telephone wiretabs) Image data (surveillance, mugshots) Document data (police reports) Conventional data (bank records, employment records, police records) Geographic data (maps)

video cameras pressure sensors microphone arrays real-time 3D-motion tracking K. Selcuk Candan

1

Logical view

Scenario • Two performers, an adult and a child performers, are tracked by a 3Dmotion tracking device. • Their locations on the stage are tracked using pressure sensors. • ARIA continuously monitors the position of the body markers of performers in 3D space, the positions of performers and details of their gestures.

7

• The output of the 3D

motion tracking is filtered through ARIA to obtain the shape and degree of confidences of the pose. K. Selcuk Candan

8

K. Selcuk Candan

10

K. Selcuk Candan

Semantic Heterogeneity z

Spatio - temporal dimensions – – – – –

z z

modeling specification indexing, retrieval, and visualization methods

User- and context- dependence, subjectivity Availability at various quality levels

9

K. Selcuk Candan

Subjectivity

Physical Heterogeneity z

Volume – – –

z

Quality/cost trade - off –

11

K. Selcuk Candan

12

storage, delivery, and processing

increases robustness, graceful degradation

K. Selcuk Candan

2

Interactivity z

Example: X3D

100ms interaction deadline – –

z

resource allocation prefetching/caching

X3D: Extensible 3D Graphics –

Based on VRML 97(Virtual reality modeling language) Root

z

Subjectivity and personalization of content

Translation Transform -2 3 4

z

Color – 10 100 10

Interaction structure

Shine – 0.75 Box – 2 2 2

13

K. Selcuk Candan

14

What is a X3D world?

– – – – – –

K. Selcuk Candan

Media (image files) 3D mesh geometry (human objects) Shape primitives (boxes that form cars) Node structure/hierarchy Spatial structure (position of the car, transformation hierarchy) Event/interaction structure (sensors/scripts) Temporal structure (motion/timeline) Metadata (comments, variable names)

16

What is a database?

K. Selcuk Candan

What is a database?

How does it differ from a file system?

z

A system which allows access to a collection of data – –

How does it differ from an operating system?



z

17

K. Selcuk Candan

Example: car-road world (a police car and a road) –

z

Shine – 0.8

What does an X3D document contain?



z

Transform -2 5 10 Color – 20 10 10

Circle – 2

z

15

Transform -1 1 1

K. Selcuk Candan

19

User specifies what to see System retrieves the corresponding data from the collection The system presents the retrieved info to the user

..different from browsing

K. Selcuk Candan

3

Queries z z

Metadata queries Example queries – –

z

What is an image? z z

Exact Partial match



Object queries – – –

z

K. Selcuk Candan

z

z

local or web

z

Size of data Properties of data –

A query processor (indices etc.) which –

z

K. Selcuk Candan

Why “image” database?

A collection of images





maps user query into data model retrieves the relevant images

z

– –

z

How to let users specify what they want?

23

K. Selcuk Candan

What are the features that interests us?

What kind of images? Mug shots, cat-scans, finger prints z News, advertisement, family photos z Surveillance z Video frames z

z

Colors, color histograms

z

Edges

z z

Texture Image segments

z

Objects

z

Metadata, captions









24

Similarity-based query processing new index structures relevance ordering

Query language –

K. Selcuk Candan

Visual: image processing Semantical: pattern recognition

Similarity-based retrieval –

An information visualization system which shows results to the user

22

visual semantical

21

What is an image database?



An object is an entity within an image z

visual similarity semantic similarity spatial similarity

20

z

2D matrix of values Collection of objects and their spatial relationships

K. Selcuk Candan

25

“sunny day”, “sea” “maps”, “aerial surveillance”

shape,location, color visual features, semantics K. Selcuk Candan

4

What kind of queries? z z

Find me all images created by “John Smith” Find all images which look like “im_ex.gif” –

z z

26

K. Selcuk Candan

27

What kind of queries? z z z

z

28

z

advertisement

Find all images which contain a car Find all images which contain a car and a man who looks like”mugshot.bmp” –

z

– –

surveillance

data mining

K. Selcuk Candan

first object looks like “im.gif” second object is a car first obj. to the right of second obj.

and return the semantics of these two objects. 29

Example

30

Find all objects contained in images of sunny days Find all images which contain two objects –

Find all image pairs which contain similar objects –

K. Selcuk Candan

What kind of queries?

Find all images of sunny days –

Find me top-5 images which look like “im_ex.gif”

Find all images which look like “sketch.bmp” Fins all images which contain a part which look like ….

K. Selcuk Candan

QBE (visual representation)

K. Selcuk Candan

31

K. Selcuk Candan

5

Query…and results…

Relational databases z

Data is – –

z

This is the main assumption for – – –

32

K. Selcuk Candan

textual numerical storage query processing optimization

33

K. Selcuk Candan

Attribute

z

Information is in tabular form –

z z z

NAME

SSN

OFFICE

DESC

..

..

..

..

..

..

..

J. Doe

555-5555

GWC 999

Asst. prof

J. Smith

333-3333

GWC 989

Prof

..

..

..

..

Example: Information about an employee

Schema describes the content A key uniquely identifies a given tuple Each attribute has a domain

34

Index structures

Schema

K. Selcuk Candan

36

Algebra z z

– – – – –

z

37

PAGE

K. Selcuk Candan

Calculus

A set of data manipulation operators Relational algebra (operates on relations) –

tuple

Relational databases

z

A query language should be declarative: –

Say what we want z

Select (σ) Project (π) Cartesian product (×), join Union (U) Intersection (∩) Difference (-)



query optimization

Don’t say how we get it z

no optimization possible

{t.name | (t є Employe) and (t.salary < 1000) and (exists t2 (t2 є Students) and (t2.gpa > 3.7) and (t.ssn = t2.ssn) )}

Procedural (non-declarative) K. Selcuk Candan

38

K. Selcuk Candan

6

SQL z

Relational databases

Based on relational calculus

z

select from where

z z z

select t.name from employee t, student t2 where (t.salary < 1000) and (t2.gpa > 3.7) and (t.ssn = t2.ssn)

z

40

K. Selcuk Candan

Business applications Data model is relational Queries are exact/declarative Updates are important Concurrency is important

41

K. Selcuk Candan

43

K. Selcuk Candan

How does a database look like? SQL

Application interfaces Query parser

Transaction processing

Query Optimizer (cost based) Query Processor

Replication manager

Indices Data

42

K. Selcuk Candan

Example: X3D/VRML Archive

44

K. Selcuk Candan

45

K. Selcuk Candan

7

Relational databases (??) z z z z z

Shortcomings…

Business applications Data model is relational Queries are exact/declarative Updates are important Concurrency is important

z

Image data doesn’t fit into tuples

z

No image comparison No partial match processing No ranking Not computationally complete



z z z

Media data needs to be kept separately

z

46

K. Selcuk Candan

47

Solutions

K. Selcuk Candan

Other problems? z

z

z

Media processing requires more computational power.

Use a host language and embed database queries in it (relational approach)

z



Provide more computational power in the data model itself (object- oriented approach)

48

K. Selcuk Candan

It does not capture the semantical structure of the data well Hierarchies: –

Aggregation hierarchy Inheritence hierarchy

49

Aggregation hierachy

K. Selcuk Candan

Inheritence hierachy

Digital library Multimedia object

video

video frame Live video pixels

text

xml

objects

pixels

50

Recorded video

document

semantics

K. Selcuk Candan

51

K. Selcuk Candan

8

OODB z

Object oriented databases provide – – –

z

z

Higher computational power Aggregation hierarchies Inheritence hierarchies

z z z

They model the real world better! –

z

OODB

z

Everything is an object

You can define your own external methods

z

Business applications Multimedia (??) Data model is object oriented Queries are exact Queries are procedural (some declarative languages) Concurrency/updates are important

E.image_similar_to (c.image)

52

K. Selcuk Candan

53

Shortcomings… z z z z

Object Relational Databases

Too much overhead –

K. Selcuk Candan

z

Optimization is hard



No partial match processing No ranking Query processing is cost driven –

z



z

K. Selcuk Candan

What else? z

No partial match processing No ranking Query processing is cost driven –

Deductive databases – –

z



K. Selcuk Candan

57

Logic based Boolean queries

Fuzzy databases –

not “similarity” driven



56

User definied functions User defined abstract data types (ADTs)

55

Shortcomings…

z

tuples SQL

Object technology z

K. Selcuk Candan

z

Relational technology z

not “similarity” driven

54

z

Benefits from both

usually logic-based, but not boolean nothing is true or false results are not-exact (like multimedia queries)

K. Selcuk Candan

9

What else? z

What else?

Spatial/Temporal Databases – – –

z z



z

Scientific, geographic applications Data model is vector or interval based Queries

Data mining – – –

Range queries Nearest neighbor queries

Business, scientific applications Relational data model Queries: find z z

Queries are declarative or visual

z z

58

K. Selcuk Candan

59

What else? z



z

Most data has a well-defined structure (schema) In SSD, there is no common schema z

z

K. Selcuk Candan

…and

Semi- structured data management –

Image databases –

Data model is feature-vector based z



z

Structure-based

– –

z K. Selcuk Candan

61

Research Issues z



z



62

z

z

Content, features of interest Information extraction/integration

Query-by-example Ranking Feedback (user-to-system, system-to-user) K. Selcuk Candan

Query processing – –

Query Language –

Each feature represented as a vector space

Research Issues

Data model –

Color Texture

Structure may or may not be available Queries z

60

Multiple features –

each object describe itself

Queries –

patterns, rules, classes, or outliers

– –

matches the data model captures user’s interest



K. Selcuk Candan

63

Online vs. off-line information extraction Indices for different media Optimization of queries with different media Similarity-based retrieval, ranking Relevance feedback

K. Selcuk Candan

10

Research issues z

Storage/delivery – –

z



speed precison and recall

5

0

How to transmit large, continuous data (video) How to visualize/present results of a query which may contain multiple types of data (images, video, audio) K. Selcuk Candan

65

Vectors…what are they??? z

Image with 1 pixel

Visualization –

64

z

How to store data in different formats How to retrieve data efficiently z

z

Vectors…what are they???

3

K. Selcuk Candan

Vectors…what are they???

Image with 2 pixels

z

7

Image with 3 pixels

7

0

0

5

5

3

66

K. Selcuk Candan

67

Distance between two image??? z

K. Selcuk Candan

Euclidean distance

Given A and B, how different are they? A

A ∆(A,B)

B

68

B

K. Selcuk Candan

69

K. Selcuk Candan

11

Which image is more similar to A?

C

Which image is more similar to A?

∆(A,C)

C

∆(A,C)

A

A

∆(A,B)

∆(A,B)

B

B

Closer to A Similar to A

70

K. Selcuk Candan

71

“Find 2 most similar images to A”

K. Selcuk Candan

“Find 2 most similar images to A”

A

72

A

K. Selcuk Candan

73

“Find images at most δ different from A”

Nearest-neighbor search

K. Selcuk Candan

“Find images at most δ different from A”

A

A δ

74

K. Selcuk Candan

75

range search

K. Selcuk Candan

12

Are there other similarity measures?

Let’s try angles...

A

Let’s try angles...

E

F

3

2

5

A

5

5

5

E

2

2

2

Similar composition

E

K. Selcuk Candan

ρ x = x1 , x2 ,..., xn



80

2

2

2

F

3

2

5

A

5

5

5

E

2

2

2

If we use angles as a similarity measure, then A is more similar to E than F

K. Selcuk Candan

What is a good measure then??

ρ y = y1 , y2 ,..., yn

y

F

cos(AE) > cos(AF)

79

Angle-based measures

i =1

E

A

F

78

Dot product ρρ n x . y = ∑ xi

5

K. Selcuk Candan

A



5

5

Let’s try angles... Similar composition

Given

2

5

F

77

K. Selcuk Candan

z

3

A E

76

F A

i

z

Application dependent...

z

...but, distances in a metric space help indexing!

Cosine similarity

ρρ ρρ x. y cos( x , y ) = ρ ρ x y

K. Selcuk Candan

81

K. Selcuk Candan

13

Metric distances (Minkowski metrics)

Metric model: axioms z

Any function d expressing a distance must satisfy the following axioms: self-minimality: minimality simmetry triangular inequality

– – – –

z

L1-metric: d = (dX+dY) Y

d(s,s) = 0 d(s1,s2)>= d(s1,s1) d(s1, s2) = d (s2, s1) d(s1,s2) + d (s2,s3) >= d (s1,s3)

dY dX

z

Example: Euclidean distance

X

82

K. Selcuk Candan

Metric distances (Minkowski metrics) z

Also called Manhattan Distance

83

K. Selcuk Candan

Metric distances (Minkowski metrics)

L2-metric: d = (dX2+dY2)1/2

z z

Y

z z

L3-metric; d = (dX3+dY3)1/3 ..... ..... L(infinity): d = max{X,Y}

dY dX X

Also called Euclidean ManhattanDistance Distance

84

K. Selcuk Candan

85

Feature…

…metric model z

Well suited for certain kinds of similarity evaluation, such as color based comparisons

z

Consistent with widly used approaces from computer vision and pattern recognition communities –

z

86

K. Selcuk Candan

z z



student_ID

can be a feature z What are the features for an image?

results suggest that the L1 metric may better capture human notions of image similarity.

Makes it relatively easy to index data, modeled as vectors of properties, in terms of classical multidimensional indexing techniques. K. Selcuk Candan

…a property of interest that can help us index an object For a “student record”

87

K. Selcuk Candan

14

Image features z

There are many possible features – – – – – –

z

88

Good feature..

Color histogram Texture Edges Shapes Objects Object or scene semantics

z

A good feature corresponds to users’ perception as much as possible Relevance feedback!!!!

89

What does “significant” mean

K. Selcuk Candan

What does “significant” mean

Information theoric sense: –

A good feature is significant and enables us to differentiate objects from others as much as possible



Feature selection: which one to use for indexing? K. Selcuk Candan

z

z

z

An event is more significant if it carries more information

Information theoric sense: – –

An event is more significant if it carries more information An event that has high occurance rate caries less information z

Solar eclipse is more interesting then sunset High frequency ----- less information Low frequency ----- high information

90

K. Selcuk Candan

91

Entropy z

92

K. Selcuk Candan

Entropy (example)

Total information content (uncertainty)

z

K. Selcuk Candan

94

Total information content (uncertainty)

P(a) = 0.5, P(b) = 0.5

H=1

P(a) = 1.0, P(b) = 0.0

H=0

more uncertain less uncertain

K. Selcuk Candan

15

Entropy (example) z

Which feature is better?

Total information content (uncertainty) F2

P(a) = 0.5, P(b) = 0.5

H=1

P(a) = 1.0, P(b) = 0.0

H=0

more uncertain more information less uncertain less information

95

K. Selcuk Candan

F1

96

Which feature is better?

F3 K. Selcuk Candan

Which feature is better?

F2

F2

F1

97

Better separation!

F1

F3 K. Selcuk Candan

98

Which feature is better?

F3 K. Selcuk Candan

Principal component

F2

F2

Better separation!

Better separation!

Less frequent! F1

99

F1

F3 K. Selcuk Candan

10 0

F3 K. Selcuk Candan

16

Principal component analysis

Principal component analysis

F2

The eigenvector of the covariance matrix with the largest eigenvalue

F2

Principal component is a combination of features!

Principal component is a combination of features!

F1

10 1

F1

F3 K. Selcuk Candan

10 2

Compactness of a database

F3 K. Selcuk Candan

Compactness of a database

comp( D) = ∑ similarity (oi , o j )

comp( D) = ∑ similarity (oi , o j )

i≠ j

i≠ j

more compact

10 3

K. Selcuk Candan

10 4

Compactness of a database

K. Selcuk Candan

Feature quality

comp( D) = ∑ similarity (oi , o j )

A feature is • good if we remove it, the overall compactness increases • bad if we remove it, the overall compactness decreases

i≠ j

A compact database is not desirable!!!

10 5

K. Selcuk Candan

10 6

good

bad

K. Selcuk Candan

17

Signal...

Media and Features

z

Is a function (generally of time) f(t)

Used in representing analog and digital information -analog signal Æ continuous -digital signal Æ discrete

10 8

Audio: Image:

f: time-Ævolume (analog) f: coordinate × coordinate Æ color (digital)

K. Selcuk Candan

Filter.... z

z

z

z

10 9

Black and white images – f : coordinate × coordinate Æ {0, 1} Greyscale images – f : coordinate × coordinate Æ [0, 255]

z

z



Color (Depth = 24) – f : coordinate × coordinate Æ [0, 255] × [0, 255] × [0, 255]



Ct: color index Æ [0, 255] × [0, 255] × [0, 255]



z

Color image (with color table) – f : coordinate × coordinate Æ color index [0, 255]



K. Selcuk Candan

Text – – – –

Symbolic Artificial Single meaning (reader independent) Small storage requirements

f(av) Æ a*g(v) f1(v) + f2(v) Æ g1(v) + g2(v)

A’ + B’ = C’

Space invariant filter

11 0

Text vs. images z

....is a function that transforms an input signal f(v) into an output signal g(v) – Filter: f(v) Æg(v) A + B = C Linear filter:

filter(f(x+a)) = filter (f(x)) K. Selcuk Candan

Example: Images... z

Images – – – –

z

Visual Natural, artificial Multiple meanings (viewer dependent) Large storage requirements

Convenient ways to store visual information –

Bitmap: z 2D array of pixels. z each pixel contains color+illumination information



They have to be z z

processed and analyzed

to extract the information content

11 1

K. Selcuk Candan

11 2

K. Selcuk Candan

18

Image operations z z z z z z z z z z

11 3

z z

Processing vs. analysis..

Acquire Store Browse Query/QBE Process/Analyse Index Retrieve/request Display Compress Watermark Transmit Enhancement

z

I

z

– – – – – – – –

z

z z z

Cat, man, umbrella…

N w w M

Many of these operators require a filter to be convoluted over the original image

# of pixel: 628 × 1024 # of bits per pixel = 24 # of bits = 628 × 1024 × 24 # of bytes = 628 × 1024 × 3 ≈ 190 Kbytes How to reduce storage cost?

11 6

1024

z

K. Selcuk Candan



reduce the visual quality of the image

Feature vector size: 628 × 1024 –

628

change the coding reduce the dimensions of the image (scale) z

Cost of the operation: M × N × w × w

Problem…





11 7

K. Selcuk Candan

Sharpening Bluring Rotating Translating Brightening Cut/paste/resize Warping Edge detection Segmentation

Storage requirements

z

IAE

Processing requirements...

K. Selcuk Candan

z

I

Operators –

I’

Image analyzing

11 4

Image processing

11 5

IPE

(Histograms)

K. Selcuk Candan

z

Image processing

Dimensionality curse: high dimensions make indices unusable (10-15 dimensions max!!!)

loss of granularity

K. Selcuk Candan

11 8

K. Selcuk Candan

19

Problem… z

Feature vector size: 628 × 1024 –

z

Dimensionality curse: high dimensions make indices unusable (10-15 dimensions max!!!)

Solution: Reduce # dimensions of the vector – –

11 9

Transforms

A

use distance-preserving transforms Ex: fourier transform, DCT, wavelet transform

628 × 1024 DCT

12 0

4 K. Selcuk Candan

Transforms

K. Selcuk Candan

Transforms A

A

A

12 1

Distances and angles are preserved

12 2

K. Selcuk Candan

Transforms

K. Selcuk Candan

Transforms A

12 3

A

Distances and angles are preserved

Distances and angles are preserved

Some dimensions are more important (differentiating) than the other

Some dimensions are more important (differentiating) than the other

K. Selcuk Candan

12 4

Eliminate unimportant dimensions

K. Selcuk Candan

20

Transform + Projection (Compression or Feature selection) A

What happens to distances???

A

A

A

∆’(A,B)

∆(A,B)

Projection

B

B

12 5

12 6

K. Selcuk Candan

What happens to distances???

∆(A,B) ∆’(A,B)

K. Selcuk Candan

What happens to distances??? δ

δ A

δ δ A

A

A ∆2 ∆1

C B

∆1’

B

C

12 7

∆2’

B

12 8

K. Selcuk Candan

False hit (∆1> ∆1’)

C

B

C

Miss (∆2< ∆2’)

K. Selcuk Candan

Misses are not desirable! Can not be eliminated with postprocessing

Image analysis

What happens to distances??? δ δ A

z

A ∆1

∆1’

B

∆2’

Image analysis gives the features –

∆2



C

– –

B

C

– – –

12 9

False hit (∆1> ∆1’)

Miss (∆2< ∆2’)

K. Selcuk Candan

13 0

z

Color histogram Texture Edges Shapes Objects Semantics Depth (stereoscopic images)

Image analysis is usually an expensive operation K. Selcuk Candan

21

..so... z

Example feature: color z

In the design of a MIS, you want to minimize the number of image analysis operations to perform on the fly – – –

– z



– –

(op1 op2) = (op2 op1) and Cost (op2 op1) < Cost(op1 op2)



then do op2 op1

13 1

K. Selcuk Candan

z

RGB: describes colors in terms of the combinations of the intensities of Red, Green and Blue colors

z

HSV – Hue: main color – Saturation: Amount of white – Value: Amount of energy

z

YUV, a linear transformation from RGB – Y: luminance (amount of light) – grey scale – U: red - cyan – V: magenta-green

S G

Y W

C

Cyan

B

White Black

K. Selcuk Candan

Color models

B

Magenta

reduce the number of colors to 256 (1 byte per pixel) cluster similar colors into a single bucket and assign a single color to the bucket the set of buckets is called color table

13 2

Color spaces...

Blue

224 colors ≈ 16 million

Color table:

If –

24 bits (3 bytes) of red, green, and blue z

pre-processing/ pre-analysis indexing/clustering semantic optimization z

Example: color

R M

Green G

Red

Yellow V

R

13 3

RGB (Red, Green, Blue)

HSV (Hue, Saturation, Value) K. Selcuk Candan

13 4

YUV (ex. PAL television system)

color

13 5

YUV

Y=0.299R + 0. 587 G + 0.114 B U= 0.492 (B-Y) V =0.877 (R-Y)

Y

Grey scale image

U

Red-Cyan

V

Magentagreen

K. Selcuk Candan

K. Selcuk Candan

Color histograms

13 6

K. Selcuk Candan

Courtesy of Misha Pavel, OGI

22

Problems with histograms

Problems with histograms

Histogram: {green:4, purple:2, red:3}

Histogram: {green:4, purple:2, red:3}

• Are these similar???

• Are these similar??? • Color associations????: • blue is similar to purple • yellow is similar to orange

13 7

K. Selcuk Candan

13 8

Color locality??

K. Selcuk Candan

Comparison of color histograms z

Hist1

Euclidean distance

Hist2 (b1-b2)2 + (g1-g2)2 + (p1-p2)2 + (r1-r2)2 +…

Hist3

Hist4 z

Intersection similarity min(b1,b2) +min (g1,g2)+min (p1,p2)+ min (r1,r2)+...

13 9

K. Selcuk Candan

14 0

z

Let x and y be two histogram vectors, each of length n

14 1

Gavg= (1/N) ∑i =1..N G(pi)

Bavg= (1/N) ∑i =1..N B(pi)

aij = cross talk factor between i-th and j-th color No cross-talkÆ

Use average color of an image Ravg= (1/N) ∑i =1..N R(pi)

d2 = ∑i =1..n ∑j=1..n aij (xi-yi) (xj-yj) –

K. Selcuk Candan

Quadratic distance bounding

Complete Euclidean Distance z

b2 + g2 + p2 + r2+ .....

x = (Ravg, Gavg , Bavg) T

aij = 1 if i = j aij = 0 otherwise K. Selcuk Candan

14 2

davg2 (x,y)=(x - y)T (x- y) =(Ravgx – Ravgy)2 + (Gavgx – Gavgy)2 + (Bavgx – Bavgy)2 K. Selcuk Candan

23

Quadratic distance bounding

Quadratic distance bounding

davg2 (x,y) T

∑w f

i i i∈{1..n}−{ j }

>T

or

i =1

sim(o, q) = w j f j + sim( − j ) (o, q ) > T 51 5

K. Selcuk Candan

51 6

Feature (term) readjustment

Feature (term) readjustment

sim(o, q) = w j f j + sim( − j ) (o, q ) > T

51 7

fj=1

fj=0

sim( − j ) (o, q ) ≤ T

a

0

sim( − j ) (o, q ) > T

b

c

| relevant & retrieved |= a + b + c K. Selcuk Candan

K. Selcuk Candan

p( f k | R) =

K. Selcuk Candan

Feature (term) readjustment

independent

independent

p (( f k = 1) ∧ (sim( − k ) (o, q ) > T ))

p( f k | R) =

p (sim( − k ) (o, q ) > T )

p ( f k | R) = p( f k = 1) 51 9

p (sim( − k ) (o, q ) > T )

51 8

Feature (term) readjustment

p( f k | R) =

p (( f k = 1) ∧ (sim( − k ) (o, q ) > T ))

p (( f k = 1) ∧ (sim( − k ) (o, q ) > T )) p (sim( − k ) (o, q ) > T )

p ( f k | R) = p( f k = 1) = p ( f k = 1 | sim( − k ) (o, q ) > T ) K. Selcuk Candan

52 0

K. Selcuk Candan

79

Feature (term) readjustment

p( f k | R) =

Ranking z

p (( f k = 1) ∧ (sim( − k ) (o, q ) > T ))

z

p (sim( − k ) (o, q ) > T )

z z

p( f k | R) = 52 1

p ( R | Oi ) > p( R | O j ) ⇔ sim(Q, Oi ) > sim(Q, O j )

b b+c K. Selcuk Candan

52 2

Ranking

K. Selcuk Candan

Bayes Theorem p( A | B) =

p ( R | Oi ) > p ( R | O j ) ⇔ sim(Q, Oi ) > sim(Q, O j ) z

Q: query R: event that a document is relevant to the user P(R|Oi): given features and weights, the probability that object oi is relevant to the user …then, retrieval is effective iff

p( B | A) p( A) p( B)

Let’s try to rewrite the first half of the equation using Bayes theorem

52 3

K. Selcuk Candan

52 4

Bayes Theorem p( A | B) =

K. Selcuk Candan

Relevance of two objects

p( B | A) p( A) p( B)

p ( R | Oi ) =

p(Oi | R ) p( R) p (Oi | R) p( R) + p(Oi | I ) p( I )

> p( A | B) = 52 5

p( B | A) p( A) p( B | A) p( A) + p( B | ¬A) p (¬A) K. Selcuk Candan

p( R | O j ) = 52 6

p (O j | R ) p ( R ) p(O j | R ) p ( R ) + p(O j | I ) p( I ) K. Selcuk Candan

80

Relevance of two objects

…so we have p (Oi | R ) p(O j | R) > ⇔ sim(Q, Oi ) > sim(Q, O j ) p (Oi | I ) p (O j | I )

p (Oi | R) p (O j | R ) > p (Oi | I ) p(O j | I )

How do we compute these probabilities???

52 8

K. Selcuk Candan

52 9

…so we have z

…so we have

Let us assume features are independent – –

K. Selcuk Candan

z

o= q=

Let us assume features are independent n

n

p(Oi | I ) = ∏ p( f i ,k | I )

p (Oi | R) = ∏ p ( f i ,k | R)

k =1

k =1

n

p (Oi | R) = ∏ p ( f i ,k | R) k =1

53 0

n

n

p(Oi | I ) = ∏ p( f i ,k | I )

p(Oi | R) p(O j | R ) > ⇔ p(Oi | I ) p(O j | I )

k =1

K. Selcuk Candan

53 1

…so we have z

p (Oi | R) = ∏ p ( f i ,k | R) k =1

| R) >

k =1 n

∏ p( f

i ,k

| I)

k =1

∏ p( f

j ,k

| R)

∏ p( f

j ,k

| I)

k =1 n

k =1

K. Selcuk Candan

z

n

p(Oi | I ) = ∏ p( f i ,k | I ) k =1

n p( f j ,k | R) p ( f i ,k | R ) n p(Oi | R) p(O j | R) > ⇔ ∑ log > ∑ log p(Oi | I ) p(O j | I ) p( f i ,k | I ) k =1 p( f j ,k | I ) k =1

53 2

n

i ,k

…so we have

Let us assume features are independent n

∏ p( f

K. Selcuk Candan

Let us assume features are independent –

use dot product as the similarity measure



use



use as the query!!!

log

p( f i ,k | R ) p ( f i ,k | I )

as the weight of the kth feature

n p( f j ,k | R) p ( f i ,k | R ) n p(Oi | R) p(O j | R) > ⇔ ∑ log > ∑ log p(Oi | I ) p(O j | I ) p( f i ,k | I ) k =1 p( f j ,k | I ) k =1

53 3

K. Selcuk Candan

81

…so we have z

…so we have

Let us assume features are not independent – –

z

o= q=

– –

z

k =1

53 4

K. Selcuk Candan

53 5

…so we have

z

z

I ( p1, p 2) = ∑ p1( x) log

How can we incorporate term dependence??? Degree of approximation…..

– –

x



p1=p2 p1p2

implies that I=0 implies that I>0 K. Selcuk Candan

z

53 7

Dependence graph f1

p1( x) p 2( x )

implies that I=0; p1p2

implies that I>0

Degree of dependence between fi and fj

Dij = I ( p ( f i ∧ f j ), p ( f i ) p ( f j ) ) If the two terms are independent, then Dij will be 0!!!

K. Selcuk Candan

Dependence graph f1

10

7

f3

2

f2

p1=p2

10

7

f3

2

f2

1 9

Maximum spanning tree

1 9

f4

53 8

Degree of approximation…..

o= q=

p1( x) I ( p1, p 2) = ∑ p1( x) log p 2( x) x 53 6

K. Selcuk Candan

…so we have

Let us assume features are not independent –

z

How can we incorporate term dependence???

p(Oi | I ) ≠ ∏ p( f i ,k | I )

k =1



o= q=

n

n

p (Oi | R) ≠ ∏ p ( f i ,k | R)

z

Let us assume features are not independent

f4

If the two terms are independent, then Dij will be 0!!!

K. Selcuk Candan

53 9

If the two terms are independent, then Dij will be 0!!!

K. Selcuk Candan

82

Dependence graph f1

10

7

f3

2

f2

Dependence graph f1

Maximum spanning tree

10

7

f2

2

f2

1 9

Maximum spanning tree

1 9

f4

f4

p ( f1 ∧ f 2 ∧ f 3 ∧ f r ) = p ( f1 ) p ( f 2 | f1 ) p ( f 3 | f1 ) p ( f 4 | f 2 )

p ( f1 ∧ f 2 ∧ f 3 ∧ f r ) = p ( f1 ) p ( f 2 | f1 ) p ( f 3 | f1 ) p ( f 4 | f 2 ) p(Oi | R ) can be computed using the distribution of the features in R!!!

54 0

If the two terms are independent, then Dij will be 0!!!

K. Selcuk Candan

54 1

Feedback without user’s help.. Q

Retrieval Engine

p(Oi | I ) can be computed using the distribution of the features in I!!! K. Selcuk Candan

Feedback without user’s help..

res1 res2 res3 res4 res5

res1 res2 res3 res4 res5

Feedback Engine δ

resk

resN

System returns N ranked results

54 2

K. Selcuk Candan

First k ranked results are chosen

54 3

Feedback without user’s help.. Q’

Retrieval Engine

K. Selcuk Candan

Multimedia query processing

res1 res2 res3 res4 res5

z

First major issue

z

Second issue

z

Third issue



δ





resN

54 4

System returns new N ranked results

K. Selcuk Candan

resN

54 5

imperfections (fuzziness) ranking expensive predicates (user defined functions)

K. Selcuk Candan

83

A multimedia query

54 6

K. Selcuk Candan

54 7

K. Selcuk Candan

Query…and results…

A multimedia query Crisp Fuzzy (imperfect)

54 8

K. Selcuk Candan

54 9

Reasons for imperfection z z z z z

55 0

K. Selcuk Candan

Fuzzy set..

Similarity between features (yellow/orange) Imperfections in the feature extraction algorithms Imperfections in the query formulation methods Partial match requirements Imperfections in the index structures and clustering algorithms K. Selcuk Candan

55 1

z

Fuzzy set F with domain D is defined using a membership function

z

A crisp (conventional) set C with domain D is defined using a membership function

z

A fuzzy set corresponds to a fuzzy predicate K. Selcuk Candan

84

Example

Example

hot

hot

1.0

1.0 0.85

heat

0.0 -5

0

5

10

15

20

25

30

35

40

heat

0.0

45

-5

0

5

10

15

20

25

30

35

40

45

Hot(29o) = 0.85

55 2

K. Selcuk Candan

55 3

Example query

K. Selcuk Candan

Example query

Fuzzy (imperfect)

Fuzzy (imperfect) 0.84

55 4

K. Selcuk Candan

55 5

Example query

K. Selcuk Candan

Example query

Fuzzy (imperfect) 0. 76

0.84

Fuzzy (imperfect) 0. 76

0.68 Fuzzy logical operator

z z

55 6

0.68

K. Selcuk Candan

55 7

z

0.84

0.68 Fuzzy logical operator

Fuzzy and crisp predicates Fuzzy logical expression and a merge function Results is a ranked list (with the associated fuzzy values!) K. Selcuk Candan

85

How to process a fuzzy query?

z

How to process a fuzzy query?

First approach…make predicates crisp!!! –

z

Use thresholds….

First approach…make predicates crisp!!! –

hot

Use thresholds…. hot

1.0

1.0 0.7

0.7 heat

0.0

55 8

-5

0

5

10

15

20

25

30

35

40

45

K. Selcuk Candan

55 9

How to process a fuzzy query?

z

0. 76

5

Good for processing in traditional databases

10

15

20

25

30

35

40

45

Not good for multimedia K. Selcuk Candan applications

Example merge functions…

0.84

Arithmetic average (N-ary)

0.68

K. Selcuk Candan

K. Selcuk Candan

Triangular norms (and co-norms)

How to emulate the properties of a crisp predicate

K. Selcuk Candan

(mostly used in information retrieval)

56 1

Triangular norms (and co-norms)

56 2

0

Merge (or scoring) functions….

0.76 = µ∧(0.84,0.68)

z

-5

Second approach…use suitable fuzzy logic!!! –

56 0

heat

0.0

56 3

z

How to emulate the properties of a crisp predicate

z

Bellman and Giertz: “The unique aggregation functions for evaluating AND and OR that preserve logical equivalence of queries involving only conjunction and disjunction and that are monotonic in their arguments are min and max.” K. Selcuk Candan

86

Triangular norms (and co-norms) z z

Triangular norms (and co-norms)

Emulating the properties of a crisp predicate may not be good for multimedia applications!!! Boundary conditions prevent partial matches

0. 00

0.99

z

K. Selcuk Candan

56 5

Triangular norms (and co-norms) z z

Emulating the properties of a crisp predicate may not be good for multimedia applications!!! Monotone condition is weak!!

0. 70 0. 70

0.00 0.00 = µ∧min(0.99,0.00)

56 4

z

0.71 0.99

0.70 0.70 0.70 = µ∧min(0.71,0.70) 0.70 = µ∧min(0.99,0.70)

K. Selcuk Candan

Visualisation of minimum

Emulating the properties of a crisp predicate may not be good for multimedia applications!!! N-ary semantics may be enough !!! –

consider all relevant features at the same time, instead of in pairs! Arithmetic average (N-ary)

Geometric average (N-ary)

56 6

K. Selcuk Candan

57 0

Visualisation of minimum

Visualisation of arithmetic average

monotone: not enough

57 1

K. Selcuk Candan

No partial match: not good K. Selcuk Candan

strictly monotone:better!!

57 2

partial matches: better!! K. Selcuk Candan

87

Visualisation of arithmetic average

Interestingness….. z

constant slope: not enough

An increase in the score of a subquery –

with a lower score is more interesting than

a similar increase in the score of a subquery

strictly monotone:better!!



with a large score

0. 40 0. ?? 0. ?? partial matches: better!!

57 3

K. Selcuk Candan

0.00 0.10 0.00

0.80 0.80 0.90

57 4

Visualisation of geometric average

K. Selcuk Candan

….parametric geometric average

adaptive slope: better Truth cutoff

No Notpartial goodmatch: not good

57 5

K. Selcuk Candan

57 6

….parametric geometric average

K. Selcuk Candan

How to put weigths?????

z

57 7

rtrue=0.4; β=0.4

rtrue=0.4; β=0.2

rtrue=0.4; β=0.0 K. Selcuk Candan

Falsehood value

57 8

How do I state that image properties are more important than semantic properties??

K. Selcuk Candan

88

How to put weigths????? z z

Fagin’s proposal

How do I state that image properties are more important than semantic properties?? What do we mean?: –

0. 60 0. 65 0. 68

z



A change in the value of image property should have a larger impact than a similar change in the value of the semantic property.

0.60 0.70 0.60

57 9

58 0

Fagin’s proposal



K. Selcuk Candan

Fagin’s proposal

Desirata –

z

If all weigths are equal the result should be equal to no weight case If one of the weigths is zero, the subquery can be dropped without effecting the rest

Desirata – – –

58 1

K. Selcuk Candan

If all weigths are equal the result should be equal to no weight case If one of the weigths is zero, the subquery can be dropped without effecting the rest ..the result should be a continuous function of the weigths

58 2

Fagin’s proposal z

If all weigths are equal the result should be equal to no weight case

0.60 0.60 0.70

K. Selcuk Candan

z

Desirata

K. Selcuk Candan

Fagin’s proposal

Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m

z

Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m

z

then f (θ ,θ 1

2 ,...,θ m

)

(x1 , x2 ,..., xm ) = (θ1 − θ 2 ) f (x1 ) + 2(θ 2 − θ 3 ) f ( x1 , x2 ) + 3(θ 3 − θ 4 ) f (x1 , x2 , x3 ) + .........

58 3

K. Selcuk Candan

58 4

(m − 1)(θ (m−1) − θ m ) f (x1 , x2 , x3 ,..., x(m−1) ) + mθ m f ( x1 , x2 , x3 ,..., xm ) K. Selcuk Candan

89

Fagin’s proposal z

z

Fagin’s proposal

Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m

z

(x1 , x2 ,..., xm ) = (θ1 − θ 2 ) f (x1 ) + 2(θ 2 − θ 3 ) f ( x1 , x2 ) + If f is continuous, then the weighted function is also 3(θ 3 − θ 4 ) f (x1 , x2 , x3 ) + then f (θ ,θ 1

2 ,...,θ m

z

)

continuous

......... (m − 1)(θ (m−1) − θ m ) f (x1 , x2 , x3 ,..., x(m−1) ) + mθ m fK.(Selcuk x1 , xCandan 2 , x3 ,..., xm )

58 5

Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m

z

then

f⎛ 1

1 1⎞ ⎜ , ,..., ⎟ ⎝m m m⎠

58 7

2 ,...,θ m

)

.........

(m − 1)θ (m−1) f (x1 , x2 , x3 ,..., x(m−1) ) K. Selcuk Candan

Fagin’s proposal

⎛1 1⎞ 2⎜ − ⎟ f ( x1 , x2 ) + ⎝m m⎠ ⎝ m m⎠ ⎛1 1⎞ 3⎜ − ⎟ f ( x1 , x2 , x3 ) + ......... ⎝m m⎠ (m − 1)⎛⎜ 1 − 1 ⎞⎟ f (x1 , x2 , x3 ,..., x(m−1) ) + ⎝m m⎠ 1 m f ( x1 , x2 K. , xSelcuk 3 ,..., xCandan m) m

z

Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m

z

then

f⎛ 1

1 1⎞ ⎜ , ,..., ⎟ ⎝m m m⎠

(x1 , x2 ,..., xm ) = f (x1 , x2 , x3 ,..., xm )

If all weigths are equal then the result is equal to the no-weighted case

58 8

Example (arithmetic average) score(a ∧ b) =

1

58 6

(x1 , x2 ,..., xm ) = ⎛⎜ 1 − 1 ⎞⎟ f (x1 ) +

If all weigths are equal…

(x1 , x2 ,..., xm ) = (θ1 − θ 2 ) f (x1 ) + 2(θ 2 − θ 3 ) f ( x1 , x2 ) + If lowest weigth is 0, then the corresponding sub query can be 3(θ 3 − θ 4 ) f (x1 , x2 , x3 ) + then f (θ ,θ

omitted

Fagin’s proposal z

Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m

K. Selcuk Candan

Example (arithmetic average)

score(a) + score(b) 2

score(a ∧ b) =

score(a) + score(b) 2

score(a ∧ b) = (θ a − θ b ) score(a) + 2θ b

58 9

K. Selcuk Candan

59 0

score(a) + score(b) 2

K. Selcuk Candan

90

Example (arithmetic average) score(a ∧ b) =

Example (product)

score(a) + score(b) 2

score(a ∧ b) = (θ a − θ b ) score(a) + 2θ b

score(a ∧ b) = score(a) × score(b)

score(a) + score(b) 2

score(a ∧ b) = (θ a − θ b ) score(a) + 2θ b score(a ) × score(b)

score(a ∧ b) = θ a score(a ) + θ b score(b) 59 1

K. Selcuk Candan

59 2

Are Fagin’s desiderata enough? z

K. Selcuk Candan

Are Fagin’s desiderata enough?

It does not compare partial derivatives!!!

z

It does not compare partial derivatives!!! adaptive slope: adaptive importance

59 3

K. Selcuk Candan

59 4

Are Fagin’s desiderata enough? z

K. Selcuk Candan

Are Fagin’s desiderata enough?

It does not compare partial derivatives!!!

z z

E

59 5

= 1 −

1 + b 2 b 2 1 + R P

∀a, b

⎛ ∂E ∂E ⎞ ⎛P ⎞ ⎯→⎜ = ⎟ ⎜ = b⎟ ← ⎝ ∂R ∂P ⎠ ⎝R ⎠

K. Selcuk Candan

It does not compare partial derivatives!!! Importance: Given a function f(x,y), x has a higher contribution than y iff

59 6

∂f ∂x

> a ,b

∂f ∂y

a ,b

K. Selcuk Candan

91

Are Fagin’s desiderata enough? z z

Are Fagin’s desiderata enough?

It does not compare partial derivatives!!! Importance: Given a function f(x,y), x has a higher contribution than y iff ∀a, b

∂f ∂x

> a ,b

∂f ∂y

z z

It does not compare partial derivatives!!! Importance: Given a function f(x,y), x has a higher contribution than y iff ∀a, b

a ,b

∂f ∂x

> a ,b

∂f ∂y

a ,b

relimp (x,y)

59 7

K. Selcuk Candan

59 8

Are Fagin’s desiderata enough? z z

a ,b

=θy

a ,b K. Selcuk Candan

relimp (x,y)

Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f >

z

Example:

a ,b

=



x

60 0

∂x

∂score( x ∧ y ) ∂score( x)

a ,b

∂score( x ∧ y ) ∂score( y )

a ,b

− θ y ) + 2θ y b

a ,b

∂y

a ,b

= (θ x − θ y ) + 2θ y b NOT OK!

= 2θ y a

relimp (x,y)

2θ y a

Are Fagin’s desiderata enough?

K. Selcuk Candan

a,a

=



x

− θ y ) + 2θ y a 2θ y a

= 1+



x

z

Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f >

z

Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f >

z

Example:

z

Example:

∂x

a ,b

∂y

∂score( x ∧ y ) ∂score( x)

a ,b

∂score( x ∧ y ) ∂score( y )

a ,b

a ,b

NOT OK!

K. Selcuk Candan

>1

∂x

a ,b

∂y

a ,b

score( x ∧ y ) = (θ x − θ y ) score( x) + 2θ y score( x) × score( y )

= (θ x − θ y ) + 2θ y b = 2θ y a

−θ y )

2θ y a

Are Fagin’s desiderata enough?

score( x ∧ y ) = (θ x − θ y ) score( x) + 2θ y score( x) × score( y )

60 1

z

OK!

=θy

a ,b

K. Selcuk Candan

=θx

a ,b

∂score( x ∧ y ) ∂score( y )

59 9

a ,b

a ,b

∂f ∂y

score( x ∧ y ) = (θ x − θ y ) score( x) + 2θ y score( x) × score( y )

score( x ∧ y ) = θ x score( x) + θ y score( y ) =θx

=

Are Fagin’s desiderata enough?

Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f > ∂x a ,b ∂y a ,b Example:

∂score( x ∧ y ) ∂score( x)

a ,b

∂f ∂x

60 2

∂score( x ∧ y ) ∂score( x)

a ,b

∂score( x ∧ y ) ∂score( y )

a ,b

= (θ x − θ y ) + 2θ y b NOT OK!

= 2θ y a K. Selcuk Candan

92

relimp (x,y)

a ,b

=

θ xb θ ya

relimp (x,y)

How about?

θ xa θ x = >1 θ ya θ y

=

How about?

z

Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f >

z

Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f >

z

Example:

z

Example:

∂x

∂y

a ,b

a ,b

∂score( x ∧ y ) ∂score( x) ∂score( x ∧ y ) ∂score( y )

60 3

θy

NOT OK! θ y −1

= θ y aθ x b a ,b

K. Selcuk Candan

60 4

Ranking

∂score( x ∧ y ) ∂score( x)

a ,b

∂score( x ∧ y ) ∂score( y )

a ,b

K. Selcuk Candan

61 5

Ranking

??

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

a ,b

θy

= θ x aθ x −1b

NOT OK! θ y −1

= θ y aθ x b

K. Selcuk Candan

Ranking

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

61 4

∂y

a ,b

score( x ∧ y ) = score( x)θ x × score( y )

= θ x aθ x −1b a ,b

∂x

θy

θy

score( x ∧ y ) = score( x)θ x × score( y )

61 6

a,a

0.85 X3 0.80 X5 0.75 X2 0.74 X6 0.74 X1 0.70 X4

K. Selcuk Candan

Ranking (and first-k retrieval)

0.85 X3 0.80 X5 0.75 X2

??

0.74 X6 0.74 X1 0.70 X4

K. Selcuk Candan

61 7

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

0.85 X3 0.80 X5 0.75 X2 0.74 X6 0.74 X1 0.70 X4

K. Selcuk Candan

93

First solution…join based on X

??

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

0.85 X3 0.80 X5 0.75 X2 X=X

??

0.74 X6 0.74 X1 0.70 X4

• Join the two information sources based on X • Sort all results based on the merged score • Select the first k

61 8

First solution…join based on X

K. Selcuk Candan

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

62 2

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

K. Selcuk Candan

??

0.825 0.8

0.72

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

0.85 X3 0.80 X5 0.75 X2

0.825 0.8

0.72

0.74 X6 0.74 X1 0.70 X4

Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access

62 1

Sorted Access Phase (k=3)

??

0.74 X6 0.74 X1 0.70 X4

Need to access the entire database at least once!!!

0.74 X6 0.74 X1 0.70 X4

K. Selcuk Candan

X=X

Sorted Access Phase (k=3)

0.85 X3 0.80 X5 0.75 X2

Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access

62 0

0.85 X3 0.80 X5 0.75 X2

61 9

Ranked join for top-k retieval (Fagin)

??

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

K. Selcuk Candan

Sorted Access Phase (k=3)

0.85 X3 0.80 X5 0.75 X2

??

0.74 X6 0.74 X1 0.70 X4

Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access

K. Selcuk Candan

62 3

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

0.825 0.8

0.72

?

0.85 X3 0.80 X5 0.75 X2 0.74 X6 0.74 X1 0.70 X4

Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access

K. Selcuk Candan

94

Random Access Phase (k=3)

??

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

0.825 0.8

0.72

X2 X5 X6

0.85 X3 0.80 X5 0.75 X2 0.74 X6 0.74 X1 0.70 X4

0.625

Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access

62 4

Result… (k=3)

K. Selcuk Candan

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

0.825 0.8

0.72

0.74 X6 0.74 X1 0.70 X4

0.625

Given n subqueries Assuming independence of n subqueries

62 7

Use only one pred. for sorted access (k=3)

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

0.74 X6 0.74 X1 0.70 X4

0.625

0.85 X3 0.80 X5 0.75 X2

K. Selcuk Candan

??

0.72

Properties of the algorithm

X1 and X4 have never been accessed!

62 6

0.8

K. Selcuk Candan

z X2 X5 X6

0.85 X3 0.80 X5 0.75 X2

0.825

Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access

62 5

Advantage!!

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

z

.....in order to find the next k best answers, we can continue where we left off.

z

If the merge function is min, there is a more efficient implementation K. Selcuk Candan

Sorted+Random Access (k=3)

0.85 X3 0.80 X5 0.75 X2

??

0.74 X6 0.74 X1 0.70 X4

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

0.75 0.8

0.7

0.85 X3 0.80 X5 0.75 X2 0.74 X6 0.74 X1 0.70 X4

• Stop when the next value is smaller than the third candidate

62 8

K. Selcuk Candan

62 9

K. Selcuk Candan

95

Sorted+Random Access (k=3)

Problem.. z

X2 X5 X6

0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3

0.75 0.8

0.7

K. Selcuk Candan

63 1

Problem.. z z

These require special ranking engines.. Can we process top - kqueries in regular databases?

0.74 X6 0.74 X1 0.70 X4

X1, X3, and X4 have never been accessed!

63 0

z

0.85 X3 0.80 X5 0.75 X2

K. Selcuk Candan

Problem..

These require special ranking engines.. Can we process top - kqueries in regular databases (Chauduri and Gravano)? thresholds on the grade

z z

These require special ranking engines.. Can we process top - kqueries in regular databases (Chauduri and Gravano)? thresholds on the grade

of match of the

of match of the acceptable objects

acceptable objects select oid from Repository where Filter_condition order[k] by Ranking_expression

63 2

Description of how the result should be ranked K. Selcuk Candan

select oid from Repository where (v >= 0.5 and p >=0.9) or f >=0.9 order[10] by max (f,v)

63 3

Available access methods… z z z

Description of how the result should be ranked K. Selcuk Candan

Approach

GradeSearch(attribute, value, min_grade) TopSearch (attribute, value, count) Probe(attribute, value, {oid})

1. 2. 3. 4.

63 4

K. Selcuk Candan

63 5

Use statistics to find a suitable search score, Sq Build a selection query Cq to return all tuples whose score is greater than Sq Evaluate Cq If there are k tuples with score greater than Sq, return the first k….otherwise, choose a smaller Sq and repeat

K. Selcuk Candan

96

Handling filter conditions z z

z

Handling filter conditions

Use one GRADESEARCH for eachatomic condition in the filter condition, and merge the returned sets of object ids through a sequence of union/intersections

z z

Use GRADESEARCH only for SOME atomic conditions, and for the rest of the conditions use PROBE

z

Use GRADESEARCH only for SOME atomic conditions, and for the rest of the conditions use PROBE

z

How do we choose which filter conditions are evaluated using GRADESEARCH??? How do we choose the proper grade???

z

63 6

K. Selcuk Candan

Use one GRADESEARCH for eachatomic condition in the filter condition, and merge the returned sets of object ids through a sequence of union/intersections

63 7

K. Selcuk Candan

63 9

K. Selcuk Candan

Handling ranking.. z

Use Fagin’s approach to find the estimated number of objects required to be visited to generate top-k results Assuming independence of n subqueries

z

Find a score, Sq, that returns at least that many tuples

63 8

K. Selcuk Candan

Query Optimization z z

Visual Query

Query languages are declarative We need to – – –

64 0

Query processing stages

Query in QL

convert declarative statements into executable statements estimate the cost of the executable statements choose the cheapest executable statement among all alternatives

K. Selcuk Candan

Parser Query Optimizer Query Execution Plan

64 1

Storage Engine K. Selcuk Candan

97

Query Optimization z

In Traditional Databases, we are optimizing for cost of query processing – – –

z

Cost is described in terms of disk access Database keeps statistics about relations, tuples, page sizes. Database also keeps index and sorting information

Query optimizer – –

uses “cost model’ to guess query execution cost for a possible query execution plan chooses a plan which is cheap according to the cost model.

64 2

K. Selcuk Candan

64 3

K. Selcuk Candan

Query tree Selectivity describes what portion of the database satisfies this predicate

64 4

K. Selcuk Candan

64 5

Query tree

K. Selcuk Candan

Query tree

Joins are expensive!!!

Joins are expensive!!!

Push selections and projections down to reduce the amount of data that goes into joins

64 6

K. Selcuk Candan

64 7

K. Selcuk Candan

98

What about joins???

64 8

What about joins???

64 9

K. Selcuk Candan

Assumptions of cost-based query optimization z

The goal of query optimization is to eliminate costliest executions…. …………………………………….finding cheapest executions is very very expensive. K. Selcuk Candan

Assumptions of cost-based query optimization

Independent of how a query is executed we get the same number of results:

z

σ θ 1∧θ 2 (R ) = σ θ 1 (σ θ 2 (R )) = σ θ 2 (σ θ 1 (R ))

65 0

65 1

K. Selcuk Candan

Assumptions of cost-based query optimization z

K. Selcuk Candan

Assumptions of cost-based query optimization

Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries

z

Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries

( (

))

(

) ( (

) )

cos t 1 σ θ 1 σ 1θ 2 (R ) = cos t σ 1θ 2 (R ) + f size σ 1θ 2 (R ) ,θ 1

( (

))

(

) ( (

) )

cos t 2 σ θ 1 σ 2θ 2 (R ) = cos t σ 2θ 2 (R ) + f size σ 2θ 2 (R ) ,θ 1

cos t 1 σ θ 1 σ 1θ 2 (R ) = cos t σ 1θ 2 (R ) + f size σ 1θ 2 (R ) ,θ 1

cos t 2 σ θ 1 σ 2θ 2 (R ) = cos t σ 2θ 2 (R ) + f size σ 2θ 2 (R ) ,θ 1 65 2

Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries

K. Selcuk Candan

65 3

( (

))

(

) ( (

( (

))

(

) ( (

=

) ) ) )

K. Selcuk Candan

99

Assumptions of cost-based query optimization z

Assumptions of cost-based query optimization

Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries

( (

))

(

) ( (

( (

))

(

) ( (

z

) )

cos t 1 σ θ 1 σ 1θ 2 (R ) = cos t σ 1θ 2 (R ) + f size σ 1θ 2 (R ) ,θ 1

) )

cos t 2 σ θ 1 σ 2θ 2 (R ) = cos t σ 2θ 2 (R ) + f size σ 2θ 2 (R ) ,θ 1

cos t 1 σ θ 1 σ 1θ 2 (R ) = cos t σ 1θ 2 (R ) + f size σ 1θ 2 (R ) ,θ 1

=

cos t 2 σ θ 1 σ 2θ 2 (R ) = cos t σ 2θ 2 (R ) + f size σ 2θ 2 (R ) ,θ 1 65 4

K. Selcuk Candan

( (

))

(

) ( (

( (

))

(

) ( (

65 5

Assumptions of cost-based query optimization z

Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries

Pick the cheapest!!!

Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries

z

Given Q = R1xR2xR3x….Rn –

for all Ri, z z



Pick the plan with the smallest cost

K. Selcuk Candan

Recursion z Recursion!!!

Given Q = R1xR2xR3x….Rn –

find the cheapest plan for Q(-Ri) compute costi = cost(Q(-Ri) x Ri)

for all Ri, z z

Pick the plan with the smallest cost



z

65 8

find the cheapest plan for Q(-Ri) compute costi = cost(Q(-Ri) x Ri)

65 7

Recursion z

K. Selcuk Candan

for all Ri, z



K. Selcuk Candan

) )

Given Q = R1xR2xR3x….Rn –

…we can use recursion!!!!!!!!!

65 6

) )

Recursion

z

z

=

K. Selcuk Candan

65 9

Recursion!!!

find the cheapest plan for Q(-Ri) compute costi = cost(Q(-Ri) x Ri)

Pick the plan with the smallest cost

There is a dynamic programming based algorithm that uses this recursion. K. Selcuk Candan

100

Multimedia Query Optimization z z

Multimedia query

Query languages are declarative We need to – – – –

convert declarative statements into executable statements estimate the cost of the executable statements choose the cheapest executable statement among all alternatives consider the quality of the alternatives!!

66 0

K. Selcuk Candan

z

Overloaded implementation of predicates

– – –

66 1

image: extract fixed number of parameters :verify if the pattern is in the image pattern: find all the images using an index structure K. Selcuk Candan

Multimedia query

z

…different number of matches

– – –

66 2

image: extract fixed number of parameters :verify if the pattern is in the image pattern: find all the images using an index structure K. Selcuk Candan

66 3

Multimedia query

K. Selcuk Candan

Binding patterns z

z

we can define four binding

…different quality implementations

– –

66 4

For a predicate patterns:



image: extract fixed number of parameters :verify if the pattern is in the image pattern: find all the images using an index structure K. Selcuk Candan

66 5

K. Selcuk Candan

101

Binding patterns z

For a predicate patterns:

Cost, fanout

we can define four binding

66 6

K. Selcuk Candan

66 7

How to compute combined qualities? z

Cost, fanout, quality

For example, using product semantics, for the query

z

we can define the overall quality as

66 8

K. Selcuk Candan

Given a query

we need to define the best retrieval strategy based on cost, fanout, and quality:

K. Selcuk Candan

66 9

K. Selcuk Candan

Merging cost and quality…

67 0

K. Selcuk Candan

67 1

K. Selcuk Candan

102

Merging cost and quality…

Merging cost and quality…

This may not satisfy the principle of suboptimality

67 2

67 3

Minimize : K. Selcuk Candan

Dealing with expensive predicates z z

67 4

Minimize : K. Selcuk Candan

Dealing with expensive predicates

Standard query optimizers assume that selection (σ) is cheap… ….however, multimedia predicates may be expensive

z z

Standard query optimizers assume that selection (σ) is cheap… ….however, multimedia predicates may be What portion of the expensive (Hellerstein, Stonebraker)

database satisfies this predicate

67 5

K. Selcuk Candan

Predicate ordering in a single table query

How costly the predicate is K. Selcuk Candan

Predicate migration…

Better plan!!

67 6

K. Selcuk Candan

67 7

Standard optimization

Better plan!! K. Selcuk Candan

103

Challenge: to find a plan that satisfies both rank orders

Predicate migration…

67 8

Standard optimization

Web Databases….

Better plan!! K. Selcuk Candan

67 9

K. Selcuk Candan

Web Databases…. z

68 0

K. Selcuk Candan

68 1

Web Databases…. z

68 2

Approach 1: use standard IR techniques to find pages that satisfy a query

K. Selcuk Candan

Web Databases….

Approach 1: use standard IR techniques to find pages that satisfy a query

K. Selcuk Candan

z

68 3

Approach 1: use standard IR techniques to find pages that satisfy a query

K. Selcuk Candan

104

Web Databases…. z

Web Databases….

Approach 1: use standard IR techniques to find pages that satisfy a query

68 4

K. Selcuk Candan

z

68 5

Web Databases…. z

Approach 2: integrate IR techniques with structure/link analysis

K. Selcuk Candan

Hups and authorities

Approach 2: integrate IR techniques with structure/link analysis

68 6

K. Selcuk Candan

68 7

K. Selcuk Candan

Topic distilation by iterative mutual reinforcement

Hups and authorities

Good hubs should point to good authorities and vice versa.

68 8

K. Selcuk Candan

68 9

K. Selcuk Candan

105

Topic distilation by iterative mutual reinforcement

69 0

K. Selcuk Candan

PageRank

69 1

Information units (multi-page)

69 2

K. Selcuk Candan

R(u ) =

1 R (v ) ∑ c v∈Bu N v

K. Selcuk Candan

Information units (single-page)

69 3

Information units (single page/site)

K. Selcuk Candan

Topic segmentation

Subjects are not informative

Specialization Generalization

69 4

K. Selcuk Candan

69 5

K. Selcuk Candan

106

Keyword inheritence

Messages too short for indexing

Web databases z

Challenges: – – – –

69 6

K. Selcuk Candan

SQL

69 7

Integrate content and structure Graph and tree (XML/semi-structured) data indexing Proximity search Progressive query processing

K. Selcuk Candan

Application interfaces

Query parser

Trans actio n proce ssing

Query Optimizer (cost based) Query Processor

Repli catio n mana ger

Indices

Data

69 8

K. Selcuk Candan

107

Suggest Documents