Spatio- temporal dimensions. â modeling ..... d(s1,s2) + d (s2,s3) >= d (s1,s3) ...... an aggregation of stars or galaxies that appear close together in the sky and ...
Starting discussions….
Multimedia (and Web) Databases
z
What is media?
K. Selçuk Candan Associate Professor
Arizona State University
2
What is media? – – – – –
K. Selcuk Candan
Sample application (boring…)…
Text/document Images Video Audio …..
z
Police investigation… – – – – –
z z
What is multimedia????? What is hypermedia????
3
–
K. Selcuk Candan
4
K. Selcuk Candan
ARIA (Architecture for Interactive Arts):
Sample multimedia query z
Quality-Adaptive Media-Flow Architectures to Support Sensor Data Management Media-stream management for intelligentStage z intelligentStage (@ fine arts college)
“Find the records of every criminal who look like the person seen in “surv_im.gif” and who had a bank transfer of more than $500,000 within the last 5 months. Return all police reports which mentions such persons and their past accomplices.”
– –
z
– – –
K. Selcuk Candan
6
..to incorporate real-time and archived media into live performances, on-demand ..to enable artists/performers to have real-time control of the stage
…equipped with –
5
Video data (surveillance cameras) Audio data (telephone wiretabs) Image data (surveillance, mugshots) Document data (police reports) Conventional data (bank records, employment records, police records) Geographic data (maps)
video cameras pressure sensors microphone arrays real-time 3D-motion tracking K. Selcuk Candan
1
Logical view
Scenario • Two performers, an adult and a child performers, are tracked by a 3Dmotion tracking device. • Their locations on the stage are tracked using pressure sensors. • ARIA continuously monitors the position of the body markers of performers in 3D space, the positions of performers and details of their gestures.
7
• The output of the 3D
motion tracking is filtered through ARIA to obtain the shape and degree of confidences of the pose. K. Selcuk Candan
8
K. Selcuk Candan
10
K. Selcuk Candan
Semantic Heterogeneity z
Spatio - temporal dimensions – – – – –
z z
modeling specification indexing, retrieval, and visualization methods
User- and context- dependence, subjectivity Availability at various quality levels
9
K. Selcuk Candan
Subjectivity
Physical Heterogeneity z
Volume – – –
z
Quality/cost trade - off –
11
K. Selcuk Candan
12
storage, delivery, and processing
increases robustness, graceful degradation
K. Selcuk Candan
2
Interactivity z
Example: X3D
100ms interaction deadline – –
z
resource allocation prefetching/caching
X3D: Extensible 3D Graphics –
Based on VRML 97(Virtual reality modeling language) Root
z
Subjectivity and personalization of content
Translation Transform -2 3 4
z
Color – 10 100 10
Interaction structure
Shine – 0.75 Box – 2 2 2
13
K. Selcuk Candan
14
What is a X3D world?
– – – – – –
K. Selcuk Candan
Media (image files) 3D mesh geometry (human objects) Shape primitives (boxes that form cars) Node structure/hierarchy Spatial structure (position of the car, transformation hierarchy) Event/interaction structure (sensors/scripts) Temporal structure (motion/timeline) Metadata (comments, variable names)
16
What is a database?
K. Selcuk Candan
What is a database?
How does it differ from a file system?
z
A system which allows access to a collection of data – –
How does it differ from an operating system?
–
z
17
K. Selcuk Candan
Example: car-road world (a police car and a road) –
z
Shine – 0.8
What does an X3D document contain?
–
z
Transform -2 5 10 Color – 20 10 10
Circle – 2
z
15
Transform -1 1 1
K. Selcuk Candan
19
User specifies what to see System retrieves the corresponding data from the collection The system presents the retrieved info to the user
..different from browsing
K. Selcuk Candan
3
Queries z z
Metadata queries Example queries – –
z
What is an image? z z
Exact Partial match
–
Object queries – – –
z
K. Selcuk Candan
z
z
local or web
z
Size of data Properties of data –
A query processor (indices etc.) which –
z
K. Selcuk Candan
Why “image” database?
A collection of images
–
–
maps user query into data model retrieves the relevant images
z
– –
z
How to let users specify what they want?
23
K. Selcuk Candan
What are the features that interests us?
What kind of images? Mug shots, cat-scans, finger prints z News, advertisement, family photos z Surveillance z Video frames z
z
Colors, color histograms
z
Edges
z z
Texture Image segments
z
Objects
z
Metadata, captions
–
–
–
–
24
Similarity-based query processing new index structures relevance ordering
Query language –
K. Selcuk Candan
Visual: image processing Semantical: pattern recognition
Similarity-based retrieval –
An information visualization system which shows results to the user
22
visual semantical
21
What is an image database?
–
An object is an entity within an image z
visual similarity semantic similarity spatial similarity
20
z
2D matrix of values Collection of objects and their spatial relationships
K. Selcuk Candan
25
“sunny day”, “sea” “maps”, “aerial surveillance”
shape,location, color visual features, semantics K. Selcuk Candan
4
What kind of queries? z z
Find me all images created by “John Smith” Find all images which look like “im_ex.gif” –
z z
26
K. Selcuk Candan
27
What kind of queries? z z z
z
28
z
advertisement
Find all images which contain a car Find all images which contain a car and a man who looks like”mugshot.bmp” –
z
– –
surveillance
data mining
K. Selcuk Candan
first object looks like “im.gif” second object is a car first obj. to the right of second obj.
and return the semantics of these two objects. 29
Example
30
Find all objects contained in images of sunny days Find all images which contain two objects –
Find all image pairs which contain similar objects –
K. Selcuk Candan
What kind of queries?
Find all images of sunny days –
Find me top-5 images which look like “im_ex.gif”
Find all images which look like “sketch.bmp” Fins all images which contain a part which look like ….
K. Selcuk Candan
QBE (visual representation)
K. Selcuk Candan
31
K. Selcuk Candan
5
Query…and results…
Relational databases z
Data is – –
z
This is the main assumption for – – –
32
K. Selcuk Candan
textual numerical storage query processing optimization
33
K. Selcuk Candan
Attribute
z
Information is in tabular form –
z z z
NAME
SSN
OFFICE
DESC
..
..
..
..
..
..
..
J. Doe
555-5555
GWC 999
Asst. prof
J. Smith
333-3333
GWC 989
Prof
..
..
..
..
Example: Information about an employee
Schema describes the content A key uniquely identifies a given tuple Each attribute has a domain
34
Index structures
Schema
K. Selcuk Candan
36
Algebra z z
– – – – –
z
37
PAGE
K. Selcuk Candan
Calculus
A set of data manipulation operators Relational algebra (operates on relations) –
tuple
Relational databases
z
A query language should be declarative: –
Say what we want z
Select (σ) Project (π) Cartesian product (×), join Union (U) Intersection (∩) Difference (-)
–
query optimization
Don’t say how we get it z
no optimization possible
{t.name | (t є Employe) and (t.salary < 1000) and (exists t2 (t2 є Students) and (t2.gpa > 3.7) and (t.ssn = t2.ssn) )}
Procedural (non-declarative) K. Selcuk Candan
38
K. Selcuk Candan
6
SQL z
Relational databases
Based on relational calculus
z
select from where
z z z
select t.name from employee t, student t2 where (t.salary < 1000) and (t2.gpa > 3.7) and (t.ssn = t2.ssn)
z
40
K. Selcuk Candan
Business applications Data model is relational Queries are exact/declarative Updates are important Concurrency is important
41
K. Selcuk Candan
43
K. Selcuk Candan
How does a database look like? SQL
Application interfaces Query parser
Transaction processing
Query Optimizer (cost based) Query Processor
Replication manager
Indices Data
42
K. Selcuk Candan
Example: X3D/VRML Archive
44
K. Selcuk Candan
45
K. Selcuk Candan
7
Relational databases (??) z z z z z
Shortcomings…
Business applications Data model is relational Queries are exact/declarative Updates are important Concurrency is important
z
Image data doesn’t fit into tuples
z
No image comparison No partial match processing No ranking Not computationally complete
–
z z z
Media data needs to be kept separately
z
46
K. Selcuk Candan
47
Solutions
K. Selcuk Candan
Other problems? z
z
z
Media processing requires more computational power.
Use a host language and embed database queries in it (relational approach)
z
–
Provide more computational power in the data model itself (object- oriented approach)
48
K. Selcuk Candan
It does not capture the semantical structure of the data well Hierarchies: –
Aggregation hierarchy Inheritence hierarchy
49
Aggregation hierachy
K. Selcuk Candan
Inheritence hierachy
Digital library Multimedia object
video
video frame Live video pixels
text
xml
objects
pixels
50
Recorded video
document
semantics
K. Selcuk Candan
51
K. Selcuk Candan
8
OODB z
Object oriented databases provide – – –
z
z
Higher computational power Aggregation hierarchies Inheritence hierarchies
z z z
They model the real world better! –
z
OODB
z
Everything is an object
You can define your own external methods
z
Business applications Multimedia (??) Data model is object oriented Queries are exact Queries are procedural (some declarative languages) Concurrency/updates are important
E.image_similar_to (c.image)
52
K. Selcuk Candan
53
Shortcomings… z z z z
Object Relational Databases
Too much overhead –
K. Selcuk Candan
z
Optimization is hard
–
No partial match processing No ranking Query processing is cost driven –
z
–
z
K. Selcuk Candan
What else? z
No partial match processing No ranking Query processing is cost driven –
Deductive databases – –
z
–
K. Selcuk Candan
57
Logic based Boolean queries
Fuzzy databases –
not “similarity” driven
–
56
User definied functions User defined abstract data types (ADTs)
55
Shortcomings…
z
tuples SQL
Object technology z
K. Selcuk Candan
z
Relational technology z
not “similarity” driven
54
z
Benefits from both
usually logic-based, but not boolean nothing is true or false results are not-exact (like multimedia queries)
K. Selcuk Candan
9
What else? z
What else?
Spatial/Temporal Databases – – –
z z
–
z
Scientific, geographic applications Data model is vector or interval based Queries
Data mining – – –
Range queries Nearest neighbor queries
Business, scientific applications Relational data model Queries: find z z
Queries are declarative or visual
z z
58
K. Selcuk Candan
59
What else? z
–
z
Most data has a well-defined structure (schema) In SSD, there is no common schema z
z
K. Selcuk Candan
…and
Semi- structured data management –
Image databases –
Data model is feature-vector based z
–
z
Structure-based
– –
z K. Selcuk Candan
61
Research Issues z
–
z
–
62
z
z
Content, features of interest Information extraction/integration
Query-by-example Ranking Feedback (user-to-system, system-to-user) K. Selcuk Candan
Query processing – –
Query Language –
Each feature represented as a vector space
Research Issues
Data model –
Color Texture
Structure may or may not be available Queries z
60
Multiple features –
each object describe itself
Queries –
patterns, rules, classes, or outliers
– –
matches the data model captures user’s interest
–
K. Selcuk Candan
63
Online vs. off-line information extraction Indices for different media Optimization of queries with different media Similarity-based retrieval, ranking Relevance feedback
K. Selcuk Candan
10
Research issues z
Storage/delivery – –
z
–
speed precison and recall
5
0
How to transmit large, continuous data (video) How to visualize/present results of a query which may contain multiple types of data (images, video, audio) K. Selcuk Candan
65
Vectors…what are they??? z
Image with 1 pixel
Visualization –
64
z
How to store data in different formats How to retrieve data efficiently z
z
Vectors…what are they???
3
K. Selcuk Candan
Vectors…what are they???
Image with 2 pixels
z
7
Image with 3 pixels
7
0
0
5
5
3
66
K. Selcuk Candan
67
Distance between two image??? z
K. Selcuk Candan
Euclidean distance
Given A and B, how different are they? A
A ∆(A,B)
B
68
B
K. Selcuk Candan
69
K. Selcuk Candan
11
Which image is more similar to A?
C
Which image is more similar to A?
∆(A,C)
C
∆(A,C)
A
A
∆(A,B)
∆(A,B)
B
B
Closer to A Similar to A
70
K. Selcuk Candan
71
“Find 2 most similar images to A”
K. Selcuk Candan
“Find 2 most similar images to A”
A
72
A
K. Selcuk Candan
73
“Find images at most δ different from A”
Nearest-neighbor search
K. Selcuk Candan
“Find images at most δ different from A”
A
A δ
74
K. Selcuk Candan
75
range search
K. Selcuk Candan
12
Are there other similarity measures?
Let’s try angles...
A
Let’s try angles...
E
F
3
2
5
A
5
5
5
E
2
2
2
Similar composition
E
K. Selcuk Candan
ρ x = x1 , x2 ,..., xn
–
80
2
2
2
F
3
2
5
A
5
5
5
E
2
2
2
If we use angles as a similarity measure, then A is more similar to E than F
K. Selcuk Candan
What is a good measure then??
ρ y = y1 , y2 ,..., yn
y
F
cos(AE) > cos(AF)
79
Angle-based measures
i =1
E
A
F
78
Dot product ρρ n x . y = ∑ xi
5
K. Selcuk Candan
A
–
5
5
Let’s try angles... Similar composition
Given
2
5
F
77
K. Selcuk Candan
z
3
A E
76
F A
i
z
Application dependent...
z
...but, distances in a metric space help indexing!
Cosine similarity
ρρ ρρ x. y cos( x , y ) = ρ ρ x y
K. Selcuk Candan
81
K. Selcuk Candan
13
Metric distances (Minkowski metrics)
Metric model: axioms z
Any function d expressing a distance must satisfy the following axioms: self-minimality: minimality simmetry triangular inequality
– – – –
z
L1-metric: d = (dX+dY) Y
d(s,s) = 0 d(s1,s2)>= d(s1,s1) d(s1, s2) = d (s2, s1) d(s1,s2) + d (s2,s3) >= d (s1,s3)
dY dX
z
Example: Euclidean distance
X
82
K. Selcuk Candan
Metric distances (Minkowski metrics) z
Also called Manhattan Distance
83
K. Selcuk Candan
Metric distances (Minkowski metrics)
L2-metric: d = (dX2+dY2)1/2
z z
Y
z z
L3-metric; d = (dX3+dY3)1/3 ..... ..... L(infinity): d = max{X,Y}
dY dX X
Also called Euclidean ManhattanDistance Distance
84
K. Selcuk Candan
85
Feature…
…metric model z
Well suited for certain kinds of similarity evaluation, such as color based comparisons
z
Consistent with widly used approaces from computer vision and pattern recognition communities –
z
86
K. Selcuk Candan
z z
–
student_ID
can be a feature z What are the features for an image?
results suggest that the L1 metric may better capture human notions of image similarity.
Makes it relatively easy to index data, modeled as vectors of properties, in terms of classical multidimensional indexing techniques. K. Selcuk Candan
…a property of interest that can help us index an object For a “student record”
87
K. Selcuk Candan
14
Image features z
There are many possible features – – – – – –
z
88
Good feature..
Color histogram Texture Edges Shapes Objects Object or scene semantics
z
A good feature corresponds to users’ perception as much as possible Relevance feedback!!!!
89
What does “significant” mean
K. Selcuk Candan
What does “significant” mean
Information theoric sense: –
A good feature is significant and enables us to differentiate objects from others as much as possible
–
Feature selection: which one to use for indexing? K. Selcuk Candan
z
z
z
An event is more significant if it carries more information
Information theoric sense: – –
An event is more significant if it carries more information An event that has high occurance rate caries less information z
Solar eclipse is more interesting then sunset High frequency ----- less information Low frequency ----- high information
90
K. Selcuk Candan
91
Entropy z
92
K. Selcuk Candan
Entropy (example)
Total information content (uncertainty)
z
K. Selcuk Candan
94
Total information content (uncertainty)
P(a) = 0.5, P(b) = 0.5
H=1
P(a) = 1.0, P(b) = 0.0
H=0
more uncertain less uncertain
K. Selcuk Candan
15
Entropy (example) z
Which feature is better?
Total information content (uncertainty) F2
P(a) = 0.5, P(b) = 0.5
H=1
P(a) = 1.0, P(b) = 0.0
H=0
more uncertain more information less uncertain less information
95
K. Selcuk Candan
F1
96
Which feature is better?
F3 K. Selcuk Candan
Which feature is better?
F2
F2
F1
97
Better separation!
F1
F3 K. Selcuk Candan
98
Which feature is better?
F3 K. Selcuk Candan
Principal component
F2
F2
Better separation!
Better separation!
Less frequent! F1
99
F1
F3 K. Selcuk Candan
10 0
F3 K. Selcuk Candan
16
Principal component analysis
Principal component analysis
F2
The eigenvector of the covariance matrix with the largest eigenvalue
F2
Principal component is a combination of features!
Principal component is a combination of features!
F1
10 1
F1
F3 K. Selcuk Candan
10 2
Compactness of a database
F3 K. Selcuk Candan
Compactness of a database
comp( D) = ∑ similarity (oi , o j )
comp( D) = ∑ similarity (oi , o j )
i≠ j
i≠ j
more compact
10 3
K. Selcuk Candan
10 4
Compactness of a database
K. Selcuk Candan
Feature quality
comp( D) = ∑ similarity (oi , o j )
A feature is • good if we remove it, the overall compactness increases • bad if we remove it, the overall compactness decreases
i≠ j
A compact database is not desirable!!!
10 5
K. Selcuk Candan
10 6
good
bad
K. Selcuk Candan
17
Signal...
Media and Features
z
Is a function (generally of time) f(t)
Used in representing analog and digital information -analog signal Æ continuous -digital signal Æ discrete
10 8
Audio: Image:
f: time-Ævolume (analog) f: coordinate × coordinate Æ color (digital)
K. Selcuk Candan
Filter.... z
z
z
z
10 9
Black and white images – f : coordinate × coordinate Æ {0, 1} Greyscale images – f : coordinate × coordinate Æ [0, 255]
z
z
–
Color (Depth = 24) – f : coordinate × coordinate Æ [0, 255] × [0, 255] × [0, 255]
–
Ct: color index Æ [0, 255] × [0, 255] × [0, 255]
–
z
Color image (with color table) – f : coordinate × coordinate Æ color index [0, 255]
–
K. Selcuk Candan
Text – – – –
Symbolic Artificial Single meaning (reader independent) Small storage requirements
f(av) Æ a*g(v) f1(v) + f2(v) Æ g1(v) + g2(v)
A’ + B’ = C’
Space invariant filter
11 0
Text vs. images z
....is a function that transforms an input signal f(v) into an output signal g(v) – Filter: f(v) Æg(v) A + B = C Linear filter:
filter(f(x+a)) = filter (f(x)) K. Selcuk Candan
Example: Images... z
Images – – – –
z
Visual Natural, artificial Multiple meanings (viewer dependent) Large storage requirements
Convenient ways to store visual information –
Bitmap: z 2D array of pixels. z each pixel contains color+illumination information
–
They have to be z z
processed and analyzed
to extract the information content
11 1
K. Selcuk Candan
11 2
K. Selcuk Candan
18
Image operations z z z z z z z z z z
11 3
z z
Processing vs. analysis..
Acquire Store Browse Query/QBE Process/Analyse Index Retrieve/request Display Compress Watermark Transmit Enhancement
z
I
z
– – – – – – – –
z
z z z
Cat, man, umbrella…
N w w M
Many of these operators require a filter to be convoluted over the original image
# of pixel: 628 × 1024 # of bits per pixel = 24 # of bits = 628 × 1024 × 24 # of bytes = 628 × 1024 × 3 ≈ 190 Kbytes How to reduce storage cost?
11 6
1024
z
K. Selcuk Candan
–
reduce the visual quality of the image
Feature vector size: 628 × 1024 –
628
change the coding reduce the dimensions of the image (scale) z
Cost of the operation: M × N × w × w
Problem…
–
–
11 7
K. Selcuk Candan
Sharpening Bluring Rotating Translating Brightening Cut/paste/resize Warping Edge detection Segmentation
Storage requirements
z
IAE
Processing requirements...
K. Selcuk Candan
z
I
Operators –
I’
Image analyzing
11 4
Image processing
11 5
IPE
(Histograms)
K. Selcuk Candan
z
Image processing
Dimensionality curse: high dimensions make indices unusable (10-15 dimensions max!!!)
loss of granularity
K. Selcuk Candan
11 8
K. Selcuk Candan
19
Problem… z
Feature vector size: 628 × 1024 –
z
Dimensionality curse: high dimensions make indices unusable (10-15 dimensions max!!!)
Solution: Reduce # dimensions of the vector – –
11 9
Transforms
A
use distance-preserving transforms Ex: fourier transform, DCT, wavelet transform
628 × 1024 DCT
12 0
4 K. Selcuk Candan
Transforms
K. Selcuk Candan
Transforms A
A
A
12 1
Distances and angles are preserved
12 2
K. Selcuk Candan
Transforms
K. Selcuk Candan
Transforms A
12 3
A
Distances and angles are preserved
Distances and angles are preserved
Some dimensions are more important (differentiating) than the other
Some dimensions are more important (differentiating) than the other
K. Selcuk Candan
12 4
Eliminate unimportant dimensions
K. Selcuk Candan
20
Transform + Projection (Compression or Feature selection) A
What happens to distances???
A
A
A
∆’(A,B)
∆(A,B)
Projection
B
B
12 5
12 6
K. Selcuk Candan
What happens to distances???
∆(A,B) ∆’(A,B)
K. Selcuk Candan
What happens to distances??? δ
δ A
δ δ A
A
A ∆2 ∆1
C B
∆1’
B
C
12 7
∆2’
B
12 8
K. Selcuk Candan
False hit (∆1> ∆1’)
C
B
C
Miss (∆2< ∆2’)
K. Selcuk Candan
Misses are not desirable! Can not be eliminated with postprocessing
Image analysis
What happens to distances??? δ δ A
z
A ∆1
∆1’
B
∆2’
Image analysis gives the features –
∆2
–
C
– –
B
C
– – –
12 9
False hit (∆1> ∆1’)
Miss (∆2< ∆2’)
K. Selcuk Candan
13 0
z
Color histogram Texture Edges Shapes Objects Semantics Depth (stereoscopic images)
Image analysis is usually an expensive operation K. Selcuk Candan
21
..so... z
Example feature: color z
In the design of a MIS, you want to minimize the number of image analysis operations to perform on the fly – – –
– z
–
– –
(op1 op2) = (op2 op1) and Cost (op2 op1) < Cost(op1 op2)
–
then do op2 op1
13 1
K. Selcuk Candan
z
RGB: describes colors in terms of the combinations of the intensities of Red, Green and Blue colors
z
HSV – Hue: main color – Saturation: Amount of white – Value: Amount of energy
z
YUV, a linear transformation from RGB – Y: luminance (amount of light) – grey scale – U: red - cyan – V: magenta-green
S G
Y W
C
Cyan
B
White Black
K. Selcuk Candan
Color models
B
Magenta
reduce the number of colors to 256 (1 byte per pixel) cluster similar colors into a single bucket and assign a single color to the bucket the set of buckets is called color table
13 2
Color spaces...
Blue
224 colors ≈ 16 million
Color table:
If –
24 bits (3 bytes) of red, green, and blue z
pre-processing/ pre-analysis indexing/clustering semantic optimization z
Example: color
R M
Green G
Red
Yellow V
R
13 3
RGB (Red, Green, Blue)
HSV (Hue, Saturation, Value) K. Selcuk Candan
13 4
YUV (ex. PAL television system)
color
13 5
YUV
Y=0.299R + 0. 587 G + 0.114 B U= 0.492 (B-Y) V =0.877 (R-Y)
Y
Grey scale image
U
Red-Cyan
V
Magentagreen
K. Selcuk Candan
K. Selcuk Candan
Color histograms
13 6
K. Selcuk Candan
Courtesy of Misha Pavel, OGI
22
Problems with histograms
Problems with histograms
Histogram: {green:4, purple:2, red:3}
Histogram: {green:4, purple:2, red:3}
• Are these similar???
• Are these similar??? • Color associations????: • blue is similar to purple • yellow is similar to orange
13 7
K. Selcuk Candan
13 8
Color locality??
K. Selcuk Candan
Comparison of color histograms z
Hist1
Euclidean distance
Hist2 (b1-b2)2 + (g1-g2)2 + (p1-p2)2 + (r1-r2)2 +…
Hist3
Hist4 z
Intersection similarity min(b1,b2) +min (g1,g2)+min (p1,p2)+ min (r1,r2)+...
13 9
K. Selcuk Candan
14 0
z
Let x and y be two histogram vectors, each of length n
14 1
Gavg= (1/N) ∑i =1..N G(pi)
Bavg= (1/N) ∑i =1..N B(pi)
aij = cross talk factor between i-th and j-th color No cross-talkÆ
Use average color of an image Ravg= (1/N) ∑i =1..N R(pi)
d2 = ∑i =1..n ∑j=1..n aij (xi-yi) (xj-yj) –
K. Selcuk Candan
Quadratic distance bounding
Complete Euclidean Distance z
b2 + g2 + p2 + r2+ .....
x = (Ravg, Gavg , Bavg) T
aij = 1 if i = j aij = 0 otherwise K. Selcuk Candan
14 2
davg2 (x,y)=(x - y)T (x- y) =(Ravgx – Ravgy)2 + (Gavgx – Gavgy)2 + (Bavgx – Bavgy)2 K. Selcuk Candan
23
Quadratic distance bounding
Quadratic distance bounding
davg2 (x,y) T
∑w f
i i i∈{1..n}−{ j }
>T
or
i =1
sim(o, q) = w j f j + sim( − j ) (o, q ) > T 51 5
K. Selcuk Candan
51 6
Feature (term) readjustment
Feature (term) readjustment
sim(o, q) = w j f j + sim( − j ) (o, q ) > T
51 7
fj=1
fj=0
sim( − j ) (o, q ) ≤ T
a
0
sim( − j ) (o, q ) > T
b
c
| relevant & retrieved |= a + b + c K. Selcuk Candan
K. Selcuk Candan
p( f k | R) =
K. Selcuk Candan
Feature (term) readjustment
independent
independent
p (( f k = 1) ∧ (sim( − k ) (o, q ) > T ))
p( f k | R) =
p (sim( − k ) (o, q ) > T )
p ( f k | R) = p( f k = 1) 51 9
p (sim( − k ) (o, q ) > T )
51 8
Feature (term) readjustment
p( f k | R) =
p (( f k = 1) ∧ (sim( − k ) (o, q ) > T ))
p (( f k = 1) ∧ (sim( − k ) (o, q ) > T )) p (sim( − k ) (o, q ) > T )
p ( f k | R) = p( f k = 1) = p ( f k = 1 | sim( − k ) (o, q ) > T ) K. Selcuk Candan
52 0
K. Selcuk Candan
79
Feature (term) readjustment
p( f k | R) =
Ranking z
p (( f k = 1) ∧ (sim( − k ) (o, q ) > T ))
z
p (sim( − k ) (o, q ) > T )
z z
p( f k | R) = 52 1
p ( R | Oi ) > p( R | O j ) ⇔ sim(Q, Oi ) > sim(Q, O j )
b b+c K. Selcuk Candan
52 2
Ranking
K. Selcuk Candan
Bayes Theorem p( A | B) =
p ( R | Oi ) > p ( R | O j ) ⇔ sim(Q, Oi ) > sim(Q, O j ) z
Q: query R: event that a document is relevant to the user P(R|Oi): given features and weights, the probability that object oi is relevant to the user …then, retrieval is effective iff
p( B | A) p( A) p( B)
Let’s try to rewrite the first half of the equation using Bayes theorem
52 3
K. Selcuk Candan
52 4
Bayes Theorem p( A | B) =
K. Selcuk Candan
Relevance of two objects
p( B | A) p( A) p( B)
p ( R | Oi ) =
p(Oi | R ) p( R) p (Oi | R) p( R) + p(Oi | I ) p( I )
> p( A | B) = 52 5
p( B | A) p( A) p( B | A) p( A) + p( B | ¬A) p (¬A) K. Selcuk Candan
p( R | O j ) = 52 6
p (O j | R ) p ( R ) p(O j | R ) p ( R ) + p(O j | I ) p( I ) K. Selcuk Candan
80
Relevance of two objects
…so we have p (Oi | R ) p(O j | R) > ⇔ sim(Q, Oi ) > sim(Q, O j ) p (Oi | I ) p (O j | I )
p (Oi | R) p (O j | R ) > p (Oi | I ) p(O j | I )
How do we compute these probabilities???
52 8
K. Selcuk Candan
52 9
…so we have z
…so we have
Let us assume features are independent – –
K. Selcuk Candan
z
o= q=
Let us assume features are independent n
n
p(Oi | I ) = ∏ p( f i ,k | I )
p (Oi | R) = ∏ p ( f i ,k | R)
k =1
k =1
n
p (Oi | R) = ∏ p ( f i ,k | R) k =1
53 0
n
n
p(Oi | I ) = ∏ p( f i ,k | I )
p(Oi | R) p(O j | R ) > ⇔ p(Oi | I ) p(O j | I )
k =1
K. Selcuk Candan
53 1
…so we have z
p (Oi | R) = ∏ p ( f i ,k | R) k =1
| R) >
k =1 n
∏ p( f
i ,k
| I)
k =1
∏ p( f
j ,k
| R)
∏ p( f
j ,k
| I)
k =1 n
k =1
K. Selcuk Candan
z
n
p(Oi | I ) = ∏ p( f i ,k | I ) k =1
n p( f j ,k | R) p ( f i ,k | R ) n p(Oi | R) p(O j | R) > ⇔ ∑ log > ∑ log p(Oi | I ) p(O j | I ) p( f i ,k | I ) k =1 p( f j ,k | I ) k =1
53 2
n
i ,k
…so we have
Let us assume features are independent n
∏ p( f
K. Selcuk Candan
Let us assume features are independent –
use dot product as the similarity measure
–
use
–
use as the query!!!
log
p( f i ,k | R ) p ( f i ,k | I )
as the weight of the kth feature
n p( f j ,k | R) p ( f i ,k | R ) n p(Oi | R) p(O j | R) > ⇔ ∑ log > ∑ log p(Oi | I ) p(O j | I ) p( f i ,k | I ) k =1 p( f j ,k | I ) k =1
53 3
K. Selcuk Candan
81
…so we have z
…so we have
Let us assume features are not independent – –
z
o= q=
– –
z
k =1
53 4
K. Selcuk Candan
53 5
…so we have
z
z
I ( p1, p 2) = ∑ p1( x) log
How can we incorporate term dependence??? Degree of approximation…..
– –
x
–
p1=p2 p1p2
implies that I=0 implies that I>0 K. Selcuk Candan
z
53 7
Dependence graph f1
p1( x) p 2( x )
implies that I=0; p1p2
implies that I>0
Degree of dependence between fi and fj
Dij = I ( p ( f i ∧ f j ), p ( f i ) p ( f j ) ) If the two terms are independent, then Dij will be 0!!!
K. Selcuk Candan
Dependence graph f1
10
7
f3
2
f2
p1=p2
10
7
f3
2
f2
1 9
Maximum spanning tree
1 9
f4
53 8
Degree of approximation…..
o= q=
p1( x) I ( p1, p 2) = ∑ p1( x) log p 2( x) x 53 6
K. Selcuk Candan
…so we have
Let us assume features are not independent –
z
How can we incorporate term dependence???
p(Oi | I ) ≠ ∏ p( f i ,k | I )
k =1
–
o= q=
n
n
p (Oi | R) ≠ ∏ p ( f i ,k | R)
z
Let us assume features are not independent
f4
If the two terms are independent, then Dij will be 0!!!
K. Selcuk Candan
53 9
If the two terms are independent, then Dij will be 0!!!
K. Selcuk Candan
82
Dependence graph f1
10
7
f3
2
f2
Dependence graph f1
Maximum spanning tree
10
7
f2
2
f2
1 9
Maximum spanning tree
1 9
f4
f4
p ( f1 ∧ f 2 ∧ f 3 ∧ f r ) = p ( f1 ) p ( f 2 | f1 ) p ( f 3 | f1 ) p ( f 4 | f 2 )
p ( f1 ∧ f 2 ∧ f 3 ∧ f r ) = p ( f1 ) p ( f 2 | f1 ) p ( f 3 | f1 ) p ( f 4 | f 2 ) p(Oi | R ) can be computed using the distribution of the features in R!!!
54 0
If the two terms are independent, then Dij will be 0!!!
K. Selcuk Candan
54 1
Feedback without user’s help.. Q
Retrieval Engine
p(Oi | I ) can be computed using the distribution of the features in I!!! K. Selcuk Candan
Feedback without user’s help..
res1 res2 res3 res4 res5
res1 res2 res3 res4 res5
Feedback Engine δ
resk
resN
System returns N ranked results
54 2
K. Selcuk Candan
First k ranked results are chosen
54 3
Feedback without user’s help.. Q’
Retrieval Engine
K. Selcuk Candan
Multimedia query processing
res1 res2 res3 res4 res5
z
First major issue
z
Second issue
z
Third issue
–
δ
–
–
resN
54 4
System returns new N ranked results
K. Selcuk Candan
resN
54 5
imperfections (fuzziness) ranking expensive predicates (user defined functions)
K. Selcuk Candan
83
A multimedia query
54 6
K. Selcuk Candan
54 7
K. Selcuk Candan
Query…and results…
A multimedia query Crisp Fuzzy (imperfect)
54 8
K. Selcuk Candan
54 9
Reasons for imperfection z z z z z
55 0
K. Selcuk Candan
Fuzzy set..
Similarity between features (yellow/orange) Imperfections in the feature extraction algorithms Imperfections in the query formulation methods Partial match requirements Imperfections in the index structures and clustering algorithms K. Selcuk Candan
55 1
z
Fuzzy set F with domain D is defined using a membership function
z
A crisp (conventional) set C with domain D is defined using a membership function
z
A fuzzy set corresponds to a fuzzy predicate K. Selcuk Candan
84
Example
Example
hot
hot
1.0
1.0 0.85
heat
0.0 -5
0
5
10
15
20
25
30
35
40
heat
0.0
45
-5
0
5
10
15
20
25
30
35
40
45
Hot(29o) = 0.85
55 2
K. Selcuk Candan
55 3
Example query
K. Selcuk Candan
Example query
Fuzzy (imperfect)
Fuzzy (imperfect) 0.84
55 4
K. Selcuk Candan
55 5
Example query
K. Selcuk Candan
Example query
Fuzzy (imperfect) 0. 76
0.84
Fuzzy (imperfect) 0. 76
0.68 Fuzzy logical operator
z z
55 6
0.68
K. Selcuk Candan
55 7
z
0.84
0.68 Fuzzy logical operator
Fuzzy and crisp predicates Fuzzy logical expression and a merge function Results is a ranked list (with the associated fuzzy values!) K. Selcuk Candan
85
How to process a fuzzy query?
z
How to process a fuzzy query?
First approach…make predicates crisp!!! –
z
Use thresholds….
First approach…make predicates crisp!!! –
hot
Use thresholds…. hot
1.0
1.0 0.7
0.7 heat
0.0
55 8
-5
0
5
10
15
20
25
30
35
40
45
K. Selcuk Candan
55 9
How to process a fuzzy query?
z
0. 76
5
Good for processing in traditional databases
10
15
20
25
30
35
40
45
Not good for multimedia K. Selcuk Candan applications
Example merge functions…
0.84
Arithmetic average (N-ary)
0.68
K. Selcuk Candan
K. Selcuk Candan
Triangular norms (and co-norms)
How to emulate the properties of a crisp predicate
K. Selcuk Candan
(mostly used in information retrieval)
56 1
Triangular norms (and co-norms)
56 2
0
Merge (or scoring) functions….
0.76 = µ∧(0.84,0.68)
z
-5
Second approach…use suitable fuzzy logic!!! –
56 0
heat
0.0
56 3
z
How to emulate the properties of a crisp predicate
z
Bellman and Giertz: “The unique aggregation functions for evaluating AND and OR that preserve logical equivalence of queries involving only conjunction and disjunction and that are monotonic in their arguments are min and max.” K. Selcuk Candan
86
Triangular norms (and co-norms) z z
Triangular norms (and co-norms)
Emulating the properties of a crisp predicate may not be good for multimedia applications!!! Boundary conditions prevent partial matches
0. 00
0.99
z
K. Selcuk Candan
56 5
Triangular norms (and co-norms) z z
Emulating the properties of a crisp predicate may not be good for multimedia applications!!! Monotone condition is weak!!
0. 70 0. 70
0.00 0.00 = µ∧min(0.99,0.00)
56 4
z
0.71 0.99
0.70 0.70 0.70 = µ∧min(0.71,0.70) 0.70 = µ∧min(0.99,0.70)
K. Selcuk Candan
Visualisation of minimum
Emulating the properties of a crisp predicate may not be good for multimedia applications!!! N-ary semantics may be enough !!! –
consider all relevant features at the same time, instead of in pairs! Arithmetic average (N-ary)
Geometric average (N-ary)
56 6
K. Selcuk Candan
57 0
Visualisation of minimum
Visualisation of arithmetic average
monotone: not enough
57 1
K. Selcuk Candan
No partial match: not good K. Selcuk Candan
strictly monotone:better!!
57 2
partial matches: better!! K. Selcuk Candan
87
Visualisation of arithmetic average
Interestingness….. z
constant slope: not enough
An increase in the score of a subquery –
with a lower score is more interesting than
a similar increase in the score of a subquery
strictly monotone:better!!
–
with a large score
0. 40 0. ?? 0. ?? partial matches: better!!
57 3
K. Selcuk Candan
0.00 0.10 0.00
0.80 0.80 0.90
57 4
Visualisation of geometric average
K. Selcuk Candan
….parametric geometric average
adaptive slope: better Truth cutoff
No Notpartial goodmatch: not good
57 5
K. Selcuk Candan
57 6
….parametric geometric average
K. Selcuk Candan
How to put weigths?????
z
57 7
rtrue=0.4; β=0.4
rtrue=0.4; β=0.2
rtrue=0.4; β=0.0 K. Selcuk Candan
Falsehood value
57 8
How do I state that image properties are more important than semantic properties??
K. Selcuk Candan
88
How to put weigths????? z z
Fagin’s proposal
How do I state that image properties are more important than semantic properties?? What do we mean?: –
0. 60 0. 65 0. 68
z
–
A change in the value of image property should have a larger impact than a similar change in the value of the semantic property.
0.60 0.70 0.60
57 9
58 0
Fagin’s proposal
–
K. Selcuk Candan
Fagin’s proposal
Desirata –
z
If all weigths are equal the result should be equal to no weight case If one of the weigths is zero, the subquery can be dropped without effecting the rest
Desirata – – –
58 1
K. Selcuk Candan
If all weigths are equal the result should be equal to no weight case If one of the weigths is zero, the subquery can be dropped without effecting the rest ..the result should be a continuous function of the weigths
58 2
Fagin’s proposal z
If all weigths are equal the result should be equal to no weight case
0.60 0.60 0.70
K. Selcuk Candan
z
Desirata
K. Selcuk Candan
Fagin’s proposal
Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m
z
Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m
z
then f (θ ,θ 1
2 ,...,θ m
)
(x1 , x2 ,..., xm ) = (θ1 − θ 2 ) f (x1 ) + 2(θ 2 − θ 3 ) f ( x1 , x2 ) + 3(θ 3 − θ 4 ) f (x1 , x2 , x3 ) + .........
58 3
K. Selcuk Candan
58 4
(m − 1)(θ (m−1) − θ m ) f (x1 , x2 , x3 ,..., x(m−1) ) + mθ m f ( x1 , x2 , x3 ,..., xm ) K. Selcuk Candan
89
Fagin’s proposal z
z
Fagin’s proposal
Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m
z
(x1 , x2 ,..., xm ) = (θ1 − θ 2 ) f (x1 ) + 2(θ 2 − θ 3 ) f ( x1 , x2 ) + If f is continuous, then the weighted function is also 3(θ 3 − θ 4 ) f (x1 , x2 , x3 ) + then f (θ ,θ 1
2 ,...,θ m
z
)
continuous
......... (m − 1)(θ (m−1) − θ m ) f (x1 , x2 , x3 ,..., x(m−1) ) + mθ m fK.(Selcuk x1 , xCandan 2 , x3 ,..., xm )
58 5
Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m
z
then
f⎛ 1
1 1⎞ ⎜ , ,..., ⎟ ⎝m m m⎠
58 7
2 ,...,θ m
)
.........
(m − 1)θ (m−1) f (x1 , x2 , x3 ,..., x(m−1) ) K. Selcuk Candan
Fagin’s proposal
⎛1 1⎞ 2⎜ − ⎟ f ( x1 , x2 ) + ⎝m m⎠ ⎝ m m⎠ ⎛1 1⎞ 3⎜ − ⎟ f ( x1 , x2 , x3 ) + ......... ⎝m m⎠ (m − 1)⎛⎜ 1 − 1 ⎞⎟ f (x1 , x2 , x3 ,..., x(m−1) ) + ⎝m m⎠ 1 m f ( x1 , x2 K. , xSelcuk 3 ,..., xCandan m) m
z
Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m
z
then
f⎛ 1
1 1⎞ ⎜ , ,..., ⎟ ⎝m m m⎠
(x1 , x2 ,..., xm ) = f (x1 , x2 , x3 ,..., xm )
If all weigths are equal then the result is equal to the no-weighted case
58 8
Example (arithmetic average) score(a ∧ b) =
1
58 6
(x1 , x2 ,..., xm ) = ⎛⎜ 1 − 1 ⎞⎟ f (x1 ) +
If all weigths are equal…
(x1 , x2 ,..., xm ) = (θ1 − θ 2 ) f (x1 ) + 2(θ 2 − θ 3 ) f ( x1 , x2 ) + If lowest weigth is 0, then the corresponding sub query can be 3(θ 3 − θ 4 ) f (x1 , x2 , x3 ) + then f (θ ,θ
omitted
Fagin’s proposal z
Let θ1 + θ 2 + ... + θ m = 1 θ1 ,θ 2 ,...,θ m ≥ 0 θ1 ≥ θ 2 ≥ ... ≥ θ m
K. Selcuk Candan
Example (arithmetic average)
score(a) + score(b) 2
score(a ∧ b) =
score(a) + score(b) 2
score(a ∧ b) = (θ a − θ b ) score(a) + 2θ b
58 9
K. Selcuk Candan
59 0
score(a) + score(b) 2
K. Selcuk Candan
90
Example (arithmetic average) score(a ∧ b) =
Example (product)
score(a) + score(b) 2
score(a ∧ b) = (θ a − θ b ) score(a) + 2θ b
score(a ∧ b) = score(a) × score(b)
score(a) + score(b) 2
score(a ∧ b) = (θ a − θ b ) score(a) + 2θ b score(a ) × score(b)
score(a ∧ b) = θ a score(a ) + θ b score(b) 59 1
K. Selcuk Candan
59 2
Are Fagin’s desiderata enough? z
K. Selcuk Candan
Are Fagin’s desiderata enough?
It does not compare partial derivatives!!!
z
It does not compare partial derivatives!!! adaptive slope: adaptive importance
59 3
K. Selcuk Candan
59 4
Are Fagin’s desiderata enough? z
K. Selcuk Candan
Are Fagin’s desiderata enough?
It does not compare partial derivatives!!!
z z
E
59 5
= 1 −
1 + b 2 b 2 1 + R P
∀a, b
⎛ ∂E ∂E ⎞ ⎛P ⎞ ⎯→⎜ = ⎟ ⎜ = b⎟ ← ⎝ ∂R ∂P ⎠ ⎝R ⎠
K. Selcuk Candan
It does not compare partial derivatives!!! Importance: Given a function f(x,y), x has a higher contribution than y iff
59 6
∂f ∂x
> a ,b
∂f ∂y
a ,b
K. Selcuk Candan
91
Are Fagin’s desiderata enough? z z
Are Fagin’s desiderata enough?
It does not compare partial derivatives!!! Importance: Given a function f(x,y), x has a higher contribution than y iff ∀a, b
∂f ∂x
> a ,b
∂f ∂y
z z
It does not compare partial derivatives!!! Importance: Given a function f(x,y), x has a higher contribution than y iff ∀a, b
a ,b
∂f ∂x
> a ,b
∂f ∂y
a ,b
relimp (x,y)
59 7
K. Selcuk Candan
59 8
Are Fagin’s desiderata enough? z z
a ,b
=θy
a ,b K. Selcuk Candan
relimp (x,y)
Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f >
z
Example:
a ,b
=
(θ
x
60 0
∂x
∂score( x ∧ y ) ∂score( x)
a ,b
∂score( x ∧ y ) ∂score( y )
a ,b
− θ y ) + 2θ y b
a ,b
∂y
a ,b
= (θ x − θ y ) + 2θ y b NOT OK!
= 2θ y a
relimp (x,y)
2θ y a
Are Fagin’s desiderata enough?
K. Selcuk Candan
a,a
=
(θ
x
− θ y ) + 2θ y a 2θ y a
= 1+
(θ
x
z
Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f >
z
Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f >
z
Example:
z
Example:
∂x
a ,b
∂y
∂score( x ∧ y ) ∂score( x)
a ,b
∂score( x ∧ y ) ∂score( y )
a ,b
a ,b
NOT OK!
K. Selcuk Candan
>1
∂x
a ,b
∂y
a ,b
score( x ∧ y ) = (θ x − θ y ) score( x) + 2θ y score( x) × score( y )
= (θ x − θ y ) + 2θ y b = 2θ y a
−θ y )
2θ y a
Are Fagin’s desiderata enough?
score( x ∧ y ) = (θ x − θ y ) score( x) + 2θ y score( x) × score( y )
60 1
z
OK!
=θy
a ,b
K. Selcuk Candan
=θx
a ,b
∂score( x ∧ y ) ∂score( y )
59 9
a ,b
a ,b
∂f ∂y
score( x ∧ y ) = (θ x − θ y ) score( x) + 2θ y score( x) × score( y )
score( x ∧ y ) = θ x score( x) + θ y score( y ) =θx
=
Are Fagin’s desiderata enough?
Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f > ∂x a ,b ∂y a ,b Example:
∂score( x ∧ y ) ∂score( x)
a ,b
∂f ∂x
60 2
∂score( x ∧ y ) ∂score( x)
a ,b
∂score( x ∧ y ) ∂score( y )
a ,b
= (θ x − θ y ) + 2θ y b NOT OK!
= 2θ y a K. Selcuk Candan
92
relimp (x,y)
a ,b
=
θ xb θ ya
relimp (x,y)
How about?
θ xa θ x = >1 θ ya θ y
=
How about?
z
Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f >
z
Importance: Given a function f(x,y), x has a higher ∂f contribution than y iff ∀a, b ∂f >
z
Example:
z
Example:
∂x
∂y
a ,b
a ,b
∂score( x ∧ y ) ∂score( x) ∂score( x ∧ y ) ∂score( y )
60 3
θy
NOT OK! θ y −1
= θ y aθ x b a ,b
K. Selcuk Candan
60 4
Ranking
∂score( x ∧ y ) ∂score( x)
a ,b
∂score( x ∧ y ) ∂score( y )
a ,b
K. Selcuk Candan
61 5
Ranking
??
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
a ,b
θy
= θ x aθ x −1b
NOT OK! θ y −1
= θ y aθ x b
K. Selcuk Candan
Ranking
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
61 4
∂y
a ,b
score( x ∧ y ) = score( x)θ x × score( y )
= θ x aθ x −1b a ,b
∂x
θy
θy
score( x ∧ y ) = score( x)θ x × score( y )
61 6
a,a
0.85 X3 0.80 X5 0.75 X2 0.74 X6 0.74 X1 0.70 X4
K. Selcuk Candan
Ranking (and first-k retrieval)
0.85 X3 0.80 X5 0.75 X2
??
0.74 X6 0.74 X1 0.70 X4
K. Selcuk Candan
61 7
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
0.85 X3 0.80 X5 0.75 X2 0.74 X6 0.74 X1 0.70 X4
K. Selcuk Candan
93
First solution…join based on X
??
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
0.85 X3 0.80 X5 0.75 X2 X=X
??
0.74 X6 0.74 X1 0.70 X4
• Join the two information sources based on X • Sort all results based on the merged score • Select the first k
61 8
First solution…join based on X
K. Selcuk Candan
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
62 2
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
K. Selcuk Candan
??
0.825 0.8
0.72
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
0.85 X3 0.80 X5 0.75 X2
0.825 0.8
0.72
0.74 X6 0.74 X1 0.70 X4
Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access
62 1
Sorted Access Phase (k=3)
??
0.74 X6 0.74 X1 0.70 X4
Need to access the entire database at least once!!!
0.74 X6 0.74 X1 0.70 X4
K. Selcuk Candan
X=X
Sorted Access Phase (k=3)
0.85 X3 0.80 X5 0.75 X2
Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access
62 0
0.85 X3 0.80 X5 0.75 X2
61 9
Ranked join for top-k retieval (Fagin)
??
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
K. Selcuk Candan
Sorted Access Phase (k=3)
0.85 X3 0.80 X5 0.75 X2
??
0.74 X6 0.74 X1 0.70 X4
Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access
K. Selcuk Candan
62 3
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
0.825 0.8
0.72
?
0.85 X3 0.80 X5 0.75 X2 0.74 X6 0.74 X1 0.70 X4
Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access
K. Selcuk Candan
94
Random Access Phase (k=3)
??
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
0.825 0.8
0.72
X2 X5 X6
0.85 X3 0.80 X5 0.75 X2 0.74 X6 0.74 X1 0.70 X4
0.625
Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access
62 4
Result… (k=3)
K. Selcuk Candan
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
0.825 0.8
0.72
0.74 X6 0.74 X1 0.70 X4
0.625
Given n subqueries Assuming independence of n subqueries
62 7
Use only one pred. for sorted access (k=3)
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
0.74 X6 0.74 X1 0.70 X4
0.625
0.85 X3 0.80 X5 0.75 X2
K. Selcuk Candan
??
0.72
Properties of the algorithm
X1 and X4 have never been accessed!
62 6
0.8
K. Selcuk Candan
z X2 X5 X6
0.85 X3 0.80 X5 0.75 X2
0.825
Assumptions: • Q is monotonic • Predicates provide sorted_access • Predicates provide random_access
62 5
Advantage!!
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
z
.....in order to find the next k best answers, we can continue where we left off.
z
If the merge function is min, there is a more efficient implementation K. Selcuk Candan
Sorted+Random Access (k=3)
0.85 X3 0.80 X5 0.75 X2
??
0.74 X6 0.74 X1 0.70 X4
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
0.75 0.8
0.7
0.85 X3 0.80 X5 0.75 X2 0.74 X6 0.74 X1 0.70 X4
• Stop when the next value is smaller than the third candidate
62 8
K. Selcuk Candan
62 9
K. Selcuk Candan
95
Sorted+Random Access (k=3)
Problem.. z
X2 X5 X6
0.90 X2 0.80 X5 0.70 X6 0.60 X4 0.50 X1 0.40 X3
0.75 0.8
0.7
K. Selcuk Candan
63 1
Problem.. z z
These require special ranking engines.. Can we process top - kqueries in regular databases?
0.74 X6 0.74 X1 0.70 X4
X1, X3, and X4 have never been accessed!
63 0
z
0.85 X3 0.80 X5 0.75 X2
K. Selcuk Candan
Problem..
These require special ranking engines.. Can we process top - kqueries in regular databases (Chauduri and Gravano)? thresholds on the grade
z z
These require special ranking engines.. Can we process top - kqueries in regular databases (Chauduri and Gravano)? thresholds on the grade
of match of the
of match of the acceptable objects
acceptable objects select oid from Repository where Filter_condition order[k] by Ranking_expression
63 2
Description of how the result should be ranked K. Selcuk Candan
select oid from Repository where (v >= 0.5 and p >=0.9) or f >=0.9 order[10] by max (f,v)
63 3
Available access methods… z z z
Description of how the result should be ranked K. Selcuk Candan
Approach
GradeSearch(attribute, value, min_grade) TopSearch (attribute, value, count) Probe(attribute, value, {oid})
1. 2. 3. 4.
63 4
K. Selcuk Candan
63 5
Use statistics to find a suitable search score, Sq Build a selection query Cq to return all tuples whose score is greater than Sq Evaluate Cq If there are k tuples with score greater than Sq, return the first k….otherwise, choose a smaller Sq and repeat
K. Selcuk Candan
96
Handling filter conditions z z
z
Handling filter conditions
Use one GRADESEARCH for eachatomic condition in the filter condition, and merge the returned sets of object ids through a sequence of union/intersections
z z
Use GRADESEARCH only for SOME atomic conditions, and for the rest of the conditions use PROBE
z
Use GRADESEARCH only for SOME atomic conditions, and for the rest of the conditions use PROBE
z
How do we choose which filter conditions are evaluated using GRADESEARCH??? How do we choose the proper grade???
z
63 6
K. Selcuk Candan
Use one GRADESEARCH for eachatomic condition in the filter condition, and merge the returned sets of object ids through a sequence of union/intersections
63 7
K. Selcuk Candan
63 9
K. Selcuk Candan
Handling ranking.. z
Use Fagin’s approach to find the estimated number of objects required to be visited to generate top-k results Assuming independence of n subqueries
z
Find a score, Sq, that returns at least that many tuples
63 8
K. Selcuk Candan
Query Optimization z z
Visual Query
Query languages are declarative We need to – – –
64 0
Query processing stages
Query in QL
convert declarative statements into executable statements estimate the cost of the executable statements choose the cheapest executable statement among all alternatives
K. Selcuk Candan
Parser Query Optimizer Query Execution Plan
64 1
Storage Engine K. Selcuk Candan
97
Query Optimization z
In Traditional Databases, we are optimizing for cost of query processing – – –
z
Cost is described in terms of disk access Database keeps statistics about relations, tuples, page sizes. Database also keeps index and sorting information
Query optimizer – –
uses “cost model’ to guess query execution cost for a possible query execution plan chooses a plan which is cheap according to the cost model.
64 2
K. Selcuk Candan
64 3
K. Selcuk Candan
Query tree Selectivity describes what portion of the database satisfies this predicate
64 4
K. Selcuk Candan
64 5
Query tree
K. Selcuk Candan
Query tree
Joins are expensive!!!
Joins are expensive!!!
Push selections and projections down to reduce the amount of data that goes into joins
64 6
K. Selcuk Candan
64 7
K. Selcuk Candan
98
What about joins???
64 8
What about joins???
64 9
K. Selcuk Candan
Assumptions of cost-based query optimization z
The goal of query optimization is to eliminate costliest executions…. …………………………………….finding cheapest executions is very very expensive. K. Selcuk Candan
Assumptions of cost-based query optimization
Independent of how a query is executed we get the same number of results:
z
σ θ 1∧θ 2 (R ) = σ θ 1 (σ θ 2 (R )) = σ θ 2 (σ θ 1 (R ))
65 0
65 1
K. Selcuk Candan
Assumptions of cost-based query optimization z
K. Selcuk Candan
Assumptions of cost-based query optimization
Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries
z
Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries
( (
))
(
) ( (
) )
cos t 1 σ θ 1 σ 1θ 2 (R ) = cos t σ 1θ 2 (R ) + f size σ 1θ 2 (R ) ,θ 1
( (
))
(
) ( (
) )
cos t 2 σ θ 1 σ 2θ 2 (R ) = cos t σ 2θ 2 (R ) + f size σ 2θ 2 (R ) ,θ 1
cos t 1 σ θ 1 σ 1θ 2 (R ) = cos t σ 1θ 2 (R ) + f size σ 1θ 2 (R ) ,θ 1
cos t 2 σ θ 1 σ 2θ 2 (R ) = cos t σ 2θ 2 (R ) + f size σ 2θ 2 (R ) ,θ 1 65 2
Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries
K. Selcuk Candan
65 3
( (
))
(
) ( (
( (
))
(
) ( (
=
) ) ) )
K. Selcuk Candan
99
Assumptions of cost-based query optimization z
Assumptions of cost-based query optimization
Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries
( (
))
(
) ( (
( (
))
(
) ( (
z
) )
cos t 1 σ θ 1 σ 1θ 2 (R ) = cos t σ 1θ 2 (R ) + f size σ 1θ 2 (R ) ,θ 1
) )
cos t 2 σ θ 1 σ 2θ 2 (R ) = cos t σ 2θ 2 (R ) + f size σ 2θ 2 (R ) ,θ 1
cos t 1 σ θ 1 σ 1θ 2 (R ) = cos t σ 1θ 2 (R ) + f size σ 1θ 2 (R ) ,θ 1
=
cos t 2 σ θ 1 σ 2θ 2 (R ) = cos t σ 2θ 2 (R ) + f size σ 2θ 2 (R ) ,θ 1 65 4
K. Selcuk Candan
( (
))
(
) ( (
( (
))
(
) ( (
65 5
Assumptions of cost-based query optimization z
Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries
Pick the cheapest!!!
Principle of suboptimality: An optimal plan for a query includes optimal subplans for the subqueries
z
Given Q = R1xR2xR3x….Rn –
for all Ri, z z
–
Pick the plan with the smallest cost
K. Selcuk Candan
Recursion z Recursion!!!
Given Q = R1xR2xR3x….Rn –
find the cheapest plan for Q(-Ri) compute costi = cost(Q(-Ri) x Ri)
for all Ri, z z
Pick the plan with the smallest cost
–
z
65 8
find the cheapest plan for Q(-Ri) compute costi = cost(Q(-Ri) x Ri)
65 7
Recursion z
K. Selcuk Candan
for all Ri, z
–
K. Selcuk Candan
) )
Given Q = R1xR2xR3x….Rn –
…we can use recursion!!!!!!!!!
65 6
) )
Recursion
z
z
=
K. Selcuk Candan
65 9
Recursion!!!
find the cheapest plan for Q(-Ri) compute costi = cost(Q(-Ri) x Ri)
Pick the plan with the smallest cost
There is a dynamic programming based algorithm that uses this recursion. K. Selcuk Candan
100
Multimedia Query Optimization z z
Multimedia query
Query languages are declarative We need to – – – –
convert declarative statements into executable statements estimate the cost of the executable statements choose the cheapest executable statement among all alternatives consider the quality of the alternatives!!
66 0
K. Selcuk Candan
z
Overloaded implementation of predicates
– – –
66 1
image: extract fixed number of parameters :verify if the pattern is in the image pattern: find all the images using an index structure K. Selcuk Candan
Multimedia query
z
…different number of matches
– – –
66 2
image: extract fixed number of parameters :verify if the pattern is in the image pattern: find all the images using an index structure K. Selcuk Candan
66 3
Multimedia query
K. Selcuk Candan
Binding patterns z
z
we can define four binding
…different quality implementations
– –
66 4
For a predicate patterns:
–
image: extract fixed number of parameters :verify if the pattern is in the image pattern: find all the images using an index structure K. Selcuk Candan
66 5
K. Selcuk Candan
101
Binding patterns z
For a predicate patterns:
Cost, fanout
we can define four binding
66 6
K. Selcuk Candan
66 7
How to compute combined qualities? z
Cost, fanout, quality
For example, using product semantics, for the query
z
we can define the overall quality as
66 8
K. Selcuk Candan
Given a query
we need to define the best retrieval strategy based on cost, fanout, and quality:
K. Selcuk Candan
66 9
K. Selcuk Candan
Merging cost and quality…
67 0
K. Selcuk Candan
67 1
K. Selcuk Candan
102
Merging cost and quality…
Merging cost and quality…
This may not satisfy the principle of suboptimality
67 2
67 3
Minimize : K. Selcuk Candan
Dealing with expensive predicates z z
67 4
Minimize : K. Selcuk Candan
Dealing with expensive predicates
Standard query optimizers assume that selection (σ) is cheap… ….however, multimedia predicates may be expensive
z z
Standard query optimizers assume that selection (σ) is cheap… ….however, multimedia predicates may be What portion of the expensive (Hellerstein, Stonebraker)
database satisfies this predicate
67 5
K. Selcuk Candan
Predicate ordering in a single table query
How costly the predicate is K. Selcuk Candan
Predicate migration…
Better plan!!
67 6
K. Selcuk Candan
67 7
Standard optimization
Better plan!! K. Selcuk Candan
103
Challenge: to find a plan that satisfies both rank orders
Predicate migration…
67 8
Standard optimization
Web Databases….
Better plan!! K. Selcuk Candan
67 9
K. Selcuk Candan
Web Databases…. z
68 0
K. Selcuk Candan
68 1
Web Databases…. z
68 2
Approach 1: use standard IR techniques to find pages that satisfy a query
K. Selcuk Candan
Web Databases….
Approach 1: use standard IR techniques to find pages that satisfy a query
K. Selcuk Candan
z
68 3
Approach 1: use standard IR techniques to find pages that satisfy a query
K. Selcuk Candan
104
Web Databases…. z
Web Databases….
Approach 1: use standard IR techniques to find pages that satisfy a query
68 4
K. Selcuk Candan
z
68 5
Web Databases…. z
Approach 2: integrate IR techniques with structure/link analysis
K. Selcuk Candan
Hups and authorities
Approach 2: integrate IR techniques with structure/link analysis
68 6
K. Selcuk Candan
68 7
K. Selcuk Candan
Topic distilation by iterative mutual reinforcement
Hups and authorities
Good hubs should point to good authorities and vice versa.
68 8
K. Selcuk Candan
68 9
K. Selcuk Candan
105
Topic distilation by iterative mutual reinforcement
69 0
K. Selcuk Candan
PageRank
69 1
Information units (multi-page)
69 2
K. Selcuk Candan
R(u ) =
1 R (v ) ∑ c v∈Bu N v
K. Selcuk Candan
Information units (single-page)
69 3
Information units (single page/site)
K. Selcuk Candan
Topic segmentation
Subjects are not informative
Specialization Generalization
69 4
K. Selcuk Candan
69 5
K. Selcuk Candan
106
Keyword inheritence
Messages too short for indexing
Web databases z
Challenges: – – – –
69 6
K. Selcuk Candan
SQL
69 7
Integrate content and structure Graph and tree (XML/semi-structured) data indexing Proximity search Progressive query processing
K. Selcuk Candan
Application interfaces
Query parser
Trans actio n proce ssing
Query Optimizer (cost based) Query Processor
Repli catio n mana ger
Indices
Data
69 8
K. Selcuk Candan
107