Fifth International Workshop on Multimedia Data Mining - CiteSeerX

Fifth International Workshop on Multimedia Data Mining

Workshop Chairs: Latifur Khan Valery A. Petrushin

August 22, 2004 Seattle, USA

Proceedings Fifth International Workshop on

Multimedia Data Mining MDM/KDD 2004 August 22nd, 2004 Seattle, WA, USA

Workshop Co-Chairs: Latifur Khan Valery A. Petrushin In conjunction with ACM SIGKDD Tenth International Conference On Knowledge Discovery and Data Mining

i

© The copyright of these papers belongs to the papers’ authors. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage.

Proceedings of the Fifth International Workshop on Multimedia Data Mining (MDM/KDD 2004) in conjunction with ACM SIGKDD Tenth International Conference on Knowledge Discovery and Data Mining, August 22nd 2004, Seattle, WA, USA (Latifur Khan, Valery A. Petrushin, eds.)

Cover art production by Myra Spiliopoulou and MDM/KDD 2004 logo by Valery A. Petrushin (Accenture).

ii

Table of Contents Workshop Co-Chairs, Steering and Program Committees _______________________ iv MDM/KDD 2004 Workshop Program ______________________________________ v Foreword _____________________________________________________________ vi Multimedia is Correlated Media (Invited Talk – Abstract) Forouzan Golshani _____________________________________________________ vii Real-time Classification of Multivariate Motion Data Using Support Vector Machines Chuanjun Li, Punit R. Kulkarni, Li Liu, B. Prabhakaran, Latifur Khan _____________ 1 Structure-based Mining of Hierarchical Media Data, Meta-Data, and Ontologies K. Selçuk Candan, Jong Wook Kim, Huan Liu, Reshma Suvarna __________________ 8 CoMMA: A Framework for Integrated Multimedia Mining using Multi-relational Associations Ankur Teredesai, Juveria Kanodia, Muhammad Ahmad, Roger Gaborski __________ 19 Efficient Indexing of Complex Objects for Density-based Clustering Karin Kailing, Hans-Peter Kriegel, Martin Pfeifle, Stefan Schönauer _____________ 28 Automating the Construction of Scene Classifiers for Content-Based Video Retrieval Menno Israël, Egon L. van den Broek, Peter van der Putten ____________________ 38 A Data Management Layer for Visual Information Retrieval Horst Eidenberger and Roman Divotkey ____________________________________ 48 Multiple-Camera People Localization in a Cluttered Environment Gang Wei, Valery A. Petrushin, Anatole V. Gershman _________________________ 52 A Framework for Bayesian Classification on Banner Images Charles Daniel, Qin Ding _______________________________________________ 61 Iris Recognition Using the SVD-Free Latent Semantic Indexing Pavel Praks, Libor Machala,Václav Snášel _________________________________ 67 Web User’s Profile Detection Younes Hafri, Chabane Djeraba __________________________________________ 72 Mining Dynamic Spatio-Temporal Association Rules for Local-scale Weather Prediction Zhongnan Zhang. Weili Wu, Ping Deng _____________________________________ 80 A Machine Learning Approach to Expressive Performance in Jazz Standards Rafael Ramirez, Amaury Hazan, Emilia Gomez, Esteban Maestre ________________ 88 Using Linguistic Structures in Textual Information Extraction Alexandre S. Saidi ______________________________________________________ 94 Author Index _________________________________________________________ 103

iii

Workshop Co-Chairs, Steering and Program Committees Workshop Co-Chairs Latifur Khan, University of Texas at Dallas, USA Valery A. Petrushin, Accenture Technology Labs, USA

Workshop Steering Committee Chabane Djeraba - University of Sciences and Technologies of Lille, France Anne Kao - The Boeing Company, USA Simeon J. Simoff - University of Technology Sydney, Australia

Workshop Program Committee Vijay Atluri - Rutgers University Walid G. Aref - Purdue University West Lafayette, Indiana Omar Boussaid - ERIC, France Henri Briand – LINA, France K. Slecuk Candan - Arizona State University, USA Marcin Detyniecki - University of Paris, France Chitra Dorai - IBM Thomas J. Watson Research Center, USA Alex Duffy - University of Strathclyde, UK Jainping Fan – University of North Carolina, USA Borko Furht - Florida Atlantic University Pierre Gantarski - LSIIT, France William Grosky - University of Michigan - Dearborn, USA Larry Hall - University of South Florida, USA Howard J. Hamilton - University of Regina, Canada Wynne Hsu - National University of Singapore, Singapore Dimitrios Gunopulos – University of California, Riverside Oktay Ibrahimov - Institute of Cybernetics, Azerbaijan Odej Kao - University of Paderborn, Germany Paul Kennedy - University of Technology-Sydney, Australia Brian Lovell - University of Queensland, Australia B. S. Manjunath – University of California, Santa Barbara Michael Martin - National Defense University, USA Mark Maybury - MITRE Corporation Dennis McLeod - University of Southern California, USA Dunja Mladenic - J. Stefan Institute, Slovenia Milind R. Naphade - IBM T J Watson Research Center Monique Noirhomme-Fraiture - Institut d'Informatique,FUNDP, Belgium Vincent Oria - New Jersey Institute of Technology, USA Tom Osborn - Nuix Pty Ltd, Australia Jian Pei - State University of New York at Buffalo, USA Dulce Ponceleon - IBM Almaden, USA John Risch - Pacific Northwest National Lab, USA Simone Santini - University of California San Diego, USA John R. Smith - IBM T. J. Watson Research Center, USA ZhaoHui Tang - Microsoft, USA Brigitte Trousse - INRIA, France Sundar Venkataraman - Rockwell Scientific Corporation, USA Duminda Wijesekera - George Mason University, USA Graham Williams - CSIRO Data Mining, Australia Clement Yu - University of Illinois, Chicago Aidong Zhang - State University of New York at Buffalo, USA Debbie Zhang - University of Technology Sydney, Australia Wensheng Zhou - HRL Laboratories, LLC, USA Djamel Zighed - ERIC, France

iv

MDM/KDD 2004 Workshop Program Sunday, August 22, 2004, Seattle, WA, USA 8:45 – 9:00 Opening and Welcome 9:00 – 9:40 Invited Talk: Multimedia is Correlated Media Prof. Forouzan Golshani, Wright State University 9:40 – 12:00 Session 1. Theory, Methodology and Algorithms for Multimedia Data Mining. 9:40 – 10:00 Real-time Classification of Multivariate Motion Data Using Support Vector Machines Chuanjun Li, Punit R. Kulkarni, Li Liu, B. Prabhakaran, Latifur Khan 10:00 – 10:20 Break 10:20 – 10:40 Structure-based Mining of Hierarchical Media Data, Meta-Data, and Ontologies K. Selçuk Candan, Jong Wook Kim, Huan Liu, Reshma Suvarna 10:40 – 11:00 CoMMA: A Framework for Integrated Multimedia Mining using Multi-relational Associations Ankur Teredesai, Juveria Kanodia, Muhammad Ahmad, Roger Gaborski 11:00 – 11:20 Efficient Indexing of Complex Objects for Density-based Clustering Karin Kailing, Hans-Peter Kriegel, Martin Pfeifle, Stefan Scheonauer 11:20 – 11:40 Automating the Construction of Scene Classifiers for Content-Based Video Retrieval Menno Israel, Egon L. van den Broek, Peter van der Putten 11:40 – 12:00 A Data Management Layer for Visual Information Retrieval Horst Eidenberger , Roman Divotkey 12: 00 – 13:20 Lunch 12:00 – 16:00 Session 2. Multimedia Data Mining: Techniques and Applications. 13:20 – 13:40 Multiple-Camera People Localization in a Cluttered Environment Gang Wei, Valery A. Petrushin, Anatole V. Gershman 13:40 – 14:00 A Framework for Bayesian Classification on Banner Images Charles Daniel and Qin Ding 14:00 – 14:20 Iris Recognition Using the SVD-Free Latent Semantic Indexing Pavel Praks, Libor Machala,Václav Snásel 14:20 – 14:40 Web User’s Profile Detection Younes Hafri, Chabane Djeraba 14:40 – 15:00 Mining Dynamic Spatio-Temporal Association Rules for Local-scale Weather Prediction Zhongnan Zhang. Weili Wu, Ping Deng 15:00 – 15:20 Break 15:20 – 15:40 A Machine Learning Approach to Expressive Performance in Jazz Standards Rafael Ramirez, Amaury Hazan, Emilia Gomez, Esteban Maestre 15:40 – 16:00 Using Linguistic Structures in Textual Information Extraction Alexandre S. Saidi 16:00 – 16:30 Discussion and Closure

v

Foreword Data mining efforts are getting beyond the area of databases. They have been focusing on data collected in fields like art, design, hypermedia and digital media production, case-based reasoning and computational modeling of creativity, including evolutionary computation, and medical multimedia data. These exotic fields use variety of data sources and structures, interrelated by the nature of the phenomenon that these structures describe. As a result there is an increasing interest in new techniques and tools that can detect and discover patterns that can lead to a new knowledge in the problem domain where the data has been collected. There is also an increasing interest in the analysis of multimedia data generated by different distributed applications, like collaborative virtual environments, virtual communities, and multi-agent systems. These heterogeneous multimedia data records require sophisticated preprocessing, synchronization, ontology mappings and other transformation procedures before even getting to the analysis stage. On the other hand, researchers in multimedia information systems, in the search for techniques for improving the indexing and retrieval of multimedia information are looking into new methods for discovering indexing information. Variety of techniques from machine learning, statistics, databases, knowledge acquisition, data visualization, image analysis, high performance computing, and knowledge-based systems, have been used mainly as a research handcraft activity. The development of multimedia databases and their query interfaces recall again the idea of incorporating multimedia data mining methods for dynamic indexing. The emerging international standard for multimedia content description (MPEG-7) promises to foster the collaboration in the field giving a uniform data representation. The 5th Workshop on Multimedia Data mining continues the series of successful workshops that have been held in conjunction with KDD Conferences in Boston (2000), San Francisco (2001), and Edmonton (2002), and Washington, D.C.(2003). These workshops brought together numerous experts in spatial data analysis, digital media, multimedia information retrieval, state-of-art data mining and knowledge discovery in multimedia database systems, and data analysis in collaborative virtual environments. For more information about the workshops see the reports in SIGKDD Exploration (2(2), 3 (2), 4(2) and 6(1)). The topic of the 5th Workshop is Mining Integrated Media and Complex Data. It means that the Workshop focuses on issues related to mining information from multi-modality, multi-source, multi-format data in an integrated way. The selected papers represent a wide range of approaches, techniques, tools and applications that cover different aspects of image, audio and video indexing and annotation, and their integration into intelligent decision making systems. As a part of the SIGKDD conference series the Workshop follows a rigid peer-review and paper selection process. Each paper has been reviewed by at least three reviewers. We would like to thank all reviewers and all members of Steering and Program Committees who helped to organize this year workshop. We also would like to thank all those who submitted their papers to the Workshop. This year we received submissions from seven countries: Austria, France, Germany, India, Spain, the Netherlands, and the United States of America. Latifur Khan and Valery A. Petrushin July 2004

vi

Invited Talk:

Multimedia is Correlated Media Prof. Forouzan Golshani Wright State University Abstract. While we have had considerable success in developing tools and technologies for each of the common media types, most of these tools are specific to each media type and do not allow us to go across different media in the process of search and retrieval. It is as if artificial barriers separate information objects in each type. As a result, there is little hope for integrated processing, composition and search. Keyword search (on annotations and meta-tags) is the primary way to do cross media search and retrieval. Unfortunately, the forty year old tradition of keyword search is no longer adequate, particularly when it produces such a dismal result. In order to enable the enterprise in seamless management of data and multimedia information, we need a variety of tools and supporting technologies. Clearly, we must be able to maintain or discover the semantics of data and information objects. Such a process begins with low level (machine oriented) feature extraction routines and leads to human understandable characteristics with which human communication is affected. An important intermediary step is the extraction of content features of all multimedia objects. With these features at hand, the need arises for uniform conceptual modeling of object semantics. While many toy with XML-like mark up languages, it is evident that we need far more sophisticated information representation frameworks. Of equal importance is an ontological representation of domain knowledge, which maps features to topic-specific themes and maps themes to user-specific concepts. This ontology must be user extensible, allowing each to easily represent the concepts unique to their respective domains. This presentation outlines a series of toolsets for collectively allow us to remove the artificial barriers that currently separate different media types. Cross-media fusion and the capabilities for search and retrieval are illustrated through live demonstrations.

Short bio. Forouzan Golshani (PhD, IEEE Fellow) is the NCR Distinguished Professor and the Chairman of Computer Science and Engineering Department at Wright State University. Until 2003, he was Co-Director of Arts Media Engineering and Professor of Computer Science and Engineering at Arizona State University. Prior to this, he was a Research Fellow Imperial College of Science, Technology and Medicine in London, England. He combines extensive basic research with industrial R & D and entrepreneurship -- during 25 years in academia, he has worked with numerous hi-tech firms, and co-founded Corporate Enhancement Group in 1995, Roz Software Systems, Inc. in 1997 and Trallus Software in 2002. Dr. Golshani has registered ten inventions, and is the author or co-author of over 200 articles. He has been a program chair or general chair for 12 international conferences, including IEEE International Conference on Distributed Computing Systems (2001), ACM International Conference on Information and Knowledge Management (1999), IEEE International Conference on Data Engineering (1993), and IEEE International Conference on Computers and Communications (1989). He is the Editor-in-Chief of IEEE Multimedia, and an Executive Editor for Frontiers of BioSciences. He is a member of the Steering Committee for IEEE Transactions on Multimedia, and serves as a member of the editorial boards of several other journals, including International journal of Artificial Intelligence Tools, and Journal of Information Management Systems. His senior management and technical consulting experience include: Motorola, Intel Corp., Honeywell, McDonnell Douglas Helicopter Company, Bull Worldwide Information Systems, and Sperry. Dr. Golshani is a past chair of the IEEE International Technical Committee on Computer Languages (1989-91), and an IEEE Distinguished Speaker (1990-93). He received a PhD in Computer Science from the University of Warwick in England.

vii

viii

Real-time Classification of Multivariate Motion Data Using Support Vector Machines (SVM)∗ Chuanjun Li Punit R. Kulkarni Li Liu B. Prabhakaran Latifur Khan Department of Computer Science University of Texas at Dallas, Richardson, TX 75083 {chuanjun, prk032000, praba, lkhan}@utdallas.edu

ABSTRACT

1.

Multivariate motion data can be generated in many applications/devices, such as gesture capture devices and animations. It can have dozens of attributes, thousands of rows, and even similar motions can have different duration and different speeds at corresponding parts. There are no rowto-row correspondences between data matrices of two motions. To be classified and recognized, multivariate motion data of different lengths are reduced to characteristic vectors by using the properties of Singular Value Decomposition (SVD) of motion data. The reduced characteristic vectors of similar motions are close to each other, while reduced characteristic vectors are different from each other if their motions are different. By applying Support Vector Machines (SVM) to the characteristic vectors, we efficiently classify and recognize real world multivariate motion data. With our dataset of more than 300 motions by different people, SVM significantly outperforms the weighted-sum SVD, the most effective multivariate motion data similarity measure published so far as in [11], in terms of accuracy and CPU time. The performance of our approach shows its feasibility of real-time applications to real-world data.

Multivariate motion data is encountered in several applications/devices. For instance, a gesture sensing device such as CyberGlove has multiple sensors that transmit values to indicate motion of a hand. Devices track the movements of sensors in different axes to capture the motion of a model in virtual reality type of applications. If we consider 3D animations, motion of a model typically involves translation and rotation for different nodes of the model. In many of these cases, more than one value is generated at each time, rather than only one value at each time as in a time series data sequence. As a result, a multivariate motion yields a matrix, rather than a multidimensional vector for a time series sequence. For example, multivariate motion data used in this paper are matrices, and each of the matrices is for one motion and has 22 columns and hundreds or thousands of rows. Classifying and recognizing similar motions poses several challenges:

INTRODUCTION

• Matrices of data sequences of multiple variables or attributes are involved, because data of multiple attributes are aggregate data, and should be considered together to make the motions meaningful.

Categories and Subject Descriptors

• Motions can be carried out with different speeds at different time, they can have different durations, and motion sampling rates may also be different. Hence the matrices of motion data can be of variable lengths, and there are no continuous row-to-row correspondences between data of similar motions as shown in Figure 1. Although two motions are similar to each other, corresponding attributes can have large differences in values between them, and different corresponding attribute pairs can have different variations at different times.

H.2.8 [Database Management]: Database Applications— data mining

General Terms Algorithm

Keywords Classification, pattern recognition, support vector machines, singular value decomposition, multivariate motion. ∗Work supported partially by the National Science Foundation under Grant No. 0237954 for the project CAREER: Animation Databases. Any opinions, findings, and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the National Science Foundation.

• There might be hundreds or thousands of classes of motions used to classify and recognize a new motion, the classification should be not only accurate, but it should also be real time. To address these challenges, we obtain a characteristic vector for each motion matrix using the Singular Vector Decomposition (SVD) properties of the motion matrix. SVD optimally exposes the geometric structure of a matrix and this geometric structure can be exploited to obtain a characteristic vector for a motion matrix. We then classify the vectors for all the motions. Classification is part of supervised

The copyright of these papers belongs to the paper’s authors. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. MDM/KDD ’04, August 22, 2004, Seattle, WA, USA.

1

1

trix by using SVD.

0.5

• Applying SVM to the characteristic vectors of the multivariate motion data and classifying the multivariate motion data.

Motion Data Values

0

-0.5

• Using the weighted-sum SVD similarity measure to compute the similarities of motion data in the training data and the motion data in the testing data, and comparing the results with the SVM classification results. Our approach outperforms the weighted-sum SVD similarity measure and shows the effectiveness of the characteristic vectors and the efficiency of SVM classification.

-1

-1.5

-2

-2.5

100

0

300

200

1

• Demonstrating that testing data classification can be done in real time with data training done offline beforehand.

0.5

Motion Data Values

0

The rest of the paper is organized as follows. Section 2 contains the background knowledge of SVD and SVM. Section 3 proposes a new approach to classifying multivariate motion data using SVD and SVM. Section 4 experimentally evaluates the accuracy and CPU time of our proposed approach. Section 5 gives a brief review of related work, and finally section 6 concludes the paper.

-0.5

-1

-1.5

-2

-2.5

0

100

200 Motion Data Lengths

300

400

2.

BACKGROUND

In this section, we give some background knowledge of singular value decomposition and support vector machines as the base of our proposed classification approach.

Figure 1: Data for Two Similar Motions. Similar motion data can have different lengths, and different corresponding attribute pairs can have different variations at different times, hence there are no row-to-row correspondences between data of similar motions.

2.1

Singular Value Decomposition

As proved in [4], for any real m × n matrix A, then there exist orthogonal matrices U = [u1 , u2 , . . . , um ] ∈ Rm×m V = [v1 , v2 , . . . , vn ] ∈ Rn×n

learning in which the learning machine is given a training set of vectors with associated labels for each class. After training, the machine can classify the label of a new vector. We show by experiments that motions of different durations with different variations can be recognized effectively and quickly. For classification, we use Support Vector Machines technique, one of the most powerful classification techniques that have been successfully applied to many real world problems [5, 13, 1]. SVM has proved to be computationally efficient especially when dealing with relatively large datasets [3]. SVM are relatively insensitive to the number of data points and the classification complexity does not depend on the dimensionality of the feature space. Hence they can potentially learn a larger set of patterns and thus be able to scale better than other techniques such as neural networks [3]. In comparison with classification by using SVM, we also computed the similarities of motion data in the testing dataset with motion data in the training datasets. The similarity measure we used is the weighted-sum SVD as defined in [11]. To the best of our knowledge, weighted-sum SVD is the most effective multivariate motion data similarity measure published so far which attempts to address the above challenges. The contributions of this paper includes:

and

such that A = U ΣV T where Σ = diag(σ1 , σ2 , . . . , σmin(m,n) ) ∈ Rm×n , σ1 ≥ σ2 ≥ . . . ≥ σmin(m,n) ≥ 0. The σi is the ith singular value of A in non-increasing order and the vectors ui and vi are the ith left singular vector and the ith right singular vector of A for i = min(m, n), respectively. The singular values of matrix A are unique, and the singular vectors corresponding to distinct singular values are uniquely determined up to the sign. The SVD exposes the geometric structure of a matrix A. The first right singular vector gives the direction along which the multi-dimensional row vectors or points contained in A have the largest variations, and the second right singular vector are the direction with the second largest variations, and so on. The singular values reflect the variations along the corresponding singular vectors. Figure 2 shows the data in an 18 × 2 matrix and its first singular vector v1 and its second singular vector v2 . Along the first singular vector v1 , data have the largest variation as shown in Figure 2. Matrix AV = U Σ gives the coordinates of the multidimensional row vectors in A in the new coordinate system

• Obtaining a characteristic vector for each motion ma-

2

y

The problem of finding the optimal hyperplane is specified by the following quadratic programming problem: minimize:

v1

L

αi +

W (α) = −

v2

i=1

1 2

L

L

αi αj yi yj K(xi, xj ) i=1 j=1

subject to: L

αi yi = 0, i=1

x

0 ≤ αi ≤ C, i = 1, 2, . . . , L

Figure 2: Geometric Structure of Matrix Exposed by its SVD

The above quadratic programming problem can be solved with traditional optimization techniques. The vectors for which αi > 0 after optimization are called support vectors. Support vectors lie closest to the optimal hyperplane. After training, only the support vectors of the training data are used to represent the classifiers, and other training vectors have no influences.

spanned by the column vectors of V [8]. Let x be a non-zero row vector, the multiplication of x and A, i.e. xA, gives a corresponding vector of x in the system spanned by the column vectors of V . The right singular vectors of A are actually the corresponding eigenvectors of AT A, and the singular values of A are the square roots of the corresponding eigenvalues of AT A. Hence the computations of right singular vectors and singular values can be done by solving the eigenvectors and eigenvalues of AT A.

2.2

3.

In this section, we propose an efficient classification approach using both SVD and SVM as shown in Figure 4. We use CyberGlove, a fully instrumented glove that provides up to 22 high-accuracy joint-angle measurements, to generate motion data, and compute the SVD of the motion matrices. A characteristic vector is reduced from each motion matrix SVD, and all the characteristic vectors are used as inputs to SVM for training. This training phase can be done offline (See the top portion of Figure 4). Similarly, we generate testing datasets and these testing data can be classified by the SVM that has already been trained offline with training datasets. We would like to classify the testing data in real time (See the bottom portion of Figure 4). We first obtain a characteristic vector rk for each matrix Pk using the SVD properties, and then classify characteristic vectors rk ’s using SVM as shown in Figure 4.

Support Vector Machines

Support Vector Machines are a class of learning machines that aim at finding optimal hyperplanes among different classes of input data or training data in a high dimensional feature space F , and new test data can be classified using the separating hyperplanes. The optimal hyperplane, found during a training phase, makes the smallest number of training errors. Figure 3 illustrates an optimal hyperplane for two classes of training data. Y’

Y

3.1

X

Characteristic Vector Generation

When two motions are similar, the directions with the highest variations should be close to each other, that is, their first right singular vectors u1 and v1 are close to each other, or are close to each other when one of them is negated due to the sign non-uniqueness of SVD as shown in Figure 5. The singular values can be very small in the other directions, making the directions of other singular vectors vary even if two motions are similar to each other as shown in Figure 6. For this reason, only the essential information shown in the first singular vectors are considered for the characteristic vectors. The variations at all directions of the right singular vectors of one motion should also be proportional to the corresponding variations of the other similar motion. Let us assume ~σ = σ/|σ| and ~λ = λ/|λ|, then ~σ should be close to ~λ if the two motions are similar to each other. The corresponding values of u1 and v1 can be close to each other if two motions are similar, but their signs can be different as shown in Figure 7. It’s difficult, if not impossible, to tell if two vectors are different just by their signs by comparing just a few elements of the two vectors as Figure 7 shows.

X’

Z’

Figure 3: Optimal Hyperplane Let {xi , yi }, i = 1, 2, . . . , L be L training data vectors xi with class label yi . Given an input vector x, an SVM constructs a classifier of the form L

f (x) = sign(

MULTIVARIATE MOTION DATA CLASSIFICATION

αiyi K(xi, x) + b) i=1

where {αi } are non-negative Lagrange multipliers each of which corresponds to a training data, b is a bias constant, and K(·, ·) is a kernel satisfying the conditions of Mercer’s theorem [13]. Frequently used kernel functions are the polynomial kernel K(xi, xj ) = (xi · xj + 1)d and Gaussian Radial 2 2 Basis Function (RBF) K(xi, xj ) = e−|xi −xj | /2σ .

3

Training Dataset Data Generation by Cyber-Glove


SVD

Training SVM

Offline Accuracy and CPU Time

Testing SVM Data Generation by Cyber-Glove


SVD

Testing Dataset

Real Time

Figure 4: Multivariate Motion Data Classification Flowchart 0.4

0.6

0.5

0.2

Vector Component Values

Vector Component Values

0.4

0.3

0.2

0

-0.2

-0.4

0.1

-0.6

0

-0.1

u1 v1 2

-0.8 4

6

8

14 12 10 Vector Components

16

18

20

u2 v2 2

22

4

6

8

14 12 10 Vector Components

16

18

20

22

Figure 6: Second singular vectors u2 and v2 for two similar motions. The large variations in the second singular vectors show that second singular vectors might not be close to each other due to the variations in the similar motions.

Figure 5: First singular vectors u1 and v1 for two similar motions As shown in Figure 2, the vector xV gives the corresponding transformed vector of vector x in the system spanned by the column vectors of V . The first few components of the transformed vector xV can vary in the range [-1,1], and the last few components approach zero as the 15th to 22th components in Figure 8 shows. If the first component of the transformed vector xV is negative, we negate all components of the transformed vector xV , then similar motions will have similar transformed vectors after negation. If the first component of the transformed vector xV is close to zero, negation according to the first components might not solve the sign non-uniqueness issue. In this case, each transformed vector and its negation should be considered for classification. The final characteristic vector reflects the effects of both the first singular vectors and the singular values by combining the transform vector (negated if necessary) and the normalized singular value vector of a motion matrix. Based on the above discussions, we generate a characteristic vector rk for a matrix Pk as follows.

pute the SVD of S: S = U ΣV T . • Compute SV , negate its rows in which the first components are negative, and let the resulting matrix be T. • Generate rk as the concatenation of tk and λk , where tk is the kth row of T . If two motions are similar, their characteristic vectors should be close to each other, otherwise their characteristic vectors would be different. This is true when one part of a motion is slower than the corresponding part in a similar motion, and the rest of the two motions are of similar speeds as shown in Figure 9. The reason is that the part with slower speed will have more points than the corresponding part with fast speed as illustrated in Figure 2, yet the directions of the first singular vectors of similar motions should be close to each other, and the directions of the normalized singular value vectors should also be close to each other. As a matter of fact, although all the experimental data of similar motions used in Section 4 are of different speeds at corresponding parts, the classification results show that their characteristic vectors are similar if the motions are similar.

• Compute the SVD for each matrix Pk . Let its first right singular vector be vk , and the vector of singular values be λk . • Let S be the matrix with vk being its kth row. Com-

4

3.2 1

Classification of characteristic vectors includes two phases: training and testing. Support vectors and optimal hyperplanes are obtained for the characteristic vectors during the offline training phase, and the subsequent testing phase classify new test data in real time based on the offline training results. SVM software package [2] is used for this work. The RBF kernel function is used for training. The type of kernel utilized by the SVM is inconsequential as long as the capacity is appropriate for the amount of training data and complexity of the classification boundary [13]. Both the training data and the testing data have the following format: label 1:rk,1 2:rk,2 . . . N : rk,N where label is an integer identifying the class of the training vector rk , and rk,i is the ith component of rk . Characteristic vectors for similar motions are given the same label, and multi-class classification is supported. The labels in the testing data are used only to calculate classification accuracy. If rk,1 is close to zero, tk in rk is negated and a second rk with negated tk and λk is also used during testing to classify rk . The tolerance of termination criterion is set to a very small value (= 10−5 ) so as to provide a strong classification.

0.8

Vector Component Value

0.6

0.4

0.2

0

-0.2

-0.4

-0.6

5

10

15

Untransformed Vector for Motion 1

25 20 Vector Component

30

35

40


Figure 7: Combined Vector Before Transforming First Singular Vector

1


0.5

4.

0

-1

10

15


30

35

40

4.1

Transformed Vector for Motion 1 Transformed Vector for Motion 2 Transformed Vector for Motion 2 After Negation

Motion Data Generation

The motion data was generated by using CyberGlove. Motions of a hand are captured by the 22 sensors located at different positions of the glove, and the sensors generate data at about 120 times/second. 100 different motions have been carried out, each motion has a different duration, and all the resultant motion data matrices have different lengths ranging from about 200 to about 1500 rows. Each different motion is repeated for three times by one person, and some motions are performed by three people, so that at least two characteristic vectors per class are available for training. The data matrix of one motion can have more than two times the length of the data matrix of a similar motion. Each motion data matrix is given a unique pattern identifier Pk .

Figure 8: Combined Vector After Transforming and Negating First Singular Vector

1

0.8

0.6


PERFORMANCE EVALUATION

In this section, we evaluate the proposed classification approach by classifying real world data and compare with similarities computed by using the weighted-sum SVD similarity measure as proposed in [11]. The weighted-sum SVD similarity measure is explained in Section 4.2 below and in Section 5.

-0.5

5

Classification of Characteristic Vectors Using SVM

0.4

0.2

0

4.2

-0.2

-0.4

5

10

15



30

35

Performance Evaluation

We divide the data into three datasets. Each dataset has data for 100 different motions, and includes data of one of the three similar motions by one person. Sequentially, we use two datasets for training, and the third dataset for testing. Three test cases have been run, with one different dataset used for testing in each test case. All the experiments were performed on a Microsoft WindowsXP Professional machine with Intel Pentium processor running at 2.5 GHz. The performance has been validated using K-fold validation (with K=3). Figure 10 illustrates the partitioning of the dataset for the K-fold validation (with K=3) and (a),

40


Figure 9: Combined Vectors Before Transforming the First Singular Vectors for Two Similar Motions. The speed of motion 1 during the first 1/4 to 1/2 rows is twice the speed at the corresponding part of motion 2, and speeds at all the rest corresponding parts are the same.

5

140

(b) and (c) correspond to the cases with the datasets 1, 2 and 3 being the respective testing datasets. ( ms )

2

3 Training Dataset (a)

125.3

124.9

100 80

CPU Time

1 Testing dataset

125.3 120

1 2 3 Training Dataset Testing dataset Training Dataset (b)

60 40

20 2.55

2.45

2.40

0 1

2

3

Datasets

1 2 Training Dataset (c)

3 Testing dataset

SVM

Figure 12: CPU Time Comparison of SVM and Weighted-sum SVD

Figure 10: K-fold Cross Validation Datasets In comparison with the weighted-sum SVD similarity measure, we also computed the similarities of motions in the training data with the motions in the testing data, and defined the accuracy as the percentage of motions recognized correctly. Figure 11 shows that SVM can achieve 100% accuracy by correctly classify all 100 different motions, while weighted-sum SVD can recognize at most 82% motions in the testing dataset. Figure 12 shows that SVM is about 50 times faster than weighted-sum SVD on average. The weighted-sum SVD similarity measure computes the inner products of all the corresponding singular vectors of the corresponding matrices weighted by their singular values, and takes the minimum of the inner products sum as the similarity measure of two matrices. This definition includes some noise components since we observed that not all corresponding singular vectors of similar motions are close to each other as shown in Figure 6, hence not all the inner products of singular vectors should be considered in the definition of the distance measure. Besides, since the inner products of singular vectors can be both positive and negative, and the weights can be the singular values of either matrix, it’s very likely that the weighted sum can drop or jump sharply even if a query approximately matches some pattern. Besides, for the weighted-sum SVD measure to

recognize one motion, all the motions used as patterns have to be scanned. All these reasons explain the dissatisfactory accuracy and average computational CPU time of weightedsum SVD as compared to the high accuracy and low CPU time as required by SVM classification.

5.

90

Accuracy ( % )

78.5

82.0

81.1

70 60 50 40 30 20 10 0 1

2

3

Datasets SVM

RELATED WORK

Recognition of multivariate sequences has obtained increasing attentions in recent years. Mostly distance measures are defined for multivariate data to reflect the similarities of multivariate data. In [7], multivariate sequences of equal lengths are considered. Scaling and shifting transformations are considered when defining sequence distances and an index structure is proposed for shift and scale transformations. Similarity search of multivariate sequences with different lengths cannot be solved by the distance definitions and the index as proposed in [7]. Multivariate sequences are partitioned into subsequences in [10]. Each of the partitioned subsequence is contained in a Minimum Bounding Rectangle (MBR). Every MBR is indexed and stored into a database by using an R-tree or any of its variants. Estimated MBR distances are used to speed up the searching of similar motions. If two sequences are of different lengths, the shorter sequence is compared with the other by sliding from the beginning to the end of the longer sequence. This makes it impossible to recognize two similar sequences with different durations or with local accelerations and decelerations. Dynamic time warping (DTW) and longest common subsequence (LCS) are extended for similarity measures of multivariate data in [14]. Before the exact LCS or DTW is performed, sequences are segmented into MBRs to be stored in an R-tree. Based on the MBR intersections, similarity estimates are computed to prune irrelevant sequences. Both DTW and LCS have a computational complexity of O(wd(m + n)), where w is a matching window size, d is the number of attributes, and m, n are the lengths of two data sequences. When w is a significant portion of m or n, the computation can be even quadratic in the length of the sequences, making it non-scalable to large databases with long multivariate sequences. It has been shown in [14] that the index performance significantly degrades when the warping length increases. Even for a small number of 20 MBRs per

100

80

Weighted-sum SVD

Weighted-sum SVD

Figure 11: Accuracy Comparison of SVM and Weighted-sum SVD

6

long sequence, the index space requirements can be about a quarter of the dataset size. In [11], a weighted-sum SVD is defined as below for measuring the similarity of two multivariate motion sequences. The similarity definition takes the minimum of two weighted sum of the inner products of right singular vectors, and this definition may fail to give high similarity of two similar motions as shown in section 4 . Linear scanning is used in [11] to search for a similar sequence, which is computationally hard when the number of sequences in the database is very large.

in this work, and taking into consideration the first left singular vectors should further distinguish similar motions with different directions. We are also working to reduce the time required for computing SVD and to explore the characteristics of SVM further to reduce the offline training time.

7.

Ψ(Q, Pk ) = min(Ψ1 (M1 , M2 ), Ψ2 (M1 , M2 )) where M1 = QT Q, M2 = PkT Pk , Ψ1 (M1 , M2 ) = ( n i=1 σi (ui · n n vi ))/( n |σ |), Ψ (M , M ) = ( λ (u ·v ))/( i 2 1 2 i i i i=1 i=1 i=1 |λi |), th ui and vi are the respective i right singular vectors of M1 and M2 , and σi and λi are the respective ith largest singular values of M1 and M2 . Speech and handwriting recognition [6, 9] as well as American Sign Language (ASL) recognition [12] have also been addressed by using hidden Markov models (HMMs). Speech recognition and handwritten word recognition are basically for univariate sequences, and the ASL recognition [12] is for multivariate motions. Different states should be specified for each sign or motion unit when HMMs are involved, and grammar constraints should be known beforehand for using HMMs. When the specified states are not followed, or motion variations are relatively large, recognition accuracy would decrease dramatically. This is true even legitimate or meaningful motions are generated for HMM-based recognitions.

6.

REFERENCES

[1] A. Burian and J. Takala. On signal detection using support vector machines. In International Symposium on Signals, Circuits and Systems, volume 2, pages 609–612, 7 2003. [2] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [3] N. Cristianini and J. Shawe-taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge, UK, 2000. [4] G. H. Golub and C. F. V. Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore,Maryland, 1996. [5] M. Gordan, C. Kotropoulos, and I. Pitas. Application of support vector machines classifiers to visual speech recognition. In Proceedings of the International Conference on Image Processing., volume 3, pages 24–28, 6 2002. [6] X. D. Huang, Y. Ariki, and M. A. Jack. Hidden Markov Models for Speech Recognition. Edinburgh University Press, 1990. [7] T. Kahveci, A. Singh, and A. Gurel. Similarity searching for multi-attribute sequences. In Proceedings. of 14th Int’l Conference on Scientific and Statistical Database Management, pages 175 – 184, July 2002. [8] F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. In SIGMOD, pages 289–300, 5 1997. [9] A. Kundu, Y. He, and P. Bahl. Recognition of handwritten words: First and second order hidden markov model based approach. Pattern Recognition, 22(3):283–297, 1989. [10] S.-L. Lee, S.-J. Chun, D.-H. Kim, J.-H. Lee, and C.-W. Chung. Similarity search for multidimensional data sequences. In Proceedings. of 16th Int’l Conference on Data Engineering, pages 599 – 608, Feb./Mar. 2000. [11] C. Shahabi and D. Yan. Real-time pattern isolation and recognition over immersive sensor data streams. In Proceedings of The 9th Int’l Conference on Multi-Media Modeling, pages 93–113, 1 2003. [12] T. Starner, J. Weaver, and A. Pentland. Real-time american sign language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1371–1375, 1998. [13] V. N. Vapnik. Statistical Learning theory. Wiley, New York, 1998. [14] M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. Keogh. Indexing multi-dimensional time-series with support for multiple distance measures. In SIGMOD, pages 216–225, 8 2003.

CONCLUSION AND DISCUSSION

We have shown that by reducing multivariate motion data into characteristic vectors, SVM can be used efficiently to classify multivariate data. Characteristic vectors are close to each other for similar motions, and are different for different motions as the high accuracy of SVM classification we have achieved demonstrates. SVMs provide an efficient approach to matching the surface of the optimal hyperplane to the training data through the use of a kernel function. RBF function has been used as the kernel function in this paper, although different kernel functions (e.g. linear, polynomial and RBF) can be provided to the SVMs during the training process, which selects a small number of support vectors for the hyperplanes. The high accuracy and low CPU testing time make SVMs a feasible technique to classify and recognize multivariate data in real time. Using only a singular motion pattern in the database to recognize similar motions allows for less variations in similar motion as shown by the low recognition accuracy of the weighted-sum SVD. By reducing multivariate motion data into characteristic vectors, and using a group of characteristic vectors for a class, a new motion has higher expected probability of being recognized by SVMs as optimal hyperplanes are obtained during the training phase. By representing a large size matrix by a single vector of small length, the storage requirement can also be significantly reduced. We have addressed the problem of real time recognition of individual motions accurately and efficiently. We will investigate the feasibility of our approach in partitioning and recognizing individual motions in continuous streaming data. Motions are classified irrespective of their directions

7

Structure-based Mining of Hierarchical Media Data, Meta-Data, and Ontologies ♣

K. Selçuk Candan

Jong Wook Kim

Huan Liu

Reshma Suvarna

Computer Science and Eng. Dept, Arizona State University, Tempe, AZ 85287

{candan, jong.wook.kim huan.liu, reshma.suvarna} @asu.edu ABSTRACT

1. INTRODUCTION

Users now have access to unprecedented amounts of media data, but it is increasingly difficult to integrate relevant media from multiple and diverse sources. The functioning of a multimodal integration system requires metadata, such as ontologies, that describe media resources and media components. Such metadata are generally application dependent and this can cause difficulties when media needs to be shared across application domains. There is a need for a mechanism that can relate the common and uncommon terms and media components. In this paper, we develop an algorithm to mine and automatically discover mappings in hierarchical media data, meta-data, and ontologies, using the structural information inherent in these types of data. We evaluate the performance of this algorithm for various parameters using both synthetic and real data collections and show that the structure based mining of relationships provides high degrees of precision.

Universality, i.e., the need for accessing media, collected and indexed independently by various applications and organizations, necessitates uniform organization schemes that would allow easy access and integration of media. Instead, mostly media is available to users and applications in diverse structures and formats. Furthermore, considering the multitude and diversity of these applications, it is not viable to expect a global unifying scheme. A semantic network of media, wherein different applications can exchange information and integrate multimodal data requires the information about each media to be represented in a detailed and structured manner. To enable such information exchanges various hierarchical metadata frameworks have been proposed. For instance, Resource Description Framework (RDF) [1] is such an effort supported by the World Wide Web Consortium. RDF aims at providing a means for the description of metadata, in an organized, informative, searchable, and accessible manner. RDF schemas are implemented for representing multimedia data like image, audio, and video files. For instance, an RDF-based system is presented in [2], where RDF schemas provide a non-visual description of photos. This description is split into three different parts: • Dublin Core [3] schema is used for identifying the photograph and describing properties like creator, editor and title.

Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications - Data Mining; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval

General Terms and Topics Algorithms; experimentation

Keywords Retrieval and mining of semantics; extracting and mining semantics from multimedia databases; knowledge discovery in XML, hierarchical multimedia, and ontologies; multi-modal integration; structure-based mining

The copyright of these papers belongs to the paper’s authors. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.

Technical schema is used for capturing technical data about the photo and the camera such as the type of camera, type of film, scanner and software used for digitization.

•

Content schema is used for categorizing the subject of the photo by means of a controlled vocabulary. This schema allows photos to be retrieved based on such characteristics as portrait, group portrait, landscape, architecture, sport, etc.

RDF and similar metadata description frameworks provide a common language through which meta-data, such as media and application ontologies, are exchanged. This enables software systems to create a uniform structure to represent and organize

MDM/KDD’04, August 22, 2004, Seattle, WA, USA. ♣

•

This work has been supported by an NSF Grant #ITR0326544 and a

CEINT04 grant

8

with a foundation for the description of metadata for the next generation of interoperable applications[1].

data, which renders the integrated data manageable and retrievable. Consequently, many multimedia standards define objects as a structured collection of media objects. The metadata (such as content descriptors and feature names), used to describe resources in RDF and other metadata description languages, are defined by various communities. The metadata creators, depending on the specific application, culture, use different terms for similar concepts. For the functioning of an automated multimodal media integration system, the semantics behind the metadata terms used by various authors and communities should be mined and automatically related. Hence, a mechanism to mine and relate semantically similar but syntactically different metadata is required

Figure 1. Example RDF statement

Ontologies, formalisms that define the relationships among terms in a given application, describe the context in which metadata is applied. They are used to link, compare, and differentiate information provided by various application resources. RDF provides a rich data model where entities and relationships can be described. The relationships in RDF framework are first class objects, which means that relationships between objects may be arbitrarily created and be stored separately from the two objects. This nature of RDF is very suitable for dynamically changing, distributed, shared nature of the Web. It is designed to provide a framework that ensures interoperability between metadata frameworks.

In this paper, we develop algorithms to automatically mine concept mappings in hierarchical media data, metadata, and ontologies. We propose a solution which mines such relationships using the inherently available structural information. We use Multidimensional Scaling [3,12,13,14] to map the nodes between the two different but similar structures (multimedia hierarchies, ontologies, or namespaces) such that the syntactically different but semantically similar components map to each other. In Section 2, we present our motivation. In Section 3, we define the problem formally and in Section 4, we propose a solution and develop an efficient algorithm. Then, in Section 5, we evaluate the performance of the algorithm for various parameters and show that it is very effective in addressing this challenge.

The metadata (property names) used to describe resources are generally application dependant and must be associated with RDF schema. This, however, can cause difficulties when RDF descriptions need to be shared across application domains.

2. MOTIVATIONS

For example, Figure 1 represents an RDF statement for resource www.asu.edu. The property (metadata) used to define the resource University_1 are, Name and Location. However, the property Location can be defined in some other application domain as Address. Although, the semantics of both property names are the same, syntactically they are different. In general a property name may have a broader or narrower meaning depending upon the needs of particular application domains. In order to prevent such conflicts and ambiguities, the terminology used by each application domain must be clearly identified.

In this section, we present two related motivations for this work. This first one, integration of resource description framework (RDF) specifiec multimedia resources, requires mining of mapping of hierarchical namespaces (or ontologies). The second application, XML multimedia document matching and integration, requires mining and mapping of components (elements or attributes) in various multimedia objects.

2.1 Integration of Resource Description Framework (RDF) described Media Resources

RDF uniquely identifies property names by using the Namespace mechanism [7]. A namespace can be thought of as a context or an ontology that gives a specific meaning to what might otherwise be a general term. It provides a method for unambiguously identifying the semantics and conventions governing the particular use of property names by uniquely identifying the governing authority of the vocabulary. Thus using namespaces RDF provides ability to define and differentiate semantics among communities.

If application and media content experts could easily associate metadata with each resource they create, then this metadata could be used by access and integration engines to increase their efficiency and precision. In order to enable this, the metadata format used by different applications must be compatible. Within the context of web information integration, many proposals were made to the World Wide Web Consortium (W3C) for representation of Web-related metadata. Initial solutions were based on the tag of the HTML. Currently, many companies, such as Microsoft, IBM, Motorola, Netscape, Nokia, OCLC, are actively participating in the field of metadata framework developments. In 1997, Netscape submitted a proposal, titled ``Meta Content Framework'', to W3C [5]. The two principles on which the meta content framework (MCF) is based are (1) there is no distinction between the representation needs of data and metadata, and (2) for interoperability and efficiency, schemas for different applications should share as much as possible in the form of data structure, syntax, and vocabulary. The culmination of various frameworks was the Resource Description Framework or RDF [6]. RDF provides application developers

Although with the help of namespaces, we can uniquely identify and relate the metadata to a particular governing authority or a community, there is no straightforward way to map and relate terms or properties among different communities. Computer

Memory

SRAM

DRAM

Computer CPU

Pentium

Processor

Celeron

Pentium

Figure 2. Similar hierarchical namespaces

9

Celeron

Memory

DRAM

SRAM

Root

Consider the two hierarchical namespaces provided in Figure 2 (the hierarchy usually corresponds to the concept/class hierarchy) of the underlying domains. As it is implied by the similar structures of these namespace hierarchies, the terms Processor and CPU are semantically related. Therefore, if the user integrates two data domains each using one of these two namespaces, whenever a query is issued using the property name CPU, the content having the property name Processor should also be retrieved.

Translation Transform -2 3 4

Transform -1 1 1

Color – 10 100 10

Transform -2 5 10

Shine – 0.75 Box – 2 2 2

Color – 20 10 10 Shine – 0.8 Circle – 2

Automatic mapping of the semantically similar but syntactically different terms from the different namespaces is one of the necessities for integration of content from independently created data sources. An automated mechanism needs to be devised that relate the common and uncommon terms of various metadata communities. Figure 4. An example X3D document and the corresponding 3D world

2.2 Matching of Extensible Markup Language (XML) specified Media Objects

XML became the de facto standard for multimedia data representation. For example, X3D [11], a file format and related access services for describing interactive 3D objects and worlds, is based on XML. X3D nodes are expressed as XML elements, i.e., tagged names (see Figure 4 for an example X3D document).

Many multimedia standards define objects as structured collections of media data. XML description of such multimedia objects and structures is very common. Examples include virtual reality modeling languages (X3D), media content description frameworks (MPEG7), e-commerce web documents, and geographic information systems. Extensible Markup Language (XML) [8] defines a generic syntax used to mark up data with simple, human readable tags. It provides a standard format for computer documents. As shown in Figure 3, an XML document is a tree-like structure, whose structure may be defined through a Document Type Definition (DTD) or through an XML-Schema. XML is very flexible; its attributes and sub-elements can be either missing or repeated. DOM [9] and LORE [10] are two wellknown tree-based data models for XML documents. To store a XML document in its tree form, each node of the tree corresponds to an element or an attribute of an element in the XML document. The root node contains the document’s root. A child node corresponds to a subelement or an attribute of the parent node. For each child of a node, besides the pointer to the child, there is a tag in the node that indicates the name of the child node. If the child is a subelement, the name is its element tag. If the child is an attribute, the name is the attribute name.

XML tags can be used for describing the data in the form of a hierarchical structure. This provides flexibility and expressive power to the data description framework, but it also complicates the integration task as different domains could have different sets of rules, tags, and properties to represent the same data. Although, whenever they are available, DTDs and namespaces provide information about the structure of the XML files, not all XML documents have associated DTDs. In fact, one of the main reasons why XML is becoming a de facto information exchange standard is that each XML document (even when it does not have an associated DTD) is self-describing: the hierarchical structure and the tags in this hierarchy gives information about the relationships between the tags; i.e., the data elements and their attributes. Consequently, given two similar multimedia objects, where different tag names are used to denote similar concepts, it should be possible to make the association between these tags using an analysis of the hierarchical XML document structures. structures. For example, in Figure 5, nodes labeled F and X are likely to correspond to each other.

X

Figure 3. An example XML document and the corresponding DTD

(a)

(b)

Figure 5. Two similar XML documents; the node labels F and X are likely to denote semantically similar elements

10

1. Map the nodes of T1 and T2 into two multi dimensional spaces, S1 and S2, both with the same number of dimensions.

3. PROBLEM STATEMENT The problem we address in this paper is to mine mappings between the nodes of hierarchical media data, metadata, and ontologies. The main observation is that the structural similarity between two given trees (such hierarchical media objects, XML documents, or name spaces) can provide clues about the semantic relationships between their nodes.

2. Identify transformations required to align the space S1 with the space S2 such that the common nodes of the two trees are as close to each other as possible in the resulting aligned space. 3. Use the same transformations to map the uncommon nodes in S1 onto S2.

In general, the nodes in the two trees can be divided into common and uncommon nodes. The common nodes are those shared by the two trees and can either have the same labels or (in the case of multimedia data) they may have application dependent features that provide high degrees of similarity [25]. In this paper, we do not focus on how common nodes are discovered. Our aim is to relate the uncommon nodes of a two given hierarchical structures. Therefore, formally, we can state the problem as follows:

4. Now that the nodes of the two trees are mapped into the same space, use clustering and nearest-neighbor algorithms to find the related uncommon nodes in the two trees.

4.1 Step I: Map both Trees into MultiDimensional Spaces We map the trees based on the common nodes. For example, the common nodes of two given namespaces might include the shared terms like University and College that are known to denote similar concepts.

Given •

two trees, T1(V1,E1) and T2(V2,E2),

Multi-Dimensional scaling (MDS) is a family of data analysis methods, all of which portray the structure of the data in a spatial fashion [3,12,13,14]. MDS is used to discover the underlying spatial structure of a set of data items from the distance information among them.

where V denotes the nodes in the tree and E denotes the edges between the nodes, •

a partial mapping, M, between the nodes in V1 and V2, (we call those nodes in V1 and V2 that have a mapping, the common nodes) and

•

two unmapped nodes vi in V1 and vj in V2,

MDS works as follows, it takes as inputs (1) a set of N objects, (2) a matrix of N x N, containing pair wise distance values, and (3) the desired dimensionality k. Given these inputs, MDS tries to map each object into a point in the k-dimensional space. The mapping process of documents, given their distances with respect to one another, is shown in Figure 6.

compute the similarity between vi and vj. For example, given the two trees in Figure 5, the user or the content-integrator might want to find which node in the first tree corresponds to the node labeled X in the second tree (In this example. purely based on the structures of the two trees, we can conclude that X corresponds to F in the first tree). The use of structural information for mining of semantic relationships is not new. We used structural information available on the web for mining web document associations, summarizing web sites, and answering web queries [22,23,24]. [17,18,19,20] have used the language taxonomies and IS-A hierarchies to define the similarity and distance between terms in natural language. These mainly rely on the observation that given a tree, T(V,E) and two nodes, a and b, on the tree, we can compute a distance, d(a,b), between the nodes by considering the structure of the tree, for instance by counting the number of edges between them. The main challenge we face in this paper for finding the similarity between two nodes in two different trees, on the other hand, is that there is no common structure to compare these two nodes; since the two trees may have arbitrarily different structures, finding a mapping between the nodes is not trivial.

Figure 6. MDS mapping of four data points onto a two dimensional space The criterion for the mapping is to minimize a stress value which is calculated as

4. PROPOSED APPROACH In order to match two nodes in two different trees, we need to find a mapping such that the distance values in two trees between the common nodes are preserved as much as possible. In this paper, we address this challenge by mapping the two trees into a common space using the matching nodes and comparing the unmapped nodes in this common space. The proposed solution can be broken down into four steps:

where dij is the actual distance between two nodes vi and vj and d’ij is the distance between the corresponding points pi and in the k-dimensional space. Thus if we can maintain the distance between pi and pj the same as the distance between vi and vj then dij = d’ij for all pairs of nodes, then the stress is 0, which is the ideal mapping.

11

centroids to be given to form clusters around. We use the nodes of one trees, as the centroids for the clustering and we use the distance in the Euclidean space to achieve clustering.

MDS initially starts with a configuration of points, called initial configuration. In this work, we assume MDS uses a random configuration. Thus, MDS initially starts with a random configuration of points. It then applies the some form of steepest descent iteratively to minimize the stress. Once the inputs are mapped into a k-dimensional space, one can use a multidimensional index structure to do nearest-neighbor and range searches.

As a result, returned clusters contain the node in one tree specified as the centroid and one or more nodes from the other tree that are closest in the shared space. Thus, in the form of a cluster, we have pairs of nodes from two different trees that are similar to each other.

In a tree-structured data, for example in a namespace, similar or related nodes are closer to each other and have less number of edges between them than the dissimilar nodes. The closer the nodes the shorter the distance between them in distance matrix [17,18,19,20], hence similar or related nodes are mapped closer to each other in a multidimensional space. In other words, MDS maps the similarity between points: similar nodes are mapped closer to each other and dissimilar ones are mapped far off from each other.

5. EXPERIMENTAL EVALUATION In this section, we provide an experimental evaluation of the proposed approach for mining mappings between the nodes of hierarchical media data, metadata, and ontologies. In order to evaluate the proposed approach and to observe the effects of various data parameters (like the number of nodes in the two trees and their degrees or fanouts), we needed a large number of trees. Furthermore, we needed to be able to vary these parameters in a controlled manner to observe the performance under different conditions. Therefore, we systematically generated a large number of tree-structured data (i.e., the ground truth) with varying parameters and use these trees in our initial experimental evaluation. After observing the effectiveness of our algorithm using this ground truth, we also used a real collection of data and verified our results.

4.2 Step II: Find a Transformations Required to Map the Common Nodes of the Two Trees Closer to Each Other on the Space Once both trees are mapped onto two separate k-dimensional spaces, we need to relate the common nodes of the two trees. To achieve this, we identify transformations required to map the common nodes from both trees to each other as close as possible in a shared space. In order to match the common nodes, we use the Procrustes alignment algorithm [15,26]. Given two sets of points, the Procrustes algorithm uses linear transformations to map one set of points on the other set of points. The general Procrustes algorithm seeks the isotropic dilation and the rigid translation, reflection and rotation needed to best match one configuration to another [15]. In our case, the inputs to the algorithm are the nodes (terms) common to T1 and T2.

A

A B D

B

C E

F

G

a. Original tree

4.3 Step III: Use the Same Transformations to Map the Uncommon Nodes

Dist C

Dist E

F

Dist G

b. Distorted tree.

Figure 7. Tree distortion process

The previous step returns the transformations required to modify the given spaces such that the common nodes of both trees conform to each other as much as possible. This matching between the common nodes of both trees can, then, be used to define the similarity between the uncommon nodes. Using the transformations identified in the previous step, the uncommon nodes in two trees are mapped into the space in terms of their distances from the common nodes in respective trees. The uncommon nodes of both trees that are approximately at the same distance or at the same distance range from the common nodes in their respective trees are likely to be similar and will be mapped close to each other in the shared k-dimensional space.

5.1 Generating Ground Truth The challenge addressed in this paper is to relate nodes of two trees that can differ from each other in terms of the number of nodes, density of nodes, and simply the node labels. Therefore, to generate two related but different trees, we 1. pick an original tree, and then 2.

distort the original tree by relabeling existing nodes, deleting nodes existing nodes in, and adding new nodes to the original tree to generate a distorted tree (Figure 7).

Thus, the original and the distorted trees act as two similar trees, which are different in terms of the number of nodes and the labels of some of the nodes. The two trees have some nodes that are common (undistorted) and some nodes that are uncommon

4.4 Step IV: Use Clustering to Find the Related Uncommon Nodes from the Two Trees

5.1.1 Synthetic Tree Generation for Controlled Experiments

At this point we have two trees whose nodes are mapped onto a shared k-dimensional space such that the common nodes are close to each other in the space. Hence, we can use clustering and nearest-neighbor approaches to identify related uncommon nodes.

For the first set of experiments, where input data trees are generated in a controlled manner, we developed a tree generation program which creates a tree randomly based on two parameters: number of nodes and maximum fan out (degree) of the tree. For

To retrieve the related points from the multi-dimensional space, we use a k-means [16] based clustering algorithm, which requires

12

For every original synthetic tree, we have generated 4 distorted trees using the node rename operation. The level of distortions experimented with are 5%, 25%, 45% and 65%. Therefore, we have a total of (100*4 =) 400 test cases. For every query node, the implementation returns the matching nodes from the other tree. For every original tree we run four tests. Figures 8, 9, and 10 provide the following observations:

our experiments, we have generated original trees with the configuration shown in Table1. Table 1. Parameters for tree generation Number of nodes in the tree

25, 50, 100 and 200

Fan out (degree)

1, 2, 4, 8 and 16

We generated (4*5 =) 20 sets of trees with a distinct combination of number of nodes and the fanout value. In our experiments, we used 5 different seed values for every combination; therefore, we have a total of 100 original trees for experiments. We report these results in Sections 5.3, 5.4, and 5.5.

5.1.2 Experiments with Real Trees

In addition to synthetics trees, we also run our experiments with the TreeBank data set [21], which has a deep recursive structure (whereas our synthetic trees were mostly balanced). We report these results in Section 5.6.

As the distortion increases the error also increases. Most common errors are due to nodes that are mapped to a sibling of the correct node. The error pattern observed is similar in case of trees with a total number of nodes 25, 50, 100, and 200. As the fan out increases, the error also increases. For trees with fanout 1 or 2, no errors are observed. The precision is close to perfect (1.0) for low renaming distortions. For heavy renaming (65%), the weighted precision can drop slightly for trees with large fanouts. Nodes = 25

5.2 Terminology Error percent

Following are the terms used in discussing and explaining the experiment results: • Number of Nodes: The total number of nodes in the original tree • Number of nodes that are mapped: number of nodes in the original tree + the number of nodes in the distorted tree. • Correct mapping: When a given query node of a given tree does map to the corresponding node of the other tree, then the mapping is said to be correct mapping. • Erroneous mapping: When a given query node of a given tree does not map to the corresponding node of the other tree, then the mapping is said to be an erroneous mapping and each erroneous mapping has a degree of error, err: o mapping to a sibling of the correct node [err=1], o mapping to the parent node of the correct node (the correct node does not have a sibling) [err=1] o mapping to the parent node of the correct node (the correct node has at least one sibling) [err=2] o mapping to the sibling of the parent [err=3], o mapping to a distant node [err=4], and o no mapping. • Error Percentage: This is the ratio of the erroneous mappings in the number of mappings returned. Note that, at most one of the mapped nodes can be a correct map. (# erroneous mappings / # of nodes mapped)*100 • Precision: The precision is measured as

50.00 45.00 40.00 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00

5 25 45 65

1

2

4

8

16

Fan out

Error percent

Nodes = 50

50 40 30 20 10 0

5 25 45 65

1

where k is the number of nodes returned and

is the degree of matching of node ni in the result; i.e., a node with a lower error contributes more to the precision. Note that if the algorithm does not return any matches for a given node, then the corresponding precision is defined as 0.

5.3 Experiment I – Label Differences In the first set of experiments, we aimed to see the performance of the proposed algorithm when the structures of the trees are identical, but some of the nodes are labeled differently.

13

2

4 8 Fan Out

16

Nodes = 100

50

50

40

40

distortion Error Empty Set

Error percent

Nodes = 100

5

30

25

20

45

10

65

5%

30

25%

20

45% 65%

10 0

0

1

1

2

4

8

2

4

16

8

16

Fan-out

Fan out Nodes = 100

50 40 30 20 10 0

Error Matching to Sibling of Parent

Error percent

Nodes = 200

5 25 45 65

1

2

4

8

50

distortion 40

5%

30

25%

20

45% 65%

10 0 1

16

2

4

8

16

Fan-out

Fan out

Figure 9. Types of the errors for Experiment I

Figure 8. Percentage mapping error in Experiment I

25 Nodes

Nodes = 100 1

50

0.9 0.8 Weighted Precision

Sibling Error

distortion 40

5%

30

25% 45%

20

65%

10

0.7

5%

0.6

25%

0.5

45%

0.4

65%

0.3 0.2

0

0.1

1

2

4

8

16

0 1

Fan-out

2

4

8

16

Fan-out

200 Nodes

1

50

0.9

distortion 40

0.8

5%

30

25%

20

45%

Weighted Precision

Error Matching Parent (No Sibling)

Nodes = 100

65%

10 0

0.7 5%

0.6

25%

0.5

45%

0.4

65%

0.3 0.2

1

2

4

8

16

0.1 0

Fan-out

1

2

4

8

16

Fan-out

Nodes =100

Figure 10. Weighted precision in Experiment I

Error Matching Parent (With Sibling)

50

distortion 40

5.3.1 The effect of distortion

5%

30

25%

20

45%

The similarity mapping of the nodes of the original tree and the distorted tree is based upon two things: The distances between the nodes in each tree and the distances between the common nodes in both trees. The higher the number of common nodes between the two trees, the more similar the two trees are. Hence resulting mappings between the distorted nodes and the original nodes are

65%

10 0 1

2

4

8

16

Fan-out

14

errors are due to the increase in the overall distortion. The higher the distortion is, the lower is the number of common nodes between the two trees; hence, the greater is the probability of wrong mapping. The type of distortion is also a key factor that influences the error proportion. As expected, the change in the tree structure (due to additions and deletions) has a negative effect on the error percentage.

better. Increase in the distortion lessens the number of common nodes between the two trees; as a result, the error rate gets higher.

5.3.2 The effect of fanout For trees with maximum fanout 2, there is a high probability of correct mapping when only one of the two siblings is mislabeled. However, when the fanout is higher, the probability that siblings (especially the leaf siblings that are structurally identical to each other) will be erroneously mapped to each other increases. Hence, the rate of correct mapping decreases when the fanout increases.

5.4.2 The effect of fanout In trees with lower fanouts, each node is closely related (short distanced) to a few nodes. Each of these close nodes is highly important in achieving a correct mapping. If any of these close nodes is deleted, then the given node looses an important distance information. Hence, it becomes difficult to exactly map the node. As a result, either the node does not get mapped to any other node of the corresponding tree or it maps to the parent of the correct node (Figure 11). On the contrary, if the tree has high fanout, each node has a large number of siblings with which it is closely related. Even if one of these nodes is deleted, there are many other nodes for the given node to relate to. Although, there is an increased probability that the given node wrongly maps to a sibling, there is a relatively high probability of correct mapping.

5.3.3 Types of errors Figure 9 presents different types of errors for trees with 100 nodes (the results are similar in other tree sizes as well). In Experiment I, the only operation that causes distortion is “renaming”. Hence, although the names of some of the nodes are modified, the structure of the distorted tree is maintained. Consequently, in most erroneous cases, the distorted nodes are simply mapped to a sibling of the correct node.

5.3.4 Precision Figure 10, on the other hand, takes into account the degree of match even for those nodes that do not perfectly match the requested node. From this figure, it is again clear that the result precision is close to perfect (1.0) for low renaming distortions. In the case of heavy renaming (65%), the weighted precision drops as the fanout increases. However, the degree of drop is not significant, which means that even when the algorithm cannot find a perfect match, it returns a node close to what was expected.

5.4.3 Precision Figure 13 shows the weighted precision obtained by the proposed algorithm in the case of a combination of distortions. From this figure, it is clear that the result precision is large for large fanouts. An increase in the number of nodes in the tree, on the other hand, has different effects depending on the fanout of the nodes. If the fanout is low, a larger tree actually means a significant drop in the precision. If the fanout is large, however, a higher number of nodes in the tree actually improve the precision. This is in accord with the drop in error in combined distortions (Figure 11).

More importantly, the results show that as the number of nodes in the tree increases, the weighted precision significantly improves. This shows that, as the number of available nodes increase, the distance-based mapping of nodes into the search space becomes more and more robust and this leads into better matches.

Nodes = 25 100

Error percent

5.4 Experiment II – Structural Differences In Experiment II, we used combinations of addition, deletion, and rename operations to generate distortions in the original synthetic trees. This enabled us to observe the performance when the structures of the trees are also variable. For each of the 100 original trees, we apply 3 levels of distortions: • 15% (5% of addition + 5% of deletion + 5% of rename), • 30% (10% addition + 10% deletion + 10% rename), • 45% (15% addition + 15% deletion + 15% rename). Figures 11, 12, and 13 provide the following observations: • As the distortion increases, the error also increases. The most significant error is empty matches. • Unlike Experiment I, when the fanout is 1 or 2, the error percentage is the highest. The error percent drops sharply for fanout value of 4 and it remains more or less constant as fanout increases. We examine different effects in detail next.

80 15% (5% each)

60

30% (10% each) 45% (15%

40 20 0 1

4 Fan out

8

16

Nodes = 50

100

Error percent

2

80 60

15% (5% each)

40

45% (15% each)

30% (10% each)

20 0

5.4.1 The effect of distortion

1

In Experiment II, we used a combination of addition, deletion, and rename operations to generate distortion. The normal observation in this experiment is that almost 75% of the test-runs return results that do not exhibit entirely correct mapping. These

15

2

4

Fan Out

8

16

Nodes = 100

Nodes = 100 100

Error percent

100

80 Empty Set

80 15% (5% each)

60

30% (10% each)

40

5%

60

10% 40

45% (15% each)

15%

20

20

0 1

0 1

2

4 8 Fan Out

2

4

8

16

Fan-out

16

Nodes = 100

Match Sibling of parent

Nodes = 200 Error percent

100 80

15% (5% each)

60

30% (10% each) 45% (15% each)

40 20

100 80 5%

60

10% 40

15%

20 0 1

0

2

4

8

16

Fan-out

1

2

4

8

16

Fan out

Figure 12. Types of the errors for Experiment II

Figure 11. Percentage mapping error in Experiment II 30%(10%each) 45%(15 %each)

1 Weighted Precision

Sibling Match Error

100 80 5%

60

10% 40

15%

20 0 1

2

4

8

16

Fan-out

0.8 0.6 0.4 0.2 0 1

2

4

8

16

Fan-out

Nodes = 100 100

200 Nodes

80

15%(5%each) 30%(10%each)

5%

60

10% 40

15%

20 0 1

2

4

8

45%(15 %each)

1 Weighted Precision

Match-Parent (No Sibling)

15%(5%each)

25 Nodes

Nodes = 100

16

Fan-out

0.8 0.6 0.4 0.2 0

Nodes = 100

1

Parent/Child Match (Sibling Present)

2

4

8

16

Fan-out

100 80

Figure 13. Weighted precision in Experiment II

5%

60

10% 40

15%

5.5 Experiment I vs. Experiment II

20

As expected, better results were observed in Experiment I as compared to Experiment II. In case of Experiment I, the only distortions were caused by renaming. Thus, the structures of the

0 1

2

4

8

16

Fan-out

16

original and distorted tree were the same, resulting in better matches.

5.6.2 Experiment II: Structural Differences on TreeBank Data

In these two experiments, we observed significantly different behaviors when it comes to the effects of fanouts. • In Experiment I, smaller fanout means smaller chance of mapping a node to the sibling of the correct node. Hence, a smaller fanout translates into a smaller error rate. • In the case of Experiment II, on the other hand, when the fanout is very low, the overall tree structure could be drastically changed by a small amount of node deletions and additions. Since the proposed algorithm is based on the structure of the, the resulting error rate was considerably high in cases with low fanout. For large fanouts, however, too much renaming is more detrimental than structural change, as without enough matching nodes.

Figure 15 shows the weighted precisions obtained by the proposed algorithm in experiments with TreeBank data (all types of distortions, including additions and deletions of nodes, are allowed). From this figure, it is clear that the precision behavior of the proposed algorithm in real data matches the precision behavior obtained using the synthetic tree set we have used in the previous experiments (Figures 14 and 15). As expected, due to structural distortions, the weighted precisions are lower than the case for only relabelings, but they are above 80% even with 45% combined distortions.

5.7 CONCLUSIONS The functioning of an automated multimodal media integration system requires access to metadata that describe the individual media resources. The metadata are generally application dependent. Therefore, an automated media integration mechanism needs to mine and relate the common and uncommon components. In this paper, we develop algorithms to automatically discover mappings in hierarchical media data, metadata, and ontologies. The proposed algorithm uses the structural information to mine and map the semantically similar but syntactically different terms and components. We extensively evaluated the performance of the algorithm for various parameters and showed that the algorithm is very effective in achieving a high degree of correct matches.

5.6 Experiment III: TreeBank Collection In addition to the synthetic trees we used in Experiments I and II, we also run additional experiments with the TreeBank data set available at [21]. The deep recursive structure of this data set (maximum depth 36, average depth, 7.87), in contrast to the mostly balanced structures we used in experiments with synthetic trees, also provides opportunities for additional observations. For the experiments with real data, in order to observe the effects of distortion, we clustered the trees in the collection based on their numbers of nodes. Therefore, for instance, if we wanted to observe the precision of our algorithm for trees with 100 nodes, from the collection we selected trees that have around 100 nodes. Then, we applied various types of distortions on these trees as explained earlier.

W eighted Precision

Weighted Precision

1

5.6.1 Experiment I: Label Differences on TreeBank Data Figure 14 shows the weighted precisions obtained by the proposed algorithm in experiments with TreeBank data (with only node relabelings). The results show that the proposed algorithm is very robust with respect to relabeling errors in real data. Even when 65% of the nodes are relabeled, the approach is able to identify the correct node with upto 90% precision. When we compare the results presented in this figure with the results obtained using synthetic trees (Figures 10 and 11), we see that for large fanouts, the precision the algorithm provides on real data is significantly larger (up to 90% precision even with 65% relabelings) than the precision obtained on synthetic tree sets (60% precision with 65% relabelings).

0.8 5% 25% 45% 65%

0.6 0.4 0.2 0 25

200 Num berofNodes

Figure 14. Weighted precision in experiments with TreeBank data (only rename distortions) W eighted P recision 1 Weighted Precision

We observed that for trees with 200 nodes around 70% of the errors were due to nodes that did not match any other node, 14% of the errors were due to nodes that matched their siblings, and another 14% were due to nodes that matched sibling of their parents. This is in contrast with the results with synthetic data (Figure 9) where the no mapping errors were close to 0. Nevertheless, the overall precision was higher than the case for synthetic data; i.e., when there were nodes that are returned in the result set, the errors of these nodes were closer to 0.

0.8 15%(5% each) 30%(10% each) 45%(15% each)

0.6 0.4 0.2 0 25

N um ber of nodes

200

Figure 15. Weighted precisions in experiments with TreeBank data (with structural distortions)

17

[16] J. MacQueen. Some Methods for Classification and Analysis

6. REFERENCES

of Multivariate Observations, Proc. 5th Berkeley Symp. Math. Statist. Prob., 1967, Vol. 1, pp. 281-297.

[1] D.Brickley and R.Guha. Resource Description Framework (RDF) schema specification,2000. http://www.w3.org/TR/RDF-schema.

[17] P. Resnik. Sematic Similarity in a Taxanomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11, pp.95-130, 1999.

[2] Y. Lafon and B. Bos. Describing and Retrieving Photos using RDF. 2000. http://www.w3.org/TR/photo-rdf

[3] Dublin Core Initiative And Metadata Element Set,

[18] P. Resnik. Using Information Content to Evaluate Semantic Similarity in a Taxanomy. IJCAI, pp. 448-453, 1995.

http://dublincore.org

[4] J.B. Kruskal. Multidimensional Scaling by Optimizing

[19] Rada, R., Mili, H., Bicknell, E., and Blettner, M. Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Management, and Cybernetics, 19(1), pp. 17-30, 1989.

Goodness of Fit to a Nonmetric Hypothesis, Psychometrika, 29(1):1-27, March 1964.

[5] R.V. Guha and T. Bray. Meta Content Framework Using XML. 1997. http://www.w3.org/TR/NOTE-MCF-XML970624

[20] Lee, J., M. Kim, and Y. Lee, Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation, 1993. 49(2): p. 188-207.

[6] O.Lassila. Introduction to RDF Metadata. 1997. http://www.w3.org/TR/NOTE-rdf-simple-intro

[7] Namespaces in XML. http://www.w3.org/TR/REC-xml-

[21] University of Pennsylvania Treebank Project collection at http://www.cs.washington.edu/research/xmldatasets/www/re pository.html

names

[8] Extensible

Markup Language http://www.w3.org/TR/REC-xml

[9] Document Object http://www.w3.org/DOM/

Model

(XML).

[22] W.-S. Li, K.S. Candan, Q. Vu, and D. Agrawal, Query Relaxation by Structure and Semantics for Retrieval of Logical Web Documents, TKDE 14(4): 768-791, 2002.

(DOM).

[10] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J.

[23] K.S. Candan, Wen-Syan Li: Discovering Web Document Associations for Web Site Summarization. DaWaK 2001: 152-161

Widom. Lore: A Database Management System for Semistructured Data. SIGMOD Record, 26(3):54-66, September 1997.

[24] K.S. Candan, W.-S. Li: Using Random Walks for Mining Web Document Associations. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 294305, 2000.

[11] Extensible 3D (X3D™) Graphics. http://www.web3d.org/x3d.html

[12] J.B. Kruskal. Nonmetric MultiDimensional Scaling: A Numerical Method, Psychometrika, 29(2):115-129, June 1964.

[25] K.S. Candan, W.-S. Li. On Similarity Measures for Multimedia Database Applications. Knowledge and Information Systems 3(1): 30-51, 2001.

[13] W.S. Torgerson. Multidimensional scaling: I. theory and method, Psycometrika, 17:401-419, 1952.

[26] D.G. Kendall. Shape Manifolds: Procrustean metrics and complex projective spaces. Bulletin of the London Mathematical Society, 16:81–121, 1984.

[14] J.B. Kruskal and Myron Wish. Multidimensional scaling, SAGE publications, Beverly Hills, 1978.

[15] J. Gower. Generalized procrustes analysis. Psychometrika, 40:33–51, 1975.

18

CoMMA: A Framework for Integrated Multimedia Mining using Multi-relational Associations Juveria Kanodia [email protected] Muhammad Ahmad [email protected]

Ankur Teredesai [email protected]

Roger Gaborski [email protected]

Department of Computer Science, 102 Lomb Memorial Drive, Rochester Institute of Technology, Rochester, NY 14623

ABSTRACT

General Terms

Generating captions or annotations automatically for still images is a challenging task. Traditionally, techniques involving higher-level (semantic) object detection and complex feature extraction have been employed for scene understanding. Based on this understanding, corresponding text descriptions are generated for a given image. In this paper, we pose the auto-annotation problem as that of multirelational association rule mining where the relations exist between image-based features, and texual annotations. The central idea is to combine low-level image features such as color, orientation, intensity, etc. and corresponding text annotations to generate association rules across multiple tables using multi-relational association mining. Subsequently, we use these association rules to auto-annotate test images.

Algorithms, Design, Experimentation.

Keywords Image Captioning, Multimedia Data Mining, AutoAnnotation, Multi-relational Association Rule Mining, FPGrowth, Multi-Relational FP-Growth, Text Based Image Retrieval.

1. INTRODUCTION It is said that a picture is worth a thousand words; but determining the likely words that constitute the correct description of the picture is considered to be a challenging problem by the computer vision, text mining, and the multimedia data mining communities. Knowledge derived from these two domains i.e., image and text data together is more descriptive compared to when each domain is considered in isolation from one another. Based on this fact, it is our conjecture that multi-relational associations should capture more information from the combined metadata. Conventional approaches use metadata from individual image features or text domain annotations using a relational join and develop feature based clusters. This paper describes a method of formulating this conjecture as a multi-relational hypothesis and tests the validity of integrated mining of combined multimedia data using multi relational association rules.

In this paper we also present a multi-relational extension to the FP-Tree algorithm to accomplish the association rule mining task more effectively compared to the currently used de-centralized version of the apriori algorithm. The motivation for using multi-relational association rule mining for multimedia data mining is to exhibit the potential accorded by multiple descriptions for the same image (such as multiple people labeling the same image differently). Moreover, multi-relational association rule mining can also benefit the auto-annotation process by pruning the number of trivial associations that are generated if text and image features were combined in a single table through a join. In this paper, we discuss these issues and the results of our autoannotation experiments on different test sets.

Categories and Subject Descriptors H.2.8 [Database Applications]: Data mining, Image databases

The copyright of these papers belongs to the paper’s authors. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. MDM/KDD’04, August 22-25, 2004, Seattle, WA, USA.

19

Recent years have witnessed a phenomenal growth in image databases and retrieval systems such as Viper [2] and MultiMediaMiner [21] to name a few. The World Wide Web has emerged as the largest repository of image data in the world. Image retrieval based on keyword search from such large databases poses a significant challenge. The search results can be greatly improved if the images are already annotated. However, owing to the large number of images in these databases, the only viable means to annotate images is to automate this process, since manual annotation can be a tedious and expensive job. The problem of auto annotation is usually treated as a supervised learning problem where higher-level features are extracted from images and complex object detection algorithms are employed to generate keywords. Image segmentation and labeling of objects is not easy. Several

clustering and classification techniques have been employed for auto-annotation of images [1, 3, 4, 9]. The “blob” approach requires the images to be in a state where object recognition is possible [3, 9]. Hsu et al employed the idea of viewpoints, which refer to the notion of invariant relationships between objects in an image [5]. Object identification in images is usually expensive and thus increases the cost of auto-annotation [3, 18, 28].

same image depending upon the users (or the intelligence of the auto-annotating system annotating the image.) Hence, there is a one-to-many relationship from the image domain to the text domain. Doing a simple join can be expensive if M is a big number in 1: M relation between the tables and such a join would be unnecessary if the tuples in a table do not qualify as frequent patterns. The relation between annotations in multiple tables and image features can however, be captured by multi-relational association rules thereby getting formulated as a multi-relational mining problem. Since, this is the motivation behind our project, we termed the framework CoMMA: Combined Multi-relational Multimedia mining using Associations.

Association rule mining for images is a fairly nascent subfield of image mining. There are two main approaches for association rule mining in images. The first one involves just mining images while the second one involves mining images along with some textual data associated with the images. We apply the latter approach in this paper. We extract basic features like color, orientation and intensity from the images. Features that are more complex were specifically not preferred so as to study and analyze the performance of the multi-relational approach with a minimum consideration for semantics of the image. The low-level features considered are color (number of pixels that are red, green, blue and yellow), orientation (edge orientations of degree 0, degree 45, degree 90 and degree 135) and intensity. The image features are extracted based on the focus of attention theory initially proposed by Itti and Koch [7]. The selective attention model allows the system to concentrate on processing salient objects in the scene without the need to process the unimportant aspects. The attention model processes the input image in three parallel feature channels: intensity contrast, color and orientation channels. The feature saliency maps topographically represent the saliency of objects in the scene based on respective features. For a detailed description of still-feature extraction from images, refer to our previous work [6].

The paper is further organized as follows: The problem of auto annotation is described in the next section. We pose this problem as that of combination of multi-relational and multimedia data mining in Section 3. We propose an extension to the FP-Growth algorithm called MRFP-Growth for multi-relational mining in Section 4. Section 5 discusses the experiments conducted and their corresponding results. Section 6 outlines some of the issues that we would like to encourage the multimedia data mining community to discuss.

2. IMAGE ANNOTATION & RETRIEVAL Image annotation can greatly enhance image retrieval. Many annotation schemes have been proposed for faster and better image retrieval. The subfield of Content Based Image Retrieval (CBIR) employs global features of images such as color histograms and was used in IBM’s QBIC (Query by Image Content) [11] and also in region based approaches involving “blobs” [3][9]. FAST (Fast and SemanticsTailored Image Retrieval Methodology) [22] uses fuzzy logic to create a new indexing method HEAR (Hierarchical Elimination-based A* Retrieval) to handle the region based image information consisting of colors, texture and shape. Relevance Feedback (RF) analysis is another effective solution for CBIR. Semi-Automatic Image Annotation [20] depends on user feedback and adds successful image search keywords as annotations to images. Zhong et al. introduce PCA [19] to reduce the noise in original images and the dimensionality of the feature spaces. Authors of MultiMediaMiner [21] mention about the usage of association rule mining to get rules based on colors of CT scan as an interesting application. Monay et al. compare [12] simple Latent Space Models for the task of annotations.

We restrict the images under consideration to certain categories, such as “flowers” and “lakes and mountains” or their combinations. This allows us to exploit the correlation between low-level features in an image and high-level semantic content without object identification in the image. It is our hypothesis that within categories of images, enough similarities exist to allow the discovery of multi-relational associations, which can be used for auto annotation of images. Another motivation for using low-level features is the need to maximize system throughput while minimizing the overhead cost of storing high-level features. However, the application can be scaled to include high-level features such as shapes and objects by designing appropriate tables to hold them. Notice that the application of this framework to specific domains implies that it should not be used for an open-ended domain such as “nature” (which may comprise of landscapes, flora and fauna, fruits and vegetables, underwater images etc.)

Image annotation has been mostly studied from a statistical as well as supervised learning perspective [10]. The multirelational association rule approach is somewhat similar to the statistical approach in that the support and confidence of rules across the image and text domain are related to the statistical distribution of features, while relations capture more information than just simple statistical links. Two related subtasks are involved in recognition of images i.e., auto annotation of images that involves recognizing whole images and object recognition, which involves recognizing objects in the images. In this paper, we address the former task using minimal features from the image domain. We would also like to note that the problem that we are

Consider the scenario where images are stored in a database and associated with these images are annotations or captions that are derived from multiple sources. Although the annotations can be stored within the same database in different tables or combined together into a single table, the upshot of the latter approach is that it does not take into account that there can be several different annotations for the

20

addressing is not only that of finding a particular suitable annotation for an image but also that of finding a set of keywords that can be used as query-set by a human for retrieval using a search engine.

collected for the four scene types were used to find the feature that dominated the subject’s attention. Based on the correlation studies on multiple images, intensity contrast gave the highest correlation for natural scenes and building/city scenes. Color gave the highest correlation for indoor and fractal scenes. Based on these observations a test image can be classified as being one of the four scene types. These studies demonstrate the efficiency of low-level features in classifying scenes and motivated us to use low-level features for generating rules for image annotations.

3. CoMMA FEATURE EXTRACTION CoMMA is developed as a general framework for employing multi-relational association rules for auto-annotation of images in specialized domains. In section 4, we describe a multi- relational version of the FP-Growth algorithm, which forms the core of the current application. In CoMMA, a user may optionally upload an annotated image. Such an image is used for generating possibly new rules indicating that the system has learnt something new (training mode). Otherwise, rules are not generated but are used to annotate the given image (test mode). Figure 1 gives an overview of how CoMMA works. Starting from the User Interface, images are uploaded into the image database along with the corresponding annotations. Low-level features are extracted from the image while the text features mainly consist of terms from the annotation or caption. Multi-relational association rules are generated by applying the MRFP-Growth algorithm. Finally, these rules are used to annotate test images and the performance is compared with the original ground truth. When an annotation is not provided for an image, the image features are extracted and annotations are obtained from the previously generated association rules as shown by the dotted line in Figure 1. Multi Relational Associations

We treat each image ik in the database as a pseudo-vector that consists of the nine features described in section 1. The vector space consists of all image features. The image features were further discretized, so that the final image feature vocabulary consisted of more than 2700 feature terms. In the image mining domain, the modeling of image annotations has usually been done by the concatenation of image feature vectors and a feature vector of words [5]. If there are one-to-many relationships between the image feature vector and the term vectors, the same technique (concatenation) can still be used but the cases where terms in a description are further related to other terms, as is the case in the current problem, cannot be handled without loss of information. This was the main motivation for keeping the images and the annotations/captions/descriptions in separate tables. If the task is just coming up with a set of words to be used by a human expert for annotating images, then clustering can also work quite well.. However, the problem with employing clustering for this task is that individual clusters usually have a much larger data spread as compared to association rules which are comparatively straight forward. Clustering was thus not employed in this application and is being explored for a baseline comparison with our approach.

Multi Relational FP Tree

Feature Extraction

Image & its Annotations

Image Databases

Image

4. MULTI-RELATIONAL FP-Growth 4.1 Association Rule Mining

Annotation Databases

Consider a database D that contains a set of transactions T. If X and Y are items in T then an association rule [16] is an implication of the form X Y and has a support s in the database D if s% of the transactions in D contains both X and Y. Similarly the rule X Y has a confidence c if c% of the transactions in D that support X also Support Y. The task of association rule mining is to thus generate a set of rules that have a minimum support and minimum confidence above certain user specified thresholds. Association rules over multiple tables can be described similarly. Association rules over multiple tables X and Y are atom sets of the form p(t1, t2, …. tn) where ti is a variable or a function of the form f(t1, t2, …. tn) The confidence is defined as follows: if c% of transaction in D are covered by X then they are also covered by X U Y. Similarly the support is defined as follows: s is the support of rule if S% of all transactions in D are covered by X U Y.

Annotation (Optional)

Interface / Visualization

Figure 1: The CoMMA Framework As described in [6] and mentioned previously, the feature extraction is based on Itti and Koch’s focus of attention theory [7]. Given a scene, humans selectively attend to important salient regions of the scene. For example while driving on a road, the red and yellow street signs stand out in the scene. The focus of attention algorithm used in this system, processes an image to extract the color, orientation and intensity features. The four color saliency maps (Red, Green, Blue, and Yellow), orientation (0 degree, 45 degree, 90 degree, 135 degree) and intensity contrast maps are feature saliency maps. Gaborski et al. experimented [6] with feature saliency to infer the importance of each feature based on different scene type. Their approach collected human eyetracks for images of natural landscape scenes, indoor scene, building/city scenes and fractal images. The eye-tracks

One of the most widely used algorithms for association rule mining is the Apriori algorithm [15]. The algorithm exploits the anti-monotone property and states that for a k-itemset to be frequent all (k-1) subsets of this itemset also have to be frequent. Though the algorithm reduces the computational

21

current node as the root. Whenever an item is added to the FP-tree a link is maintained between the node and its occurrence in the header table. The process continues till the whole FP-Tree is generated. The complete FP-Tree for the sample relation in Table 1 is given in Figure 2.

cost of generating the itemsets, the computational cost is still high when the number of 1-frequent itemsets is sufficiently high, which in turn translates into a high cost for generating 2-frequent itemsets. The FP-Growth algorithm [29] was proposed to overcome this problem. The algorithm creates a compact tree-structure called the FP-Tree that represents frequent patterns and mines the FP-Tree to get the frequent patterns. It solves the multi-scan problem and improves itemset generation.

The MRFP-Growth algorithm, that we use, builds upon the FP-Growth algorithm and can be divided into two phases. Consider a database D that consists of k tables with one primary table and k-1 secondary tables. The problem can then be stated as that of finding relations from the primary table to each of the secondary tables and then between the secondary tables. Following are given definitions for some terms that are used in the MR FP-Growth algorithm.

4.2 Multi-Relational FP-Trees As described in the previous sections we are posing auto annotation as a multi-relational association rule mining problem. WARMR [14] is the most frequently used algorithm for multi-relational association rule mining. WARMR is based on the Apriori Algorithm. There is also a distributed version of the Apriori Algorithm called the Decentralized Apriori [8].

4.2.1 Definitions

Building an FP-Tree [29] requires scanning the database twice. The first scan generates all the 1-frequent itemsets. FPTree generation can be split into two phases. In the first phase, the items appearing in the dataset are enumerated. All the items that have a support less than the threshold are weeded out. The remaining itemsets are organized in a table called the Header Table and are sorted by frequency. Pointers to the first occurrence of the itemset in the dataset are also stored in order to maintain reference for all other occurrences of the item. The second phase starts with another I/O scan of the database. Each transaction is read again and only those itemsets that occur in the header table are maintained. These are then sorted in descending order based on their frequencies. Table 1: A sample relation TID

Item

1

A,B,C,D,E

2

B,P,D,F,G

3

A,B,C,K,E

4

A,B,C,H,K

5

A,C,D,K,L

1.

MR FP-Tree: A frequent pattern tree which holds the support count and the Id (See Definition 4) for each frequent item. The Id helps in determining the link information between the different relations. Figures 3, 4 and 5 depict an MR FP-Tree.

2.

Primary Table: A table that has a primary key. Example: Table 1 with TID as the Pk (Primary Key).

3.

Secondary Table: A table that has field(s) which refer to the primary key of the primary table. Example: Table 2 and Table 3 have TID as the Fk (Foreign Keys).

4.

Id: The set of primary key values whose records contain an item. Example: Id for item H in Table1 is: IdH = {2, 6}.

5.

Id set: The set of Id (See Definition 4) which contains a frequent pattern. Let A be the item being considered for generation of frequent pattern P. Then Id set of PA

=

k

Id Ai

, where Ai is the

i =1

ith follower of item A. See Figure 3 and Table 2 to understand it further. The Multi-Relational FP-Growth algorithm is given below: 1.

2. 3.

For each secondary table do: i. Generate MR FP-Tree for the items in the table. Keep track of the Id for each frequent item. ii. Mine the MR FP-tree for the frequent patterns. Also, note the Id set of the frequent patterns. Make an MR FP-tree for the Id in the Id set of all the frequent patterns generated for all the secondary tables. Mine the MR FP-tree for frequent patterns of Id from which get the patterns associated with these Ids and effectively get the frequent patterns across the tables.

The first phase of the algorithm involves running the MR FPTree algorithm separately on all the tables but with one major difference. Each node in the tree not only keeps track of its support but also keeps track of the indices (Id) in the dataset where it occurs. This results in k-1 FP-trees for the secondary tables. If the Id for each item, i.e. the primary key value is a

Figure 2: FP-Tree for Table 1. The FP-Tree is constructed as follows: For the first itemset in the sorted dataset, check if it exists as one of the children of the root. If it exists then increment its support. If not, then add the current item as a child of the root node and set the support to 1. Repeat the same process by considering the

22

long string or a bigger value, then there are chances of running out of memory while making the MR FP-Trees. To avoid such a situation a bitmap index can be used for the Id. When the trees are mined for rules, relevant information such as the rules, Id sets (See Definition), and support counts are put in a table such as the one given by Table 4.

Please note that the number next to an item in the two trees is the support for that item and the numbers inside the curly braces represent the Id for that item. Rules generated from each of these trees go into a table Frequent_Patterns (Table 4) that is then used to make the final MR FP-Tree (Figure 5) across the different tables and to get rules (Table 5) from it.

In phase two, the table that was filled up with the rules mined from MR FP-Trees in Phase 1 for secondary tables, is used to make an MR FP-tree on the Id set. Each id in an Id set is considered to be an item. Once the tree is made, it is mined for rules as specified earlier. However this time, the rules that are generated, are not saved. Instead, the Id set corresponding to the rules are used to get the patterns that were placed in the table made in phase 1. Thus the rules across the different tables are generated.

Table 4: Table containing rules obtained from MR FP Trees in Figure 3 and 4

Record ID

ID set

1

1,2,3

Patterns from Tables

Support

3

2

1,5

3

1,3

3

3

Table 2: Secondary Table

4

Table 3: Secondary Table

1,3,5

,

3

TID

Items

TID

Items

5

1,2

1

a, b, e, p

6

1,2,3

p

e

3

2

a, p, e , t

7

1,2,3

a

e

3

3

a, b , e , d

3

p, e , t

4

c,d,b

1

,

, ,

3

1

, , ,

2

,

, ,

,

, ,

3

5

, , ,

To illustrate how the MRFP-Growth algorithm works, consider Table 1 in conjunction with Tables 2 and 3. Itemsets in Table 1 are in a one-to-many relationship with itemsets in Tables 2 and 3. Running the FP-Tree algorithm on these tables yields additional FP trees as given in figures 3 and 4. Figure 5: MR FP-Tree for Table 4 with id in ID set as item. Please note in Figure 5, the number in bold written in the tree node is the frequent item and the number next to it (not bold) is its support count. Rules Obtained from the above tree (Figure 5) are listed in Table 5. Table 5: Rules from Figure3 with support count >2 2

2

2

3 Figure 3: MR FP-Tree for Table 2.

3 3 {1,5,6,7}

1 4 {1,5,6,7}

1,3 3 {1,5,6,7}

1 6 {1,2,3,4,6,7}

From these rules (given in Table 5) the distinct Id sets (the set of Record Ids given in table 4) are taken to get the patterns to make up the final rules, as given in Table 6. A possible performance glitch is the time taken to make combinations for each of the frequent patterns in the last phase. In this phase, the frequent patterns tend to get long depending upon the number of relations. As generating rules depends upon Id set, making combinations for the items in frequent patterns can be skipped. Therefore using the id set of the frequent pattern, rules can be generated faster. So one can have 1, 2, 3 as a frequent pattern with {1,5,6,7} as the Id set and still get the rules in Table 6 instead of making combinations of this pattern as given in the rules 1, 2, 3 in Table 5. Please note that the Id sets in this particular

Figure 4: FP-Tree for Table 3.

23

foreign table that holds image features and image-ids (foreign key).

example are same for the rules 1, 2, and 3 in Table 5, however they can be different.

Table 7: Image table (Primary table)

Table 6: Final rules generated from the MR FP-Tree in Figure 5. Final rules

Support

1

C:\34.jpg

2

C:\people.jpg

3

, ,e

3

, ,e

3

3

C:\sky.jpg

, ,e

3

…

…

, ,e

6

a

Image Path

,e

p

Image-id

Table 8: Annotation_English table (Foreign table)

5. RESULTS AND DISCUSSION 5.1 DATA ORGANIZATION

Image-id

Annotation

Although research in auto-annotation of images has been going on for several decades, standardized datasets have not come into existence. The dataset that was employed in this project consisted of images from different sources like Corel Professional Photo CDs; University of California Berkeley floral images [23]; University of Washington ground truth dataset [24], snapshot.com [25], Freefoto.com [27], United States Fish and Wildlife services National Image Library [26]. Multiple sources were used instead of concentrating on a singular source to ensure that the results are not already biased because of the dataset. These summed up to 2036 images. Numerous human experts were asked to annotate the images to the effect that there was some overlapping between the annotations given by these experts.

1

flowers leaves

1

Sky flowers

2

people sky flowers

3

Sky clouds flowers

3

people flowers clouds

…

….

Figure 6 gives the empirical distribution of keywords for the top 65 keywords with the highest frequency. For this particular dataset, 20 keywords accounted for about a third of the probability density mass function.

Table 9: Feature table (Foreign table) Image-id

Features

1

258R 92G 44B

57Y

2

768R 92G 33B

18Y

3

457R 92G 77B

57Y

…

…

0.5

The features have been normalized to account for different image sizes. Each image feature has been appended with a distinguishing tag, which helps in generating rules. Table 10 gives a list of the tags that were used and their meaning.

relative keyword frequency

0.4

0.3

Table 10: Tag Listing 0.2

0.1

Tag

Stands for

R

Red

G

Green

0 1

11

21

31

41

51

61

B

Blue

Y

Yellow

D0

Edge Degree 0

D45

Edge Degree 45

keyowrds in order of descending frequency

Figure 6: Empirical keyword distribution in sample data set. The relations (tables) that were used are given in Tables 7, 8 and 9. Image Table (Table 7) is the primary table (See Definition 4.2.1) with image-id (primary key) and absolute path to the location of each image on the disk. Table 8, Annotation_English, is a foreign table (See Definition 4.2.1) and has attributes image-id (foreign key) and annotation. The table was named so, to allow creation of tables for annotations in other languages, which would make CoMMA multilingual. Finally, Feature table (Table 9) is another

D90

Edge Degree 90

D135

Edge Degree 135

I

Intensity

Some of the rules that are generated on running MR FPGrowth are given below:

24

48D45

WEEDS

BOAT

36R, 108Y, 22D135

48D45

BUSH, DARK

48D45

GRASS

4Y

EARTH, GROUND

4B

POPPY

SKY SEA 4B

Table 11: Sample results describing images and their corresponding annotations Original Keywords: BUSH SMALL FLOWERS GROUND SMALL CoMMA Generated Annotations: RADIANT BUSH SMALL GRASS LEAVES FLOWERS BERRIES GROUND WEEDS EARTH DRY DESERT

100B, 108D45 218B, 55D90

Original Keywords: BUSH FLOWERS GROUND CoMMA Generated Annotations: BUSH SMALL OVER FOREST RIVER GROUND TREE LEAVES BUSHES GRASS SKY FLOWERS TREES ROCKS

DARK, NIGHT

339G

DIRT

339G

SHRUBS, GRASS, LEAVES

370R

FLOWERS

370R

VINES

29D0

MOUNDS

Original Keywords: LAKE SUMMIT IN ALASKA

Using such rules, new images are auto-annotated and the results are discussed in the next section.

CoMMA Generated Annotations: STILL LAKE SUMMIT ALASKA NEAR ROCK ROCKS DESERT MELTING WALLS ICE

5.2 Performance Evaluation Table 11 gives some of the results obtained on autoannotating test images. To evaluate our results we used the metric given below. No common universally adopted benchmark evaluation metric in exists for image annotation. Hence, different people assess the quality of annotations differently. However, the vocabulary statistics must be taken into account while evaluating performance since a poor system can just ‘guess’ the correct answer in at least some of the instances. To distinguish between systems that simply use empirical word distribution of the keywords in the training set and systems that employ a more systematic approach we modified normalized score measure provided by Barnard et al. [1].

In many specialized domains certain groups of keywords are more frequent than others. A sufficiently high frequency of such a group can significantly skew the results, making a quantitative evaluation of the results rather problematic. In order to make sure that this is not the case in the dataset that we are using, we gave the top 10 words as annotations to the test dataset. The performance was evaluated using the same metric; the results of these rounds of tests are given in Figure 7. 1

It is defined as:

0.8

k

0.4

evaluation score

ENS =

i =1

0.6

(ri / n i ) [w / (N - n i )]

k

Where n is the actual number of keywords in the test image, r is the number of correctly predicted keywords, w is the number of incorrectly predicted words, N is the vocabulary size and k is the total number of tables. For a system that predicts all the keywords correctly the value of ENS is positive one (+1.0), while for a system that predicts all keywords incorrectly the value of ENS will be a negative one (-1.0) and a for a system that just predicts all the keywords in the dataset for any image the value will be zero. The performance of our system is given in figure 6, against a pool of randomly selected images. The results are negative only in a few cases while it ranks higher than 0.5 for more than half of the data set.

0.2 0 0

10

20

30

40

-0.2 -0.4 -0.6 -0.8 -1 annotations

Figure 7: Evaluation score for the test dataset from highest to lowest for the maximum coverage of the dataset

25

50

7. ACKNOWLEDGEMENTS

1

We would like to thank Hau San Si Tou and Bhavika Patel for their help in this project. We would also like to thank all those people who helped us in getting the data set for this project. We would like to thank Vani Mandava for her editing help.

0.8 0.6

annotations

0.4 0.2

8. REFERENCES

0 0

10

20

30

40

[1] Barnard, Kobus et. al., Matching Words and Pictures. Journal of Machine Learning Research, 3(2003), 1107-1135

50

-0.2 -0.4

[2] Beng Chin Ooi, Viper Image Database System, National University of Singapore.

-0.6 -0.8

[3] Carson, C. et al., “Blobworld: a system for regionbased image indexing and retrieval”, Proc. Of the 3rd Int’l Conf. On Vis. Info. Sys., Amsterdam, Netherlands, June 1999, pp. 509-516

evaluation score

Figure 8: Evaluation score for the test dataset from highest to lowest for the maximum coverage of the dataset

[4] Cheng, Pu-Jen and Chien, Lee-Feng, AutoGeneration of Topic Hierarchies for Web Images from Users’ Perspectives, CIKM’03, November 38 2003, 544-547

Figure 8 clearly shows that annotating all the images with the top 10 frequently occurring words does not affect the result in any appreciable way and such a system ranks rather low performance wise. Random annotations seem to perform well for a few images. The performance for most other images is fairly poor as is evident in figure 7 as compared to figure 6.

[5] Jing Dai, Mong Li Lee. Wynne Hsu, Mining Viewpoint Patterns in Image Databases, in 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 2003.

6. CONCLUSION AND FUTURE WORK In this paper, we provide a framework for a multi-relational approach for integrated mining of multimedia databases. The problem domain we describe a solution for is auto-annotation of images. The performance of auto annotation via multirelational mining was quite effective as witnessed by the test results. In a constrained setting the system was able to annotate the test images successfully in most cases. A dataset that consists of multiple descriptions of the same image would be an ideal candidate for further exploration of multirelational mining for image annotation. We are currently extending the framework using the techniques that we described above for not only image-annotation but also for retrieval of images from an image database based on keyword based queries. Future research will concentrate on query expansion given these varied parameters and finding the optimal value for annotation within a limited image category. The annotations returned from the images can be treated and used for query expansion and disambiguation of terms related to an image.

[6] Gaborski, R., Vaingankar V.S., and Canosa, R.L. Goal Directed Visual Search Based on Color Cues: Cooperative Effects of Top-down & Bottom-up Visual Attention. In Proceedings of the Artificial Neural Networks in Engineering, Rolla, Missouri. Vol 13, pp: 613-618, 2003 [7] Itti L., and Koch C. Computational modeling of visual attention. Nature Neuroscience Review., 2(3):194-203. 2001 [8] Jensen, Viviane C. and Soparkar, Nandit, Frequent Itemset Counting Across Multiple Tables, Proceedings of PAKDD, pages 49-61, 2000 [9] Jeon, J. and Manmatha, R., Automatic Image Annotation and Retrieval using Cross-Media Relevance Models, SIGIR’03, July 28-August 1, 2003

Related to the disambiguation problem is the problem of coming up with a description of the given scene. Hence, given a set of images, another dataset that contains multiple annotations for an image from varied sources and a dataset that consists of facts related to individual terms, the challenge will be to generate a description for a particular image. The performance of this method on a larger dataset and a larger vocabulary is an open question that also needs to be addressed in the future. We believe that future advances in computer vision and data management domains would benefit from adopting some of the suggestions provided to solve problems such as query expansion for image data, content based image retrieval, auto categorization of images, etc that are perched at their intersection.

[10] Jia Li, and James Z. Wang, Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 25, No. 9, September 2003. [11] M. Flickner et al., Query by Image and Video Content: The Cubic System, IEEE Computer, 28(9): 23-32, Sep. 1995

26

Retrieval Using Relevance Feedback, MM’01, Sep 3- Oct 5, 2001, 98-106

[12] Monay, Florent and Gatica-Perez, Daniel, On Image Auto-Annotation with Latent Space Models, MM’03, November 2-8, 2003, Pages: 275-278

[20] Wenyin, Liu et al., Semi-Automatic Image Annotation, Microsoft Research Technical Report

[13] Parkhurst, Law, and Neibur (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42(1), 107-123.

[21] Osmar R. Zaiane, Jiawei Han, Ze-Nian Li, Sonny H. Chee, and Jenny Chiang. MultiMediaMiner: a system prototype for multimedia data mining. Proc. ACM SIGMOD Intl. Conf. on Management of Data, 581-583, 1998.

[14] Peter Brockhausen and Marc de Haas and JorgUwe Kietz and Arno Knobbe and Olaf Rem and Regina Zucker and Nico Brandt (2001). Mining Multi-Relational Data. Technical report, IST Project MiningMart, IST-11993

[22] Zhang, Ruofei and Zhang, Zhongfei (Mark), Addressing CBIR Efficiency, Effectiveness and Retrieval Subjectivity Simultaneously, MIR’03, Nov. 7, 2003, Pages: 71-78

[15] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo, Fast Discovery of Association Rules, in U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307-328. AAAI/MIT Press, 1996.

[23] http://elib.cs.berkeley.edu/photos/tarlist.txt

[24] http://www.cs.washington.edu/research/imagedata base/groundtruth/_tars.for.download

[25] http://www.snap-shot.com/pages/land/

[16] R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules between sets of items in large databases, in Proc. of the ACM Sigmod Conference on Management of Data, pages 207216, Wahington, D.C., May 1993.

[26] http://images.fws.gov

[27] http://www.freefoto.com

[17] Sašo Džeroski, Multi Relational Data Mining: An Introduction. Sigmod ACM Trans. Program. Lang. Syst., 15, 5 (Nov. 1993), 795-825.

[28] Wynne Hsu, Mong Li Lee, Ji Zhang. Image Mining: Trends and Developments, in Journal of Intelligent Information System (JISS): Special Issue on Multimedia Data Mining, Kluwer Academic, 2002.

[18] Suetens, Paul Fua, Pascal and Hanson, J. Andrew, Computational Strategies for Object Recognition, ACM Computing Surveys, March 5 1992, 5-62

[29] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACMSIGMOD, Dallas, 2000.

[19] Su, Zhong and Zhang Hongjiang, Stan Li, Extraction of Feature Subspaces for Content Based

27

Efficient Indexing of Complex Objects for Density-based Clustering Karin Kailing, Hans-Peter Kriegel, Martin Pfeifle, Stefan Schonauer ¨ Institute for Computer Science University of Munich Oettingenstr. 67, 80538 Munich, Germany {kailing,kriegel,pfeifle,schoenauer}@dbs.informatik.uni-muenchen.de

ABSTRACT Databases are getting more and more important for storing complex objects from scientific, engineering or multimedia applications. Examples for such data are chemical compounds, CAD drawings or XML data. The efficient search for similar objects in such databases is a key feature. However, the general problem of many similarity measures for complex objects is their computational complexity, which makes them unusable for large databases. An area where this complexity problem is a strong handicap is that of densitybased clustering where many similarity range queries have to be performed. In this paper, we combine and extend the two techniques of metric index structures and multistep query processing to improve the performance of range query processing. The efficiency of our methods is demonstrated in extensive experiments on real world data including graphs, trees and vector sets.

Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications - Data Mining

General Terms Performance

Keywords Density-based Clustering, Metric Indexing, Multistep Query Processing

1.

INTRODUCTION

Databases are getting more and more important for storing complex objects from scientific, engineering or multimedia applications. Examples for such data are chemical compounds, CAD drawings, XML data, web sites or color images. The efficient search for

The copyright of these papers belongs to the paper’s authors. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. MDM/KDD’04, August 22, 2004, Seattle, WA, USA.

28

similar objects in such databases, for example to classify new objects or to cluster database objects, is a key feature in those application domains. Often a feature transformation is not possible, therefore a simple distance function like the Euclidean distance cannot be used. In this case, the use of more complex distance functions, like the edit distance for graphs or trees is necessary. However, a general problem of all such measures is their computational complexity, which disqualifies their use for large databases. An area where this complexity problem is a strong handicap is that of clustering, one of the primary data mining tasks. Densitybased clustering has proved to be successful for clustering complex objects [10, 8]. Density-based clustering algorithms like DBSCAN [7] or OPTICS [2] are based on range queries for each database object. Each range query requires a lot of exact distance calculations. Thus, these algorithms are only applicable to large collections of complex objects if those range queries are supported efficiently. When working with complex objects the necessary distance calculations are the time-limiting factor. For complex objects distance calculations are significantly more expensive than disk accesses. So the ultimate goal is to save as many distance calculations as possible. One approach to improve the performance of range queries is to use a filter-refinement architecture. The core idea is to apply a filter criterion to the database objects in order to obtain a small set of candidate answers to a query. The final result is then retrieved from this candidate set through the use of the complex similarity measure. This reduces the number of expensive object distance calculations and speeds up the search process. Another possibility is the use of a metric index structure. In [5] several efficient access methods for similarity search in metric spaces are presented. In most real world applications a static index structure is not acceptable, so dynamic index structures like the M-tree [6] are applied. So far both above mentioned concepts, multi-step query processing and metric index structures have only been used separately. We claim that those concepts can beneficially be combined and that through the combination a significant speed-up compared to both separate approaches can be achieved. In this paper, we discuss how the two approaches can be combined and present some other techniques to improve the efficiency of range query processing. Filters can easily be used to speed-up the creation and the traversing of a metric index structure like the M-tree. Additionally, caching can be used to prevent that the same distance calculations are performed more than once. As DBSCAN [7] for example is only interested in getting all objects in the ε- neighborhood of a given query object,

complex objects

complex models complex distance measure

Figure 1: Examples of Complex Objects. but does not need to know the actual distances, we introduce the concept of ”positive pruning” to save further distance calculations. The remainder of the paper is organized as follows. In section 2, we present some recent work in the field of indexing and clustering complex objects. Section 3 presents our techniques used to save costly distance calculations while performing range queries. The performance gain of our new techniques is presented in section 4, while section 5.2 concludes the paper and gives some hints at future work.

2.

MOTIVATION AND RELATED WORK

In the next subsection, we present three promising and approved modelling approaches and distance measures for complex objects (see figure 1 for an illustration). The evaluation part will show that in all those cases we achieve a performance gain using our new techniques. Afterwards we present some recent approaches for clustering and query processing on complex objects. As this is an extremely broad field we do not make any claim on completeness, neither for the data types nor for the techniques presented. The main purpose of this section is to motivate the necessity of new techniques which allow efficient similarity range queries on complex objects.

2.1 2.1.1

Data Types of Complex Objects

2.1.3

Sets of Feature Vectors

For CAD applications, suitable similarity models can help to reduce the cost of developing and producing new parts by maximizing the reuse of existing parts. In [10] an effective and flexible similarity model for complex 3D CAD data is introduced, which helps to find and group similar parts. It is not based on the traditional approach of describing one object by a single feature vector but instead an object is mapped onto a set of feature vectors, i.e. an object is described by a vector set. An algorithm and a method for accelerating the processing of similarity queries on vector set data is presented. In the evaluation part we will show that we can significantly improve this approach for range queries.

2.1.2

chemical compounds, CAD drawings, XML documents or web sites. For similarity search it is important to take into account both the structure and the content features of such objects. A successful approach is to use the edit distance for tree structured data. However, as the computation of this measure is NP-complete, constrained edit distances like the degree-2 edit distance [19] have been introduced. They were successfully applied to trees for web site analysis [18], structural similarity of XML documents [14], shape recognition [15] or chemical substructure search [18]. While yielding good results, they are still computationally complex and, therefore, of limited benefit for searching or clustering in large databases. In [9] a filter and refinement architecture for the degree2 edit distance is presented to overcome this problem. A set of new filter methods for structural and for content-based information as well as ways to flexibly combine different filter criteria are presented. With experiments on real world data the authors show that this approach is superior to metric index structures. But the experiment on k-nearest-neighbor queries also show that even the most complex filter which combines structural and content feature has to compute the object distances for 10 percent of the database in order to find the nearest neighbor. Again, we will show in the evaluation part, that our new techniques outperform the presented approach for range queries.

Tree-Structured Data

In addition to a variety of content-based attributes, complex objects typically carry some kind of internal structure which often forms a hierarchy. Examples of such tree-structured data include

29

Graphs

Attributed graphs are another natural way to model structured data. Most known similarity measures for attributed graphs are either limited to a special type of graph or are computationally extremely complex, i.e. NP-complete. Therefore they are unsuitable for searching or clustering large collections. In [11], the authors present a new similarity measure for attributed graphs, called edge matching distance. They demonstrate, how the edge matching distance can be used for efficient similarity search in attributed graphs. Furthermore, they propose a filter-refinement architecture and an accompanying set of filter methods to reduce the number of necessary distance calculations during similarity search. Their experiments show that the matching distance is a meaningful similarity measure for attributed graphs and that it enables efficient clustering of structured data.

2.2

Clustering Complex Objects

In recent years, the research community spent a lot of attention to clustering resulting in a large variety of different clustering al-

candidates filter

refinement

result

Figure 3: A multistep query processing architecture.

2.3 2.3.1

Figure 2: Browsing through cluster hierarchies. gorithms. However, most of those algorithms were designed for vector data, so there is still a need for research on clustering complex objects.

2.2.1

Density-Based Clustering of Complex Objects

In this paper, we focus on the acceleration of density-based clustering algorithms like DBSCAN [7] and OPTICS [2], which are based on ε-range queries. Density-based clustering algorithms provide the following advantages: 1. They can be used for all kinds of metric data spaces and are not confined to vector spaces. 2. They are robust concerning outliers. 3. They have proved to be very efficient and effective in clustering all sorts of data. 4. OPTICS is – in contrast to most other algorithms – relatively insensitive to its two input parameters, ε and MinPts. The authors in [2] state that the input parameters just have to be large enough to produce good results.

2.2.2

Clustering Multi-Represented Objects

Traditional clustering algorithms are based on one representation space, usually a vector space. However, for complex objects often multiple representations exist for each object. Proteins for example are characterized by an amino acid sequence, a secondary structure and a 3D representation. In [8] an efficient density-based approach to cluster such multi-represented data, taking all available representations into account is presented. The authors propose two different techniques to combine the information of all available representations dependent on the application. The evaluation part shows that this approach is superior to existing techniques. The experiments were done for protein data that is represented by amino-acid sequences and text descriptions as well as for image data, where two different representations based on color histograms and segmentation trees were used.

2.2.3

Query Processing on Complex Objects

Visually Mining through Cluster Hierarchies

In [4] the authors show how visualizing the hierarchical clustering structure of a database of objects can aid the user in his time consuming task to find similar objects (cf. figure 2). Based on reachability plots produced by the density-based clustering algorithm OPTICS [2], approaches which automatically extract the significant clusters in a hierarchical cluster representation along with suitable cluster representatives are proposed. These techniques can be used as a basis for visual data mining. The effectiveness and efficiency of this approach was shown for CAD objects from a German car manufacturer, and a sample of the Protein databank [3] containing approximately 5000 protein structures.

30

Multi-step Query Processing

The main goal of a filter-refinement architecture, as depicted in figure 3, is to reduce the number of complex and, therefore, time consuming object distance calculations in the query process. To achieve this goal, query processing is performed in two or more steps. The first step is a filter step which returns a number of candidate objects from the database. For those candidate objects the exact object distance is then determined in the refinement step and the objects fulfilling the query predicate are reported. To reduce the overall search time, the filter step has to fulfill certain constraints. First of all, it is essential, that the filter predicate is considerably easier to evaluate than the exact similarity measure. Second, a substantial part of the database objects must be filtered out. Only if both conditions are satisfied, the performance gain through filtering is greater than the cost for the extra processing step. Additionally, the completeness of the filter step is essential. Completeness in this context means that all database objects satisfying the query condition are included in the candidate set. Available similarity search algorithms guarantee completeness if the distance function in the filter step fulfills the lower-bounding property. For any two objects Op and Oq , a lower-bounding distance function df in the filter step has to return a value that is not greater than the exact object distance do of Op and Oq , i.e. df (Op , Oq ) ≤ do (Op , Oq ). With a lower-bounding distance function it is possible to safely filter out all database objects which have a filter distance greater than the current query range because the exact object distance of those objects cannot be less than the query range. Using a multi-step query architecture requires efficient algorithms which actually make use of the filter step. Agrawal, Faloutsos and Swami proposed such an algorithm for range search [1].

2.3.2

Metric Index Structures

In some applications, objects cannot be mapped into feature vectors. However, there still exists some notion of similarity between objects, which can be expressed as a metric distance between the objects, i.e. the objects are embedded in a metric space. Several index structures for pure metric spaces have been proposed in the literature (see [5] for an overview). A well-known dynamic index structure for metric spaces is the M-tree [6]. The M-tree, which is explained in detail in section 3.1, aims at providing good I/Operformance as well as reducing the number of distance computations.

3.

EFFICIENT RANGE-QUERIES ON COMPLEX OBJECTS

So far, the concepts of multi-step query processing and metric index structures have only been used separately. We claim that these concepts can beneficially be combined and that, through the combination, a significant speed-up compared to both separate approaches can be achieved. In the following, we will demonstrate the ideas for range queries with the M-tree as index structure and arbitrary filters fulfilling the lower-bounding criterion. It has to be

noted that the techniques can also be applied to similar metric index structures like the Slim-tree [17]. This section is organized as follows. After introducing the necessary concepts for similarity range queries using the M-tree, we present the concept of ”positive pruning” in section 3.2. In section 3.3, we combine the two worlds of direct metric index structures and multi-step query processing based on filtering. Furthermore, we show in this section that filters cannot only be used for improving the query response time of an M-tree, but also for efficiently creating an instance of an M-tree. In section 3.4, we show how caching can be applied to accelerate the processing of similarity range queries.

3.1

Similarity Range Queries using the M-tree

The M-tree (metric tree) [6] is a balanced, paged and dynamic index structure that partitions data objects not by means of their absolute positions in the multi-dimensional feature space, but on the basis of their relative distances in this feature space. The only prerequisite is that the distance function between the indexed objects is metric. Thus, the M-tree’s domain of applicability is quite general, and all sorts of complex data objects can be organized with this index structure. The maximum size of all nodes of the M-tree is fixed. All database objects Od or references to them are stored in the leaf nodes of an M-tree, along with their feature values and the distance d(Od , P (Od )) to their parent object P (Od ). Inner nodes contain so-called routing objects, which correspond to database objects to whom a routing role was assigned by a promoting algorithm that is executed whenever a node has to be split. Additional to the object description and the distance to the parent object, routing objects Or also store their covering radius r(Or ) and a pointer ptr(T (Or )) to the root node of their sub-tree, the so-called covering tree of Or . For all objects Od in this covering tree, the condition holds that the distance d(Or , Od ) is smaller or equal to the covering radius r(Or ). This property induces a hierarchical structure of an M-tree, with the covering radius of a parent object always being greater or equal than all covering radii of their children and the root object of an M-tree storing the maximum of all covering radii. Range queries are specified by a query object Oq and a range value ε by which the answer set is defined to contain all the objects Od from the database that have a distance to the query object Oq of less than or equal to ε: D EFINITION 1 (S IMILARITY R ANGE Q UERY ). Let O be a domain of objects and DB ⊆ O be a database. For a query object Oq ∈ O and a query range ε ∈ IR0+ , the similarity range query simRange : O×IR0+ 7→ 2DB returns the set simRange(Oq , ε) = {Od ∈ DB|dist(Od , Oq ) ≤ ε}.

1 2 3 4

simRange(queryObject Oq , range ε) → ResultSet result = NIL; rangeSearch(root, Oq , ε); return result;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

rangeSearch(Node N , queryObject Oq , range ε) Op := parent object of node N ; IF N is not a leaf THEN FOR EACH Or in N DO IF |d(Op , Oq ) − d(Or , Op )| ≤ r(Or ) + ε THEN compute d(Or , Oq ); IF d(Or , Oq ) r(Or ) + ε (see line 7 of the algorithm in figure 4). In this section we introduce the concept of ”positive pruning”. If a directory node is completely covered by the query range, we can report all objects on the leaf level of the M-tree without performing any cost intensive distance computations (cf. figure 5).

Given a query object Oq and a similarity range parameter ε, a similarity range query simRange(Oq , ε) starts at the root node of an M-tree and recursively traverses the whole tree down to the leaf level, thereby pruning all sub-trees which certainly contain no result objects. A description of simRange in pseudocode and the recursive procedure rangeSearch used to traverse the M-tree is given in figure 4. The sub-tree of a routing object Or can be pruned, if the absolute value of the distance of the routing object’s parent object Op to the query object Oq , d(Op , Oq ), minus the distance between Or and Op is greater than the covering radius of Or plus ε: Figure 5: Positive Pruning for the M-tree.

|d(Op , Oq ) − d(Op , Or )| > r(Or ) + ε

31

L EMMA 1. Let Oq ∈ O be a query object and ε ∈ IR0+ a query range. Furthermore, let Or be a routing object in an M-tree with a covering radius r(Or ) and a subtree T (Or ). Then the following statement holds: d(Or , Oq ) + r(Or ) ≤ ε ⇒ ∀O ∈ T (Or ) : d(O, Oq ) ≤ ε P ROOF. The following inequalities hold for all O ∈ T (Or ) due to the triangle inequality and due to d(Or , Oq ) + r(Or ) ≤ ε: d(O, Oq ) ≤ d(O, Or ) + d(Or , Oq ) ≤ r(Or ) + d(Or , Oq ) ≤ ε

Figure 7: Similarity range query based on the filtering M-tree.

In the case of negative pruning, we skip the recursive tree traversal of a subtree T (Or ), if the query range does not intersect the covering radius r(Or ). In the case of positive pruning we skip all the distance calculations involved in the recursive tree traversal if the query range completely covers the covering radius r(Or ). In this case we can report all objects stored in the corresponding leaf nodes of this subtree without performing any further distance computations. Figure 6 shows how this concept can be integrated into the original method rangeSearch depicted in figure 4. 1 7 7a 7b 8

rangeSearch(Node N , queryObject Oq , range ε) .. . compute d(Or , Oq ); IF d(Or , Oq ) + r(Or ) ≤ ε THEN report all objects in T (Or ); ELSE IF d(Or , Oq ) r(Or ) + ε)

Figure 6: Adaptation of similarity range search on M-trees for positive pruning. This approach is very beneficial for accelerating density-based clustering on complex objects. DBSCAN for instance, only needs the information whether an object is contained in simRange(Oq , ε) = {O ∈ DB|d(O, Oq ) ≤ ε}, but not the actual distance of this object to the query object Oq .

3.3

Range Queries

Similarity range queries are used to retrieve all objects from a database which are within a certain similarity range from the query object (cf. definition 1). By computing the filter distance prior to the exact distance we can save on many distance computations. Based on the following lemma, we can prune many subtrees without computing the exact distances between a query object Oq and a routing object Or (cf. figure 7).

Combination of Filtering and Indexing

The M-tree reduces the number of distance calculations by partitioning the data space even if no filters are available. Unfortunately, the M-tree may suffer from the navigational cost related to the distance computations during the recursive tree traversal. On the other hand, the filtering approach heavily depends on the quality of the filters. When combining both approaches, these two drawbacks are reduced. We use the filter distances to optimize the required number of exact object distance calculations needed to traverse the M-tree. Thereby, we do not save any I/O cost compared to the original Mtree, as the same nodes are traversed, but we save a lot of costly distance calculations necessary for the traversal. The filtering Mtree stores the objects along with their corresponding filter values within the M-tree. A similarity query based on the filtering M-tree always computes the filter distance values prior to the exact distance computations. If a filter distance value is already a sufficient criterion to prune branches of the M-tree, we can avoid the exact distance computation. If we have several filters, the filter distance computation always returns the maximum value of all filters. The pruning quality of the filtering M-tree benefits from both the quality of the filters and the clustering properties of the index

32

⇒ do (Oq , O) > ε P ROOF. As ∀O1 , O2 ∈ O : df (O1 , O2 ) ≤ do (O1 , O2 ) holds, the following statement is true: df (Oq , Or ) > r(Or ) + ε ⇒ do (Oq , Or ) > r(Or ) + ε Based on the triangle inequality and our assumption that do (O, Or ) ≤ r(Or ), we can prove the above lemma as follows: df (Oq , Or ) > r(Or ) + ε ⇒ do (Oq , Or ) > r(Or ) + ε ⇒ do (Oq , Or ) − r(Or ) > ε ⇒ do (Oq , Or ) − do (O, Or ) > ε ⇒ do (Oq , O) > ε

Let us note that a similar optimization can be applied to the objects stored on the leaf level with the assumption that their ’covering radius’ is 0. Figure 8 shows how this concept can be integrated into the original method rangeSearch of figure 4.

3.3.2

Construction of an M-tree

Filters can also be used for accelerating the creation of an M-tree. Insert. Figure 9 depicts the function findSubTree which decides which tree to follow during the recursive tree-traversal of the insert operation. The main idea is that we sort all objects according to the filter distance and then walk through this sorted list. Thereby, we first test those candidates which might not lead to an increase in the covering radius. If we detect a routing object for which no increase

1 5 6 6a 6b 7 8 15 15a 15b 16 17

rangeSearch(Node N , queryObject Oq , range ε) .. . IF |d(Op , Oq ) − d(Or , Op )| ≤ r(Or ) + ε THEN compute df (Or , Oq ) IF df (Or , Oq ) ActResult.mMRad THEN return ActResult; END IF mM Radi = 0; Sort all objects Ok of node N descending according to min(df (Ok , Oai ), df (Ok , Obi )); FOR EACH object Ok of this sorted list DO IF df (Ok , Oai ) < df (Ok , Obi ) THEN compute do (Ok , Oai ); IF do (Ok , Oai ) < df (Ok , Obi ) THEN mM Radi = max(mM Radi , do (Ok , Oai )); ELSE compute do (Ok , Obi ); mM Radi = max(mM Radi , min(do (Ok , Oai ), do (Ok , Obi ))); END IF ELSE compute do (Ok , Obi ); IF do (Ok , Obi ) < df (Ok , Oai ) THEN mM Radi = max(mM Radi , do (Ok , Obi )); ELSE compute do (Ok , Oai ); mM Radi = max(mM Radi , min(do (Ok , Oai ), do (Ok , Obi ))); END IF END IF IF mM Radi > ActResult.mMRad THEN break; END IF END IF IF mM Radi < ActResult.mMRad THEN ActResult = (Oai , Obi , mM Radi ); END IF END FOR return ActResult;

6 16 22 22a

distCache(Node N , Object O1 , Object O2 ) → float result = hashtable.lookup(O1 , O2 ); IF result = null then THEN result = compute d(O1 , O2 ); hashtable.add(N, O1 , O2 ,result); END IF return result; deleteCache(Node N ) hashtable.delete(N );

Figure 11: Adaptation of similarity range search on M-trees for Caching.

4.

EVALUATION

To show the efficiency of our approach, we chose the applications and data types described in section 2 and performed extensive experiments. All algorithms were implemented in Java 1.4 and the experiments were run on a workstation with a Xeon 1.7 GHz processor and 2 GB main memory under Linux. We implemented the M-tree as described in [6]. As in all cases the time for distance calculations was dominating the runtime of a range query, we only show the number of distance calculations and not the runtime.

Figure 10: Pseudocode description of function nodeSplit for the M-tree.

3.4.2

rangeSearch(Node N , queryObject Oq , range ε) .. . distCache(N, Or , Oq ); .. . distCache(N, Od , Oq ); .. . END IF deleteCache(N );

Cache Based Range Queries

Efficient query processing of range queries also benefits from the idea of caching distance calculations. During the navigation through the M-tree directory, the same distance computations may have to be carried out several times. Although each object O is stored only once on the leaf level of the M-tree, it might be used several times as routing object. Furthermore, we often have the situation that distance calculations carried out on the directory level have to be repeated at the leaf level. As shown in figure 4 a natural way to implement range queries is by means of recursion resulting in a depth-first search. We suggest to keep all distance computations in main memory which have been carried out on the way from the root to the actual node. After leaving the node, i.e. when exiting the recursive function, we delete all distance computations carried out at this node. This limits the actual main memory footprint to O(h · b) where h denotes the maximum height of a tree and b denotes the maximum number of stored elements in a node. Even in multi-user environments this rather small worst-case main memory footprint is tolerable. The necessary adaptations of the rangeSearch algorithm are drafted in figure 11.

34

4.1

CAD Vector Set Data

For the experiments with this data type, we used the similarity model presented in [10], where CAD objects were represented by a vector set consisting of 7 vectors in 6D. All experiments were carried out on a data set containing 5,000 CAD objects from an American aircraft producer. As distance measure between sets of feature vectors we used the minimal matching distances which can be computed in O(k3 ), where k denotes the cardinality of the point set, by means of the Kuhn-Munkres [12, 13] algorithm. As filter, we used the centroid filter introduced in [10].

4.1.1

Creation of M-tree

The generation of the optimized M-tree was carried out without caching (cf. figure 12) and with caching (cf. figure 13). Without cashing, the number of necessary distance calculations is very high, due to the repeated splitting of nodes. Note that the number of distance calculations for one node split is quadratic w.r.t. the number of elements of this node. In this case, our nodeSplit algorithm only needs 1/4 of the distance calculations while still producing the same M-tree. If we apply caching, the overall number of required distance computations is much smaller as many distance computations necessary for splitting a node can be fetched from disk. In this case our f indSubT ree-function allows us to reduce the number of required distance calculations even further, i.e. the number of distance computations is bisected. To sum up, both optimizations, which are based on the exploitation of available filter information, allow us to build up an M-tree much more efficiently.

Hits

Filter

M-tree

M-tree+Cache+Filter+PosPruning

6000

5000

no. of distance calculations

Number of distance calculations

25000000 20000000 15000000 10000000 5000000 0 5

10

19

50

4000

3000

2000

1000

100

fanout of M-tree M-tree M-tree(optimized Split)

M-tree (optimized FindSubtree) M-tree(optimized FindSubtree+Split)

0 0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

5

5,5

6

6,5

7

7,5

8

8,5

9

9,5

10

epsilon

Figure 14: Comparison of our best technique to M-tree and filtering for vector set data .

Figure 12: Creation without Caching Distance Calculations.

M-tree+Filter 500000

M-tree+Cache

M-tree+Cache+Filter

M-tree+Cache+Filter+PosPruning

6000

400000


Number of distance calculations

450000

350000 300000 250000 200000 150000 100000 50000 0 5

10

19

50

4000

3000

2000

1000

100

fanout of M-tree

M-tree M-tree(optimized Split)

5000

0

M-tree (optimized FindSubtree) M-tree(optimized FindSubtree+Split)

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

5

5,5

epsilon

Figure 15: Comparison of our techniques for vector set data. Figure 13: Creation with Caching Distance Calculations.

4.1.2

Range Queries

Figure 14 and 15 show in what way the different approaches for range query processing depend on the chosen ε-value. Figure 14 shows that for the investigated data set, the original M-tree is the worst access method for all ε-values. On the other hand, the pure filter performs very well. For this data set, reasonable ε-values for density-based clustering would be about 1 for DBSCAN and about 2 for OPTICS. In this parameter range, our approach clearly outperforms both the filter and especially the original M-tree. In figure 15 one can see that for small ε-values, we benefit from the filtering M-tree, whereas for higher values we benefit from caching and positive pruning. Furthermore, we clustered the data set using OPTICS [2] which forms the basis for the visual data mining tool presented in subsection 2.2. With a suitable parameter setting for OPTICS we achieved a speed-up of 16% compared to the centroid filter, 33% compared to the original M-tree and 104% compared to the sequential scan. Let us note, that the average cardinality of the result set of each range query was almost 2,000 which limits the best achievable speed-up to 150%.

4.2

Image Data

Image data are a good example for multi-represented complex objects. A lot of different similarity models exist for image data, each having its own advantages and disadvantages. Using for example text descriptions of images, one is able to cluster all images related to a certain topic, but these images need not look alike. Us-

35

ing color histograms instead, the images are clustered according to the distribution of color in the image. The approach for clustering multi-represented objects presented in [8] is able to get the best out of all these different types of representations. We present some experiments for image data represented as trees or graphs, where the efficiency of range query processing is especially important.

4.2.1

Tree Structured Image Data

Images can be described as segmentation trees. Thereby, an image is first divided into segments of similar color, then a tree is created from those segments by iteratively applying a region growing algorithm which merges neighboring segments if their colors are sufficiently alike. As similarity measure for the resulting trees, we used the degree-2 edit distance and implemented the filter refinement architecture as described in [9]. We used a sample set of 10,000 color TV-Images. For the experiments we chose reasonable epsilon values for the multi-represented clustering algorithm. Figure 16 shows that we achieve a significant speed-up compared to the original M-tree. As can be seen we also outperform the pure filtering approach.

4.2.2

Graph Structured Image Data

To extract graphs from the images, they were segmented with a region growing technique and neighboring segments were connected by edges to represent the neighboring relationship. We used the edge matching distance and the image data set as described in [11]. The filter presented in this paper is almost optimal, i.e. the number of unnecessary distance calculations during query process-

Filter

Mtree

Mtree+Filtering+Caching+PositivePruning

cluster hierarchies of complex objects or clustering of complex multi-represented objects can be extended to larger databases.

10.000


9.000 8.000

5.2

7.000

5.000 4.000 3.000 2.000 1.000 0 2

5

15

epsilon

Figure 16: Comparison of our best technique to M-tree and filtering for tree structured data. ing is very low. Even in this case our techniques is as good as the filter. To show the robustness of our approach against the filter selectivity, we reduced it in a stepwise process. We weighted the original filter distances with constant factors to decrease the filter selectivity. Figure 17 shows that independently of the filter selectivity, our approach outperforms the original M-Tree by a factor of almost 2 and is at least as good as the pure filtering approach.

5.

CONCLUSIONS

5.1

Summary

In this paper, we showed that there exist a lot of interesting application areas for density-based clustering of complex objects. Density-based clustering is based on similarity range queries where the similarity measures used for complex objects are often computationally very complex which makes them unusable for large databases. To overcome the efficiency problems, metric index structures or multi-step query processing are applied. We combined and extended these approaches to achieve the best from two worlds. More precisely, we presented three improvements for metric index structures, i.e. positive pruning, the combination of filtering and indexing and caching. In a broad experimental evaluation based on real world data sets, we showed that a significant speed-up for similarity range queries is achieved with our approach. By means of our new techniques, application areas like visually mining through

Filter

Mtree+Filtering+Caching+PositivePruning

6.

Mtree


12000 10000 8000 6000 4000 2000 0 1

0,8

0,6

0,4

0,2

Potentials for Future Work

In this paragraph, we shortly describe how the introduced optimized M-tree can be used for effectively and efficiently navigating through massive data sets. In Section 2.2.3, a data mining tool was sketched, called BOSS [4]. BOSS is based on the density-based hierarchical clustering algorithm OPTICS and on suitable cluster recognition and representation algorithms. Density-based clustering algorithms are able to detect arbitrarily shaped clusters, which are advantageous in application areas as for instance trend detection in spatial databases. On the other hand, in the area of similarity search clusters of spherical shapes are often more desirable. The optimized M-tree cannot only be used for computing a hierarchical density-based clustering efficiently, but it can also be utilized as a new data mining tool helping the user in his time-consuming task to find similar objects. Each directory node of an M-tree consists of objects representing all elements stored in the corresponding spherical subtrees. Thus, the tree itself can be regarded as a hierarchical clustering which, additionally, efficiently supports all kinds of similarity queries, e.g. ε-range queries. Furthermore, the optimizations introduced in this paper allow to build up an optimized M-tree much more efficiently than carrying out a complete hierarchical density-based clustering. In order to increase the quality, i.e. to minimize the overlap between subtrees of the optimized M-tree, we carry out update operations similar to update operations on Slim-trees [17]], i.e. we propose to use a variant of the slim-down algorithm trying to keep the tree tight. The quality of the resulting dynamic browsing tool could be measured by means of numerical values reflecting the degree of overlapping nodes (cf. the fat-factor and the bloat-factor presented in [17]). In our future work, we want to elaborate the trade-off between quality and efficiency of a new dynamic data-mining browsing tool which is based on the optimized M-tree as introduced in this paper.

6.000

0

selectivity of filter

Figure 17: Comparison of our techniques for graph data.

36

REFERENCES

[1] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In Proceedings of the 4th International Conference of Foundations of Data Organization and Algorithms (FODO), pages 69–84, 1993. [2] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. “OPTICS: Ordering Points to Identify the Clustering Structure”. In Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’99), Philadelphia, PA, pages 49–60, 1999. [3] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. “The Protein Data Bank”. Nucleic Acids Research, 28:235–242, 2000. [4] S. Brecheisen, H.-P. Kriegel, P. Kröger, and M. Pfeifle. Visually mining through cluster hierarchies. In Proc. SIAM Int. Conf. on Data Mining (SDM’04), Orlando, FL, 2004. [5] E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroquin. Searching in metric spaces. ACM Computing Surveys, 33(3):273–321, 2001. [6] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB’97, Proc. of 23rd International Conference on Very Large Databases, August 25-29, 1997, Athens, Greece, pages 426–435, 1997.

[7] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD’96), Portland, OR, pages 291–316. AAAI Press, 1996. [8] K. Kailing, H.-P. Kriegel, A. Pryakhin, and M. Schubert. Clustering multi-represented objects with noise. In Proc. 8th Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD’04), Sydney, Australia, 2004. [9] K. Kailing, H.-P. Kriegel, S. Schönauer, and T. Seidl. Efficient similarity search for hierachical data in large databases. In Proc. 9th Int. Conf. on Extending Database Technology (EDBT 2004), 2004. [10] H.-P. Kriegel, S. Brecheisen, P. Kröger, M. Pfeifle, and M. Schubert. Using sets of feature vectors for similarity search on voxelized cad objects. In Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’03), San Diego, CA, pages 587–598, 2003. [11] H.-P. Kriegel and S. Schönauer. Similarity search in structured data. In Proc. 5th International Conference, DaWaK 2003, Prague, Czech Republic, pages 309–319, 2003. [12] H. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955.

37

[13] J. Munkres. Algorithms for the assignment and transportation problems. Journal of the SIAM, 6:32–38, 1957. [14] A. Nierman and H. V. Jagadish. Evaluating structural similarity in XML documents. In Proc. 5th Int. Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, USA, pages 61–66, 2002. [15] T. B. Sebastian, P. N. Klein, and B. B. Kimia. Recognition of shapes by editing shock graphs. In Proc. 8th Int. Conf. on Computer Vision (ICCV’01), Vancouver, BC, Canada, volume 1, pages 755–762, 2001. [16] T. Seidl and H.-P. Kriegel. Optimal multi-step k-nearest neighbor search. In Proc. ACM SIGMOD Int. Conf. on Managment of Data, pages 154–165, 1998. [17] C. J. Traina, A. Traina, B. Seeger, and C. Faloutsos. Slim-trees: High performance metric trees minimizing overlap between nodes. In Proc. 7th International Conference on Extending Database Technology, Konstanz, Germany, March 27-31, 2000, pages 51–65, 2000. [18] J. T. L. Wang, K. Zhang, G. Chang, and D. Shasha. Finding approximate patterns in undirected acyclic graphs. Pattern Recognition, 35(2):473–483, 2002. [19] K. Zhang, J. Wang, and D. Shasha. On the editing distance between undirected acyclic graphs. International Journal of Foundations of Computer Science, 7(1):43–57, 1996.

Automating the Construction of Scene Classifiers for Content-Based Video Retrieval Menno Israel ¨

Egon L. van den Broek

Peter van der Putten

ParaBot Services b.v. Singel 160 1054 TA Amsterdam The Netherlands

NICI, University of Nijmegen P.O. Box 9104 6500 HE Nijmegen The Netherlands

LIACS, University of Leiden P.O. Box 9512 2300 RA Leiden The Netherlands

[email protected]

[email protected]

[email protected]

ABSTRACT

Keywords

This paper introduces a real time automatic scene classifier within content-based video retrieval. In our envisioned approach end users like documentalists, not image processing experts, build classifiers interactively, by simply indicating positive examples of a scene. Classification consists of a two stage procedure. First, small image fragments called patches are classified. Second, frequency vectors of these patch classifications are fed into a second classifier for global scene classification (e.g., city, portraits, or countryside). The first stage classifiers can be seen as a set of highly specialized, learned feature detectors, as an alternative to letting an image processing expert determine features a priori. We present results for experiments on a variety of patch and image classes. The scene classifier has been used successfully within television archives and for Internet porn filtering.

Content-Based Video Retrieval, CBVR, Content-Based Image Retrieval, CBIR, Image Mining, Vicar, Scene Classification, Patch Classification, Color, Texture, real time, automatic annotation, television archives, porn filtering.

1. INTRODUCTION This work has been done as part of the EU Vicar project (IST). The aim of this project was to develop a real time automated video indexing, classification, annotation, and retrieval system. Vicar was developed in close cooperation with leading German, Austrian, Swedish, and Dutch broadcasting companies. These companies generally store millions of hours of video material in their archives. To increase sales and reuse of this material, efficient and effective video search with optimal hit rates is essential. Outside the archive, large amounts of video material are managed as well, such as news feeds and raw footage [35].

Categories and Subject Descriptors H.3.1 [Information storage and retrieval]: Content Analysis and Indexing - Indexing methods. H.3.3 [Information storage and retrieval]: Information Search and Retrieval - Retrieval models. I.2.10 [Artificial Intelligence]: Vision and Scene Understanding - Intensity, color, photometry, and thresholding, Texture, Video analysis. I.4.7 [Image processing and computer vision]: Feature Measurement - Feature representation, Moments, Texture. I.4.8 [Image processing and computer vision]: Scene Analysis - Color. I.4.9 [Image processing and computer vision]: Applications. I.5.1 [Pattern Recognition]: Models - Neural Nets. I.5.4 [Pattern Recognition]: Applications - Computer vision.

Generally, only a fraction of the content is annotated manually and these descriptions are typically rather compact. Any system to support video search must be able to index, classify, and annotate the material extensively, so that efficient mining and search may be conducted using the index rather than the video itself. Furthermore, these indices, classifications, and annotations must abstract from the pure syntactical appearance of the video pixels to capture the semantics of what the video is about (e.g., a shot of Madonna jogging in a park). Within Vicar a variety of visual events is recognized, including shots, camera motion, person motion, persons and faces, specific objects, etc. In this paper we will focus on the automated classification of visual scenes. For searching and browsing video scenes, classifiers that extract the background setting in which events take place are a key component. Examples of scenes are indoor, outdoor, day, night, countryside, city, demonstration, and so on. The amount of classes to be learned is generally quite large - tens to hundreds - and not known beforehand. So, it is generally not feasible to let an image processing expert build a special purpose classifier for each class.

General Terms Algorithms, Performance, Experimentation, Human Factors.

The copyright of these papers belongs to the paper’s authors. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.

Using our envisioned approach, an end user like an archive documentalist or a video editor can build classifiers by simply showing positive examples of a specific scene category.

MDM/KDD’04, August 22, 2004, Seattle, WA, USA.

38

In addition, an end user may also construct classifiers for small image fragments to simplify the detection of high level global scenes, again just by showing examples (e.g., trees, buildings, and road).

we simply divide images up into a regular n by m grid, say 3 by 2 grid segments for instance. Next, from each segment patches (i.e., groups of adjacent pixels within an image, described by a specific local pixel distribution, brightness, and color) are sampled. Again, some intelligent sampling mechanism could be used to recognize arbitrarily sized patches. However, we divided each grid segment by a second grid, into regular size image fragments, ignoring any partial patches sampled from the boundary. These patches are then classified into several patch categories, using color and texture features (see Section 3). See Figure 1, for a visualization of this approach.

We call these image fragments patches. The patch classifiers actually provide the input for the classification of the scene as a whole. The patch classifiers can be seen as automatically trained data preprocessors generating semantically rich features, highly relevant to the global scenes to be classified, as an alternative to an image processing expert selecting the right set of abstract features (e.g., wavelets, Fourier transforms). Additionally, the interactive procedure is a way to exploit a priori knowledge, the documentalist may have about the real world, rather than relying on a purely data driven approach.

For each segment, a frequency vector of patch classifications is calculated. Then, these patch classification vectors are concatenated to preserve some of the global location information (e.g., sky above and grass below) and fed into the final scene classifier. Various classifiers have been used to classify the patches and the entire picture, including kNN, naive Bayes, and back-propagation neural networks.

Note that the scene is classified without relying on explicit object recognition. This is important because a usable indexing system should run at least an order of magnitude faster than real time, whereas object recognition is computationally intensive. More fundamentally, we believe that certain classes of semantically rich information can be perceived directly from the video stream rather than indirectly by building on a large number of lower levels of slowly increasing complexity. This position is inspired by Gibson’s ideas on direct perception [14]. Gibson claims that even simple animals may be able to pick up niche specific and complex observations (e.g., prey or predator) directly from the input without going through several indirect stages of abstract processing.

2.2 Related Work Literature on scene classification is relatively limited. Early retrieval systems like QBIC [10], VisualSEEk [30], PicHunter [7], PicToSeek [13], and SIMPLIcity [36] use color, shape, and texture representations for picture search. Picard extended Photobook with capabilities for classifying patches into so-called ‘stuff’ categories (e.g., grass, sky, sand, and stone), using a set of competing classification models (society of models approach) [22, 25, 26]. In Blobworld, Belongie et al. [1, 5] segment pictures into regions with coherent texture and color of arbitrary shape (‘blobs’) and offer the user the possibility to search on specific blobs rather than the low level characteristics of the full picture. However, these blobs are not classified into stuff nor scene categories [1, 5]. Campbell et al. [3] also segment pictures into arbitrarily shaped regions and then use a neural network to classify the patches into stuff-like categories like building, road and vegetation.

This paper is expository and meant to give a non-technical introduction into our methodology. A high level overview of our approach is given in Section 2. Section 3 provides more detail on the low level color and texture features used and Section 4 specifies the classifying algorithms used. Experimental results for patch and scene classification are given in Sections 4.1 and 4.2. Next, we highlight two applications in which scene classification technology has been embedded (Section 6). We finish with a discussion and conclusion (Sections 5 and 7).

Some papers are available on classification of the scene of the picture as a whole. Lipson et al. ([21]) recognize a limited set of scenes (mountains, mountain lakes, waterfalls, and fields) by deriving the global scene configuration of a picture and matching it to a handcrafted model template. For example, the template for a snowy mountain states that the bottom range of a picture is dark, the middle range very light and the top range has medium luminance. Ratan and Grimson [27] extend this work by learning the templates automatically. The templates are built using the dominant color-luminance combinations and their spatial relations in images of a specific scene category. They present results for fields and mountains only. Both papers only report results for retrieval tasks, not for classification.

2. OVERALL APPROACH In Vicar a separate module is responsible for detecting the breaks between shots. Then for each shot a small number of representative key frames is extracted, thus generating a storyboard of the video. These frames (or a small section of video around these key frames) are input to the scene classifier.

2.1 Scene Classification Procedure The scene classifier essentially follows a two stage procedure: (i) Small image segments are classified into patch categories (e.g., trees, buildings, and road) and (ii) these classifications are used to classify the scene of the picture as a whole (e.g., interior, street and forest). The patch classes that are recognized can be seen as an alphabet of basic perceptual elements to describe the picture as a whole.

Oliva et al. [23] defined global characteristics (or semantic axes) of a scene (e.g., vertical - horizontal, open - closed, and natural - artificial), for discriminating between, for example, city scenes and nature scenes. These characteristics are used to organize and sort pictures rather than classify them. Gorkani and Picard [16] classified city versus nature scenes. The algorithms used to extract the relevant features were

In more detail, first a high level segmentation of the image takes place. This could be some intelligent procedure recognizing arbitrarily shaped segments, but for our purposes

39

Figure 1: A screenshot of the automatic scene classifier, using a 4x3 grid. The right image shows the classified patches.

3.1 Distributed color histograms

specific for these scenes (i.e., global texture orientation). In addition, Szummer and Picard [31] classified indoor and outdoor scenes. They first classified local segments as indoor or outdoor, and then classified the whole image as such. Both classifiers performed well, but it is not known whether these approaches generalize to other scene categories.

At the core of many color matching algorithms lies a technique based on histogram matching. This is no different for the current scene classification system. Let us, therefore, define a color histogram of size n. Then, each pixel j present in an image, has to be assigned to a bin (or bucket) b. Each pixel is assigned to a bin, as follows:

2.3 Positioning the Vicar method Our method uses the local patch classification as input for the classification of the scene as a whole. To our knowledge only Fung et al. reported a similar approach [11, 12]. Note that the final scene classifier has only access to patch class labels. From the point of view of the final classifier, the patch classifiers are feature extractors that supply semantically rich and relevant input rather than generic syntactic color and texture information. Moreover, the patch classifiers are trained rather than being feature extractors a priori selected by an image processing expert.

The bin bi , with i ∈ {0, n − 1}, for a pixel j with value xj , is determined using: xj , (1) βi = s where xj is the value of pixel j and s is the size of the intervals, with s determined as follows:

s=

max(x) − min(x) , n

(2)

where max(x) and min(x) are respectively the maximum and minimum value xj can take.

So, our method differs and improves on the general applicability for a variety of scene categories, without the need to select different and task specific feature extraction algorithms, for each classification task. Moreover, we used computationally cheap algorithms, enabling real time scene classification. A more fundamental difference is that we allow end users to add knowledge of the real world to the classification and retrieval engines, which means that it should be possible to outperform any purely data driven approach, even if it is based on optimal classifiers. This is important given the fact that image processing expertise is scarce and not available to end users, but knowledge of the world is abundant.

For convenience, Equation 2 is substituted into Equation 1, which yields: n · xj , (3) βi = max(x) − min(x)

Now, bi is defined as the integer part of the decimal number βi . As for each conversion from a originally analog to a digital (discrete) representation, one has to determine the precision of the discretization and with that the position of the boundaries between different elements of the discrete representation. In order to cope with this problem, we distributed each pixel over three bins, instead of assigning it to one bin.

3. PATCH FEATURES In this section, we discuss the patch features as used for patch classification. They provide the foundation for the scene classifier. In order of appearance, we discuss: (i) color quantization using a new distributed histogram technique, (ii) color spaces, the segmentation of the HSI color space, and human color categories, and (iii) an algorithm used to determine the textural features used.

Let us consider an image with p pixels that has to be distributed over n bins. Further, we define min(bi ) and max(bi ) as the borders of bin i (bi ). Then, when considering an image pixel by pixel, the update of the histogram for each of these pixels, is done as follows:

40

bi

+= 1

(4)

|xj − min(bi )| += 1− max(bi ) − min(bi ) |xj − max(bi )| += 1− max(bi ) − min(bi )

bi−1

bi+1

distributed categories, so for simplicity we used this categorization. Our 16 color categories are defined by an equal division of the Hue axis of the HSI color space, since the Hue represents color. Luminance is represented by the Intensity axis of the HSI color space. Again we have chosen for a coarse quantization: the Intensity-axis is divided into six equal segments. The Saturation-axis was not segmented.

(5)

(6)

where min(bi ) ≤ xj ≤ max(bi ), with i ∈ {0, n − 1} and j ∈ {0, p − 1}

The original RGB color coordinates were converted to Hue and Intensity coordinates by Equations 7 and 9. Next, for both the Hue and the Intensity histogram, using Equation 3 each pixel is assigned to a bin. Last, Equations 4, 5, and 6 are applied on both histograms to update them. Note that, due to the circular character of the Hue, the last bin and the first bin of our Hue histogram are neighbors. Our algorithm takes into account this implication of Hue’s circularity. Since both histograms are a coarse quantization this method (i) is computationally cheap (making real time classification possible) and (ii) facilitates in generalization by classifiers.

Please note that this approach can be applied on all histograms, but its effect becomes stronger with the decline in the number of bins a histogram consists of.

3.2 Color No color quantization can be done without a color representation. The RGB color space is the most used color space for computer graphics. However, the HSI / HSV (Hue, Saturation, and Intensity / Value) color spaces are more closely related to human color perception than the RGB color space [20, 34]. Therefore, we have chosen to use the HSI color space.

3.3 Texture Next to color, texture can be analyzed. Jain and Karu [18] state: “Texture [eludes] a formal definition”. Let us define texture as follows: A repetitive arrangement of pixels values that either is perceived or can be described as such.

Here, we took into account human perceptual limitations. If Saturation was below 0.2, Intensity was below 0.12, or Intensity was above 0.94, then the Hue value has not been taken into account, given that, since for these Saturation and Intensity values the Hue is not visible as a color.

For texture analysis, in most cases the Intensity of the pixels is used, hereby ignoring their color [24, 34]. Several techniques are used to determine the patterns that may be perceived from the image [29]. With most texture analyses, textural features are derived from the image, instead of describing arrangements of the individual pixels. This reduces the computational costs significantly, which is essential for applications working real time.

Since image and video material is defined in the RGB color space, we needed to convert this color space to the HSI color space. This was done as follows: √

H

= arctan

S

=

s

I

=

R

√

3 (G − B) 2 − 12 (G + B)

3 (G − B) 2 R+G+B 3 R−

2

!

+

(7)

1 (G + B) 2

2

Therefore, we used a texture algorithm that extracts three textual features for each position of a mask that is run over the image. Here, the size of the mask determines the ratio between local and global texture analysis. The position of the mask is defined by its central pixel. Note that the mask is a square of n × n pixels, with n being an odd integer.

(8)

(9)

For each pixel of the mask, the difference between both its horizontal neighbors as well as the difference between its vertical neighbors is determined. (p, q) denotes the elements (i.e., pixels) of the image with (i, j) being the coordinates of the pixels located in a mask, surrounding an image pixel (p, q). Function f determines the normalized value of pixel (i, j) for a chosen color channel (i.e., H, S, or I), using Equations 7, 8, and 9.

Note that, all H, S, and I values were normalized to values between 0 and 1. But how to quantize this HSI color space? From literature [2, 8, 9, 15, 19, 28, 32] is known that people use a limited set of color categories. Color categories can be defined as a fuzzy notion of some set of colors. People use these categories when thinking of or speaking about colors or when they recall colors from memory.

f oreach(p, q) ∈ Image f oreach(i, j) ∈ M ask(p, q) Sum+ = f (i, j) SqSum+ = f (i, j)2 M11 + = (f (i + 1, j) − f (i − 1, j))2 M12 + = (f (i, j + 1) − f (i, j − 1))2 M22 + = f (i + 1, j) − f (i − 1, j))∗ (f (i, j + 1) − f (i, j − 1))

No exact definition of the number nor the exact content of the color categories is present. However, all research mentions a limited number of color categories: ranging between 11[2, 32, 33] and 30[8], where most evidence is found for 11 color categories. We conducted some limited experiments with subjective categories (i.e., categories indicated by humans) but these did not give better results to 16 evenly

41

used the Corel image database as test bed.

So, for each mask M11 , M12 , and M22 are determined, defining the symmetric covariance matrix M . Let ev1 and ev2 be the eigenvalues of M (for more details, see for example J¨ ahne [17] on structure tensor).

4.1 Patch classification In this section we will discuss the patch classification. In the next section, the classification of the image as a whole is discussed.

Given this algorithm, three textural features can be determined: F1 F2

F3

SqSum − Sum2 min{ev1 , ev2 } = max{ev1 , ev2 } = max{ev1 , ev2 } =

Each of the patches had to be classified to one of the nine patch categories defined (i.e., building, crowd, grass, road, sand, skin, sky, tree, and water). First, a kNN classifier was used for classification. This is because it is a generic classification method. In addition, it could indicate whether a more complex classification method would be needed. However, the classification performance was poor. Therefore, we have chosen to use a neural network for the classification of the grid cells, with nine output nodes (as much as there were patch classes).

(10)

(11) (12)

F1 (see Equation 10) can be identified as the variance (σ 2 ), indicating the global amount of texture present in the image. The other two features, F2 and F3 (see Equations 11 and 12), indicate the structure of the texture available. If ev1 and ev2 differ significantly, stretched structures are present (e.g., lines). When ev1 and ev2 have a similar value (i.e., F2 approximates 1; see Equation 11), texture is isotropic. In the case both ev1 and ev2 are large (i.e., both F2 and F3 are large; see Equation 11 and 12), clear structure is present, without a clear direction. In the case ev1 and ev2 are both small (i.e., F2 is large and F3 is small; see Equation 11 and 12), smooth texture is present. Moreover, F2 and F3 are rotation-invariant.

On behalf of the neural network, for each of the nine patch classes both a train and a test set were randomly defined, with a size ranging from 950 to 2500 patches per category. The neural network architecture was as follows: 25 input, 30 hidden, and 9 output nodes. The network ran 5000 training cycles with a learning rate of 0.007. With a patch size of 16x16, the patch classifier had an overall precision of 87.5%. The patch class crowd was confused with the patch class building in 5.19% of the cases. Sand and skin were also confused. Sand was classified as skin in 8.80% of the cases and skin was classified as sand in 7.16% of the cases. However, with a precision of 76.13% the patch class road appeared the hardest to classify. In the remaining 23.87% of the cases road was confused with one of the other eight patch classes, with percentages ranging from 1.55% to 5.81%. The complete results can be found in Table 1.

Hence, this triplet of textural features provides a good indication for the textural properties of images, both locally and globally. In addition, it is computationally cheap and, therefore, very useful for real time content-based video retrieval.

4. EXPERIMENTS AND RESULTS In the previous section (Section 3) the features used were introduced. These features were used for the first phase of classification: the classification of patches, resulting in a frequency vector of patch classes for each grid cell.

Table 2 shows the results for a 8x8 patch classifier in one of our experiments. The 16x16 patch classifier clearly outperforms the 8x8 patch classifier with an overall precision of 87.5% versus 74.1%. So, the overall precision for the 8x8 patch classifier decreases with 13.4% compared to the precision of the 16x16 classifier. The decline in precision for each category, is as follows: sand 22.16%, water 21.26%, building 17.81%, skin 17.48%, crowd 17.44%, tree 16.8% and road 7.16%. Only for the categories grass and sky the classification was similar for both patch sizes.

In the second phase of classification, a classifier is used to classify the whole image. The input for the classifier is the concatenation of all frequency vectors of patch classes for each grid cell. So, two phases exist, each using their own classifier. We have experimented with two types of classifiers: A K-nearest neighbors classifier (kNN) and a neural network. We will now discuss both the patch classification (Section 4.1) and the scene classification (Section 4.2).

Note that Figure 1 presents a screenshot of the system, illustrating both the division of an image into grids. The classified patches are resembled by little squares in different colors.

The advantage of kNN is that it is a lazy method, i.e. the models need no retraining. This is an important advantage given that we envisage an interactively learning application. However, given that kNN does not abstract a model from the data, it suffers more from the curse of dimensionality and will need more data to provide accurate and robust results. The neural network needs training, parameter optimization and performance tuning, however it can provide good results on smaller data sets providing that the degrees of freedom in the model are properly controlled.

So far, we have only discussed patch classification in general. However, it was applied on each grid cell separately: For each grid cell, each patch was classified to a patch category. Next, the frequency of occurrence of each patch class, for each grid cell, was determined. Hence, each grid cell could be represented as a frequency vector of the nine patch classes. This served as input for the next phase of processing: scene classification, as is discussed in the next subsection.

The experiments discussed in the next two subsections all

42

Table 1: Confusion matrix of the patch (size: 16x16) classification for the test set. The x-axis shows the actual category, the y-axis shows the predicted category. building crowd grass road sand skin sky tree water unknown 89.23 3.02 0.09 1.11 1.02 0.60 0.38 3.70 0.85 0.00 building 5.19 87.25 0.19 1.81 0.44 0.50 0.38 2.94 0.06 1.25 crowd 0.00 0.00 94.73 0.73 0.60 0.00 0.00 3.00 0.93 0.00 grass 1.55 5.48 2.84 76.13 1.55 1.74 1.81 5.81 3.10 0.00 road 1.84 0.88 2.24 1.44 83.68 8.80 0.24 0.00 0.64 0.24 sand 0.32 2.53 0.00 0.63 7.16 89.37 0.00 0.00 0.00 0.00 skin 0.21 0.00 0.00 2.57 0.93 0.00 91.71 0.36 3.86 0.36 sky 1.12 3.44 2.60 0.32 0.16 0.24 0.56 88.44 0.84 2.28 tree 0.00 0.00 4.00 4.44 0.52 0.00 3.04 0.44 87.26 0.30 water

Table 2: Confusion matrix of the patch (size: 8x8) classification for the test set. The x-axis shows the actual category, the y-axis shows the predicted category. building crowd grass road sand skin sky tree water unknown 71.42 9.00 0.85 2.69 2.43 2.86 0.26 6.53 0.77 3.20 building 10.38 69.81 1.13 1.56 2.13 5.56 0.69 6.44 0.19 2.13 crowd 0.80 0.07 93.87 0.73 0.07 0.73 1.20 1.20 0.87 0.47 grass 2.65 5.81 2.45 68.97 2.97 1.87 5.48 3.10 4.52 2.19 road 3.44 3.12 2.88 1.84 61.52 15.20 8.80 0.16 2.80 0.24 sand 1.16 7.79 0.42 0.11 13.47 71.89 4.42 0.11 0.11 0.53 skin 0.00 0.00 0.00 0.29 1.36 2.57 91.43 0.07 4.07 0.21 sky 4.56 11.08 8.20 1.88 0.52 0.76 0.24 71.64 0.56 0.56 tree 0.37 0.52 3.26 9.78 3.85 3.85 11.41 0.52 66.00 0.44 water

Table 3: Confusion matrix of the scene classification for the test set. The x-axis shows the actual category, the y-axis shows the predicted category. Interior City/street Forest Country Desert Sea Portraits Crowds 82.0 8.0 2.0 0.0 0.0 0.0 2.0 6.0 Interior 10.0 70.0 4.0 8.0 0.0 0.0 2.0 6.0 City/street 2.0 4.0 80.0 2.0 2.0 8.0 0.0 2.0 Forest 0.0 6.0 28.0 54.0 10.0 0.0 0.0 2.0 Country 8.0 6.0 2.0 10.0 64.0 4.0 4.0 2.0 Desert 4.0 14.0 0.0 2.0 0.0 80.0 0.0 0.0 Sea 8.0 0.0 0.0 4.0 4.0 2.0 80.0 2.0 Portraits 4.0 14.0 0.0 0.0 2.0 0.0 0.0 80.0 Crowds

43

4.2 Scene classification

mentation as well as for image matching.

The system had to be able to distinguish between eight categories of scenes, relevant for the Vicar project: interiors, city/street, forest, agriculture/countryside, desert, sea, portrait, and crowds. In pilot experiments several grid sizes were tested: a 3x2 grid gave the best results. The input of the classifiers were the normalized and concatenated grid vectors. The elements of each of these vectors represented the frequency of occurrence of each of the reference patches, as they were determined in the patch classification (see Section 4.1).

Hitherto, the patches with which the classifiers were trained had to be manually classified. So, the development of a general purpose automatic scene classifying system would ask an enormous effort: In principle, for all possible patches, sets of reference patches should be manually classified. To solve the latter problem, we currently develop algorithms for automatic extraction of relevant patch types, to utilize automatic training of our system. The second phase of the system consists of the classification of the image representation, using the concatenated frequency patch vectors of the grid cells. An average performance of 73.8% was achieved. The least performing class is Country (which includes the categories countryside and agriculture) with 54% correct. What strikes immediately, when looking at the detailed results in Table 2, is that this category is confused in 28% of the times with the category forest and in 10% of the times with the category desert.

Again, first a kNN classifier was used for classification. Similarly to the patch classification, the kNN had a low precision. Therefore, we have chosen to use a neural network for the classification of the complete images, with eight output nodes (as much as there were scene classes). For each of the eight scene classes both a train and a test set were randomly defined. The train sets consisted of 199, 198, or 197 images. For all scene classes, the test sets consisted of 50 images. The neural network architecture was as follows: 63 input, 50 hidden, and 8 output nodes. The network ran 2000 training cycles with a learning rate of 0.01.

The latter confusions can be explained by the strong visual resemblance between the three categories, which is reflected in the corresponding image representations from these different categories. To solve such confusions, the number of patch categories could be increased. This would increase the discriminating power of the representations. Note that if a user searches on the index rather than on the class label, the search engine may very well be able to search on images that are a mix of multiple patches and scenes.

The image classifier was able to classify 73,8% of the images correct. Interior (82% precision) was confused with city/street in 8.0% and with crowds in 6.0% of the cases. City/street was correctly classified in 70.0% of the cases and confused with interior (10%), with country (8.0%), and with crowds (6.0%). Forest (80% precision) was confused with sea (8.0%). Country was very often (28.0%) confused with forest and was sometimes confused with city/street (6.0%) and with desert (10%), which resulted in a low precision: 54.0%. In addition, also desert had a low precision of classification (64%); it was confused with: interior (8.0%), city/street (6.0%), and with country (10%). Sea, portraits, and crowds had a classification precision of 80.0%. Sea was confused with city/street in 14%, portraits were confused with interior in 8.0% of the cases, and crowds were confused with city/street in 14.0% of the cases. In Table 3 the complete results for each category separately are presented.

To make the system truly interactive, classifiers are needed that offer the flexibility of kNN (no or very simple training) but the accuracy of more complex techniques. We have experimented with learning algorithms such as naive Bayes, but the results have not been promising yet. Furthermore, one could exploit the interactivity of the system more, for instance by adding any misclassifications identified by the user to the training data. Finally, the semantic indices are not only useful for search or classification but may very well be used as input for other mining tasks. An example would be to use index clustering to support navigation through clusters of similar video material.

5. DISCUSSION

6. APPLICATIONS

Let us discuss the results of patch and scene classification separate, before providing overall issues. For patch classification, two patch sizes have been applied. The 16x16 patch classifier gave clearly a much higher precision than the 8x8 patch classifier. Our explanation is that a 16x16 patch can contain more information of a (visual) category than a 8x8 patch. Therefore, some textures can not be described in a 8x8 patch (e.g., patches of buildings). A category such as grass, on the other hand, performed well with 8x8 patches. This is due to its high frequency of horizontal lines that fit in a 8x8 patch.

The scene classifier has been embedded into several applications. In this section we will describe two of them.

6.1 Vicar The scene classifier has been integrated into the Vicar Video Navigator [35]. This system utilizes text-based search, either through manual annotations or through automatically generated classifications like the global scene labels. As a result, Vicar returns the best matching key frames along with information about the associated video. In addition, a user can refine the search by combining a query by image with text-based search. The query by image can either be carried out on local characteristics (appearance) or may include content based query by image. In the first case, the index consisting of the concatenated patch classification vectors is included in the search. In the latter case, the resulting index of scores on the global scene classifiers is used (content). In Figures 2 and 3 an example search is shown from a custom

Therefore, the final system tests were done with the 16x16 patch size, resulting in an average result of 87,5% correct. Campbell and Picard [4, 25, 26] reported similar results. However, our method has major advantages in terms of a much lower computational complexity. Moreover, the classified patches themselves are intermediate image representations and can be used for image classification, image seg-

44

Figure 2: A query for video material.

Figure 3: The result of a45 query for video material.

8. ACKNOWLEDGMENTS

made web application based on the Vicar technology: the first screenshot shows one of the key frames that has been retrieved from the archive using the (automated annotated) keyword countryside. An extra keyword person (also automated annotated) is added in the search, as well as the content index of the image. In the second screenshot the results of the combined queries are shown: persons with a similar background scene as the query image.

This work was partially supported by the EU projects VICAR (IST-24916) and SCOFI (IAP-2110; http://www.scofi.net/). Further, we gratefully acknowledge the reviewers, for their comments on the manuscript. We thank Robert Maas for his work on the texture algorithm.

6.2 Porn filtering

9. ADDITIONAL AUTHORS

To test the general applicability of our approach we built a new classifier to distinguish pornographic from non pornographic pictures. Within half a day a classifier was constructed with a precision of over 80%. As a follow up, a project for porn filtering was started within the EU Safer Internet Action Plan (IAP) program. Within this project, SCOFI, a real time classification system was built, which is currently running on several schools in Greece, England, Germany and Iceland. The porn image classifier is combined with a text classifier and integrated with a proxy server to enable safe web surfing. The text classifier and the proxy server have been developed by Demokritos, Greece, and are part of the Filterix system [6].

Additional authors: Marten J. den Uyl, Vicar Vision b.v., The Netherlands, email: [email protected]

10. REFERENCES [1] S. Belongie, C. Carson, H. Greenspan, and J. Malik. Recognition of images in large databases using a learning framework. Technical Report CSD-97-939, University of California at Berkeley, 1997. [2] B. Berlin and P. Kay. Basic color terms: Their universals and evolution. Berkeley: University of California Press, 1969. [3] N. W. Campbell, W. P. J. Mackeown, B. T. Thomas, and T. Troscianko. The automatic classification of outdoor images. In Proceedings of the International Conference on Engineering Applications of Neural Networks, pages 339–342. Systems Engineering Association, 1996.

For this application of the system, we first created image representations using the patch classification network as mentioned in Section 4.1. With these image representations we trained the second phase classifier, using 8.000 positive (pornographic) and 8.000 negative (non pornographic) examples. The results: the system was able to detect 92% of the pornographic images in a diverse image collection of 2.000 positive examples and 2.000 negative examples (which includes non pornographic pictures of people). There were 8% false positives (images that are not pornographic, are identified as pornographic images) and 8% false negatives. Examples of false positives were close ups of faces and pictures like deserts and fires. To improve results, within the SCOFI project a Vicar module was used that detects close ups of faces.

[4] N. W. Campbell, W. P. J. Mackeown, B. T. Thomas, and T. Troscianko. Interpreting image databases by region classification. Pattern Recognition, 30(4):555–563, 1997. [5] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):1026–1038, 2002.

The integrated SCOFI system that combines text and image classification has a performance of 0% overblocking (i.e., 100% correct on non pornographic web pages) and 1% underblocking (i.e., 99% correct on pornographic web pages). As such it is used as a real time filter for filtering pornography on the Internet, in several schools throughout Europe.

[6] K. V. Chandrinos, I. Androutsopoulos, G. Paliouras, and C. D. Spyropoulos. Automatic web rating: Filtering obscene content on the web. In J. Borbinha and T. Baker, editors, Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, pages 403–406, 2000.

7. CONCLUSION

[7] I. J. Cox, M. L. Miller, T. P. Minka, and T. V. Papathomas. The bayesian image retrieval system, pichunter: theory, implementation, and psychophysical experiments. IEEE Transactions on Image Processing, 9(1):20–37, 2000.

In this paper a general scene classifier is introduced that does not rely on computationally expensive object recognition. The features that provide the input for the final scene classification are generated by a set of patch classifiers that are learned rather than predefined, and specific for the scenes to be recognized rather than general. Though the results on different scene categories can still be improved, the current system can successfully be used as a tool for generating scene indexes and classifications for content-based image and video retrieval and filtering. This is demonstrated by its success in various applications such as the Vicar Video Navigator video search engine and the SCOFI real time filter for pornographic image material on the Internet.

[8] G. Derefeldt and T. Swartling. Colour concept retrieval by free colour naming: Identification of up to 30 colours without training. Displays, 16(2):69–77, 1995. [9] G. Derefeldt, T. Swartling, U. Berggrund, and P. Bodrogi. Cognitive color. Color Research & Application, 29(1):7–19, 2004. [10] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee,

46

D. P. nad D. Steele, and P. Yanker. Query by Image and Video Content: The QBIC system. IEEE Computer, 28(9):23–32, 1995.

[24] C. Palm. Color texture classification by integrative co-occurrence matrices. Pattern Recognition, 37(5):965–976, 2004. [25] R. W. Picard. Light-years from lena: video and image libraries of the future. In Proceedings of the 1995 International Conference on Image Processing, volume 1, pages 310–313, 1995.

[11] C. Y. Fung and K.-F. Loe. Learning primitive and scene semantics of images for classification and retrieval. In Proceedings of the 7th ACM International Conference on Multimedia ’99, volume 2, pages 9–12, rlando, Florida, USA, 1999. ACM.

[26] R. W. Picard and T. P. Minka. Vision texture for annotation. Multimedia Systems, 3(1):3–14, 1995.

[12] C. Y. Fung and K.-F. Loe. A new approach for image classification and retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 301–302. ACM, 1999.

[27] A. L. Ratan and W. E. L. Grimson. Training templates for scene classication using a few examples. In Proceedings of the IEEE Workshop on Content-Based Analysis of Images and Video Libraries, pages 90–97, 1997.

[13] T. Gevers and A. W. M. Smeulders. Pictoseek: combining color and shape invariant features for image retrieval. IEEE Transactions on Image Processing, 9(1):102–119, 2000.

[28] D. Roberson, I. Davies, and J. Davidoff. Colour categories are not universal: Replications and new evidence from a stone-age culture. Lanham, Maryland: University Press of America Inc., 2002.

[14] J. Gibson. The Ecological Approach to Visual Perception. Houghton Mifflin, Boston, 1979.

[29] A. Rosenfeld. From image analysis to computer vision: An annotated bibliography, 1955-1979. Computer Vision and Image Understanding, 84(2):298–324, 2001.

[15] R. Goldstone. Effects of categorization on color perception. Psychological Science, 5(6):298–304, 1995.

[30] J. R. Smith and S. F. Chang. Querying by color regions using the VisualSEEk content-based visual query system, chapter 2, pages 23–42. The AAAI Press, 1997.

[16] M. M. Gorkani and R. W. Picard. Texture orientation for sorting photos at a glance. In Proceedings of the International Conference on Pattern Recognition, volume 1, pages 459–464, 1994.

[31] M. Szummer and R. W. Picard. Indoor-outdoor image classification. In IEEE International Workshop on Content-Based Access of Image and Video Databases (CAIVD), pages 42–51, Bombay, India, 1998. IEEE Computer Society.

[17] B. J¨ ahne. Practical Handbook on Image Processing for Scientific Applications. CRC Press, 1997. [18] A. K. Jain and K. Karu. Learning texture discrimination masks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(2):195–205, 1996.

[32] E. L. van den Broek, M. A. Hendriks, M. J. H. Puts, and L. G. Vuurpijl. Modeling human color categorization: Color discrimination and color memory. In T. Heskes, P. Lucas, L. Vuurpijl, and W. Wiegerinck, editors, Proceedings of the 15th Belgian-Netherlands Conference on Artificial Intelligence, pages 59–68. Nijmegen: SNN, University of Nijmegen, 2003.

[19] P. Kay. Color. Journal of Linguistic Anthropology, 1:29–32, 1999. [20] T. Lin and H. Zhang. Automatic video scene extraction by shot grouping. In Proceedings of the 15th International Conference on Pattern Recognition, volume 4, pages 39–42, Barcelona, Spain, 2000. IEEE Computer Society.

[33] E. L. van den Broek, P. M. F. Kisters, and L. G. Vuurpijl. The utilization of human color categorization for content-based image retrieval. In B. E. Rogowitz and T. N. Pappas, editors, Proceedings of Human Vision and Electronic Imaging IX, volume 5292, pages 351–362, 2004.

[21] P. Lipson, E. Grimson, and P. Sinha. Configuration based scene classification and image indexing. In Proceedings of 16th IEEE Conference on Computer Vision and Pattern Recognition, pages 1007–1013. IEEE Computer Society, 1997.

[34] E. L. van den Broek and E. M. van Rikxoort. Colorful texture analysis. Pattern Recognition Letters, [submitted].

[22] T. P. Minka and R. W. Picard. Interactive learning using a “society of models”. Technical Report 349, MIT Media Laboratory Perceptual COmputing Section, 1996.

[35] P. van der Putten. Vicar video navigator: Content based video search engines become a reality. Broadcast Hardware International, IBC edition, 1999.

[23] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.

[36] J. Z. Wang. Integrated region-based image retrieval. Boston: Kluwer Academic Publishers, 2001.

47

A Data Management Layer for Visual Information Retrieval Horst Eidenberger and Roman Divotkey Vienna University of Technology Institute of Software Technology and Interactive Systems Favoritenstrasse 9-11, A-1040 Vienna, Austria +43-1-58801-18853

{eidenberger, divotkey}@ims.tuwien.ac.at

ABSTRACT

1. INTRODUCTION

This case study describes the data management layer of the VizIR visual information retrieval project. VizIR is an open source framework of software tools for visual retrieval research. In content-based multimedia retrieval media objects are described by high-dimensional feature vectors. These feature vectors have to be stored in an efficient way in order to accelerate the retrieval process. VizIR database management is based on object-oriented persistence management. The database interface has a three tier architecture: a pattern-based persistence system hides the underlying database, an object-relational mapping system maps classes to entities and a relational database provides state-of-the-art database features (transactions, integrity, recovery, etc.). The described database management prototype can be downloaded from the VizIR project website.

Content-based visual information retrieval (VIR) is a field of multimedia research that aims at extracting meaningful (semantic) media information directly from the pixel level. Sophisticated algorithms (e.g. the MPEG-7 visual features [2, 7]) are used to locate relevant information (features, descriptors) in media objects. Usually, features are represented as highdimensional data vectors. For example, if all visual MPEG-7 features are used to describe a media object, the data vector has more than 320 dimensions. Dis-similarity of media objects is measured as distance between feature vectors. See [3, 6, 8] for more information on content-based visual information retrieval. The fundamental database problem of VIR is to establish the efficient storage of feature vectors in order to enable fast (but still flexible) content-based multimedia data mining. This case study describes the approach we implemented to solve this problem in the VizIR project [4]. VizIR aims at developing a software workbench of free tools for content-based image and video retrieval (see Section 2 for more information on VizIR). Below, we discuss general approaches for VIR database design, describe and argue for our design decision and give details on the concrete implementation in the VizIR framework (freely available from [10]).

Categories and Subject Descriptors H.2.4 [Database Management] Systems – Multimedia databases, object-oriented databases, relational databases. H.2.8 [Database Management] Database Applications – Data mining, image databases.

General Terms Management, Experimentation.

Performance,

Design,

The paper is organised as follows. Section 2 sketches the VizIR project. Section 3 points out principal data models for feature data. Section 4 describes the VizIR data management model. Finally, Section 5 describes selected implementation issues.

Reliability,

Keywords

2. BACKGROUND: THE VIZIR PROJECT

Content-based Visual Information Retrieval, Video Retrieval, Image Retrieval, Object-oriented Database Design, Database Management, Persistence Management, High-dimensional Indexing, Multimedia Databases.

Even though significant amounts of research on VIR have been conducted in recent years and a considerable number of research prototypes has been developed (see [8] for a quick overview), there is still no VIR software framework available that would satisfy the researchers' needs. Firstly, as similar methods are used for image and video retrieval, it would be desirable to support both media types in one environment. Furthermore, it would accelerate research work, if state-of-the-art VIR components (e.g. space to frequency transformations, kernelbased learning algorithms, user interfaces) would be readily available in an homogeneous environment.

The copyright of these papers belongs to the paper's authors. Permission to make digital or hard copies off all part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.

With the VizIR project we are intending to satisfy these demands. VizIR is a framework of resources (mainly software

MDM/KDD'04, August 22, 2004, Seattle, WA, USA.

48

components implemented in Java) that are needed to build VIR prototypes. The software components include classes for media access, transportation and visualisation in user interfaces, for feature extraction (including the content-based MPEG-7 descriptors), for querying and refinement based on a novel 3D retrieval and browsing panel, for user interface design, and for visualisation of media metadata, evaluation and benchmarking. As the framework itself and all elements have to be extendible, it is imperative that the underlying database system does not make any assumptions on the elements' structure in order to keep them persistent. This constraint drives the database design considerations presented in Section 4.

art in content-based retrieval). Thirdly, as for classic RDBMS it is mostly impossible to define a mapping from feature vector elements to entity attributes. Finally, multimedia indexing structures are hardly implemented in classic RDBMS and more specialised products are often not operating system-independent or do not provide traditional RDBMS features. As it is very difficult to press polymorphic feature data in relational databases in fine-granular manner, we searched for alternative approaches of data representation. XML databases seem to provide ideal structures and properties for VIR data. Features can easily be mapped to XML documents (e.g. MPEG7 defines an XML representation of its visual features). Media objects are per se separated from metadata and stored externally. All data points can easily be accessed by using document models and (simple) querying languages (e.g. W3C DOM and XPath).

VizIR is an open project and all components are free under GNU General Public License. See [4] for a more detailed description on the VizIR project. All finished components (including the database layer presented in this paper) can be downloaded as source code from the project website [10].

One VIR-specific example for this group of systems is the PTDOM database [11]. PTDOM defines a document object model specific for the MPEG-7 features. All features (including those based on MPEG-7 types: vector, matrix) can be accessed on a fine-granular level and retrieved using XPath and databaseinternal user-defined functions (similar to stored procedures). Data elements can be indexed by B-trees. Of course, additionally, more sophisticated multimedia indexing structures could be implemented as well. The main drawback of PTDOM, in terms of practical application, is that the currently available implementation is strongly bound to commercial, operating system-dependent helper libraries.

3. RELATED WORK: DATABASE MANAGEMENT FOR FEATURE DATA One scientific challenge of VIR is the high dimensionality of feature vectors. For example, if all content-based MPEG-7 descriptors are used to describe an image, the description has more than 300 dimensions. Solving the dimensionality problem adequately must be one of the first issues in designing a VIR system. Still, it is mandatory for the success of VIR in general and the VizIR project in particular that the database layer meets a number of software engineering requirements: Database access has to be simple, efficient, domain-independent and operating system-independent. Additionally, the database management system has to provide traditional features (integrity, recovery, etc.). Before we designed the VizIR database layer we surveyed approaches that were used in existing VIR systems or suggested for the future.

The last VIR-specific approach that may become relevant in the future is the media mediator concept [9]. Media mediators are functions that are used to access media data live during a query. Conceptually, media mediators are defined on a semantic level and mapped to low-level features that extract information from the media samples. Theoretically, media mediators can be used to define arbitrary operations on media data but, as well, to implement distributed querying environments. The advantages of the media mediator concept are that everything is done on the fly and media objects are accessed in a fine-granular way. On the other hand, obviously, the comprehensive operations needed to implement mediators would be extremely resourceconsuming. Additionally, it would be almost impossible to accelerate the querying process using indexing structures. These drawbacks make it unlikely that the media mediator concept can ever be implemented in its original form. Still, if particular operations could be identified as basic building blocks for media mediators, these operations could be computed prior to query execution. Hereby, the querying process could be dramatically accelerated while the flexibility of the concept would be largely preserved.

Classic RDBMS (e.g. DB2 in QBIC [8]) fulfil all software engineering requirements easily. If used, media objects are usually stored externally, feature vectors are stored as BLOBs (often in one table per feature) and indexed by context-free structures (e.g. B-trees). Therefore, the data can only be accessed sequentially (by ID). More sophisticated access methods (such as dis-similarity measurement by distance functions; for example, implemented as stored procedures) cannot be used. Finegranular access would only be possible, if feature vector elements could be assigned to table attributes. This is usually impossible as many features have varying length. In recent years, sophisticated indexing structures have been developed for multimedia RDBMS (see [1] for an overview). Various R-trees, SS-trees, etc. have been proposed to allow for efficient organisation and access to high-dimensional media data. Ideally, raw media data would be stored outside the database. Feature metadata should be stored in fine granulation in the database to enable context-specific indexing. If multimedia indexing structures do exist, feature data can be selected using distance functions. Unfortunately, a number of drawbacks are connected to this approach. Firstly, most indexing structures have the tendency to become inefficient for really high-dimensional data (in the MPEG-7 case: 320+ dimensions). Secondly, most indexing structures are unable to deal with multiple distance measures in one index (state-of-the-

4. VIZIR DATA MANAGEMENT MODEL Below, we describe the data management model we designed for the VizIR project from the described palette of approaches. Subsection 4.1 describes the design decision. Subsequent subsections describe all relevant aspects of the VizIR data management model.

4.1 Use case-driven design decision Surveying principal VIR approaches showed that we could

49

...

Serializeable VizIR classes

...

Descriptor

contains

DescriptorContainer 0..*

1

Persistence system

2

Object/relational mapping

3

RDBMS

generates

DescriptorLogic

accessible via classname

DescriptorInfo 0..* 1..*

Database

DescriptorInfoCollection

Figure 1. Layer structure of VizIR persistence system. Figure 2. Descriptor-related entities (simplified). basically choose between a classic RDMBS (with selfimplemented multimedia indexing structures) and an XML database. As VizIR is a software engineering project, we decided to follow a best practice and perform the database decision use case-driven.

be used behind the persistence system API, (2) any serialisable object can easily be made persistent and (3) database management is fully transparent to the rest of the VizIR framework.

VizIR is intended for general purpose VIR. For practical applicability it should provide reliable state-of-the-art persistence management. These requirements are best satisfied by classic RDBMS. An XML database would be a good choice, because the (implemented) visual MPEG-7 features are available as XML documents. Additionally, most feature structures can easily be represented in XML form. On the other hand, even professional XML databases have serious problems with handling large XML documents. Generally, implementing multimedia indexing structures would hardly make sense, since most features require variable distance measures. In this situation, an index would have to be defined for every distance measure used in the retrieval process. Obviously, following this approach would result in significant overload of indexing metadata. Furthermore, some distance measures used in VIR are not based on metrics and, in particular, do not meet the triangle inequality requirement. For these measures it would be even more difficult to define an index. Moreover, feature structures can be organised arbitrarily (e.g. as matrices). Additionally, in many retrieval situations, the query engine has to browse through the feature vectors sequentially anyway.

4.2 VizIR entities Generally, the VizIR persistence management system needs to store media-related and descriptor-related data. For media objects, just the visual data and some textual metadata are stored. The structure needed for descriptor-related data is shown in Figure 2 (in UML syntax). It is required both on the database level (as entities) and on the application level (as classes). The main class is DescriptorInfo. This class holds the management methods for the other components. DescriptorLogic contains the extraction algorithms. DescriptorLogic may have an arbitrary structure: as it is stateless, it is not made persistent. The actual (XML) descriptor data are held in Descriptor. Since descriptors may have widely varying appearances, each Descriptor is encapsulated by a DescriptorContainer. As this class has a pre-defined, fixed structure, it can easily be made persistent (see Section 5). Additionally, every Descriptor may belong to a group (e.g. an MPEG-7 descriptor scheme). This relationship is implemented in DescriptorInfo and DescriptorInfoCollection. Even though we did not have this generality in mind when we designed the VizIR persistence manager, the presented model is flexible enough to hold any type of feature data for any type of media. It could, for example, be employed to manage contentbased features of audio streams or text features of arbitrary media objects.

Therefore, we decided that VizIR should be grounded on a relational database and indexing structures should be implemented (if required) on the application level. Since VizIR is based on the query-by-example paradigm, low-level indexing in relation to a pre-defined origin (e.g. the zero vector of distance space) would not be feasible. An index would be required for every query example. However, variable indexing concepts on the application level (e.g. heuristics) may result in valuable query acceleration.

4.3 Persistence management layer The persistence management layer is responsible for offering all database-relevant methods to the VizIR framework while hiding the concrete implementation of the object-relational mapping and the database. Figure 3 illustrates the implemented model. The chosen design follows state-of-the-art software design patterns.

In order to guarantee application independence and framework extendibility we decided to employ object-oriented persistence management and to map serialised software objects to tables of a relational database. Figure 1 depicts the resulting three layer structure: The persistence system layer provides the methods needed to access the database (storage and retrieval), the mapping layer maps objects to entities and the database layer provides transactions, integrity and recovery. The advantages of this solution are that (1) any mapping tool and any database can

The main class PersistenceSystem is responsible for initialisation and the creation of all database-related entities (media objects and descriptors). Additionally, it contains a factory class for the creation of PersistenceManager classes (PersistenceFactory). PersistenceManager encapsulates all methods needed for database access and transaction management. This class is used

50

mapping is defined for the resources of DescriptorContainer. Feature vectors (Descriptor objects) are properties of this class.

PersistenceSystem

6. CONCLUSIONS AND FUTURE WORK PersistenceFactory

creates

PersistenceFactoryImpl

We tried to identify the most practicable database solution for a content-based visual information retrieval system that does neither make assumptions on features used nor on application domains. The VizIR framework is intended to be a modern, usable workbench for visual information retrieval research. Hence, grounding the system on a flexible and robust database layer was mandatory. It is interesting to notice that the best solution turned out to be a classic relational database in combination with an object-oriented persistence manager. Using the described design, VizIR can deal with arbitrary feature data and database systems. The programming effort for the VizIR user is reduced to a minimum. Actually, the VizIR persistence layer can be used to manage media objects and metadata (text or binary) of any kind. It is free software and can be downloaded from [10].

PersistenceManager

PersistenceManagerImpl

Figure 3. Persistence management classes. to put VizIR objects under persistence control, reload objects from earlier instances and retrieve collections of objects by name. Currently, the persistence manager supports only direct queries by ID (e.g. descriptor class name). Joins can be used to retrieve, for example, all feature vectors for one media object or all media objects of a particular media collection. Generally, the level of sophistication of the querying components depends on the object-relational mapping tool.

Future work will include performance tests with large MPEG-7 test datasets as well as architecture tests with mapping tools and database systems not considered so far.

7. ACKNOWLEDGMENTS The authors would like to thank Christian Breiteneder for his valuable comments and suggestions for improvement. The VizIR project is supported by the Austrian Scientific Research Fund (FWF) under grant no. P16111-N05.

In order to guarantee the exchangeability of the underlying mapping system, the persistence management classes implement the Bridge pattern: PersistenceFactory and PersistenceManager are just interfaces that define an API. The classes implementing these interfaces are dependent on the mapping layer. The factories PersistenceSystem and PersistenceFactory are responsible for instantiating the right implementing classes for a particular configuration of mapping layer and database.

8. REFERENCES [1] Böhm, C., Berchtold, S., and Keim D.A. Searching in HighDimensional Spaces-Index Structures for Improving the Performance of Multimedia Databases. ACM Computing Surveys 33, 3 (2001), 322-373.

5. IMPLEMENTATION

[2] Chang, S.F., Sikora, T., and Puri A. Overview of the MPEG-7 Standard. IEEE Transactions on Circuits and Systems for Video Technology 11, 6 (2001), 688-695.

The Java implementation of the VizIR persistence management system makes use of the Hibernate system on the mapping layer [5]. Hibernate was selected, because it supports a wide range of commercial and open source database systems (including Oracle, DB2 and MySQL), provides powerful querying mechanisms and employs the Java Reflection API to analyse the structure of software classes. Furthermore, it is, like VizIR, an open source project that is published under GNU LGPL.

[3] Del Bimbo, A. Visual information retrieval. Morgan Kaufmann, San Francisco CA, 1999. [4] Eidenberger, H., and Breiteneder, C. VizIR – A Framework for Visual Information Retrieval. Journal of Visual Languages and Computing 14, 5 (2003), 443-469.

Classes that are made persistent using Hibernate have to meet a few requirements: A default constructor (without parameters, e.g. newInstance()) has to exist for each class and accessor methods (get/set) have to be available for every resource. These methods are used through the Reflection API. Optionally, every class should have an ID tag. Only two bits of information have to be provided externally: the mapping of resources to database data types and the primary/foreign key references in 1:n and n:m relationships. This information is provided in simple XML documents. Even though it is possible to inform Hibernate about relationships of entities, the system leaves maintenance of referential integrity (at least of n:m relationships) to the user. Integrity can be achieved by implementing the Lifecycle interface and callback methods for data manipulation events (e.g. onDelete()).

[5] Hibernate project website. http://www.hibernate.org/.

We are making use of the properties of the Hibernate system to store arbitrarily shaped feature data in the database without the need to define mappings for every new Descriptor class: The

[11] Westermann, G.U., and Klas, W. A Typed DOM for the Management of MPEG-7 Media Descriptions. Multimedia Tools and Applications, to appear.

[6] Lew, M.S. (ed.) Principles of Visual Information Retrieval. Springer, Heidelberg, Germany, 2002. [7] Manjunath, B.S., Salembier, P., Sikora T. Introduction to MPEG-7. Wiley, San Francisco CA, 2002. [8] Marques, O., and Furht, B. Content-Based Image and Video Retrieval. Kluwer, Boston MA, 2002. [9] Santini, S., and Gupta, A. Mediating Imaging Data in a Distributed System. in Proceedings of SPIE Electronic Imaging Symposium, Storage and Retrieval Methods and Applications for Multimedia (San Jose CA, January 2004), SPIE, 365-376. [10] VizIR project website. http://vizir.ims.tuwien.ac.at/.

51

Multiple-Camera People Localization in a Cluttered Environment Gang Wei

Valery A. Petrushin

Anatole V. Gershman

Accenture Technology Labs 161 N. Clark St. Chicago, IL 60601 +1.312.693-6852



[email protected]

[email protected]

opportunities for business and public safety applications, including surveillance for threat detection, monitoring parking lots and streets, customer tracking in a store, assets movement control, detecting unusual events in a hospital, monitoring elderly people at home, job analysis and working place redesign, etc. These applications require the ability automatically detecting, recognizing and tracking people and other objects by analyzing the video or image data. In spite that video surveillance has been in use for decades the development of systems that can automatically detecting and tracking people is still an active research area.

ABSTRACT With rapid proliferation of video cameras in public places, the ability to identify and track people and other objects creates tremendous opportunities for important business and security applications. The paper presents the Multiple Camera Indoor Surveillance project which is devoted to using multiple cameras, agent-based technology and knowledge based techniques to identify and track people and summarize their activities. The paper describes a people localization system, which identifies and roughly localizes people in an indoor environment. The system uses low-level color features – a color histogram and average vertical color – for building person models and the Bayesian decision making approach for people localization. The results of a pilot experiment that used 32 hours of data (4 days x 8 hours) showed the average recall and precision values of 68% and 59% respectively. Augmenting the system with domain knowledge, such as knowing the areas where working places in cubicles, doors, and passages are located and assigning more specific probabilities to them allowed increasing the average recall to 87% and precision to 73%.

Many approaches have been proposed in recent years. They differ in various aspects such as number of cameras used, type of cameras (grayscale or color, mono or stereo, CCD or Webcams, etc.) and their speed and resolution, type of environment (indoors or outdoors), area covered (a room or a hall, a hallway, several connected rooms, a parking lot, a highway, etc.), and location of cameras (with or without overlapping fields of view). There is a vast number of papers that are devoted to this research area and we shall overview some of them in the related work section below. However, the performance of most systems is still far from the requirement of real applications due to three major reasons. First, the capability of today’s video and image analysis techniques is dwarfed by the complexity of the problem. For example, today’s face recognition techniques that is the most common approach for person identification in an image, is very susceptible to many factors, such as changes in the luminosity, view angle and even very simple disguises. Second, most of such systems are developed and tested in controlled “simulated” environments, e.g., a person deliberately walking back and forth in the room, instead of the real public areas where things take place naturally. Finally, the efforts have been mostly focused on analyzing the video data, without taking into consideration the domain knowledge about the environment that can essentially improve the accuracy of both tracking and recognition.

Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning: I.5.2 [Pattern Recognition]: Design Methodology – classifier design and evaluation.

General Terms Algorithms, Experimentation.

Keywords Multi-camera surveillance, indoor inference, people localization.

surveillance,

[email protected]

Bayesian

1. INTRODUCTION

To bridge the gap between the practical applications and the performance of tracking algorithms, we seek solutions in the following directions:

The proliferation of video cameras in public places such as airports, streets, parking lots and shopping malls can create many

•


52

Build simple and efficient agents that collaboratively analyze and aggregate the video data from multiple cameras, instead of developing sophisticated algorithms that rely on a single video stream.

C7

C8

L6

C3 L2 C5 C4

C6

L4

L3

L5 L2 C2

C0

C1 L1

Figure 1. Location of surveillance cameras

•

Take advantage of the data redundancy to improve the accuracy of tracking and recognition by fusing data from the cameras with overlapping fields of view.

•

Take into account the knowledge about the environment, i.e. build “environment specific” systems.

•

link to the event’s movie. It also has other fields that are reserved for results of more detailed analysis. The event database serves as a common repository for both people who are doing manual search of events and computer automatic analysis. The objectives of the MCIS project are the following. •

Create a realistic multi-sensor indoor surveillance environment.

•

Create an around-the-clock working surveillance system that accumulates data in a database for three consequent days and has GUI for search and browsing.

•

Use this surveillance system as a base for developing more advanced event analysis algorithms, such as people recognition and tracking, using collaborating agents and domain knowledge.

Develop systems that work in a real environment.

To validate our vision, we started the Multiple Camera Indoor Surveillance (MCIS) project. The backbone of the projects consists of 30 AXIS-2100 webcams that are watching the whole Accenture Technology Labs’ floor and a meeting room that is located on another floor (see Figure 1). The webcams cover two entrances, seven laboratories and demo rooms, two meeting rooms, four major hallways, four open-space cube areas and two discussion areas. Some areas with high density of traffic and activities have more than one camera and some spots are overlapped with up to four cameras. The total area covered is about 18,000 sq. ft. (1,670 sq. m). The backbone architecture also includes several computers. Each computer receives signals from 3-4 webcams, detects “events” and records event’s pictures in JPEG format. The event is any essential movement in the camera’s field of view. The signal sampling frequency is about 3 frames per second. Another computer collects events’ pictures, converts them into MPEG-1 movie and creates an event record in an SQL database. The event record has fields that keep the event’s beginning time stamp, camera id, and a

The following analyses or demonstrations have been developed or are currently under development.

53

•

Search and browsing of the Event Repository database using a Web browser.

•

Counting how many people are on the floor.

•

Creating a people localization system that is based on evidence from multiple cameras and domain knowledge.

•

Creating an awareness map that shows where a person is located and what he or she is doing at a given moment.

•

Creating a real-time people tracking system that gives an optimal view of a person based on multiple cameras and prediction of person’s behavior.

•

Creating a system that recognizes a person at a particular location and interacts with him/her, for example, the system can send a voice message to the person and get his/her response.

•

Creating an event classification and clustering system.

viewed by each camera in each visited cell of 10x10 grid of the floor plan. The person tracker module keeps the history of the person’s past locations and uses it to predict current location. If the predicted location contains several candidates then color histograms are used to disambiguate them. If no candidates found the system keeps unsupported person tracks active until new data arrive. For supported track their histories are updated and new predictions are calculated. In spite of low image processing rate (about 3.5 Hz) the system works well with up to three people, who are not moving too fast and not wearing similarly colored outfits. The M2Tracker system [4] uses from 4 to 16 synchronized cameras to track up to six people walking in a restricted area (3.5 m by 3.5 m). The system models people using the following models for segmenting images in each camera view: color models at different heights, presence probabilities along the horizontal direction at different heights, and ground plane positions tracked using a Kalman filter. Then the results of one camera segmentation are matched for pairs of cameras to estimate 3D models for each person and estimate the object location on the ground plane using Gaussian kernels to create location likelihood map. The system merges results from several pairs of cameras until the ground plane positions are stable. Then the current positions of people are updates and new predictions are calculated. Due to high computational complexity the system cannot work in real time, but the authors hope that code optimization efforts and advances in computing will make it possible in the future.

This paper describes our approach to creating the people localization system that uses the output of multiple cameras and knowledge about environment and a person’s behavior. The remaining of this paper is organized as the following. Section 2 contains an overview of some previous works that have been done in this research area. Section 3 gives an overview of the localization system. In Section 4 we describe visual features extraction and our people localization algorithm using single or multiple cameras and domain knowledge. The experimental results are presented in Section 5. Section 6 summarizes the papers and describes some future work.

2. RELATED WORKS There are many papers devoted to multiple camera surveillance. They are differ in many aspects such as indoor/outdoor surveillance, people or/and vehicle tracking, using overlapping or non-overlapping cameras, using mono or stereo, color or grayscale cameras, etc. Below we shall focus on some research that deals with indoor people identification and tracking.

The system presented in [5] uses several non-overlapping cameras and knowledge about topology of paths between cameras. It probabilistically models the chain of observation intervals for each tracked person using Bayesian formalization of the problem. To estimate the optimal chain of observation the authors transform the maximum a posteriori estimation problem into a linear program optimization.

A system described in work [1], which is one camera system that was created for tracking people in subway stations, used the luminance contrast in YUV color space to separate people blobs from the background. The coordinates and geometric features of the blobs are estimated and two-way matching matrices algorithm has been used to track (overlapping) blobs.

The approach proposed in [6, 7] uses multiple synchronized grayscale overlapping cameras for tracking people and selecting a camera that gives the best view. The system consist of three modules: single view tracking, multiple view transition tracking and automatic camera switching. The system uses the following features for each person: locations of selected feature points, intensity of the selected feature points and geometric information related to a coarse 2D human body model. The multivariate Gaussian models and Mahalonobis distances are used for people modeling and tracking. The classconditional distribution for spatial and spatial-temporal matching is used for the multiple view transition tracking for matching predicted location and body model size. The automatic camera switching is necessary if the person is moving out of the current camera’s field of view, or the person moves too far away, or the person is occluded by another person. The system select a camera that will contain the person over the largest time (number of frames) accordingly the current prediction of the person’s movement. The experiments with three cameras in various indoor environments showed high robustness of people tracking (96-98%).

In [2] one color static camera system has been used to track people in indoor environment. It used several interacting modules to increase tracking robustness. The modules are a motion tracker that detects moving regions in each frame, a region tracker that tracks selected regions over time, a head detector that detects heads in the tracked regions, and an active shape tracker that uses models of people shape to detect and track them over time. The interaction among modules allows them dynamically incorporating and removing static objects into/from the background model, making prediction about a person’s position and moving direction, and recovering after occlusions. In the Microsoft’s EasyLiving project [3] two color stereo cameras have been used for real-time identification and tracking up to three people in a rather small room (5 m by 5 m). The system evaluates 3D models of blobs and clusters them to fit a people-shaped blob models. Then the centroids of the blobs are projected into the room ground plan. The quantized RGB color histogram and histogram intersection are used for person’s identity maintenance. A histogram is estimated for each person

54

Domain Knowledge Database

Person Models

Event Processing Event Database Camera cluster K1

Ground Truth

Decision making

Result Database

Visualization Tool Figure 2. Architecture of the people localization system

of features of downsampled frames. The co-occurrence matrix allows seeing which video clips have common features, but it does not take into account the similarity among features. To solve the problem the authors suggested to find a unified embedding space, in which the projections of features and clips are close to each other if they co-occurring and similar features are also mapped closer to each other. Clustering clips in the embedding space allows identifying similar usual and unusual events. A new clip can be classified by k-nearest-neighbor approach in the unified embedding space using its prototype vector features. Experiments with several days long hospital video gave promising results.

The KNIGHTM system [8, 9] is a surveillance system that uses several overlapping and/or non-overlapping uncalibrated color cameras for people tracking. The system uses spatial and color Gaussian probability distributions for each person to identify and track people in one camera view. The person identification is based on voting of foreground pixels. If two or more people receive essential percentage of votes from the same region then the systems assumes that partial occlusion of people happens. In case of complete occlusion a linear velocity predictor is used for disambiguation. In order to track people across multiple cameras the system during the training period learns the field of view lines of each camera as viewed in the other cameras. This information and knowledge of cameras’ location are used for identification of moving people. The experiments with three cameras and three different camera setups gave promising results.

3. OVERVIEW OF PEOPLE LOCALIZATION SYSTEM The system consists of two major subsystems, which are connected through the result database. These subsystems are event processing and visualization ones (Figure 2). The event processing subsystem gets event data in chronological order from the Event database. The events related to a particular cluster of cameras are processed by the same agent and synchronized findings go to decision making agent that makes the final decision about people location and saves results in the Result database. The agents use person models to identify people and domain knowledge to put the findings into context.

The authors of the paper [10] suggest system architecture and scenarios of multiple camera systems that take advantage of recent achievements in video camera technology, such as omnidirectional and pan-tilt-zoom (PTZ) cameras. Using the combination of such cameras allows creating an intelligent surveillance system that can automatically select an optimal camera view to track and recognize people and their behavior. In [11] an algorithm for unsupervised detecting and categorization of human activity is proposed. The approach uses several days long video from one camera to detect cooccurrence among video clips and large number of visual features. The preprocessing of video data consists of deleting all frames without motion and downsampling the remaining data to 3 Hz. Then for each frame the following features are extracted: 5 by 5 bins 2D histograms of number of foreground pixels, their color and texture. The feature vectors are quantized into N=500 bins (prototype vectors). Each video clip is represented by a set

The Visualization Tool shows the summary of the people’s activities of the day, and allows the user to browse into greater details. Figure 3 shows a screenshot of the tool. The office floor plan is located in the upper left corner. The different areas on the plan are painted into different colors. The video display area is located in upper right side. The lower part shows the localization results. Each person’s activity summary of the day

55

Figure 3. Snapshot of Visualization Tool window

salient characteristic of a person are sizes of the body, color of the hair and color of the clothes that is on the person. At any given day, a person usually wears the same clothes, and thus the color of person’s clothes is consistent and good discriminator (unless everybody wears a uniform).

is represented by two colored bars aligned with the timeline above. The narrow bar is the ground-truth of the person’s activities (if it is provided), which is manually annotated data, while the wider one is the results generated by the agents. When the person is detected and identified at certain time, the corresponding segment on the bar is painted in the color of the area, where the person is localized at that time. If the user clicks on the timeline, the thumbnail pictures of the people are displayed on the map. If the user clicks on the color bar of any person the video clip that corresponds to the event is played in the video display area.

For mainstream processing we used two lower-level color features that are sufficient to distinguish different people based on their clothes. These features are color histogram and average vertical color. We use the color histogram in the RGB color space. For the detected body region R the extracted feature is a vector of length 48 and consists of three 16 bins histograms for each of the three color components. For region R, the histogram value of component c and bin m is defined as:

4. PERSON DETECTION AND LOCALIZATION USING VISUAL FEATURES

hc (m) =

4.1 Visual features

1 R

∑{

1 if c ( x , y )∈m 0 Otherwise ( x , y )∈R

(1)

where |R| is the number of pixels of the region R and c(x, y) is the quantized value of the color component at pixel (x, y).

A person’s most distinguishable visual identity is his or her face. However, as shown in Figure 4, in many practical applications the size and image quality of the face do not allow traditional face recognition algorithms to work reliably, and sometimes the human face is not visible at all. Therefore, our people localization system uses face recognition as an auxiliary means that is applied for only some areas of some cameras. The other

The color of clothes usually varies in vertical direction. To account for this variation we used average vertical color, which is calculated as follows. The top 15% (the head) and bottom 20% (the feet that also often include shadow) of the detected

56

Figure 4. Example of pictures when face recognition fails

body blob R are discarded and the rest of the region R’ is divided into M horizontal stripes of equal width. The average color is calculated for each stripe and for each color component giving a 3-by-M feature vector. For color component c and stripe region R(m) the average value is calculated using the following formula:

vc ( m) =

1 ∑ c ( x, y ) R(m ) ( x , y )∈R ( m )

Currently we do not have an automatic solution to this problem and we adapt the models by selecting pictures manually.

4.2 People localization The term “localization” in our research means finding a person in particular area (location). If only one camera watches a location then “to localize a person” means to find this person in pictures that the camera takes. If several cameras watch a location then the person should be identified in pictures of one or more cameras depending on how their fields of view are overlapped. We use this term in opposite to term “tracking”, which assumes that more precise location of an object (3D coordinates in a room space or 2D projection on a floor plan) and the trajectory of movement is required. For many applications such kind of rough localization (up to the room, hallway or cubicle) is sufficient.

( 2)

We varied the number of stripes M from 2 to 16 and found that the optimal number was 8. The above visual features were used for both people modeling and people identification. When a person is detected, his visual features are extracted and compared with the person models. The similarity between current evidence E and a person model M(Hi) is based on linear combination of Euclidean distances between corresponding features and a threshold T. The threshold is estimated from two empirical distributions: distances to the modeled person and to all other people. The same distributions serve for converting the distance D into a probabilistic measure P(E|Hi) using linear interpolation (3) with the estimated threshold T and distances r1 and r2. Figure 5 illustrates this conversion.

1 if D ≤ T − r1 ⎫ ⎧ ⎪ (T + r − D) 2r if T − r < D ≤ T ⎪ ⎪ ⎪ 1 1 1 P(E | Hi ) = ⎨ ⎬ ⎪(T + r2 − D) 2r2 if T < D ≤ T + r2 ⎪ ⎪⎭ ⎪⎩ 0 if D > T + r2

In our experiments described below we used only eight of thirty cameras. Figure 1 shows the cameras and locations that were used. We shall call a “camera cluster” a set of cameras that are viewing the same location. The camera C0, that is located at the entrance to the floor, is used only for initial person model acquisition as we described above. The camera cluster K1 = {C1} contains only one camera that views a meeting room (L1) on another floor. The cluster K2 = {C2, C3, C4, C5} consists of cameras that view a hallway (L2), a discussion area (L4) and their intersection (L3). The cluster K3 = {C6} has a camera that watches a laboratory. And the cluster K4 = {C7, C8} has two cameras that view an open-space cube farm.

(3)

Let us first describe the people localization process for one camera. Then we shall extend our description to a cluster of cameras and, finally, to the all clusters.

4.2.1 Single camera localization

Since people may wear different clothes from day to day, the person models must be updated everyday. At the entrance of the floor, we installed a fingerprint recognition system and a camera viewing it. Anyone who participated in the experiment must register at the entrance and the system creates this person’s model based on several pictures taken by this camera. Unfortunately, due to varying illumination conditions at different locations having these initial models are not enough for reliable performance. They need to be adapted for each camera.

Suppose we have only one camera that watches a location. It captures events and saves them in the event database. The processing agent gets events, which are sequences of frames, in chronological order and processes them frame by frame in the following way. First, it detects moving objects by subtracting background image and applying post-processing steps, such as thresholding and morphological operations to remove noise and shadows. Figure 6 shows the results of this step. As the background may change for various reasons, the background

57

P

T

1

r1 r2 0.5

0 T-r1

T

D

T+r2

Figure 5. Converting distance into a probabilistic measure

-

=

Figure 6. Person detection and segmentation

image for each camera is updated by the event capturing agent when there is no motion detected for a certain time interval. This background image is provided with the event as the first image of sequence.

P(H i | T j )∏ Pl (Rk | H i ) L

P (H i | Rk ) =

The next step is extracting features from the segmented regions using formulas (1) and (2), calculating the distances to the person models and converting them into probabilistic measures (3).

where

l =1

(4)

P (Rk ) N +1

L

i =1

l =1

P(Rk ) = ∑ P(H i | T j ) ⋅ ∏ Pl (Rk | H i )

is the

complete probability for the region Rk.

We are using the Bayesian inference approach for decision making. Suppose, we are interested in localization of N people. Each person has a model, and additionally we have a model for an “unknown person”. This is necessary, because there are more than N people on the floor and we have to have an additional hypothesis to fit the requirements of Bayesian approach. This model is trained on the images of all people who enter the floor.

Due to all regions are processed independently it may turn out that two regions have a high probability to represent the same person (who is not “unknown”). In this case the region with higher probability wins and the other’s region probability is divided in half and renormalized. A person is claimed to be found in a location if his/her posterior probability exceeds a threshold (=0.6).

For each location we have to estimate a probability matrix, that gives us prior probability P(Hi|Tj) to see a person Hi at time interval Tj at this location. We used one hour intervals from 9 a.m. to 5 p.m. for Tj j=1,..,8. The prior probabilities can be estimated from data or assumed to be equal if no data is available. For each detected region we use probabilistic measures (3) and prior probabilities for N+1 hypotheses to estimate the posterior probabilities using the Bayes formula. To obtain more reliable estimate we can apply this procedure to the sequence of frames. Usually we use the sequences of length 810 frames (~ 3 seconds) or until the region is available. To track a region a simple heuristic rule is used – to search a region of similar color in vicinity of previous region. In case of occlusion the frame is skipped. Thus, for each region Rk that was found in L consecutive frames we have probabilities that this region represents a person Hi that were estimated using (4).

4.2.2 Multiple cameras localization Let us first consider a case when we have a cluster of cameras with overlapping fields of view. Then we have events from different cameras that are overlapped in time. An agent, that processes the cluster, first gets events from all cameras of the cluster, sort them chronologically and finds intersections. The segments of events without intersections are processed as we described above, but for overlapped segments we have to merge the probabilities obtained for each camera. This could be done in different ways based on how reliable are different cameras and using different merging techniques, but currently we used the maximum as a merging function and consider all cameras are equally reliable. For example, Table 1 shows the results of processing an overlapping segment of events for the cluster of four cameras (C2, C3, C4, and C5), when a person is walking

58

Accuracy of Person Localiz ation System 1 0.9 0.8

Precision

0.7 0.6

Domain Knowledge

0.5

Visual Feature

0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

Re ca ll

Figure 7. Recall and precision of people localization system knowledge that allows increasing the accuracy of people localization. We used the areas such as working places in cubicles, doors, and passages to locate people in these places. This extension requires assigning more prior probabilities, but did not change much the probabilistic approach that has been described above. At some locations a face recognition approach that is based on skin color detection and eigenfaces [12, 13] has been applied.

along the hallway in the location L3 and another person is standing in the location L4 (see Figure 1). The cameras C2 and C3 view one region and due to bad illumination did not came to a definite decision but gave the highest estimate to the “unknown person’ (the hypothesis H0). The cameras C4 and C5 view two regions and gave highest estimates to the hypotheses “person #1” (H1) and “unknown person” (H0). The probability values of these two hypotheses exceed the threshold and are accepted for the current event.

5. EXPERIMENTAL RESULTS

Table 1. Merging probabilities of multiple cameras

Camera

H1

H2

H3

H4

H0

C2

0.0745

0.2138

0.2757

0.0965

0.3395

C3

0.0028

0.0375

0.1948

0.2818

0.4831

C4

0.8745

0.1538

0.2003

0.2456

0.5360

C5

0.7653

0.2264

0.1935

0.1372

0.6165

Merge

0.8745

0.2264

0.2757

0.2818

0.6165

For our pilot experiment we used eight cameras that are integrated into four clusters (see Figure 1 and the description above). Four people served as volunteers to be localized in the experiment. Every day up to fifty people are working on the floor. The color features of most of them have been used for building the “unknown person” model. To evaluate the accuracy, we recorded four days of video data, with eight hours per a day. One day data were used for prior probability estimations, and the rest for testing. We used precision (P) and recall (R) as measures for evaluating the performance of the system (5).

P = C

Finally, the probabilities for each interval go to the decisionmaking module, which merges them with event probabilities from other camera clusters using the above approach. The only difference is that the final probability for each person is normalized over all locations.

T

,

R = C

A

(5 )

where C is the number of events where people were correctly localized by the system, A is the number of events where people are actually visible (ground truth), and T is the number of events that the system claimed that a person is located in that location.

4.2.3 Localization using domain knowledge Using static cameras in a slowly changing office environment has an advantage of knowing “where to look”. Some areas in a picture that is taken by a static camera are more important for localizing people than the other. Such areas are working places in cubicles, armchairs in halls, doors, passages, etc. Some other areas are important for not watching them – blinking screens of computer or TV monitors, lamps, reflective surfaces, etc. Using a layout for each camera that marks all important/unimportant areas and assigning to them specific meanings is a very useful

Using only visual features the system obtained the average recall value of 68.23% and the precision of 59.16%. Using the domain knowledge the system improved the performance getting the average recall and precision increased to 87.21% and 73.55%, respectively. Figure 7 shows the accuracy comparison with and without the use of domain knowledge. In the figure, the small dots and triangles show the results for the four volunteers individually while the larger ones show the

59

and Pattern Recognition, June 23 - 25, 1999, Fort Collins, Colorado, pp. 2253-2259.

average values using visual analysis only and with domain knowledge, respectively.

[6] Cai, Q. and Aggarwal, J.K., Tracking Human Motion using

6. CONCLUSION AND FUTURE WORK

Multiple Cameras. Proc. International Conference on Pattern Recognition, August 1996, Vienna, Austria, pp. 6872.

This paper presents the Multiple Camera Indoor Surveillance project and reports on the preliminary results of performance of people localization system. The system uses low-level color features for people identification and localization. Merging results from multiple cameras and augmenting the system with domain knowledge proved to increase the system’s accuracy. For future research we plan to use roughly calibrated cameras for more reliable tracking of moving objects and estimating sizes of person’s body. We also plan to do experiment with different visual features, and tracking and recognition techniques.

[7] Cai, Q. and Aggarwal, J.K., Tracking Human Motion in Structured Environments using a Distributed-camera System. IEEE transactions On Pattern Analysis and Machine Intelligence, vol. 2, No. 11, November 1999, pp. 1241-1247.

[8] Khan, S., Javed, O., Rasheed, Z., and Shah, M., Human Tracking in Multiple Cameras. Proc. 8th IEEE International Conference on Computer Vision, July 9-12, 2001, Vancouver, Canada, Vol. 1, pp. 331-336.

7. REFERENCES

[9] Javed, O., Rasheed, Z., Atalas, O., and Shah, M., KnightM:

surveillance applications. Proc. 2nd IEEE International Workshop on PETS, Kauai, Hawaii, USA, December, 2001.

A real Time Surveillance System for Multiple Overlapping and Non-overlapping Cameras. The fourth IEEE International Conference on Multimedia and Expo (ICME 2003), July 6-9, 2003, Baltimore, MD.

[2] Siebel, N.T. and Maybank, S., Fusion of Multiple Tracking

[10] Huang, K.S. and Trivedi, M.M., Distributed Video Arrays

Algorithms for Robust People Tracking. Proc. 7th European Conference on Computer Vision (ECCV 2002), Copenhagen, Denmark, May 2002, vol. IV, pp. 373-387.

for Tracking, Human Identification, and Activity Analysis. The fourth IEEE International Conference on Multimedia and Expo (ICME 2003), July 6-9, 2003, Baltimore, MD, Vol. 2, pp. 9-12.

[1] Fuentes, L. M. and Velastin, S. A., People tracking in

[3] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, S.

[11] Zhong, H. and Shi, J., Finding (Un)Usual Events in Video.

Shafer, Multi-camera Multi-person Tracking for EasyLiving. Proc. 3rd IEEE International Workshop on Visual Surveillance, July 1, 2000, Dublin, Ireland.

CMU-RI-TR-03-05, CMU, 2003.

[12] Wei, G., and Sethi, I.K., Face Detection for Image Annotation. Pattern Recognition Letters, Vol. 20, 1999.

[4] Mittal, A. and Davis, L.S., M2Tracker: A Multi-view Approach to Segmenting and Tracking People in a Cluttered Scene. International Journal of Commuter Vision, vol. 51 (3), 2003, pp. 189-203.

[13] Turk, M. and Pentland, A., Eigenfaces for Recognition. Journal of Cognitive Neuroscience, 3(1), 1991.

[5] Kettnaker, V. and Zabih, R., Bayesian Multi-camera Surveillance. Proc. IEEE Conference on Computer Vision

60

A Framework for Bayesian Classification on Banner Images Charles Daniel and Qin Ding Pennsylvania State University – Harrisburg Middletown, PA 17057, USA

[email protected]

(or initial advertisement placement) is usually on a website owned by an individual or company (called "a Web Publisher") with no particular relationship to the advertiser of the product they are helping to advertise. There can be many different points of entry (websites) for the placement of the banner image but they all eventually lead to an advertiser's landing page. The advertiser's landing page usually consists of a web page containing a form with fields asking for particular information about the user (typically name, address, phone, etc.) that is needed in order to complete the sale of the product.

ABSTRACT Due to the increasing volume of image data available, it is of importance to develop new applications and techniques to perform data mining on images. This paper presents a data mining application on a special type of images, i.e., banner images. A banner image is an image file that is displayed on a website and used for an advertisement of some product. A banner image is designed in such a way so that it will attract web users into clicking this image and possibly further completing the sale of the advertised product. By analyzing the relationship between the clicks (measured by the ratio of clicks and views, called click-thru rates) and the attributes of banner images can help improve the effectiveness of advertisement. In this paper we propose a framework called Bayesian Banner Profiler in which we apply Bayesian Classification to predict the category of click-thru rates based on the attributes extracted from the banner images. We also address the issue of how to select important image attributes. The system is implemented in Perl and currently supports classification on GIF banner images.

Advertisers, contrary to the name given to them by the industry, don't generally handle the advertisement distribution aspects themselves. Rather they leave the complications of distribution and tracking of advertising media to a third party called a Network. A Network is basically the middle-man between the Advertiser and the Web Publisher (known as an "Affiliate" to the Network). An Affiliate simply places a small piece of HTML on their website which causes a users' web browser to contact the Network's servers to download a banner image. This download request triggers a "view" hit for the downloaded banner image on the Network's servers. In addition to the HTML banner image reference, the Affiliate also places a hyper-link around the banner image which causes browsers to send a request for a web page to the Network's server whenever the web user clicks on the banner image. When the user does click on the banner image and their web browser sends a page request to the Network's server, it triggers a "click" hit for the banner image and a subsequent HTTP redirect to the Advertiser's landing page. The Network charges the Advertiser for distributing banner images depicting their products to the web surfing public on Affiliate web sites. The Network then, after taking a percentage for its commission, pays the Affiliate for the placement of the advertisement on their web site. In order to fairly distribute funds to the appropriate Affiliates and charge the Advertiser for the appropriate amount, the Network has to keep track of every single view and click of every single banner image on every one of their Affiliates' web sites. Needless to say there are millions upon millions of views and clicks tracked every day and all the tracking data are kept in a database table by the Network.

Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications

General Terms Management, Design

Keywords Bayesian classification, Banner images, Multimedia data mining

1. INTRODUCTION In the fast paced world of Internet Web Advertising, the look and feel of an advertisement can make or break the run of an advertising campaign. Web Advertisements can take on many forms, one of which is the banner image. A banner image in its simplest form is an image file displayed on a website and used for an advertisement of some product to attract web users into clicking this image. Once a web user clicks the image, they are transported to the advertiser's web page (also called "the Landing Page") in order to collect the users' information and to complete the sale of the advertised product. The point of entry

The goal of this paper is to analyze the statistical information of the ratio of clicks to views, i.e., the click-thru rate, and map them out to a classification of banner image attributes, thereby giving us a tool for predicting (with a certain probability) the statistical (click-thru) outcome of a new banner image. The classification method used currently is the Naive Bayesian Classification. The end result is a web-based program capable of taking in a GIF image as input, calculating the probable


61

classifications (click-thru) based on the attributes chosen for consideration by the user, and outputting the results in a sorted manner.

Since P(A) is constant across the classes, it can be dropped as it is not worth maximizing. The Naive Bayesian Classification is named so because of its naive assumption of "class conditional independence", that is, it assumes there are no dependent relationships between the attributes A. However the Naive Bayesian Classification has been shown to have some good results even when the presumption proves false [1].

This banner image profiling tool will be useful for many parties in the web advertising industry. Advertisers can test new banner images to determine their probable profit (via the click-thru statistics of already run banner images) and choose or design banners such as to maximize their profit. Affiliates can do the same within the scope of their site so that they can determine the types of banner images that do well within their site; they may even go so far as to determine the optimum position for the banner image on their website based on the results of the classification. The Network, being the middle-man, is capable of using this probability knowledge in classifying new banner images such as to dynamically send statistically probable profitmaximizing banner images to the users' web browser as they visit an Affiliate's website.

For the purpose of this paper, we defined our classes C as clickthru rates (clicks/views) and our attributes A as the various banner image attributes. By doing so it gives us the ability to predict the probabilities that an unknown object (banner image) having the attributes (image attributes) A will belong to a class C (a click-thru value). This means we can extract the image attributes from a new unclassified banner image and figure out the most probable click-thru rate that the banner image will achieve once we actually put it into the real web world.

Although many works have been done on mining image data [2, 3, 4, 5], to the best of our knowledge, our Bayesian Banner Profiler system is a novel application of data mining on banner images. Our system is implemented in Perl and currently it supports classification on GIF banner images.

3. THE BAYESIAN BANNER PROFILER FRAMEWORK The Bayesian Banner Profiler framework consists of 4 major parts: • Web-CGI interface server script • GIF image attribute extraction program • Attribute quantization algorithm • Bayesian probability computation algorithm

The rest of the paper is organized as follows. Section 2 briefly reviews the naïve Bayesian Classification. Section 3 details the Bayesian Banner Profiler framework. Section 4 presents some implementation details and performance analysis. Finally we conclude the paper and discuss some future work.

The current implementation also affords the user the ability to choose from different profiler configurations. Each profiler consists of: • A quantizationSchema, which describes the methods of converting the raw image attribute value into a quantized (or discrete) value that is easier to work with. • A bayesianCounts, which is a Perl hash containing the precalculated classes (click-thru categories) and counts of the image attributes from our datasets. These counts are used later by the Bayesian classification algorithm.

2. NAIVE BAYESIAN CLASSIFICATION A Bayesian Classifier is a statistical method that can predict class memberships of an otherwise unclassified object. The classifier is based on Bayes' theorem for calculating "posterior probabilities". A posterior probability P(C | A) is one in which the probability that the unclassified object belongs to class C given the known background information A. Bayes theorem provides a way of calculating P(C | A) using the knowledge of P(C), P(A), and P(A | C); where P(C) is the probability of class C occurring in the entire data set, P(A) is the probability of the attributes occurring in the entire dataset (which is a constant across the classes), and P(A | C) is the probability of the attributes A occurring in the class C. Bayes Theorem can be formalized as follows: P(C | A) =

3.1 GIF Image Attribute Extraction The GIF image attribute extraction program takes in as input a GIF file path. After it has read the GIF file and extracted the image attributes, it is capable of outputting either a plain text output of the image attributes, or a valid Perl hash syntax structure of the image attributes (Perl hash is an associate array that provides fast retrieval of key/value pairs). In the current implementation, the following attributes are extracted from each GIF image: • Screen Dimensional Data ¾ height - The overall height of the image in screen realestate pixels ¾ width - The overall width of the image in screen realestate pixels • Meta Data ¾ num_colors - The number of colors in the color table index of the GIF file ¾ has_transparency - A boolean representing the presence (or lack) of transparency portions

P( A | C ) * P(C ) P( A)

Using this equation, it is possible to compute the probability that an unclassified object having attributes A belongs to each of the classes C. To find the actual class that the unclassified object belongs to, we simply try to maximize on the probability (the highest probability wins). That is: max(P(Ci | A)) for all classes i or, max(P(A | Ci) * P(C)) for all classes i

62

•

¾

•

•

•

•

num_frames - The total number of frames in the entire GIF file ¾ total_opaque_pixels - Calculated by counting all nontransparent pixels in all Histogram Color Data - The number of occurrences of each color in the entire image ¾ histofreq_1 ... histofreq_5 - The top 5 colors in the image (in hex notation) ¾ histogram_num_colors - The total number of unique colors in all frames Frame Data ¾ frame_time:min - The minimum delay from one frame to another between all frames in the entire GIF ¾ frame_time:max - The maximum delay from one frame to another between all frames in the entire GIF ¾ frame_time:avg - The average delay from one frame to another between all frames in the entire GIF Intensity Data - (from Hue/Saturation/Intensity color model) The grayscale intensity value obtained by averaging the red, green and blue channel values together ¾ intensity:min - The minimum intensity in the entire GIF ¾ intensity:max - The maximum intensity in the entire GIF ¾ intensity:avg - The average intensity in the entire GIF (using total_opaque_pixels) Primaries Channel Data - Separates a pixel into its primaries parts: red, green and blue ¾ primaries:red_avg - The average red contribution to the entire GIF ¾ primaries:green_avg - The average green contribution to the entire GIF ¾ primaries:blue_avg - The average blue contribution to the entire GIF ¾ primaries:red_percent - Percentage of the GIF contributed to by red ¾ primaries:green_percent - Percentage of the GIF contributed to by green ¾ primaries:blue_percent - Percentage of the GIF contributed to by blue

• •

•

If the value is a scalar (i.e., string), it will use the scalar as the new quantized value. This is useful for essentially canceling the effect of an attribute by mapping all values into one value. If the value is a subhash, it tries to map the value from the attribute hash as a key in this subhash and uses the value it gets as the new quantized value. If the value is a subarray (containing arrays of value ranges), it iterates through the subarray looking for a range that the attribute value falls within and uses that index in the subarray as the new quantized value. If the value is a reference to a Perl code (anonymous subroutine), it will call the subroutine passing it the reference to the parent attribute hash and the actual attribute value. It is that subroutine's responsibility to modify the attribute value into a new quantized value.

Currently the system makes use of the subarray of value ranges to map out several of the attributes that deal with pixel values. The anonymous perl code method is also used when we map the histogram subarray (after sorting) into histofreq_1, histofreq_2...histofreq_5; as well as a quick way to mutate and round up the click-thru values (considered the 'class' attribute). We believe this design of the quantizationSchema provides maximum flexibility in translating any arbitrary attribute value into a quantized value. Since the quantizerSchema is associated with a profiler, we can essentially change the entire inner working of how image attributes are dealt with by simply choosing a different profiler from the user interface.

3.3 Bayesian Algorithm

Probability

Computation

The Bayesian probability computation algorithm uses the Naive Bayesian classification method to compute the probability that the input image falls within a certain class of images based on the image attributes. The Bayesian algorithm takes in as input the quantized image attributes hash, the bayesianCounts hash, and an array of attributes (input by the user) that it should use in the probability consideration. It then computes for each class (i.e. classification category) the probability, using the quantized image attributes hash that the image belongs to it.

Through the web interface the user is able to choose any combination of attributes to consider when running the Naive Bayesian Classification algorithm on the unclassified object.

3.2 Attribute Quantization Algorithm The attribute quantization algorithm, i.e., the quantizer, runs through the image attributes that were extracted and quantizes each to discrete values. For example the intensity average attribute, which usually has a value ranging from 0 to 255, is split and mapped into 8 discrete value ranges. This quantization step helps to define the closeness of an attribute between two images.

The classes in the current implementation are setup to represent the click-thru rate of the banner images. The click thru rate, as previously mentioned, is a kind of normalizing method on the statistical data (i.e. views and clicks) of all the banner images. The click-thru measurement is an immensely popular way of rating banners and advertising campaigns. It is calculated by dividing the sum of the unique clicks of the banner image to the sum of the unique views of the banner image. If the unique views is less than or equal to 0, we assume that the unique clicks and the click-thru are also 0 since a user can't possibly click on a banner image that isn't displayed to them. Formally the clickthru can be defined as:

The quantizer engine takes in as input the raw image attributes hash and a quantizationSchema. It then iterates through the attributes applying the translations that the schema defines for each. It modifies the attributes hash in place with the new quantized values and returns. The quantizationSchema is in reality a Perl hash which mirrors the structure of the attribute hash. For each attribute in the schema:

For a given banner image i : if (sum(UNIQUE_VIEWS(i)) > 0)

63

click-thru(i)=(sum(UNIQUE_CLICKS(i))/ sum(UNIQUE_VIEWS(i)))

4. IMPLEMENTATION DETAILS AND PERFORMANCE ANALYSIS

click-thru( i) = 0

The Bayesian Banner Profiler system is implemented in Perl. Figure 1 shows the initial user interface of the Bayesian Banner Profiler tool. This user interface consists of five important panels: • The Image Source Panel - This panel contains user interface elements concerned with specifying the sample source image (the test case). • The Image Upload Panel - When the user chooses "New..." from the Source pulldown they are prompted to either upload an image file through the browser or to specify the URL of an image on the web. • The Image Preview Panel - When an image is chosen from the pulldown a preview (dimension restricted) version of the image is displayed in this panel. • The Attributes Panel - The user can choose from any combination of image attributes to consider when doing the Bayesian probability computation. At least one attribute is required for the calculation to be useful. • The Menu Panel - This panel contains the submit button and the hidden results panel. In the future as more general query-specific elements are added they will be placed here.

else

Click-thru rates, by nature of the division, can range from values 0.0 to 1.0 (but in the industry they usually top out around 0.20). To deal with these miniscule values, the current implementation actually explodes the value by a factor of 1000 to bring more distinction between the click-thrus and thereby giving us more classes to classify into. Rather than tediously recalculating the click-thrus and Bayesian counts for the entire data set each time the user queries the system, the current implementation uses a precomputed hash of counts and classes called the bayesianCounts hash. This hash file is combined with the quantizationSchema to represent a Bayesian Profiler configuration. Tweaking the quantizationSchema and the building of the bayesianCounts file gives the system an unprecedented flexibility. Since the bayesianCounts file is also governed by a profiler, the user can easily switch to a different set of data (bayesianCounts) by simply selecting a different profiler in the user interface.

Figure 1. Main page and user interface

64

Figure 2. After a successful query

Figure 2 shows the results of a successful query to the profiler. Note how the preview image pane has changed to display the currently selected test case source image and the attributes the user has chosen to consider for the Bayesian classification algorithm.

capabilities) was created. The interface between the HTTP server and the Bayesian Banner Profiler program is completely CGI. This means that the Bayesian Banner Profiler program can be taken out and used within Apache (via mod_cgi) or some other CGI standards conforming web server with no change.

Finally note the resulting table of probabilities (sorted in descending order from most probable) and classes displayed below the SUBMIT button. The resulting table of probabilities is currently implemented to list the top 10, but it can easily be changed to any arbitrary splice of the results array. Also note that the range of the click-thru is within 0 to 1 while in the result panel the click-thru (shown in class column) is multiplied by 100 in order to show the difference.

Since the web advertising industry does most of its banner advertising in GIF formats to support the majority of antique web browsers, we support GIF format in our system. Currently we don’t support PNG image format even though GIF format can be easily translated into PNG format. The reason is that the translated PNG image attributes are not staying true to the original GIF version. Additionally, since PNGs do not support animations we would lose all the inter-frame information such as frame delay.

Figure 3 shows the comparison of predicted click-thru rates and the actual ones. As it shows the result is not ideal but it still gives close prediction, given the small range of click-thru rates.

5. CONCLUSION In this paper we have presented an application of classification on banner images, i.e., the Bayesian Banner Profiling framework. By classifying the click-thru rates based on the

This system was primarily developed on GNU/Linux. A custom built limited protocol supporting HTTP server (with Perl-CGI

65

attributes extracted from the banner images can help improve the effectiveness of advertisement.

REFERENCES [1] J. Han and M. Kamber, Data Mining concepts and Techniques, Morgan Kaufmann, 2001. [2] O. R. Zaiane, J. Han and H. Zhu, “Mining Recurrent Items in Multimedia with Progressive Resolution Refinement”, in Proceedings of the IEEE International Conference on Data Engineering, San Diego, CA, March 2000, pp. 461-470.

Currently we support some low-level image attributes. Our test results of banner images vary in accuracy and we feel that adding high-level attributes can aide in improving the accuracy and stability of the system. After all, human beings usually don't click on banner images based on such low-level details as the number of colors; rather they click on banners based on their aesthetic qualities and content. Fortunately the framework of our system is flexible enough to include more attributes, including high-level image attributes.

[3] O. R. Zaiane, J. Han, Z.-N. Li, S. H. Chee, J. Y. Chiang, “MultiMediaMiner: A System Prototype for MultiMedia Data Mining”, SIGMOD 1998, pp. 581-583. [4]

One of the major improvements to this system would come from extracting better image attributes from the GIFs in which to classify on. The current implementation has a limited number of low-level attributes to choose from. There is, however, a largely untapped realm of higher abstraction image analysis awaiting the future of this system, for example, using motion and blink detection, using OCR to detect text in the image, and using object detection to parse individual objects from the image (banners usually clip-art images from image CDs). By doing this, many other attributes can be added to the classification. Additionally, we would like to support more banner formats than just GIF images; particularly we would like to add support for JPEG and PNG image as well as Macromedia Shockwave Flash SWF files. We are also interested in extending the work by using other classification methods.

J. Zhang, W. Hsu, M. L. Lee, “Image Mining: Issues, Frameworks and Techniques”, in Proceedings of the Second International Workshop on Multimedia Data Mining (MDM/KDD’2001), San Francisco, CA, USA, August, 2001.

[5] Q. Ding, W. Perrizo, Q. Ding, and A. Roy, “On Mining Satellite and Other Remotely Sensed Images,” in Proceedings of the SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Santa Barbara, CA, May 2001, pp. 33-40. [6] E. Kohler, Gifsicle: Animated GIFs for UNIX, http://www.lcdf.org/~eddietwo/gifsicle/

Figure 3. Actual Click-thru rate (Actual CTR) vs. Predicted Click-thru rate (Calculated CTR)

66

Iris Recognition Using the SVD-Free Latent Semantic Indexing Pavel Praks

Libor Machala

Václav Snášel

Dept. of Applied Mathematics VŠB - TU Ostrava, 17, Listopadu 15, 708 33 Ostrava, Czech Republic Tel. (+420) 59 732 4355

Dept. of Experimental Physics Palacký University, Svobody 26, 779 00 Olomouc, Czech Republic Tel. (+420) 58 563 4285

Dept. of Computer Science VŠB - TU Ostrava, 17, Listopadu 15, 708 33 Ostrava, Czech Republic Tel. (+420) 596 993 263

[email protected]

[email protected]

[email protected]

techniques like face recognition [11]. Iris biometrics systems for public and personal use have been designed and deployed commercially by British Telecom, US Sandia Labs, UK National Physical Laboratory, NCR, Oki, IriScan, and others. Applications of these systems are expected in personal identification, access control, computer and Internet security, etc. Studies about iris recognition were published in [9, 10, 11, 12, 15, 16, 17].

ABSTRACT The article presents a method for an automatic verification of persons by iris recognition. A raster image of the human iris is represented as a sequence of pixels. Information retrieval is conducted by the Latent Semantic Indexing (LSI) method. The partial symmetric eigenproblem is computed effectively using a Lanczos-based iterative method. Numerical experiments on a real data-collection indicates feasibility of the here presented approach as a tool for automated image recognition without any image preprocessing.

The method proposed by Daugman [9,10] is based on the transformation of elementary regions of the iris image into polar coordinates. Then, using two-dimensional optimal Gabor functions, the binary iris code is generated. The iris identification consists of comparisons of the generated codes using Hamming distance. In [16] the field of interest is transformed into the standardized polar coordinates similarly as in [9]. The characteristic iris vector is computed from the mean brightness levels of elementary ring sectors of the iris image.

Categories and Subject Descriptors G.1.3 [Numerical Analysis]: Numerical Linear Algebra – Eigenvalues and eigenvectors (direct and iterative methods); Singular value decomposition; Sparse, structured, and very large systems (direct and iterative methods.)

Liam et al. [15] use a trained Self-Organizing Map Neural Network to recognize iris patterns. The iris of doughnut shape is converted into a rectangular form and fed to the neural network. Roche et al. [17] propose an iris recognition method where the features of the iris are represented by fine-to-coarse approximations at different resolution levels. In this technique the discrete dyadic wavelet transformation was used.

General Terms Algorithms, Measurement, Security, Human Factors.

Keywords Biometric identification, information retrieval, multidimensional data, dimension reduction, numerical linear algebra, Singular Value Decomposition (SVD), partial symmetric eigenproblem, very large systems of linear equations, iterative solvers.

In this article, we present an alternative approach to the recognition of the human iris images. The aim is to show that the Latent Semantic Indexing (LSI) is as good way for recognizing images as other above mentioned methods. Moreover, the pattern recognition can be powered very effectively when the time consuming Singular Value Decomposition (SVD) of LSI is replaced by the partial symmetric eigenproblem, which can be solved by using fast iterative solvers.

1. INTRODUCTION Methods of human identification using biometric features like fingerprint, hand geometry, face, voice and iris are widely studied. A human eye iris has its unique structure given by pigmentation spots, furrows and other tiny features that are stable throughout life, see [1,8]. It is possible to scan an iris without physical contact in spite of wearing eyeglasses or contact lens. The iris can be hardly forged, e.g. replaced or copied. This makes the iris a suitable object for the identification of persons. Iris recognition seems to be more reliable than other biometric

2. THE SVD-FREE LATENT SEMANTIC INDEXING The numerical linear algebra is used as a basis for information retrieval in the retrieval strategy called Latent Semantic Indexing [14]. The premise is that more conventional retrieval strategies (such as vector space, probabilistic and extended Boolean) are not very efficient for real data, because they retrieve information solely on the basis of keywords. In this paper, the Latent Semantic Indexing is used as the tool for solving of iris recognition problem. In our approach, a raster image is coded as a sequence of pixels [19, 20], see Fig. 1.


67

Fig2  0 40    Fig 2 = 15 0   0 20   

v2

0    15  0 v2 =    40  0    20   

Fig. 1 Image coding of an 3×2 pixels image example (left) and the corresponding six-dimensional vector (right)

 . 0 . ...   . 15 . ...  . 0 . ... A=  . 40 . ...  . 0 . ...   . 20 . ... 

Fig2

.  . .  . . .

 0 40    Fig 2 = 15 0   0 20    Fig.2 Document matrix coding. The document matrix A is represented as a sequence of coded images

T

[V, S2] = eigs(A A, k); S = sqrt(S2);

Then the coded image can be understood as a vector of an mdimensional space, where m denotes the number of pixels (attributes). Let the symbol A denote a m×n term-document matrix related to m keywords (pixels) in n documents (images). Let us remind, that the (i, j)-element of the term-document matrix A represents the color of i-th position in the j-th image document, see Fig. 2.

2. Compute the co-ordinate of the query vector q qc =

T

(((q * A) * V) * inv(S))* inv(S);

3. Compute the similarity coefficients between the query vector and documents

The Latent Semantic Indexing method involves the Singular Value Decomposition of A. The SVD of any realistic document matrix still is very memory and time consuming operation, especially for large data collections.

for i = 1:n Loop over all documents T s = qc*V(i,:) /(norm(qc)*norm(V(i,:))); sim(i) = 1-acos(s); end;

Analyzing the LSI [14] and using observations of linear algebra, a new SVD-free LSI procedure was derived [20]. The derived LSI algorithm replaces the expensive SVD of the non-square matrix A by the partial eigenproblem of ATA, where T denotes the transpose superscript. Of course, the solution of this partial symmetric (!) eigenproblem using a Lanczos-based iterative method can be obtained very effectively. In addition, the size of the eigenproblem does not depend on the number of attributes (pixels). Moreover, our numerical experiments proved that the derived SVD-free LSI is suitable for image retrieval and text retrieval [20].

The function lsi returns the vector of the similarity coefficients sim to the user. The i-th element of sim contains the value, which may be understood as a “measure” of the semantic similarity between the i-th document and the query document.

2.1 Latent Semantic Indexing and Document Matrix Scaling Moreover, our numerical results pointed out, that there is a possibility to increase the ability of LSI method to extract details from images by scaling of the document matrix. This feature of the method was also exploited to iris recognition.

Following [14, 19, 20], the SVD-free Latent Semantic Indexing procedure can be written in Matlab by MathWorks as follows.

Let the symbol A(:,i) denote the i-th column of the document matrix A. Since in Matlab the colors of images are coded as nonnegative integral numbers, we used the following scaling

function sim = lsi(A,q,k)

A(:,i) = A(:,i)/sum(A(:,i)), i=1,…,n .

% Input: % A ... the m × n document matrix % q ... the query vector % k ... Compute k largest eigenvalues and eigenvectors; % k

The implementation of Markov models into a web server makes possible four operations directly linked to predictive analysis. In the first one, the server supports Markov models in a predictive mode. Therefore, when the user sends an exploration request to the web server, this later predicts the probabilities of the next exploration requests of the user.

α corresponds to the state space. β is a matrix representing transition probabilities from one state to another. λ is the initial probability distribution of the states in a.

74

Predictive analysis

This prediction depends of the history of user requests. The server can also supports Markov models in an adaptive mode. Therefore, it updates the transition matrix using the sequence of requests that arrive at the web server. In the second one, prediction relationship, aided by Markov models and statistics of previous visits, suggests to the user a list of possible scenes, of the same or different web bases, that would be of interest to him, and then the user can go to next. The prediction probability influences the order of scenes. In the current framework, the predicted relationship does not strictly have to be a scene present in the current web base. This is because the predicted relationships represent user traversal scenes that could include explicit user jumps between disjointing web bases.

The algorithm is given bellow. Clustering () { Initialize num_tries and num_pairs ; min_cost = infinitive ; for k = 1 to num_tries do # randomly selected items # in the entire data set . current = k ; L =1; repeat # randomly selected # item in current xi = a ; # randomly selected item # in entire data set current xh = a ; if ( TCih < 0) then current = currentxi + xh ; else j = j +1; endif until ( j
)"'!
)"-¹P )" - 23 8 -c @ = )"'!!" )"'- 4 T )"Q )" @)"O #+ 39 ÐZeC- £qP$C 'EB' # =3*)"% = )"! (@).-+).-0N=+'2+!" 5TzL:? ' $ '#+!"b23$0' $ -±' µ# '!1 60)"*-±OC+Ûq39 ¤ 9USU - d+- T H # !"}2fT0 U -' µ# '!1 60)"*-±OC à Ù9 ¥9US'2 '!.# *'!.#U @!" '!UK'-wZ 6 N 9`B4!" = =f(Ì- x!"0)" $5AhHikjl1YA:s51).- =3' ).-0% !.7 0)" õ?µ# '- )" )"*N)" )"-g#+!"$@).- ).- 0±-% = k+ 2( - )"-X9g|T ).-0_ )"!"0)" PH-0 L S=f - )"'!
- ). @= # )"-!ONk +!"F º89±ÊH-± +)" !"53 '@'P¥G'#+!"(= = !"& !"0!"& !" T'7 53-(}'Uv'G}P# !"Ì).-l)" $=3 U=+ )" )"'-=f= #+!"'}).-± =f ).d '- 9b¢ ' ¶@=+!"5s# )" - $' -N= =3-±). 4jl'# ('YA)" ).-( #k-$2+!.)"J J @ '- - ).- K$)" '3YT)" T).-).- P-} = ).-0@'- !"&

4.

DYNAMIC SPATIO-TEMPORAL ASSOCIATION RULES FOR LOCAL-SCALE WEATHER PREDICTION (DYSTAL) ALGORITHM

eC-0- '!53 AhUi6j1YT:(!"0')" Ó - )" T'1 H'!.!"7 ).-0>?# H= 9 F9PzM4eQ/ hA 24!.!

µ# '- )" )"*$ ).2 # 9+)"6)"6}!.)" !"l).¸3- @ 4.4 Phase IV: Generating the Dynamic Spatio=3').- @- )"- N).-,^F894e?-c -B= }'U# O!"07 )" $5 U# A 6'!"0')" x).-N"F '< O0- 6 Tw µ# -

Temporal Association Rules YU 1z1AeeeQ5)" '*SUµ#- )" )"*6)" )")" 9ss'!"0')" a# O T0- )" )"'-$# !" 1)" # !" 9(~T'*O }# !" - - t23# Nl). !"&N2 &B ).¸3- A P )" )"-!X!"0)" Ô).- y' 9LeC-'#+L'!"0')" 4 '!"0&N)"- )" ).- &B'@- 6).- #+)" )"'-+)w )"9(YU

- !"&b0- }# !" })"0' 7-7 )" (- !"&_- ). )" )"A # !"}).- # 6# s= d+- =+ )" s).-

$hUi6j` ).2+# 9b>d -!U=+).!.!U#$# = 7

T# !" T O0- @ 6 &- @)"@ = )"'78 @=3'!L).7 d - ,= l)" b0- N B &- @)"b = )"78 @=f! )"-(#+!" 96j).- - &L t)" T '- )" ).-0L 6#+!" )" )"-±#+!" 9B '!"}= OA# 4!"0')" '}'- 0 5 '-,)" ).!.!A# +)"'- 0$> 236 ).23Ì2 &# ).-0 T'!.!"'S).-0$¢= !"- 8&$L= ).= )" )"-X9O¡T',!"E2E @ '* -( ).2+# 6 )"Q().- !.# )"0 6µ#- )" '7

60 0'= +)"!=3 )" )"-b'LhT'!.!"4rP)" 8&9Ae Ë 1· # H'-(

)"* ).2 # @'- c 0')" !P ).2 # 9 4µ# '- ).7 '&$' OÎ).T D'Ô'- ( @)" - 23 8 -( @)" &

)"* ).2 # ' ð Ü''Ú ' Sk^" Ý #+' 'Ú8Û Ú !=`#