Final Year Project Intermediate Report 1. Project Introduction

11 downloads 106 Views 342KB Size Report
Jan 27, 2014 ... a cost matrix C. We define a number Ci,j for each i,j between 1 and n. ... on constructing a complete system and do practical testing for ...
Final Year Project Intermediate Report Project Title: EMD Based Image Similarity Search Project Member: Yao Haobin Supervisor: Prof. Nikos. Mamoulis

1. Project Introduction 1.1. Project Objective The objective of the project is to construct a complete system to do EMD-based image similarity search, and to measure and improve the quality of similarity search. In previous research, Prof. Nikos and Yu Tang have developed fast algorithms for searching similar images according to EMD calculation. This project will integrate their algorithms into a full searching system and conduct quality testing and argument configuration, in order to produce an high-quality image similarity search software. 1.2. Preliminary Knowledge EMD (Earth Mover’s Distance) is the cost of transforming one image to another, a measurement of similarity between images. Suppose we have image P and image Q. Extract P, Q into histograms (p1,p2,p3,...pn) and (q1,q2,q3,...,qn). Histogram of an image is a vector that summarize the characteristics of the image. Now make a flow from (q1,q2,q3,...,qn) to (p1,p2,p3,...,pn), and let:



f (i, j)  q i

,

j



f (i, j)  p

j

i

By the above flow construction, we flow out all of (q1,q2,....,qn) and flow into (p1,p2,....pn), and thus transform histogram of Q to histogram of P. And then we have a cost matrix C. We define a number Ci,j for each i,j between 1 and n. Ci,j stands for the unit cost of the flow from qi to pj. Then the EMD value between P,Q is the cost of the min-cost flow.

We have:

emd(P,Q)  min f (i, j)C(i, j) i, j

1.3. Project Focus In previous research of Prof. Nikos and Mr. Yu Tang, fast EMD calculation algorithms have been developed and implemented. Therefore this project will focus on constructing a complete system and do practical testing for improvement. Here we emphasize on the quality of search result rather than the execution speed since the algorithms have been proved indeed fast. Given a query image, the system will return certain number of “similar” images from the image database according to EMD calculation, and we want to know how similar will the result images be compared with the query image, and we would try to improve the quality by configuring some arguments of the algorithms. Proposed arguments for configuring are the histogram extraction method and the cost matrix for EMD calculation. Initially, the histogram is extracted by cutting image into 10*10 bins and extracting a number for each bin center. And the Ci,j (the cost matrix entry) depends on the Euclidean distance between the ith bin and the jth bin (see the following graph as illustration).

We will test the quality on the initial setting at first and configure the bin numbers, histogram extraction algorithms, or the cost matrix settings to improve the quality of image similarity search.

2. Project Methodology The project is to be accomplished by incremental development. We will have a simple working prototype at very early stage. The prototype can accept input and display results but contains no EMD calculation logic. Then we will add in full functionality, integrate in EMD calculation, evaluate the quality and then try to improve the quality incrementally. The project is supposed to be completed in the following six stages:

1). Prototype construction Build a prototype system, which contains all the components, including extracting histograms from source images, filtering for possible candidates, and refinement with similarity search. The only difference with real system is that the prototype will just use primitive Euclidean distance for computing the similarity between images. The objective of the prototype is to have a rather complete system before plug-in of the critical part at the beginning of the project. 2). EMD similarity plug-in Integrate EMD calculation into the prototype to construct a complete system. The program for EMD calculation has been developed. However, the original code is not easy to call. So modification of the existing EMD calculation program is necessary for the integration. After this stage, a working system is supposed to be ready for testing. 3). Testing and quality evaluation An essential goal of the project is to evaluate the quality of EMD image similarity search. And we will use benchmark images from the database to test the quality of searching result. 4). Improvement of histograms extraction method and cost matrix Histogram extraction and cost-matrix configuration are factors that have great influence on the quality of searching, besides similarity calculation algorithm. Testing different ways of extracting and cost-matrix setting is a promising approach to improve the overall searching quality. 5). Real world test for user feedback Build user-friendly interface and invite clients to experience and comment on the system. After this stage, we can get a comprehensive evaluation on the quality of the system. 6). System Fulfillment and Final Report Fulfill the system for final presentation. We propose to integrate alternative search method as comparison. And we will summarize the quality of our search system.

3. Current Progress - Stage 1 and 2 completed 3.1. Stage 1 Prototype Architecture 3.1.1. Components of the prototype: 1) Web Interface: webpage for client to upload query image and display results 2) Histogram extractor: extract histograms of image database 3) Prototype similarity searcher: conduct similarity search, but with no EMD calculation logic, just used for having a working system at the beginning

4) Web server: php server to coordinate the components 5) Image database: collection of images 6) Histogram database: precomputed histograms of images in the database 3.1.2. Architecture of the Prototype The histogram extractor precomputes the histograms of all images in the database and store them in the histogram database. The client side uploads an image for query, then the php server calls the Prototype similarity searcher. The searcher extracts the histogram of the query image and do similarity search by simple Euclidean distance between histograms (the results are not important for the prototype), and return the urls of the most K similar images to the server. The data flow is also showed in the following graph.

3.2. Stage 2 Complete System Showcase Currently, we have constructed a image similarity search system, integrated with emd calculation. The system can now receive image from client and search for “K” most “similar” images in the server-end image database, according to emd calculation (“K” is a configurable argument). The website for testing the system is now available: fyp13016s1.cs.hku.hk For the current website, you upload an image for query:

Then system will display the 10 most similar images according to EMD calculation from the back-end image database. Since the number of images is low

currently, the searching effect is not obvious. However, a working system is now ready for testing.

3.3. Complete System Architecture 3.3.1. Components of the system: 7) Web Interface: webpage for client to upload query image and display results 8) Single histogram extractor: extract histogram of the query image 9) Mass histogram extractor: extract histograms of image database 10) EMD searcher: conduct EMD similarity search 11) Web server: php server to coordinate the components 12) Image database: collection of images 13) Histogram database: precomputed histograms of images in the database 3.3.2. Architecture of the components The mass histogram extractor precomputes the histograms of all images in the database and store them in the histogram database. When client side uploads an image for query, the php server saves the image temporarily and calls the single histogram extractor to extract the histogram of the query image. After the extraction, the server passes the histogram to the EMD searcher. The EMD search do similarity search the query histogram in the histogram database for similar ones, by EMD calculation, and return the urls of the most K similar images to the server. At last the server returns the result urls to the client webpage to display. The following graph illustrates the data flow.

3.4. Proposed Architecture Improvement Currently the image and the histogram database are stored as data files, and the EMD searcher will read in all the histograms at the beginning. If the database contains a great number of images in future testing, there might be performance problems. Then we will amend the EMD search to read in only part of the histograms at the beginning and read in other parts when needed, or even store the histograms in Mysql database instead of current data files. However, the prior consideration is the quality of the result image. We will only try to improve the execution speed if necessary.

4. Future Plan 4.1. Future Challenge Having constructed a complete system for image similarity search, we will start to do search quality measurement and argument configuration by testing the system on large image databases. However, how to measure the quality of the searching result is currently the biggest challenge for the project. When the client uploads an image and some “similar” images are found by the system, we do not have quantitative method to judge how the results are good. A possible method is to invite large number of users to evaluate the system and collect the feedback. We will adopt this approach but just user-feedback is obviously not an objective and convincing way to measure the quality and then adjust argument settings. Therefore besides user evaluation, we will conduct comparative case-studies on benchmark image databases. The test query image will contain certain object (a car, for example), and then how many images among the results returned by the system contain objects of the same category (how many “cars” are found) could be a standard to measure the quality of current image similarity search. We are going to compare

our method to other image similarity search tools on those test cases and expect to have a more convincing evaluation on the search quality than just user evaluation. 4.2. Future Schedule 1). Case studies on the benchmark image database to evaluate the quality Schedule: 2014.01.27 - 2014.02.19 Test the system on certain categorized image databases with chosen query image. The query images contain certain object or have obvious features to provide evaluation standard for the similarity of the result images. Compare the result with other image similarity search tools, and with textual search by image names, and then give evaluation to our image similarity search system. We have found some candidate image databases: http://vision.caltech.edu/malaa/research/image-search-bench/ http://vision.caltech.edu/malaa/datasets/ http://medialab.liacs.nl/mirflickr/mirflickr1m

This stage is a crucial part of this project. We must find standard and method to measure the search quality of current EMD calculations, which is the basis of argument configuration and further improvement. 2). Configuration of histograms extraction method and cost matrix Schedule: 2014.02.20 - 2014.03.04 Repeat the above tests with different histogram extraction methods (more bin numbers or even different algorithms) and different cost-matrix configurations. We expect to achieve better quality. If the performance is affected after changing the configuration, we will try to ensure the quality and keep performance acceptable. 3). Invite users to test and evaluate the system Schedule: 2014.03.05-2014.03.11 Let users test the system and gather the evaluation and feedback. We will invite the users to evaluate similarity under different argument configuration and compare with other image search tools. 4). Analyze the feedback, and give overall evaluation for quality of the system Schedule: 2014.03.12 - 2014.03.21 Summarize the case study and user feedback to produce the overall evaluation of the image similarity search system. 5). Fulfill the system and summarize the project Schedule: 2014.03.22 - 2014.04.15 Adjust the system for the final presentation and report (if needed). Summarize the project and give final report.

Reference: Earth Mover’s Distance based Similarity Search at Scale Yu Tang, Leong Hou U, Yilun Cai, Nikos Mamoulis, Reynold Cheng