A Schematic Approach To Data Retrieval Using No ...

5 downloads 3897 Views 1MB Size Report
Data retrieval forms an integral part of any database application. ..... user defined function entered by a database user for look up [8] becomes hard to exe- .... Edit distance in this case has cost of replace operation of 1 as shown in Figure. 2.1.
A Schematic Approach To Data Retrieval Using No Random Access Algorithms by Vanshika Sinha A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science

Supervised by Dr. Rajendra K. Raj Department of Computer Science B. Thomas Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, New York February 2010

Project Report Release Permission Form Rochester Institute of Technology B. Thomas Golisano College of Computing and Information Sciences

Title: A Schematic Approach To Data Retrieval Using No Random Access Algorithms

I, Vanshika Sinha, hereby grant permission to the Wallace Memorial Library reproduce my project in whole or part.

Vanshika Sinha

Date

iii

© Copyright 2010 by Vanshika Sinha All Rights Reserved

iv

The project “A Schematic Approach To Data Retrieval Using No Random Access Algorithms” by Vanshika Sinha has been examined and approved by the following Examination Committee:

Dr. Rajendra K. Raj Professor Project Committee Chair

Prof. Carol Romanowski Associate Professor

Dr. Minseok Kwon Assistant Professor

v

Acknowledgments

I am grateful for all the guidance, valuable inputs and support I have received from Dr. Rajendra K. Raj for this project. I would also like to take this opportunity to thank Professor Carol Romanowski for being my reader and Dr. Minseok Kwon for being my observer.

vi

Abstract A Schematic Approach To Data Retrieval Using No Random Access Algorithms Vanshika Sinha Supervising Professor: Dr. Rajendra K. Raj

Data retrieval forms an integral part of any database application. Current data retrieval is greatly dependent on search techniques incorporated by database engines. Database engines in general support only traditional techniques for data retrieval. A common example for retrieving data of interest is the use of Boolean operators. While using a combination of these operators reduces the search space and produces an accurate answer, forming a query with such precise details is a difficult task and in most cases not efficient as well. To make things worse, this task becomes impracticable if there are inconsistencies in the database. Inconsistencies in the database make the idea of similarity search more consequential. Also with large databases we need to adopt techniques that will help in filtering out irrelevant data from the rest of the data using some form of ranking. This project focuses on proposed data retrieval techniques that deal with ambiguity and support similarity search operations. In this project I have used an approach that combines the components - data structures, algorithms and similarity measures to form a generic model for data retrieval. The first part of the project deals with exploiting syntactic properties for reducing search space and retrieving data. Scores of a match are calculated and compared to a threshold to generate results. The second part focuses on exploiting properties that incorporate semantic correspondence in data. Using this as a ranking function I have implemented specialized indexes and access algorithms to prune and reduce the search space effectively. The project includes interesting test cases that analyze the pros and cons of the different methods. These are evaluated by using criteria like precision, recall, advantages, disadvantages, processing cost and pruning power.

vii

viii

Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

1

Introduction . . . . . . . . . . . . . . . . . 1.1 Background . . . . . . . . . . . . . . . 1.2 Traditional Database Techniques . . . . 1.3 Related Work . . . . . . . . . . . . . . 1.4 Hypothesis . . . . . . . . . . . . . . . 1.5 Roadmap . . . . . . . . . . . . . . . .

. . . . . .

2

Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Flow of NRA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

4

Analysis . . . . . . . . . . . . . . . . . . . 4.1 Experimental Analysis . . . . . . . . . 4.2 Evaluation Criteria . . . . . . . . . . . 4.3 Test Cases . . . . . . . . . . . . . . . .

30 30 33 34

5

User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

6

Conclusions . . . . . . . . . . . . . . . . . 6.1 Current Status . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . 6.3 Lessons Learned . . . . . . . . . . . .

48 48 48 49

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . .

. . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . .

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 3 4 5

50

ix

A UML Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

B Code Listing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

x

List of Tables 4.1 4.2 4.3 4.4

First Iteration . . . . . . Scan of Candidate Set - I Second Iteration . . . . . Scan of Candidate Set - II

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

32 32 32 33

xi

List of Figures 1.1 1.2

Inconsistencies in database . . . . . . . . . . . . . . . . . . . . . . . . . . . Edit Distance between two strings . . . . . . . . . . . . . . . . . . . . . . . .

3 4

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14

Edit Distance between two strings . . . . . . . . . . . . . . . . . . . . . . . .

3.1 3.2 3.3

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8

String Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

. . . . . . Longest Common Subsequence Example . Jaccard Similarity Example . . . . . . . . IDF Example . . . . . . . . . . . . . . . TF Example . . . . . . . . . . . . . . . A Hamming Distance example

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Summary of the components that contribute to modern data retrieval techniques Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . Flow Diagram of NRA . . . . . . . . . . . . . . . . . . . . . . . . . . . Hierarchy for data retrieval process . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . Example of q-grams . . . . . . . . . . . . . . Query Table and Base Table . . . . . . . . . . . Other approach using additional data structures . Stemming

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

7 8 9 10 11 12 12 13 16 17 17 18 19 19

Interaction Diagram for NRA Algorithm . . . . . . . . . . . . . . . . . . . . . 26 NRA Algorithm Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 29

IDF of Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Reduced Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

. Test Case 1: Threshold Vs Recall . . . Test Case 2: Threshold Vs Precision . Test Case 2: Threshold Vs Recall . . . Test Case 3: Threshold Vs Precision . Test Case 1: Threshold Vs Precision

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

36 36 38 38 40

xii

4.9 4.10 4.11 4.12 4.13

Test Case 3: Threshold Vs Recall . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1 5.2 5.3 5.4

Input Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

A.1 A.2 A.3 A.4

NRA Algorithm Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 52

. Test Case 4: Threshold Vs Recall . . . Threshold Vs Pruning Power . . . . . Dataset Size Vs Processing Cost . . . Test Case 4: Threshold Vs Precision

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

42 42 43 44

UDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Output Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 User Interface Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 54 DataStore and Table class diagram . . . . . . . . . . . . . . . . . . . . . . . . 54

1

Chapter 1 Introduction 1.1

Background

Incorporating similarity search in databases is not very straightforward. Spelling mistakes, inconsistencies, different formats for same data play a major role in finding similarity. A user defined function entered by a database user for look up [8] becomes hard to execute with such data discrepancies. This project deals with approaches to support similarity search operations and inconsistencies in data.

In business applications accuracy of results is crucial; a single missing instance could result in tremendous loss. Such applications cannot afford to lose information due to inconsistencies in the data. These applications use novel database technologies, some of which are listed below:

• Record Matching Applications may use record matching for retrieving information pertaining to a particular customer. • Schema Matching Schema matching is used for integrating two or more database schemas. • Data Cleaning Data cleaning is used for spell checking and generating suggestions [2]. • Keyword Search Keyword Search is used for look up purposes.

2

Data retrieval techniques that incorporate efficient similarity search forms the crux of all the above techniques. Hence by improving approximate query search we enhance these methods as well.

Information retrieval aims at finding the best possible or most relevant results depending on the subject/topic required by the user. Data retrieval involves fetching results that match exactly with the predicate described by the user query. Information retrieval uses models which are probabilistic, as opposed to the deterministic used in data retrieval. TF/IDF [4] helps in extending the idea of relevance ranking information retrieval to data retrieval system. It can be also used for topic-based search. In this project I have tried to extend ideas like relevance ranking from information retrieval in data retrieval. I have used evaluation criteria like precision, recall, processing cost, pruning power and storage efficiency to compare and contrast the different techniques.

1.2

Traditional Database Techniques

Data retrieval is one of the basic database operations. The existing data retrieval techniques have limited support for similarity search. Traditional techniques incorporated by the database engines are used. Using these techniques reduces the search space and produces an accurate answer but forming a query with such precise details is a difficult task. In this project I have tried to adopt techniques for similarity search that will help retrieving relevant data from the rest of the data. This data retrieval is done by using a combination of similarity measures, data structures and algorithms.

If a user wants to look up all the information about ”Jim Brown”, this information might be in different records with the name in different forms. In the current scenario the data engines will return all results containing this keyword. If the database doesn’t contain any inconsistencies then this approach itself is most useful. With large amounts of data, inconsistencies are bound to be present. Assume the database contains the strings mentioned in Figure 1.1. All these strings refer to ”Jim Brown” in different forms and some with spelling errors. With the current setting we will get only the results that exactly match the query, despite having several other records that refer to ”Jim Brown” thus leading to information loss.

3

Figure 1.1: Inconsistencies in database

More results can be generated by using Boolean operators amongst other things. However forming such precise queries will not only be difficult, it will return data without any notion of ranking. This shows that traditional databases don’t have functionality for efficiently incorporating approximate search [1] and thus we need techniques for satisfying user defined functions so that a naive user can perform approximate similarity searches easily.

1.3

Related Work

A technique that allows querying of multiple data-sources and uses algorithms to rank these sources according to relevance of the keywords has been implemented by Suryanarayanan [7]. This project focuses on ranking the data itself and not the data sources. There are some approaches that break the keyword containing multiple string words into their individual strings and then do exact keyword matching to find relevant data. Using these techniques for data retrieval would result in records that have the keywords ”Jim” and ”Brown” and partially solve our purpose but as we can see it will also return a lot of false positives along with it. Also since the results are not ranked, a lot of such results will cause ambiguity. Thus here I have incorporated similarity search in strings rather than using exact keyword search. The q-grams approach enables similarity search operations on strings [8]. This helps in producing more answers that are relevant.

4

Many approaches for finding syntactic similarity exist. Some metrics include Levenshtein distance, Hamming distance, LCS (Longest Common Subsequence) and Jaccard similarity. These have their own advantages and disadvantages (this is discussed more in detail when we analyze the design (Chapter 2)). However there is not a single technique that could be used for all scenarios [1]. Let us consider for instance, using the edit distance on two strings for finding the similarity among them using figure 1.2.

Figure 1.2: Edit Distance between two strings

This approach is beneficial for finding syntactic similarity in strings. Typing and spelling mistakes can be resolved using this technique that is exemplified in the example above. However, characters inserted, removed or substituted for others are not taken into consideration [8]. For example: ”Sell” and ”Cell” have edit distance of 1, however both mean two completely different things, this not taken into account.

1.4

Hypothesis

In the previous sections we saw the traditional methods as well as some newer related work for data retrieval systems. After the analyses of these techniques it can be concluded that in order to support similarity search in string sets there is no single similarity metric (Levenshtein distance, Jaccard similarity, Hamming distance, LCS) that can be used in all scenarios. Each one has its own advantages and disadvantages. Also for the search to be more efficient we not only need to focus on the similarity metric we also need to incorporate other components. Also the techniques mentioned above did not use the concept of relevance ranking that is done by using similarity measurement techniques that exploit semantic aspects of data.

5

My hypothesis is that a combined application that takes into consideration the data structures, weighting schemes, similarity metrics and access algorithms can facilitate efficient similarity searches in database systems. Along with existing syntactic metrics I would also be using special metrics that exploit semantic properties of the data. My work can be divided in two parts. In the first I have focused on similarity metrics like Jaccard similarity, Hamming distance, Edit distance and LCS to develop a scoring function in order to generate results filtered using some threshold. In the second I have used the No Random Access algorithm that uses special properties for semantic similarity searches like IDF [2] along with q-grams technique. It also uses inverted lists for storage and efficient retrieval of relevant data as suggested by Hadjieleftheriou et al [4]. This algorithm incrementally computes the scores and reports the results.

My plan is to test my hypothesis by generating test cases that contain versatile search strings and evaluate them on the basis of criteria like recall, precision and pruning power at varying values of thresholds to see how efficiently different techniques can reduce their search space to produce results. I have also evaluated the processing costs of the techniques on datasets of different sizes.

1.5

Roadmap

The following chapter outlines the design for the proposed system using the solution described in the hypothesis (Chapter 2). In this section I have discussed the overall architecture of the system and rationale behind each of its subcomponents. Also I have analyzed the requirements of the NRA algorithm and compared to the other techniques. Next, we move to implementation section (Chapter 3) that explains the software details of the entire system and NRA algorithm in particular. I have then analyzed the hypothesis (Chapter 4) using experiments that compare and contrast the different techniques using evaluation criteria like recall and precision. I have used graphs and diagrams in order to illustrate my analysis. I have also described how the user can interact with the system (Chapter 5). Lastly I have discussed how well my proposed model matches the actual implementation. The current status, future work and lessons learned are included in the conclusions (Chapter 6).

6

Chapter 2 Design Due to database inconsistencies exact keyword matching is not always sufficient. Additional methods such as Jaccard similarity, Edit distance, Hamming distance and LCS are commonly used in order to determine the similarity between strings. Such techniques deal with syntactic errors that may be present in the database. I have analyzed the working, advantages and disadvantages of some such techniques below. It illustrates how there is no single technique that is applicable to all scenarios [4]. Following these techniques I have analyzed metrics like TF/IDF that exploit semantic properties and extend the idea of relevance ranking information retrieval to data retrieval systems. The rarer the term, the more information it carries thus such terms are given a greater ranking.

• Levenshtein distance (Edit Distance) Given two strings, if we want to transform one to another by performing an insert, delete or substitute [8] then the minimum number of steps required to do so will be the edit distance. I have used a dynamic programming approach for implementing edit distance similarity measure for comparison of strings. The algorithm can be described as follows:

– Algorithm using dynamic programming: Let input and transformed be two strings. Heart of the solution: S[i][j] = Minimum cost for transforming one string to another. S[i][0] = i S[0][j] = j if input(i) = transformed(j)

7

S[i][j] = S[i-1][j-1] if input(i) != transformed(j) S[i][j] = Minimum(S[i][j], S [i-1] [j-1] + replaceCost, S [i] [j-1] + addCost, S [i-1] [j] + removeCost) – Example: Edit distance in this case has cost of replace operation of 1 as shown in Figure 2.1. This approach is beneficial for finding syntactic similarity in strings. Typing mistakes, spelling mistakes can be resolved using this technique. This is exemplified by the case shown below.

Figure 2.1: Edit Distance between two strings The drawback of this metric is that the character inserted, removed or substituted for another is not taken into consideration. Thus the semantic similarity is not exploited using this approach. For example: ”Sell” and ”Cell” have edit distance of 1, however both mean two completely different things, this not taken into account. – Scoring Function: String1, String2 are the 2 strings being compared. Let maxlen be length of the longer string set editDistance = editDistance( String1, String2 ) Score = 1 - (editDistance/maxlen)

• Hamming Distance A commonly used approach in information theory is hamming distance. This approach can be extended to string similarity in cases where data mainly consists of

8

fixed numeric or alphanumeric values (e.g.: binary for signal data)

• Example: The following is an example showing two strings containing zip code data shown in Figure 2.2

Figure 2.2: A Hamming Distance example

Hamming distance can be considered to be a more restricted form of edit distance. Here two conditions need to be satisfied: 1. The strings being compared have to be of same length. 2. Only substitutions permitted.

– Algorithm: Simple linear time algorithm, dynamic programming not required. 1. Initialize distance to 0. 2. Iterate through strings and compare corresponding positions. 3. If characters not equal increment distance. 4. Return distance. For equal length strings execution is faster than edit distance due to the restrictions placed on it. The drawbacks are same as edit distance. Another drawback is that it can be applied only to equal length strings. – Scoring Function: String1, String2 are the 2 string sets being compared. Let len be length of the String1

9

if String1 and String2 have the equal lengths then hammingDistance = hammingDistance(String1, String2) Score = (len - distance)/len else Score = 0

• Longest Common Subsequence Given two strings, longest common subsequence returns the longest subsequence that is common to both strings (not necessarily consecutively common).

– Algorithm using dynamic programming: Let input and transformed be two strings. Heart of the solution: S[i][j] = Length of longest common subsequence S[i][0] = 0 S[0][j] = 0 if input(i) != transformed(j) S[i][j] = S[i-1][j-1] + 1 if input(i) != transformed(j) S[i][j] = Maximum(S [i - 1][j],S [i] [j - 1]) – Example:

Figure 2.3: Longest Common Subsequence Example This technique is widely used in bioinformatics: finding consensus among DNA sequences, protein sequences, genes. However this technique is not suited for all types of strings. It doesn’t require the common subsequence to be contiguous in both strings.

10

– Scoring Function: String1, String2 are the 2 string sets being compared. Let minlen be length of the smaller string set lcsValue = LCS( String1, String2 ) Score = lcsValue/minlen

• Jaccard Similarity Jaccard Similarity is used for finding similarity in sets. Strings are tokenized into sets and the similarity measure is calculated using the ratio of their intersection to their union.

– Algorithm: 1. Create two sets and populate them with tokens of the respective strings. 2. Calculate the elements common to both strings, i.e. intersection 3. Calculate the union of the strings. 4. Take the ratio of intersection to union. – Example: Strings ”153 West Squire Dr”,”147 West Squire Dr” are tokenized into 2 sets h1 and h2. The following shows the output of 0.6 as the jaccard similarity coefficient between the two strings as shown in Figure 2.4

Figure 2.4: Jaccard Similarity Example Jaccard similarity is useful where word order is trivial, as in above case for address.

11

This technique doesn’t take into consideration order of words. The importance of word is not taken into account, all words have equal importance. Also it is sensitive to misspelled words especially in short strings. – Scoring Function: String1, String2 are the 2 string sets being compared. Score = jaccardSimilarity(String1, String2)

• TF/IDF These different techniques mentioned above are used to find syntactic similarity in strings. Modern techniques exhibit special properties for semantic similarity search. TF/IDF [6] helps in extending the idea of relevance ranking information retrieval to data retrieval systems. IDF (Inverse Document Frequency): Weighting scheme in which a token is ranked according to its frequency in a document. It is calculated by taking the inverse of the frequency [2]. The rarer the term, the more information it carries thus such terms are given a greater ranking. Let’s consider an example illustrated in Figure 2.5: ”The HeathWorks” the token ”The” and similar tokens like ”a” which appears many times in the document but don’t give much information are given a lower ranking using the IDF weighting scheme.

Figure 2.5: IDF Example

TF (Term Frequency): Weighting scheme in which a token is ranked according to its frequency within the multi set [5] is called token frequency.

12

Let’s consider an example shown in Figure 2.6: ”Green Street Greenwich”. The token ”Green” appears twice, so if we did a search on topic ”Green” then this string should have a higher ranking. TF is useful in large strings with probability of repetition.

Figure 2.6: TF Example

However in small strings where chances of repetition within a multi-set are low, cost of calculating TF could be an overhead.

2.1

Architectural Overview

We need a technique that combines all the requirements for similarity search in data retrieval. Figure 2.7 shows the summary of the components that contribute to modern data retrieval methods.

Figure 2.7: Summary of the components that contribute to modern data retrieval techniques

My work can be divided in two parts. In the first part of my work I have used existing similarity measures used for finding syntactic similarity in strings and developed a scoring function using properties of each, as mentioned in previous section. The attribute of

13

interest is fed into data structures and scoring functions are applied to them to get desired results. By doing so I have combined these measures with other components of modern data retrieval techniques and used them for data retrieval. Figure 2.8 shows the architectural overview of the system more clearly. The layers are described below:

Figure 2.8: Architectural Overview

1. View (a) Graphical User Interface The view is for the naive user to enter the UDF (user defined query) in form of a search string. The user interface is intuitive and simple to use. This layer provides an abstraction over the underlying layers so an end user can execute search operations without knowing the implementation details.

14

2. Similarity Metrics Depending on the algorithm used we calculate the scores and these are used for filtering out the irrelevant data and ranking the results. (a) Scoring Functions The algorithm decides how the score of a particular match is calculated. The way some of these scores are generated has been explained previously in this Chapter. (b) Threshold Filtering Precision of the results depends on the threshold. The value of threshold lies between 0 and 1 inclusive. Higher the threshold more precise the solution set will be. The impact of threshold on different evaluation parameters is presented in the Chapter 4 (Analysis). 3. Frameworks

(a) Algorithm APIs To make the model generic, algorithm APIs are exposed. These APIs can be used with any of the implementations in the lower layers. Thus reducing redundancy and imparting transparency to the algorithms being used. More details about this model are presented in Chapter 3 (Implementation). 4. Algorithms The following similarity techniques and algorithms are implemented as discussed previously in the Chapter.

(a) Jaccard Similarity (b) Edit Distance (c) Hamming Distance (d) NRA Algorithm (e) LCS 5. Data Structures In a database system, most of the time for data access is consumed in disk accesses. In order to avoid the number of these accesses, we use indices that act like routers to

15

data. By using indices we don’t need to scan through the entire data but just look up records that meet specific criteria. They also require much lesser space than actual data and hence can be cached partially or entirely, resulting in faster access. Without indices searching becomes like looking for a needle in a haystack. The user either knows exactly what he is looking for which would be a case of direct retrieval. The other case is where the user ambiguously specifies his requirement, this is called browsing[5].

This layer acts as a middleware between the user and the database. By using this middleware and by exploiting semantic properties and thereby pruning a huge portion of search space immediately efficient data retrieval is achieved .

(a) Inverted Index NRA algorithm uses a specialized data structure called the inverted index that it uses for efficient data retrieval. (b) Hash Map Jaccard Similarity, Edit Distance, Hamming Distance and LCS use Hash Map as a data structure for retrieval. 6. Databases This layer contains APIs to access database contents that can be used by the upper layers. It contains APIs for accessing, populating, querying different databases on the server. This layer comprises of: (a) Databases Contains api’s to create a database connection, get table(s) from a database, add attribute(s), tuple(s) and setting up table(s). (b) Relations Contains api’s related to a relation. For example to get table name, get attribute(s) etc. (c) Attributes Attributes are part of the relation, these form the string sets for query search.

16

2.2

Flow of NRA Algorithm

In the second part of the project, I have implemented a NRA based algorithm using TF/IDF [4] techniques that exhibit special properties for semantic similarity search. Forming a query with exact requirements is not only tough; it consumes a lot of memory if not given correctly. It is not possible for naive users to form such queries. By exploiting semantic properties and using the inverted index data structure as middleware, pruning a huge portion of search space immediately is possible. The overall flow diagram of the NRA algorithm using IDF and q-grams approach is shown by Figure 2.9

Figure 2.9: Flow Diagram of NRA

17

A brief flow Given a UDF query the result is generated in the following manner:

1. Token Generation For simple data retrieval in database systems, primary and secondary indices that are setup are adequate for searching data records. However this is not as simple when we are looking up data with a lot of inconsistencies. Every token in a string can be a potential search key as shown in figure 2.10.

Figure 2.10: Hierarchy for data retrieval process

Figure 2.11 shows syntactically similar words: driver, drive, driving and driven. Each of these words could be treated as indices. An approach based on stemming would use n-gram indices. In this example drive- or driv- are shown to derive the rest of the tokens. The speed of retrieval, effectiveness and granularity of retrieval depends on how we choose the indices.

Figure 2.11: Stemming

18

Retrieval should also be robust enough to resolve any errors due to data inconsistencies. For dealing with inconsistencies in data and finding similarity I use the q-grams approach [8] in this project. It involves breaking strings into substrings of q length and using these strings for further processing the similarity operation. The value of q decides the length of the indices. As shown in Figure 2.12 we could have q as 4, which would break up the string in sets of 4. These form the indices for our inverted index. The first stage is the token generator which generates q-gram tokens of string sets. This will be applied to the query string as well as the base strings contained in the database. This forms the basic component for building up the algorithm.

TokenGenerator(String set,int noGrams )

Token generator takes the string and q-grams length for generating tokens. This pre-processing is required before we move to the next step as depicted in figure 2.12. Input for procedure: String Variables: start, end and length 1. Initialize the variables. end will determine the number of grams of tokens 2. While length of string greater than equal to end 3. Keep generating the substrings 4. Increment end and start Output: Tokens of the string

Figure 2.12: Example of q-grams

2. Inverted index

19

If we want to use pure relational techniques for similarity search, after generation of tokens, we can use them to generate the base table and query table. These tables are further processed using database operations as suggested by Chaudhuri et all [3] to produce results. The query table and base table is depicted in Figure 2.13

Figure 2.13: Query Table and Base Table

In this project we use the tokens as indices of the inverted index. This requires additional data structures for storage and retrieval. This is shown in Figure 2.14.

Figure 2.14: Other approach using additional data structures

The similarity measures are applied to tokens of the query common to that of the base strings to fetch the relevant strings.

20

An inverted list maps a token to a list of pairs of setid and length of string sets that contain the token. These pairs help in retrieving the actual strings without actually storing the data in the inverted list data-structure, acting like pointers to data. Thus it doesn’t take as much space as it would if actual data was stored. The inverted list should be created as soon as read in so that we don’t have to scan every time we do a lookup. The list can either be in-memory or on disk if lists become very large. For on disk we need to store it in a file or in the database itself. We can also cache the indices which have high information gain in order to reduce disk operations. Taking the formula of IDF of a token we find that since the length of the query remains constant if we arrange the tokens in increasing order of lengths then implicitly they are arranged by order of their weights. Once that is done, we need to rank the data depending on how relevant it is. We define a threshold value which will return data above that threshold. The data is then retrieved using retrieval algorithms.

3. Apply Similarity Measure Using the information IDF of tokens of the query is calculated. These IDF values are used later for calculating similarity scores of string sets with the query.

4. Algorithms for Retrieval There are different ways of accessing data, either we use a sorted mode or we can access data randomly. If the elements in the inverted list are ordered in increasing order then this will be helpful to algorithms that do sequential access of data for rendering relevant strings.

The ”No Random Access algorithm” incrementally computes the ranks without actually knowing values beforehand. It returns all the strings with IDF(Query,String) greater than or equal to T (threshold). The threshold value lies between 0 and 1 inclusive. When the threshold is 1 the query string is identical to the resultant base string.

21

5. Return relevant results to User These generated results are sent to the user.

The implementation details and working of the NRA algorithm are discussed in detail in the consecutive chapters.

22

Chapter 3 Implementation To make the model generic the following inheritance pattern has been used. Algorithm class is abstract and its abstract methods are overridden by its children depending on the algorithm. Figure 3.1 shows this hierarchy. As we can see Algorithm class is the base class and all the others - Jaccard Similarity, Edit Distance, Hamming Distance, LCS and NRA Algorithm are the derived classes.

• queryResult(query, threshold) This method is overridden by each of the derived classes depending on the scoring function of the algorithm. It is responsible for taking in the user defined query and the threshold value as input and depending on the algorithm, returning the query result. – Input query - user inputs the search string threshold - value between 0 and 1 inclusive – Output queryResult - result after ranking and filtering data based on threshold.

• Algorithm(String database, String relation, String attribute) This constructor has the implementation which is used by all the derived classes to initialize the database and populate the data structures accordingly.

23

Figure 3.1: Algorithms

24

• getPrecision(answers, generatedResults)

– Input answers - correct answers present in the data collection generatedResults - results generated by the algorithm This method is defined in the base class and uses the vector of correct solution set and the results generated by algorithm to calculate precision. – Output Precision of the result

• getRecall(answers, generatedResults)

– Input answers - correct answers present in the data collection generatedResults - results generated by the algorithm This method is the defined in the base class and uses the vector of correct solution set and the results generated by algorithm to calculate recall. – Output Recall of the result

• pruningPower(generatedResults)

– Input generatedResults - results generated by the algorithm This method is the defined in the base class and it uses the generated results and the total size of the database to return the pruning power.

25

– Output Pruning Power Figure 3.2 shows the interaction diagram for the NRA Algorithm. The end user selects database, relation and attribute he is interested in performing the search on. He then inputs the keyword to search on along with the threshold value.

During this time Data Store and Table are in action, populating the values from the database. The attributes of the selected relation are used for populating the Inverted List data structure. Depending on the query, the reduced inverted list data structure is used for further calculations. Token Generator breaks up the string sets into tokens of specified length. Inverted list maps a token to a list of pairs of setid and length of string sets that contain the token. Each of lists are arranged in increasing order of string lengths. The IDF of tokens of the query are calculated as: IDF = log2 (1 + noOf Sets/noOf SiSets)[4]

where noOfSets is the total number of sets noOfsiSets number of sets containing the token si. These values will be used by the NRA (not shown in the diagram). Once we have the lists in place we can start with the NRA algorithm. Input : Lists q = q , ... , qn, Threshold T Output: Sets with I(q, s) >= T [4]

As the name suggests the ranking of data is done by sequentially accessing the lists from the top. It uses a hash-table for calculating the aggregated score for each set id. After every iteration NRA looks at the candidate set and filters out the sets that don’t qualify and reports the ones that do. NRA algorithm is incrementally computing the ranks without actually knowing values beforehand. The data structures required for the NRA algorithm are: Inverted List Generated as described previously. Candidate Set In memory hash-table that contains the set ids of the string sets which are candidates to be

26

Figure 3.2: Interaction Diagram for NRA Algorithm

27

relevant mapped to their aggregated scores [4]. Initially Candidate Set :- C = 0 Bit vector Each entry is also associated with a bit vector that keeps information of whether the set-id has been encoutered in the list or not [4]. Bit Vector for each s containing bits for presence of that token in lists for 1 to n :b[1,n](s) = 0 (Since initially none of the tokens have been encountered in any of the lists) [4] IDF(s,q) =

idf (si)2 si∈query∩stringset normalizedLen∗normalizeQue [4]

P

normalizedLen is the normalized length of set and normalizedQue is the normalized length of query set. As we can see the normalized length of the query remains constant thus the idf is inversely proportional to the normalized length of the string set. Thus in the inverted list we have arranged the entries in increasing order of their lengths. The normalized length is calculated as: normalizedLen(string set) =

qP

si∈stringset

idf (si)2 [4]

Upperbound I(us) : The upper bound is computed as the sum of the lower bound and the contributions wi(fi) for each i where s has not been encountered yet [4]. Lowerbound I(ls) : The lower bound of the score of s is computed as the sum of wi (s) for all i where s has been encountered so far [4]. for all new s belongs to C, [4] let upperbound I(s) = 0 and lowerbound I(s) = 0, [4] Let fi = first element on list i [4] do for all 0

Suggest Documents