SoDesktop: A Desktop Search Engine - IEEE Computer Society

5 downloads 3976 Views 472KB Size Report
interface of SoDesktop is also introduced. Keywords- Desktop search; crawl; schedule; index. I. INTRODUCTION. With the development of computer technology, ...
2012 International Conference on Communication Systems and Network Technologies

SoDesktop: a Desktop Search Engine Zhiwang Cen, Jungang Xu, Jian Sun School of Information Science and Engineering Graduate University of Chinese Academy of Sciences, GUCAS Beijing 100190, China {cenzhiwang, xujungang, jiansun6000} @gmail.com This paper describes the framework and modules of a desktop search engine named SoDesktop, and also describes the design and implementation of it in detail. Using the open source platform named Firtex as the full-text index engine [2], SoDesktop can effectively find files that the user needs according to the content of the file. And SoDesktop has good scalability, support full-text search of multiple file formats, and can customize index file types. Other parts of this paper are organized as follows: Section II discusses related work; Section III describes the architecture and each component of SoDesktop; Section IV describes the user interface of SoDesktop; Section V presents conclusions and future work.

Abstract—The number of files stored in a personal computer is increasing very quickly, so it is difficult for users to find the information they want. One desktop search engine named SoDesktop is proposed in this paper, which is composed of four modules including Data crawler, Task scheduler, Data indexer and Data searcher. The implementations of these four modules are described in details, and the implementation of user interface of SoDesktop is also introduced. Keywords- Desktop search; crawl; schedule; index

I.

INTRODUCTION

With the development of computer technology, computer can complete many kinds of complicated tasks. Therefore, the number of files stored in the personal computer is increasing very quickly. At the same time, because the price of storage equipment with large capacity is becoming lower and lower, the number of various documents stored in the personal computer, such as digital photos, text files, video and audio files, increases in an amazing rate. However, a new problem arises: computer users have to spend much time searching the useful information in the ocean of the computer data, and sometimes, even ever seen or used files by users cannot be found. Therefore, the current problem which the users face is not how to save the file, but how to find and locate the file as quickly as possible. In other words, the traditional desktop information retrieval technology cannot meet the current needs of the users. This will inevitably lead to development of new technologies of desktop search engine. Desktop search engine is designed to help users find and locate the required information or documents from the personal computer effectively. Today, desktop search technologies become more popular in field of information retrieval. Desktop search software, for example, Google Desktop Search, can establish much order index on files, pictures, and other directory, providing users with information and documents retrieval services [1]. Desktop search engine is different from web search engine. When searching the desktop, users typically search for the specific documents that have ever been seen, such as an e-mail from colleagues, photographs or videos taken in some occasions, a friend’s contact and so on. It is important to help users recall which document or information that have ever been seen. Desktop search engine emphasizes on mining all available information in the personal computer, including web browser history, email files, documents, multimedia files and so on. 978-0-7695-4692-6/12 $26.00 © 2012 IEEE DOI 10.1109/CSNT.2012.106

II.

RELATED WORK

With the development of personal computer, desktop search has become more important, and more and more researches on desktop search have been carried on. At present, there are two approaches for desktop search: one is the direct search for the file location, and the other is the way of searching files by file name, type, text content. Basically, these two methods are based on basic information of one file, such as file name, location, and updated time, which are considered as the main clues to complete the search function [3]. However, they cannot effectively manage a large number of data. Therefore, some new technologies on desktop search have been proposed, including the following aspects: (1)Desktop search engine can be extended to the internal network through UPnP protocol, so you can search all files in all computers on LAN network [4]. (2)The retrieval precision of results in desktop search system has been improved by using the relationship between user's schedule and computer data [3]. (3) Considering the structure and semantics defined by the application, desktop search system improves search results by the use of implicit predicate [5]. (4)By analyzing the user’s activity information of accessing to local resources, desktop search system produces its activity patterns and uses these patterns to produce semantic connection of desktop resources. Therefore, information retrieval can be based on this semantic information [6]. Another method is that through collecting the user’s activity information about documents and applications, desktop search system can detect the user tasks, and search files based on different tasks [7]. (5) In order to improve search accuracy, integrating semantic knowledge to the desktop search process [8][9][10],

460 465 463

converted into the same format for further scheduled. Since data collection can not influence the user's current operation, it is usually running as a background process which starts running with startup of computer.

querying with natural language technology in desktop search [11] and concept-based search [12] are presented. (6)Based on the technologies of ontology and context, for example, ontology and user mining [13], desktop search results can be improved. Besides the improvement methods of desktop search technologies described above, there are some performance evaluation methods of a desktop search system. In reference [14], several desktop search systems on the market are evaluated by using information retrieval methods. According to Google's PageRank algorithm, reference [15] presents a new evaluation method of local content search results. Currently, many domestic and international search companies have launched their own desktop search product [16], including Microsoft Windows desktop search [17], Yahoo desktop search [18], Copernic desktop search [19], Google desktop search [1], Archivarius [20] and Baidu hard disk search [21]. These products can scan the local file for indexing, and their full-text search methods on documents are based on inverted index, providing efficient search functions. But these products cannot completely meet user's need, such as some of them cannot classify documents, or have low recall and precision rate, or take up much computer resources. III.

Figure 1. The Architecture of SoDesktop

In the design of SoDesktop, in order to uniform data format, all data from different data sources are formatted, and each data will be converted into file format that the operating system support. SoDesktop supports three data sources that are directory data scanner, Real-Time File Monitor and email client. Their data collection modules are Dir Crawler, Real-Time File Monitor and Email Crawler. Dir Crawler is a directory data collector, and it mainly scans directories recursively and collects files constantly according to configuration of white list and black list. The collected data will be submitted to Priority Queue. White list is the list of folders that can be scanned in the hard disk, while black list is the list of folders that cannot be scanned. In this component, the main problem is how to restore scanning after interrupt. In other word, it is how to continue scanning from the breakpoint. The strategy SoDesktop adopted is to periodically save the list of folders to be scanned and the files being scanned. When the system crashes or power failure, Dir Crawler reads the latest saved breakpoint, loads list of folders to be scanned and the file being scanned in the next time you start SoDesktop, and continues to scan from the file that is scanned last time. Another problem of this component is to deal with the containment relations between folders of white list and black list. The folder A, a folder in the black list, is the subfolder of folder B, a folder in the white list. The folder A will be ignored scanning. If a folder C in white list is the subfolder of the folder D in black list, the folder C will be scanned only while other folders and files in folder D will not. Real-Time File Monitor can monitor the folders, get the change information of files and updates the index base in real-time. The folders real-time monitored are based on the setting folder list of white list and black list. When a new file is created, deleted or modified in the hard disk, Real-Time File Monitor can immediately obtain its change information, including file path and kind of operation. It usually has two

THE ARCHITECTURE OF SODESKTOP SEARCH

The main modules of SoDesktop desktop search are Data crawler, Task Scheduler, Data Indexer and Data Searcher. According to different methods that are based on the different strategies of data crawling, Data crawler collects all the data stored in the hard disk. The files that are not indexed and already indexed will be filtered by the File Filter module and they will be submitted to the Priority Queue in Task Scheduler. The Priority Queue is used to save the information of files to be indexed. According to scheduling algorithm, Queue Scheduler submits the data of file with high priority to the Indexer, and Indexer will use different file type’s parser to extract metadata and contents of these files. And using the FirteX API, the Indexer will create all files’ inverted indexes which are saved in the Firtex Index Database. Users can submit query keywords to Data Searcher through user interface. Data Searcher produces formatted FirteX query string by keyword syntax constructor, and the query string will be submitted to Query Parser for producing FirteX’s queries. Searcher module gets the queries and search data in FirteX Index Database. The search results will be ranked according to rank algorithm in Rank module and the final results will be shown to user through user Interface. This process is shown in Fig. 1. A. Data Crawler The main function of Data Crawler is to scan the local file-system constantly and collect different formatted documents as fast as possible from local file-system. Besides, it must support different data sources. The collected data may be from different data sources, such as white list of files folder, Real-Time File Monitor, mail client, user browse history and so on. Therefore, data collection needs different components of different data sources, and all data is 464 466 461

strategies: one is the Real-Time File Monitor, which initiatively find a file that has changed, the other is the Real-Time File Monitor receives a file’s change notification passively. SoDesktop uses the latter one which can discovery a changed file more quickly than the former. Because the changes of file are completed by operating system, SoDesktop have to use operating system’s API. Real-Time File Monitor works as follow: Real-Time File Monitor enters sleep mode after startup. When a file has been changed, the operating system will send a message in order to wake-up Real-Time File Monitor. And then Real-Time File Monitor performs the appropriate action according to the change information of the file, and submits an index request to the Priority Queue. When the execution is completed, Real-Time File Monitor enters sleep mode again. Email Crawler is a special component that does not like two components above. It handles user's email but not files on the hard disk. There are one or more mail clients in the user's computer. Different mail clients save email in different formats. The time of sending and receiving an e-mail might not be the same. In order to index and query these email, it needs these email client’s data collector which can convert this data into the same format and submit to Priority Queue for further indexing. Email Crawler is designed for this purpose. It regularly scans folders in which the email client saves email, and uses different email client parsers to parse each email, such as Foxmail Parser and Outlook Parser. Each email will be saved into a file format, and the final saved file information will be submitted to the Priority Queue for indexing. File Filter will filter the data of files before it is submitted to the Priority Queue. Its role is to filter the files of format that desktop search does not support in the scanning process, or the files which have been indexed without the need for updateing. The method of determining whether a file has been indexed, the method uses the file path as a keyword, records the indexed files by using Bloom Filter, and checks whether a file have been indexed by querying whether a file is in Bloom Filter. If a file is not indexed, SoDesktop will check whether the type of file is supported. If it is, it will be submitted to the Priority Queue, otherwise it will be ignored.

HRN (Highest Response ratio Next) is the algorithm of balance between SJF and FCFS. FCFS only considers the waiting time of each job and ignores the execution time of them. SJF only considers the execution time without considering the amount of waiting time. Therefore, these two scheduling algorithms will bring some troubles in some extreme cases. Taking the amount of the waiting time and the estimated duration of each job into account, HRN scheduling algorithm chooses the job that has highest response ratio, and executes the job. The definition of Response ratio R is as follows: R = (W + T ) / T = 1 + W / T

(1)

Where T is estimated execution time of the job, W is the amount of time the job has spent waiting in standby queue. When scheduling jobs, the algorithm calculates response ratio for each job, and executes the one with highest R. In this way, W / T increases with the increase of its waiting time, and long-time jobs also have chance to be executed. Different data sources have different priority and different emergency. In these three kinds of data sources, data of Real-Time File Monitor has the highest priority, followed by email client data, and data of Dir Crawler has the minimum priority. This is because data of Real-Time File Monitor must be handled immediately in order to reflect changes in hard disk quickly and accurately. Priority of email client data is higher than that of data of Dir Crawler, because the amount of email client data is much smaller than that of Dir Crawler data. Task Scheduler figures out the priority based on the source of data. Data of Real-Time File Monitor is handled first, followed by the mail client data, and the last is data of Dir Crawler. If only one kind of data in the queue, it uses HRN for scheduling, while job’s estimated execution time T is the file’s indexed time that can be estimate according to the size of the file. The bigger the size of file is, the more time it spends to index. The waiting time W of a job in a standby queue is the time span between the submitted time of the file and the time of calculation response ratio. This ensures real-time data can be processed with top priority, while the large file data also have the opportunity to be indexed.

B. Task Scheduler According to the specified scheduling algorithm, Task Scheduler submits the data in Priority Queue to the Indexer effectively in order to improve the efficiency of index. It is the main function of Task Scheduler. There are several common scheduling algorithms as follows: FCFS (First Come First Serve) is the simplest algorithm based on the order of data’s arriv time. This algorithm is more propitious to long-time jobs, but not to the short-time jobs, because short-time jobs may have to wait a long time. SJF (Shortest Job First) arranges the jobs’ order based on the duration of jobs. The short-time jobs are handled first and the long-time jobs are postponed to be executed. This algorithm is very adverse to long-time jobs, because it may not be implemented for a long time. It can not schedule jobs based on the urgency of jobs.

C. Data Indexer The main function of Data Indexer is to extract text and information from the files collected by the Crawler and indexes them. The general items of index are including file path, filename, author’s name, updated time, text. For text files you need for word segmentation. Index is usually inverted index (Inversion Index) form of organization, that is, items from the index to find the appropriate document. Text of files needs word segmentation. Indexes are usually inverted index by which it can find the appropriate document from items of index. In this module, the two main problems should be solved. One is metadata and text extraction of different file types, the other is the generation of inverted index.

465 467 462

generates the corresponding query string for searching multi-field result. Query Parser is to parse the query string, generate FirteX’s query data, and submit it to Searcher for searching. And then Data Searcher will search the submitted query in FirteX Index Database. All of this is based on FirteX’s API. The search result will be shown to user on the User Interface after ranking in Rank module. According to rank algorithm, the Rank module sorts the results from Data Searcher, and then sends it to User Interface. Ranking algorithm can be FirteX own algorithm or customize algorithm. Ranking algorithm SoDesktop uses is multi-field ranking algorithm based on IDF - TF.

The metadata extraction in Data Indexer supports different file types, such as pictures, videos, music and so on. File size, the file’s author and update time are basic metadata. Besides, if they are picture file, picture pixel is needed; if it is music, the information including title, artist, album and genre is needed; if they are videos, it should extract the duration of the video; text should be extracted if the files contains text. Therefore, different types of files need to extract different data. And in order to support other file type in future, the system must have a good scalability. In order to solve problem of scalability, SoDesktop uses plug-in system to solve the problem of data extraction of different files type. In the plug-in system, you can add a new file type parser to SoDesktop. The new file type parser will be registered to plug-in system. When indexing, according to different file types, Indexer will use different file type parser to extract the data and convert it into a unified data format for the next indexing. After formatting different types of Data, Data Indexer begins to generate inverted index. The index generation of SoDesktop is based on an open source text retrieval platform named FirteX. FirteX is a powerful, high performance, flexible text indexing and retrieval platform, supporting Windows, Linux and Mac system. It is developed by using standard C++ and opens its source in GPL [2]. FirteX supports Chinese (GB2312 and GBK) and English, and its flexible architecture can be easily extended to support other languages and encoding; it supports rich search syntax, multi-field search, date range search, and customized search results ranking. Therefore, it is suitable for the generation of inverted index. Indexer generates one file’s index each time. When indexing, according to different file types, Indexer will use different file type parser to extract the data and convert it into the Document data format, and generate indexes using FirteX index API. The index will be saved into the FirteX Index Database.

IV.

USER INTERFACE OF SODESKTOP

User Interface mainly provides users with interfaces for inputting query, displaying the search result and configuration of SoDesktop. The user interface of SoDesktop is as follow.

Figure 2. User Interface of SoDesktop

D. Data Searcher Data Searcher searches keywords submitted by user in FirteX Index Database, and replies the results to the User Interface. The process of this module is as follows: query string submitted by user will produces formatted FirteX query string by keyword syntax constructor, and the query string will be submitted to Query Parser. Data Searcher gets the queries and search for matched data in FirteX Index Database. Then Data Searcher will find related documents, and gives each document a score which represents relevant degree between the document and the query. The search results will be ranked according to rank algorithm in Rank module and the final results will be shown to the users. Syntax Constructor is based on FirteX query syntax. It implements the function of multi-keyword search and multi-field search. Different types of files have different fields. For example, text files have text information, while pictures have pixel information but no text, and music files have titles, artists and genre information. Users can search for these fields while FirteX support multi-field search. According to FirteX query syntax, Syntax Constructor

Fig. 2 is User Interface of SoDesktop after searching the query “desktop search”. There are the input box and search button On the top of User Interface. On the left side, it is the display interfaces of different file types, including documents, music, pictures, videos, favorites and email. In each display interface of different file types, there are some advanced search options, and different file types have different options. For example, document can be searched by file name, file type, file size, modified date of file and folders. Besides, pictures can be search by dimension, duration are for videos and title, artist, album and genre are for music. The right side is the search result, which shows each item in the list of search results in detail, including the filename, file path, last modified time and file size. V.

CONCLUSIONS AND FURTHER WORK

This paper presents the structure of SoDesktop desktop search and its components which are Data Crawler, Task Scheduler, Data Indexer and Data Searcher. It describes implementation of these four modules in details. In addition,

466 468 463

the ranking algorithm and strategy of CPU throttling are also presented in this paper. In the design of SoDesktop, it is difficult to solve problems of ranking algorithm and CPU throttling, and there are still some problems in these two modules. Multi-field ranking algorithm based on IDF-TF sometimes does not well understand use’s query. In order to improve the ranking algorithm, semantic analysis can be introduced to desktop search in the further work.

[9]

[10]

[11]

[12]

REFERENCES [1] [2]

[3]

[4]

[5]

[6]

[7]

[8]

Google, “Google Desktop Download” http://desktop.google.com/, 2010. The Key Laboratory of Network Science and Technology of The Institute of Computing Technology of Chinese Academy of Sciences, “FirteX-High Performance Search Platform”, http://www.firtex.org/, 2010. Y. MATSUBARA and I. KOBAYASHI, “Development of a Desktop Search System Using Correlation between User’s Schedule and Data in a Computer,” Proc. the 2007 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2007), IEEE Press, Nov. 2007, pp. 235- 238. Wei Lun Huang, Tzao Lin Lee and Chiao Szu Liao, “Desktop Search in the Intranet with Integrated Desktop Search Engines,” Proc. the 13th IEEE Asia-Pacific Computer Systems Architecture Conference (ACSAC 2008), IEEE Press, Aug. 2008, pp. 1-4. S. Cohen, C. Domshlak and N. Zwerdling, “On Ranking Techniques for Desktop Search,” ACM Transactions on Information Systems, vol. 26, Mar. 2008, pp. 1183-1184. J. Gaugaz, S. Costache, P. Chirita, C. S. Firan1 and W. Nejdl, “Activity Based Links as a Ranking Factor in Semantic Desktop Search,” Proc. Latin American Web Conference 2008(LA-Web 2008), IEEE Press, Oct. 2008, pp. 49-57. S. Chernov, “Task Detection for Activity-based Desktop Search,” Proc. the 31st annual international ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2008), ACM New York, Jul. 2008, pp. 894-894. C. Fluit, “Autofocus: Semantic Search for the Desktop,” Proc. the 9th International Conference on Information Visualisation (IV 2005), London, IEEE Press, Jul. 2005, pp. 480-487.

[13] [14]

[15]

[16] [17]

[18] [19]

[20] [21]

467 469 464

P. A. Chirita, R. Gavriloaie, S. Ghita, W. Nejdl and R. Paiu, “Activity based Metadata for Semantic Desktop Search,” Proc. the 2nd Annual European Semantic Web Conference (ESWC 2005), Springer Berlin, May. 2005, pp. 439-454. P. A. Chirita, S. Costache, W. Nejdl and R. Paiu, “Beagle++: Semantically Enhanced Searching and Ranking on the Desktop,” Proc. the 3rd Annual European Semantic Web Conference (ESWC2006), Springer Berlin, Jun. 2006, pp. 348-362. Z. Yun-tao, G. Ling and W. Yong-cheng, “Retrieval Technique with Natural Language Interface,” Journal of Guangxi Normal University(Natural Science Edition), vol. 21, Mar. 2003, pp. 6-9. Z. Xin-xin, S. Hong-guang and L. Yu-shu, “Information Retrieval Algorithm Based on Improved Hanning Window,” Journal of Guangxi Normal University(Natural Science Edition), vol. 24, Dec. 2006, pp. 191-194. Z. Su and M. Jianxi, “Review of Research on Semantic Desktop,” Library Journal, vol. 28, Mar. 2009, pp. 58-63. L. Chang-Tien, M. Shukla, S. H. Subramanya and W. Yamin, “Performance Evaluation of Desktop Search Engines,” Proc. the 2007 IEEE International Conference on Information Reuse and Integration (IEEE IRI-07), IEEE Press, Aug. 2007, pp. 110-115. Y. KABUTOYA, T. YUMOTO, S. OYAMA, K. TAJIMA and K. TANAKA, “Quality Estimation of Local Contents Based on PageRank Values of Web Pages,” Proc. the 22nd International Conference on Data Engineering (ICDE 2006), IEEE Press, Apr. 2006, pp. x134-x134. L. Wei-chao, “Analysis of Desktop search engine,” Journal of Modern Information, vol. 27, Dec. 2007, pp. 211-213. Microsoft, “Microsoft Windows desktop search,” http://www.microsoft.com/windows/products/winfamily/desktopsearc h/default.mspx, 2010. Yahoo, “Yahoo! Desktop Search Beta,” http://info.yahoo.com/privacy/in/yahoo/desktopsearch/, 2010. Copernic, “Copernic Desktop Search - The best desktop search tool,” http://www.copernic.com/en/products/desktop-search/index.html, 2010. Likasoft, “Archivarius 3000,” http://www.likasoft.com/cn/documentsearch/, 2010. Baidu, “Baidu Disk Search, ” http://disk.baidu.com/, 2010.