A Comparative Study Between Proposed File

Internati onal Journal of Artificial Intelligence and Mechatronics Volume 4, Issue 3, ISSN 2320 – 5121

A Comparative Study Between Proposed File Retreival System and Related Systems Mohammed Khaleel1,2 , H. M. El-Bakry1 , Ahmed A. Saleh1 1

Information Systems Department, Faculty of Computers and Information, M ansoura University, Egypt M inistry of Higher Education and Scientific Research, Scholarships and Cultural Relations Directorate, Baghdad, Iraq

2

Abstract – We study the problem of files reterival in search systems and caching query result pages in Web search engines. However, there are millions of queries received by popular search engines daily, and a result page returned to the user for every query. However, the user might submit a new query, ask for extra result pages for the same query, or leave searching. In this manuscript studies query result caching in the proposed framework. We show here a modification of a common scheme of online paging to this model. However, in the resulting algorithm, the predictable number of cache misses is not more that four times the predictable number of misses that any other online caching algorithm may get under our particular model of query generation. A set of measurements is proposed for evaluating Web search engine performance. S ome measurements are adapted from the concepts of recall and precision, which are commonly used in evaluating traditional information retrieval systems. Others are newly developed to evaluate search engine stability, an issue unique to Web information retrieval systems. An experiment was conducted to test these new measurements by applying them to a performance comparison of three files search engines: 4-shared, file search engine and S hare Digger system. Keywords – File Search Engine, File Caching, Indexing, Crawling, Caching Technique.

I. INTRODUCTION The primary way to get to the Web content are Largescale search engines. In October 2010, the indexed Web is considered to have not less that 14.6 billion pages [1]. The real size of the Web is projected to be much more b igger because of the existence of pages which are dynamically created. Bring ing the huge quantity of content and store them in an efficient searchable manner is the foremost function of file search systems. Profitable file search systems are considered to proceed billions of queries on their index of the Web everyday. So, creating a file search system wh ich scales even to today’s web presents many challenges [11]. Fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thousands per second. Three main elements are involved in a fu ll-fledged web search engine: a query processor, an indexer, and a crawler. Practicaly, many other components are also contained in the profitable search engines (e.g., spelling corrector, spam classifier, web graph builder). Yet, we choose to ignore those complementary elements because they do not possess any important influence on

the scalability or they are very specific to a certain objective. Fig. 1 shows the function of these three components.

Fig.1. Main Co mponents of File Search System Co mmonly, these files search systems receive millions of queries daily on all aspects of life. However, as millions of users submit these queries, s tudies show that a little popular queries accounts for an important part of the query stream. Those statistical characteristics of the query stream might lead to the database of search results [2]. But the system which answers a lot of queries from a cache instead of proceeding them via its index, may decrease the requirements of its hardware and reduce its response time. Some studies were reported on tests with caching query results. In this manuscript, we proposed a file search framework and use hybrid techniques of caching replacement policies. This cache presents a realistic algorith m for p rioritizing entries to refresh and creates the conception of stimu lating cache entries. This manuscript planned as follows: Section 2 discusses previous work, impo rtant parameters of file search system and main challanges of this work. In Section 3, we rev iew the framewo rk of the system and suggested algorithms for this work. In Sect ion 4, we show numerous experimental outcomes. Section 5 is the conclusion of the paper. Lastly , the list of references is in section 6.

II. PREVIOUS WORK i) File Search System using Caching Technqiues In [3], a logo having a million queries submitted to the Excite 2 file search system was used by Markatos to drive emu lations of query outcome caches. Markatos used least recently used (LRU), indicat ing that warm, b ig caches of file search outcomes could achieve hit ratios nearly 20%. A caching scheme of two-level which chains inverted lists and caching query results was proposed by Saraiva et al. [4]. LRU was the strategy of replacement that they approved for the query result cache.

Copyright © 2015 IJAIM right reserved 124

Internati onal Journal of Artificial Intelligence and Mechatronics Volume 4, Issue 3, ISSN 2320 – 5121 They tested logged query streams and tried their method with a system without caches. Generally, their strategy of combined caching improved the system quantity by a threefold, whereas the response time per query was preserved. Furthermore, the results which assumed to be demanded shortly could be prefetched by file search systems for results of submitted queries to be stored in the cache. A direct examp le is prefetching the next page of results every time a user submit a new query. In [5], to test integrated schemes for the cachingand prefetching of search outcomes, a log having more than 7 million queries submitted to the file search system Shared Digger was used. Hit ratios exceeding 40% were achieved. Prefetchingof outcomes proved to be of major importance, doubling the hit ratios of small caches and increasing those of larger caches by over 40%. The prefetching of search outcomes has also been inspected in [6], albeit fro m a different angle: the objective was to min imize the computational cost of serving search sessions rather than to maximize the hit ratio o f the outcomes cache.

ii) Important Parameters of file search system The scalability of a search engine might be affected by many parameters. So, those parameters could be categorized as external parameters and internal parameters. Parameters which cannot be controlled via the search engine called External parameters, whereas those

Component Crawling

Indexing Query Processing

that the search engine could slightly control them called Internal parameters. Table 1 su mmarize the parameters we exp lained. To achiev scalability in the internal parameters, the most essential is the hardware amount. The increase in the hardware storage capacity and processing significantly helps the effectiveness goals in query processing, indexing and crawling. For scalability, it is not a long-term solution to increase machines because the maintenance and depreciation costs are considerably high. In addition to hardware, the network bandwidth is an essential parameter for crawling. When the processing system of backend text is supposedly strong enough, bottleneck would not be formed and, thus, the page download rate would be determined by the network bandwidth which is available for the crawler. It is significant to rise the hit rate of the result cache for scaling the component of the query processing [7]. However, that would considerably helps decreasing the query traffic wich reachs the query processor. The peak query processing amount which could be constant by query processor is another essential parameter [8]. This is generally controlled by the effectiveness of data structures which used by the query processor and the quantity of available hardware.

Table 1: The parameters which impact the file search system scalability Internal Parameters External Parameters A mount of hardware Rate of web growth Available network bandwidth Rate of web change Amount of malicious intent Amount of hardware Amount of spam content Cache hit rate Peak query traffic Peak query processing throughput

iii) Main Challanges A. Crawling Challanges

III. PROPOSED FRAMEWORK AND ALGORITHMS

Nevertheless, we have another open problem which is the so-called push-based crawling. In this technique, instead for the web pages to be discovered via the crawler itself over the lin k structure of pages, they are discovered and pushed to the crawler v ia external agents.

B. Indexing Challanges Another open research problem is that of developing scalable systems to index and process huge quantities of data of real-time streaming text. The only open problem which did not take any attention fro m researchers yet is mu lti-site distributed indexing. Though there were some studies on parallel index creation, it is, however, not clear how those works could be expanded to multi-site architectures that are connected by networks of wide-area. [9]

C. Ranking Challanges Another open research problem is to that of developing techniques of ranking for indexing and sorting large quantities of data of real-time streaming text. [10]

Fig.2. Block Diagram of Proposed Framework



3.1. The Proposed Algorithm Poposed Framework consists of two phases. To make all files can be searched, the first phase would be done offline as a preprocessing phase. Offline phase contins crawling and indexing parts. The second phase is online phase which contains the reteriving and ranking parts.

3.1.1. Crawling Phase Achieving high page freshness, high content quality and high web coverage are the foremost objectives of the crawler. The crawler aims to locate and fetch as many pages as possible from the Web. This way, it increases the likelihood that more pages useful to users will be indexed by the search engine. Finally, the crawler tries to priorit ize fetching of pages in a way that comparat ively more significant pages are previously downloaded by using log queries.

3.1.2. Indexing Phase For the indexer, e xtract ing an ironic set of features from the collection of downloaded document and possibly from other information sources is the mian essential objective. The query processor used the extracted features in order to evaluate the significance of documents to queries. Thus, it is important to classify a big number of features which are fairly good indicators for significance and to precompute them as correctly as possible. The efficiency objectives in indexing involve creating indexes that are highly co mpact, speeding up the operations of index update, and reducing the length of index deploy ment cycles. Fig-3 shows the main steps of indexing phase in our system.

Fig.3. Indexing Phase of Proposed Framework

3.1.3. Reterival Phase The success of a search engine typically correlates with the fraction of relevant results in the generated search results (i.e., precision) and the fraction of relevant results that are returned to the user in all relevant results that are available in the index. The query processing time determines the response time fo r a query, so we use cache technique to min mize responce time query. Evaluation of the queries in a rap id way is the foremost value objective of the query processor. Fig-4 shows the main steps of reteriv ing phase in our system.

Fig.4. Retrieving Phase of Proposed Framework Copyright © 2015 IJAIM right reserved 126


3.2. Proposed Algorithms 3.2.1 Proposed Crawling Algorithm

Algorithm1- File Crawling Proposed Algorithm

In contrast to the search engine, we only focus on retriev ing and crawling files. Therefo re, the properties or attributes which are required for first step of crawling process are file name, size and extension. Suggested algorith m presented in Algorithm 1 has four features as input URL, Keyword, and File Extension respectively and array of files as output.

3.2.2 Proposed Caching Algorithm Techniques of caching replacement policies which we used in our system are MFU (Most Frequently Used) and LRU (Least Recently Used) and imp lemented as stated in Algorith m 2. ALGORITHM: File_Crawling Input : URL, FKeyword, FExtension Output : arrayOfFiles 1: Procedure File_Crawling 2: Begin 3: B_url = getBaseURL (URL) 4: P = Down load(URL) 5: Urls = Ext ractOutgoingURL in p with B_url 6: Furls = Fetch Urls format with FKeyword 7: Foreach Fu rls as Furl do 8: IF file extension is in Furls 9: arrayOfFiles = Furl 10: Else 11: Continue 12: End Foreach 13: IF arrayOfFiles equal null 14: File_Crawling(Urls, FKeywo rd, FExtension) 15: End IF

ALGORITHM: LRU with MFU Caching Algorith m Input : Class of Files, Time Stamp for each file Output : Victim Files 1. If File needed in cache (Cache Hit) 2. Access File fro m Cache with ID 3. Else File needed not in Cache (Cache M iss) 4. Search Fo r File in Class (Cluster) in Database 5. IF File founded in DB 6. Select Fro m Cache the least file used with minimu m accessed number and longest time stamp As a victim 7. Replace it with the file founded fro m DB 8. END IF 9. END IF Algorithm 2- LRU with MFU Caching Proposed Algorith m

IV. EVALUATION AND EXPERIMENTS 4.1. Machine Specifications The specifications of the machine were Corei7 CPU, 2GB RAM, 500GB Hard Disk and Windows 7. The Software specifications are Apache Server (Localhost) with PHP version 5.3 and MYSQL Database version 5.5. The dataset contains 100 online users. In our experiment, we will present the crawling process performance and query process time.

4.2. Proposed System Evaluation Table 2 and chart figure 5 show crawling process performances for 10 queries as sample (Most Frequently Used) with total number of retrieved, irrelevant and relevant results.

Table 2: Relevant and Irrelevant Results of Crawling Process on 10 queries as Samp le Queries Total Results Error Performance 10 Samp les Retrieved After (Irrelevant (Relevant Top Queries filtration process Result) Result) Co mputer Science Book 50 2 48 Co mputer Science Tutorial 70 5 65 Java Tutorials 44 3 41 Java Books 62 2 60 Java Tutorials in PDF and DOC 84 8 76 Data Structure Books PDF 57 5 52 Data Structure Algorithms 68 6 62 Algorith ms Examp les in co mputer science 66 2 64 Statistical Tutorials 88 9 79 Mobile Co mputing Tutorials in PDF 77 4 73

4.3. Caching Evaluation Cache hit Data found in cache. Results in data transfer at maximu m speed. Cache miss Data not found in cache. Processor loads data from M and copies into cache. This results in ext ra delay, called miss penalty.

Hit rat io = percentage of memory accesses satisfied by the cache. Miss ratio = 1-hit ratio.


Internati onal Journal of Artificial Intelligence and Mechatronics Volume 4, Issue 3, ISSN 2320 – 5121 Table 3: Cach ing Hit and Miss with according time in Proposed System Queries 10 Samp les Cache Hit Hit Time second Cache Miss Top Queries Ratio Ratio Co mputer Science Book 84% 0.001 16% Co mputer Science Tutorial 73% 0.001 27% Java Tutorials 74% 0.001 26% Java Books 71% 0.001 29% Java Tutorials in PDF and DOC 65% 0.001 35% Data Structure Books PDF 84% 0.001 16% Data Structure Algorithms 88% 0.001 12% Algorith ms Examp les in co mputer science 77% 0.001 23% Statistical Tutorials 50% 0.001 50% Mobile Co mputing Tutorials in PDF 60% 0.001 40%

4.4. A Comparison Study between Proposed System and other Related Systems

Miss Time second 0.003 0.005 0.008 0.008 0.009 0.008 0.007 0.01 0.01 0.01

crawler used in each system as shown in table 4. File Search system and Share Digger system used Google-Bot crawler as custom search to search in cloud with high results and high error rate (irrelevant results). But, 4shared system uses its own crawler on its own datasets (not on the cloud) with min imu m results and low error rate (irrelevant results).

This section discusses the comparison study between proposed system and other system in the related field (file search engine) such as File Search System, 4-Shared System and Share Digger. Co mparison study discuss the total ,relevant and irrelevant results for computer science book query as example, in addition to time of query and

Table 4: A Co mparison study between Proposed System and Other Systems Systems

Total Results 50

Relevant Results 48

Irrelevant Results 2

Time o f query 3.94 sec

File Search System

86

15

15

5 sec

4shared System

2

2

0

3 sec

817

30

787

22 sec

Proposed System

Share Digger System

Crawler used Proposed Crawler Cloud Dataset MSN-Bot Windows File Storage Dataset 4shared-Crawler 4shared-Dataset Glimpse-Bot Cloud Dataset

Fig.5. Performance and Error Chart fo r Crawling with Caching Process Copyright © 2015 IJAIM right reserved 128


V. CONCLUSION In this study, we presented methodology system that we suggested for file searching in cloud faster and more accurate than other methodologies. It is a procedure of using both caching and crawling algorith ms to recover file more accurate and faster than other methodologies. Similarly, this paper establishes the suggested system with both online and offline phases in order to get over the challenges of other systems.

REFERENCES [1]

[2] [3] [4]

[5] [6] [7]

[8]

[9]

[10]

[11]

R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, pages 5–14, New York, NY, USA, 2013. ACM. S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, in: Proc. 7th Internat. WWW Conf., 2002, pp. 107–117. E.P. Markatos, On cachingsearch engine query results, in: Proc. 5th Internat. Web Cachingand Content Delivery Workshop, 2002. P. Saraiva, E. Moura, N. Ziviani, W. Meira, R. Fonseca, B. Ribeiro-Neto, Rank-preservingtwo-level cachingfor scalable search engines, in: Proc. 24th Annu. Internat. ACM SIGIR Conf. on Research and Development in Information Retrieval, New Orleans, LA, USA, 2006, pp. 51–58. R. Lempel, S. Moran, Predictive cachingand prefetchingof query results in search engines, in: Proc. 12th World Wide Web Conf. (WWW2003), Budapest, Hungary, 2003, pp. 19–27. R. Lempel, S. Moran, Optimizingresult prefetchingin web search engines with segmented indices, in: Proc. 28th Internat. Conf. on Very Large Data Bases, Hong Kong, China, 20025pp. 370–381. B. B. Cambazoglu, F. P. Junqueira, V. Plachouras, S. Banachowski, B. Cui, S. Lim, and B. Bridge. A refreshing perspective of search engine caching. In Proceedings of the 19th International Conference on World Wide Web, pages 181–190, New York, NY, USA, 2010. ACM. B. B. Cambazoglu, E. Varol, E. Kayaaslan, C. Aykanat, and R. Baeza-Yates. Query forwarding in geographically distributed search engines. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 90–97, New York, NY, USA, 2010. ACM. R. Blanco, E. Bortnikov, F. Junqueira, R. Lempel, L. Telloli, and H. Zaragoza. Caching search engine results over incremental indices. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 82–89, New York, NY, USA, 2010. ACM. B. B. Cambazoglu and C. Aykanat. Performance of query processing implementations in ranking-based text retrieval systems using inverted indices. Information Processing & Management, 42(4):875–898, 2012. Mohammed Khaleel, Hazem M. El Bakry, and Ahmed A. Saleh, “Developing E-learning Services Based on Cache Strategy and Cloud Computing,” International Journal of Information Science and Intelligent System, vol. 3, No. 4, October 2014, pp. 45-52.