Content Based Parallel Information Retrieval for Text Files – Exploiting the Multiprocessor Functionality Tarjni Vyas∗ , Chirag Kharwar and Vikas Shah Department of CSE, Faculty of CSE, Nirma Institute of Technology, Ahmedabad 382 481, India. e-mail:
[email protected],
[email protected],
[email protected]
Abstract. In the era of growing services the information is the biggest asset for any organization. Retrieval of the information in the most efficient and fast way helps any organization in their growth. The algorithms for getting the desired information from the document are already available. These algorithms are sequential in nature so will take more time to execute. State of the art technologies allows us to use multiprocessing systems so that we can utilize the power of multiple cores to improve the search time of the retrieval algorithms. In this experiment, we have implemented a multiprocessing program which can take advantage of number of cores available in the system. We have tested our program against the sequential counterpart and compared them through performance in speed and efficiency. Keywords:
Information retrieval, Parallel processing; IR models.
1. Introduction Information retrieval is an essential part of the growth of any organization and it is also important for other non-organizational entities (other users). Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers) [1]. Parallel processing is the method of using the multicore functionality to get better performance. When both of the domains are combined together than useful results can be occupied [1]. In this paper we have tried to do the same. Now-a-days users are not worrying about much relevance of the documents rather they are worried about the speed i.e. how fast they are getting their results. Information is available in the large volume. If sequential algorithms are applied on this large data than it will take a long time for the query to give the result. State of the art technologies in VLSI is improving day by day. Thus, even general purpose PCs are having multiple processors embedded in a single CPU. These processors are generally not used whenever some program runs in sequential mode. Thus, our aim is to utilize these idle processors in the PCs of general purpose users to give them faster search mechanism. The rest of this paper is organized as follows: in section 2 we have discussed the existing sequential method which are mostly used in the current state of the art scenario, in section 3 we have shown our approach of doing the retrieval in parallel, in section 4 we have compared the results of the parallel program, sequential program and normal windows search time, in section 5 we will discuss the work that can be done for further improvement in the relevance of the documents, finally in section 6 we will conclude our paper. 2. IRS Models As information retrieval is the growing area there have been lots of research already done and in that the researchers have proposed so many models for searching relevant documents. First we will understand what information retrieval model is. Following figure shows the basic model for any information retrieval system. The figure 1 is self-explanatory. The retrieval function is where the algorithms are applied. The document and the query model is given as the input to the algorithms and the job of the algorithm is to find the relevant documents according to some scheme. This scheme is called as model for that algorithm [1]. ∗ Corresponding author
© Elsevier Publications 2014.
391
Tarjni Vyas, et al.
Figure 1. Information retrieval model.
We will discuss some of the models which are popular and are used in current scenario. These models are discussed below: A. Boolean model Boolean model is one of the simplest and earliest approach to retrieve the relevant documents [2]. The adjective “Boolean” refers to the use of Boolean algebra model in which word in the documents are logically combined with the Boolean operators AND, OR and NOT. For example Boolean AND of two logical statements X and Y mean that both X and Y must be satisfied. While Boolean OR between statements means that the one of these statements must be satisfied. Any number of statements can be combined using these three Boolean operators. But problem with this model is its not taking weight of document into consideration. It will just check that the word is present in document or not. In which either too many documents could be retrieved or less number of documents could be retrieved. And (∧): The intersection of two sets. OR (∨): The union of two sets. Not (¬): Set inverse or really set difference. B. Vector model Vector model taking advantage of document weight into consideration to retrieve more relevant documents. The classical vector model finds the similarity between query term and documents by the cosine of angle between query vector q and document vector d [3]. Therefore relevance or similarity can be expressed as: = sim ( q , D)
q·D . | q | | D|
Vector model finds term frequency and inverse document frequency. And calculate the weight for each term by multiplying term frequency and inverse document frequency. We can find term frequency as number of times term appears in document and IDF as: N (1) idf (t, D) = log |{d ∈ D : t ∈ d}| where N denotes total number of documents and denominator denotes number of documents in which term present. And by calculating weight for each document we can consider relevancy of documents in decreasing order of weights. 3. Multiprocessing in Multicore Systems At the most basic level of a multiprocessing systems, we divide the task which is having components/modules that can be executed independently from each other onto different processors/cores available in the machine. These tasks utilizes different cores of the machine as a result of that the program can achieve better performance in terms of execution speed. [8] Defines multi-processing as “A single OS instance controls two or more identical processors connected to a single shared memory and distribute the task among them.” In parallel processing the amount of work done by the processor is decided by several strategies, one of them is threshold method. In this, a fix number is decided if the task is less than that number than the partition is given to the processor otherwise more partitioning has to be done. These logics are used in the next section which contains our approach for the parallelization. 392
© Elsevier Publications 2014.
Content Based Parallel Information Retrieval for Text Files
4. Our Approach of Parallelization As we have seen there are different approaches available to solve a particular information retrieval need, we have modified the Boolean model for IR and use the following algorithm for solving the purpose. 1. Fetch the document as a string. 2. For each document do a) Find whether the word is present in the document or not. b) If word is present then • Increment the count and break the loop for that particular word c) Otherwise • Continue searching for next word. 3. Assign count as a weight to the document. 4. Sort all the document in descending order of count. We here present the parallelization approach for the above algorithm. i.e., applying the above algorithm in parallel on different cores of the CPU at the same time [4,6]. As more number of documents are processed in parallel the ultimate goal of improving the performance of the system can be achieved. The approach is as follows:
Figure 2. Retrieval assuming threshold value to 2.
As shown in the figure 2 the user will hit the query and that query will be processed simultaneously by different processors. As shown the processors are given the documents and a program to process that document. We are using fork join API from java thus it will create separate address space for the forked process, that means if there are 4 cores (quad core processor), 4 copies of the program will be made and run on different processors at the same time. Here the assumption is to have a CPU of which all the cores are dedicated for performing the search operations. i.e., search operations are having highest or real-time (in case of windows) priority. So that as soon as the document is ready it can get the processor to execute. The parallelization will work as follows: Some threshold value needs to be set. For our experiment the threshold value is 1 only. i.e., each processor will be processing on one document at a time. This means no sequential execution as if there are 4 processors then they will be executing 4 documents at a time as mentioned earlier. As shown in the above figure first CORE 1 will process document numbers from 1 to 2, CORE 2 will process document numbers from 3 to 4 up to CORE N that will be processing document numbers from n − 1 to n. In this way all the cores of the CPU will be utilized and better performance can be achieved in terms of speed. We have also discussed the utilization of CPU in sequential and parallel execution in the next section. 5. Implementation Results For the implementation purpose, java has been used by us as a programming language and NetBeans as an editor. We have used java 7 for our program. Fork and Join API that comes with java 7 has been used for multi-processing of © Elsevier Publications 2014.
393
Tarjni Vyas, et al.
the document in parallel. The files which are used are of size 77 MB and we have tested our application on 2.27 GB (max) of data and we can conclude from the results that it will scale well. The configuration on which the application has been tested is as follows: Table 1. Workstation configuration. Parameter
Value
Processor Architecture Operating system Processor RAM Number of Cores
X86 Windows 7 Ultimate 64-bit Core i3 M350 2.27 GHz 4 GB 4
We have calculated the minimum program time that it takes in both sequential as well as parallel mode and put the values of time for different number of files i.e. different sizes of data. Implementation results that we got are as follows: Table 2. Experiment results. Program analysis for the file size of 77 MB No. of Files 2 5 10 15 20 25 27 30
Sequential Time
Parallel Time
Total Size (MB)
1.825 4.446 8.814 12.948 17.129 22.667 27.058 36.488
3.495 4.68 7.191 9.89 8.612 11.357 15.601 21.513
154 385 770 1155 1540 1925 2079 2310
As seen from the graph that the parallel program when applied to more number of documents with greater size, outperforms the existing sequential algorithm. The program can create more impact if there are more number of cores available in the CPU. As seen from the graph that on initial stage when documents are less and then the overhead of forking is much more and the parallel program is giving less performance then the sequential one. But it scales as more and more number of documents are added which is a typical case in internet or when we are working on massive data. The graph for the above table is:
Figure 3. Comparison of sequential and parallel program.
394
© Elsevier Publications 2014.
Content Based Parallel Information Retrieval for Text Files
6. Future Work As a future work we can apply this technique to the following two paradigms: 1. Distributed IR: • The documents can be given to different machines (servers) available with the search engine provider so that they can be processed in parallel. In this way by applying two level of parallelization (Multi-core and multi-machine) more better results can be achieved. 2. Using Many Cores: • The technique can be apply to the number of machines having many cores (GPU). Thus, rather than restricting it to a few cores of the single CPU, the technique can be applied to the GPUs which are specially build for the computation purpose and are under-utilized generally. Furthermore, the relevance can be increased by changing or modifying the existing algorithm that is used by us. In this way it can be better compared with the more relevant vector model. This is optional as general tradeoff between the relevance focus much in speed of the algorithm. 7. Conclusion In this paper, we have proposed a parallel approach for using the different algorithms of the information retrieval. We have modify Boolean model and use that algorithm in parallel. The practical results shows that it is improving overall performance in terms of speed. According to this experiment we conclude that if the documents can be processed simultaneously then it can lead to improvements in terms of time. Thus producing fast results which is the demand of the information era. We have also shown the futuristic improvements to this method in the form of distribution of the processing of documents and using GPU power by exploiting many core functionality to enhance the search speed. References [1] D. Manning, P. Raghavan and H. Sch¨utze, “An Introduction to Information Retrieval,” Cambridge: Cambridge UP, (2009). [2] H. Lashkari, F. Mahdavi and V. Ghomi, “A Boolean Model in Information Retrieval For Search Engines,” in International Conference on Information Management and Engineering, (2009). [3] L. S. Wang, “Relevance Weighting of Multi-Term Queries for Vector Space Model,” in IEEE, (2009). [4] G. V. Cormack, C. L. A. Clarke and S. Buettcher, “Parallel Information Retrieval,” in Information Retrieval: Implementing and Evaluating 494 Search Engines, MIT Press, pp. 492–510, (2010). [5] S.-H. Chung, S.-C. Oh, K. Ryel Ryu and S.-H. Park, “Parallel Information Retrieval on a Distributed Memory Multiprocessor System,” in IEEE, (1997). [6] Stanfill, The Marriage of Parallel Computing and Information Retrieval. [7] L. Yang, Y. Zhang and H. Li, “Information Search Based on Test Mining and EXTJS,” World Congress on Software Engineering, IEEE, no. 10.11, pp. 386–391, (2009). [8] Janssen, “Techopedia,” Techopedia, [Online]. Available: http://www.techopedia.com/definition/3393/multi-processing. [Accessed Thursday April (2014)]. [9] T. Chen, Z. Zheng, N. Zhang and J. Chen, “Heterogeneous Multi-core Design for Information Retrieval Efficiency on the Vector Space Model,” in Fifth International Conference on Fuzzy Systems and Knowledge Discovery, IEEE, (2008). [10] R. Brightwell, M. Heroux, Z. Wen and J. Wu, “Parallel Phase Model: A Programming Model for High-end Parallel Machines with Manycores,” in International Conference on Parallel Processing, IEEE, (2009). [11] Singh N. Kumar, S. Gera and A. Mittal, “Achieving Magnitude Order Improvement in Porter Stemmer Algorithm over Multi-Core Architecture,” in IEEE International Conference.
© Elsevier Publications 2014.
395