string matching problem on a cluster of personal ... - Semantic Scholar

4 downloads 5414 Views 29KB Size Report
can be used to predict the execution time, speedup and similar performance metrics of a string matching algorithm running on a cluster of personal computers.
STRING MATCHING PROBLEM ON A CLUSTER OF PERSONAL COMPUTERS: PERFORMANCE MODELING Panagiotis D. Michailidis and Konstantinos G. Margaritis Department of Applied Informatics, University of Macedonia, 156 Egnatia str., P.O. Box 1591 Thessaloniki, Greece Abstract: This paper presents an analytical performance prediction model and methodology that can be used to predict the execution time, speedup and similar performance metrics of a string matching algorithm running on a cluster of personal computers. The developed performance model has been checked on a 6-cluster of computers and it has been shown that this model is able to predict the parallel run-time accurately. Key words: string matching, cluster of personal computers, performance prediction, analytical model.

1. INTRODUCTION The string matching problem can be defined as follows. Let a given alphabet (a finite sequence characters) Ó, a short pattern string P=P[1]P[2]...P[m] of length m and a large text string T=T[1]T[2]...T[n] of length n, where both the pattern and the text are sequences of characters from Ó, with m ≤ n. The string matching problem consists of finding one or more generally all the exact occurrences of a pattern P in a text T. Survey and experimental results of well known sequential algorithms for this string matching problem can be found in [2,5,6,10]. The implementation of the string matching problem on a cluster of personal computers [1] can provide the computing power required for the speed up the process on large free textual databases. The performance data of this implementation is presented in the paper [7]. In this paper we will present a performance prediction model that can be used to predict the performance metrics of the parallel implementation of string matching on a cluster of personal computers [1]. The remainder of this paper is organized as follows: in the next we present briefly the parallel implementation of the string matching. Theoretical performance model is presented in Section 3. Then, we compare these results with the experimental measurements. Finally, Section 4 contains our conclusions and future research issues. 2. PARALLEL PROGRAMMING MODEL The string matching problem can achieve data parallelism with the following simple data partitioning technique: we decompose the text into a number of subtexts according to the number of processors allocated. The size of each subtext is k= (n − m + 1) / p  + m − 1 successive characters of the complete text. There is an overlap of m-1 pattern characters between successive subtexts. These subtexts are stored in local disks of the processors. According the above partitioning approach we follow the static master-worker model that is as follows: First, the master distributes the pattern string to all the workers. Second, each worker reads its subtext from the local disk in main memory and search using any sequential string matching algorithm [5,6]. Third, the workers send the number of occurrences back to master. In this paper, we use the Brute-Force (in short, BF) string matching algorithm [5,6].

3. PERFORMANCE PREDICTION MODEL We will develop an analytical model to describe the behaviour and the performance of the parallel algorithm presented previously. The analytical model is not only used to verify the experimental results but also to predict the speedup tendency. Using, a large number of processors can be sometimes counter productive, that is, the speedup that they achieve is lower than that of a smaller number of processors. The main reason for that is the computation to communication ratio as applied to the specific cluster. Hence, this analytical model can save us a lot of time running different experiments to find the highest (or best) performance for a given string searching application. Using the parallel computational model of section 2, the execution time of the parallel algorithm can be broken up into five terms: • Ta: the master startup time, which is mainly due to performing the text partition (or initialization MPI routines). • Tb: the communication time for broadcasting of the pattern string to all processors involved. • Tc: the average I/O time for reading the subtext from the local disk of a single processor. • Td: the average searching time of the string matching on a single processor. • Te: the communication time for receiving (or gathering) the results of the string searching from one processor. Hence, the execution time of the parallel algorithm, using p processors, is the summation of the 5 terms and is given by: Tp=Ta+Tb+Tc+Td+Te I/O, matching and communication times In order to formulate the equations that give us the expected speedup curves of our parallel algorithm, we have to formulate the equations for the I/O time, matching time and communication time, which make up the execution time of the whole algorithm. The I/O time is proportional to the size of the text. So, reading of the text requires n accesses to the disk. Let ã be the average time to perform one I/O step. Then, the I/O time, Ti/o, is given by: Ti/o=nã. If the workload is divided evenly over p processors, then each node's processing time is given by: Ti/o=( (n − m + 1) / p  + m − 1 )ã. Further, the string matching of an m pattern string in a n text string requires n computation steps in practice for the BF algorithm [5,6]. Let ä be the average time to perform one computation step. Then, the searching time, Tsearch, is given by: Tsearch=nä. In addition, each node's processing time is given by: Tsearch=( (n − m + 1) / p  + m − 1 )ä. The total communication time of the parallel algorithm is the summation of two components: latency time and transmission time. The latency time, á, is a fixed startup overhead time needed to prepare sending a message from one processor to the other. The transmission time is proportional to the size of pattern string. Let â be the incremental transmission time per byte. Note that it is usually the case that á>>â. Then the communication time, Tcomm, to send P bytes of data (messages) is defined as: Tcomm=á+Pâ. Parallel performance

If the values of á, â, ã and ä are known, then the execution time of the parallel string matching algorithm can be easily estimated as will be shown below. To formulate the equations of the parallel string matching execution time for an m pattern string, an n text string, and p computers, we need to determine the value of each of the following terms: • Ta: It is mainly the time spent in text partitioning. Compared with the communication time and the string matching time, it is negligible. • Tb: It includes the communication time to broadcast the pattern string to all processors involved in the processing of the string matching. We may consider that the function MPI_Bcast is completed in log 2 p steps. In each step, one or two parallel send operations per processor are performed. The size of an m pattern string is m bytes. Therefore, the broadcast transfers m bytes to the other p-1 processors. The expression for this amount of the time is given by: log 2 p (á+mâ). • Tc: We know that each processor has to reads the subtext from the local disk into a buffer in main memory with size (n − m + 1) / p  + m − 1 characters, the I/O time on single processor is given by: ( (n − m + 1) / p  + m − 1 )ã. • Td: Since each processor has to search the pattern string in its subtext with size (n − m + 1) / p  + m − 1 characters, the searching time on a single processor is given by: ( (n − m + 1) / p  + m − 1 )ä. • Te: The master has to gather p results resulting from the string matching carried on the subtexts by p processors concurrently. Each processor sends back one value (in our case, the number of the occurrences). The function MPI_Reduce is completed in log 2 p steps. Therefore, the communication time to gather these results from one processor is given by: log 2 p(á+â). The execution time of the parallel string matching, Tp, simply the summation of all the terms calculated above and is given by: Tp=log 2 p (á+mâ)+ ( (n − m + 1) / p  + m − 1 )ã+( (n − m + 1) / p  + m − 1 )ä+ log 2 p(á+â). Determination of á, â, ã and ä We show how the values of á, â, ã and ä, which are used in our performance prediction model, are found. The I/O (or searching) factor, ã (or ä), is found by measuring the time taken which performs n (or n) steps. Since the I/O (or searching) time, Ti/o (or Tsearch) is equal to n (or n), ã (or ä) can be obtained easily in the following way: ã=(Ti/o)/n or ä=Tsearch/n. In Table 1 we list the values of ã and ä (in seconds) for different pattern lengths. m ã ä 5 8.87E-08 2.77E-07 10 7.91E-08 2.73E-07 30 7.95E-08 2.65E-07 60 7.90E-08 2.95E-07 Table 1: Values of ã and ä In order to find the values of á and â, we run some simple ping-pong tests which send/receive an number of messages between two processors. Ôhe ping-pong test consists of two processes where each process resides on a single processor. Both processes do nothing but simply send and receive

messages. All timings are average times over 100 separate rounds. Using the linear regression method to fit a straight line to the curve of the communication, we find the values of á and â which are the message latency time and the incremental transmission time per byte. They are found to be 0.00124742 secs and 0.00000038977 secs respectively in our computing environment. Expected results Tables 2 and 3 we show for some values of m, n and p the expected execution times obtained using the equation Tp. Figures 1-2 and 3-4 present for some values n and p the execution times and speedups respectively, obtained in the experiments and those using the equations above. The experimental results of the Figures are obtained from the paper [7]. It is important to note that the execution times and speedups, which are plotted in Figures 1-4, are result of the average for four pattern lengths. Note that while there are small differences between the experimental and expected values, the overall behaviour is similar. p/m 5 10 30 60 1 1.156 1.113 1.088 1.183 2 0.58 0.559 0.547 0.594 3 0.389 0.375 0.367 0.398 4 0.294 0.283 0.277 0.3 5 0.237 0.228 0.223 0.242 6 0.199 0.192 0.187 0.203 Table 2: Expected execution times (in secs) for text size 3MB using several pattern lengths p/m 5 10 30 60 1 9.225 8.881 8.686 9.437 2 4.615 4.443 4.345 4.722 3 3.079 2.964 2.899 3.15 4 2.311 2.225 2.176 2.364 5 1.85 1.782 1.743 1.893 6 1.54 1.486 1.454 1.579 Table 3: Expected execution times (in secs) for text size 24MB using several pattern lengths

Time (in secs)

BF search algorithm and n=3MB 1.2 1 0.8 0.6 0.4 0.2 0

Expected Experimental

1

2

3

4

5

6

Number of processors Figure 1: The expected and experimental execution times as a function of the number of processors

Time (in secs)

BF search algorithm and n=24MB 10 8 6

Expected

4

Experimental

2 0 1

2

3

4

5

6

Number of processors Figure 2: The expected and experimental execution times as a function of the number of processors

Speedup

BF search algorithm and n=3MB 6 5 4 3 2 1 0

Expected Experimental

1

2

3

4

5

6

Number of processors Figure 3: The expected and experimental speedup as a function of the number of processors

Speedup

BF search algorithm and n=24MB 6 5 4 3 2 1 0

Expected Experimental

1

2

3

4

5

6

Number of processors Figure 4: The expected and experimental speedup as a function of the number of processors 4. CONCLUSIONS In this paper has introduced a performance prediction model, which can help in the systematic design, evaluation and performance tuning of parallel string matching algorithm. The model requires only a small set of input parameters (á, â, ã and ä) that can be obtained from cluster specifications or from trial runs on a mininal system. The method enables the programmer to examine the performance of a cluster of computers prior to implementation. Further, this model can save us a lot of time running different experiments to find best performance for the string matching application.

The execution of parallel string matching implementation was studied on a 6-cluster of personal computers and was modelled using the performance model. It has been shown that the performance prediction model agrees well with the experimental measurements. The maximal difference between prediction and measurement values are less than 7% and most of them are less than 2%. Therefore, the model is accuracy since the predictions need not be quantitatively exact, prediction errors of 1020% are acceptable. Future work includes the extension of the model to the heterogeneous cluster of workstations and to incorporate the model for static and dynamic load balancing. REFERENCES 1. Anderson, T., D. Culler, D. Patterson. A Case for NOW (Network of Workstations), IEEE Micro, 1995, 1 (vol. 15), pp. 54-64. 2. Crochemore, M., W. Rytter. Text Algorithms, Oxford University Press, 1994. 3. Foster, I. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering, Addison-Wesley, 1995. 4. Gropp, W., E. Lusk, A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface, The MIT Press, Cambridge, Massachusetts, 1994. 5. Michailidis, P., K. Margaritis. On-Line String Matching Algorithms: Survey and Experimental Results, accepted for publication, International Journal of Computer Mathematics, 2000. 6. Michailidis, P., K. Margaritis. String Matching Algorithms, Technical Report, Dept. of Applied Informatics, University of Macedonia, 1999 (in Greek). 7. Michailidis, P., K. Margaritis. String Matching Problem on a Cluster of Personal Computers: Experimental Results, to appear in Proc. 15th International Conference Systems for Automation of Engineering and Research (SAER'2001), 2001. 8. Pacheco, P. Parallel Programming with MPI, San Francisco, CA, Morgan Kaufmann, 1997. 9. Snir, M., S. Otto, S. Huss-Lederman, D.W. Walker, J. Dongarra. MPI: The Complete Reference, The MIT Press, Cambridge, Massachusetts, 1996. 10. Stephen, G. String Searching Algorithms, World Scientific Press, 1994. 11. Wilkinson, B., M. Allen. Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers, Prentice Hall, 1999.

Suggest Documents