Distributed Fast Fourier Transform (DFFT) on MapReduce. Model for Arabic Handwriting Feature Extraction Technique via Cloud Computing Technologies.
Distributed Fast Fourier Transform (DFFT) on MapReduce Model for Arabic Handwriting Feature Extraction Technique via Cloud Computing Technologies Hamdi Hassen1, Khemakhem Maher1,2 1 Mir@cl Lab, FSEGS University of Sfax, Tunisia 2 Computer Science Department, Faculty of Computing and Informnation Technology, King Abulaziz University, KSA Abstract - The choice of relevant features is very important and decisive step when building a handwriting recognition system. Indeed, a good choice can lead to a powerful system and viceversa. Fast Fourier Transform (FFT) is amongst adequate feature extraction technique to achieve such an objective. Typical Arabic handwriting recognition tasks based on FFT and especially when dealing with a big and massive amount of Arabic handwriting documents require enough processing power that could not be provided by current state-of-the-art workstations. Distributed computing architectures and infrastructures appear to be a solution to afford such a mission. Our aim is indeed to distribute the FFT feature extraction techniques using the MapReduce programming model for Arabic handwriting feature extraction using Cloud Computing architecture. Experiments were conducted on the MapReduce model via the Amazon Web service (AWS) Cloud Computing architecture, with a real large scaled dataset from the IFN/ENIT database. Performance analysis revealed the viability of our investigation; moreover, it confirms also that such infrastructures can speed up substantially the entire pattern recognition system.
Keywords: Pattern Recognition, Mapreduce , FFT, Cloud computing
1.
Introduction
The main task achieved by any OCR (optical Character Recognition) system is to convert a scanned text (offline) or handwriting on writing device (online) into a text document. Text recognition is a sub field of the pattern recognition area which has been the subject of so much research in the past three decades. The Recognition of Arabic handwriting characters is a challenge in the last few years. This challenge is due to many factors such as: first, the great similarity between some characters. Second, the shape variation of most of Arabic characters according to their position in a given word or sub word (morphological problem). Third, the complexity and richness of the Arabic calligraphy especially for the Ancient documents. And fourth, the existence of words and sub words in a given text.
These factors become much more difficult when dealing with large amount of documents which is our main objective through this work. However, we believe that a good selection of feature extraction technique remains one of the most important steps for achieving good recognition accuracy. Such a selection requires the knowledge of both, the robust and efficient feature extraction technique that can lead to relevant features and the adequate hardware architecture that can handle such robust feature extraction technique in order to computerize large amount of documents in a reasonable time. High performances of FFT is a key issue in Arabic handwriting recognition system. Nevertheless, it represents a complex techniques when processing a massive database of Arabic handwriting text. Distributed computing architectures such as Cloud Computing technologies provide enough computing capacities to process this kind of complex algorithm. Our idea consists on using the programming model MapReduce for processing large Arabic handwriting data sets with a distributed FFT algorithm on a real Cloud Computing platform. The aim of this work is to demonstrate that the performances of FFT on the feature extraction step can be achieved by the distribution of FFT using MapReduce programming model on Cloud architecture . The rest of the paper is organized as follows: Section 2, describes the feature extraction difficulties of the Arabic handwriting Characters. Section 3 presents a general description of the FFT as feature extraction techniques and its complexity. Our approach is presented in the fourth section. Section 5 provides some performance evaluation and investigation of our approach. Finally, a conclusion with some remarks and future work are presented in Section 6.
2. Arabic handwriting Features The Arabic handwritten language presents many challenges to the Optical Character Recognition (OCR) developer [1] that can be presented as follows:
In Arabic language, a text is composed of cursive words and sub words where each of which is formed of consecutive letters linked one to another sometimes with some overlaps [2]. Most of Arabic alphabet letters are presented by four forms according to their position in a given word: isolated form, initial form, medial form, and final form. In addition, the corresponding shape of each of these forms can be completely different from the others for the same character. Moreover, some of Arabic characters have similar shape and only diacritic dots make them different but unfortunately more complicated. This is why the recognition rate of Arabic characters is lower compared to other languages especially Latin. These properties will cause a high level of difficulty in the choice of the pertinent and relevant features [3] during the design of an Arabic OCR system.
The purpose of our system is to recognize large amount of Arabic handwriting text using FFT as a feature extraction technique implemented on distributed architecture . The recognition of the Arabic handwriting by FFT consists previously on applying operating the Fourier Transform on the contour of the character [8]. First, we start with the detection of the contour. Second , the code of Freeman of the contour is generated on which one operates the calculation of Fourier Transform. Figure 1 depicts the FFT process.
The analysis of Arabic handwriting script is further complicated compared to Latin script due to obligatory dots/stokes that are placed above or below most letters. Some of the dots and strokes can be missed in the preprocessing step where the text should be cleaned up to be used directly and efficiently by the feature extraction technique in the OCR process[1] Another difficulty of Arabic handwriting recognition due to the different writing styles, in fact Arabic handwriting can be in different style such as Ruqqah and some others usually used for decorative calligraphy such as Kofi, Thuluth and Diwani. This feature will cause more difficulties for recognition and make the learning database of the recognition system even larger[1]. The choice of the robust and efficient algorithm to deal with the relevant and pertinent features is very important and decisive in handwriting recognition rate. Results of surveys conducted by Arabic handwriting OCR researchers confirm that high performance of the fast Fourier transform (FFT) is a key issue for Arabic handwriting recognition system, in fact it represents a robust and efficient features extraction technique for Arabic handwriting recognition process [4].
3.
Fast Fourrier transform as a features extraction technique
The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT) [5]. A FFT is an algorithm that computes the DFT and its inverse. A Fourier transform converts time (or space) to frequency and vice versa; an FFT rapidly computes such transformations by factorizing the DFT matrix into a product of sparse (mostly zero) factors [6].
Figure 1. The FFT process
Mathematically, the process of Fourier Transform is represented by the following equation:
X N ( K ) = A0 +
YN ( K ) = C 0 +
N
∑
a n cos
n =1
N
∑ cn cos
n =1
2 π nK 2 π nK + b n sin N N
2πnK 2πnK + d n sin N N
[1]
[ 2]
Where
The main principle of FFT is "Divide and Conquer to break down a big problem to a number of smaller problems and tackle them individually". The FFT needs to satisfy the flowing condition [7] .
K: the k points of the contour,
∑
a n , bn , cn et d n : The coefficients of Fourier corresponding to
cost (sub problem) + cost (overhead) < cost (original
problem).
N: is the necessary number in the approximation of the contour by the coefficients of Fourier
the harmonic n.
a 0 et c0 :
The continuous components that correspond to the initial points where the frequency is equal to 0. The FFT algorithm computes the DFT and produces exactly the same result as evaluating the DFT definition directly. The most important difference is that an FFT is much faster. FFT represents an efficient and robust feature extraction technique that uses Fourier Descriptors (FD). In fact, it can be normalized in order to be invariant to the position of the characters (translation or rotation), the size and the starting point [9]. But, FFT is characterized with a complexity order of O(n log n) [10] [11]. Another weakness of FFT applied to Arabic handwriting features extraction is that the extracted primitives need a high level of memory consuming [12]. It appears that the optimization of FFT is as equally demanding as the design of an efficient feature extraction technique. Our optimization of FFT consists on using the MapReduce model to distribute the FFT algorithm via the cloud computing architecture.
4. MapReduce model for distributed FFT on cloud computing architecture MapReduce technique is a programming model for processing massive data sets with a parallel and distributed manner on a distributed architecture. A MapReduce program is composed of two main function: Map() and Reduce() [13]. In the Map function, the master node takes the input, divides it into smaller sub-problems and distributes them to worker nodes. In the reduce function, the worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. Figure 2 explains the MapReduce Model.
Figure 2: Map reduce model
Our idea consists of distributing the computational FFT as features extraction techniques across a cluster of virtual servers running in the Cloud Computing architecture using the MapReduce model to analyze and process a large and massive amounts of arabic handwriting texts. The figure below describes the design of our approach.
Figure 3: FFT via MapReduce model
The open-source framework Hadoop[14] is used as a tool to manage clusters in cloud architectures. The distributed processing model MapReduce is used by Hadoop on the Amazon web service cloud architecture[15] in which the FFT task is mapped to a set of servers for processing then the results of the map function is performed by the reduce function to achieve a single output set. The distribution and control of FFT algorithm is done by a node called the master node. Figure 4 illustrates the AWS MapReduce mechanism.
Figure 5: A sample of the studied corpus
5.2. Experimental environment To set up an experimental environment that would allow us to objectively study the effectiveness of Distributed FFT across a cluster of virtual servers running in a cloud architecture via the MapReduce model in the entire Arabic handwriting recognition process , we used the following tools: Figure 4: AWS MapReduce model
5. The experimental study
• A local Intel Core 2 Duo desktop having the configuration: 3.00 GHz *2, 2 GB of RAM running a Windows XP operating system, • The Cygwin shell to run Linux command [17].
5.1. Datasets
• The java programming language and JDK 1.6 was istalled.
To evaluate the performance of DFFT as features extraction techniques based on MapReduce model via the cloud architecture, a database with 16000 pages (370 characters/page) is used.
• The Eclipse 3.4 tool was used to program and build our OCR application based on FFT as a feature extraction technique.
The reference database is formed of 345 shapes representing approximately the different Arabic alphabet randomly chosen from the Arabic handwritten word images dataset. This base contains samples of 945 names of Tunisian Towns retrieved from the already normalized and preprocessing database IFN/ENIT Institut of Communications Technology (IFN),Technical University Braunschweig, Germany , Ecole Nationale d'Ingénieurs de Tunis (ENIT), Tunisia [16]. Figure 5 represents un sample of the studied corpus.
• The cascading framework [18] was used to easily and quickly develop the Data Analytics and Data management • 100 cores using the three Standard Amazon EC2 Instances. The small instances each with 1.7 GB of memory, 160 GB of instance storage, and 32-bit platform. The Large Instance 7.5 GB of memory, 850 GB of instance storage, 64-bit platform The Extra Large Instance 15 GB of memory, 1690 GB of instance storage and 64-bit platform. • The Amazon S3 Bucket [19] is used to store and receive the input and the output into and from the cloud clusters. The execution of FFT in AWS MapReduce model should respect some steps. First, we start by developing and executing our OCR application based on FFT as a feature extraction technique in the same local host using Java programming language. Second we should Sign Up to amazon web service to Upload our application and data to Amazon S3. The database to
3 running jobs flows are created in cascading AWS MapReduce respectively for 3 different transform size ( N= 64, N= 128, N= 256).
upload and process is very massive, we have using the AWS Import/Export option based on the use of physical storage devices. The configuration and the launch of clusters is the third step in which we use the AWS management console to specify the number of EC2 instances that will be used in cluster, and the types of instances to use. Finally, when Amazon EMR will automatically terminate the cluster when processing is complete, we retrieve the output from Amazon S3 on the cluster. Many tools are used to visualize the output like MicroStrategy[20].
The execution time, the speedup factor and the efficiency factor are chosen to evaluate our experimental [21] . The execution time is the maximum length of time that FFT could take to extract features from characters on a specific Amazon EC2 instances for a definite transform size (N). The speedup refers to how much the distributed FFT is faster than its corresponding in a sequential manner and finally the efficiency factor that refer to the pour cent of useful of resources
5.3. Results and analysis This section presents the results and analysis of the experimental study conducted in the real IFN/ENIT database using distributed FFT as feature extraction technique on Mapreduce model on the AWS cloud computing architecture.
The experimentally derived values of the execution time, the FFT speed up factor and the efficiency factor on AWS MapReduce model for different instances of Amazon EC2 and different transform size are given in Table 1.
TABLE I EXECUTION TIME, SPEEDUP FACTOR AND EFFCIENCY FACTOR OF FFT ON AWS MAPREDUCE MODEL The execution time(h) Transform size
N= 64
N= 128
N= 256
Number of cores
The speedup factor(%)
Amazon EC2 Instances Small
Large
Extra large
25
0.671
0.611
0.551
50
0.400
0.384
0.368
75
0.316
0.301
0.280
100
0.261
0.254
0.245
25
0.551
0.485
0.432
50
0.280
0.265
0.243
75
0.180
0.185
0.175
100
0.140
0.135
0.130
25
0.530
0.470
0.420
50
0.260
0.240
0.223
75
0.160
0.170
0.165
100
0.150
0.130
0.120
Amazon EC2 Instances Small
Large
Extra large
The efficiency factors(%) Amazon EC2 Instances Small
Large
Extra large
14.158
15.548
17.241
0.566
0.622
0.690
23.750
24.740
25.815
0.475
0.495
0.516
30.063
31.561
33.929
0.401
0.421
0.452
36.398
37.402
38.776
0.364
0.374
0.388
17.241
19.588
21.991
0.690
0.784
0.880
33.929
35.849
39.095
0.679
0.717
0.782
52.778
51.351
54.286
0.704
0.685
0.724
67.857
70.370
73.077
0.679
0.704
0.731
17.925
20.213
22.619
0.717
0.809
0.905
36.538
39.583
42.601
0.731
0.792
0.852
59.375
55.882
57.576
0.792
0.745
0.768
63.333
73.077
79.167
0.633
0.731
0.792
The major observations are as follows: • The average test time of FFT in a sequential mode is 9.5 hours and on a distributed architecture with 100 computers the execution time is 0.150 hour, 0.130 hour and 0.120 hours respectively for the three Amazon EC2 Instances. • The execution time, The speedup and the efficiency of FFT increase linearly as FFT size increases and as Amazon EC2 Instances integrate more memory capacities. Consequently, obtained results confirm that: • AWS MapReduce is an adequate framework to speed up the FFT feature extraction technique applied in a greedy process ( the Arabic handwriting recognition system). In fact if we use
100 cores with an extra large AWS instance with 256 FFT size, we can extract the inerrant and pertinent features of 1370 characters in a second with a speedup factor equal to 79%. • AWS MapReduce is an efficient tool to develop a distributed FFT as a large scale Arabic handwriting feature extraction technique . In fact for FFT size equals to 256 with an extra large AWS instance , the system is used for 79%.
6. Conclusion and perspective In this paper, we have proposed an approach to distributed FFT as a feature extraction technique for Arabic handwriting recognition system using MapReduce Model via cloud computing architecture.
Experimental results of DFFT on AWS MapReduce are presented and confirmed the viability of our investigation. Performance analysis confirmed indeed that FFT on AWS MapReduce can provide an effective framework to speed up the feature extraction process. Further investigations are under study and could extend the development of a powerful Arabic handwriting recognition system based on MapReduce model via the cloud computing architecture which constitutes indeed our main objective. ACKNOWLEDGMENT The authors wish to thank the reviewers for their fruitful Comments. The authors also wish to acknowledge the members of the Miracl Laboratory, Sfax, Tunisia, the members of Department of Mathematics and Computer Science, University of Marburg, Germany and the members of computer science department, College Of Science And Arts at Al-Ola, Taibah University, KSA
7. References [1] Aburas, A.A. Gumah, M.E.Arabic Handwriting Recognition: Challenges and Solutions, International Symposium on Information Technology, 2008. ITSim, Kuala Lumpur, Malaysia Page(s):1 - 6, 2008. [2] Raid S, Jihad. E, Hierarchical On-line Arabic Handwriting Recognition, 10th International Conference on Document Analysis and Recognition, Barcelona, pages 86- 871, 2009. [3] John M, Thad S, Richard S, and George C. On-line cursive handwriting recognition using speech recognition methods. In Proceeding of IEEE ICASSP’94 Adelaide, pages v125–v128, Adelaide, Australia, April 1994. [4] F. Kuhl. Elliptic fourier features of a closed contour. Computer Graphics and Image processing 18, pages 236–258, 1982. [5] K. Arbter. Affine-invariant fourier descriptors. from pixel to features. Elsevier Science Publisher B.V (North-Holland), 1989 [6] Snoussi Maddouri, S., Reconnaissance de l’Ecriture Arabe Manuscrite par Réseau de Neurones Transparent et Transformée de Fourier, Journée des jeunes chercheurs sur l’Ecrit et le Document, Colloque International Francophone sur l’Ecrit et le Document (JEDCIFED), Lyon, France, 2000. [7] Sabri A, Achraf S, Arabic Character Recognition using Modified Fourier Spectrum (MFS), GMAI '06 Proceedings of the conference on Geometric Modeling and Imaging: New Trends, Washington, Pages 155 - 159, 2006. [8] M.Szmulo, «Boundary normalization for recognition of non touching non-degraded characters », ICDAR, IEEE, 1997, pp 463-466. [9] K. Arbter. Affine-invariant fourier descriptors. from pixel to features. Elsevier Science Publisher B.V (North-Holland).
[10] Granlund G H (1972). Fourier Preprocessing for Hand Print Character Recognition. In IEEE Transactions on Computers, February 1972. [11] Kauppinen, et al An experimental comparison of autoregressive and Fourier-based descriptors in 2D shape classification. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.17, No.2, 1995. [12] Haikal El-Abed, Sofiene Haboubi Samia Maddouri Noureddine Ellouze, Invariant Primitives for Handwritten Arabic Script: A Contrastive study of four feature sets, 10th International Conference on Document Analysis and Recognition, pp 691-697, 2009. [13] Dean J. and Ghemawat S., \Mapreduce: Simpli_ed data processing on large clusters", Communications of the ACM 50 th anniversary issue: 1958 - 2008 2008, vol. 51, Jan. pp. 107-113. [14] Varia J., Mathew S., Overview of Amazon Web Services, 2013. [15]
Available at: http://hadoop.apache.org/
[16] M. Pechwitz, S. S. Maddouri, V. Mrgner, N Ellouze, and H. Amiri. \Ifn/enit - database of handwritten arabic words"In In Proc. of CIFED, Tunisie, 2002, pages. 129 - 136, 2002. [17]
Available at: http://www.cygwin.com/
[18]
Available at: http://www.cascading.org/
[19]
Available at: http://aws.amazon.com/s3/
[20]
Available at: http://www.microstrategy.com/
[21] V. Kumar, A. Grama, A. Gupta, and G. Karypis.\ Introduction to Parallel Algorithm Design and Analysis", Benjamin Cummings, Redwood City - USA, 1994.