Cloud computing technology for large scale and efficient Arabic handwriting recognition system Hamdi Hassen* and Maher Khemakhem** *Computer science departement, College Of Science And arts at Al-Ola, Taibah University, KSA.
[email protected] **Mir@cl Lab, FSEGS University of Sfax,Tunisia.
[email protected]
Abstract— Optical Character Recognition (OCR) system is a process which allows computers to recognize written or printed characters such as numbers or letters and change them into a form that the computer can use. Today there are many OCR systems in use based on different algorithms. All of the popular OCR support high accuracy and most high speed, but till now, Arabic handwriting recognition systems have been limited to small and medium size of documents to recognize.
experimental study are analyzed and conclusions are presented.
Our idea consists to use a strong, efficient and scalable technology that offers a number of benefits, such as the ability to store and retrieve large amounts of data (documents) in any location at any time (Mobility of the users: pervasive environment).
Segmentation is a crucial step of OCR systems as it extracts meaningful regions for analysis. A poor segmentation process produces miss-recognition or rejection, This step is important for Arabic handwriting recognition system due to the cursive nature of Arabic script and the fact that some words overlap vertically[2].
II.
ARABIC OCR SYSTEM
The goal of Optical Character Recognition (OCR) is to classify optical patterns (often contained from a digital image) corresponding to alphanumeric or other characters. The process of OCR involves several steps including segmentation, feature extraction, and classification.
We have used a distributed Arabic handwriting system based on cloud technologies. Obtained results confirm that our approach present a very interesting framework to scale our application in many scaling points such as storage, networking, server processing (CPU cycles, RAM capacity,..), Database transactions per second and to integrate (combine) strong complementary approaches which can lead to the implementation of an efficient and powerful handwriting OCR systems .
In feature extraction stage each character is represented as a feature vector, which becomes its identity. The major goal of feature extraction is to extract a set of significant features, which maximizes the recognition rate with the least amount of elements. Feature extraction methods can be classified into 3 categories: Statistical, Structural, global transformations and moments [3] The classification process consists in two steps: training and testing. These steps can be broken down further into sub-steps: preprocessing (Processes the data so it is in a suitable form), feature extraction (reduce the amount of data by extracting relevant information) and model estimation ( from the finite set of feature vector .For training step, the same preprocessing , feature extraction substeps but classification in testing steps consists on comparing feature vector to the various models in the data references.
Keywords— Large scale; efficient; Arabic handwriting OCR; cloud computing; cloud storage I . INTRODUCTION
Handwriting OCR system, especially Arabic still constitutes a big challenge, especially if we need to computerize a large amount of documents, despite the wide range of proposed approaches and techniques which attempted to solve the inherent problems [1].
III. CLOUD COMPUTING
Fortunately, distributed infrastructures such as cloud computing can provide enough computing power and storage which can be exploited and used to solve our problem.
A Cloud is a type of parallel and distributed system consisting of a collection of interconnected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources based on service-level agreements established through negotiation between the service provider and consumers [4]
The paper begins with an introduction to the Arabic OCR system and cloud computing technologies. The second section presents our problem statement. An overview of our approach to solve this problem is presented in section 3. Finally, the obtained results of the
Today's cloud computing is primarily used to deliver infrastructure, platform, and software as services, which are 113
4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
made available as subscription-based services in a pay-asyou-go model to consumers. These services in industry are respectively referred to as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) [5].
of the Arabic text (document) is not enough good quality. The Arabic handwriting OCR is now a very attractive field of research [10], [11] especially for historical documents. Indeed, the complex morphology and the cursive aspect of this writing are behind the weakness of the proposed approaches. For that, it is necessary for technologies to be more «efficient» .
Software as a Service is the idea that someone can offer you a hosted set of software (running on a platform and infrastructure) that you don't own but pay for some elements of utilization - by the user, or some other kind of consumption basis. Platform as a Service is an application development and deployment platform delivered as a service to developers over the Web. Infrastructure as a Service (IaaS) is the delivery of hardware (server, storage and network), and associated software (operating systems virtualization technology, file system), as a service. [6]
The concept of efficiency suggests that it is necessary to choose an efficient storage infrastructure that allows decreasing the whole of its exploitation costs and satisfying other exigencies of large scale application in a pervasive environment. To remain competitive and to answer to the national and international regulations, there is an urgent need to develop a system for monitoring and facilitating the creation of digital library that offers a number of benefits, such as the scalability and the ability to store and retrieve large amounts of data in any location at any time (Mobility of the users: pervasive environment). The digitization of the national Arabic handwriting cultural heritage is the motivation of our project [10]. The project is expected to connect with vanguard digital libraries such as Google, and digitize many books, periodicals and manuscripts. Our project would make access to data much easier for researchers and students in all regions of the country. To ease the use of such documents as archive and make them readable by a large audience, it is necessary to digitize them instead of letting them in the form of binary image files .
Figure1: Cloud Computing structure
There are several Deployment Models of cloud computing such as private, community, public and hybrid[7]. Public Cloud” describes Cloud computing in the traditional, off-site sense, while private Cloud emulates Cloud computing on private networks. However, community cloud shares infrastructures between several organizations from a specific community with common concerns (security compliances...). Finally, the hybrid Cloud is a combination of public and private Cloud offerings - will be typical for most enterprises. [8]
First it comes to mind to use one computer to recognize the images. Therefore, we started the digitalization of some documents as a sequence of words. In this case, different Arabic words are recognized sequentially on a PC (3.4 GHZ CPU frequency, 1GB of RAM and running Windows XP-professional). The time of recognition process achieves 5.85 minutes with a single document of 9000 words. Figure 1 illustrates the results as a graph.
IV. PROBLEM STATEMENT In many national libraries and archive centers, most of documents are still in the form of newspaper, books, magazines, research papers, conference proceedings, dissertations, and monographs. The digitization of such documents presents a big challenge according to the Australian project [9]. Recall that this project concerns only Latin documents. Arabic writing (documents) attracted many researchers since several decades to attempt building powerful printed Arabic OCR systems. Unfortunately, obtained results remain very weak especially if the input binary image file 114
4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
platform to deploy our classification and features extraction application which needs enough computing power.
Applications
Segment -ation
Feature extraction
SAAS
Classifi -cation
PAAS
Figure 2: variation of the speedup according to the size of documents .
Cloud storage (learning and training It is obvious that this solution is characterized by many disadvantages: such as not adaptable to a huge amount of documents (limited storage capacity), the decrease of the computing power and the locality access to documents. Figure 2
IAAS
database)
Figure 4: Arabic OCR system based on the cloud computing technologies
The learning and training data base are portioned with a separate server responsible for operating on data within each partition, then we follow a design where data is replicated at the physical level. The management and the availability of data as a service. VI.
THE EXPERIMENTA L STUDY:
In this section, we present experiments and evaluation that we undertook in order to quantify the efficiency of our approach. The experiments were conducted on an Intel Core 2 Duo virtual machine having configuration: 3.00 GHz *2, 2 GB of RAM running a standard Ubuntu Linux version 11.10 and JDK 1.6 and the network capacity was 100 Mbits/s To improve the influence of cloud technologies on the feature of our application, we used different corpus with different size (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000 and 9000 words) randomly chosen from the IFN/ENIT corpus data base formed of handwritten Tunisian town‟s names.
Figure 3: Analytic Performance using once computer
Therefore, it is necessary to build a strong application to shorten the used time and increase the throughput. This is possible using distributed system such as cloud computing.
We have considered also a reference library composed of 345 characters representing approximately the totality of the Arabic alphabet (including the characters shape variation according to their position within words and with different position (rotation and translation)).Figure 5 illustrates a sample of the studied corpus.
Consequently, we can conclude that large scale OCR system requires enough computing power and storage. V.
THE PROPOSED APPROACH
In this paper, we propose a novel approach to distribute the Arabic handwriting OCR system, the idea consists to consider the storage data in the training and test steps as service SAAS using a strong and complimentary approach Such as cloud storage and to consider cloud computing as 115
4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
WEB-INF/cloudbees-web.xml Training data base Learning data base
Figure 5: A simple of the studied corpus
The character image is divided into NxM zones. From each zone features are extracted to form the feature vector. The goal of zoning is to obtain the local characteristics instead of global characteristics. We have used Hough transform such as a features extraction technique of Arabic handwritten character. This technique is insensible to the variation of characters (rotation, translation) [12]. The implementation of Hough Transform doesn't require a particular parameter; we simply program the different stages of this method [13].
We have used many instances of Cloudbees, within different amounts of documents to recognize( 2000, 4000, 6000 8000 and 10000 words. In order to analyze and keep eye on our experiments, we use NewRelic [17] that defines many factors such as the response time of our application, availability Storage capacity, the CPU cycle and the RAM capacity.
For classification steps, there are many methods like k-Nearest Neighbour (k-NN) , Bayes Classifier, Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM), etc [14]. There is no such thing as the “best classifier”. The use of classifier depends on many factors, such as available training set, number of free parameters etc. In our case, we have used Euclidean Minimum Distance Classifier (EMDC) for its different qualities [15].
Figure 6 presents the variation of the response time according to the variation of the amount of the document to recognize.
We have chosen the cloudbees [16] cloud computing free version to test, evaluate, and make use of our approach. First, we start by executing our application in the same local host, then we deploy it in a WAR (Web application archives) file using the command Jar cf ../hamdi/OCR.war * bees getapp -a hamdi/ocr bees deploy -a hamdi/ocr bees run -a hamdi/ocr Second we deploy the two data base (training and learning) in cloudbees . Figure 6: variation of the speedup according to the size of documents using Cloud technologies
Both the application and the two data base are interfaced with a cloudbees console to have the same overhead.
Our studies show that the growth of the response time is small compared with the growth of the amount of the document to recognize.
We should insert the XML file in my application to register this data base as a datasource in my application. 116
4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
For thus, our approach provides the flexibility and dependability to process a large amount of document, and a level of reliability to process thousands of images with minimum speed.
offered by cloud storage technologies increase the recognition rate compared to exiting results [12] with a single server. Thus, cloud computing offers many benefits such as the scalability of platform, the availability of resources to a large number of users and the ability of research into scalable computing for OCR, the consolidation of support and maintenance. (dynamically scalable and often virtualized resource as a service over the internet on a utility basic).
Other performances of our application have been studied such as the capacity Storage, the CPU cycle and the RAM capacity.
The efficiency of our system, we can decrease the cost for all soft and hardware„s investment by means of OCR based on cloud technology, thus increasing processing efficiency. Results show how the cloud technology can simplify the task of ensuring critical data is always available without the cost and complexity of traditional replication data[18]. Another advantage of using cloud computing for OCR systems, is that our application can be used in a pervasive environment, we can access our application from any mobile platform iPhone, iPad, Android, winCE and other popular Embedded OS (the implementation of this task is a future work). VII. CONCLUSION AND PERSPECTIVE
Figure7: analytic Performance using cloud computing
The figure 7 proves the linear scalability of the different analytic performance have indicated before according to the increasing of the demand of users (amount of document to recognize). For a document of 2000 words, 54 % of the total capacity of storage is available, this average reserves an approach value (55%) for large amount of documents (6000 words).The same behavior for other performance according to the variation of the amount of documents.
In this paper, we proposed the use of cloud computing Technologies for large scale and efficient distributed Arabic Handwriting Recognition System Performance evaluation of the proposed approach confirms that cloud computing can provide an effective framework to speedup the recognition process. In addition, such infrastructure allows also the integration (combination) of some strong complementary approaches in order to improve the recognition rate of the obtained system. Consequently, it will be easy to implement powerful and scalable handwritten OCR systems which will be much more powerful than to the existing products [18][19] [20].
To improve the importance of cloud technologies in accuracy of recognition, we have increased the amount of the learning data base. Experiments and results are explained in figure 8.
The proposed design approach requires further investigations. In particular, we examine how to distribute the different stages of the OCR system such as preprocessing, segmentation, feature extraction between servers of the cloud and the idea of using several clouds at the same time “inter-cloud infrastructure” . ACKNOWLEDGMENT The authors wish to thank the reviewers for their fruitful Comments. The authors also wish to acknowledge the members of the miracl laboratry, sfax, tunisia and the members of computer science departement, College Of Science And arts at Al-Ola, Taibah University, KSA. Figure 8: recognition rate according to the amount of learning data base using cloud technologies
REFERENCES [1] S. Sangsawad and C. Fung Using Content Based Image Retrieval Techniques for the Indexing and Retrieval of Thai Handwritten Documents, IEEE Xplore., vol 1, june 2010
Figure 8 shows that the possibility to store a large amount of documents especially the learning data base 117
4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
[2] Mohsen Zand, Ahmadreza Naghsh Nilchi, and S. Amirhassan Monadjemi, Recognition-based Segmentation in Persian Character Recognition World Academy of Science, Engineering and Technology 38 2008 [3] G. Vamvakas, B. Gatos, I. Pratikakis, N. Stamatopoulos, A. Roniotis and S.J. Perantonis, "Hybrid Off-Line OCR for Isolated Handwritten Greek Characters", The Fourth IASTED International Conference on Signal Processing, Pattern Recognition, and Applications (SPPRA 2007), ISBN: 978-0-88986-646-1, pp. 197202, Innsbruck, Austria, February 2007. [4] Twenty Experts Define Cloud Computing, http://cloudcomputing.syscon. com/read/612375_p.htm [18 July 2008]. [5] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A.Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia. Above the Clouds: A Berkeley View of Cloud computing. Technical Report No. UCB/EECS-2009-28,University of California at Berkley, USA, Feb. 10, 2009. [6] Sushil Bhardwaj1, Leena Jain1, Sandeep Jain2 CLOUD COMPUTING: A STUDY OF INFRASTRUCTURE AS A SERVICE (IAAS) ,International Journal of Engineering and Information Technology Vol 2 , No. 1 IJEIT 2010 [7] Peter Mell, Timothy Grance, The NIST Definition of Cloud Computing (Draft), Special Publication 800-145 (Draft) [8] Ke Zeng, Ann Cavoukian, Modelling Cloud Computing Architecture Without Compromising Privacy: A Privacy by Design Approach, Modelling Cloud Computing Architecture Without Compromising Privacy: A Privacy by Design Approach NEC Company, Ltd. And Information and Privacy Commissioner, Ontario, Canada May 2010 [9] R. Holley, How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D‐ Lib Magazine, March/April 2009, vol. 15 no ¾
[10]http://www.emploi.gov.tn/index.php?id=133&L=2&tx_ttnews[tt_n ews]=1311&tx_ttnews[backPid]=34&cHash=8de7ae1227 [11] H.zoheir, le role des technologies dans le stockage du manuscrit arabe, cybrarians journal, n° 14, septembre 2007 [12] Hassen Hamdi, Maher Khemakhem, A Comparative study of Arabic handwritten characters invariant feature, (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No. 12, 2011 [13] Charles V.StewarsComputer Vision, The Hough Transform, February 24, 2011 [14] C. Huang, S. Srihari, Word segmentation of off-line handwritten documents, in: Proceedings of the Document Recognition and Retrieval (DRR) XV, IST/SPIE Annual Symposium, San Jose, CA, USA, January 2008. [15] Jon Dattorro, Convex Optimization & Euclidean Distance Geometry, Meboo Publishing USA, 2005. [16] https://grandcentral.cloudbees.com/services [17] https://rpm.newrelic.com/accounts/73522/setup [18] Iman Zangeneh, Mostafa Moradi, Ali Mokhtarbaf, The Comparison of Data Replication in Distributed Systems, World Academy of Science, Engineering and Technology 59 2011. [19] CiyaICR product, http://www.Ciyasoft.com,2004 [20] M.Khemakhem and A. Belghith. Towards A Distributed Arabic OCR Based on the DTW Algorithm: Performance Analysis The International Arab Journal of Information Technology, Vol. 6, No. 2, April 2009. [21] Hassen Hamdi, Maher Khemakhem Distributing Arabic Handwriting Recognition System Based on the Combination of Grid Meta-Scheduling and P2P Technologies (Omnivore), Universal Journal of Computer Science and Engineering Technology 1 (1), 3135, Oct. 2010. © 2010 UniCSE, ISSN: 2219-2158.
118