While the basic of World Wide Web communication data almost of data still be represented by Text such as data exchange in Web. Services base on XML ...
4th International Conference on Next Generation Web Services Practices
A Simple Approach to Optimized Text Compression’s Performance Tanakorn Wichaiwong1, Kitti Koonsanit2 and Chuleerat Jaruskulchai1 1 Department Of Computer Science, Faculty of Science, Kasetsart University, Bangkok Thailand 2 National Electronics and Computer Technology Center, Pathumthani, Thailand E-mail: {g4964210, g4964202, fscichj}@ku.ac.th
data. Now, we addressing on communication data within XML for Text. Almost of data exchange in the real world contain a lot of White space to provide readability and easy to understand of content for user. However, Information Retrieval (IR) remain to the major steps in indexing construction, that have step to eliminate token and stopwords for improve performance of relevant set and reduce space and memory in computation time. Since 1970, the capitalization styles of CamelCase have became widespread, when it has been adapt to adopt in standard of identifier, naming convention for several programming languages in the world such as C, JAVA and C#. CamelCase can divide this article to two types: UpperCamelCase or PascalCase and lowerCamelCase. Almost of developer and organizations use the term CamelCase only for lowerCamelCase. In this paper, we discusses on new algorithms for efficiently compressed data in our system, that different from other algorithms cause of compressed data still easy to read and understand. The paper is organized as follows: In the next section we discuss work in text data compression, indexing construction of Information Retrieval and Capitalization styles. In section 3 we describe our system design and how to process. In section 4 we study experimental results and in section 5 we draw conclusions and chalk out path for future work.
Abstract While the basic of World Wide Web communication data almost of data still be represented by Text such as data exchange in Web Services base on XML technology and storage data into Relational Databases. Unfortunately, these attractive of data come at the expense of performance to transfer data. A way to improve is Compression technique. In this paper we present new compression algorithm using Capitalization. The mechanism has 3 steps is following: Firstly, Remove White space. Secondary, Compressing data to UpperCamelCase capitalization style and Lastly, to Decompress compressed data. Our experiments have shown significant performance gains of our algorithm include reduce data size up to 22% and keep data integrity. In additionally, compressed data is easy to read and understand like naming convention in several programming language. Keywords: Capitalization Styles, Retrieval, Compression, Performance
Information
1. Introduction While the basic of World Wide Web communication data almost of data still be represented by Text such as data exchange in Web Services [1, 2, 3] base on XML [4, 5] technology and storage data into Relational Databases. Efficient storage and transportation of data is an important issue. XML [6] design can divide to two types is following: Firstly, XML for Text is representing data more than tag or element such as News and Essay. Secondary, XML for Data is representing tag or element more than data such as electronic commerce
978-0-7695-3455-8/08 $25.00 © 2008 IEEE DOI 10.1109/NWeSP.2008.12
2. Related Works Recall to the Document Preprocessing prepare to inverted index construction of Information Retrieval [7, 8] is following: 1. Lexical analysis of text focus to treating digits, hyphens, punctuation marks and the case of letters. 2. Elimination of stopwords focus to filtering out words with very low discrimination value for IR 66
process cause of 40% or more solely with the elimination of stopwords. 3. Stemming of the remaining words focus to remove affixes. 4. Selection of index terms. 5. Construction of term categorization. In step 2 of IR process context in document need to remove all stopwords that include white space cause of in almost document use white space to separate word or term such in English language. Basically of IR term can consider all white space of context in not important term. Consequence, text compression to achieve high performance of ratio, if all of context in document no white space appearance too. Lossless data compression is a mature field of research [9] mainly based on Claude Shannon’s information theory. This theory direct correlation between the probability of occurrence of a symbol and the bits needed to encode it. Huffman coding [10] achieves the minimum of redundancy possible in a fixed set of variable length codes. It uses statistical modeling to encode symbols using the probability of the symbol’s occurrence. In contrast, a dictionary based compression scheme, which looks for groups of data that occur in a collection of data. If data match in collection is output instead of the code for that symbol. Compression ratio depends on the longer match in collection of data. In LZ77 compression [11], for example, the collection of data consists of all the strings in a window into the previously read input stream. The deflate algorithm [12] uses a combination of the LZ77 compression and the Huffman coding. It is used in popular compression application such as gZIP [13]. The Capitalization Styles [14, 15, 16] defined three types of capitalization styles: Firstly, UpperCamelCase (Pascal) is meaning the first letter in the identifier and the first letters of each subsequent concatenated word are capitalized Such as BackColor, DataSet. Secondary, lowerCamelCase (Camel) is meaning the first letter of an identifier is lowercase and the first letter of each subsequent concatenated word is capitalized such as backColor, dataset, and lastly is UpperCase (Upper) is all of letters in the identifier are capitalized such as BACKCOLOR, DATASET. As show in “Table 1.” is the examples for the different types of identifiers in Microsoft .NET.
TABLE 1. EXAMPLE OF THE DIFFERENT TYPES OF IDENTIFIERS IN MICROSOFT Identifier Class Enum type Enum Values Events Exception Constant values Interface Method Namespace Parameter Property
Case UpperCamelCase UpperCamelCase UpperCamelCase UpperCamelCase UpperCamelCase UpperCase
Example AppDomain ErrorLevel FatalError ValueChanges IOException RED
UpperCamelCase UpperCamelCase UpperCamelCase lowerCamelCase UpperCamelCase
IDisposeable ToString System.Text typeName BackColor
3. Text Compression Using Capitalization Technique We focus on addressing of data in network communication almost of data contain white space in content and send it together. After we analysis in some kind of data such as News or essay in English language, that has many occurrences of white space. White space does not important in data cause of it still readability of data. Consequence, within Document Preprocessing of Information Retrieval. We focus to find a way to remove or replace white space, that to improve performance of data transfer and useless of bandwidth network. In additionally, we focus on disadvantage of compression algorithm needs to decompress compressed data before usage data too. This research stresses on present new way of compression data and keep compressed data integrity. We have the following steps: 1) Remove White space 2) Compressing data and 3) Decompress data the detail of step as following:
3.1. Remove White space, which transfer original message from client site to request information from web services. 3.2. Compressing data, which replace the first letter of each subsequent concatenated word are capitalized. 3.3. Decompress data, which decompress the compressed data, back to original data. The structure of Text Compression Using Capitalization Technique show in “Figure 1.”.
67
TABLE 2. THE CHARACTERISTICS OF EXAMPLE DOCUMENT Size (KB) Content White Space 1,451 304 2,066 336 3,038 516 4,691 1,019 5,435 1,216 6,031 1,084 7,728 1,535 8,205 1,614 9,197 1,491 10,753 1,873
File File01.txt File02.txt File03.txt File04.txt File05.txt File06.txt File07.txt File08.txt File09.txt File10.txt Figure 1. The structure of Text Compression Using Capitalization Technique
% 20.95 16.26 16.98 21.72 22.37 17.97 19.86 19.67 16.21 17.42
TABLE 3. COMPARE THE DATA SIZE OF DOCUMENT
4. Experiment
Size (KB) Original Compressed 1,451 1,147 2,066 1,730 3,038 2,522 4,691 3,672 5,435 4,219 6,031 4,947 7,728 6,193 8,205 6,591 9,197 7,706 10,753 8,880
File
We addressing on occurrences of white space in context of communication data. In the experiment of Text Compression Using Capitalization Technique, there was development in which methods of way for compress and decompress data. This experiment was done on Intel Pentium Dual-Core 1.87 GHz with the memory of 1 GB, Microsoft Windows XP Professional with Service Pack 2 and using Microsoft Visual C#.NET 2008 for develop our system.
File01.txt File02.txt File03.txt File04.txt File05.txt File06.txt File07.txt File08.txt File09.txt File10.txt
4.1. Effectiveness Measurement The characteristics of example document [17, 18] as show in “Table 2.”. The effectiveness of data size is the proportion in data size which can be found by using:
% 20.95 16.26 16.98 21.72 22.37 17.97 19.86 19.67 16.21 17.42
12000 10000 8000
Proportion = [1 – (the size of compressed data / the size of actual data)]
6000
As shown in “Table 3.” or “Figure 2.”, the use of compression technique can reduces the size of data compared to the original data before compression and in “Table 4.” show the length of time used in compress and decompress. In “Table 5.” or “Figure 3.”, show data size compare with gZip and compression technique with gZip and in “Figure 4.”, “Figure 5.” and “Figure 6.” show example data of our algorithm.
2000
4000
0 1
2
3
4 Original
5
6
7
8
9
Compressed
Figure 2. Graph showing the size of data
68
10
TABLE 4. SHOW THE LENGTH OF TIME USED IN COMPRESS AND DECOMPRESS Time (ms) Compress Decompress 223 178 331 293 561 395 720 537 1,102 753 1,100 747 1,160 1,197 1,274 911 1,458 1,115 1,716 1,401
File File01.txt File02.txt File03.txt File04.txt File05.txt File06.txt File07.txt File08.txt File09.txt File10.txt
Figure 4. The example of original data
TABLE 5. COMPARE THE SIZE WITH gZIP and COMPRESS + gZIP File
gZip
File01.txt File02.txt File03.txt File04.txt File05.txt File06.txt File07.txt File08.txt File09.txt File10.txt
452 626 1,009 1,468 2,104 2,125 2,475 2,633 2,846 3,637
Size (KB) Compress + gZip 388 583 931 1,243 1,955 2,044 2,172 2,309 2,652 3,446
Figure 5. The example of compressed data
4000 3500 3000 2500 2000 1500 1000 500 0
Figure 6. The example of decompress data
5. Conclusion 1
2
3
4 gZip
5
6
7
8
9
While the basic of World Wide Web communication data almost of data still be represented by Text such as data exchange in Web Services base on XML technology and storage data into Relational Databases. Unfortunately, these attractive of data come at the expense of performance to transfer data. Compression data is important to improve data transfer in World Wide Web on present day need to exchange a large data, that can increase the effectiveness of data transfer and useless of
10
Compress + gZip
Figure 3. Graph showing the size of data with gZip and compression with gZip
69
Bandwidth Network. Past researches on data compression, prove that the size can be reduced but it’s difficult to use with web services and need to decompress process before usage data. The techniques presented in this paper is on XML for Text compression able to reduce the data size, but still enables the easy understanding of documents because the data is already in the form of CamelCase like naming convention in several programming language. It is also easy to use. This research stresses the importance of data inside of XML. Therefore, we continue research to using our algorithm within a part of Web Services process.
[9] M. Nelson, The Data Compression Book, M&T Books, 1992
References
[13] J.L. Gailly and M. Adler, “Gzip: The compressor data,” Available at http://www.gzip.org/
[10] D.A.Huffman., “A method for the construction of minimum redundancy codes”, Proceedings of the IRE, Volume 40, Number 9, September 1952, pages 1098-1101 [11] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression”, IEEE Transaction on Information Theory, Volume 23, Number 3, May 1997, pages 337-343 [12] N. Jesper Larsson and Alistair Moffat, “Offline Dictionary-Based Compression”, IEEE Transaction on Information Theory, 1999.
[1] World Wide Web Consortium, “Web Services Transfer (WS-Transfer)”, available at http://www.w3.org/Submission/WS-Transfer/.
[14] 3Suns, “CamelCase”, available http://everything2.com/e2node/CamelCase
[2] Solutions Architect, IBM jStart Emerging Technologies “Web services architect: Part 1”, April 01.
[15] Brad Abrams, “Design Guidelines, Managed Code and the .NET Framework”, available at http://blogs.msdn.com/brada/default.aspx
[3] Roger Wolter, Microsoft Corporation, “XML Web Services Basics”, available at http://msdn2.microsoft.com/en-us/library/ms996507.aspx
at
[16] David Bolton, “About Pascal and Camel Case”, available at http://cplus.about.com/od/learnc/ss/csharpclasses_5.htm
[4] Jean Paoli, Eve Maler, Tim Bray, et. al., Editors. World Wide Web Consortium, “Extensible Markup Language (XML) 1.0 (Fourth Edition)”, available at http://www.w3.org/TR/REC-xml.
[17] Mike Hammond, “Brown corpus in plain text”, available at http://dingo.sbs.arizona.edu/~hammond/ling696fsp03/browncorpus.txt.
[5] D. Hunter, C. Cagle, D. Gibbons, N. Ozu, J. Pinnock and P. Spencer. “Beginning XML”, Wrox Press, 2002.
[18] Peerachet Porkaew, “Lexitron corpus”, available at http://lexitron.nectec.or.th/clean.src
[6] Kevin Williams, Michael Brundage, Patrick Dengler, Jeff Gabriel, et. al., Editors. “Professional XML Databases”, Wrox Press, 2000. [7] B. Ricardo and R. Berthier, Modern Information Retrieval, 1999
[19] BEA Systems, IBM, Microsoft, SAP AG and Siebel Systems. “Business Process Execution Language for Web Services”, available at http://www128.ibm.com/developerworks/library/specification/wsbpel/
[8] D. Christopher, R. Prabhakar and S. Hinrich, An Introduction to Information Retrival, 2008
[20] Refsnes Data, “W3 Schools”, http://www.w3schools.com/default.asp
70
available
at