a method for compression of short unicode strings - Innovative ...

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032

A METHOD FOR COMPRESSION OF SHORT UNICODE STRINGS Masoud Abedi

Peter Luksch

Rayan Tahlil Sepahan Co.,

Faculty of Computer Science,

Isfahan Science and Technology Town, Isfahan

University of Rostock,

Iran

Germany

Abbas Malekpour

Mohammad Reza Mojtabaei

Faculty of Computer Science,

Department of Computer,

University of Rostock,

Mohajer Technical University, Isfahan,

Germany

Iran



characters are arranged through a specific technique. Abstract— The use of short texts in communication

has been greatly increasing in recent years. Applying different languages in short texts has led to compulsory use of Unicode strings. These strings need twice the space of common strings, hence, applying algorithms of compression for the purpose

In the fourth stage, the characters of stage three are specially formatted to be transmitted. The results obtained from an assessment made on a set of short Persian and Arabic strings indicate that this proposed

methods like gzip, bzip2 or PAQ due to high

outperforms

the

Huffman

algorithm in size reduction.

of accelerating transmission and reducing cost is worthwhile. Nevertheless, applying the compression

algorithm

Keywords— Decoding,

Algorithms,

Encoding,

Data

Huffman

compression, codes,

Text

communication

overhead data size at the beginning of the message is not appropriate. The Huffman algorithm is one of the

I. INTRODUCTION

rare algorithms effective in reducing the size of short

As the internet is growing significantly all over the world, the

Unicode strings. In this paper, an algorithm is

speed of data transmission needs to be increased day by day.

proposed for compression of very short Unicode

Strings are used widely in many areas such as short messages,

strings. This proposed algorithm has four stages for

social networks, webpages, and in distributed databases. The

compression. In the first stage, every new character to be sent to a destination is inserted in the proposed mapping table. At the beginning, every character is new. In case the character is repeated for the same

number of webpages is on a drastic rise globally. It was shown that the number of available webpages are doubled every 9 to 12 months [1]. On the other hand, the use of social networks is on a constant intense growth. Facebook as the most famous social network in 2014, recorded a number of 1.4 billion as the

destination, it is not considered as a new character.

number of its active users in one month. Facebook has

In the second and third stages, the new characters

evaluated the number of its mobile users by 90%

together with the mapping value of repeated

[2].Meanwhile, the role of smartphones should not be ignored. Considering the great number of strings in the internet, the

141

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032 necessity to reduce the costs and increase the speed of data

approach. In frequency method, the probability is assigned to a

transmission is an endless challenge [1]. The need to support

symbol based on the number of times the symbol is repeated.

other languages increases the use of Unicode standard, which

The higher the probability of a symbol, the shorter the code

in turn would double the strings’ size. For example, for

assigned to it. In static model, this probability is a fixed value,

sending a short message through mobile phone, 16 bits of

but in dynamic model the probability is modified “on the fly”

space is used for transmission of a single character applying 2-

as the input text is being compressed. In the context method, a

byte Universal Character Set (UCS2), while it takes up only 8

probability is assigned to a symbol considering its context.

bits without the use of Unicode. Therefore, using Unicode

This prediction is due to lack of access to the future string.

(UCS2) would reduce the message length from 140 characters

The decoder and the encoder are not aware of the upcoming

to 70 characters [3].

stream of strings; they only have access to the previously

The string size appears to be very low but considering the

processed symbols, the context of the current symbol [6].

great number of messages sent and received in a day only by a

One of the best suggested methods for compression of

specific user of a social network, it would be a substantial size.

Unicode strings is applying the lossless Lempel-Ziv-Welch

Regarding Short Message Service (SMS), the number of sent

(LZW) method. The LZW algorithm is a member of the

messages only in December 2009 reached 152.7 billion

Lempel-Ziv family. The Lempel-Ziv technic was introduced

through 286 million users; which means an average of 534

in 1977 and it is named LZ77. This algorithm is the basis of

messages per user in one month [4]. Consider a distributed

“deflate” technic applied in “zip”, which is used in programs

database with millions or maybe billions of records; how

like WinZip, PKZip, gzip, and in Unix compression utility.

many strings should be sent and received for creating such a

This method keeps its dictionary within the data themselves.

database and keeping it up-to-date? This extensive application

For this purpose, the repeated pieces of a string are related to

justifies the importance of using string compression methods.

the first observation of the same piece of string using pointers.

Information compression in a sense is the elimination of

In 1978, the second version of this algorithm was introduced,

repeated information. This elimination could reduce the

named LZ78, where a dictionary is used separately. In 1984

number of bits needed for illustrating information [5]. In string

the LZW algorithm was introduced for high-performance disk

compression, due to high sensitivity, the “lossless” methods

controllers, but it is mostly used in GIF (Graphics Interchange

are applied, where unlike the “lossy” methods, the data

Format) files. This algorithm is the extended version of LZ78

accuracy is not sacrificed to reach an excellent compression

[7, 8].

level. Most of the text compression methods are based on

One of the Unicode string compression methods is Huffman

statistical or dictionary classes. In dictionary-based class, texts

algorithm [9] introduced in 1952 and has been frequently

are stored in a data structure named dictionary. Every time a

studied ever since. This algorithm has mostly been used for

new text fragment is found, it is saved in an entry of the

compression of other languages or as a part of a compression

dictionary marked by a pointer; this pointer is used in the

process. In Huffman algorithm, the number of times that each

process of string compression. The statistical class includes

character is repeated in a string is determined. The obtained

methods that develop statistical models of the text; statistical

list is sorted into a descending order and forms a tree, through

models can be static or dynamic (adaptive). In general, the

which an equivalent code is extracted for each character. This

modeling stage is set ahead of the coding stage in statistical

code is made of a number of bits and its length may be

methods. In the modeling stage, a probability is assigned to

different for each character. The theory behind this method

the input symbols. In principal, the symbols are coded based

relies on frequency of some letters versus other letters in

on the probability in coding stage [6].

natural languages. In this method, the shortest stream of bits is

Most statistical models rely on either frequency or context

used for the most frequent characters and the longest ones for 142

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032 the rarest characters. This algorithm is used in some formats like MP3 and JPEG [7, 8, 10].

The size of a message in instant messaging networks like

Applying the Huffman algorithm for compression of Arabic

Twitter is way less than 160 bytes. The strings with about 78

texts yields better results compared with LZW technic, but the

characters [15] like the ones sent in instant messaging

opposite holds true for English texts, (i.e. the LZW method

networks are named “Tiny-strings” by the authors of this

outperforms Huffman algorithm in compressing English texts

study.

[8].Applying Huffman algorithm to Arabic strings yields

Using the compression methods like gzip, bzip2 or PAQ

inefficient results in comparison with Persian strings. The

[16,17] for Tiny-strings is not appropriate. In most cases, the

Arabic strings are compressed 3.5% less than Persian strings

size of overhead data at the beginning of message together

on average. In contrast, short length (e.g. 1024 bytes) Persian

with the compressed string is more than the size of non-

strings are compressed less in comparison with Arabic strings

compressed string [18]. Only SMAZ algorithm [19] appears to

with the same length [1].

be fit for compression of Tiny-strings [18]. This algorithm is

One of the effective features on the implementation of

presented using a predefined dictionary. The SMAZ algorithm

compression algorithm is the string size. If the size is more

reduces the size of Tiny-strings of SMS by 29% and the Tiny-

than 5 MB, the encountered string is a very big one. In this

strings of Twitter by 19% [18]. This algorithm is introduced

case, every single word must be considered as a base unit for

and evaluated only for non-Unicode strings [18]. Finding a

compression. This compression approach is named word-

constant text pattern for messages of even one person is not

based [11]. Medium-sized strings are of 200 KB to 5

possible; people with an accent for example, apply the

megabytes. In this case, selecting a word as a base unit for

common text pattern in their messages for the purpose of

compression is not appropriate; therefore, a smaller grammar

concealing their accent, while the same people have their own

part (i.e. syllable) is used. Short strings are from 100 to 200

specific pattern in writing to their family members. The

KB, where every character must be considered as a base unit

context pattern used in sending a message differs upon the

for compression. Though very small strings are about 160 KB,

receivers’ social status [20].

still the previous compression methods are applied [12, 13].

With respect to Tiny-strings’ size, the available methods are

The size of a SMS is usually less than 140 bytes. Fig.1 shows

not efficient enough; therefore, an attempt is made to

the size of more than 10100 short messages sent in English

introduce an efficient compression method for Tiny-strings.

[14] and these messages do not contain any spam or

This method might be useful in some previous research in big

advertisements. More than 50% of the messages are smaller

data analysis [21] and data processing [22,23]

than 50 bytes; more than 70% of them are below 75 bytes and II. THE PROPOSED METHOD

only 0.03% of the whole messages have a size of more than

Assume the existence of the character string S as a stream

150 bytes.

that includes Cx s in the form 𝑆 = {(𝑐, 𝑥)|𝑥 ∈ 𝐼, 𝑐 ∈ 𝑐ℎ𝑎𝑟}, where 𝐼 = {1,2,3, . , 𝑛} indicates the location of the character and

𝐶 = {1,2,3, … , 𝑐𝑤𝑀𝑎𝑥 }

indicates

the

corresponding

character. The goal is to send stream S from point Y to point Z with optimum length. Based on the type of transmitted information, cwMax may vary in value (e.g. for sending a stream of characters subject to Unicode standard, every character needs 2 bytes of space and cwMax is 65535). Usually for Fig. 1 The size and number of sent messages

sending any of the Cis, a space equal to the size of CwMax is 143

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032 necessary. The maximum number of bits necessary to express

results in four different streams expressed in (4):

CwMax is shown by (CwMax)b and the size of stream S is equal to SlengthB in bytes, which is calculated through (1): 𝑛

𝑆𝑙𝑒𝑛𝑔𝑡ℎ𝐵 = ∑𝑘=1 Ck × (𝐶𝑤𝑀𝑎𝑥 )𝑏

(1)

Csupport is considered as the number of bits necessary for displaying length of (CwMax)b. For example, for sending characters using Unicode, the number of bits necessary to

𝑆 = [𝐶𝑛 , 𝐶𝑛 , … , 𝐶𝑢 , . . ] 𝑆𝐿𝐵 = [𝐶𝐿𝐵 , 𝐶𝐿𝐵 , … , 𝐶𝐿𝐵 , . . ] 𝑆𝐼𝑛𝑑𝑒𝑥 = [𝑁𝑢𝑙𝑙 , 𝑁𝑢𝑙𝑙, … , 𝐶𝐼𝑛𝑑𝑒𝑥 , . . ] 𝑆𝐼𝑛𝑑𝑒𝑥−𝐿𝐵 = [𝑁𝑢𝑙𝑙 , 𝑁𝑢𝑙𝑙, … , 𝐶𝐼𝑛𝑑𝑒𝑥−𝐿𝐵 , . . ]

(4)

Where, SLB corresponds to S, provided that, CLBi is the length of Cni in bits. SIndex-LB correspond to SIndex, provided that CIndex-LBI is the length of CIndex in bits (and for Cni, the CIndex and CIndex-LB values are set to Null in the first observation).

display every (CwMax)b is five as shown in (2): B. Second stage of compression (𝑐𝑤𝑀𝑎𝑥 )𝑏 = 16 𝑐𝑠𝑢𝑝𝑝𝑜𝑟𝑡 = ((𝑐𝑤𝑀𝑎𝑥 )𝑏 )𝐵𝑖𝑡𝑠 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑡𝑜 𝑑𝑖𝑠𝑝𝑙𝑎𝑦 = 5

(2)

Based on a specific policy, it is possible to break the S string into smaller strings. The borderline between the new and

Assume that CLBi is the minimum number of bits necessary to display the length of every Ci in bits, with a range in

repeated characters is recognized as the potential breaking points.

accordance to (3):

The letter A is the first potential breaking point, as shown in Fig. 2.

1≤ 𝐶𝑖𝐿𝐵 ≤ 𝑐𝑠𝑢𝑝𝑝𝑜𝑟𝑡

(3)

A. First stage of compression At the beginning, every new Ci available in stream S is fed into a table named the mapping table which has only 2

The potential breaking points are expressed as S=Spn1† Spu2, Spn3† Spu4,… . The value of CNew-Count is the number of consecutive new characters in SLB stream and and CIndex-Count is the number of consecutive mappings in Sindex-LB stream. Fig. 2 shows how these two values are calculated.

columns. The first and second columns contain the mapping

𝐴

value and the new C, respectively, as shown in Table I. The

𝑆 = [𝐶𝑛 , … , 𝐶𝑛 , 𝐶𝑢 ,

mapping value starts from one and increases by one unit in

𝐶𝑁𝑒𝑤−𝐶𝑜𝑛𝑡 𝑆𝐿𝐵 = [𝐶𝐿𝐵 , … , 𝐶𝐿𝐵 , 𝐶𝐿𝐵 ,

each row. The character string S is checked from beginning to

…

, 𝐶𝑢 ,

… , 𝐶𝐿𝐵 ,

𝐶𝑛 , … , 𝐶𝑛 , . . . ] 𝐶𝐿𝐵 , … , 𝐶𝐿𝐵 , . . . ]

𝑆𝐼𝑛𝑑𝑒𝑥 = [𝑁𝑢𝑙𝑙, … , 𝑁𝑢𝑙𝑙, 𝐶𝐼𝑛𝑑𝑒𝑥 , … , 𝐶𝐼𝑛𝑑𝑒𝑥 , 𝑁𝑢𝑙𝑙 , … , 𝑁𝑢𝑙𝑙, … ] 𝑆𝐼𝑛𝑑𝑒𝑥−𝐿𝐵 = [𝑁𝑢𝑙𝑙, … , 𝑁𝑢𝑙𝑙, 𝐶𝐼𝑛𝑑𝑒𝑥−𝐿𝐵 , … , 𝐶𝐼𝑛𝑑𝑒𝑥−𝐿𝐵 , 𝑁𝑢𝑙𝑙 , … , 𝑁𝑢𝑙𝑙, . . . ]

end and every time a character is observed for the first time it is added to this table. If the new characters are expressed by

𝐶𝑜𝑢𝑛𝑡

Cn and the existing ones by Cu, the S can be expressed as 𝑆 =

Fig. 2 Calculating values of Cindex-count and CNew-count

[𝐶𝑛 , 𝐶𝑛 , … , 𝐶𝑢 , … ]. TABLE I MAPPING VALUES TABLE Mapping value

Cn value

1 2 3 . . .

Cn Cn Cn . . .

For drawing a conclusion regarding the potential breaking point(s) of A, (5) is applied. In case f1(Spn1, Spu2) ≤ f2(Spn1, Spu2), the potential breaking point of A is not taken into account and two sections of Cn and Cu are considered as Cn. Here, the corresponding value of Spu2 is considered as Null in SIndex and SIndex-LB streams; but if f1(Spn1,

If CIndexi is considered as the mapping value of Cui after first observation, CIndex-LBi can be expressed as the number of bits necessary for expressing the length of CIndexi. For the Cns, the CIndexi value is considered Null in the first observation, that

Spu2) > f2(Spn1, Spu2), the potential breaking point of A is considered in stream S. If there is no breakage between these two regions, Spn1, Spn2, and Spn3 form a bigger Spn1, the decision making process is carried out for other potential breaking points to the end of stream S.

144

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032 (5) 𝑓1 (Spn1 , S𝑝𝑢2 ) = (𝐶𝑛𝐿𝐵𝑀𝑎𝑥 × (𝐶𝐼𝑛𝑑𝑒𝑥−𝐶𝑜𝑛𝑡 + 𝐶𝑁𝑒𝑤−𝐶𝑜𝑛𝑡 + 1)) + 𝐶𝑠𝑢𝑝𝑜𝑟𝑡 + 1 𝑓2 (Spn1 , S𝑝𝑢2 ) = (𝐶𝑛𝐿𝐵𝑀𝑎𝑥 × (𝐶𝑁𝑒𝑤−𝐶𝑜𝑛𝑡 + 1)) + (𝐶𝐼𝑛𝑑𝑒𝑥−𝐿𝐵𝑀𝑎𝑥 × (𝐶𝐼𝑛𝑑𝑒𝑥−𝐶𝑜𝑛𝑡 + 1)) + 2(𝐶𝑠𝑢𝑝𝑜𝑟𝑡 + 1)

presented in Fig. 4. Equation 8 evaluates Part1 for breaking from the appropriate point P1. By expanding the same method, the Cns can be assessed up to CnLBMaxh. The largest positive value among the h-1 calculated values is selected and is named Partbest. This process is shown in (9).

C. Third stage of Compression

𝑆𝑝𝑛 = [𝐶𝑛 ,

…

, 𝐶𝑛 , … ]

, 𝐶𝑛 , 𝐶𝑛 , …


𝑃1 𝑆𝐿𝐵−𝑝𝑛 = [𝐶𝑛𝐿𝐵𝑀𝑎𝑥1 , 𝐶𝑛𝐿𝐵 , … , 𝐶𝑛𝐿𝐵 , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥2 , 𝐶𝑛𝐿𝐵 , . . . , 𝐶𝑛𝐿𝐵 , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥3, . . 𝐶𝑛𝐿𝐵𝑀𝑎𝑥3 , … ]

At this stage, S is divided into smaller parts using appropriate breaking points. For more simplicity, only string

𝐾

SLB-pn is considered, which corresponds to Spn region in SLB

Fig. 4 Calculating the values of Count and K for Part1

stream. At first, a maximum variable with -∞ value is set and (8)

the points in which there is a change in maximum value are

𝑃𝑎𝑟𝑡1 = ((𝐶𝑢𝐿𝐵𝑀𝑎𝑥3 × (𝑘 + 1 + 𝐶𝑜𝑢𝑛𝑡)) + 𝐶𝑤𝑀𝑎𝑥 + 1) − ((𝐶𝑛𝐿𝐵𝑀𝑎𝑥2 × (𝐶𝑜𝑢𝑛𝑡 + 1) + (𝐶𝑛𝐿𝐵𝑀𝑎𝑥3 × 𝑘)) + 2(𝐶𝑠𝑢𝑝𝑜𝑟𝑡 + 1) + 𝑆𝑢𝑚) 𝑃𝑎𝑟𝑡𝑏𝑒𝑠𝑡 = {𝑀𝑎𝑥(𝑃𝑎𝑟𝑡𝑖 ) > 0|𝑖 = 0, … , ℎ − 1} (9)

determined from the beginning of Spn stream. The manner in which the maximum values change is given in (6). (6) 𝑆𝐿𝐵−𝑝𝑛 = [𝐶𝑛𝐿𝐵𝑀𝑎𝑥1 , 𝐶𝑛𝐿𝐵 , … , 𝐶𝑛𝐿𝐵 , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥2 , 𝐶𝑛𝐿𝐵 , . . . , 𝐶𝑛𝐿𝐵 , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥3 , … ]

The stream SLB-pn breaks at the point CnLBMaxn which is related

The value of Count is the number of Cn s starting from

to partbest and the stream Spn breaks at the corresponding point

CnLBMax1 up to the one before CnLBMax2 and the value of K is the

of partbest. The manner of this breakage and the two small

possible number of consecutive CnLBMax2s; K is at least equal to

regions named Ssmall1, Ssmall2 are illustrated in Fig. (5). the value

one.

of Sum is changed according to (10).

Calculation of these two values are shown in Fig. 3. 𝑆𝑝𝑛 = [𝐶𝑛

,… ,


𝐶𝑛 , 𝐶𝑛 ,

…

, 𝐶𝑛 ,

𝑆𝑆𝑚𝑎𝑙𝑙2

𝑆𝑆𝑚𝑎𝑙𝑙1 𝑆𝑝𝑛 = [𝐶𝑛

𝐶𝑛 , … , 𝐶𝑛 , . . . ]

,… ,

𝐶𝑛 , 𝐶𝑛 ,

…

, 𝐶𝑛 , 𝐶𝑛 , … ]

𝑆𝐿𝐵−𝑝𝑛 = [𝐶𝑛𝐿𝐵𝑀𝑎𝑥1 , … , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥2 , … . , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥𝑛 , . . . . , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥𝑛−1 , . . , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥ℎ ,

𝑃0

…]

𝑃𝑎𝑟𝑡𝑏𝑒𝑠𝑡

𝑆𝐿𝐵−𝑝𝑛 = [𝐶𝑛𝐿𝐵𝑀𝑎𝑥1 , 𝐶𝑛𝐿𝐵 , … , 𝐶𝑛𝐿𝐵 , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥2, . . 𝐶𝑛𝐿𝐵𝑀𝑎𝑥2, 𝐶𝑛𝐿𝐵 , . . . , 𝐶𝑛𝐿𝐵 , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥3, … ]

Fig. 5 Ssmall1 and Ssmall2 regions

𝐾 Fig. 3 Calculating the values of Count and K for Part0

𝑆𝑢𝑚 = 𝑃𝑎𝑟𝑡𝑏𝑒𝑠𝑡 − 𝐶𝑛𝐿𝐵𝑀𝑎𝑥𝑛

(10) If all the calculated values for parts are less than zero, only

The first appropriate breaking point is symbolized by P 0 in Fig. 3. This point can be assessed in Spn stream as an appropriate

the second point which is related to CnLBMAxn is considered, Fig. 6.

breaking point. 𝑆𝑆𝑚𝑎𝑙𝑙1

(7)

𝑆𝑝𝑛 = [𝐶𝑛 ,… , , 𝐶𝑛 , 𝐶𝑛 , . . . ] 𝑆𝐿𝐵−𝑝𝑛 = [𝐶𝑛𝐿𝐵𝑀𝑎𝑥1 , … , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥2 , … . , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥𝑛−1 , . . , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥ℎ , 𝐶𝑛𝐿𝐵 , … ]

𝑃𝑎𝑟𝑡𝑜 = ((𝐶𝑢𝐿𝐵𝑀𝑎𝑥2 × (𝑘 + 1 + 𝐶𝑜𝑢𝑛𝑡)) + 𝐶𝑤𝑀𝑎𝑥 + 1) − ((𝐶𝑛𝐿𝐵𝑀𝑎𝑥1 × (𝐶𝑜𝑢𝑛𝑡 + 1) + (𝐶𝑛𝐿𝐵𝑀𝑎𝑥2 × 𝑘)) + 2(𝐶𝑠𝑢𝑝𝑜𝑟𝑡 + 1) + 𝑆𝑢𝑚)

Fig. 6 The negative nature of all part values and selection of the region

In (7), before making the final decision for the first break, the value of Sum is considered as zero. The manner of assigning a value to Sum will be discussed in due course. The same process is run for the points after CnLBMax2 and CnLBMax3, while the previous region(s) are considered in calculation of Count. The manner of calculating Count and K for part1 is

Now starting from the first CnLB value after CnLBMaxh in SLB-pn stream and by considering the maximum as -∞, the maximums are searched for and the whole third step is repeated. Here, if all the calculated values for parts are less than zero, only the second point is considered related to CnLBMaxh and a new region is formed together with the previous region, Fig. 7 .

145

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032 𝑆𝑆𝑚𝑎𝑙𝑙𝑞−1

mappings available in SISmall in the binary system. Next, every

𝑆𝑝𝑛 = [… , 𝐶𝑛 , … , 𝐶𝑛 , 𝐶𝑛 …

Csupport for the next bit is set equal to the largest length of

𝑆𝑆𝑚𝑎𝑙𝑙𝑞

𝑆𝑆𝑚𝑎𝑙𝑙𝑞−1 𝑆𝐿𝐵−𝑝𝑛 = [

As observed in Fig. 9, the first bit is one. The value of

, 𝐶𝑛 , 𝐶𝑛 , . . . ]

,… ,

CIndex in SISmall is set at the end of bits with a fixed size equal to

, 𝐶𝑛𝐿𝐵𝑀𝑎𝑥1 , … , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥2 , … . , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥𝑛−1 , . . , 𝐶𝑛𝐿𝐵𝑀𝑎𝑥ℎ , 𝐶𝑛𝐿𝐵 , … ]

the largest length of CIndex available in SISmall. Finally, a Null

Fig. 7 The negative nature of all part values and selection of the region in subsequent iterations

value is added to SISmall in the binary system with the size of

This whole process is run for other Spu and Spn. When the

the largest length of mappings available in SISmall.

process is finished, the initial stream is divided in the form of (11).

The same process is run for other SSmall and SISmall available in stream S. At the end, all the bits form a queue and the

𝑆 = [𝑆𝑝𝑛 , 𝑆𝐼𝑛𝑑𝑒𝑥 , 𝑆𝑝𝑛 , … ] 𝑆𝑝𝑛 = [𝑆𝑆𝑚𝑎𝑙𝑙1 , 𝑆𝑆𝑚𝑎𝑙𝑙2 , … ] 𝑆𝐼𝑛𝑑𝑒𝑥−𝐿𝐵 = [𝑆𝐼𝑆𝑚𝑎𝑙𝑙−𝐿𝐵1 , 𝑆𝐼𝑆𝑚𝑎𝑙𝑙−𝐿𝐵2 , … ] 𝑆𝑆𝑚𝑎𝑙𝑙 = [𝐶𝑛 , 𝐶𝑛 , … ] 𝑆𝐼𝑆𝑚𝑎𝑙𝑙 = [𝐶𝐼𝑛𝑑𝑒𝑥 , 𝐶𝐼𝑛𝑑𝑒𝑥 , … ]

(11)

resultant bit stream is changed to bytes. These bytes are sent to destination Z as a compressed string of S. In this process, the number of bits should be a multiple of 8, otherwise a number of zero(s) is/are added to the last byte to fill the space of the

It should be considered that conversion of region from Cn or CIndex to Cn is possible in all stages, while its opposite does not

absent bits; the zero(s) should be inserted in a manner where the value of the last byte is not lost.

hold true. D.Forth stage of compression

E. The receiver The stream B in (12) is the same received bytes from source

For preparing any SSmall region to be sent, the following Y.

structure is introduced and applied, Fig. 8.

(12) 𝐵 = [𝑏11 , 𝑏12 , 𝑏13 , 𝑏14 , 𝑏15 , 𝑏16 , 𝑏17 , 𝑏18 , 𝑏21 , 𝑏22 , 𝑏23 , 𝑏24 , 𝑏25 … ]

Bit 0

𝑀𝑎𝑥𝑙𝑒𝑛𝑔𝑡ℎ𝐵 (𝑆𝑆𝑚𝑎𝑙𝑙 )

1

+ 𝐶𝑠𝑢𝑝𝑝𝑜𝑟𝑡

+

𝐹𝑖𝑥𝑆𝑖𝑧𝑒(𝑆𝑆𝑚𝑎𝑙𝑙 )

At this point, a mapping table is considered similar to the

Null

one available in sender, where there is no value initially. If b 11

𝑀𝑎𝑥𝑙𝑒𝑛𝑔𝑡ℎ𝐵 (𝑆𝑆𝑚𝑎𝑙𝑙 ) × (𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑛 𝑠 𝑖𝑛 𝑆𝑠𝑚𝑎𝑙𝑙 + 1)

is zero, the value of the next Csupport bits is known as the

The length of the Null

Fig. 8 The structure of sending a SSmall .

Length of this region (The Csupport value is known by the

As observed in Fig. 8, the first bit is zero. The value of

sender and receiver as a detail of the data type). Next, the

Csupport for the next bit is set equal to the largest length of Cn

reading of Cns begins from the bit Csupport +1 with Length of

available in Ssmall in the binary system, presented as

the same region. Every new Cn absent in the mapping table is

MaxLengthB(SSmall) here. Next, every Cn in Ssmall is placed at the

added to the table. This process is repeated until a Null value

end of bits with a fixed size equal to the largest length of Cn

is obtained. Then, another bit is read, if its value is equal to

available in Ssmall, shown by FixSize(SSmall) in Fig. 8. At last a

zero, the same process is repeated using the new value of

Null value is added to the end of the bits with the fixed length

Length. If the value of the first bit in the region is one, the previous

of the largest Cn in Ssmall. For sending any of the SISmall regions, the structure introduced in Fig. 9 is applied. Bit 1

process is repeated, but the CIndex value is read instead of Cn, that is, it was once inserted in the mapping table and is repeated. Thus by referring to the mapping table, the

𝑀𝑎𝑥𝑙𝑒𝑛𝑔𝑡ℎ𝐵 (𝑆𝐼𝑆𝑚𝑎𝑙𝑙 )

𝐹𝑖𝑥𝑆𝑖𝑧𝑒(𝑆𝐼𝑆𝑚𝑎𝑙𝑙 )

Null

1 + 𝐶𝑠𝑢𝑝𝑝𝑜𝑟𝑡 + 𝑀𝑎𝑥𝑙𝑒𝑛𝑔𝑡ℎ𝐵 (𝑆𝑆𝑚𝑎𝑙𝑙 ) × (𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒 𝐶𝑖𝑛𝑑𝑒𝑖𝑥 𝑖𝑛 𝑆𝑠𝑚𝑎𝑙𝑙 + 1)

The length of the Null

Fig. 9 The structure of sending a SIsmall .

equivalent Cn of the CIndex is read. The values read directly or indirectly (from the mapping table) form the stream S to be sent. 146

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032 III. EXPERIMENTAL RESULTS One of the common challenges in assessing the text compression methods is lack of related data sets. In this study, a set of tweets (Tiny-strings) in Arabic language is used, which contains 2000 Tiny-strings of modern standard Arabic with Jordanian dialect. To transmit these tweetss, applying Unicode is mandatory. The size of the tweets are shown in Fig. 10 [24]. About 50% of the tweets are less than 100 bytes; Fig. 12. Comparison of execution time of this proposed algorithm and Huffman method for Arabic Tiny-strings

only 31% of them have a size above 160 bytes.

For assessing this proposed algorithm for compression of Persian Tiny-strings, a data set containing more than 22400 tweets obtained from Persian and international news broadcasting corporations is used [26]. These tweets are selected equally with no bias from the most recent posts in Twitter; it must be noted that applying Unicode is mandatory here. Fig. 13 shows the size of the tweets. Fig. 10. The size of the Arabic tweets and their number

To assess the proposed algorithm, the Huffman algorithm, known as the best compression method [8, 25], is applied here. These assessments are run on a laptop with an Intel(R) Core(TM) i3 CPU M 330 @ 2.13GHz processor. The percentage of size reduction in Arabic strings using the proposed algorithm and Huffman method is compared in Fig. 11. The newly proposed algorithm outperforms the Huffman algorithm by 12% in reducing the size of Tiny-strings of Arabic language. The run time of the two algorithms is Fig. 13. The size of Persian tweets and their number

compared in Fig. 12. This proposed algorithm is 0.26 seconds slower than Huffman algorithm.

The percentage of size reduction of Persian tweets is compared for proposed algorithm with that of the Huffman method in Fig. 14. This proposed algorithm outperforms Huffman algorithm by more than 8% in reducing the size of Tiny-strings in Persian language. The execution time of the two methods is compared in Fig. 15. The findings on both Persian and Arabic languages are tabulated in Table II.

Fig. 11. Comparison of this proposed algorithm and Huffman method in size reduction of Arabic Tiny-strings

147

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032 compression of Tiny-strings, applying this proposed algorithm is more efficient for Arabic than Persian language. For future studies, assessing the proposed method for compression of other languages or other types of data streams [27] would be advantageous. The algorithm can be compared with other proposed methods [28, 29] and the compatibility of the method can be checked for compression of webpages [30] especially on mobile phones. Fig. 14. Comparison of the proposed algorithm and Huffman method in size reduction of Persian Tiny-strings

ACKNOWLEDGMENT This research is supported by Petro Palayesh Kian Espadana Company. Also the authors would like to greatly thank Dr. Ahmad Baraani-Dastjerdi, Dean of Computer Engineering Faculty, University of Isfahan, for his invaluable help and comments. Without his persuasion and cooperation, this study would not have begun.

Fig. 15. Comparison of execution time of the proposed algorithm and Huffman method for Persian Tiny-strings

REFERENCES

TABLE II SUMMARY OF ALL COMPRESSION RESULTS

Language

Algorithm

Original size (Bytes)

Compressed size (Bytes)

Compression ratio (%)

Run time (s)

Persian

Huffman

3243550

2190996

67.55

32.56

Persian

Proposed

3243550

1927285

59.42

34.20

Arabic

Huffman

265156

181216

68.34

1.92

Arabic

Proposed

265156

148005

55.82

2.18

IV. CONCLUSION An appropriate compression method is proposed here for Unicode Tiny-strings (which are about 78 characters long). This method is fit for compression of short messages and strings sent in instant messaging networks or any similar application. This method is subject to no restriction in the language or the number of languages available in the text. The results obtained from the assessment on Arabic and Persian Tiny-strings indicate that this method outperforms the Huffman algorithm in compression of the strings. Regarding

[1] O. Jalilian, A. T. Haghighat, and A. Rezvanian, “Evaluation of Persian text based on Huffman data compression,” 22nd Int. Symp. on Inform., Commun. and Automation Tech. -IEEE , Oct. 2009. [2] J.-E. Lönnqvist and F. große Deters, “Facebook friends, subjective well-being, social support, and personality,” Comput. in Human Behavior, vol. 55, pp. 113–120, Feb. 2016. [3] A. Mohd and T. M. Wong, “Short Messaging System (SMS) Compression on Mobile Phone–SMS Zipper,” Jurnal Teknologi, vol. 49, no. 1, Dec. 2008. [4] D. Luckham, Ed., “Event Processing for Business,” Jan. 2012. [5] S. Sankar and D. S Nagarajan, “A Comparative Study: Data compression on TANGLISH Natural Language Text,” Int. J. of Comput. Applicat. - IJCA, vol. 38, no. 3, pp. 33–37, Jan. 2012. [6] D. Salomon, “Statistical Methods,” Data Compression, 4 ed. California , Springer Reference, 2007, ch. 2, sec. 17, pp. 139–140. [7] E. Doug. "A survey of Unicode compression.” , www.unicode.org , Jan. 2004 [8] Z. M. Alasmeret al., “A Comparison between English and Arabic Text Compression,” Contemporary Eng. Sci., Vol. 6, no. 3, 111 – 119, 2013

148

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032 [9] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Springer-Reson, vol. 11, no. 2, pp. 91–99, Feb. 2006. [10] J. Jiang and S. Jones, “Word-based dynamic algorithms for data compression,” IEE Proceedings I Communications, Speech and Vision, vol. 139, no. 6, p. 582, 1992. [11] W. R. Azevedo Dias and E. D. Moreno, “Code Compression using Huffman and Dictionary-based Pattern Blocks,” IEEE Latin Am. Trans., vol. 13, no. 7, pp. 2314–2321, Jul. 2015. [12] J. Platoš, V. Snášel and E. El-Qawasmeh, “Compression of small text files,” Elsevier-Advanced engineering informatics, vol. 22, issue 3, p. 410-417, 2008 [13] P. Prochazka and J. Holub, “Compression of a Set of Files with Natural Language Content,” The Computer Journal, vol. 58, no. 5, pp. 1169–1185, Jun. 2014. [14] NUS SMS Corpus., http://wing.comp.nus.edu.sg/SMSCorpus/, accessed on 28 January 2016. [15] C. TAGG, “Research Into Text Messaging,” A corpus linguistics study of SMS text messaging, Birmingham England, UBIRA- University of Birmingham Research Archive, March 2009, ch. 2, sec. 3, pp. 20–21. [16] M. V. Mahoney, “Adaptive Weighing of Context Models for Lossless Data Compression,” Florida Institute of Technology CS Depart-ment , Technical Report CS-200516,2005. [17] A. S.Taylor and J. Vincent, “An SMS history,” Mobile World Past, Present and Future, California , Springer Science & Business Media, 2005, ch. 1, sec. 4, pp. 75–91. [18] P. Gardner-Stephen al., “Improving Compression of Short Messages,” International Journal of Communications, Network and System Sciences, vol. 06, no. 12, pp. 497– 504, 2013. [19] S. Sanfilippo, “SMAZ—Compression for Very Small Strings,” 2009. https://github.com/antirez/smaz (accessed on 29 January 2016) [20] R. Ling, “The length of text messages and use of predictive texting: Who uses it and how much do they have to say?” , Computer-mediated communication, Washington- DC, 2007 [21] X. Li1, M. Makkie al., “Scalable Fast Rank-1 Dictionary Learning for fMRI Big Data Analysis” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, 2016. [22] M. Sedigh Fazli, K. Keshavarzi, and S. Setayeshi, “Designing a Hybrid Neuro-Fuzzy System for Classifying the Complex Data, Application on Cornea Transplant” Proceedings on the International Conference on Artificial Intelligence (ICAI), The Steering Committee of the World

Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 2013. [23] M. S. Fazli and J.-F. Lebraty, “A comparative study on forecasting polyester chips prices for 15 days, using different hybrid intelligent systems,” The 2013 International Joint Conference on Neural Networks (IJCNN), Aug. 2013. [24] Twitter Data set for Arabic Sentiment Analysis Data Set. , http://archive.ics.uci.edu/ml/datasets/Twitter+Data+set+fo r+Arabic+Sentiment+Analysis#, (accessed on 28 January 2016) [25] E. M. Abu Jrai. “Hybrid Technique for Arabic Text Compression” M.A. thesis, Computer Science Faculty of Information Technology Middle East University, Amman, June, 2013. [26] View all tweets from any Twitter user on one page., https://www.allmytweets.net, BBC Persian(@bbcpersian), Young journalists club(@YJCPersian), Voa Persian(@VOAIran), Fars News Agency(@farsnews), Deutsche Welle persian(@dw_persian), Radio farad(@RadioFarda), Shargh daily(@SharghDaily). (accessed on 6 February 2016) [27] C. C. Aggarwal, “An Introduction to Data Streams,” Data Streams Models and Algorithms, Yorktown Heights, NY, Springer, 2007, ch. 1, sec. 1, pp. 1–7. [28] P. Deutsch, “GZIP file format specification version 4.3,” May 1996. [29] M. Burrows and D.J. Wheeler, “A Block-sorting Lossless Data Compression Algorithm”, Systems Research CenterSRC, May. 1994 [30] P. Skibinski, “Improving HTML Compression,” IEEE Data Compression Conference-2008, UT, pp. 545. Mar. 2008. Masoud Abedi received his B.S. degree in Computer Engineering in 2012 from Islamic

Azad

university

of

Isfahan

(Khorasgan), Isfahan, Iran, and his M.S. degree in Computer Software Engineering in 2015 from Sheikh Bahaie University (SHBU), Isfahan, Iran. His research interests include Algorithms,

Multi-objective

Optimization

Algorithms,

Simulations, and Compression. *Corresponding Author at: Rayan Tahlil Sepahan Co., Isfahan Science and Technology Town (ISTT), Isfahan (Iran, Islamic Republic of) Email: [email protected]

149

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032 Abbas

Malekpour

is

currently

an

Assistant Professor in the Institute of Distributed High Performance Computing at University of Rostock. He received his Master Degree from Stuttgart University and his Ph.D. degree from University of Rostock, Germany. From 2002 to 2004 he was with Institute of Telematics Research Group at university of Karlsruhe, Germany. And from 2004 to 2010, he was a research

assistant

in

MICON

Electronics

and

Telecommunications Research Institute at University of Rostock, Germany. His current research interests include the areas of Mobile and Concurrent Multi-path Communication prototyping. Email: [email protected] Peter Luksch received his Ph.D. degree in Parallel

Discrete

Event

Simulation

on

Distributed Memory Multiprocessors from Technische Universität München, Germany. He

finished

his

Postdoctoral

Qualification (Habilitation)

Lecture

in Increased

Productivity in Computational Prototyping with the Help of Parallel and Distributed Computing. Currently, Prof. Dr. rer. nat. habil Peter Luksch is head of the department of Distributed High Performance Computing at University of Rostock, Germany Email:[email protected] Mohammad Reza Mojtabaei was born in 1967, in Isfahan, Iran. He got his Master degree in Machine Intelligence & Robotic, from Shiraz University, Shiraz, Iran, in 1995. He is currently a PhD candidate at University of Isfahan, Isfahan, Iran. He is also working as an Instructor at computer group in technical & vocational university, Mohajer college, Isfahan, Iran. His research

interests

include

pattern

recognition,

image

processing, programming languages, data structure and algorithm design.

150

3rd International Conference on Computer Science Networks and Information Technology on 26th - 27th August 2017, Held at University De Québec, Montreal, Canada ISBN: 9780998900032

151

a method for compression of short unicode strings - Innovative ...

a method for compression of short unicode strings - Innovative ...

Suggest Documents

Compression of Partially Ordered Strings

A Space-Frequency Data Compression Method for

A New Compression Method for Compressed Matching

A New Current-Balancing Method for Paralleled LED Strings

An Enhanced Short Text Compression Scheme for

A new fast matching method for adaptive compression of stereoscopic ...

A Novel CAE Method for Compression Molding Simulation of ... - MDPI

A Method for Compression of Solar Image using Integer ... - IJRASET

A Method for Router Table Compression for Application ... - CiteSeerX

A Method for Router Table Compression for Application ... - CiteSeerX

Representing Myanmar in Unicode - Unicode Consortium

Efficient Compression Method for Pronunciation Dictionaries

Improving Compression of Short Messages - Scientific Research ...

Search for short strings in~ $ e^+ e^-$-annihilation

ANN-based Innovative Segmentation Method for ... - arXiv.org

innovative method for increasing phytosterols ...

IMPRESS, an innovative pilot injection-compression

ANN-based Innovative Segmentation Method for ... - Cogprints

Method and System for Data Compression in a ... - Dominik Slezak

A New Compression Method for FITS Tables - HEASARC - Nasa

cte: a method to compression - then - encryption for ...

A Spatial Hierarchical Compression Method for 3D Streaming Animation

a method for rate control and compression estimation in jpeg

A Novel MDL-based Compression Method for Power Quality ... - CERN