Research in Computing Science Series Editorial Board Comité Editorial de la Serie
Editors-in-Chief:
Associate Editors:
Editores en Jefe
Editores Asociados
Juan Humberto Sossa Azuela (Mexico) Gerhard Ritter (USA) Jean Serra (France) Ulises Cortés (Spain)
Jesús Angulo (Frane) Jihad El-Sana (Israel) Jesús Figueroa (Mexico) Alexander Gelbukh (Russia) Ioannis Kakadiaris (USA) Serguei Levachkine (Russia) Petros Maragos (Greece) Julian Padget (UK) Mateo Valero (Spain)
Editorial Coordination:
Formatting:
Coordinación Editorial
Formación
Blanca Miranda Valencia
Nicolás Alonzo Gutiérrez Omar Hernández Olivares M. Sindy Serrano Mendoza A. Janet Loaiza García M. Félix Hernández Araiza J. Marcos Muñoz Dávila
Lucía Muñoz Dávila Laura Muñoz León Daniel Vélez Flores
Research in Computing Science es una publicación trimestral, de circulación internacional, editada por el Centro de Investigación en Computación del IPN, para dar a conocer los avances de investigación científica y desarrollo tecnológico de la comunidad científica internacional. Volumen 60, Octubre, 2012. Tiraje: 500 ejemplares. Certificado de Reserva de Derechos al Uso Exclusivo del Título No. 04-2004-062613250000-102, expedido por el Instituto Nacional de Derecho de Autor. Certificado de Licitud de Título No. 12897, Certificado de licitud de Contenido No. 10470, expedidos por la Comisión Calificadora de Publicaciones y Revistas Ilustradas. El contenido de los artículos es responsabilidad exclusiva de sus respectivos autores. Queda prohibida la reproducción total o parcial, por cualquier medio, sin el permiso expreso del editor, excepto para uso personal o de estudio haciendo cita explícita en la primera página de cada documento. Impreso en la Ciudad de México, en los Talleres Gráficos del IPN – Dirección de Publicaciones, Tres Guerras 27, Centro Histórico, México, D.F. Distribuida por el Centro de Investigación en Computación, Av. Juan de Dios Bátiz S/N, Esq. Av. Miguel Othón de Mendizábal, Col. Nueva Industrial Vallejo, C.P. 07738, México, D.F. Tel. 57 29 60 00, ext. 56571. Editor Responsable: Juan Humberto Sossa Azuela, RFC SOAJ560723 Research in Computing Science is published by the Center for Computing Research of IPN. Volume 60, October, 2012. Printing 500. The authors are responsible for the contents of their articles. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior permission of Centre for Computing Research. Printed in Mexico City, October, 2012, in the IPN Graphic Workshop – Publication Office.
Volume 60 Volumen 60
ISSN: 1870-4069
Copyright © 2012 Instituto Politécnico Nacional Copyright © 2012 Instituto Politécnico Nacional Instituto Politécnico Nacional (IPN) Centro de Investigación en Computación (CIC) Av. Juan de Dios Bátiz s/n esq. M. Othón de Mendizábal Unidad Profesional “Adolfo López Mateos”, Zacatenco 07738, México D.F., México http://www.ipn.mx http://www.cic.ipn.mx Indexed in LATINDEX and PERIODICA Indexada en LATINDEX y PERIODICA
Printing: 500 Tiraje: 500
Printed in Mexico Impreso en México
Índice Pág. La robótica pedagógica como herramienta didáctica en tópicos de Inteligencia Artifical. Caso de estudio: aprendizaje por refuerzo .................................................1 Oscar Alonso-Ramirez, Héctor-Xavier Limón-Riaño, Ángel-Juan Sánchez-García, Rafael Zamudio-Reyez, Héctor-Gabriel AcostaMesa
La Ingeniería de Sistemas como guía en la concepción de Software Educativo para niños Hipoacúsicos ................................................................................................11 Citlalih Gutiérrez Estrada, Sergio Díaz Zagal, Rafael Cruz Reyes, Guadalupe Macedo Miranda, Julio César González, Claude Baron
Diseño y desarrollo de un sistema vibrotáctil utilizado en un robot planar para la rehabilitación de las extremidades superiores .........................................................21 Guadalupe Salas-López, Oscar Sandoval-González, Ignacio HerreraAguilar, Paolo Tripicchio, Blanca González-Sánchez, Otniel Portillo-Rodríguez, Adriana Vilchis-González
Modelo de revision de creencias para determinar clientes potenciales ....................31 Meliza Contreras-González, Gerardo Mendoza-Castillo, Pedro BelloLópez, Miguel RodríguezHernández
Variant of Huffman coding and decoding using words frecuencies to improve the compression ratio...................................................................................................41 Ángel-Juan Sánchez-García, Homero-Vladimir Ríos-Figueroa
Cálculo del Periodo de Rotación Solar utilizando un Telescopio de Bajo Costo .....47 Rafael Lemuz, Yolanda Zamora, Jessica N. López
Diseño e implementación de una metodología para el desarrollo de software para la administración de procedimientos de calidad .........................................................55 Ma. Luisa Alcántara-Muñoz, J. Juan Hernández-Mora, Yesenia N. González Meneses, Blanca Estela Pedroza Méndez, María Guadalupe Medina B.
Comparación de algoritmos utilizados por el sistema planificador de rutas turísticas para el Estado de Puebla. .......................................................................................65 Raymundo Montiel Lira, Esther Ortega Mejía, Rosa María Rosas Vazquez
Desarrollo de objetos de aprendizaje para motivar el aprendizaje de las matemáticas en niños de primer grado de primaria usando un sensor kinect .....................................................75 Omar Carreño Sánchez, Edmundo Bonilla Huerta, José Juan Hernández Mora, Yesenia Nohemí González Meneses, Valentin Morales Martínez
Variant of Huffman coding and decoding using words frecuencies to improve the compression ratio Ángel-Juan Sánchez-García1 y Homero-Vladimir Ríos-Figueroa1 1
Universidad Veracruzana, Xalapa, Ver., México
[email protected],
[email protected]
Abstract. A variation of Huffman coding is presented. This variant not only seeks the frequency of letters, it also seeks the word frecuency, thus minimizing the average number of bits per word (a string of characters). In this case we show the compression of the Bible into text file. Keywords: Compression, codeword, binary tree, code.
1
Introduction
As a variable-length coding technique, Huffman coding [4-5] has been shown to be one of the most leading methods being used in various applications dealing with data compression. This compression method is based on frequency of occurrence of the symbols of an alphabet. To encode a text composed of characters from an alphabet, each character is assigned a sequence of bits called "codeword". The idea is to assign shorter codewords to the most frequent characters and longer codewords for less common characters. This assignment of codewords is represented by a binary tree. The Huffman method is divided into two main parts [4], 1) codeword's length determination. 2) encoding procedure and codewords assignment. In the first part the binary tree is built based on the occurrence of each alphabet character in the text. The codeword is assigned to the binary label of the path from the root to a leaf node. The algorithm to construct the tree is shown below. 1.
Initialize n trees of a node and label them with the characters of the alphabet. 2. Write the frequency of each character in the root of each tree to indicate the weight of trees. REPEAT 1. Find the two trees with lower weights. 2. Make left subtree and right subtree of a new tree. 3. Write the weight as the sum of the weights in the root of new tree. UNTIL all the trees are in a single tree. The alternative proposed in this paper is located in the first part of the method of Huffman. © J. C. Hernández-Hernández, J. F. Ramírez Cruz, A. Cortés Fernández, J. H. Sossa Azuela. (Eds.). Advances in Intelligent and Information Technologies. Research in Computing Science 60, 2012, pp. 41 -45.
42
Sánchez-García A.J., Ríos-Figueroa H. V.
The second part of the Huffman method, may differ from one technique to another; and this may severely affect the decoding effectiveness [2-3]. Acommon property in any Huffman encoding is that each code must be unique, such that it could be recognizable in the bit-stream with no guard-bit(s) attached. Section 2 presents a review of improvements to method of Huffman. Section 3 describes the variants that we propose to increase the compression ratio. Section 4 presents and discusses the results obtained. Finally, in section 5 we describe our conclusions.
2
Background
Over time, improvements have been made in different parts of the method to obtain best results. The following are some improvements. Benes [1] presented a arquitecture and design of a high-performance asynchornous Huffman decoder for compressed-code embedded process. Chowdhury [2] presented a new data structure for Huffman coding in which in addition to sending symbols in order of their appearance in the Huffman tree one needs to send codes of all round leaf nodes, the number of which is always bounded by half the number above of symbols. Connell [3] described a variable-word-length minimum-redundant code. that reduced transmission time, storage space and time encoding and decoding. Hashemian [4] introduced a new version of Huffman encoded and decoded using that deals mainly with encoding / decoding procedure that avoids the complete construction of the Huffman table.
3 Description of the proposed alternative assignment in codewords Since the original encoding is for each character, each character is encoded with a binary stream, then each word in the text will be encoded as a set of character codes that make up the word, so the length of each codeword is the sum of the length of each encoded character of the word. For this reason, the frequency search suggests some alternatives for increasing the compression ratio. An alternative to minimize the resulting code is coding the whole word. The words are separated by white spaces or line breaks, then words have redundancy if some of them are with a punctuation mark. For example, if the text is the sentence "How are you?", The algorithm separate the words "How", "are" and "you?" And elsewhere in the text is the word "you", frequency then the search would take as different words "you" and "you?". Therefore we propose that in this pursuit of frequencies, each word with a punctuation mark (as question marks, exclamation points, quotes, etc), separate the word and punctuation. Then the proposed algorithm to find the frequencies of words and punctuation is shown below.
Variant of Huffman coding and decoding using words frecuencies to…43
Read the first word. WHILE not been reached EOF DO IF the word contains punctuation THEN Separate each punctuation mark of the word Increase the frequency of each punctuation mark found in the word END IF Increasing the frequency of the remaining word Read the next word END WHILE
3.
Discussion of results
The proposed algorithm was implemented for the allocation of codewords. The Bible in Spanish was taken as input text file to compress. The table 1 shows the original file sizes, file size characters generated with ones and zeros and size of the generated binary files. The original files were compressed with Unicode and ANSI format. We also compare our coding against compression software called WinRAR shown in the table 2. The size of the original file in Unicode format is 7.77 Mb, and the size of the original file in ANSI format is 3.88 Mb. The table 3 shows de compression ratio. The results obtained are shown below. Table 1. Comparison of results obtained with the traditional method and the method proposed.
Format
Unicode
ANSI
File
Original Strings 0/1 Binary Original Strings 0/1 Binary
Character compression 7.77 Mb 17.3 Mb 2.16 Mb 3.88 Mb 16.3 Mb 2.04 Mb
Word compressión proposed 7.77 Mb 7.94 Mb 0.99 Mb 3.88 Mb 7.94 Mb 0.99 Mb
Table 2. Comparison of the compression with the software WinRAR against the proposed method.
Format Unicode ANSI
WinRAR 1.19 Mb 1.04 Mb
Word compressión proposed 0.99 Mb 0.99 Mb
44
Sánchez-García A.J., Ríos-Figueroa H. V.
Table 3. Results of compression ratio
Format
Size of the original file
Unicode
7.77 Mb
ANSI
3.88 Mb
Size of the compressed file by words 0.99 Mb 0.99 Mb
Compression ratio expected
Compression ratio
0.853
0.873
0.854
0.745
The following expressions are used to calculate the expected compression ratio: 1. x = ∑ bit_length * probability 2. Compression_ratio = (maximum _length * x) / maximum_length. In both formats, the maximum length was 72 bits (the word less frequently found), and the value of x for Unicode format was 10.55 and ANSI format was 10.50. 789,265 different words were found in this file, of which 29,701 were different (leaf nodes). At the end of the tree constructed had 59.401 nodes (2n-1 nodes). For example, for the word "BIBLIA" (upper case), with our method, de codeword woulb be: "101110101001000010" (18 characters). With the traditional method would be: B: 10101110011 I: 1010010011 B: 10101110011 L: 0111111010 I: 1010010011 A: 101001000 BIBLIA: "1010111001110100100111010111001101111110101010010011 101001000" (39 characters.) Another example of the most common would be "luz". With our method would be: "01001101111" (11 characters), and with the other method: l: 11011 u: 10001 z: 110001111 luz: "1101110001110001111" (19 characters).
4
Conclusions
Once implemented our alternative proposal, the compression ratio was reduced even more than the known software to compress files. The above is due to shorter lengths are obtained for each word. As future work can find a way to make faster calculations for the speed is comparable to the current compression software, because in the original version, the list of alphabetic characters is established. In this case, since there may be many combi-
Variant of Huffman coding and decoding using words frecuencies to…45
nations of characters (many different words) the frequency list is built when executing the algorithm. References 1 2. 3. 4.
5. 6.
Benes, M., Nowick, S., Wolfe, A.: A Fast Asynchronous Huffman Decoder for Compressed-Code Embedded Processors. 0-8186-8392-9/98. (1998). Chowdhury, R., Kaykobad, M.: An efficient decoding technique for Huffman codes. (2005). Connell, B.: A Huffman-Shannon-Fano code. Proceedings of the IEEE. (1973) Hashemian, R.: Direct Huffman coding and decoding using the table of code-lengths. Proceedings of the International Conference on Information Technology: Computers and Communications (ITCC.03) IEEE. (2003). Huffman, D.: A method for de construction of mínimum-redundancy codes. Proc. IRE, Vol. 40, pp. 1098-1101. (1952). Vitter, J.: Design and analysis of dynamic Huffman codes. Journal of the ACM,34(4):825-845. (1987).