A Framework for Retrieving Information by Using the ...

2 downloads 6660 Views 117KB Size Report
We convert all the Unicode ... defined in XHTML are not supported by the IPTV technology. .... of the news web sites by using HTML parsers, convert the text.
IEEE – INTERNATIONAL CONFERENCE ON ADVANCES IN ENGINEERING, SCIENCE AND MANAGEMENT

1

A Framework for Retrieving Information by Using the Local Language on Internet Protocol Television Chithrangi Kaushalya Kumarasinghe, Sumudu Lasitha Wicramasinghe, Pasan Hewamadduma, Nilanka Devthilina and Saminda Premaratne Department of Information Technology Faculty of Information Technology University of Moratuwa, Katubadda, Moratuwa, Sri Lanka Email: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract— The existing Internet Protocol Television (IPTV) uses non-Unicode characters, hence the characters from many regional languages are not displayed correctly. We present a new localization framework with Unicode characters for IPTV. Here, we concentrated mainly on Online News Reading using IPTV, but this is extendable for other purposes. The language we focused was “Sinhala” which is used by most of the Sri Lankans. With simple modifications in the server, the set of tools can be further extended to other Unicode languages. We convert all the Unicode characters into images dynamically in the server side and sends to the client IPTV with other graphical content. This provides clear, readable, and correct textual news dynamically without unknown and ambiguous character sequences. Index Terms—IPTV, translation, HTML parser.

localization,

Unicode,

language

I. INTRODUCTION

N

OWADAYS the television has become important equipment in day to day life. With the advancement of the new technologies the television has also enriched with new functionalities. Internet Protocol Television is one of the latest advancement that provides the functionalities of a computer which is connected to the Internet, for the television. But the major drawback of IPTV infrastructure is that it doesn’t support for Unicode character encoding system. Hence the existing IPTV applications are restricted only for English language and this interferes to the IPTV applications for becoming famous among the local community. Therefore the research aims to develop a localization framework for IPTV which supports for Unicode to retrieve the information in local language. Our work is mainly focused on developing a news portal for IPTV to display online newspapers on the television in “Sinhala” language. Moreover this work can be extended for

any other language which is supported by the Unicode character encoding system if the relevant font file is provided. Further the IP TV infrastructure does not support for flash, iFrames, and the web pages with more than 250 Kb in size. Within this research those limitations are successfully achieved. As a future enhancement, we hope to integrate a news searching mechanism for this application with the aid of a virtual key board which allows the user to perform a full text search on news articles. II. OVERVIEW OF IPTV TECHNOLOGY IPTV can be described as a combination of the traditional TV with the Internet. So it describes a system capable of delivering (receiving and displaying) a video stream using the architecture and networking methods of the Internet Protocol Suite over a packet-switched network infrastructure.[1] The IP-based platform offers significant advantages, including the ability to integrate television with other IP-based services. So it is distinguished from general Internet-based or web-based multimedia services by its on-going standardization process and preferential deployment scenarios in subscriberbased telecommunications networks with high-speed access channels into end-user premises via set-top boxes (STB) or other customer premises equipment. STB device is more similar to a micro computer which operates according to the micro Linux OS. Initially it gathers information from variety of sources. They are stored media, network feeds, communication links and live studio sources. Then the media sources are converted into a form that can be managed and distributed. Next those media will be stored, moved and sent out at particular scheduled times by the asset management system. Then the people, who are connected to the IPTV system, can get the information. Then the information is transferred simultaneously to the users by the distribution system. Finally the users can view programs on televisions which are converted by STB, on multimedia computers or on IP televisions (data only televisions).

IEEE – ICAESM 2012

IEEE – INTERNATIONAL CONFERENCE ON ADVANCES IN ENGINEERING, SCIENCE AND MANAGEMENT

2

III. LIMITATIONS AND CONSTRAINS A. Limitations of IPTV Environment The restrictions and fewer capabilities of the hardware and software used in IPTV infrastructure has narrowed down the performance of existing IPTV based applications. Mainly these limitations appeared because of the hardware and software limitations of the STB which is at the client side. Because of its hardware limitations such as processing power and memory, the maximum size of a web page that can be load at a time is limited to 250Kb and web pages should be created to render within 5-7 seconds. Further because of its software limitations flash, action scripts, other several image and animation formats are not supported. More over certain tags defined in XHTML are not supported by the IPTV technology. The system is also having problems with character rendering and character displaying as it only supports for ASCII character encoding system. The only font available for displaying characters is Arial. Therefore all the textual content on the IPTV need to be presented in English language. B. Limitations of the Sinhala Unicode system There are several issues and deficiencies of the Sinhala symbols in the current version (Version 6.0) [5] of Unicode Character Encoding System. [3] [2] 1) Ambiguity causes by “kombuwa” as it appears after its constant: The Unicode system encodes all the vowel modifiers after the constant the vowel is going to modify. For example in the word “name” the vowel “a” appears after the constant “n”. But in the Sinhala language, even though the symbol “kombuwa” is a vowel, it should appear before the base or the constant character (by the left hand side of the constant) but not after the constant. [1, 2] Hence this generates a conflicting circumstance with the other languages and need to be handled specifically when introducing Sinhala to IPTV environment. In Sinhala language there are six situations where the vowel “kombuwa” applies as a single vowel or as a part of two or three-part vowel modifier. Figure 1 shows the six situations, the correct usage of the vowel and the resulting character combination according to the Unicode Character Encoding System. 2) Problem with Zero Width Joiner (zwj) and Zero Width Non Joiner (zwnj): As in many Indic languages, in Sinhala, the shape of the vowel or glyph varies depending on its position and its base character. The two characters, Zero Width Joiner (zwj) and Zero Width Non Joiner (zwnj) are responsible for handling these behaviors. In Unicode the Zero Width Non Joiner prevents joining of neighboring characters that would otherwise be joined while the Zero Width Joiner joins the neighboring characters that would otherwise not be joined. But, they do not reside within the Sinhala Unicode range. Therefore, some software does not distinguish them as Sinhala codes, and hence do not process them appropriately.

Fig. 1. Ambiguity of “kombuwa” in the current Unicode Encoding System

3) Lack of encodings for the “yansaya”, “rakaransaya” and “rephaya”: The current Unicode version does not include symbols for ’yansaya’, ’rakaransaya’ and ’rephaya’. Alternative characters are used instead of such symbols. 4) Lack of support for the use of multi - part vowel modifiers: When using multiple vowel modifiers for a constant, Unicode system does not support any guidance about the order and shape of the glyphs correctly. 5) Deficiency of guidance on the encoding of non-standard letters: In some situations, the shape of the vowel will be different according to the base letter it is going to modify in Sinhala language. Such vowels are, “paapilla” and “al lakuna”. But the current Unicode version provides only one shape of the vowel. Hence this creates inappropriate vowel constant combinations in localized systems. IV. NEWS PORTAL FOR IPTV News portal for IPTV is mainly focus on the concept of offering local news in local language while overcoming the above technological deficiencies in the IPTV infrastructure. Therefore the aim is to develop an interactive system for the IPTV users to read news content in Sinhala language while contributing for developing a localization framework for the IPTV environment. In IPTV user’s point of view there should be a specialty to read news articles from the news portal rather than watching from IPTV news channels. Mainly news portal is a source for many types of news. Another importance of this news portal was that it saves time of users by avoiding waiting for watching news from long news programs in news channels. The traditional news paper reading taste of the user is expressed in digital manner via the IPTV news portal. Searching news according to the type and published date makes easier to search news contents as well. In technology wise the research is also important for the IPTV environment as this is the first application introduced to IPTV technology in Sinhala language. The methodologies followed to defeat the IEEE – ICAESM 2012

IEEE – INTERNATIONAL CONFERENCE ON ADVANCES IN ENGINEERING, SCIENCE AND MANAGEMENT

3

limitations and constraints of IPTV infrastructure and the Sinhala Unicode range provide new paths for localizing IPTV applications. V. OVERVIEW OF IPTV NEWS PORTAL When designing a News Portal for IPTV with the above limitations and constraints, we had to find an innovative mechanism to extract only the relevant content from news web sites by filtering the unwanted content such as advertisements, sponsor links etc and convert them into a format which is supported by the IPTV environment to display them on the television. Hence our solution is to extract the relevant content of the news web sites by using HTML parsers, convert the text into images dynamically and store the images in a suitable location with a unique identifier to be displayed according to the end user’s request. Special concerns have been made while text to image conversion to overcome the issues and limitations with Sinhala Unicode.

Fig. 2. High level architecture of the system

VI. SYSTEM ARCHITECTURE The system consists of two major parts. • Front - end which provides the Graphical User Interface with interactive menus for the end user to interact with the system • Back - end which holds the news extraction, text filtering, formatting, text to image conversion and storing mechanisms Figure 2 shows the high level architecture of the system. The Front - end enables the end users to browse news articles stored in the database by facilitating interactive user menus under the newspaper, category and the published date. Further by keeping track of the news articles requested by each user the system is capable of displaying news under the most viewed categories on the application’s main menu. The back end of the system holds the novelty of the application as the pioneering concepts for displaying Sinhala on IPTV are embedded in. The back - end script runs periodically to extract news content from the news web sites, convert the news items into images, assign a unique identifier for each article image, store them in a proper location of a folder hierarchy and upload the location of each image to the central database. Figure 3 shows the major components of the back - end. A. News Extraction Mechanism HTML parsers are a good mechanism for extracting information from a large number of web sites on the Internet without the involvement of human.[4] This module is capable of extracting the news category, regenerating URLs for each news article, extracting title and the content each news item. The administrator of the system can specify which HTML tag to be parsed to retrieve the content of the article from the admin control panel.

Fig. 3. Overall functionality of the back - end

B. Text Filtering and Formatting Mechanism The aim of this module is to pre - process the Sinhala text before it is converted into images by the next module. The purpose of this pre - processing is to remove the unwanted stuff and rearrange the order of the character sequence parsed by the HTML parsers to minimize the erroneous Unicode character combinations due to the limitations of the Sinhala Unicode range.[6] •

Unnecessary tabs are removed from the character sequence by using regular expressions since they appear as unrecognized characters after the text to image conversion as shown in figure 4.

IEEE – ICAESM 2012

IEEE – INTERNATIONAL CONFERENCE ON ADVANCES IN ENGINEERING, SCIENCE AND MANAGEMENT

Fig. 4. Unwanted white spaces

Fig. 5. Unrecognized Zero Width Joiner





The characters representing the Zero Width Joiner (zwj) and the Zero Width Non Joiner (zwnj) are removed from the character sequence by using regular expressions since the text to image conversion technology is unable to recognize and convert them to images correctly as shown in figure 5. To overcome the problem with “kombuwa” as shown in figure 1, the occurrences of “kombuwa” and the other five multi vowel modifiers where the “kombuwa” resides as a part of the modifier, have been examined in the Unicode character sequence of the text parsed by HTML parsers. Figure 1 will figure out the six situations. This problem will be addressed in two different ways according to the number of parts in the modifier. If the “kombuwa” appears as a single part vowel modifier it is swapped with the previous character in the character sequence (cases 1 and 3). If the “kombuwa” appears as a part of multi part vowel modifier, the modifier will be removed from the sequence and an extra character which represents the “kombuwa” will be added before the previous character of the multi part vowel modifier. Then the characters which are responsible for the other parts in the multi part vowel modifier will be inserted after that constant character (cases 2, 4, 5 and 6). The result after performing the text formatting is displayed in figure 6.

4

D. News Storing and Managing Mechanism The images of the articles were saved with a unique name which indicates the news category, date and time the image was created in a folder hierarchy accordingly by assuming more than 1 image will not be created within one second. The location of each image is inserted to the database to return the article images according to the request of the end user. VII. RESULT To analyze the accuracy of the system, a data set of 50 articles with nearly 200 words each were analyzed and the number of erroneous character combinations of the original and output documents is shown in figure 7. As depicted in the graph the problem of the incorrect location of “kombuwa” which is responsible for the highest error rate has been totally removed from the system. Further the errors due to unrecognized characters have also been reduced dramatically. But the current system does not provide any solution for the problems with “yansaya”, “rakaransaya” and “rephaya”. Further an analysis has been conducted to examine the accuracy of using the HTML parsers to extract news contents from the Internet and the system shows 81 percent accuracy. Further by embedding some kind of intelligence for the parsing mechanism, the accuracy can be further increased. VIII. CONCLUSION In this paper we present a new localization framework with Unicode characters for Internet Protocol Television (IPTV). Since existing Internet Protocol Television (IPTV) has lot of technological limitations we had to display the textual information in local language as images.

C. Text - To - Image Conversion To overcome the limitations of IPTV infrastructure such as the lack of support for fonts and HTML tags, the limitation of 250Kb file size to be loaded to the television at a time etc the text was converted into the images and stored with a unique identifier at the back end. Text to image conversion was implemented with the support of PHP’s GD library.

Fig. 6. The ambiguity is removed by re - locating the “kombuwa”

Fig. 7. Erroneous character combinations of the original and output document 1) number of errors due to unrecognized characters 2) number of errors due to ambiguous “kombuwa” 3) number of errors due to wrong use of “al kirima” 4) number of errors due to “Pa - pilla” 5) number of errors due to “rakaransaya” 6) number of errors due to “yansaya”

IEEE – ICAESM 2012

IEEE – INTERNATIONAL CONFERENCE ON ADVANCES IN ENGINEERING, SCIENCE AND MANAGEMENT

5

Here, we concentrated mainly on Online News Reading using IPTV, but this is extendable for other purposes as well. The language we focused was “Sinhala” which is used by most of the Sri Lankans. The news content is extracted from the online newspaper websites via HTML parsers and converted to images dynamically to be displayed on the television with the other graphical contents. The text was pre - processed to reduce the erroneous “Sinhala” Unicode character combinations. With some slight modification, the set of tools can be further extended to other Unicode languages. The results show that the system provides a clear, readable, and correct textual news dynamically with minimum unknown and ambiguous character sequences. Due to the deficiencies in the current Unicode version, some alternative character combinations with closer articulation have been used even though it deviates from the rules in “Sinhala” language. IX. FUTURE WORKS There are few experiments currently being under taken to add new features to the application with the purpose of facilitating the user. Since IPTV is becoming increasingly popular in the local market, it is a definite necessity that the integrated applications provide full functionalities with the available resources. The main research work which is being carried out at present is to extend the searching mechanism with the aid of a virtual key board. At present, selection of news items is done via remote controller through drop down menus. With the intended research work results, the user will be allowed to enter a key word about which the user wants to be updated with the latest news. The application then will search the key word among the available news resources and will display those results to the user. More importantly, the user will be enabled to search free text since a free text searching mechanism is being researched. ACKNOWLEDGMENT We would like to acknowledge the local service provider of IPTV, the Sri Lanka Telecom for the great support. REFERENCES [1]

[2] [3]

[4]

[5] [6]

H. Benoit, Satellite, Cable, Terrestrial,IPTV, Mobile TV in the DVB Framework, volume 1, pages 185–195. Oxford-Elsevier, 3 edition, 2008. Gihan Dias. Challenges of enabling it in the sinhala language. In 27th Inter-nationalization and Unicode Conference, 2005. Gihan Dias and Aruni Goonetilleke. Development of standards for sinhala computing. In, 1st Regional Conference on ICT and EParadigms, 2004. Jinzhu Hu, Xing Zhou, Jiangbo Shu, and Chunxiu Xiong. Research of self-adaptive web page parser based on templates and rules. In, International Conference on Management and Service Science, pages 1–4, 2009. Unicode Incoperated. The unicode standard 6.0 @ONLINE, December 2011. http://www.unicode.org/standard/standard.html. Penn State University. Sinhala unicode entities @ONLINE, December 2011. http://tlt.its.psu.edu/suggestions/international/bylanguage/ sinhalachart.html.

IEEE – ICAESM 2012

Suggest Documents