Computers and Software

International Review on

Computers and Software (IRECOS) Contents Automatic Processing of Amazighe Verbal Morphology: a New Approach for Analysis and Formalization by Fatima Zahra Nejme, Siham Boualknadel, Driss Aboutajdine

448

Energy-Efficient Data Reporting Strategy (EDRS) for Multilayer Clustering WSN by O. F. Mohammed, B. Hussin, A. S. H. Basari

458

Effective Clustering of Text Documents in Low Dimension Space Using Semantic Association Among Terms by N. Sivaram Prasad, K. Rajasekhara Rao

467

Scrapple: a Flexible Framework to Develop Semi-Automatic Web Scrapers by Alex Mathew, Harish Balakrishnan, Saravanan P.

475

Fast and Efficient Indexing and Similarity Searching in 2D/3D Image Databases by Y. Hanyf, H. Silkan, H. Labani

481

Testing Patterns in Action: Designing a Test-Pattern-Based Suite by Bouchaib Falah, Mohammed Akour, Nissrine El Marchoum

489

Edema and Nodule Pathological Voice Identification by SVM Classifier on Speech Signal by Asma Belhaj, Aicha Bouzid, Noureddine Ellouze

495

Hybrid Learning Model and Acoustic Approach to Spoken Language Identification Using Machine Learning by R. Madana Mohana, A. Rama Mohan Reddy

502

Spatio-Temporal Wavelet Based Video Compression: a Simulink Implementation for Acceleration by I. Charfi, M. Atri

513

Generating Graphical User Interfaces Based on Model Driven Engineering by S. Roubi, M. Erramdani, S. Mbarki

520

Requirement Scheduling in Software Release Planning Using Revamped Integer Linear Programming (RILP) Model by Sandhia Valsala, Anil R. Nair

529

Errata corrige

535

.

Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved

International Review on Computers and Software (I.RE.CO.S.), Vol. 10, N. 5 ISSN 1828-6003 May 2015

Automatic Processing of Amazighe Verbal Morphology: a New Approach for Analysis and Formalization Fatima Zahra Nejme, Siham Boualknadel, Driss Aboutajdine Abstract – Due to the rich morphology and the highly complex word formation process of roots and patterns, Amazighe morphology processing poses special challenges to Natural Language Processing (NLP) systems. In this paper we present the architecture and implementation details of lexicon and the morphological descriptions for building a Verb Morphological Analyzer. Our main contribution in this paper consists of two main components: firstly, a linguistically motivated tool based on the concept of patterns and allows, from a verbal entries, to predict the inflectional forms. Then, a set of rules covering a set of orthographical constraint and grammaticalization rules observed in the treatment process. Our analyzer exploits the efficiency and flexibility offered by finite state machines in modeling while using the NooJ Finite State tools. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Amazighe Language, Natural Language Processing, Finite State Methods, NooJ, Inflectional Morphology, Verbal Morphology

I.

These changes have strengthened the possibility of promoting the Amazighe language and enabling it to be introduced in the public domain including administration, media and also in the educational2 system in collaboration with ministries. Nevertheless, these stages are not sufficient for a lessresourced language such as Amazighe to join the wellresourced ones in the area of information technology and making it accessible to all through electronic resources and automatic processing tools. Therefore, a set of scientific and linguistic research are undertaken to remedy to the current situation. These researches can be divided on two categories: (1) Computational resources which include the optical character recognition (OCR) [19], Amazighe corpora [23], [32], (2) NLP tools which have been limited and carried out on light stemmer [3], search engine [4], concordancer [33], verb Conjugator [5], named entity recognition [28], [29] and morphological analyzer [9]-[14]. This paper presents the continuation of our previous efforts, which are restrained on the noun processing and describes a step toward the development of full-fledged verbal morphological analyzer tool. To do this, and given the lack of available resources and their limitations, we have motivated to follow the rule-based approach and rely on hand-constructed linguistic rules in developing our tool. Today the state of the art technology to write morphologies is to use special-purpose morphology languages based on finite-state technology.

Introduction

Automatic morphological analysis is a widely-used technique for the development of NLP systems and linguistically-annotated corpora [1]-[35]. Particularly in less resourced languages like Amazighe, morphological analyzers on computer are indispensable tools for both NLP researchers and corpus linguists. Amazighe language is one of the oldest languages of humanity. However for many decades, it was considered one of the endangered languages of West Africa. Even in Morocco, where it is the mother tongue of approximately half of the population, the language has been neglected and was only oral exclusively reserved for family and informal domains [1]. Hence, there have been several linguistic studies, resources supporting computational treatment of the language, including corpora and electronic dictionaries; remain scarce relative to those for many other languages. The general unavailability of such resources, as well as the lack of orthographic standardization characteristic of mostly spoken languages, have been an obstacle to research in computational processing of Amazighe and development of automatic tools. Fortunately, interest in the language has increased in the last decade, especially since the creation of IRCAM1 (the Royal Institute of the Amazighe Culture) in 2001, with the objective of promoting the language at all levels [6]. As a result, the status of Amazighe has progressively changed to institutional then to official status beside Arabic. 1

Institution responsible for the preservation of heritage and the promotion of the Moroccan Amazigheculture and its development (see http://www.ircam.ma/).

2

It has become common practice to find Amazighe taught in various Moroccan schools as a subject.


448

Fatima Zahra Nejme, Siham Boualknadel, Driss Aboutajdine

The most well-known among others is NooJ [24], [25]. In terms of parsing efficiency, NooJ, which is a self-contained corpus analysis and comprehensive linguistic development tool, is extremely efficient compared with other3 NLP parsers. As of today, NooJ can process a dozen languages, including some Roman, Germanic, Slavic, Semitic and Asian languages, as well as Hungarian. NooJ will be described inside this paper as the tool used for the development of our components which presents a part of the whole analyzer for Amazighe. The remainder of this paper is divided into five main sections: the first presents a brief overview of the Moroccan Amazighe language. The second section exposes and discusses the verbal inflectional morphology. The third presents the processing of our morphological analyzer system. The fourth section provides the processing of the morphological constraints and the grammaticalization rules. The five one gives an overview of evaluation results while the last section tries to draw some conclusions and suggests some future directions for our approach.

II.

The Amazigh language has its own script called Tifinaghe that was adapted by IRCAM in 2003, to provide an adequate and usable standard alphabetic system called Tifinaghe-IRCAM. This system contains:  27 consonants including: the labials (ⴼ, ⴱ, ⵎ), dentals (ⵜ, ⴷ, ⵟ, ⴹ, ⵏ, ⵔ, ⵕ, ⵍ), the alveolars (ⵙ, ⵣ, ⵚ, ⵥ), the palatals (ⵛ, ⵊ), the velar (ⴽ, ⴳ), the labiovelars (ⴽⵯ, ⴳⵯ), the uvulars (ⵇ, ⵅ, ⵖ), the pharyngeals (ⵃ, ⵄ) and the laryngeal (ⵀ);  2 semi-consonants: ⵢ and ⵡ;  4 vowels: three full vowels ⴰ, ⵉ, ⵓ and neutral vowel (or schwa) ⴻ which has a rather special status in Amazighe phonology. Today, the current situation of the Amazighe language is at a pivotal point. It holds official status4 beside Arabic. Its morphology as lexical standardization process is still underway. At present, it represents the model taught in must schools and used on media and official papers. II.2.

The Amazighe Language II.1.

Moroccan Amazighe Morphology: A Brief Overview

Amazighe languages is characterized by a complex and productive morphology, with a basic word-formation mechanism, root-and-pattern. It is rich in terms of its inflections, derivations and complexities it produces. Like all Semitic languages, Amazighe morphology has a multi-tiered structure and applies non-concatenative morphotactics. The words are originally formed through the amalgamation of roots and patterns, as shown in Fig. 1. A root is a sequence of one or many consonants and the pattern is a template of vowels (V) with slots into which the consonants (C) of the root are inserted. This process of insertion is called interdigitation. The resulting lemmas then pass through a series of affixations (to express morpho-syntactic features) and/or clitic attachments (as pronouns: ⵜⵉⵏⵉⴷ [tinid] “you say”  the objective personal pronouns ⵜ--ⴷ [t--d] affixed to the verb ⵉⵏⵉ [ini] “say”) until they finally appear as surface forms. For example the common noun ⴰⴳⵍⵍⴰ [aglla] “flank” is built up from the root ⴳⵍ [gl] “collapse” by following a definite pattern (see Fig. 1) ⴰ12’2ⴰ “a12’2a”; where the number 1 is replaced by the first consonant of the root, number 2’ is replaced by the duplication of the second consonant of the root and number 2 is replaced by the 2nd consonant of the root. Also, for the derived verb ⵜⵜⵢⴰⴳⴰⵍ [ttyagal] “be suspended” is built up from the same root ⴳⵍ [gl] by following a definite pattern (see Fig. 1) ⵜⵜⵢⴰ1ⴰ2 “ttya1a2”; where the number 1 is replaced by the first consonant of the root and number 2 is replaced by the 2nd consonant one. To cover Amazighe morphology, four main lexical categories need to be studied: noun, verbs pronouns and function words which include adverbs, preposition, etc. [20], [6].

Historical Background

The Amazighe language, also known as Berber or Tamazight (ⵜⴰⵎⴰⵣⵉⵖⵜ [tamaziɣt]), is belonged to the African branch of the Afro-Asiatic language family, also referred to Hamito-Semitic in the literature [15] [30]. Geographically speaking, it spreads in “North Africa which extends from the Siwa Oasis in Egypt in the east to Senegal in the west and from Algeria in the north to Mali in the south” [2], and was also found in the Canary Islands. Nevertheless, even though the Amazighe language is still spoken in several North African nations, it is not admitted as one of the official languages in any stated countries except Morocco. In linguistic terms, Moroccan Amazighe language is characterized by the proliferation of dialects due to historical, geographical and sociolinguistic factors. For instance, one may distinguish three main regional varieties: Tarifit in the North, Tamazight in Central Morocco and South-East, and Tachelhite in the SouthWest and the High Atlas. Since the ancient time, it is the mother tongue of approximately half of the population. However for many decades, it was only oral exclusively reserved for family and informal domains [1]. While by the creation of the Royal Institute of Amazighe Culture (IRCAM) in 2001 and the constitution update of July 2011, the status of Amazighe has progressively changed. 3

NooJ’s parsers are extremely efficient compared with other NLP parsers. Compared with INTEX, GATE (Hamish Cunningham, 2002) or XFST (which have very efficient finite state parsers), NooJ's finite-state graphs are compiled dynamically during parsing, instead of being compiled (determinized and minimized) before parsing.

4 Despite the fact that the Amazighe language is still spoken in several nations, it is not admitted as one of the official languages in any countries except Morocco.


International Review on Computers and Software, Vol. 10, N. 5

449


Fig. 2. Example of template describing an Amazighe verbal form for the first singular person Fig. 1. Example of words formation process following the root ⴳⵍ [gl] “collapse”

Thus, in the present study, the whole question will be carefully discussed with concentration on the aspects of verbal inflections - i.e. how inflectional aspects are expressed and generated-.

Practically speaking, nouns and verbs are the base of the Amazighe morphology and they are the more important categories to focus on, as others can be derived from them. Thus, the scope of this work is the verbal morphology; it will be the subject matter of the rest of this section. II.2.1.

III. Amazighe Verbal Inflectional Morphology and Related Issues Amazighe language, similarly to Arabic and other Semitic members, displays a nonconcatenative verbal inflectional morphology. The most influential analysis of Amazighe verbal morphology holds that each stem (i.e. overt form) consists of a basic lexical root made of an ordered set of one to five consonants, and some paradigm associated with grammatical aspect [21], [22], [34]. The phenomenon has been the focus of a large body of research in Amazighe linguistics. At least six verbal stems, with asymmetrical diffusions across Berber, are distinguished by morphological alternations. Three occur universally: the aorist, the perfective and the imperfective. Two additional stems, the negative perfective and negative imperfective, occur in negative contexts (cf. II.2.1). After determinations of aspects, there has been much debate about the formation process of tenses. The following section aims to review these formation processes.

Amazighe Verbal Morphology

Amazighe verbs are single graphic words which present the base of the Amazighe morphology because (1) it represents a wide morphological class which is remarkably rich and (2) as others can be derived from them. It is classified according to the number of consonants of their lexical root: there are Monoliteral verbs, Biliteral, Triliteral ones etc. The verb occurs in two forms: basic and derived one. The basic form (radical) is formed through an amalgamation of a root and a pattern (Root: ⴳⵍ [gl], Pattern: aCC, Radical: ⴰⴳⵍ [agl] “suspend”). While, the derived one is obtained by the combination of a basic verb with one of the following derivational morphemes: ⵙ/ⵙⵙ [s/ss] (ⵏⵢ [ny] “get on” ⵙⵏⵢ [sny] “bring up”) indicates the factitive form, ⵜⵜ [tt] (ⴳⵔ [gr] “throw away” ⵜⵜⵓⴳⵔ [ttugr] “be thrown”) marks the passive form and ⵎ/ⵎⵎ [m/mm] (ⴽⴽⵙ [kks] “remove” ⵎⵢⵓⴽⴽⴰⵙ [myukkas] “remove mutually”) designates the reciprocal. The verb, whether basic or derived, inflects in four aspects namely: aorist, perfective, negative perfective and imperfective, that is marked with vocalic alternations, prefixation or consonant gemination/degimination. Moreover, it displays three moods (indicative, imperative and participial), where in each mood the same personal markers are used (cf. Table II). The indicative and the participial moods are based on the four aspects, while the imperative mood has two forms simple and intensive that are based respectively on the aorist and the imperfective aspects [6]. In general, the verbal form of an Amazighe verb can be inflected as shown in Table I and described using the template of Fig. 2. Inflection is generally considered to be more productive than derivation.

III.1. Related Work Amazighe verbal morphology is already shown in several previous studies. The first exploration refers to the works which tend to concentrate on particular dialects as Ait Attab dialect [27], Imdlawn Tashlhit [7] [8] and Ait Ayache one [35] and which support that the morphological rules that govern the formation of different verb forms take as a basis the lexical entry of the verb that has two types of information: CV template and melodies. However, these studies are insufficient because they deal with a region within a geographical border. The second one is relies on study of the various conjugation structures inferred of the three major Moroccan Amazighe dialects [17], [31].



450


TABLE I THE INFLECTIONAL FORMS OF THE VERB ⴰⴳⵍ [AGL] “SUSPEND” IN THE THREE MOODS FOR THE 2ND PERSON MASCULINE PLURAL Aspects Moods Aorist Imperfective Perfective Negative Perfective Simple Intensive ⵓⴳⵍⵖ [uglɣ ] ⵓⴳⵉⵍⵖ [ugilɣ ] Indicative ⴰⴳⵍⵖ [aglɣ ] ⵜⵜⴰⴳⵍⵖ [ttaglɣ ] Participial Imperative

ⵢⴰⴳⵍⵏ [yagln] -

ⵉⵜⵜⴰⴳⵍⵏ [ittagln] -

ⵢⵓⴳⵍⵏ [yugln] -

ⵢⵓⴳⵉⵍⵏ [yugiln] -

-

-

ⴰⴳⵍ [agl]

ⵜⵜⴰⴳⵍ [ttagl]

TABLE II PERSONAL MARKERS FOR THE INDICATIVE, IMPERATIVE AND PARTICIPIAL MOODS Indicative mood Imperative mood Participial mood Masc. Fem. Masc. Fem. Masc./ Fem. ...ⵖ ...ⵖ ⵉ…ⵏ 1st pers. 2nd pers. ...Ø ...Ø Singular

2nd pers.

ⵜ…ⴷ

ⵜ…ⴷ

3rd pers.

ⵉ…

ⵜ…

st

ⵏ…

ⵏ…

nd

2 pers.

ⵜ…ⵎ

ⵜ…ⵎⵜ

3rd pers.

…ⵏ

…ⵏⵜ

1 pers. Plural

2nd pers.

El gholb [17], given the extent of the Amazighe language on a huge geography, has chosen some representative dialects of the three major ones on sporadic basis in order to give an overview of all relevant changes. Based on this result, he has presented a draft in which he adopts the classification by verbal type: monoliteral, biliteral, trilateral, etc. but limited only to the conjugation of simple and underived verb of monoliteral and biliteral types with the structures: /ccv/, /c’c’v5/, /vcc/, /vc’c’/, /vcv/, /cvc/, /vc/. Laabdelaoui et al [31] adopt the class based approach. The verbs are arranged into 31 classes along the aorist/perfective, and the aorist/imperfective conjugation oppositions. In the first 30 classes, independently of the morph phonological alternations, all verbs belonging to a specific class are modeled by the same morphotactic rules to get either the perfective or the imperfective forms, whereas the last class contains a set of 10 verbs that behave differently. Based on these classification criteria, the Amazigh verb and its derived forms do not necessarily belong to the same class, since they may not use the same morphotactic rules to be conjugated. Class based approach provides a straightforward way of describing a large number of verbs in a compact and generalized way but fails to predict the class for a new verbs (other than those owned by our list and also for the derived forms resulted) or how it forms are morphologically generated. Also, and given the nature of Amazighe morphology, it accounts many classes (31 classes) and, deal with regional varieties, may present different conjugation classes for the same verb. To this end, we have defined our own paradigm structure based on the pattern and which provides a compact way to define the formation process. Our goal is to provide generalizations which can be of use in understanding the nature and development of Amazighe aspect.

…ⴰⵜ/ⵜ/ⵎ

…ⴰⵎⵜ/ⵎⵜ

…ⵏⵉⵏ

III.2. Our Approach Our approach investigates the mechanism responsible for predicting the conjugation of Amazighe verbs in each aspect. Furthermore, we proposed that the inflection of verbal aspects is based on the pattern. In this context, we have undertaken to develop a set of rules to generalize the inflection model of each pattern. In line with our goal, and with the aim of the representativeness of the three Moroccan varieties (Tarifit, Tamazight, Tashelhit), we have adopted, as basis of our work, a set of 3676 attested and standardized word lemmas from [31]. This cross-dialectal perspective has several advantages, the main one being that it contributes to a clearer description of the system and allows highlighting the characteristics commonly shared by the different dialects, in order to present the variations that occur. Starting from this basic list, and in order to simplify the presentation and account for different verbal bases, we choose a classification according to the verbal types (monoliterals, biliterals, etc.), containing a vowel or not and containing geminate radical or not. Then, we extract the rules for each pattern of each type. The following diagram (Fig. 3) demonstrates the overall architecture of our approach.

5

We use the “C’” in the pattern presentation when a consonant is reduplicated.

Fig. 3. Verbal inflection architecture



451


Our approach is based on hierarchical structures:  The first phase is to determine, from each entry, the verbal type,  The second one is to determine the vowel degree (zero vowels or full ones) with the pattern (CV),  The third one, and based on these two latest, we determine the changes that need to be assign to generate the inflected forms for each aspect. In order to better illustrate our proposition, we consider as an example the lexical entry ⴰⴳⵍ [agl] “suspend”, which correspond to the biliteral type with the pattern “aCC” (full vowels). For this template, the inflected forms are generated as in the Fig. 4. Based on this classification, a set of paradigms were carefully developed to cover six verbal types and also to present the verbs inflectional exceptions which require a particular study. As a result, we have raised a set of 553 general rules: 329 for regular verbs and 224 for exceptional ones. Table III describes these rules in more details.

IV.

Therefore, we have selected this technology to be applied on our work. IV.1. NLP Approaches to Morphology: Brief State-of-the-Art During the last 25 years the finite-state approach has been the most fruitful one in the field of computational morphology. Finite state morphology aims at handling morphology within the computational power of Finite State Automata. This approach is especially attractive in dealing with human language morphologies; among these are the ability to handle concatenative and non-concatenative morphotactics, and the high speed and efficiency in handling large automata of lexicons with their derivations and inflections that can run into millions of paths. Although there is now a variety of Finite state tools, the development or adaptation of existing tools to facilitate the creation, annotation, and Amazighe description of corpora remains a major challenge. For morphological analysis, several tools and frameworks have been used. Well known tools include:  XFST: stands for Xerox Finite-State Tool, one of the most sophisticated tools for constructing finite state language processing applications [18], was developed at the XRCE by Kenneth R. Beesley and Lauri Karttunen [16]. It is “based on solid and innovative finite-state technology”, designed for multi-purpose use with explicit support for automata theoretical research. The XFST toolkit provides powerful and elegant linguistic descriptions and treatment of irregularities via different operators, at a high level of abstraction, such as restriction, replacement, and leftto-right longest match replacement.  GATE: General Architecture for Text Engineering is a development environment that provides a rich set of graphical interactive tools for the creation, measurement and maintenance of software components for processing human language.

The Processing of Verbal Inflections: Overall Design, Implementation and Evaluation

Our main agenda is to develop a highly flexible verbal Amazighe morphological analyzer consisting of two major parts: (1) Construction of our verb Amazighe lexicon “VAmLex” which stands for “Verbal Amazighe lexicon”, (2) The formalization of the inflectional morphology rose by our approach. Given the nonconcatenative nature of Amazighe morphology, it was important to carefully select an appropriate approach to handling morpho-phonological process. Furthermore, finite State technology is considered the preferred model for representing the phonology and morphology of natural languages.

Fig. 4. Inflected forms of the lexical entry ⴰⴳⵍ [agl]



452


Types

Monoliterals Biliterals Triliterals Quadriliterals Quinquiliterals Six literals Total

TABLE III DESCRIPTION OF OUR RULES Pattern Zero vowels Full vowels Regulars Exceptions Regulars Exceptions Number of Number Number of Number Number of Number Number of Number of entries of rules entries of rules entries of rules entries rules 5 2 2 2 59 16 9 9 117 5 22 10 570 73 72 44 766 6 64 15 555 88 150 73 285 5 63 16 680 90 100 42 21 3 13 5 119 35 10 8 1 1 7 5 1195 22 164 48 1990 307 341 176

This package allows building and managing a large coverage of electronic dictionaries and formal grammars in order to formalize the different linguistic phenomena such as: spelling, morphology (inflectional and derivational), vocabulary (simple words, compound words and frozen expressions), syntax (local, structural and transformational), disambiguation, semantics and ontology, and which can be applied to treat texts and large corpora. For each of these levels, NooJ provides linguists with one or more formal frameworks specifically designed to facilitate the description of each phenomenon, as well as parsing, development and debugging tools designed to be as computationally efficient as possible, from Finite-State to Turing machines. This approach distinguishes NooJ from other computational linguistic frameworks that provide a unique formalism that is supposed to cover all linguistic phenomena. As of today, NooJ can process a dozen languages, including Arabic [44] as a Semitic language like Amazighe. One of the important and useful features of NooJ, regarding Amazighe as morphologically rich languages, is its simple description of morphological phenomena and efficient morphological processing. The use of this technology was extremely attractive and allows generating and analyzing several thousands of words per second. The NooJ lexical module that will be used throughout this paper relies on operators performing transformations inside strings, and morphological graphs describing grammatical rules for morphological analysis. Generally, transformations inside strings are based on use of some generic predefined commands such as:  : position the cursor (|) at the beginning of the form,  : position the cursor (|) at the end of the form,  : keyboard Right arrow,  : keyboard Left arrow,  : delete the last character,  : delete the current character.

 LFG: which stands for Lexical Functional Grammar, it is a linguistic framework, which apart from being interesting from the theoretical linguistic perspective, over the years has proven instrumental in the development of computational linguistic models and Natural Language Processing tools. The grammar architecture of LFG, with its strong lexicon component and multiple levels of representations seems especially suited for a cross-lingual grammar induction task.  HPSG: stands for Head-driven Phrase Structure Grammar, is an attractive tool for capturing complex linguistic constructs. HPSG is very suitable for NLP as it integrates all the essential linguistic layers (Phonology, Morphology, Syntax, Semantics, and Context etc.) of NLP. Although these finite state technologies present a number of advantages, there present also several disadvantages, the most common one is that they provide a single formalism (powerful grammar) supposed to be used to describe all linguistic phenomena. Unlike NooJ, which is a linguistic development environment, it provides linguists with one or more formal tools specifically designed to facilitate the description of each linguistic phenomenon, as well as parsing tools designed to be as computationally efficient as possible. Furthermore, it allows linguists to combine in one unified framework Finite-State descriptions such as in XFST, GPSG, LFG and HPSG. This fact makes NooJ an ideal tool to parse complex phenomena that involve phenomena across all levels of linguistic phenomena, and allows NooJ’s parsers to be are extremely efficient compared with other NLP parsers. IV.2. NooJ: Linguistic Developmental Framework NooJ, released in 2002 by Max Silberztein [24] - [26], is a freeware language-engineering development environment, runs on different operating systems such as Windows, Linux, Solaris and Mac OSX, and provides a set of tools and methodologies for formalizing and developing a set of Natural Language Processing (NLP) applications. It presents a package of finite state tools that integrates a broad spectrum of computational technology from finite state automata to augmented/recursive transition networks.

IV.3. NooJ and Amazighe Analysis: Overall Design and Implementation Our main agenda is to build the verbal morphological analyzer as the combination of several finite state transducers using the framework NooJ. In order to do



453


this, and given that the linguistic resources required by the morphological analyzer include a lexicon and inflection rules for all paradigms, we started by building a verbal morphological Amazighe lexicon.

Inflectional grammar is looking for the paradigm named “aC'C'_aff” in order to generate all forms of a headword. Among the 98 inflectional transformations which are described in the inflectional paradigm "aC'C'_aff", here is one:

IV.3.1. VAmLex: Verb Amazighe Lexicon

ⴷⵉⵜⵓ/Acc_Négatif+2+m+s

The most basic and yet most needed step in morphological analysis of any language is the development of morphological lexicon. To this end we have created, in the first step and using the NooJ robust dictionary module, our verbal Amazighe lexicon “VAmLex” which stands for “Verb Amazighe Lexicon”. Lexical entries were developed from Amazighe Conjugation Manual [ⴰⴷⵍⵉⵙ ⵏ ⵓⵙⴼⵜⵉ ⵏ ⵜⵎⴰⵣⵉⵖⵜ - adlis n usfti n tmaziṛt] [31] and also from the new grammar of Amazighe [6]. Our main lexicon contains, actually, 3693 entries (3183 regular6 verbs and 510 irregular7 ones) represented as a second person, singular, masculine and imperative mood. Each lexical entry presents the following details: the lemmas, lexical category, type, semantic feature and the translation in French and Arabic languages. Furthermore, each one is linked to its inflection rule invoked by the property “+FLX=” for the inflectional information.

This NooJ paradigm, written in NooJ graphic editors, consists of a number of pairs describing all the possible forms. The first part of this pair describes a change on the word (e.g. / - position the cursor (|) at the beginning of the form, / - go left with 2 character and /delete next character) while the second part describes features that the newly made word is given (e.g. / Acc_Négatif+2+m+s– verb is added description that is in negative perfective form (Acc_Négatif), second person (2), masculine (m) and singular (s)). The meaning of the transformation is: (1) to alternate the first vowel ⴰ [a]  ⵓ [u], (2) insert the vowel ⵉ [i] before the last consonant and (3) finally, add the personal markers of the negative imperfective aspect which correspond to the second masculine person. These operations, applied in succession, generate the form: ⵜⵓⴳⵉⵍⴷ (tugild– it has not suspended). During morphological processing, we are also considering verbs in the text separately and trying to identify grammaticalization rules and orthographical constraints follow from the distortion of some radicals by prefixation or suffixation in order to increase the robustness of our analyzer and elaborate an identification and distribution of possible tags that can serve as the analysis of the verbs forms that not corresponds to our lexicon entries.

IV.3.2. Rules Formalization Given that the linguistic resources required by the morphological analyzer include a lexicon and inflection rules for all paradigms, we have undertaken to complete the morphological concept of our lexicon, and we have formalized, in the second step, the inflectional paradigms that allows the automated generation of all inflected forms. Relied on the rules presented in the Amazighe Conjugation Manual [31] and following our approach we have formalized the verbal inflectional rules. Therefore, we have created, through hand-encoded graphs integrated in the linguistic development platform NooJ, a set of hand-encoded inflectional paradigms covering the exceptional cases. The inflectional descriptions include the mood (indicative, imperative or participial), the gender (masculine or feminine), the number (singular, plural), the aspect (aorist, imperfective, positive perfect and negative one) and the person (first, second or third). By these descriptions we refer to the set of all possible transformations which allow us to obtain, from a lexical entry, all inflected forms. On average, there are 98 inflected forms per verb entry and 427049 fully inflected forms in the total. To give an overview of all these rules, we take as an example the verb ⴰⴳⵍ [agl] “suspend”.

V.

One of the main challenges of Amazighe resides in the orthographical constraints and grammaticalization. We continue our description within the linguistic platform NooJ by detailing the different lexical constraints implemented inside morphological grammars. V.1.

7

Orthographical Constraints

Amazighe verb is separated by a white space to all grammatical elements might follow or precede it. Furthermore, depending on some morphological constraints, verbs are not always respected in written text. These constraints depend mainly on the phonetic pronunciation. Hence, verbs may be written without an intervening space with pronominal complements as pronouns. Thus, we have proceeded by the application of some morphological transformations in enacting segmentation

ⴰⴳⵍ,V+Simple+Bilitère+Irreg+Tr+FLX=aC'C'_aff+ FR=suspendre+accrocher+être suspendu+AR=‫ﻋﻠ ّﻖ‬+‫ﺗﻌﻠ ﱠﻖ‬

6

Orthographical Constraints and Grammaticalization Process: Rules Definition and Formalism

Verbs with aorist formally identical to the perfective. Verbs with aorist formally different to the perfective.



454


of the complex verbal forms before lexicon lookup in order to deal with more complex morphological phenomena. To illustrate our proposition, we present in Fig. 5 an example of orthographical constraints graph. This graph shows that a form is recognized as agglutination of a verb and a pronoun if it is decomposed into two substrings satisfying the introduced constraints. These two substrings formed by two letter sequences represented by (8), are stored respectively in the variables $V and $ Suff. The contents of these two variables should check both lexical constraints written in brackets (''). These two constraints are used to certify the validity of the segmentation. Indeed, we have:  The first constraint ) which allows to checking that the first root stored in the variable V ($V) is listed as a verb thanks to a lexicon lookup.  The second constraint which allows to validate the contents of the variable $ Suff as a personal pronoun. If both mentioned lexical constraints are valid, the form will be recognized such as verb+pronoun (V+PRO), and then the grammar produces the resulting tags and . In these two tags, the variable $1L/$2L stores the lemma, $1C/$2C stores the morpho-syntactic Category, $1S/$2S stores the syntactic and semantic features and $1F stores the inflectional information. To facilitate the reading and understanding of this graph we cited one complex verbal form of the verb ⵙⵎⵓⵏ [smun] “assemble”  ⵉⵙⵎⵓⵏⵜⵏⵜ [ismuntnt] “he assembled them”.

decomposition operations must be applied to enact restoration of the initial form such as it appears in the lexicon and then allocate the linguistic information corresponding to the input form. V.2.

In Amazighe language, we find that, in general, verbal class is the main source of the grammaticalization process. To this end, we have undertaken to formalize the verb ⵉⵍⵉ [ili] “be/exist” which formed the basis of many grammatical units. We illustrate our proposal with two examples: (1) ⵍⵍⵉ [lli] (Imperative mood of the verb ⵉⵍⵉ [ili] in 1+m+s) is one of the frozen forms of the verb ⵉⵍⵉ [ili]. This form operates as anaphora when it is placed after a noun. For example, the anaphoric form ⴰⵔⴳⴰⵣ ⵍⵍⵉ [argaz lli] “the man in question” use the semantics of the verb ⵉⵍⵉ [ili] and ⵍⵍⵉ [lli] in this case will be annotated as anaphora. (2) ⵍⵍⵉⵖ [lliɣ] (perfective mood of the verb ⵉⵍⵉ [ili] in 1+m+s) matching another grammatical term which is conjunction “during”. To this end, we have formalized this process in order to have this annotation between the given tags.

VI.

Experiment and Evaluation

The performance evaluation of a morphological analyzer has to be observed in terms of its impact on the performance of the applications that use it. Hence, the main goal of this experiment is to prove the flexibility of our approach. To do this, at the end of the development phase, we have carried out the evaluation on two steps: (1) The first part of our experiment was devoted for the evaluation of our inflectional rules. To this end, we have adopted a list of 701 distinct verbs. The list entries, manually constructed, were not used as part of the development of our rules in order to get some feedback and to improve the modeling of inflectional morphology of Amazighe verbs; (2) The second one was devoted for the evaluation of our orthographical constraints and grammaticalization rules. Thus, we have adopted a collection of school texts. Then, in order to use this corpus for verbal analysis, we have undertaking as a first step a manual tagging in order to select the verbal category in our texts. After the application of the inflectional and morphological rules to our lists, we have undertaking a manual analysis of the outputs to evaluate the performance of our rules. The results for full analysis can be seen in the Table V. The results indicate that our verbal system in its current development has so far registered success.

TABLE IV EXAMPLE OF THE RESULTING TAGS First Morpheme: the verb Second Morpheme: the suffix $V= ⵉⵙⵎⵓⵏ [ismun] “he $Suff= ⵜⵏⵜ [tnt] assembled” $1L= ⵙⵎⵓⵏ [smun] $1L= ⵜⵏⵜ [tnt] “assemble” $1C= V $1C= PRO $1S= Trilitère+Tr+Reg $1S= Personnel+affixe(COD) $1F= Inacc+3+m+s $1F= 3+p+f

In the same way, we have defined the orthographical constraints depending on the vowel ⴻ [e]. Indeed, the neutral vowel (or schwa) ⴻ [e], which has a rather special status in Amazighe phonology, is noted to avoid the juxtaposition of two identical consonants and improve readability of forms. However, it may be placed between two consonants depending on the phonetic pronunciation or some spelling mistakes. Therefore, as part of the robustness of the automatic processing task, we have implemented finite state transducer to deal with these errors. For example, the form ⴼⴼⵖⵖ [ffɣɣ] “I released” of the verb ⴼⴼⵖ [ffɣ] “come out” can have a ⴻ [e] between the consonant ⵖ [ɣ] of the root and the one of the personal markers for the indicative mood (ⴼⴼⵖⴻⵖ [ffɣeɣ]). To this end, 8

Grammaticalization Process and Frozen Verbal Expressions

word form, with length 1



455


Fig. 5. Morphological analysis of the form ⵉⵙⵎⵓⵏⵜⵏⵜ [ismuntnt] “he assembled them” TABLE V AMAZIGHE VERBAL ANALYZER EVALUATION Results Verbs correctly analyzed Number % Inflectional rules 591 84,30% Orthographical and grammaticalization rules 2345 96,42

By taking a closer look at the entries which were not correctly analyzed we could come up with the following conclusions:  The evaluation of Inflected rules: the incorrect analyses are mostly due to: (1) 30% of verbs which patterns are not included in those already treated and (2) to 70% of incorrect inflections. A high rate of this part represents the difference in the Imperfective form (with correct perfective and negative perfective). This difference is due to verbs of some regional varieties of the Amazighe language. But the inflections remain correct for the standard side.  The evaluation of orthographical and grammaticalization ones: unrecognized entries are mostly due to: (1) 81% of the entries which root is not belong to our lexicon. In this case, the morphological constraints cannot be applied and the input form cannot be recognized and (2) 19% of typographical errors which depend mainly on the phonetic pronunciation (e.g. the entry ⵉⵜⵜⵡⵙⵉⵔ [ittwsir] cannot be recognized because the right form contaigning in the lexicon is ⵉⵜⵜⵉⵡⵙⵉⵔ [ittiwsir] “he aged”).

Verbs incorrectly analyzed Number % 110 15,69% 84 3,45

on patterns. Then, in addition to a large coverage electronic lexicon, we described our morphological constraints and grammaticalization rules which deal with complex morphological phenomena. The accuracy figures as high in evaluation of our method seem to be appropriate and encouraging. These results allow us to review, correct and complete all our resources in order to improve it. In order to emphasize more on the usefulness of our approach towards morphological analysis of Amazighe verbs, we plan to: (1) add new verbal lemmas and specific tags in order to enlarge the lexicon and to handle the regional varieties and (2) to formalize more morphological constraints and grammaticalization rules. Furthermore, the incorrect forms and the new patterns will be re-examined for further consideration into the morphological system.

References [1] [2] [3]

VII. Conclusion [4]

Very few linguistic resources have been developed so far for Amazighe and we believe that the development of a morphological analyzer tool is the first step needed for automatic text processing. In line with this, we presented how we provided this scarce resource language with a more fine-grained tool. In this paper, we presented a high accuracy morphological analyzer for Amazighe verbs that exploits the regularity in the inflectional and morphological paradigms while employing the NooJ Finite State tools for modeling the language in an elegant way. The research results presented above describe the first efforts aimed to investigate the mechanism responsible for predicting the conjugation of Amazighe verbs based

[5]

[6] [7]

[8]

[9]


A. Boukous, Société, langues et cultures au Maroc: Enjeux symboliques (Casablanca, Najah El Jadida, 1995). Eifring, H. and Theil, R., Linguistics for students of Asian and African Languages (Blindern: Universitet I Oslo, 2005). F. Ataa Allah, S. Boulaknadel, Pseudo-racinisation de la langue Amazighe. In Proceeding of Traitement Automatique des Langues Naturelles. Montréal, Canada (2010a). F. Ataa Allah, S. Boulaknadel, Amazighe Search Engine: Tifinaghe Character Based Approach. In Proceeding of International Conference on Information and Knowledge Engineering. Las Vegas, Nevada, USA, pp. 255-259 (2010b). F. Ataa Allah, S. Boulaknadel, Amazigh Verb Conjugator. In Proceedings of the 9th edition of the Language Resources and Evaluation Conference, Reykjavik, Iceland (2014). F. Boukhris, A. Boumalk, E. Elmoujahid, H. Souifi, La nouvelle grammaire de l'Amazighe (Rabat, Maroc: IRCAM. 2008). F.Dell, M. Elmedlaoui, Syllabic Consonants and Syllabification in Imdlawn Tashlhiyt Berber. Journal ofAfrican Languages and Linguistics. 7: 105-130, 1985. F. Dell, M. Elmedlaoui, Clitic Ordering, Morphology and Phonology in the Verbal Complex of Imdlawn Tashlhiyt Berber (Part I. Langues Orientales Anciennes Philologie et Linguistique 2: 165-194, 1989). F. Nejme, S. Boulaknadel, D. Aboutajdine, Analyse Automatique de la Morphologie Nominale Amazighe. Actes de la conférence


456


[10]

[11]

[12]

[13]

[14]

[15] [16] [17]

[18]

[19]

[20]

[21]

[22] [23]

[24]

[25] [26] [27]

[28]

[29]

[30] [31]

[32]

du Traitement Automatique du Langage Naturel (TALN). Les Sables d'Olonne, France (2013a). F. Nejme, S. Boulaknadel, D. Aboutajdine, Finite State Morphology for Amazighe Language. In Proceeding of International Conference on Intelligent Text Processing and Computational linguistics (CICLing). Samos, Greece (2013b). F. Nejme, S. Boulaknadel, D. Aboutajdine, Toward a noun morphological analyser of standard Amazighe. In Proceeding of International Conference on Computer Systems and Applications (AICCSA). Fes, Maroc (2013c). F. Nejme, S. Boulaknadel, D. Aboutajdine, Toward an amazigh language processing. Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP), COLING, Mumbai, December, pages 173 - 180 (2012a). F. Nejme, S. Boulaknadel, D. Aboutajdine, Formalisation de l’Amazighe standard avec NooJ. Actes de la conférence JEPTALN-RECITAL. Grenoble, France (2012b). F. Nejme, S. Boulaknadel, D. Aboutajdine, Vers un dictionnaire électronique de l'Amazighe. Actes de la Conférence Internationale sur les Technologies d'Information et de Communication pour l'AMazighe. Rabat, Maroc (2012c). J. Greenberg, The Languages of Africa (The Hague. 1966). K. R. Beesley, L. Karttunen, Finite state morphology. CSLI, Stanford. (2003). L. El Gholb, La conjugaison du verbe en amazighe: élément pour une organisation. Mémoire de Master 2, Université Ibn Zohr, Agadir 2009. L. Karttunen, J.-P. Chanod, G. Greffenstette, A. Schiller, Regular Expressions for Language Engineering, Natural Language Engineerin.g 2:4, 305–328. 1997. M. Amrouch, A. Rachidi, M. El Yassa, D. Mammass, Handwritten Amazighe Character Recognition Based On Hidden Markov Models, International Journal on Graphics, Vision and Image Processing. 10(5), pp.11—18, 2010. M. Ameur, A. Boumalk, Standardisation de l’Amazighe. Actes du séminaire organisé par le Centre de l’Aménagement Linguistique, Publication de l’Institut Royal de la Culture Amazighe. Rabat, Maroc (2004a). M. Lahrouchi, La structure interne des racines triconsonantiques en berbère tachelhit ( S. Chaker, A. Mettouchi & G. Philippson (eds) Etudes de phonétique et de linguistique berbères. Hommage à Naïma Louali 1961-2005, pp. 177-193. Editions Peeters: Paris, Louvain, 2009). M. Lahrouchi, On the internal structure of Tashlhit Berber triconsonantal roots (Linguistic Inquiry 41/2: 255-285, 2010). M. Outahajala, Y. Benajiba, P. Rosso, L. Zenkouar, Using Confidence And Informativeness Criteria To Improve POS Tagging In Amazigh, In Journal of Intelligence and Fuzzy Systems. doi: 10.3233/IFS-141417, 2014. M. Silberztein, The Lexical Module In NooJ pour le Traitement Automatique des Langues. S. Koeva, D. Maurel, M. Silberztein Eds, MSH Ledoux (Franche-Comté Academic Presses. 2005). M. Silberztein, NooJ Manual (Download from http://www.nooj4nlp.net. 2006). M. Silberztein, An Alternative Approach to Tagging. NLDB 2007: 1-11. (2007). M. Iazzi, Morphologie du verbe en tamazight (parler des Aït Attab, Haut Atlas Central), approche prosodique. Thèse de DES, Université Mohamed V, Rabat, 1991. M. Talha, S. Boulaknadel, D. Aboutajdine, Système de reconnaissance des entités nommées amazighes Actes des Journées internationales d’Analyse statistique des Données Textuelles (JADT). Paris, France (2014a). M.Talha, S. Boulaknadel, D. Aboutajdine, RENAM: Système de Reconnaissance des Entités Nommées Amazighes. Actes de la conférence du Traitement Automatique du Langage Naturel (TALN). Les Sables d'Olonne, France (2014b). O. Ouakrim, Fonética y fonología del Bereber (Survey at the University of Autònoma de Barcelona. 1995). R. Laabdelaoui, A. Boumalk, M. Iazzi, H. Souifi and K. ANSAR, Manuel de conjugaison de l’Amazighe (IRCAM, Rabat, Morocco, 2012). S. Boulaknadel, F. Ataa Allah, Building a standard Amazighe corpus. In Proceedings of the International Conference on

Intelligent Human Computer Interaction. Prague, Tchec (2011). [33] S. Boulaknadel, F. Ataa Allah, Online Amazighe Concordancer. In Proceedings of International Symposium on Image Video Communications and Mobile Networks. Rabat, Maroc (2010). [34] S. Chaker, Les bases de l'apparentement chamito-sémitique du berbère : un faisceau d'indices convergents (Etudes et documents berbères, 7: 28-57, 1990). [35] T. A. Ernest. Tamazight verb structure : a generative approach (Volume 2 de African series, Indiana University publications, 1971).

Authors’ information Fatima Zahra Nejme is currently working as PHD student in Mohammed V-Agdal University, Rabat, Morocco. She obtained her Master Degree in Computer Sciences and Telecommunications at Mohammed V-Agdal University, Rabat, Morocco in 2011. Her research interests focus on the development of natural language processing tools for Amazighe language. Siham Boulaknadel is researcher at Royal Institut of Amazighe Culture, Morocco. She obtained her PhD in computer sciences at the University of Nantes in 2008. Her research interests focus on natural language processing, information retrieval, artificial intelligence, and e-learning. She is currently involved in several national projects dealing with developing linguistics resources and natural language processing tools for Amazighe language. She has also largely contributed to the supervision of young researchers in various topics, especially in developing numerical learning resources of the Amazighe language. In addition, she is the author or co author of numerous national and international publications. Driss Aboutajdine received the PhD degrees in signal processing from the Mohammed V-Agdal University, Rabat, Morocco, in 1985. He joined Mohammed V-Agdal University, on 1978, first as an assistant professor and since 1990, he is professor in faculty of Science heading the LRIT laboratory. Actually he is the national coordinator of a National Information Technology Network of Excellence. Over 30 years, he developed research activities covering various topics of signal and image processing, wireless communication, pattern recognition and natural language processing which allow him to publish over 300 journal papers and conference communications. He was elected member of the Hassan II Moroccan Academy of Science and Technology on May 2006 and he was Vice President in charge of research, cooperation and partnership of the Mohammed V Souissi University from September 2008 to December 2010. Actually, he is the director of National Center for Scientific and Technical Research.



457


Energy-Efficient Data Reporting Strategy (EDRS) for Multilayer Clustering WSN O. F. Mohammed1, B. Hussin2, A. S. H. Basari3 Abstract – Energy consumption is one of the key issues in data reporting in wireless sensor networks (WSNs). Unbalanced energy consumption that characterized by many-to-one traffic pattern results in uneven energy dissipation among sensors which consequently reduces network lifetime significantly. In event tracking and monitoring system with direct constant data reporting of the ongoing event, energy depletion of sensors in specific region in the network where the event takes place is unavoidable. However, energy depletion can be balanced if sensors are clustered into groups and sensed data are transmitted through multi-hop towards the Based Station (BS) over Cluster Heads (CHs) with different transmission ranges capabilities. CHs should have the knowledge that enables them to know how many packets should be transmitted either in multi-hop or directly in order to avoid the fast energy drain of sensors near the BS. The knowledge can be provided by the BS based on the region where CHs are located. In this paper, we propose an energy-efficient strategy for reporting the sensed data in multilayer clustered WSN. The proposed strategy maximizes network lifetime through balancing energy consumption by using dual transmission range for forwarding pre-defined number of data packets according to the network condition. Experimental simulations have been performed to validate the proposed strategy. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Clustering, Event Tracking, Energy Efficiency, Routing Protocols, Wireless Sensor Networks (WSN)

Esi_txd

Nomenclature WSN GPS QoS BS CH CM R d R`

φ si m ECsi Nsi M Erx Etx l ε0 Lsi Dt

Wireless Sensor Network Global Positioning System Quality of Service Base Station Cluster Head Cluster Members Transmission Range Transmitter-receiver distance The smallest integer not less than the fraction of the sensor’s distance from the BS (d) over R Angle of a sector Slice in a sector Number of slices in a sector Energy consumption in any slice Number of sensors in ithslice Uniform sensors in the area of monitoring The reception energy needed for receiving one bit of data The energy required to send one bit of data The length of data packet in bits The initial energy of sensors The lifetime of slices i The amount of data (in bits) that need to be forwarded directly from sensors in si to the BS

ECsi_rx rDt m2BNsi Bsensed_si Brelay_si Bdirect_si Eelec α PLE SHA DTA SHAF DTAF

The energy required to send data from si directly to the BS The number of relayed bits The ratio of direct data sending to one-hop sending The total data bits sent whereBNsi is the data bits generated by sensors in si The number of sensed bits in slices i The number of bits relayed by si The number of bits sent directly by si to BS The electronic energy of the sensor The energy dissipation over the operational amplifier (op-amp) during data transmission The path-loss exponent Single-Hope Advertisement Direct-Transmission Advertisement SHA flag DHA flag

I.

Introduction

The growing field of information technology has enhanced the capabilities of the wireless communication. The large usage of Wireless Sensor Networks (WSNs) in various fields of the real world is scaling with wide variety of roles and objectives. Recently, the interest in sensor networks has moved from a pure research to practical deployment [1]-[30].


458

O. F. Mohammed, B. Hussin, A. S. H. Basari

The demands on WSNs have increased rapidly due to the high potential and usefulness of new applications that use sensors. Nonetheless, several challenges associated with these networks in regards to application requirements and efficient operation of the network. Naturally, most of WSNs are designed for specific mission which in turn places specific requirements on network systems such as the use of certain protocols and algorithms [1]. One of the challenging issues in WSNs is that nodes may not be able to communicate to each other in bidirectional manner as communication links between sensors might be asymmetric. This imposes negative impact in the routing and data dissemination in WSNs. Architectures and protocols are the most important factors and essential elements in designing WSNs [2]. Since a WSN includes a big number of sensor nodes that are placed and deployed in an area for a sensing task, it requires reliable protocols and algorithms for control and management. This includes self-configuration, medium access control, node localization, routing, synchronization, data aggregation, and network security. Though, constraints in sensor nodes such as energy, computation, and storage, make it impossible to utilize protocols that are developed for traditional wireless networks (such as cellular systems and mobile ad hoc networks), as they do not consider these limitations [3]. Furthermore, for WSNs to be capable of gathering data usefully and efficiently, it requires that the location of sensing nodes should be known in advance [4]. So, sensors should be capable of determining their location and exchange their position information with their neighbor’s. Location information can be used for different purposes, such as saving energy. For instance, redundant or out of a job sensors can be powered down (go in sleeping mode) to save energy [5]. According to Kamal and Varalakshmi [6], sensors position or location problems are among the most critical research area related to the operations of WSNs. Also, WSNs should be robust to issues such as power degradation or breach. A lot of distributed mechanisms and algorithms have been developed for sensor networks count on sensors that have accurate knowledge of their location [7]-[10]. Even though sensors can be provided with a Global Positioning System (GPS), but this is not achievable at all times; therefore, internal localization algorithms are required [11], [12].

II.

•

Scalability Protocols and algorithms of WSNs should enable WSN applications with the ability to adjust themselves to adapt to the new conditions when the number of nodes increases in the network. •

Precision The accuracy depends significantly on the purpose of a specific application. •

Robustness WSNs may operate for a long time in rough hostile environments. Protocols and algorithms should stay functional even when a few sensor nodes fail to operate. •

Synchronization Scope In WSNs, protocols and algorithms could provide a global time base for all sensing nodes in the network, otherwise, only local synchronization between neighbor nodes that are close to each other is performed. In large WSNs, issues of scalability make global synchronization very difficult because of the energy and bandwidth usage limitations. Also, aggregating data collected from remote nodes requires a common time base for these large WSNs. •

Immediacy Some applications of WSN such as emergency or environmental changes detections (for example gas leak and intruder detections) require immediate communication with the Base Station (BS) node or cluster leader upon the event occurrence. In these applications, delay is not acceptable when an emergency is detected. •

Adaptability In WSNs, sensor nodes may move, join, or fail, which cause network topology changes. Hence, WSNs protocols must be adaptive to such changes. Protocols should have prediction mechanisms to predict the change and take proactive action. This improves the efficiency of WSNs. •

Reliability WSNs applications require reliable delivery of data. Thus, WSNs protocols should offer error control and correction to guarantee a consistent data delivery under different wireless channels conditions such as noise, error, and time-varying.

Requirements of Sensing Applications

According to R. Merz et al. [13], any application developed for WSNs should take into account a set of requirements that can be considered as useful metrics for evaluating WSNs applications. These requirements are as follows:

•

Fault Tolerance Nodes in WSNs are subject to failure, especially in harsh deployment environments where operations are unattended. Therefore, nodes must be fault tolerant and have the abilities of testing, repairing, and recovering.

•

•

Channel Utilization Bandwidth in WSNs is usually limited. Therefore, wireless communication protocols developed for WSNs must use the bandwidth resources efficiently to enhance channel utilization.

Energy Efficiency All protocols and algorithms designed for WSNs must take into consideration the limited resources, such as energy, in the sensing nodes.



459


•

Quality of Service (QoS) QoS requirements should be considered for specific WSNs applications. These applications might have different QoS requirements in terms of delay, throughput, and packet loss. For instance, many of monitoring applications, such as fire and gas leak monitoring, are very delay sensitive and, as a result, they require welltimed delivery of data. Also, data collection applications that are designed for scientific exploration purposes are packet loss sensitive.

Even though clustered WSNs offer better performance compared to that of flat WSNs, the design of clustering consumes more battery energy [18]. Nonetheless, the design of an efficient multilayer clustering can contribute to energy conservation by reducing the broadcasting communications and computations [19]. Ammari et al. [20] proposed solutions that involved adjusting the communication ranges to improve the energy conservation in WSNs. They suggested that sensor nodes, that are located within a certain distance to the BS, should have an equal level of energy, while sensors placed at two different distances from the BS can have different level of energy. Heterogeneity was added with the purpose of simultaneous energy dissipation in the sensors. However, the solutions were not feasible as the size of the fields under monitoring are limited. Later, they proposed a localized energy-aware data forwarding protocol that is based on Voronoi diagram, in which sensors are homogeneous and the BS is mobile. Sensor that is closed to the BS and has satisfactory residual energy can be selected as forwarder for the sensed data. Nonetheless, the BS mobility is the key weakness as sensors need extra energy to track the location of the BS. For balanced energy dissipation, Leone et al. [21] developed an algorithm under the assumption that sensors nodes have foreknowledge about their data generation rate. Based on the data aggregation at the BS, related value for the algorithm parameter is computed. Then, sensor nodes will be notified about this value by the BS in order for them to calculate their optimum ratio of transmission, either via multi-hop or direct (see Fig. 2). However, the solution assumes that the probabilities of the events occurrence are known and it observes and deduces the probabilities values from the events, which is not realistic. Azad et al. [22] developed a strategy where transmission distance is divided into two parts, namely ring thickness and hop size, according to the concentric rings formed around the BS. In their strategy, sensors regulate their transmission power to balance the energy consumption in the network by forming different topologies according to network situation. Yet, forming topology dynamically require high transmission power, which limits the implementation of this strategy practically. Chen et al. [23] proposed cooperative transmission mechanism to avoid gaps (holes) near the BS caused by dead sensor nodes.

III. Multi-Transmission Ranges for Energy Efficient Routing in WSN The successful of a WSN primarily depends on routing the sensed data from sources (sensors) to the specified BS. Most of sensors energy is dissipated in routing data which consequently reduces the lifetime of the network. Thus, the key design objective of routing protocols in WSNs is to reduce energy consumption that would extend the network lifetime. Network size plays an important role in sensors communication and energy consumption. Direct-hop communication between sensors and their BS can be effective as an energy-aware solution if the network size is small [14], [15]. However, for large WSNs, this will not be feasible. In such cases, multi-hop communication can provide operational scalability as intermediate nodes can forward the sensed data towards the far located BS [16]. Clustered design of WSNs is one of the architectures that aim to minimize the consumption of the battery energy [17]. In clustered WSN, sensors are gathered into groups (clusters). One of the sensors within a cluster is chosen as a Cluster Head (CH) while other sensors are considered as Cluster Members (CM). In the clustered WSN, sensor nodes have various possible roles according to the clustering algorithm used as shown in Fig. 1. The main goal of energy efficient routing is to reduce the energy consumption through improving other factors such as CH election, CH rotation, inter-cluster and intra-cluster communication procedure, and reporting to the BS.

Fig. 2. Different transmission ranges with respect to distance

Fig. 1. Various node roles in a clustered WSN



460


In their mechanism, sensors collaborate with their relay nodes to forward the sensed data using multi-input single-output procedure. To balance the energy consumption over the network, the proposed mechanism allows sharing energy among sensors placed near the BS and sensors located farther away. This mechanism outperforms the traditional singleinput single-output hop-by-hop procedure. Though, using different transmissions ranges in such mechanism is not practical and it is not feasible for clustered layering WSN. Thanigaivelu and Murugan [24] developed a Klevel-based transmission-range strategy based on controlled region selection procedure. In this strategy, the K value (number of tire level jumps) enable sensors to define their potential next hops based on the residual energy. Thus, with the new K value selected in the each renewal phase, random repetition and selection of a hop is avoided, in contrast to the fixed transmission range procedure. Nevertheless, the strategy is feasible for free space situations only.

IV.

Fig. 4. Multi-hop reporting in a hierarchal WSN

Direct sending can cause fast energy depletion if the sensors are located far from their BS. In such case, the unbalanced energy depletion among sensors is unavoidable. Consequently, there will be a higher chance that the network fails to perform the intended tasks. On other hand, if sensors can transmit in different transmission ranges according to their resources and network condition, there will be higher chance to extend the network lifetime as the energy depletion will be balanced among all sensors.

Energy-Efficient Data Reporting Strategy (EDRS) for Multilayer Clustering

Sensor transmission range is one of the key elements that can be used to improve WSNs in regards to lifetime of network operation. The lifetime of the network can be insured if the average energy depletion is comparable for every sensor in the network. This requires intelligent energy-efficient mechanism that considers multiple transmission ranges for transmitting or forwarding the sensed data. In event monitoring system, it is assumed that events are generated randomly uniformly in the network. Once the event is sensed, the sensed data will be reported (propagated) to BS via either one-hop, or multi-hops, or directly as shown in Fig. 3 and Fig. 4, respectively. This depends on the transmission rate, that sensors are capable of, and the condition of the network.

IV.1. Preliminaries Assuming that sensors are capable of sending data in various transmission ranges (R and R * R`), where R` is the smallest integer not less than the fraction of the sensor’s distance from the BS (d) over R that set by operator; that is it, the ceiling of (d/R). Considering a satisfactorily wide angle φ; by making several sets of hierarchical slices, the entire monitoring area can be covered without aggregating R` beyond the maximum acceptable transmission range. As the area is covered by a disk sector of angle φ, the sector can be divided into m slices (or tires). By assuming that the innermost slice has a transmission range of R, each of other slices si (2 ≤ i ≤ m) is defined by two consecutive slices, one of R * R` transmission range and the other of R * (i-1) transmission range, as shown in Fig. 5. In multi-hop manner, sensors closed to the BS consume more energy for forwarding sensed data sent by other sensors. Therefore, as it is logically assumed that sensors in slices closed to BS can be burdened with sensed data that need to be forwarded to the BS, sensors in s2 are considered to be capable for dual transmission ranges (R and R * R`). That is; sensors in s2 forwards data to a selected node in s1 or to the BS alternately. Thus, energy consumption in s1 and s2 can be balanced by relying some of sensed data directly from s2 to the BS while the rest of the sensed data are relayed in multi-hop manner via s1.

Fig. 3. One-hop reporting in a hierarchal WSN



461


In such multi-hop forwarding with multi-transmission energy balancing strategy, sensors in s2 will have the greatest impact. Therefore, energy balancing should be considered in s1 and s2 only. To balance the energy consumption in s1 and any other slices si, their lifetimes should be equal. Thus:

 2i  1 ECs1  ECsi

(2)

where ECs1 represents the energy consumption in innermost slice s1 and ECsi presents the energy consumption in any other slice si per unit of time. The total energy consumption and the lifetime of each slice i are as follows:

 ECsi  1  N si Etx   1  i  m 1

Fig. 5. The monitoring area is covered by a disk sector of angle φ, which is divided into m hierarchical slices

Sensors located in other slices transmit their sensed data in a hop-to-hop manner (not directly to BS). In regards to the number of sensors in the network, it is assumed that, among the total of M uniform sensors in the area of monitoring, NS1 sensors are deployed in the innermost slice s1.Thus, Ns1 = M/r where m is the number of slices in the sector. Since the sensors are uniformly deployed, hence, with respect to innermost slice, the number of sensors in every other slice is:

N si   2i  1 N s1

Lsi 



m

 N j  Etx  Erx 

(3)

N si  0 ECsi

(4)



j i 1

where NSi denotes the number of sensors in slice i, m denotes the number of slices in the sector, Erx denotes the reception energy needed for receiving one bit of data, Etx denotes the energy required to send one bit of data, l denotes the length of data packet in bits, ε0 denotes the initial energy of sensors, and Lsi is the lifetime of slices i. Let Dt be the amount of data (in bits) that need to be forwarded directly from sensors in si to the BS. If Ns1 is the number of sensors located in s1, and that every sensor sends one bit per time, then from Eqs. (2) and (3), the total energy consumption in the innermost slice is:

(1)

where Nsi represents the number of sensors in ith slice and Ns1 represents the number of sensors in the innermost slice in the sector.

m  ECs1  N s1 Etx   Erx  Etx    2i  1 N s1  Dt  (5)  j  2 



IV.2. Dual Transmission Ranges Computation It is important to emphasize that sensors that are capable of dual transmission ranges are available. Therefore, the strategy proposed in this research paper can be feasible and useful for different WSNs applications. Since sensed data at s2 are forwarded to the BS either directly or in multi-hop transmissions, therefore, optimum ratio of these transmissions needs to be presented analytically. The ratio of the data sending that would help balancing the energy in s1 and s2 is defined as amount of data that would be transmitted with a transmission range of R to the amount of data would be transmitted with transmission range R` • R. In order to balance the energy depletion in s1 and s2, the energy consumption of sensors in those slices need to be equal. This means that sensors in s1 and s2 are going to die approximately at the same time; i.e. the lifetime of sensors in s2 (Ls2) is equal to the lifetime of sensors in s1 (Ls1).





ECs1  N s1  Etx   Erx  Etx  m 2  1 

(6)

  Erx  Etx  Dt where ECs1 represents the total energy consumption in the innermost slice, Erx and Etx respectively represents the reception and transmission energy where l is assumed to be one (l=1). The energy Esi_tx required to send data from si to next slice towards BS is:

Esi _ tx  Bsensed _ si  Brelay _ si  Bdirect _ si

(7)

where Bsensed_si is the number of sensed bits in slices i, Brelay_si is the number of bits relayed from other slices, and Bdirect_si is the number of bits sent directly to BS.



462


The energy consumption required to send Dt packets directly from sensors in any slice i to the BS can be computed based on the following equation:

ECsi  Esi _ tx  Esi _ txd  Esi _ rx

ECs1 ECs 2 ECs1 ECs 2 i  2     N s1 Ns2 N s1  2i  1 N s1 ECs1 ECs 2   3ECs1  ECs 2 N s1 3 N s1

(8)

Substituting Equations (6) and (10) into Equation (12) yields:

where Esi_txd represents the energy required to send data from si directly to the BS and ECsi_rx represents the number of relayed bits. Slice i would receive and relay



data bits that arrive from outer slices and would also send  2i  1 N s1 bits of its own sensing data to the next hop in





 3 N s1  m 2  4 N s1  Dt  Etx   

the neighbor slice towards destination. Dt bits of that amount of data should be sent directly to the BS. Hence, from Eq. (8), the total energy consumption in si is:





 N s1 Etx   Erx  Etx  N s1 m 2  1    3   E  E  D  rx tx t  

m

 j i 1  2i  1 N s1 

  2i  1 N s1    m  ECsi    Esi _ tx    2 j  1 N s1  Dt    j 11 



(13)



 Dt Es 2txd  m 2  4 N s1 Erx Further simplification of Eq. (13) results in the following:





Dt 2 Etx  Es 2 _ txd  3Erx  (9)



2





N

2







 N s1  2m N s1 Erx  N s1  2m 2 N s1 Etx

m

 Dt Esi _ txd  Esi _ rx

(14)

  2 j  1 N s1  j i 1

Dt 

As the energy consumption in s2 has serious effect on the network lifetime, our proposed strategy is used in s2 only. Thus, i=2 is substituted into Eq. (9) to give the energy consumption in s2:

m j 3



   m  4 N



(10)



 3 N s1  m3  4 N s1  Dt Es 2 _ tx  Dt Es 2 _ txd  2

where Es2_txd represents the energy needed to send data from the second slice (s2) to the BS directly. This is also formulated according to the following energy model where the energy needed for one data bit transmission over a certain distance d is computed as:

Etx  Eelec   d



 2m N s1 Erx  N s1  2m N s1 Etx 2 Etx  Es 2 _ txd  3Erx

where m2BNs1 is the total data bits sent and BNs1 is the data bits generated by sensors in s1. By using Eq. (14), rDt can be calculated as follow:

s1 Es 2 _ rx

n

s1

2

Dt is the number of bits that need to be send directly from s2 to the BS in order to balance the energy consumption in the innermost slice (s1) and s2. Let rDt represents the ratio of direct data sending to one-hop sending in s2. That is: rDt is obtained by dividing Dt by the total number of bits sent in s2. The total number of bits (Bs2) that need to be sent in s2 can be calculated as follows: Bs 2  m 2 BN s1  BN s1 (15)

m   ECsi   3 N s1    2 j  1 N s1   Dt  Es 2 _ tx    j 3  

 Dt Es 2 _ txd  Es 2 _ rx   2 j  1 N s1  

(12)

rDt 

Dt m 2 BN s1  BN s1





 BN s1  2m 2 BN s1 Erx       BN  2m 2 BN E  s1 s1 tx    rDt    2 Etx  Es 2 _ txd  3Erx     m 2 BN  BN  s 1 s 1  



(11)





where Eelec represents the electronic energy of the sensor, α indicates the energy dissipation over the operational amplifier (op-amp) during data transmission, d represents the transmitter-receiver distance, and n is the Path-Loss Exponent (PLE). To calculate the number of data bits Dt needed to be sent from s2 directly to the BS, all sensors in s1 and s2 should have the same energy consumption. Therefore:

  1  2m  E  1  2m  E   2 E  E  3E   m  1 2

rDt

(16)



2

rx

tx

2

tx

s 2 _ txd

rx

From Eq. (16), it is clear that the fraction of direct sending of data bits from s2 to the BS is not subject to



463


the number of sensors in s2 but to the number of sectors m in the network.

V.

Evaluation of the Proposed EDRS Strategy

To validate the proposed EDRS, several experimental simulations were conducted using Network Simulator 2 (ns2). The experiments were run for an average of 50 times to ensure that simulation models are correct and reliable. The energy efficiency of the network was derived by measuring and comparing the lifetime of the network. This help to realize effect of the method of sending data, directly or through multi hop, on the life time according to Eq. (16) where the direct sending ratio can be obtained in different levels of hierarchal WSNs (as there are several slices in the sector). In the simulation scenarios, the number of sensors in the network is 1000 sensor deployed uniformly in 10 slices. The initial energy of each sensor is 0.5J. Thus, the total energy of the network is 500J. The reception energy Erx and the electronic energy consumption Eelec are set to 50nJ/bit. The transmission energy Etx is set to 180pJ/bit. The energy needed to send data from the second slice (s2) Es2_txd is 2.13µJ/bit (based on Eq. (11) where α=0.0013pJ/bit/m4, and Path Loss represented by n = 4, d=2×100). The radius of the network is 100 ~ 1000m. Table I presents the results of the simulations in terms of the best ratio of direct data sending (rDt) to one-hop sending in s2 where the number of slices in the network is 10. For every ten data packets, rDt is computed to decide on how many of these packets should be sent directly to the BS.

IV.3. Algorithm Description and Operation In this section, the algorithm of proposed EDRS strategy is presented. In this algorithm, sensors present in slice 2 (s2) of the network sector are chosen to operate and send data in the multi-transmission mode, while other sensors located in other slices send data using one transmission range. In the setup phase, sensors in s2 should be configured to work in dual-transmission mode (R and R * R`) so that they are able to set themselves into two transmission levels and sent or forward data consecutively; while the rest of sensors send with single-transmission of R. In this algorithm, the BS uses two types of advertisement packets: Single-Hope Advertisement (SHA) and Direct-Transmission Advertisement (DTA). Therefore, sensors in the network require two flags: SHA flag (SHAF) and DHA flag (DTAF). The procedure of the algorithm for choosing sensors to use dualtransmission mode is as follows: 1. Initially, in all sensors, SHAF is set to “zero” and DTAF is set to “NON-ACTIVE”. 2. Then, BS advertises SHA packet for a transmission range of R. Intended sensors who receive this packet will set their SHAF to one. 3. Then, the BS advertises DTA packet for a transmission range of R` • R. Sensors in s2 who receive this packet and that it is required for them to send data directly to the BS will set their SHAF = 0 and their DTAF to “ACTIVE” and proceed with R` • R transmission range. After that, sensors in s2 with DTAF = ACTIVE send data according to the computed ratio of direct data sending to one-hop sending. Fig. 6 shows the operation of the proposed EDRS strategy in each sensor.

TABLE I THE BEST rDt OBTAINED IN 10 SLICES NETWORK Number of slices Best ratio of direct data sending in the network sector - m to one-hop sending in s2 - rDt 2 0.252 3 0.216 4 0.189 5 0.181 6 0.179 7 0.177 8 0.175 9 0.174 10 0.172

The results obtained from simulations showed that the proposed EDRS improves the lifetime of the network compared to normal hop by hop procedure. Fig. 7 illustrates the lifetime gained when using EDRS compared to hop by hop reporting method.

Fig. 7. The improvement of network lifetime using EDRS compared to Hop to Hop

Fig. 6. General procedure of the proposed EDRS



464


The figure shows that, for the network radius of 200 meters, the network lifetime observed when using EDRS is 6.98E+05, while it is 4.84E+05 when using hop by hop for the same network radius. This means that EDRS increased the network lifetime by about 22%.

VI.

[6]

[7]

Conclusion and Future Work [8]

In WSNs, unbalanced energy consumption is a serious issue that has a negative impact on the network lifetime. Dual-transmission ranges would help balancing energy consumption in WSNs. In this paper, we proposed an Energy-Efficient Data Reporting Strategy (EDRS) for Multilayer Clustering WSN. It can maximize the lifetime of the WSNs by balancing energy consumption among homogenous uniformly deployed sensors. Several strategies have been developed to balance the energy and increase the lifetime using multitransmission ranges. Nevertheless, sensors with several different transmission ranges are not available in practice (do not exist). It has been demonstrated that using dual-transmission ranges (R and R * R`) improve the network lifetime noticeably. It was illustrated that, with WSN represented by a sector of m slices, only two of the innermost slices can be used to balance the energy consumption in the network. It was shown that the second slice (s2) is the most operative slice when utilizing EDRS in such networks. Simulation results showed that EDRS can increase the lifetime of WSNs by about 22% compared with a hop by hop transmission method. For future work, we are going to evaluate the EDRS in heterogeneous WSNs with different node deployment strategies.

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

Acknowledgements

[17]

This work was supported by Faculty of Information and Communication Technology, Universiti Teknikal Malaysia Melaka, Malaysia.

[18]

References

[19]

[1]

[2]

[3]

[4]

[5]

F. Akyildiz, W. Su,Y. Sankarasubramaniam and E. Cayirci, Wireless sensor networks: A survey, Computer Networks (Elsevier) Journal, Vol. 38 , No. 4 ,pp. 393 – 422, 2002. J. Heidemann, F. Silva, C. Intanagonwiwat, R. Govindan, D. Estrin, and D. Ganesan, Building efficient wireless sensor networks with low-level naming, in Proceedings of the Symposium on Operating Systems Principles(SOSP), pp.146-159, 2001. Vidhyapriya, R. &Vanathi, P.T., Energy Aware Routing for Wireless Sensor Networks, International Conference on Signal Processing, Communications and Networking (ICSCN), pp.545550, 2007. Mili, F., Ghanekar, S., Meyer, J., Distributed algorithms for event tracking through self-assembly and self-organization, 53rd IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), pp.173-176, 2010. Kodali, Ravi Kishore, Experimental analysis of an event tracking energy-efficient WSN, International Conference onAdvances in

[20]

[21]

[22]

[23]


Computing, Communications and Informatics (ICACCI), pp.1293-1298, 2013. Kamal, S.; Varalakshmi, P., Energy efficient and congestion avoidance event tracking in wireless sensor networks, International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN),pp.167-171, 2011. Khan, M.,Pandurangan, G., Kumar, V.S.A., Distributed Algorithms for Constructing Approximate Minimum Spanning Trees in Wireless Sensor Networks,IEEE Transactions on Parallel and Distributed Systems, Vol.20, No.1, pp.124-139, 2009. Acimovic, J., Beferull-Lozano, B., Cristescu, R., Adaptive distributed algorithms for power-efficient data gathering in sensor networks,in Proceedings of International Conference on Wireless Networks, Communications and Mobile Computing, pp.946-951, 2005. Taherkordi, A., Mohammadi, R.,Eliassen, F., A CommunicationEfficient Distributed Clustering Algorithm for Sensor Networks, 22nd International Conference onAdvanced Information Networking and Applications (AINAW), pp.634-638, 2008. Miu-ling Lam, Yun-Hui Liu, Two distributed algorithms for heterogeneous sensor network deployment towards maximum coverage, in Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 3296-3301, 2008. Ren, Qianqian, Guo, Longjiang, Zhu, Jinghua, Ren, Meirui; Zhu, Junqing, Distributed aggregation algorithms for mobile sensor networks with group mobility model, Tsinghua Science and Technology , Vol.17, No.5, pp.512-520, 2012. Li, W.W., Several Characteristics of Active/Sleep Model in Wireless Sensor Networks, 4th IFIP International Conference on New Technologies, Mobility and Security (NTMS), pp.1-5, 2011. R. Merz , J. Widmer , J. - Y. L. Boudec , and B. Radunovic , A joint PHY/MAC architecture for low - radiated power THUWB wireless ad - hoc networks, Wireless Communications and Mobile Computing Journal , Vol. 5 , No. 5 ,pp. 567 – 580 , 2005. Heinzelman, W., Chandrakasan, A.,Balakrishnan, H., Energyefficient communication protocol for wireless microsensor networks, In Proceedings of the 33rd HawaiiInternational Conference on System Sciences (HICSS), pp.3005-3014, 2000. W. Heinzelman, A. P. Chandrakasan, and H. Balakrishnan, An Application-Specific Protocol Architecture for Wireless Microsensor Networks,IEEE Transactions on Wireless Communications, Vol. 1, No. 4, pp. 660-670, 2002. Jan, H., Paul, A., Minhas, A.A., Ahmad, A., Jabbar, S., Kim, M.,Dependability and reliability analysis of intra cluster routing technique. Peer PeerNetw. Appl., doi: 10.1007/s12083-014-03111, pp.13, 2014. Abbasi, A.A., Younis, M., A survey on clustering algorithms for wireless sensor networks. Computer Communications, Vol. 30, No. 14-15, pp. 2826–2841, 2007. Weng, C.-E., Lai, T.-W. An Energy-Efficient Routing Algorithm Based on Relative Identification and Direction for Wireless Sensor Networks, Wireless Personal Communications , Vol.69, No.1,pp.253–268, 2013. Jabbar, S., Minhas, A.A., Paul, A., Rho, S., MCDA: Multilayer Cluster Designing Algorithm for Network Lifetime Improvement of Homogenous Wireless Sensor Networks, The journal of Supercomputing, Vol.70, No.1, pp.104–132, 2014. H.M. Ammari, and S.K. Das, Promoting Heterogeneity, Mobility, and Energy-Aware Voronoi Diagram in Wireless Sensor Networks, IEEE Transactions on Parallel and Distributed Systems, Vol. 19, No. 7, pp. 995–1008, 2008. P. Leone, S.E. Nikoletseas, and J.D.P. Rolim, Stochastic Models and Adaptive Algorithms for Energy Balance in Sensor Networks, Theory Computing Systems, Vol. 47, No. 2, pp. 433-453, 2010. A.K.M. Azad, JoarderKamruzzaman, Energy-Balanced Transmission Policies for Wireless Sensor Networks, IEEE Transactions on Mobile Computing, Vol.10, No. 7, pp. 927-940, July 2011. Y. Chen, Q. Li, L. Fei, and Q. Gao, Mitigating energy holes in wireless sensor networks using cooperative communication, in Proccedings of the 23rd IEEE International Symposium on Personal Indoor and Mobile Radio Communications (PIMRC), pp. 857–862, 2012.


465


[24] Thanigaivelu, K., &Murugan, K., K-level based transmission range scheme to alleviate energy hole problem in WSN, in Proceedings of theSecond International Conference on Computational Science, Engineering and Information Technology, p.476-483, 2012. [25] Nithya, V., Ramachandran, B., Vaishanavi Devi, G., Energy Efficient Tree Routing Protocol for Topology Controlled Wireless Sensor Networks, (2015) International Journal on Communications Antenna and Propagation (IRECAP), 5 (1), pp. 1-6. [26] Bou-El-Harmel, A., Benbassou, A., Belkadid, J., Design of a Three-Dimensional Antenna UHF in the Form Cubic Intended for RFID, Wireless Sensor Networks (WSNs) and RFID Sensor Networks (RSNs) Applications, (2014) International Journal on Communications Antenna and Propagation (IRECAP), 4 (6), pp. 260-264. [27] Kandasamy, R., Krishnan, S., Enhanced Energy Efficient Method for WSN to Prevent Far-Zone, (2014) International Journal on Communications Antenna and Propagation (IRECAP), 4 (4), pp. 137-142. [28] Shankar, T., Shanmugavel, S., Karthikeyan, A., Hybrid Approach for Energy Optimization in Wireless Sensor Networks Using PSO, (2013) International Journal on Communications Antenna and Propagation (IRECAP), 3 (4), pp. 221-226. [29] Khedher, M., Liouane, H., Douik, A., XOR-Based Routing Protocol for Wireless Sensor Networks, (2015) International Journal on Communications Antenna and Propagation (IRECAP), 5 (2), pp. 70-77. [30] Shankar, T., Shanmugavel, S., Karthikeyan, A., Modified Harmony Search Algorithm for Energy Optimization in WSN, (2013) International Journal on Communications Antenna and Propagation (IRECAP), 3 (4), pp. 214-220.

Omar Fouad Mohammed is currently studying a PhD in ICT focusing on Internetworking Technology from faculty of information and communication technology at Universiti Teknikal Malaysia Melaka (UTeM). Received his B.Sc degree in Computer Science from AlMustansiriyah University, Iraq (2006), M.Sc degree in Computer Science (Internetworking Technology) from Universiti Teknikal Malaysia Melaka (UTeM) (2012). His research interests are in Wireless sensor network and simulation. Associate Professor Dr. Burairah Hussin was born in 1973, received his B.Sc in Computer Science from Universiti Teknologi Malaysia (1996), M.Sc degree in Numerical Analysis from University of Dundee, Scotland (1998) and Ph.D. in Management Science at Centre for Operational Research and Applied Statistics (CORAS), University of Salford, UK. (2007). He is currently a Research Coordinator at Centre for Advanced Computing Technologies, Faculty of Information and Communication Technology at Universiti Teknikal Malaysia Melaka (UTeM). His main research interests are in operational research and artificial intelligence but he has a sound expertise in Internetworking Technology. He likes to work closely with faculty and students to identify, implement and review appropriate enhancements to existing practices responding to the needs of the university. Associate Professor Dr. Abd. Samad Hasan Basari received the B.Sc degree in Mathematics from Universiti Kebangsaan Malaysia (1998), M.Sc degree in IT-Education from Universiti Teknologi Malaysia (2004). He obtained his PhD on ICT (Maintenance Modelling Tools) at Universiti Teknikal Malaysia Melaka in 2009.His research interests are in AI application, Decision Support Technology and Modeling.

Authors’ information 1

Faculty of Information and Communication Technology, University Technical Malaysia Melaka, Malaysia. Jalan Hang Tuah, 75300, MELAKA, Malaysia Hang Tuah Jaya, 76100 Durian Tunggal, Melaka, Malaysia Mobile No.: +60146330155 E-Mail: [email protected] 2

Faculty of Information and Communication Technology, University Technical Malaysia Melaka, Malaysia. Jalan Hang Tuah, 75300, MELAKA, Malaysia Hang Tuah Jaya, 76100 Durian Tunggal, Melaka, Malaysia Phone: +6063316675 E-Mail: [email protected] 3

Faculty of Information and Communication Technology, University Technical Malaysia Melaka, Malaysia. Jalan Hang Tuah, 75300, MELAKA, Malaysia Hang Tuah Jaya, 76100 Durian Tunggal, Melaka, Malaysia Phone:+6063316685 E-Mail: [email protected]



466


Effective Clustering of Text Documents in Low Dimension Space Using Semantic Association Among Terms N. Sivaram Prasad1, K. Rajasekhara Rao2 Abstract – Sparse and high dimensional document representation of the popular Vector Space Model results in poor clustering performance. Dimension reduction techniques are useful for dense and low dimensional representation of documents that enhances clustering performance. This paper proposes a novel unsupervised filter method for feature selection. Filter methods assign weights to terms, used for representation of documents in the collection, according to some criterion, which is different from clustering task. Unsupervised feature selection methods do not use class labels to guide the selection of features. The proposed method assigns a score to a term, which is proportional to the term’s overall semantic association with rest of the terms in the document collection. The overall semantic association of a term is estimated using the co-occurrence frequencies of the term with other terms in the collection. Clustering results on three ideal text data sets TDT2, Reuters21578 and 20 Newsgroups proved that the proposed method selects features that are more discriminative, to separate intrinsic classes of documents, when compared with that selected by the existing unsupervised filter based feature section methods. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Filter Method, Co-occurrence Frequency, Semantic Association, Term, Text Clustering, Unsupervised Feature Selection

Nomenclature Term document matrix Term vocabulary of the collection which is defined as the list of terms used to represent documents in the collection Ith term in the term vocabulary

D T

ti

dj



w ti ,d j

m n

tcf  t 

df  t  w t 

A ai, j q

D k



maxi Pi

Set of partitions / clusters given by the clustering algorithm Number of documents in the jth cluster Set of intrinsic classes present in the chosen data set Confusion matrix whose rows and columns are indexed by cluster labels and class labels respectively Element of the confusion matrix whose value shows how many documents which actually belongs to the jth class are placed in the ith cluster by the clustering algorithm Number of classes present in the given data set Number of partitions / clusters formed by the clustering algorithm Summation over each cluster maximum number of documents placed in that cluster from a class Summation over each class maximum number of documents from that class assigned to a cluster Size of the biggest cluster

max j C j

Size of the biggest class

ALLF

Clustering performance using all features averaged over different number of classes from 2 to 10

P | Pj | C CM

Jth document in the document collection Weight of ith term in jth document

cmi, j

Size of the term vocabulary Number of documents in the collection Number of times term t occurs in the collection Number of documents in which term t appears Weight assigned by filter method to term t Term association matrix Element of the term association matrix which gives the semantic association between the ith term and the jth term Number of top ranked terms used for low dimension representation of documents in the collection Term document matrix in low dimension space Number of classes present in the chosen data set for clustering

|C | |P|

 max j cmij i

 maxi cmij j


467

N. Sivaram Prasad, K. Rajasekhara Rao

I.

number of irrelevant terms. Terms that are not discriminative to separate intrinsic classes of documents are called irrelevant terms. Due to sparseness of D and usage of large number of irrelevant terms for document representation, the distance metric used for clustering is disturbed and hence results in poor quality clusters. Also VSM assumes that the terms in the term vocabulary are independent of each other. In fact complex semantic relationships does exist between terms in the term vocabulary [8]. Due to the assumption of term independence in VSM a pair of similar documents, the second document expressed with words synonymous to those words used in the first document, is considered dissimilar and placed in different clusters. Hence it is essential to represent documents in low dimensional semantic space where only semantically significant terms are used for document representation. The low dimensional semantic representation is essential not only to reduce the computational effort required for document clustering but also to improve clustering performance. Dimension reduction techniques are broadly classified into feature extraction and feature selection techniques [9]. Feature extraction methods find smaller set of new features that are linear or non linear combination of original features for dimension reduction. Since they preserve the relative distances between objects, they are less effective when there are large numbers of irrelevant attributes that hide the clusters. Also the new features are difficult to interpret. Feature selection methods can be broadly divided into unsupervised feature selection [10], semi-supervised feature selection [11] and supervised feature selection [12], according to the usage of class label information. Feature selection methods can be broadly divided into two categories filters [13] and wrappers [14] according to the evaluation criterion used to search for relevant features. Wrapper methods assess the quality of every candidate feature subset by investigating the performance of learning (clustering) algorithm. Each candidate feature subset is obtained by a combinatorial search through the space of all possible feature subsets. Hence, wrappers are usually more computationally demanding, but they can be superior in selecting more relevant features when compared to features selected by filters, which ignore the properties of the learning task at hand [14]. In contrast, filter methods evaluate the relevance of each feature using its intrinsic properties (e.g., document frequency, feature variance, ability to preserve locality etc.) and assign a score to the feature regardless of the subsequent learning algorithm. The filter methods then select leading features based on the rank of these feature-level scores. This paper focuses on filtering strategy for its efficiency in handling high dimensional data [15], [16]. The contribution of this paper is mainly two manifolds. First, it proposes a novel feature ranking technique for dimensionality reduction using semantic relationships among terms used for document representation.

Introduction

Document Clustering is defined to be that of finding groups of similar documents in the document collection. The similarity between documents is measured with the use of a similarity measure. The possible similarity measures for text clustering is described in [1]. Document clustering is useful for document organization to improve retrieval efficiency, systematic browsing of documents and for summarization of document collection. The most widely used document representation model is the Vector Space Model (VSM) [2]. A collection of documents in VSM can be represented using a Term document matrix D as in (1):

d1 d2 w  t1 ,d1  w  t1 ,d 2  w  t2 ,d1  w  t2 ,d 2  w  t3 ,d1  w  t3 ,d 2    w  tm ,d1  w  tm ,d 2 

t1 t2 t3  tm



where w ti ,d j



 dn  w  t1 ,d n   w  t2 ,d n   w  t3 ,d n     w  tm ,d n 

(1)

is the weight of the term ti in the

document d j , m is the size of the term vocabulary of the document collection and n is the number of documents in the collection. A MATLAB tool box for generating Term document matrix D from text collection is described in [3]. Popular term weighting schemes for term document matrices are discussed in [3], [4]. The term vocabulary T of a document collection comprises of all the terms in the pre-processed documents of the collection, which are used for document representation. Many variations of VSM have been proposed in [5] that differ in what they consider as a feature or term. The most common approach is to consider unique words that are present in the document collection after pre-processing the documents as distinct terms. The most common document pre-processing steps are stop-word elimination and stemming. Very frequent words such as articles, prepositions, conjunctions, etc., that almost never have any discriminative capability to find intrinsic classes of documents are removed during stop-word elimination phase. Automatic identification of stop-words in a document collection using statistical tests is discussed in [6]. Document written in natural language usually contains many morphological variants of a single term (e.g., compute, computing, computational etc.). During stemming all variants of a term are conflated into a single term stem by the stemming algorithm [7]. Matrix D is highly sparse because very few documents have many of the terms used for document representation. High dimensional term space used by the VSM consists of small number of relevant and large



468


Second, experiments are carried out to evaluate the proposed method in comparison with the state of the art unsupervised filter methods. The remainder of this paper is organized as follows: Section II presents a review of unsupervised filter methods. Section III proposes a novel unsupervised filter method. Section IV describes the experimental design. Section V presents the empirical evaluations of the proposed feature selection method on three real world high dimensional data sets. The concluding remarks are given in Section VI.

II.

III.2. Semantic Association of a Term (SAT) According to [18] terms are semantically associated if they tend to co-occur frequently (e.g., honey and bee). Thus the underlying concept of a term is not only influenced by the term itself, but also by the terms that co-occur with it. Hence the influence of other terms in the semantic description of a term can be estimated using cooccurrence frequencies of terms in the context of documents to be clustered. The proposed method captures the semantic description of terms within the context of the given corpus. A m  m term association matrix A can be computed from the Term Document matrix D using (3):

Existing Unsupervised Filter Methods II.1.

Collection Frequency and Inverse Document Frequency

Collection frequency and Inverse Document Frequency (CFIDF) based filter method has been proposed by the authors in their previous work [17]. CFIDF ranks terms according to its collection frequency and document frequency and can be expressed mathematically as shown in (2): w t  

tcf  t  df  t 

log  df  t  

ai, j

1, if i  j   n  w  ti ,d r   w t j ,d r   r 1 , if i  j  n     w  ti ,d r   w  t,d r     r 1  tT ti    



(2)









where, ai, j is an element of matrix A corresponding to



i  th row and j  th column, w ti ,d j where, w  t  is the weight assigned to the term t by the



represents the

weight of the term ti in the document d j and T is the

filter method, tcf  t  is the term collection frequency

term vocabulary of the document collection. The value of ai, j gives the semantic association between the term ti

and is defined as the total number of occurrences of the term t in the document collection and df  t  is the

and the term t j . The i  th column of A captures the

document frequency of the term t which is given by the total number of documents in which the term t appears. Popular filter based feature selection methods, namely Term Variance (TV), Term Variance Quality (TVQ), Laplacian Score (LS) and Term Contribution (TC) are also discussed in [17].

semantic description of the concept underlying the term ti . The magnitude of the i  th column vector of A gives the overall semantic association of the term ti with rest of the terms in the document collection. The higher the overall semantic association of a term is the more discrimination power it has and vice versa.

III. Proposed Feature Ranking Method

IV.

In this section a new feature ranking method, based on the semantic association of a term (SAT), for effective document clustering is proposed.

Experimental Design

Experiments are conducted to evaluate the effectiveness of the proposed SAT feature selection technique, in comparison with existing unsupervised filter based feature selection techniques including TV, TVQ, LS, TC and CFIDF. Documents in the data set are represented using VSM, in the form of Term Document

III.1. Motivation The goal of any document clustering method is to project documents into the subspace in which the documents with different semantics can be well separated, while the documents with common semantics can be clustered. To accomplish this, documents should be represented using terms that have more discriminative power to separate intrinsic classes of documents. A term will have more discriminating power if its underlying concept has more semantic significance, in the context of the given document collection.



matrix D . Each element w ti ,d j



of the matrix D

represents the weight of the term ti in the document d j . In this paper it is simply the frequency of occurrence of the term ti in the document d j . In all the experiments, the terms are ranked based on the score assigned to it by the feature selection technique. The term that has the highest score is given best rank and so on.



469


The q number of highly ranked terms are then used for representation of documents in the collection in the form of a modified Term Document matrix D . Documents represented in the reduced dimension space given by the matrix D are partitioned by the standard k-means algorithm [19]. The effectiveness of the feature selection techniques is measured using the clustering performance. For each data set clustering is performed on a document collection that contains k number of classes. For each k, 10 test runs were conducted on different randomly chosen clusters and the average performance is reported in the results tables. To evaluate the effect of number of classes k in the document collection on clustering performance in the reduced dimension space, the value of k is varied from 2 to 10 in steps of 1. To evaluate the effect of number of features q used for document representation on clustering performance, q is varied either from 50 to 325 or from 100 to 375, at increments of 25 for a chosen value of k. The lower limit for q is selected so as to avoid empty clusters that results from insufficient number of terms used for representing documents in the collection. For each combination of k and q clustering is performed using K-Means algorithm. The K-Means algorithm available in MATLAB is used in this paper. As the algorithm randomly chooses initial cluster representatives, for the purpose of reproducing results given in this paper the random number generator algorithm in MATLAB is seeded with the following parameters: twister and 5489. The parameters used for K-Means algorithm are as follows: "Distance" option used is "cosine", "EmptyAction" option is chosen as "singleton", "Start" option is set to "cluster" and the number of replicates is set to 10.

for the representation of documents in the data sets. The three data sets available at http://www.zjucadcg.cn/ dengcai/Data/TextData.html are used in this paper. The ratio of the size of the biggest cluster to the size of the smallest cluster in a data set is highest for Reuters21578 data set. The class distributions of the data sets are shown in Fig. 1, Fig. 2 and Fig. 3. The 20 Newsgroups data set has most uniform class distribution and Reuters21578 data set has most uneven class distribution. TABLE I CHARACTERISTICS OF TEXT DOCUMENT COLLECTIONS Characteristic TDT2 Reuters 21578 20 Newsgroups Number of documents 9394 8213 18846 Number of terms 36771 18933 26214 99.66 % 99.65 % 99.75 % Sparsity of D Number of classes 30 41 20 Maximum class size 1844 3713 999 Minimum class size 52 10 628 Median class size 131 37 994 Average class size 313 200 942

Fig. 1. Class Distribution of TDT2 data set

IV.1. Benchmark Data Sets Three text document collections, namely TDT2 Reuters-21578 and 20 Newsgroups are considered in evaluating feature selection methods. Vector Space Model (VSM) is used to represent documents in the three collections. No special term weighting measures are used



for document representation and hence w ti ,d j



Fig. 2. Class Distribution of Reuters21578 data set

is

simply frequency of occurrence of the term ti in the document d j . These three documents corpora have been among the ideal test sets for document clustering purpose because documents in the collections are manually clustered based on their topics and each document has been assigned one or more labels indicating which topic / topics it belongs to. Only documents with one class label are considered in all the three data sets. Table I provides the statistics of the three document corpora. The 20 Newsgroups data set's Term document matrix is biggest of all the three Term document matrices used

Fig. 3. Class Distribution of 20 Newsgroups data set

IV.2. Clustering Quality Evaluation As the class label information, given by domain experts after carefully reading each document, is available for all the three bench mark data sets the



470


clustering results are evaluated by external clustering quality evaluation measures. Two external clustering quality evaluation measures namely Normalized van Dongen (NVD) proposed by [20] and Combined BCubed measure (BCF) proposed by [21] are used. External measures compare the partitioning obtained by the clustering algorithm with a ground truth partitioning created by human annotators. Let P   P1 ,P2 , ,Pk  be a

text clustering evaluation metrics. The BCF value lies in the interval  0 ,1 and higher values of the measure indicates better clustering performance and vice versa.

V.

Tables II, III, IV, V, VI and VII show the minimum number of features required to achieve clustering performance, greater than or equal to that when all features are considered for document representation. The smallest value for a given number of clusters k is highlighted in boldface. When the smallest value is same for two or more feature selection methods, the value corresponding to the feature selection method that gives rise to the best clustering performance, is highlighted in boldface. The average value of clustering performance using different feature selection methods for a given data set, across a different number of classes k and a different number of features q, is compared with average clustering performance for different values of k when all features are considered in Tables VIII, IX and X. The best value for a given clustering quality evaluation measure is highlighted in boldface.

partitioning of documents obtained by a clustering algorithm, n be the total number of documents in the collection, and | Pi | be the number of documents in

i  th cluster. Let C  C1 ,C2 , ,Ck  be the groundtruth partitioning of the documents, and cmij be the number of documents which actually belongs to the class C j but are placed in the partition Pi by the clustering algorithm. External measures use the contingency matrix to estimate the quality of clustering. CM  cmij  PC

IV.2.1. Normalized Van Dongen Criterion The van Dongen criterion (VD) [22] was originally proposed for evaluating a graph clustering. VD measures the representativeness of the majority objects in each class and in each cluster. A normalized version of VD proposed by [20] (NVD) as shown in (4) is used in this paper. The NVD value lies in the interval [0,1]. Smaller values of NVD indicate better clustering performance and vice versa:

2n  NVD 

TABLE II MIN. NUMBER OF FEATURES REQUIRED TO ACHIEVE CLUSTERING PERFORMANCE (NVD) WITH ALL FEATURES ON TDT2 k TV LS TC CFIDF SAT 2 100 550 125 100 75 3 125 425 125 175 125 4 325 325 375 175 75 5 525 250 625 150 75 6 825 > 1500 700 875 1500 7 250 475 375 175 150 8 > 1500 > 1500 1225 525 775 9 550 > 1500 800 925 575 10 350 500 1125 250 200

 max j cmij   maxi cmij i

j

2n  maxi Pi  max j C j

Results and Discussion

(4)

TABLE III MIN. NUMBER OF FEATURES REQUIRED TO ACHIEVE CLUSTERING PERFORMANCE (BCF) WITH ALL FEATURES ON TDT2 k TV LS TC CFIDF SAT 2 100 450 125 100 75 3 100 425 100 125 125 4 250 325 375 175 100 5 525 300 625 550 100 6 775 > 1500 700 875 1175 7 250 475 375 175 150 8 > 1500 > 1500 800 > 1500 200 9 575 > 1500 > 1500 925 700 10 350 500 1150 250 200

where n is the number of documents in the collection, expression max j cmij is summation over each cluster

 i

maximum number of documents assigned to that cluster maxi cmij is summation over each class from a class,

 j

maximum number of documents from that class assigned to a cluster, maxi Pi is the size of the biggest cluster and

max j C j is the size of the biggest class.

TABLE IV MIN. NUMBER OF FEATURES REQUIRED TO ACHIEVE CLUSTERING PERFORMANCE (NVD) WITH ALL FEATURES ON REUTERS k TV LS TC CFIDF SAT 2 50 100 50 50 50 3 50 50 50 50 50 4 350 > 1000 > 1000 > 1000 > 1000 5 175 600 300 100 200 6 100 575 225 100 75 7 325 925 250 175 225 8 550 > 1000 600 500 500 9 600 > 1000 575 475 425 10 400 850 375 350 275

IV.2.2. Combined BCubed Precision and BCubed Recall The description for BCubed precision (BP) and BCubed recall (BR) of a clustering solution is given in [23]. Combining BP and BR in a standard way to obtaine combined BCubed precision and BCubed recall is explained in [21]. This paper uses the notation ‘BCF’ to represent the combined metric. According to [21], this combined metric BCF satisfies all formal constraints on



471


TABLE V MIN. NUMBER OF FEATURES REQUIRED TO ACHIEVE CLUSTERING PERFORMANCE (BCF) WITH ALL FEATURES ON REUTERS k TV LS TC CFIDF SAT 2 225 175 125 75 75 3 50 50 50 50 50 4 375 > 1000 250 275 700 5 175 775 > 1000 150 350 6 100 575 75 100 75 7 325 750 250 175 275 8 550 > 1000 675 550 500 9 600 > 1000 600 475 425 10 500 > 1000 450 425 275

techniques. Results for the feature selection technique Term Variance Quality (TVQ) are not shown in the tables and figures, because its performance matches exactly with that of Term Variance (TV).

TABLE VI MIN. NUMBER OF FEATURES REQUIRED TO ACHIEVE CLUSTERING PERFORMANCE (NVD) WITH ALL FEATURES ON 20 NEWSGROUPS k TV LS TC CFIDF SAT 2 > 2500 > 2500 > 2500 > 2500 > 2500 3 2450 > 2500 > 2500 > 2500 > 2500 4 1375 > 2500 1675 1275 1275 5 > 2500 > 2500 2500 2500 1975 6 2325 > 2500 > 2500 1575 1525 7 > 2500 > 2500 2325 2475 > 2500 8 > 2500 > 2500 > 2500 > 2500 > 2500 9 > 2500 > 2500 > 2500 1925 2125 10 2250 > 2500 1650 1925 2100

Fig. 4. Clustering performance (NVD) using different number of features for document representation on TDT2 data set

TABLE VII MIN. NUMBER OF FEATURES REQUIRED TO ACHIEVE CLUSTERING PERFORMANCE (BCF) WITH ALL FEATURES ON 20 NEWSGROUPS k TV LS TC CFIDF SAT 2 > 2500 > 2500 > 2500 > 2500 > 2500 3 > 2500 > 2500 > 2500 > 2500 1975 4 1300 2350 1275 1225 900 5 > 2500 > 2500 > 2500 > 2500 1975 6 2325 > 2500 2400 > 2500 1525 7 > 2500 > 2500 2300 1650 2400 8 > 2500 > 2500 > 2500 > 2500 2150 9 > 2500 > 2500 > 2500 2100 2475 10 > 2500 > 2500 2400 > 2500 2300

Metric NVD BCF

Fig. 5. Clustering performance (BCF) using different number of features for document representation on TDT2 data set

TABLE VIII AVERAGE CLUSTERING PERFORMANCE ON TDT2 Using features from 50 to 325 Using all at increments of 25 features TV LS TC CFIDF SAT 0.244 0.281 0.278 0.236 0.226 0.228 0.825 0.806 0.799 0.832 0.842 0.847

TABLE IX AVERAGE CLUSTERING PERFORMANCE ON REUTERS Using features from 50 to 325 Using all at increments of 25 Metric features TV LS TC CFIDF SAT 0.558 0.575 0.559 0.555 0.551 NVD 0.548 0.599 0.588 0.597 0.602 0.604 BCF 0.609

Fig. 6. Clustering performance (NVD) using different number of features for document representation on Reuters data set

TABLE X AVERAGE CLUSTERING PERFORMANCE ON 20 NEWSGROUPS Using features from 100 to 375 Using all at increments of 25 Metric features TV LS TC CFIDF SAT NVD 0.729 0.770 0.720 0.727 0.713 0.632 BCF 0.374 0.356 0.378 0.378 0.386 0.433

Fig. 4 to Fig. 9 show the effect of number of top ranked features q used for document representation on clustering performance averaged over a different number of classes from 2 to 10, for different feature selection

Fig. 7. Clustering performance (BCF) using different number of features for document representation on Reuters data set



472


References [1]

[2]

[3]

Fig. 8. Clustering performance (NVD) using different number of features for document representation on 20 Newsgroups data set

[4]

[5]

[6]

[7]

[8]

[9]

Fig. 9. Clustering performance (BCF) using different number of features for document representation on 20 Newsgroups data set

[10]

VI.

Conclusion [11]

This paper proposes a novel filter based unsupervised feature selection method. The proposed method exploits the semantic relationship between terms used in the representation of documents in the collection. The method estimates the discriminative power of a term, to separate intrinsic classes of documents, based on the semantic association of the term with rest of the terms in the document collection. Clustering performance is highest on the TDT2 data set and least on 20 Newsgroups data set, for all feature selection methods. Poor performance of feature selection methods for 20 Newsgroups data set is due to the fact that the dimensionality of the data set's Term Document matrix is 43 % higher than that for TDT2 data set and 217 % higher than that for Reuters data set. Highly sparse and huge dimensionality of Term Document matrix for 20 Newsgroups data set leads to poor selection of features with more discriminative power and hence the selected features are not fruitful in producing good quality clusters. Though the Reuters data set's dimensionality is 45 % of that of TDT2 data set, because of highest ratio of maximum cluster size to minimum cluster size for Reuters data set, the clustering performance is relatively low. Empirical evaluations demonstrate that the proposed method outperforms other feature selection methods in selecting features with more discriminative power to separate intrinsic classes of documents. The current work can be extended by making use of the transitive semantic relationships among terms.

[12]

[13]

[14] [15]

[16]

[17]

[18]

[19] [20]

[21]

[22] [23]


Huang, A., Similarity measures for text document clustering, Proceedings of the 6th New Zealand Computer Science Research Student Conference (Page: 49-56 Year of Publication: 2008). G. Salton, A. Wong, C. S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol. 18, n. 11, pp. 613-620, 1975. Zeimpekis, D., Gallopoulos, E., TMG: A MATLAB toolbox for generating term-document matrices from text collections, In Kogan, J., Nicholas, C.H., Teboulle M., Grouping multidimensional data (Berlin Heidelberg : Springer-Verlag 2006, 187-210). Ke, W., Information-theoretic term weighting schemes for document clustering and classification. International Journal on Digital Libraries, pp. 1-15, 2014. M. Keikha, N. S. Razavian, F. Oroumchian, H. S. Razi, Document representation and quality of text: An Analysis , In M. W. Berry, M Castellanos, Survey of Text Mining II, (London: Springer-Verlag, 2008, 219-232). Wilbur, W. J., Sirotkin, K., The automatic identification of stop words, Journal of information science, Vol. 18, n.1, pp. 45-55, 1992. M. F. Porter, An algorithm for suffix stripping, Program: electronic library and information systems, Vol. 40, n. 3, pp. 211218, 2006. Zapata Becerra, A. A., Escuela de Idiomas Modernos, In Zapata Becerra, A. A., A Handbook of general and applied linguistics, (Venezuela: Trabajo de ascenso sin publicar, 2000). P. Cunningham, Dimension Reduction, In M. Cord, P. Cunningham, Machine learning techniques for multimedia, (Berlin Heidelberg: Springer-Verlag, 2008, 91-112). Dy, J. G., Brodley, C. E., Feature selection for unsupervised learning, The Journal of Machine Learning Research, Vol. 5, pp. 845-889, 2004. Zhao, Z., Liu, H., Semi-supervised Feature Selection via Spectral Analysis, Proceedings of the 7th SIAM International Conference on Data Mining (Page: 641-646 Year of Publication: 2007). Robnik-Šikonja, M., Kononenko, I., Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning, Vol. 53, n. 1-2, pp. 23-69, 2003. A. L. Blum, P. Langley, Selection of Relevant Features and Examples in Machine Learning, Artificial Intelligence, Vol. 97, n. 1, pp. 245-271, 1997. R. Kohavi, G. H. John, Wrappers for feature subset selection, Artificial intelligence, Vol. 97, n. 1, pp. 273-324, 1997. Cantupaz, E., Newsam, S., Kamath, C., Feature Selection in Scientific Applications, Proceedings of the 10th ACM SIGKDD international conference on Knowledge Discovery and Data mining (Page: 788-793 Year of Publication: 2004). I. Guyon, A. Elisseeff, An introduction to variable and feature selection, The Journal of Machine Learning Research, Vol. 3, pp. 1157-1182, 2003. Prasad, N.S., Rao, K.R., Subspace clustering of text documents using collection and document frequencies of terms, (2014) International Review on Computers and Software (IRECOS), 9 (10), pp. 1692-1699. C. Chiarello, C. Burgess, L. Richards, A. Pollock, Semantic and Associative Priming in the Cerebral Hemispheres: Some words do, some words don't sometimes, some places, Brain and Language, Vol. 38, n. 1, pp. 75-104, 1990. J. Hartigan, M. Wong, Algorithm as 136: A k-means clustering algorithm, Applied Statistics, Vol. 28, n. 1, pp. 100-108, 1979. Wu, J., Xiong, H., Chen, J., Adapting the right measures for kmeans clustering, Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data mining (Page: 877-886 Year of Publication: 2009). E. Amig , J. Gonzalo, J. Artiles, F. Verdejo, A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information retrieval, Vol. 12, n. 4, pp. 461-486, 2009. S. Dongen, Performance Criteria for Graph Clustering and Markov Cluster Experiments, 2000. Bagga, A., Baldwin, B., Entity-based cross-document co-


473


referencing using the vector space model, Proceedings of the 17th International Conference on Computational Linguistics (Page: 7985 Year of Publication: 1998).


Department of Information Technology, Bapatla Engineering College, Bapatla, Andhra Pradesh, India. E-mail: [email protected] 2

Department of Computer Science and Engineering, Sri Prakash College of Engineering, Andhra Pradesh, India. E-mail: [email protected] N. Sivaram Prasad is working as a professor in the Information Technology department of Bapatla Engineering College, Bapatla, India. He received the Master’s degree (M.Tech.) in computer science and engineering from Jawaharlal Nehru Technological University at Hyderabad, India in 2001. He received the Bachelor’s degree (B.Tech.) from Acharya Nagarjuna University at Guntur, India in 1995. His research interests include Data mining and Digital image processing. K. Rajasekhara Rao is working as a professor in the Computer Science and Engineering department of Sri Prakash College of Engineering, Tuni, India. He received the PhD degree in computer science and engineering from Acharya Nagarjuna University at Guntur, India in 2008. He received his Master’s degree MS in software systems from BITS Pilani, India in 1992. He received Bachelor’s degree B.Tech in electronics and communication engineering from Acharya Nagarjuna University at Guntur, India in 1985. His research interests include Data mining and Embedded Systems. Dr. Rajasekhar is a fellow of IETE, life member of IE, ISTE, ISCA and CSI.



474


Scrapple: a Flexible Framework to Develop Semi-Automatic Web Scrapers Alex Mathew, Harish Balakrishnan, Saravanan P. Abstract – The World Wide Web is the biggest source of data that the general public has access to. Students and researchers working on data-related problems need a well-maintained source to get their data from. Most online services provide APIs to use their service. However, this may not provide all the data that they actually need even though it exists on the website. In this paper, "Scrapple" – a framework for creating semi-automatic web scrapers, which extract required data from the Web, is introduced. Users do not need to write the entire scraping programs - they only need to define a configuration file which is used to build the required scraper. The configuration provides support for CSS selectors and XPath expressions. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Web Scraping, Web Wrapper, CSS Selector, XPath Expressions

I.

Extracting data from pages that are poorly structured is an arduous task since web scraper programs change accordingly. Here Scrapple, a framework which takes a configuration file as input from the users and extracts the relevant and required data from the web pages, is presented. The configuration file takes input of the base page and the CSS selectors/ XPath expressions of the web content required.

Introduction

The Internet is a huge source of information. Several people may use data from the Internet to perform various activities, like research or analysis, business intelligence and analytics. Everything that we need can be found in the web. However, there are two primary issues involved with using data from the Internet: • One may not have any way to get information from a particular website, i.e, it may not provide an API for accessing the data. • Even if an API is provided, it may not give all the data needed. It is possible that there may be some data that is present on the web interface, but not provided through the API. This is where web scrapers and web crawlers play a role.

II.

Related Work

Data extraction process from the web can be classified based on the selectors used. Selectors can be CSS or XPath expressions. CSS selectors are said to be faster and are used by many browsers. Ducky [1] uses CSS selectors for extracting data from pages that are similarly structured. CSS selectors [2] provide a good balance between structure and attributes. Though the property values cannot be specified as expressions, it is widely used due to its simplicity and flexibility. They can traverse down a DOM (Document Object Model)[3] according to the path specified in the selector expression. On the other hand, XPath [4] expressions are more reliable, handles text recognition better and a powerful option to locate elements when compared to CSS selectors. Since it can traverse both up and down the DOM, it is used widely used for easy navigation through the page to locate the elements searched for. Many researches are going on presently in this topic. Oxpath [5] provides an extension for XPath expressions. The system created by V. Crescenzi, P. Merialdo, and D. Qiu [6] uses XPath expressions for locating the training data to create queries posed to the workers of a crowd sourcing platform.

Web scrapers: They, also called extractors, are used to extract content from any particular page. They may use CSS Selectors or XPath expressions to point to a particular tag on the HTML structure of the page, and extract the content from that tag. The content extracted could be text from
tags, links from tags, and so on. Web crawlers: These are scripts that go through multiple links from a single base page. Given a base URL, it uses this page as an index page to many different pages that are linked to it. It goes through each of these pages, extracting the required content along the way it. Scrapers and crawlers can be written to extract necessary content from any page that you would need information from. But writing them is a tedious process and analyzing the web pages before data extraction is time consuming.


475

Alex Mathew, Harish Balakrishnan, Saravanan P.

Systems like Ducky [1] and Deixto [7] use the concept of Configuration files where the user inputs the simple details like base pages, a “next” column if there are multiple pages to be parsed. Deixto [7] uses the concept of tag filtering where the unnecessary html tags can be ignored when the DOM (Document Object Model) tree is created. H. A. Sleiman and R. Corchuelo [8] discuss and compare various systems based on performance, input and output, effectiveness and algorithm used. Scrapy [9], an open source project, provides the framework for web crawlers and extractors. This framework provides support for spider programs that are manually written to extract data from the web. It uses XPath expression to locate the content. The output formats of Ducky [1] and Scrapy [9] include XML, CSV and JSON files.

IV.

Implementation

Architecture Scrapple provides a command line interface (CLI) to access a set of commands which can be used for implementing various types of web content extractors. The basic architecture of Scrapple explains how the various components are related.

III. Proposed Framework Scrapple helps to reduce the hassle in manually writing the scripts needed to extract the required content like in Scrapy [9]. It involves the use of a configuration file that specifies property-value pairs [10] of various parameters involved in constructing the required script. The configuration file is a JSON document, consisting of the required key-value pairs. The user specifies the base URL of the page to work on, and also the tags of the data to be extracted. The user has a choice between CSS selectors and XPath expressions for specifying the target tags. Once the target tags have been specified, the other parameters are filled and the configuration file is completed. This configuration file is used by Scrapple to generate the required script and the execution is performed to generate the output of the scraper as a JSON or a CSV document depending on the user’s choice. Thus, the user can obtain data they need without having extensive programming expertise to manually write the scripts required. Scrapple uses lxml package[11] over the BeautifulSoup package for parsing the HTML structure since the speed and performance measures[12] varies by around 10 times.

Fig. 1. lxml vs. BeautifulSoup performance measures

Command line input: The command line input is the basis of definition of the implementation of the extractor. It specifies the project configuration and the options related to implementing the extractor. Configuration file: The configuration file specifies the rules of the required extractor. It contains the selector expressions for the data to be extracted and the specification of the link crawler.

Fig. 2. Scrapple Architecture



476


Extractor framework: The extractor framework handles the implementation of the parsing & extraction. The extractor framework follows the following steps:  It makes HTTP requests to fetch the web page to be parsed.  It passes through the element tree.  It extracts the required content, depending on the extractor rules in the configuration file.  In case of crawlers, this process is repeated for all the pages that the extractor crawls through.

 scraping[Specifies parameters for the extractor to be created.] The parameters under scraping are as follows – url: Specifies the URL of the base web page to be loaded. – data[Specifies a list of selectors for the data to be extracted.] * selector: Specifies the selector expression. * attr: Specifies the attribute to be extracted from the result of the selector expression. * field: Specifies the field name under which this data is to stored. * default: Specifies the default value to be used if the selector expression fails. – next[Specifies the crawler implementation.] * follow_link: Specifies the selector expression for links to be crawled.

Data format handler: According to the options specified in the CLI input, the extracted content is stored as a CSV document or a JSON document. Commands: There are 4 basic commands provided by the Scrapple CLI namely genconfig, generate, run and web.

The main objective of the configuration file is to specify extraction rules in terms of selector expressions and the attribute to be extracted. There are certain set forms of selector/attribute value pairs that perform various types of content extraction.

genconfig: This command is used for generating the skeleton configuration file. The basic syntax for genconfig is: scrapple genconfig

Output file formats: Scrapple provides two formats for output files: JSON and CSV. On specification of the format in the –output_type, the required file format is generated. By default, the output file format is of JSON type.

generate: This command will generate a python executable file which can be used then and there for scraping the data for the specified web page configuration file. The basic syntax for generate is:

JSON Javascript Object Notation (JSON) files are easy to understand and create. They are easy to parse through, understand and write. It is a language independent format and hence many of the APIs use them as a datainterchange format. Few data types in JSON are:  Object: It is an unordered set of name/value pairs.  Array: It is a set of values of same data type. It is enclosed in a square bracket and the name-value pairs are separated by a comma.  Name: It is the field that describes the data.  Value: It is the input data for the name attribute. It can be a number, a Boolean value (true or false), a character (inserted between single quotes) or a string (inserted between double quotes).

scrapple generate run: This command will generate the resultant output file for the configuration file fed as input. The basic syntax for run is: scrapple run web: This command provides a web interface for Scrapple. The interface will provide a form and when submitted will create the configuration file. The syntax for using the web interface is: scrapple web

CSV Comma Separated Values (CSV) files consists of tabular data where the fields are separated by a comma and the records by a line. It is stored in plain-text format. CSV files are easy to handle and manipulate. For example:

Configuration file The configuration file is the basic specification of the extractor required. It contains the URL for the web page to be loaded, the selector expressions for the data to be extracted and in the case of crawlers, the selector expression for the links to be crawled through. The keys used in the configuration file are:  project_name: Specifies the name of the project with which the configuration file is associated.  selector_type: Specifies the type of selector expressions used. This could be “xpath” or “css”.

Name John Doe

Marks 96 45

Grade O F

Promotion True False

can be represented as:



477


John,96,O, True True; Doe, 45,F, False False.

V.

On giving the run command, the content extraction is initialized. Once the extraction process is complete, the resultant output file has the content requested.

Experimentation and Results

{

There are two main types of tools that can be implemented with the Scrapple framework:  Single page linear scrapers scrapers,  Link crawlers crawlers.

"scraping" "scraping":{ "url" "http://pyvideo.org/video/1785/python "url":"http://pyvideo.org/video/1785/python "http://pyvideo.org/video/1785/python-for for-humans humans 1", humans-1" "data" "data":[ { "field" "title", "field":"title" "title" "attr" "text",, "attr":"text" "selector" "selector":"//h3" "//h3",, "default" "default":"" }, { "field" "speaker_name", "field":"speaker_name" "speaker_name" "attr" "text",, "attr":"text" "selector" "selector":"//div[@id='sidebar']//dd "//div[@id='sidebar']//dd "//div[@id='sidebar']//dd[2]//a" [2]//a" [2]//a", "default" "default":"" }, { "field" "event_name", "field":"event_name" "event_name" "attr" "text",, "attr":"text" "selector" "selector":"//div[@id='sidebar']//dd[1]//a" "//div[@id='sidebar']//dd[1]//a" "//div[@id='sidebar']//dd[1]//a", "default" "default":"" } ] }, "project_name":"pyvideo" "project_name" "pyvideo", "pyvideo" "selector_type":"xpath" "selector_type" "xpath" }

Single page linear scrapers: Scrapple’s single page content extraction was tested on n the page “Python for Humans”.

Fig. 5. Complete configuration file Fig. 3. 3 Data extraction Web page (http://pyvideo.org/video/1785/python (http://pyvideo.org/video/1785/python-for for-humans humans humans-1) 1)

{ "project" "project":"test1" "test1", "test1" "data":[ "data":[ { "event_name":"PyCon "event_name" "PyCon US 2013", 2013" "speaker_name":"Kenneth "speaker_name" "Kenneth Reitz" Reitz", "title" "Python for Humans" "title":"Python } ]

Consider one wants wants to extract details of the talk like, for instance, the title of th thee talk, the event name and speaker name name. The he basic configuration file is first generated and the selectors where the data fields can be located in the page are filled. filled. For generation of the skeleton configuration file, the genconfig command is used. used. The skeleton configuration file generated and complete cconfigura onfiguration onfiguration file are shown in Fig. 4 and Fig. 5..

} Fig. 6.. Output JSON file

{

Link crawler: Consider the event listing page as the base page (Fig. 7). configuration iguration file for a 7) To generate a skeleton conf crawler, one adds add “---type=crawler” type=crawler” to the end of genconfig command and tthe he skeleton configuration file is generated for the crawler. Assume one wants want to extract the details of all the talks in the page. The he configuration file on where the talk links can be found is written and the details tto o be extracted from each page is provided with the help of th thee selectors. The complete configuration file is run with the run command and the output JSON file with all the talks and its details are created. On giving the generate command, the extractor program is generated as a python fi file le which can be run whenever required.

"scraping":{ "url":"http://py "url":"http://pyvideo.org/video/1785/python video.org/video/1785/python video.org/video/1785/python--for-humans humans humans--1", "data":[ { "field":"", "attr":"", "selector":"", "default":"" } ] }, "project_name":"pyvideo", "selector_type":"xpath" }

Fig. 4.. Skeleton configuration file generated


International Review on Compu Computers ters and Software, Vol. 10, N. 5

478


{ "project":"pyvideo", "data":[ { "talk_title":"Boston Python Meetup: ...", "speaker":"Stephan Richter", "event":"Boston Python Meetup" }, { "talk_title":"Boston Python Meetup: ...", "speaker":"Marshall Weir", "event":"Boston Python Meetup" }, { "talk_title":"November 2014 ...", "speaker":"AsmaMehjabeen Isaac Adorno", "event":"ChiPy" },

Fig. 7. Data extraction page for Link crawler (http://pyvideo.org/category) { "scraping":{ "url":"http://pyvideo.org/category/", "data":[ { "field":"", "attr":"", "selector":"", "default":"" } ], "next":[ { "follow_link":"//table//td[1]//a", "scraping":{ "data":[ { "field":"event", "attr":"text", "selector":"//h1", "default":"" } ], "next":[ { "follow_link":" \ //div[@class='video-summary-data']/div[1]//a", "scraping":{ "data":[ { "field":"talk_title", "attr":"text", "selector":"//h3", "default":"" }, { "field":"speaker", "attr":"text", "selector":" \ //div[@id='sidebar']//dd[2] \ ", "default":"" } ] } } ] } } ] }, "project_name":"pyvideo", "selector_type":"xpath"

### talk_list.json continues { "talk_title":"Python 2.7 & Python 3: ...", "speaker":"Kenneth Reitz", "event":"Twitter University 2014" } ] } Fig. 9. Part of the output JSON file

The CSV file output is generated on specification of the --output_type in the run command as CSV. Fig.10 shows the CSV file for the event listing with talk and event URL included.

Fig. 10. Output CSV file

Comparison between Scrapple & Ducky: From the experiments performed with the Scrapple framework, it is found that correctly written configuration files give accurate results. In the single page linear extractor example and link crawler example (where over 2800 pages were crawled through), an accuracy level of 100% was achieved. The accuracy of the implementation of the Scrapple framework is dependent on the user’s understanding of web structure and the ability to write correct selector expressions. On comparison with Ducky [1], it can be seen that Ducky also provides an accuracy of 100%. The primary difference between the Scrapple framework and the Ducky framework is the features provided.

}

Fig. 8. Complete configuration file for link crawler



479


TABLE I DUCKY VS SCRAPPLE FEATURE COMPARISON Feature Ducky Scrapple Configuration file Yes Yes CSS Selectors Yes Yes Xpath Selectors Yes No CSV Output Yes Yes JSON Output Yes Yes XML Output No Yes Generation of extractor script Yes No

VI.

Applications Of Natural Language To Information Systems. NLDB 2002.Stockholm. Sweden. June 2002. [11] lxml - Processing XML and HTML with Python [Online]. Available: http://lxml.de/ [12] BeautifulSoup vs. lxml benchmark.[Online] Available: http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxmlperformance/

Author’s information School of Computing, SASTRA University, Thanjavur, Tamilnadu, India.

Conclusion and Future Work

Alex Mathew received his B.Tech degree in Computer Science and Engineering from SASTRA University, Tamilnadu, India in 2015. Hiscurrent research interests lies in the fields of data mining and data analytics. He has conducted beginner level python workshops under Computer Society of India, SASTRA Chapter. He has attended several developer

The goal of Scrapple is to provide a generalized solution to the problem of web content extraction. This framework requires a basic understanding of web page structure, which is necessary to write the necessary selector expressions. Using these selector expressions, the required web content extractors can be implemented to generate the desired datasets. Experimentation with a wide range of websites gave consistently accurate results, in terms of the generated dataset. However, larger crawl jobs took a lot of time to complete and it was necessary to run the execution in one stretch. Scrapple could be improved to provide restartable crawlers, using caching mechanisms to keep track of the position in the URL frontier. Tag recommendation systems could also be implemented, using complex learning algorithms, though there would be a trade-off on accuracy.

conferences. Harish Balakrishnan received his B.Tech degree in Computer Science and Engineering from SASTRA University, Tamilnadu, India in 2015. His current research interests include data mining and cloud computing. He has a few publications in International conferences. Mr. Harish Balakrishnan is a student member of Computer Society of India. P. Saravanan received his B.E degree in Computer Science and Engineering fromMadras University, Tamilnadu, Indiaand the M.E degreein computer Science and Engineering from Anna University, Chennai, India, in 2001 and 2006 respectively. He worked in many engineering colleges at various levels in academic. Currently he is working in SASTRA University, Thanjavur, Tamilnadu, India as Assistant Professor. He has many publications in National and International conferences and journals. Mr. P Saravanan is a life member of Indian Society of Technical Education and Member of Computer Society of India.

References [1]

Kei Kanaoka, YotaroFujii and Motomichi Toyama.Ducky: A Data Extraction Systemfor Various Structured Web Documents. In Proceedings of the 18th International Database Engineering & Applications Symposium, IDEAS ’14, pages 342-347, New York, NY, USA, 2014. ACM [2] Selectors.[Online] Available: http://www.w3.org/TR/CSS21/selector.html [3] W3 Document Object Model.[Online] Available: http://www.w3.org/DOM/ [4] XPath support in ElementTree[Online] Available: http://effbot.org/zone/element-xpath.htm [5] T. Furche, G. Gottlob, G. Grasso, C. Schallhart, and A. Sellers.Oxpath: A language for scalable data extraction, automation, and crawling on the deep web. The VLDB Journal, 22(1):47–72, Feb. 2013 [6] V. Crescenzi, P. Merialdo, and D. Qiu. Alfred: Crowd assisted data extraction. In Proceedings of the 22Nd International Conference on World Wide Web Companion, WWW ’13 Companion, pages 297–300, Republic and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering Committee. [7] F. Kokkoras, K. Ntonas, and N. Bassiliades.Deixto: A web data extraction suite. In Proceedings of the 6th Balkan Conference in Informatics, BCI ’13, pages 9–12, New York, NY, USA, 2013. ACM. [8] H. A. Sleiman and R. Corchuelo, "A survey on region extractors from web documents", IEEE Trans. Knowl. Data Eng., vol. 25, no. 9, pp.1960 -1981, July 2012 [9] Scrapy: A fast and powerful scraping and web crawling framework. [Online] Available: https://www.scrapy.org. [10] Boris Katz , Sue Felshin , DenizYuret , Ali Ibrahim , Jimmy Lin , Gregory Marton , Alton Jerome McFarland , BarisTemelkuran. Omnibase: Uniform Access to Heterogeneous Data for Question Answering. In Proceedings of the 7th International Workshop on



480


Fast and Efficient Indexing and Similarity Searching in 2D/3D Image Databases Y. Hanyf1, H. Silkan2, H. Labani1 Abstract – D-index is among the most efficient similarity search indexes, its performance dramatically depends on the choice of the ρ parameter. In This paper we propose a new criteria and a technique that ensures a good choice of D-index ρ parameter for reducing the D-index searching cost with an acceptable construction cost. We present our results upon two 2D/3D image databases; the Amsterdam Library of Images ALOI-1000 consists of 72000 color images of views and COIL-100 database consists of 7200 color images of views. We have described each view by combination of three well established descriptors from the MPEG-7 standard: CSD, SCD and EHD, and we have used the recommended distance measures by MPEG-7 to compare views. The results obtained prove the search efficiency of the proposed method against the original Dindex and the sequential method. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Multimedia Retrieval, Metric Access Methods, D-Index, Metric Space, 2D/3D Images Database

The task of similarity search is to retrieve the k objects most similar to the query object, or all database objects within certain distance to the query object. Almost all these databases can be viewed as a metric space, defined by a set of objects and a function d to compare the similarity between objects, when d is called distance or metric and satisfies to the three following properties; the strict positiveness (d(x,y) 0 and d(x, y) = 0  x = y), the symmetry (d(x,y) = d(y,x)), and the triangle inequality (d(x, z) ≤ d(x, y) + d(y, z)). The main goal of metric access methods is to reduce as much as possible the number of distance evaluations needed to solve a query. These methods can be classified into two categories: pivot based indexes and partitioning indexes. The partitioning based MAMs like D-index [1] and GNAT [2], [3] are designed to partition metric spaces into subsets and to discard regions that do not intersect with the query ball during searching time. The pivoting MAMs like AESA [2], [4], LAESA [2], [5] and EP [6] are based on the triangular inequality to discard objects and avoid some distances computation. Because the similarity between two 3D objects is very expensive, the metric access methods have a major importance in Content-Based 3D Model Retrieval (CBMR). In literature, many works have used the Metric access methods for searching and indexing in the 3D image databases, as examples: in [7], Aouat, Saliha, et al. have proposed a modified quad-tree for indexing 3D objects, and Lazaridis, Michalis, et al. have successfully used the inverted file to propose a complete framework for

Nomenclature D-index −

functions

Buckets

h ℎ ℎ

st

K-NN

I.

Metric Access Method for searching in metric data set, proposed by Dohnal et al [1] Functions which can partitions objects in clusters separated by distance Refers to D-index clusters The distance witch separate the Dindex buckets The number of levels of D-index structure When the construction process carried out in a level i, it refer to the number of previous levels When the construction process carried out in a level i, it refer to the number of remains levels The value witch used to decrease the parameter during the selection process The k nearest neighbors to a given query

Introduction

Thanks to the developments of multimedia databases in various forms such as 2D images, 3D images, videos, fingerprints and texts etc., the similarity search in these databases has a very important application in various fields, such as text retrieval, computational biology and pattern recognition.


481

Y. Hanyf, H. Silkan, H. Labani

multimodal searching and indexing of rich multimedia database including 3D object [8], and Hassan Silkan et al. [9] have proposed a method for automatic selection of optimal views by using an incremental algorithm based on pivot selection techniques of proximity searching in metric spaces. Although the metric access methods are successfully used in searching 3D datasets, they are also away from responding to all expectations; among these problems we may mention the curse of dimensionality, the searching cost, the construction cost and the parameters selection. In this work we have addressed the problem of choosing good parameters for reducing the search. We have especially proposed criteria and techniques to choose a good ρ value for D-index [1], [10] in order to improve the searching cost by an acceptable construction cost and we have tested the proposed method on real 2D/3D image databases. The rest of the paper is as follows, the related works are presented in section 2, in section 3 the proposed method is described, and in section 4 we present the similarity measure and feature extraction proposed for indexing 2D/3D images database, the experiment results are shown in section 3, and we conclude in section 4.

II.

,

0 ) 1 −

( , ( ,

)≤ )> ℎ

− +

(1)

When a range query search radius up to some predefined , at most one bucket has to be visited per level plus the exclusion bucket. At the same time, the use of a pivot-filtering strategy significantly decreases the number of distance computations in the accessed buckets.

Fig. 2. Data partitioning by

( ,

)

−

function

According to [1], D-index has several qualitative proprieties; the most important ones can be summarized as follows:  An object is typically inserted at the cost of one block access.  The number of bucket accesses is maximally h+1, for all response sets, when the distance to the more dissimilar object does not exceed .  For r = 0, the successful search is typically solved with one block access, unsuccessful search usually does not require any accesses It’s well known that D-index is among the efficient Metric access methods, but there is a critical bottleneck that increases its performance; it is the ρ value selection. In [1], authors are showing that the ρ value effects dramatically the searching cost, and they don’t propose any way to choose an adequate ρ value, they construct some structure by using random ρ value and compare them experimentally in order to determine the best ρ value among the compared ones and they admit that the selection of optimized parameters is still an open research issue. In order to solve this problem, we propose, in the next section, a modified version of D-index.

Related Works II.1.

( ,

D-Index

The D-index, considered as one of the fastest MAMs available, was presented in [1]. Its principle idea is based on partitioning data into subsets organized in h levels and each level contains 2 separable buckets in addition to another bucket called exclusion bucket to capture all objects that cannot be stored in previous levels. An example of D-index structure is shown in Fig. 1. Level 0,n =2 (4buckets) Level 1,n =2 (4buckets) Level 2,n =1 (2buckets)

Exclusion bucket

Fig. 1. Structure of D-index

III. Proposed Method

The D-index construction is based on particular functions called − functions, which has as a goal the partitioning of the databases into buckets separable up to 2 ; several ones are proposed in [11]. For example; the ball partitioning − function ( , ) (Eq. (1)) uses one reference object ∈ and the medium distance to partition data into three , , , subsets [ ] , [ ] [ ] (see Fig. 2):

The main objective of this modification is to make Dindex able to choose an adequate ρ in order to partition data. First we based on the searching cases analysis to propose criteria to select the ρ value, and then we have working on modifying the D-index to achieve the proposed criteria at an acceptable cost. We have presented the seed of this idea in [10], but it need more details, clarifications and illustrations.



482


III.1. D-Index Efficiency Criterion

Therefore, to guarantee an efficient partitioning of Dindex, this should respect the two following criteria:  maximizing the distance between separable buckets (maximize of ρ);  minimizing the exclusion buckets size.  ALGORITHM 2: MODIFIED KNN SEARCH

The main advantage of D-index is the ability to store objects in separable buckets in order to avoid access to some buckets during the search. If ≤ , the search algorithm accesses just one bucket at each level, and if > , the search algorithm needs to access more than one bucket plus the exclusion bucket. At the time of query execution (q, r), there are four possible situations: 1- The query is exclusively contained in a separable bucket; 2- The query is exclusively contained in the exclusion bucket; 3- The query is in a separable bucket and intersect with the exclusion bucket; 4- The query is contained in a separable or in the exclusion bucket and intersect with more than separable bucket.

A=, = ; for i=1 to h = min ( , ); , ( )〉 < 2 if 〈( access bucket ,〈

end if else , let{ , , … , } = ( ( Return all objects x such that   , or ..or   , End if End for Return all objects x such that

))   ;

 

,

,

( )〉

;

;

( )〉

accessed; update A; and ;exit ; else , let{ , , … , } = ( ( )) access buckets if not , , , ,…., , already accessed; update A; and ; end if end for end if

;

or

,

;update A and

end if end if end for access bucket ; update A and ; if( > ) for i=1 to h , ( )〉 ≤ 2 ) then if(〈( access bucket if not already , ,〈

,〈

( )〉

if( ≤ ) then exit; else , ( )〉 ≤ 2 ) then if(〈( access bucket ,〈 , ( )〉 ; update A and

ALGORITHM 1: MODIFIED RANGE SEARCH For i=1 to h , ( )〉 < 2 then if 〈( return all objects such that   ,〈 , ( )〉 ; exit; end if if( ≤ ) then , ( )〉 ≤ 2 ) then if(〈( return all objects x such that  

then

;

Exb

Fig. 3 shows the previous situations in a twodimensional space. The first situation is the most desired one, because the search is carried out just in the concerned separable bucket. The search is more efficient in this situation if ρ is maximal and exclusion bucket contains a large number of objects. The second situation is desired when the exclusion bucket is small since only the exclusion bucket is visited during the search. Because the search algorithm always visits the exclusion bucket, it should contain few objects. The final situation is the most undesirable one because it requires access to all separable buckets that intersect with the range query in addition to the exclusion bucket. Based on the previous analysis, we can conclude that the ideal D-index structure is that which has the least chance to meet the situation 4 and a high possibility to meet first situation at search time. To decrease the possibility to meet situation 4, the ρ value must be as large as possible; this will increase the possibility to meet the third situation, so the exclusion bucket size should contain few objects because it will be visited frequently.

q

q

r

2 ρ

r

Exb 2 ρ

q

r

2 ρ Exb Exb 2 ρ r q

Fig. 3. Range search situations in two-dimensional space

III.2. Adequate

for Each Level

In the original D index, all buckets are separated by a fixed . Since D-index -split functions use different references to create separate buckets at each level, it is hard to ensure the previous criteria by using a fixed ρ. Accordingly, we propose to use a specific to



483


partition data at each level l. This change on the D-index structure imposes a small change on the search algorithm; indeed, in range search and in nearest neighbors algorithms, all will be changed by the of level i. In the algorithm 1, we present the modified range search algorithm, while algorithm 2 is the changed K-NN search algorithm.

exclusion bucket. In the following process, we describe the proposed partitioning strategy. For each level i: 1. At the beginning = , insert all objects into exclusion bucket. i.e. maximize . Where ( ) is the maximal (minimal) distance between the reference and objects. 2. If the exclusion bucket is higher than (

III.3. D-Index Partitioning Strategy It’s clear that if we maximize ρ, the number of objects in the exclusion bucket will be maximal too, and if we minimize the number of objects in the exclusion bucket the ρ will be also minimized. In algorithm 3, we propose a strategy to balance the number of objects in the exclusion bucket with the length of ρ. We maximize the as far as possible, respecting the requirement that the exclusion bucket size should not be higher than per cent of database objects. When the structure contains only one level, the object percentage of the exclusion bucket can be computed easily by using the known n p formula, , where p is the percentage desired and n 100 is the number of objects. But in general, when the structure contains many levels we need some additional effort to compute the objects number which the exclusion bucket should contain. We suppose that the structure contains h levels and n objects, in order to balance the structure, each level n should contains objects, that means that the portion of h each level in the total proportion can be computed as n p . So in the level i, the exclusion bucket should  h 100 contains the sum of the previous levels proportions in addition to the objects of next levels. In order to compute the required number of objects in the exclusion bucket at level i, we adopt the following notations; ℎ is the number of remaining levels and ℎ refers to the number of previous levels. The required size of the exclusion bucket at level i can be computed as follows:

n  hr h   objects number of remaining levels

)

×

×

ALGORITHM 3: INSERTION k ; // precision of selection for i=0 to n-1 → ; end for for i=1 to h − ← 2 ← ; // step of decreasing (

)>

× ←

(

;

×

× ×

−

)

do

;

for (all  ) , ( )〉 < 2 ) then if(〈( → ,〈 , ( )〉 ;

Delete x from end if end for end while end for

the previouslevels portions

The previous formula can be expressed as:

n

×

× . × Fig. 4 shows an example of ρ decreasing process in the two-dimensional space; the partitioning by ρ1 increases the exclusion bucket of the structure. The decreasing process decreases ρ1 using the step of decreasing , and reinserts the exclusion objects by using ρ2 ( 2 = 1 − ). So the number of objects in the exclusion bucket is reduced, but the exclusion set size does not yet satisfy the requirement. Finally, the ρ2 is decreasing by and after the reinsertion using ρ ( = 2 − ), the exclusion bucket size satisfied the condition. In order to integrate the ρ decreasing process to the Dindex, we modify the insertion algorithm of D-index (see Alg 3).

n p   hp h  100   



×

× ) we decrease the value of by the × decreasing step ( ), 3. Reinsert all exclusion bucket objects into the adequate buckets 4. Repeat instructions 2 and 3 until the number of objects in exclusion bucket is less than or equal to:

100  hr  p  hp 

;

100  h

III.4. Construction Cost Reduction Policy

Briefly, the exclusion buckets size of level i must be

In agreement with [1], it’s true that the common assumption is that the cost to build and to maintain an index structure is much less important compared to the costs that are needed to execute a query.

(

×

×

)

near to × , where ℎ is the number of the × remaining levels, ℎ is the number of previous levels, h is number of levels and p is the objects proportion in the Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved


484


But the construction cost of metric access methods has a significant importance in large database, methods should have an accepted construction cost because the structure building will be a real challenge in case of large databases.

IV.

Similarity Search in 2D/3D Database

In order to prove the performance of proposed the method, we have tested it on 2D/3D databases. Indeed, 2D/3D approach is largely used to represent 3D objects, it based on the representation of objects by a set of characteristic views because the extraction of views feature is cheaper than obtaining features of 3D object directly [9]. In this paper, we propose to use a combination of three metrics descriptors from the mpeg-7 global visual descriptors, to ensure a good presentation of visual feature of each view, we use two color descriptors and one texture descriptor , they are as follows : CSD: in order to index an image, the Color Structure Descriptor [12] relies on both the color distribution of the image and the local spatial structure of the color, we choose the 64 Dimension of this descriptor; the similarity is measured by the L1 distance. SCD: The Scalable Color Descriptor [12] is derived from a color histogram in the Hue-Saturation-Value color space with fixed space quantization. We used the 128 coefficients version of this descriptor. The distance between two scalable color instances is measured by the L1 metric. EHD: Edge Histogram Descriptor [13] represents the local-edge distribution in the image. The image is subdivided into 4× 4 sub-images and edges in each subimage categorized into five types: vertical, horizontal, 45° diagonal, 135° diagonal, and non-directional edges. These results in 80 coefficients (5 values for each of the 16 sub-images) representing the local edge histograms. Furthermore, the semi-global and the global histograms can be computed based on the local histogram and the distance is computed as a sum of weighted subsums of absolute differences for the local, semi-global and global histograms. The distance between two views is computed by the linear combination of the three descriptor distances:

ρ ρ1 ρ2

Fig. 4. Decreasing process in two-dimensional space

The proposed technique is not without cost, but we have taken care of the construction cost minimization as far as possible in order to make it acceptable. The policy used is based on two axis; we use an economical decreasing step (st) and an economical reinsertion policy. III.4.1. Decreasing Step st The construction cost of modified D-index is decreasing when the ρ selection loop is minimal; the st should be chosen in a way to accelerate the criteria satisfaction. In order to achieve this, st should be as large as possible, but this raises another problem; it’s the criteria satisfaction precision. The process can be stopped although the criteria are far from satisfaction. The criteria satisfaction precision can be solved by using a small decreasing step. In order to minimize the ρ selection loop ensuring a good criteria precision, we propose to use a decremental st. At the beginning, st is large when the criteria are away to be satisfied and it will be decreasing when it approaches to the satisfaction. For the previous reasons we propose to use the following decreasing step = / ; where the decreasing’s precision ( ) is a value that represents the precision of structure construction. A big value of means that the decreasing step ( = / ) is small, this implies that the size of exclusion buckets will contain nearly percent of objects.

+ ∗ _

( 1, 2) = ∗ ( ( 1, 2) + ∗ _

,

+ ( 1, 2)

)

where = 1/3 and d_SCD, d_CSD and d_EHD are distances function used for matching similarity between SCD, CSD and EHD. As d_SCD, d_CSD and d_EHD are metrics, the Dist() function is a metric function to.

V.

Results

The proposed method has presented a significant efficiency in 2D images database and genes database [9]. In these experiments, we focus on the evaluation of querying performance on real 2D/3D image databases; we have tested the searching efficiency of the proposed modified D-index and compared it with the original Dindex. The original and the modified D-index are implemented under the MSL [14] library.

III.4.2. Reinsertion Policy In D-index, the objects insertion requires the computation of distances between objects and references, in order to avoid distances recomputations, for each reinsertion, we memorize all distances computed in the first insertion. This is to be used in the next reinsertions and as soon as the final was selected the distances will be freeing.



485


Data Sets

The search costs are measured in terms of distances computation. All presented cost values are averages obtained by execution of 100 different query objects. All modified D-index structures are created by using a fix value of decreasing’s precision ( = 100). If two pivots are closed they nearly partition data in the same way, so in this experiment we have taken care of choosing distant pivots; pivots are views belonging to deferent 3D objects because the views of one object are probably closed. The comparison results of K-NN queries on ALOI database are presented in Fig. 7, and that of range queries are in the Fig. 8. Comparing with the sequential search, we observe that both the classic and the modified D-index present interesting reduction of distances computation (between 90% to 20% distances). We remark that the D-index%1 is more efficient than D-index1 for all nearest neighbor’s number requested, and it is more efficient than D-index2 for 1 to 10 nearest neighbor’s queries and more preferment than D-index3 for 1 to 5 nearest neighbor’s queries, we can conclude that the modified D-index (Dindex%1) chooses a good ρ value for index ALOI database for K-NN queries, and in particular, the Dindex%1 presents an important efficiency for queries of small nearest neighbors numbers.

distances computations

Fig. 5. ALOI data set

50000

70000

 ALOI database [15] is a collection of 1000 3D small objects, for each object the frontal camera was used to record 72 aspects by rotating the object in the plane at 5◦ resolution, each object is represented by 72 views.  COIL[16] (see Fig. 6) is a collection of 100 objects, the objects were placed on a motorized turntable against a black background. The turntable was rotated through 360 degrees to vary object pose with respect to a fixed color camera. Images of the objects were taken at pose intervals of degrees; each object is represented by 72 views.

D-index1

30000

V.1.

D-index2 D-index3

10000

D-index%1

1

5

10

nearest neighbors

20

40

Fig. 7. Comparison of nearest neighbors search efficiency for ALOI database 70000

Fig. 6. COIL data set

30000

In order to compare the modified D-index with the original one on the 2D/3D image databases, we have created two original D-index structures (D-index1, Dindex2 and D-index3) for each database and we have compared them with the structure that is constructed by the modified D-index. For each database, original Dindex structures uses three ρ values that are chosen randomly by the algorithm between the minimum and the maximum ρ generated by the modified D-index. The modified D-index (MD-index) that have a small objects proportion in the exclusion buckets is more efficient, for this reason, we have chosen the D-index%1 structure for the comparison; i.e. its exclusion buckets contains around 1% of objects.

50000

Searching Performance Evaluation distances computations

V.2.

D-index1 D-index2

10000

D-index3 D-index%1

1

5

10

search raduis

20

40

Fig. 8. Comparison of range search efficiency for ALOI database



486


selection of ρ value for D-index by an acceptable construction cost and we have tested the efficiency of proposed technique on real 2D/3D image databases. Compared with original D-index, the experiments show that the proposed technique ensures good performance.

We have reached the same conclusion through the range queries comparison results (see Fig. 8); the modified D-index (D-index%1) chooses a good ρ value to index the ALOI database for range queries, especially D-index%1 offers to queries of small range queries an ideal efficiency. Fig. 9 represents the comparison results on COIL database of K-NN queries and the range queries are available in the Fig. 10, we remark that the modified Dindex ensures a good ρ value; this is concluded from the fact that the D-index%1 is more efficient than the classical D-index structures in the state of solving all nearest neighbors and range queries that are used in the experiments.

References [1]

[2]

7000

[3]


6000

[4]

4000

5000

[5]

3000 2000

[6] D-index1

D-index3

D-index2

[7]

1000

D-index%1

1

5

10

nearest neighbors

20

[8]

40

[9]

D-index1

[10]

4000

D-index2 D-index3

2000

[11]

[12]

1000

3000

D-index%1

[13]

[14]

0


5000

Fig. 9. Comparison of nearest neighbors search efficiency for COIL database

1

5

10 search raduis

20

40

[15]

Fig. 10. Comparison of range search efficiency for COIL database

[16]

VI.

Conclusion

Almost in all MaMs proposed in literature, the performance depends directly on the way we choose some parameters. Although many research works propose some techniques to choose parameters for some MaMs, many recent MaMs are proposed with random parameters. In this work, we have proposed a modified D-index that uses criteria and techniques to ensure a good

V. Dohnal, C. Gennaro, P. Savino, P. Zezula, D-Index: Distance Searching Index for Metric Data Sets. Multimedia Tools and Applications,V.21,pp.9-33,2003. E.Chavez, G.Navarro, R. Baeza-Yates, J.L. Marroqu´ın,, Searching in metric spaces. ACM Computing Surveys, V. 33, n.3, pp.273–321, 2001 S. Brin, Near neighbor serach in large metric spaces. Proceedings of the 21 Conference on very large Databases (VLDB’95), pp.,574-584, 1995. E. Vidal, An algorithm for finding nearest neighbors in (approximately)constant average time. Pattern Recognition Letters,V.4,pp145-157,1986. M.L. Mic´o, J.Oncina, E. Vidal, A new version of the nearest neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognition Letters, V. 15.n.5, pp.9–17 ,1994 G. Ruiz., F. Santoyo., E. Chavez, K. Figueroa., and E. Tellez, Extreme Pivots for Faster Metric Indexes, 6th International Conference, SISAP 2013, A Coruña, Spain, October 2-4, 2013, Proceedings. A. Saliha, L. Nacéra, S. Feryel, and L. Slimane, 3D object indexing and recognition." Applied Mathematics and Computation V.196, n.1, pp. 318-332, 2008. L. Michalis, A. Apostolos, R. Dimitrios, and D. Petros, Multimedia search and retrieval using multimodal annotation propagation and indexing techniques. Signal Processing: Image Communication V. 28, n. 4,pp. 351-367, 2013:. H. Silkan, Y.hanyf, A New Efficient Optimal 2D Views Selection based on Pivot Selection Techniques for indexing and retrieval of 3D models, Mediterranean Conference on Information & Communication Technologies'2015 , Proceedings Y. Hanyf, H. Silkan, and H. Labani, Criteria and Technique for choice good ρ value for D-index, First international conference of intelligent system and computer vision ( ISCV),2015 , Proceedings. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula, Separable splits in metric data sets , 9-th Italian Symposium on Advanced Database Systems, Venice, Italy, June 2001, pp. 45–62, LCM Selecta Group—Milano, Proceedings. T. Sikora. The MPEG-7 visual standard for content description-an overview. Circuits and Systems for Video Technology, IEEE Transactions on, V. 11, n. 6, pp. 696-702. 2001 P. Wu, Y. M. Ro, C. S. Won, and Y. Choi, Texture descriptors in MPEG-7. Computer Analysis of Images and Patterns, pp. 21-28. Springer Berlin Heidelberg, 2001 K. Figueroa, G. Navarro, E. Chavez. Metric Spaces Library , Available in http://www.sisap.org/Metric\_Space\_Library.html, 2007. J. M. Geusebroek, G. J. Burghouts, and A. W. M Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, V. 61, n. 1, pp. 103-112, January, 2005 S. A. Nene, S. K. Nayar and H. Murase, Columbia Object Image Library (COIL-100), Technical Report CUCS-006-96, February 1996.


Université Chouaib Doukkali, Faculté des Sciences, Laboratoire LAMAPI, El Jadida, Maroc. 2

LIMA, Laboratoire d’informatique, Applications, El Jadida, Maroc.


Mathématiques

et

leurs


487


Youssef Hanyf is a PhD student in informatics and mathematics sciences Department, Chouaib Doukkali University, Morocco, his current interest research are information retrieval, similarity search, content based image retrieval, Multimedia databases, multimedia indexing.

Halima Labani Receives the PhD in Mathematics from Caddi Ayaad University, FSSM, Morocco. Currently, she is a professor in Chouaib Doukkali University, Department of Mathematics, faculty of sciences El Jadida, Morocco. Her research interest is mathematics and applications.

Hassan Silkan receives the PhD in Mathematics and computer sciences from Sidi Mohamed Ben Abdellah University, FSDM, Morocco. Currently, he is a professor in Chouaib Doukkali University, Department of Computer Science, faculty of sciences El Jadida, Morocco. His research areas are Shape Representation and Description, similarity search, content based image retrieval, Database Indexing, Multimedia Databases.



488


Testing Patterns in Action: Designing a Test-Pattern-Based Suite Bouchaib Falah1, Mohammed Akour2, Nissrine El Marchoum1 Abstract – Design patterns constitute a revolution in the field of software engineering, as they emphasize the importance of reuse and its impact on the software process and the software product quality. A special type of design patterns is testing patters; these can be used in the testing phase to reduce redundancy, save time and resources and provide an effective reuse mechanism for more coverage and better quality of service at the same time. Many design patterns exist to test different aspects of the implemented functionality separately. However, in this paper, we will suggest a new concept, which consists of incorporating different testing patterns into the same test suite to test different aspects through running one single test exactly once. It will also allow the users to track the performance of their test suite quality attributes using a simple representation. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Software Testing, Test Patterns, Design Patterns, Quality Metrics, Testing Quality

I.

It goes without saying that one of the main goals of the testing activity in the software life cycle is to identify faults and errors in the system and fix them thought predefined techniques. One key consideration that the software testers worry about the most is the quality of the software being tested. Many existing applications of testing patterns exist varying from drafts of concepts to actual implementations. However, these applications involving test patterns usually focus on generating separate test cases for separate quality goals. The underlying idea of this paper is to combine the existing knowledge in the field to come up with a sample test suite that will incorporate different test patterns to improve different quality goals in one single test suite. Although working towards realizing multiple quality goals concurrently introduces tradeoffs, we can still come up with combinations of test patterns that can at least improve the overall performance of the testing process and satisfy the quality requirements of the customer. As much as there are countless applications of design patterns to different scenarios ranging from software design to software deployment, our ultimate focus is to emphasize the importance of using test patterns to the software testing process to enhance the overall testing quality. Therefore, the focus will be on test patterns. We fill first give a glance on the existing literature related to the topic in section II, then depict how test patterns are used to test software in section III, then we will suggest our testing concept incorporating test patterns into one single test suite in Section IV and show an example in section V. Finally we will briefly mention the future work in Section VI then conclude.

Introduction

When the object oriented paradigm came as a revolution in the software engineering field, it brought one of its most useful principles: reuse. This principle has evolved from simply reusing simple declarations or code fragments into software reuse and even logic reuse, which is the key motivation behind what is referred to as software design patterns. These patterns are being used to save the software engineers, developers as the software testers the burden of wasting time and resources over producing redundant modules. The idea of reusing logic to solve similar types of problems has been working pretty well over the last decades and proved to be a smart choice when it comes to dealing with recurring engineering problems. Design patterns in general and testing patterns especially have been a revelation in the field of software engineering. These patterns are considered as a way to capture the knowledge and experience of the software engineering community and build a basis for reuse [1]. With the growing popularity of patterns and their incorporation into the software engineering process, many issues arose like compatibility, efficiency and effectiveness, along with suitability to the current scenarios. Many argued that the best way to deal with these challenges, many argued that using patterns needs to be accompanied by documentation [1]. This explains the remarkable number of publications suggesting new concepts such as pattern contracts, pattern test case templates and many other documentation techniques to guarantee that the patterns at hand can be used efficiently and tailored easily to the different requirements of each single system.


489

Bouchaib Falah, Mohammed Akour, Nissrine El Marchoum

II.

of traditional test code is the reduction of maintenance costs by providing test cases which are easily adaptable to change.

Related Work

Over the last decade, the effectiveness of using design patterns for software design and implementation has been questioned by many researchers in the field. Soundarajan and Hallstrom et al. [2] for example suggest a solution to test how a given pattern used to solve a specific problem stated in the requirement, complies to the requirements of the used pattern itself. This technique tests whether a pattern has been used correctly as to enable its use for multiple systems and avoid redundancy. This emphasizes the main motivation behind using patterns in the software engineering process, which effective reuse of existing solutions and tailoring them to every single system accordingly. For this purpose, the authors used what is called Pattern Test Case Templates (PTCT), which is a codified solution of the pattern that identifies all the defects associated with its implementation in different software systems. Even after the rise of testing patterns as a class of patterns of its own, these patterns also were used in association with PTCTs to evaluate their performance. As [1] explains, these pattern test case templates are not only extracted from the requirements’ contracts to predict the faults, but also act as a starting point for test suite generation depending on the intended outcomes. Another important advantage to the technique suggested by this paper is also the early detection of design bugs, which are usually hard to localize and get expensive as they go undetected throughout the software engineering process. Another interesting take on the topic is depicted by Feudjio and Schieferdecker [3] who take testing patterns to a further level of automation. With the growing customer needs and the short time to market, test automation appears to be an appealing solution for the testers and an effective way to test similar modules with slight modifications. This paper outlines some design patterns for reactive software systems and shows how important it is to have such patterns in an environment where interactivity is a key characteristic of the units under test. In the same context of automated testing using patterns, Rybalov [4] also emphasizes how crucial automated testing is to ease the process and help testers and developers to do refactoring easily. However the author applies automated testing with patterns to customer testing. The author draws the attention to an important fact, which is that test cases are programs, therefore they are also vulnerable to any faults and errors application code might have. This is an aspect of testing that has been overlooked for quite long as the testing programs have always been considered to be simple and easy. However with the evolving complexity of units under test, having effective test cases has also become an issue that takes up the time of the testers and developers. This is where test patterns come into play as they help minimizing the cost of updating and refactoring test cases as the application code under test evolves. So the main motivation behind using patterns for testing instead

III. Testing Patterns in Practice In this section, we will be outlining some applications of test patterns for different purposes. Although the core motivation of using these patterns is to optimize the testing process, there are many enhancements and realworld applications that show the different contexts where a test pattern can be used successfully. Testing patterns are divided into two main categories: extensible test patterns and static ones. These latter ones use dynamic features such as dynamic binding and dynamic typing, which makes the testing process using them pretty complex and potentially prone to errors [9]. With the increasing development speed as well as the competitive edge each firm has to bring compare to the other existing entities in the market, shortening the time span allocated for testing while realizing a fault coverage as high as possible has become an increasingly important concern in the field of software testing. As way to realize these two objectives (better quality testing and big fault coverage) at once in the most efficient way Automatic Test Case Generation (ATCG) tools come into play [5]. These tools generate fault models and provide a mechanism to measure the fault coverage of the test patterns used. They also ensure an efficient and effective test case generation process. Test patters can be used separately, but also in combination with other test patterns to achieve different quality goals. When this happens a methodology needs to be used to coordinate between them. Among these algorithms we find test ordering algorithms and test data flow controlling algorithms [6]. These algorithms help prioritize test cases and pick which ones should go first. This way redundancy is avoided and tradeoffs between the different patterns are reduced. Other algorithms on the other hand, focus on reducing test data efficiently even with multiple fault models [7]. There is also a tool used to measure the quality of test patterns themselves. It uses a ration of covered conditional branches by the pattern to the number of conditional branches in total [8]. This ratio called “the branch pass index” has allowed the software testers to use a benchmark to evaluate the test patterns quality.

IV.

Patternsuite: The Testing- PatternBased Software Testing Suite

This technique uses an iterative incremental six-tep process. It is based on the elicitation of quality metrics and requirements from the customer, aping them to generated test patterns, then tracking the metrics’ realization and creating test suites based on the findings. This process illustrated in the flow chart in Fig. 1 will be thoroughly explained in this section.



490


Fig. 2. Scenario-based test service level agreement generation in PATTERNSUITE

IV.2. Automatic Test Pattern Generation Automatic test pattern generation tools have attracted researchers for over thirty years in spite of the many challenges that arise with their use such as scalability and the flexibility in addition to the ability to handle different fault models at once [10]. Automatic Test Pattern Generation (ATPG) tools do not only provide high quality testing to different fault models; they also help identifying design faults early on, which is an incredible functionality given the difficulty of pointing out those faults early on in the software development stage [10]. ATPG function in two main phases: fault activation and fault propagation. The fault activation phase consists of producing a value at the fault site. In the fault propagation phase, the effect previously produced is propagated from the initial fault location to a primary output site [10]. The goal of the process is to come up with a test sequence (test input) that allows the testers to distinguish between the correct path and that affected by the propagated fault. The effectiveness of the test sequence is assesses by measuring the coverage of the fault model. The effectiveness also depends on the fault model used the unit under test, the abstraction level at which it has been represented and other quality considerations. As for the efficiency of the whole ATPG tool, it is directly related to the decision making throughout the process of giving adequate inputs to the tool [10]. This step of the process is the most innovative idea that this concept brings to software testing. Instead of selecting traditional existing testing patterns, we can use automation to save time and have more accurate patterns based on the program logic and data. Automatic Test Pattern Generation Tools (ATPGT) are in charge of generating fault models. These fault models can be used in combination with the existing testing techniques to measure the fault coverage and to train the testing process as to get to the best possible coverage. They also generate test patterns that can be reused over time to maximize reusability. Those test patterns can then be fed a finite state machine (FSM) that tests the unit under test accordingly.

Fig. 1. The six-step process of creating a test-patter-based test suite

IV.1. Service Agreements with the Customer The concept of service agreements is very popular in the cloud computing field. Service-level agreements have been used over the last decade to specify the level of service clients are interested in out of the wide variety offered by the cloud under the form of Software as a Service (SaaS) [11]. As for the field of software testing, a new concept seems to follow the same logic; it is referred to as customer scenario focused end-to-end testing. This method is essential to ensure the quality of software products, as it is usually impossible to cover all customer scenarios in the designed test suites by the test engineers. That is why this method suggests a common technique that models scenario-based test suites that converge towards providing high customer scenario coverage [12]. The way this concept is incorporated into our PATTERNSUITE technique is depicted in Fig. 2. In order to realize high coverage, the customer scenarios are modeled in terms of the services provided by the unit under test, using the Persona-ExperienceUsecase hierarchy. The Personas refer to the different user categories, the experience to the different functionality expected by each individual user class, and the usecase classifies the functionality requirements of each use category [12]. During this step, a clear formulation of the intended quality goals of the customer is done. Instead of giving a whole set of raw requirements, the testing engineers should help the customer refine the quality requirements and come up with a precise and most importantly ordered list of priorities in terms quality goals.



491


TABLE I A SAMPLE SELECTION OF TESTING PATTERNS ALONG WITH THEIR TARGET QUALITY GOALS Test Pattern Usage Target Quality Metric Template Pattern Testing units, Customer Satisfaction components and Customer Testing subsystems without changing the original code Object Genie Using a mother class to Customer Satisfaction do the recurring Low Coupling instantiation of all the important classes to the UUT Domain Test Object Encapsulation of visual Fast and Effective arts of the application GUI Testing into objects for reuse Good and Esy Acceptance Testing Transporter Creation of a central Fast Automated entity which navigates Testing through the units under Good Customer test Testing Dumb Testing Reduce knowledge about Environment-changenavigability to check agnostic testing logic and user interface strategy details Low Coupling Independent Testing The state of the UUT Stability remains unchanged no Coherence matter if the test succeeds or fails Don’t Repeat Avoid repeated code or Lower Test Yourself (DRY) logic to optimize testing Maintenance Cost code No duplication

IV.3. Quality Metrics Establishment Identifying and establishing the right metrics for software testing in general is not an easy task. It is a tedious process to be followed and usually involves a great deal of experience from the testing engineers. That is why various benchmarks are usually used to set specific values or thresholds to the quality metrics previously extracted in the first phase. Although using benchmarks and predefined values is of great help, the experience and intuition of software testers remains a plus. It is through dealing with recurring problems that the testers develop an experience that allows them to approximately tell the extent to which quality metrics are realized –or not- by a given test suite. The quality metrics used for evaluating the software testing process are rather quantitative, rather than giving a broad description of” how well the test case does”, ratios, percentages and frequencies are calculated in order to give an accurate measurement tool of the quality of the testing process. In this step, the testing engineers establish thresholds and values to the specific quality metrics that have been discussed in the first step. These quality metrics are usually quantitative (percentage, number, fraction):  Fault Coverage.  Conditional Branch Coverage.  Reaction to Bugs.  Cost to maintain test.  Time of execution.

Multiple Failures

Continue to execute even Robustness if failures occur High Fault Coverage

Depending on the services required by the customer in addition to the quality metrics set in the previous steps, the testers need to proceed by trial and error to make candidate test suites that combine these test cases.

IV.4. Test Case Design and Pattern Incorporation After generating patterns using an automatic test pattern generation tool, it is up to the software testers to design test cases that incorporate these patters. ATPG tools usually result in a test case which consists of test inputs. Designing test cases that incorporate the test pattern would mainly aim at transforming the raw test case into an object-oriented entity which fulfills the objet oriented paradigm characteristics such as encapsulation, separation of concerns, cohesion … etc. The traditional way of selecting test patterns would typically consist of finding the testing pattern that meets the quality requirements of the customer. Table I outlines an example of a small selection of testing patterns and their purpose. If the test patterns were to be selected manually by tge software tester, the task would be to select the test pattern whose target quality metrics comply with those of the customer. But since anything done annually involves a great deal of error risk, it is better to use automatic test pattern generation tools. In this step, based on the output of the automatic test pattern generation tools’ output, the testers design test cases that incorporate the resulting patterns.

IV.5. Quality Metrics Tracking Dashboard As previously discussed in an earlier section, coverage measurement tools provide many mechanisms to visualize the output of the different testing techniques in terms of measured performance. Our technique can implement a user-friendly interface that allows the user to display the quality metrics recorded either individually or to display charts and comparative views of the presented data. And therefore track the testing performance and take decisions accordingly. IV.6. Decision Making: Final Test Suite Selection According to the statistics displayed on the dashboard, the testing engineers pick the combination of generated testing patterns that results in the best quality metric selected. This way the final test suite maximizes the quality metric based on the customer’s requirements.

V.

Patternsuite Example

In this section, we will provide a general example to show how the suggested technique can be applied.



492


We suppose that we have a meeting with a client for requirements elicitation from which we derive the quality requirements and use an automatic test pattern generation that we will call PatternSuperTool. V.1.

in a user interface that the testers can have easy access too. Data can either be displayed in the table format or as a graph to have more visibility to the testers and help them in the decision making process.

Service Agreement Phase

V.5.

The service level agreement with the client states the following:  The first quality requirement is that the test cases applied to the unit under test should try to cover as many conditional branches as possible.  Second test cases ran by the client’s team during acceptance testing should not take too long to execute. Other quality requirements were discussed during the meeting but the goal of this specific service level agreement is to focus on those two quality requirements specifically. V.2.

This is the final phase where the testers decide which test suite from the candidates will be picked. In this case, t4 is the best test suite because it gives the best results for both quality metrics, and will therefore be selected as the final one. We should mention that the choice is not always obvious, as sometimes there is no best solution. However in situations of ties, the testers should pick the test case that gives the best value of the first quality metric on the list. This is where the service level agreements play their most imminent role: prioritization.

VI.

Quality Metrics Establishment Phase

Based on the service-level agreement, we establish a threshold for both quality metrics as follows:  Required ratio of branch index is ¾;  Each test case should have an average execution time of 4 seconds. V.4.

Future Work

Now that the concept is described, the future contribution would be a blueprint for a real-life application. Typical blueprint would tackle a common application example and produce all the deliverables including the service level agreements, the test cases along with their originating fault models, the candidate test suites, the quality metrics and the result displayed on a dashboard in a simple user-friendly interface. The next step should compare the requirements realization using this technique to the already existing techniques; a benchmark should be chosen against which the comparison should be done. This would demonstrate the feasibility of the project, along with the advantages that it brings compared to the traditional testing techniques.

Automatic Test Pattern Generation

In this step, we run PatternSuperTool feeding it with some sample data provided by the customer. This data will be used to generate fault models, and then generating four different test patterns that we will call p1 p2, p3 and p4. V.3.

Final Test Suite Selection

VII.

Test Case Design and Pattern Incorporation and Dashboard Display Phase

Conclusion

Software testing patterns are indeed a great asset at the disposition of software testers. The smart use of this incredible tool can result in a remarkable shift in the fixed quality metrics realization by any given testing team. Furthermore, the use of patterns in testing can improve the quality of the process itself; logic reuse along with building an archive of reusable mechanisms can only get more done within less time. The main advantage of this technique is that it gives the testers a mechanism to realize many quality attributes using one single test suite. In the traditional way, it is practically extremely challenging to avoid the tradeoffs that arise when trying to balance between different customer requirements at once. Furthermore, the process can be subject to many refinement iterations. This process does not have to follow a waterfall model; especially that it takes time and effort to take a decision regarding the final test suite. Therefore, the fact that this technique lends itself to an incremental or iterative paradigm can be a key factor in its success.

Now that we have set up the quality metrics, we will design test cases that use the generated test patterns in the second phase. The next step would consist of creating all the possible combinations of test suites using these test cases. Let’s consider the test cases t1, t2, t3 and t4 as being the test cases incorporating the generated patterns p1, p2, p3, and p4 respectively. Creating the test suites and computing the metrics for each case will give the results in Table II. TABLE II QUALITY METRICS MEASURED FOR CANDIDATE TEST SUITES Average Execution Test Suite Test Cases Branch Index Time 1 t1,t2,t3 20% 1s 2 t1,t2,t4 60% 1s 3 t1,t3,t4 70% 8s 4 t2,t3,t4 80% 3s

In this example we will only focus on test suites that have three test cases for simplicity. This data is displayed



493


[10] Cheng, Krstic, “Current Directions in Automatic Test-Pattern Generation”, Computer, Vol. 32, Issue 11, pp. 58-64. [11] Kübert, Sewner, “High performance computing as a service with Service-Level Agreements”, 2012 IEEE Ninth International Conference on Service Computing, pp. 579- 585.

The suggested technique cannot only be refine din terms of the sequence of activities and the precision of the generated deliverables. It can also be further enhanced by incorporating many innovative techniques. Good example would be using test pattern scenarios instead along with test pattern contracts for a more accurate formulation and documentation of the test patterns being used. Moreover, many algorithms can be used for test pattern prioritization such as decompression algorithms [6]. On the other hand, the activities involved in this technique can be extremely time-consuming if not handled in a smart way. This technique could be extremely efficient, but only if carried out by testing engineers who have experience in the field of testing and good knowledge about testing patterns. It is also worth mentioning that the process of designing a test case based on a given pattern is a non trivial task. Another drawback to the suggested technique is that it is still at the conceptual level. And it goes without saying that implementing the necessary mechanisms will result in challenges, bugs and errors that cannot be detected at the conceptual phase. Although the suggested technique –like any new concept- might not be very popular, it still constitutes a starting point for the researchers in the field, as it gives incentives for new research topics. It can also be of great help to testing engineers, especially in the context of big complex projects that would take long to test. Finally, the core motivation behind this technique is reuse. Those who might argue that the process of incorporating patterns might be time and resource consuming, it is a methodology that allows to build a testing mechanism once, but that can be reused as many times as needed.


Al Akhawayn University in Ifrane, Morocco

2

Yarmouk University, Jordan.

Dr. Bouchaib Falah is currently an Assistant Professor at Al Akhawayn University, teaching graduate and undergraduate software engineering courses, School of Science and Engineering. Beside teaching high school level math in Morocco and college mathematics and computer science at Harrisburg Area Community College in Pennsylvania, Suny Orange Community College in New York, Pennsylvania State University in Pennsylvania, Central Pennsylvania College in Pennsylvania, Concordia College in Minnesota, and North Dakota State University in North Dakota, Dr. Bouchaib Falah has an extensive industrial experience with Agri-ImaGIS, Synertich, and Commonwealth of Pennsylvania Department of Environmental Protection. E-mail: [email protected] Dr. Mohammed Akour is an Assistant Professor in the Department of Computer Information System at Yarmouk University (YU). He got his Bachelor (2006) and Master (2008) degree from Yarmouk University in Computer Information System with Honor. He joined YU as a Lecturer in August 2008 after graduating with his master in Computer Information System. In August 2009, He left YU to pursue his PhD in Software Engineering at North Dakota State University (NDSU). He joined YU again in April 2012 after graduating with his PhD in Software Engineering from NDSU with Honor. E-mail: [email protected] Nissrine El Marchoum is Allianz Managed Operations & Services SE. Munich Area, Germany. She got her master degree from Al Akhawayn University in Ifrane. E-mail: [email protected]

References [1]

[2] [3]

[4] [5]

[6]

[7]

[8]

[9]

Soundarajan, Hallstrom, Delibas, Shu, “Testing Patterns”, Software Engineering Workshop, 2007. SEW207, IEEE, 2007. Pp. 109-120. Soundarajan, Hallstrom, “Patterns: from system design to software testing”, Innovations Syst Softw Eng,2008, pp. 71-85. Freudjio, Schieferdecker, “Test Automation Design Patterns for Reactive Software Systems”, EuroPLop Conference, workshop E, 2009. Raybolov, “Design Patterns For Customer Testing”, white paper, 2004. Benware, Schuermyer, Tamarapalli, Tsai, Ranganathan, Madge, Rajski, Krishnamurthy, “Impact of Multiple-Detect Patterns on Product Quality”, Test Conference 2003, vol. 1. pp. 1031-1040. Novak, “Comparison of Test Patterns Decompression Techniques”, Design, Automation and Test in Europe Conference and Exhibition, 2003, Czech Republic. Pp. 1182-1183. Alampally, Venktesh, Shanmugasundaram, Parekhii, Agrawal, “An Efficient Test Data Reduction TechniqueThrough Dynamic Pattern Mixing Across Multiple Fault Model”, VLSI Test Symposium (VTX), 2011 IEEE 29th. pp. 285-290. T. Aoki, T. Toriyama, K. Ishikawa, and K. Fukami, “A tool for measuring quality of test pattern for LSIs’ functional design,” in Proceedings of Asia and South Pacific Design Automation Conference (ASP-DAC), 1995, pp. 75–78. W. T. Tsai, Y. Tu, W. Shao, and E. Ebner, “TestingExtensible Design Patterns in Object-Oriented Frameworks through Hierarchical Scenario Templates”, Proc. of IEEE COMPSAC, 1999, pp. 166-171.



494


Edema and Nodule Pathological Voice Identification by SVM Classifier on Speech Signal Asma Belhaj, Aicha Bouzid, Noureddine Ellouze Abstract – This paper introduces two voicing parameters to describe the speech signal and study their effects on the classification of disordered voices. These parameters are the fundamental frequency and the open quotient. The fundamental frequency is obtained by the voicing speech period and the open quotient is defined as the ratio of the open phase by the pitch period. These open phase and pitch period are determined by the GCI and GOI obtained from the multi-scale product method (MPM) of the speech signal. The classification is operated on two pathological databases MAPACI and MEII by an SVM classifier multi-class one against all. We consider a three-category classification into edema, nodule and normal voices for the female speakers. The effects of these voicing parameters are studied when added to MFCC coefficients, MFCC derivatives, and the energy. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Pathological Voices, SVM, MFCC, Open Quotient

I.

Given the complex and subjective nature of the personal listening, researchers have developed various tools for establishing a diagnosis. Methods dealing with acoustic evaluation of pathological voices have been introduced in the literature. Among them, the automatic classification of pathological voice from speech signal has received a considerable attention. Classifiers used in speech recognition have been recognized in the classification of pathological voices as: neural networks, Gaussian mixture model (GMM), hidden Markov model (HMM), and the support vector machines (SVM). Bi-class SVM classifier have attained an excellent performance in both normal and pathological classification. The classification between the pathologies, however, has not been dealt with in many works [8]-[10] and the results obtained are not sufficiently efficient. In this paper, we are dealing with voice pathology classification (edema, nodule, and normal voices) through the use of the following features: MFCC coefficients, first and second derivatives, the energy, the fundamental frequency and the open quotient. The pitch period is defined as the difference time between two successive glottal closing instants GCI. The open phase consists in the time interval separating the glottal opening instant (GOI) and the following glottal closure instant (GCI). The open quotient being the ratio of the open phase by the pitch period. MFCC coefficients are determined by Mel frequency filtering. This paper will be divided as follows. In Section 2, we be dealing with a brief of the features used in the pathologic voice classification. Section 3 is about the enumeration of some classifiers of pathologic voice. Section 4 describes MEII and MAPACI databases.

Introduction

Direct communication has at its basis the Speech. However, any kind of pathology related to the speaking abilities may severely impact the professional and as well as the social activities. That’s why, during the recent years, many researchers in larygology and speech technology fields have paid more attention to the acoustic characteristics of pathological voices [1]-[3]. The evaluation and study of the voices representing cases of pathology prove to be very helpful whether for diagnosis or for therapy assessment. Assessing the voice quality can be done either by a diagnostician or by a direct examination as the laryngo-stroboscopy. Discriminating between normal and voice pathology may be ensured using objective or subjective approaches. Let’s briefly identify the two approaches. The perceptive approach is dealing with the qualification of the patient’s voice pathology. This evaluation is driven by trained professionals whose task is to rate the speech samples on a GRBAS grade scale [4] depending on their depiction of voice disorder. This approach, however, perceived as subjective has some limits, consisting in being highly linked to the listener’s experience and in being inconsistent for evaluation pathological voice quality. The objective approach, on the other hand, deals with the two different tasks: qualifying and quantifying the voice pathologies using acoustical, aerodynamic, and physiological measurements. One of the main advantages of this second approach is the fact that it is quantitative, but also much more comfortable for the patient. The diagnostic can be done after a clinical observation and evaluation of the patient’s voice quality, and this trough an electro-glottography (EGG) [5], stroboscopy of the vocal folds [6] and high-speed camera [7].


495

Asma Belhaj, Aicha Bouzid, Noureddine Ellouze

Section 5 explains in detail the fundamental frequency and open quotient determination method. Section 6 presents SVM multi-class classifiers. The results are developed in Section 7. Section 8 will be the conclusion and future work.

II.

Recent researches use this classifier in discrimination between normal and pathological samples. For example, [2] proposes to use a set of features consisting of 11 MFCC coefficients, Harmonic to Noise Ratio, Normalized Noise Energy, Glottal to Noise Excitation, Energy, and their first derivatives. The classifier is trained on the vowels /a/ from the pathological corpus of MEII Database. The average of the correct classification rate is 95.12%. The SVM classifier uses features extracted from wavelet transform of speech samples to discriminate between normal and pathological voices [27]. The correct classification rate is 97.5% for normal voices and 100% for pathological ones. Artificial Neural Networks are ones of the widely used classifiers in various domains, as pattern classification and recognition and particularly speech recognition. This classifier is applied on MEII database to distinguish between normal and pathological samples. The input layer is composed of 26 neurons corresponding to 26 acoustic descriptors given by the MDVP software. Besides, the classifier is composed of 1-hidden layer and 1-neuron output layer for normal or pathologic decision. The average correct classification rate is 94%. The discrimination between normal and pathological samples is also operated on a database of 5 Spanish sustained vowels [29]. Each vowel is treated by a neural network which takes as input classic parameters and others extracted from the bi-coherence. The decisions from the 5 networks are then combined to decide if the input sample is healthy or not. The correct classification rate is 94.4% for the classic parameters and is increased by 4% with added parameters.

Features Used in Pathologic Voice Discrimination

This section addresses the most important parameters used in the pathological voice discrimination systems. They are essentially the MFCC Mel Frequency Cepstral Coefficients (MFCC) with their first and second derivatives, the fundamental frequency, the energy, and the harmonic to noise ratio. In fact, the MFCCs MelFrequency Cepstral Coefficients are one of the most discriminative parameters in the speech recognition field [11]. The harmonic to Noise Ratio is defined as the log ratio of the energy of the periodic and aperiodic components [12]-[14] while other use the short-time autocorrelation function [15]. The fundamental frequency is an obvious parameter describing the speech voicing state. This parameter is used in most of the studies, in conjunction with the MelFrequency Cepstral Coefficients (MFCC). It is used in [16]-[17] for the discrimination between normal and pathological voices using the MEEI database. Besides, the Multi-Dimensional Voice Program (MDVP) produced by KayPentax Corp [20] provides some acoustic features defined in [21] and corresponding to the speech samples in the MEEI database. Some classification systems use directly these proposed features computed from the MDVP [22], [23] and some others deduce features inspired from those computed by the MDVP software [18]-[20].

IV.

Databases

In this work, we have used two databases: MEII and MAPACI. MEII Database was provided by the Massachusetts Eye and Ear Infirmary (MEEI) Voice and Speech Labs (Kay Elemetrics Corp., 1994). It contains sound files sampled at 25 kHz or 50 kHz with 16 bits of resolution. The acoustic samples are sustained phonations of vowel “a” (3 – 4 s long) [30]. MAPACI is a Spanish speech pathology database. Voice samples were sampled at 44, 1 KHz (2003). It consists of 12 normal male voice samples, 12 pathological male voice samples and 12 normal female voice samples, 12 pathological female voice samples pronouncing a sustained vowel /a/ of about 3 s [31].

III. Classifier of Pathologic Voices The aim of this section is to describe some classifiers used in the voice pathology assessment. Their structure and behavior are briefly presented. The Gaussian Mixture Modeling (GMM) is a supervised classification system widely used in Automatic Speaker Recognition. It was adapted from speaker identification to a classification in one grade of GRBAS scale (from 0 (normal) to 3). GMM was used in Normal/pathological classification [24]. When a speech sample has to be classified, the likelihood between the sample and each GMM is estimated and the decision relies on the maximum likelihood. For the grade classification, 95% is obtained for the grade 0 corresponding to the normal subjects while a loss of performance is observed for the pathological ones, especially between adjacent grades. The system is used in [25] to determine which kind of information is better suited to the classification of the four grades. Support Vector Machine (SVM) [26] is a well-known classifier used in problems of classification, regression, and detection.

V.

Feature Extraction

The feature extraction constitutes the first step in a classification system. This step consists in determining the MFCC coefficients, first and second derivatives, the energy, the fundamental frequency, the open quotient and their variations. MFCC coefficients, their derivative and the energy are computed using the Melcepst function



496


provided by the voicebox toolbox [32]. The open quotient and the fundamental frequency are calculated from the glottal closure instants (GCI) and the glottal opening instants (GOI) values. The GCI and the GOI are detected by the multi-scale product (MP) of the speech signal. For the MAPACI database the, analysis is operated with a hamming window of 2048 samples with an overlap of 1024 samples (sampling frequency 44.1 kHz). For the MEII database, the window of 1161 samples overlaps on 581 samples at 25 kHz sampling frequency, and a Window of 2322 samples which overlaps on 1161 samples at a sampling frequency of 50 kHz. V.1.

N

 F0i  k 

(4)

k 1

The Jitter of the mean fundamental frequency:

The multi-scale product of a signal is the product of the wavelet transform coefficients for different successive scales:

1 N 1 i F0  k  1  F0i  k  N  1 k 1



(5)

The local open quotient is defined as the ratio of the duration of the open phase by the fundamental period: GCI  k  1  GOI  k 

Oq  k  

3

(1)

j

1 N

F0i 

JiF0i 

 2  f  n  

(3)

T0  k 

The mean values of the fundamental frequency for the ith window are calculated according to the following relationships:

GCI and GOI Detection by MP

p n 

1

F0  k  

(6)

T0  k 

j 1

The mean values of the local open quotient for the ith window are calculated according to the following relationships:

where  2 j  f  n   is the wavelet transform of the function f  n  . Fig. 1 shows a speech segment of normal voice from the MEII database, and the corresponding multi-scale product. The wavelet used is the quadratic spline function. The scale combinations are S1 = 2, S2 = 5/2, S3= 3 for female speakers and S1 = 3, S2 = 4, S3= 5 for male speakers. The speech MP presents two types of peaks, minima corresponding to the glottal closure instants GCI, and maxima related to the glottal open instants GOI. The positive impulse GOI, are weaker but discernible and are detected as the maxima between two GCI.

Oqi  0

JiOqi  0

1

1 N 1 i Oq  k  1  Oqi  k  N  1 k 1



(8)

period k in the Window i [33]-[36].

-1 0

100

200

300

400

500

600

700

800

900

1000

1.5

1.5 1

d1(k)

GOI(k)

0.5

1 0.5 (a)

0 -0.5

T0(k)

GCI(k)

-1 -1.5

(7)

k 1

0

0 -0.5

(b)

 Oq1  k 

The parameter i is the index of the Window. The parameter k is the index of the period. The parameter N is number of periods. Oqi is the open quotient of the

0.5

-1.5

N

The Jitter of the mean Open quotient is:

1.5

(a)

1 N

0

100

200

300

400

0 -0.5 -1

GCI(k+1) 500

-1.5

600

700

800

900

1000

0

100

200

300

400

500

600

500

600

1.5 1

Fig. 1. Speech normal voice corresponding to a sustained vowel /a/ extracted from the female speaker AXH1 of MEII database and its MP

GOI(k)

d1(k)

0.5 (b)

0 -0.5 -1 -1.5

Fig. 2 presents a pathological voice signal. The MP shows negative pulses related to GCI and positive pulses related to the GOI. The local pitch period of the speech is given by: T0  k   GCI  k  1  GCI  k  (2)

GCI(k) 0

100

200

T0(k) 300

GCI(k+1) 400

Fig. 2. Speech pathological voice corresponding to a sustained vowel /a/ extracted from paralysis AXT13 pronounced by a male speaker and its MP of MEII database

VI.

The local fundamental frequency F0 (k) is given by the inverse of the local pitch period:

SVM Multiclass

In order to classify automatically both normal and



497


pathological voices, support vector machines have been of great use [37]. Two cases can be noticed in accordance to the SVM classifier: the first case is a linearly separable one, in which the optimization algorithm of the SVM is obtained through the maximization of the margin between the two categories. The second case is a non-linearly one, in which the Kernel function K, fitting to the Mercer properties, has been used in order to map the input data to a higher dimensional space, the next step being the linearly separable to separate the classes. Eq. (9) demonstrates that x a data point to be attributed to one of the two classes, resulting from the classifier of the maximization algorithm:

VII.1.

In Edema/nodule/normal classification operated on MAPACI database, we use 6 edema, 6 nodule and 12 normal voices. The classification rates are reported in Table I and Fig. 3. TABLE I CONFUSING MATRIX FOR THE CLASSIFICATION USING MAPACI DATABASE OF EDEMA, NODULE, AND NORMAL FEMALE VOICES Edema Nodule Normal MFCC Edema 53.45 43.36 3.19 Nodule 25.5 50.71 23.79 Normal 0 0 100 MFCC+ Δ+ΔΔ Edema 54.39 43.36 2.25 Nodule 15.47 50.42 34.10 Normal 0 0 100 MFCC+E+Δ+ΔΔ Edema 57.57 36.07 6.36 Nodule 20.92 53.29 25.79 Normal 2.32 0 97.68 MFCC+ F0 Edema 86.35 4.3 9.35 Nodule 0 100 0 Normal 0 0 100 MFCC+ Oq Edema 34.95 61.12 3.93 Nodule 23.20 24.64 52.16 Normal 1.94 0 98.06 MFCC+ F0+ Oq Edema 70.09 28.78 1.13 Nodule 0 100 0 Normal 0 0 100 All parameters Edema 92.33 2.61 5.06 Nodule 0 100 0 Normal 0 0.23 99.77

N

f  x 

 i yi K  x,xi   b

(9)

i 1

where yi  1,1 are labels of class b is the term of bias and

N

 i 1  i y i  0 ,

 i  0 . The Gaussian kernel

used in this work is:

 xz 2 k  x,z   exp    2 2 

   

Edema/Nodule/Normal Classification Using MAPACI Database

(10)

A kernel inversed with γ and a penalty parameter C as component of the function cost should be determined and this before the training phase. The amount of the C value is varying: if it is a larger amount, this reflects the assignment of a higher penalty to the errors of classification. The user should define a grid-search within the intervals in order to identify (C, γ) pairs in order to allow the prediction by the classifier of an unkown data as precisely as possible C = [ 100 , 103 ] and γ = [

104 , 102 ]. The training and classification of SVM are carried through SVM toolbox [39]. The strategy adopted for multi-category classification is the one-against all which was used in a way that one classifier is obtained for each pair of different classes [38], [39]. The number obtained of binary classifications is K, K is considered as the categories’ number. The majority rule is adopted for determining the ultimate decision. The vote for the category according to which the unknown sample has been attributed is increased by one. The sample is attributed to the class obtaining the largest vote [40].

Fig. 3. Classification rates using the MAPACI database of edema, nodule, and normal female voices using a 3-class SVM

MFCC coefficients are efficient to recognize normal voices with a high rate. The derivatives don’t give better results, indeed for energy. The open quotient introduces confusion between the two diseases. The best recognition rates are obtained with the combinations all parameters or MFCC + F0. The fundamental frequency appears as the most discriminating parameter for the recognition associated with MFCC. VII.2.

VII.

Classification Results

Edema/Nodule/Normal Classification Using MEII Database

Classification operated on MEII database, uses 14 edema female voices, 14 nodule female voices, and 14 normal female voices. The classification rates are reported on Table II and Fig. 4. MFCC coefficients

The results are determined for the MAPACI and MEII databases. We operate a normal/edema/nodule, classification and we consider only female voices.



498


alone, and MFCC+ Δ+ΔΔ, and MFCC+ Δ+ΔΔ+E give the same classification results. The best recognition is obtained with all parameters. Normal Class is always recognized with a good score in all cases.

ACC 

 i Aii /  ij Aij

(11)

Aij parameters are obtained from the confusion matrix as depicted in Table III.

TABLE II CONFUSING MATRIX FOR THE CLASSIFICATION USING THE MEII DATABASE OF EDEMA, NODULE AND NORMAL FEMALE VOICES Edema Nodule Normal MFCC Edema 73.38 26.62 0 Nodule 12.57 74.85 12.58 Normal 0 0 100 MFCC+ Δ+ΔΔ Edema 73.38 26.62 0 Nodule 12.57 74.85 12.58 Normal 0 0 100 MFCC+ Δ+ΔΔ+E Edema 78.42 21.58 0 Nodule 12.57 74.25 13.18 Normal 0 0 100 MFCC+ F0 Edema 64.74 31.65 3.60 Nodule 31.14 40.71 28.15 Normal 0 0 100 MFCC+ Oq Edema 40.95 55.12 3.93 Nodule 52.15 24.64 23.21 Normal 1.94 0 98.06 MFCC+ F0+ Oq Edema 67.62 32.38 0 Nodule 33.53 61.08 5.39 Normal 0 1.24 98.76 All parameters Edema 100 0 0 Nodule 20.96 79.04 0 Normal 13.65 0 86.35

TABLE III THE STRUCTURE OF A CONFUSION MATRIX Edema Nodule Normal Edema A11 A12 A13 Nodule A21 A22 A23 Normal A31 A32 A33

Performance evaluation for different combinations of parameters is determined for MAPACI and MEII databases. Table IV gives the values of the accuracy for selected parameters on MAPACI and MEII database according to edema, nodule or normal 3 class SVM classification. TABLE IV ACCURACY RATES OF THE EDEMA/NODULE/NORMAL CLASSIFICATION FOR FEMALE VOICES OF THE MAPACI AND MEII DATABASES MAPACI MEII Acc % Acc % MFCC 75.75 82.33 MFCC+ Δ+ΔΔ 76 82.33 MFCC+ F0 96.5 68 MFCC+ Oq 63.75 54 All parameters 97.5 88.33

Accuracy rates of edema / nodule / normal recognition by SVM classification, for female voices, in MEII and MAPACI databases are presented in Table IV. The best Accuracy for MAPACI database, is obtained by the combination “all parameters” with 97, 5 %, and by the combination “MFCC + F0” with 96, 5 %. The best Accuracy for MEII database, is obtained by the combination “all parameters” with 88, 33 %, and by the combination “MFCC + derivative” with 82, 33 %. Fig. 4. Classification rates using the MEII database of edema, nodule, and normal female voices using a 3-class SVM

VIII. Conclusion This paper presents the evaluation of a multiclass SVM classifier of normal, edema and nodule pathological voices using MFCC parameters. The following parameters: the fundamental frequency F0 and the glottal open quotient Oq added to MFCC coefficients, energy, and derivatives of MFCC are studied. Besides, we consider a triple classification between normal, edema and nodule voices for female speakers using MAPACI and MEII databases. The Open quotient and the fundamental frequency are computed from the GOI and the GCI obtained by the multi-scale product (MP) of the speech signal. The classification is performed by an SVM multi-class system according to one against all approach using the Gaussian kernel. The proposed approach is tested on two databases of pathological voices: MAPACI and MEII. For all these classifications, we vary the set of parameters to investigate the relative effect of the

The MFCC coefficients used alone recognize 100% of the normal speakers, the edema with 73.39% and the nodule with 74.85%. The fundamental frequency + MFCC influence on pathologic classes, by lowering the recognition of edema to 64.74 % and the nodule to 40.71%. The good recognition of normal voices remains effective. The open quotient Oq has an effect on pathologic classes by lowering the recognition of edema to 40.95% and the nodule to 24.64%. The good recognition of normal voices remains effective. VII.3. Performance of the Classification System Accuracy parameter is used for performance evaluation. It is calculated given by the confusion matrix. Accuracy is expressed by the expression: Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved


499


fundamental frequency and the open quotient on the classification rates. In the MAPACI database, the classification into edema, nodule and normal for women has been tested for different configuration of parameters. The best recognition rate, of 3-class SVM classification into edema, nodule and normal using MFCC parameters is obtained by using all parameters. In addition, the fundamental frequency F0 appears to be the most discriminate parameter. The open quotient parameter Oq does not discriminate between edema and nodule. The MFCC coefficients alone may recognize normal voices. In the MEII database, the same classification into edema, nodule and normal for women is applied. The best recognition rate is indeed obtained with the combination all parameters. The MFCC coefficients allow 100% recognition of normal voices, 73.39% of the edema and 74.85% of nodule. The open quotient Oq deprecates the recognition rate of edema and nodule. The performance of the classification uses accuracy parameter, expressed by the coefficients of the confusion matrix. The best Accuracy obtained using the MAPACI database, is given by the combinations “all parameters” with 97, 5 % recognition rate, and “MFCC + F0” with recognition rate of 96, 5 %. The best Accuracy obtained with the MEII database, is given by combinations “all parameters” with 88, 33 % recognition rate, and “MFCC + derivatives” with a recognition rate of 82,33 %. We can conclude that the best performance of a 3class SVM classification into edema, nodule and normal using MFCC parameters is obtained by using all parameters. The fundamental frequency F0 appears to be the most discriminate parameter. The open quotient Oq depreciates the discrimination between edema and nodule diseases. The MFCC coefficients alone recognize normal voices. Future works concern the classification between more pathologies and testing other parameters extracted from the speech multi-scale product.

[7]

[8]

[9]

[10] [11] [12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

References [1]

[2]

[3]

[4] [5]

[6]

[20] [21]

J.I. Godino-Llorente, P. Gomez-Vilda, and T. Lee, Analysis and Signal Processing of Oesophageal and Pathological Voices, EURASIP Journal on Advances in Signal Processing, Special Issue on Analysis and Signal Processing of Oesophageal and Pathological Voices, 2009. J.I. Godino-Llorente, P. Gomez-Vilda, N. Saenz-Lechon, M. Blanco-Velasco, F. Cruz Roldan, and M.A. Ferrer, Discriminative methods for the detection of voice disorders, In: NOLISP 2005 International Conference on Non-Linear Speech Processing, April 2005; Barcelona, Spain. J.I. Godino-Llorente, R. Fraile, N. Saenz-Lechon, V. Osma-Ruiz, and P. Gomez-Vilda, Automatic Detection of Voice Impairments from Text-Dependent Running Speech using a Discriminative Approach, In: MAVEBA 2007, pp. 25–28. M. Hirano, Psycho-Acoustic Evaluation of Voice: GRBAS Scale for Evaluation the Horse Voice, Springer 1981; Berlin, Germany. K. Marasek, “An Attempt to classify lx signals”, In: EuroSpeech 1995 the 4th European Conference on Speech Communication and Technology; September 1995; Madrid, Spain. D. Deliyski, High-speed videoendoscopy: recent progress and

[22]

[23]

[24]

[25]


clinical prospects, In: AQL 2006 the 7th International Conference on Advances in Quantitative Laryngology Voice and Speech Rearch , Groningen University.. J. Demeyer, and B. Gosslin, “Glottis segmentation with a high speed glottography: a new approach”, In Proceedings of Liege Image Days, March 2008; Liege, Belgium . J.I. Godino-Llorente, P. Gomez Vilda, N. Saenz-Lechon1, M. Blanco-Velasco, F. Cruz-Roldan, and M. Angel Ferrer-Ballester, Support Vector Machines Applied to the Detection of Voice Disorders, Springer-Verlag, Berlin Heidelberg, 2005. pp.219-230. J. Iagnacio Godino-Llorente , Member, IEEE, P. Gomez Vilda , Member, IEEE, M. Blanco-Velasco, Member, IEEE, Dimensionality Reduction of a Pathological Voice Quality Assesment System Based on Gaussian Mixture Models and ShortTerm Cepstral Parameters, In: IEEE 2006 Transactions on Biomedical Engeneering, October 2006; 5. A.A. Dibazar, T.W. Berger, and S.S. Narayanan, Pathological Voice Assesment, In: IEEE 2006 EMBS 2006; New York. J. Benesty, M.M. Sondhi, and Y. Huang, Springer Handbook of Speech Processing, Springer, Berlin, Germany 2008. F. Servin, B. Bozkurt, and T. Dutoit, Hnr extraction in voiced speech oriented towards voice quality analysis, In: EUSIPCO 2005 13th European Signal Processing Conference; September 2005; Antalya, Turkey. G. De Krom, Spectral correlates of breathiness and roughness for different types of vowel fragments, In:ICSLP 1994 the 3rd International Conference on Spoken Language Processing; September 1994, Japan. C. D’Alessandro,F. Yegnanarayana, and A. Darsinos, Decompositions , In: ICASSP 1993 the IEEE International Conference on Acoustics, Speech, and Signal; May 1993; Detroit, Mich, USA. P. Boersma, Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound, In: IFA1993 of the Institute of Phonetic Sciences; 1993; Amsterdam. K. Shama, A. Krishna, and N.U. Cholayya, Study of harmonics-tonoise ratio and critical-band energy spectrum of speech as acoustic indicators of laryngeal and voice pathology, EURASIP Journal on Advances in Signal Processing 2007; Article ID 85286, 9. M. Wester, Automatic classification of voice quality: comparing regression models and hidden markov models, In: VOICEDATA 1998 Symposium on Databases in Voice Quality Research and Education;1998. R.B. Reilly, R. Moran, and P. Lacy, Voice pathology assessment based on a dialogue system and speech analysis, In: AAAI 2004 Symposium on Dialogue Systems for Health Communication; 2004; pp. 104–109. R.J. Moran, R.B. Reilly, P. de Chazal, and P.D. Lacy, Telephonybased voice pathology assessment using automated speech analysis, IEEE Transactions on Biomedical Engineering; 2006; 3:468–477. Corp KE. Multi-dimensional voice program (mdvp) [computer program]. Tech Rep, Kay Elemetrics Corp, 2008. Corp K. E. Disordered voice database model (version 1.03). Tech Rep, Massachussets Voice Eye and Ear Infirmary Voice and Speech Lab, 1994. A. Dibazar, S. Narayanan, A system for automaticdetection of pathological speech, In: 36th Asilomar 2002 Conference on Signals, Systems and Computers Pacific Grove, Calif, USA; November 2002. J.I. Godino-Llorente, S. Aguilera-Navarro, C. HernandezEspinosa, M. Fernandez-Redondo, and P. Gomez-Vilda, On the selection of meaningful speech parameters used by a pathologic/non pathologic voice register classifier, In: EUROSPEECH 1999; Budapest, Hungary; September 1999. C. Fredouille, G. Pouchoulin, J.F. Bonastre, M. Azzarello, A. Giovanni, and A. Ghio, Application of automatic speaker recognition techniques to pathological voice assessment (dysphonia), In: EuroSpeech 2005, the 9th European Conference on Speech Communication and Technology; Lisbon, Portugal, pp. 149–152; September 2005. G. Pouchoulin, C. Fredouille, J. Bonastre , A. Ghio, M. Azzarello, and A. Giovanni, Modélisation statistique et informations


500


[26] [27]

[28]

[29]

[30]

[31] [32] [33]

[34]

[35]

[36]

[37]

[38] [39]

[40]

pertinentes pour la caractérisation des voix pathologiques (dysphonies) , In : JEP 2006 (Journée d’Etudes sur la Parole); 2006. C. Bishop, Pattern Recognition and Machine Learning, Springer, New York, NY, USA, 2006. P. Kukharchik, I. Kheidorov, E. Bovbel, and D. Ladeev, Image and signal processing, In: Speech Signal Processing, Based on Wavelets and SVM for Vocal Tract Pathology Detection, Lecture Notes in Computer Science Springer; Berlin, Germany, pp. 192– 199, 2008. J.I. Godino-Llorente, S. Aguilera-Navarro, C. HernandezEspinosa, M. Fernandez-Redondo, and P. Gomez-Vilda, On the selection of meaningful speech parameters used by a pathologic/non pathologic voice register classifier. In: EUROSPEECH 1999 the 6th European Conference on Speech Communication and Technology; September 1999; Budapest. J.B. Alonso, J. de Leon, I. Alonso, and M.A. Ferrer, Automatic detection of pathologies in the voice by HOS based parameters. EURASIP Journal on Advances in Signal Processing 2001; 4: pp. 275–284. Kay Elemetrics Inc. Voice disorders database, version 1.03[CDROM][Online].Available:http://www.kaypentax.com/Prod uct%20Info/CSL%20Options/4337/4337.htm. MAPACI P. Voice Disorder Database [Online]. Available: http:// www. Mapaci.com/index-ingles.php http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html A. Belhaj, A. Bouzid, N. Ellouze, and A. Nait-ali, Paramétrisation des voix pathologiques à partir du MPM et leurs classifications , Quatrièmes journées de phonétique clinique ;2011 ; Strasbourg, France. A. Belhaj, A. Bouzid, N. Ellouze, and A. Nait-ali, Disordered voice parametrisation using the multi-scale product, In : SETIT 2012; Mars 2012, Sousse, Tunis. A. Belhaj, A. Bouzid, N. Ellouze, Statistical voicing parameter analysis of pathological signals using the Multi-scale Product and SVM classification, In: ATSIP 2014; Mars 2014; Sousse,Tunis. S. Chekili, A. Belhaj, A. Bouzid, N. Ellouze, Recognition of pathological voices, In: IEEE 2014 International Multi-Conference on Systems, Signals & Devices, Conference on Communication & Signal Processing; Février 2014; Barcelona, Espagne. V. Vapnik, An overview of statistical learning theory, In: IEEE 1999 Transactions on Neural Networks; September 1999; IEEE. pp. 988-1000. http://asi.insarouen.fr/enseignants/~arakoto/toolbox/index.html. C.W. Hsu, and C.J. Lin, A comparison of methods for multi-class support vector machine, In: IEEE 2002Transactions on Neural Networks; 2002; IEEE. pp.415-425. J. Friedman, Another Approach to Polychotomous Classification, Technical report, Department of Statistics, Stanford University, 1996.

Dr. Asma Belhaj was born in February9, 1985. She has diploma in DEA degree “Diplôme des Etudes Approfondies” in Signal Processing and Control from the University Paris XII in 2008. She has joined the Institute of Technological Studies of Tunis (Tunisia) as associate professor in 2008. Currently she was the PH.D thesis at signal processing laboratory (LSTS-ENIT), in March 2014. E-mail: [email protected] Pr. Noureddine Ellouze received a Ph. D. degree in 1977 from l’Institut National Polytechnique at Paul Sabatier University (ToulouseFrance), and Electronic Engineering Diploma from ENSEEIHT in 1968 at the same University. In 1978, Dr. Ellouze joined the Department of Electrical Engineering (ENIT– Tunisia), as assistant professor. In 1990, he became Professor in signal processing, digital signal processing and stochastic process. He was. General Manager and President of the Research Institute on Informatics and Telecommunication IRSIT from 1987-1990, and President of the Institute from 1990-1994. He is now Director of Signal Processing Research Laboratory LSTS at ENIT, and is in charge of Control and Signal Processing Master degree at ENIT. Pr. Ellouze is IEEE fellow since 1987; He directed multiple Masters and Thesis and published over 300 scientific papers both in journals and proceedings. He is chief editor of the scientific journal Annales Maghrébines de l’Ingénieur. His research interests include neural networks and fuzzy classification, pattern recognition, signal processing and image processing applied in biomedical, multimedia, and man machine communication. E-mail: [email protected]

Authors’ information Tunis El Manar University, National Engineering School of Tunis, Laboratory of Signal, Images and Information Technology, TUNISIA. Dr. Aïcha Bouzid was born in April 29, 1975. She has diploma in electrical engineering at Ecole Nationale d’Ingénieurs de Tunis (ENITTunis- Tunisia) in 1998, Master degree on automatic and signal processing in 2000 and Ph. D. thesis at signal processing laboratory (LSTSENIT), in July 2004. She has joined the Institute of Technological Studies of Sfax (Tunisia) as associate professor in 1999. Currently she is working as professor in the department of Electronic at Institut Supérieur d’Electronique et de Communication de Sfax. Her research areas of interest are signal processing, speech processing and applied mathematics. E-mail: [email protected]



501


Hybrid Learning Model and Acoustic Approach to Spoken Language Identification Using Machine Learning R. Madana Mohana1, A. Rama Mohan Reddy2 Abstract – Spoken Language Identification (SLId) is the process of identifying the language of an utterance from an anonymous speaker, irrespective of gender, pronunciation and accent. In this paper we present acoustics based learning model for spoken language identification. An acoustic feature representing the short term power spectrum of sound called Mel Frequency Cepstral Coefficients (MFCC) is used as a part of the investigation in this paper. The proposed system uses a combination of Gaussian Mixture Model (GMM) and the Support Vector Machines (SVM) to handle the problem of multi class classification. The model aims at detecting English, Japanese, French, Hindi, and Telugu. A speech corpus was built using speech samples obtained from a plethora of online podcasts and audio books. This corpus comprised of utterances spanning over a uniform duration of 10 seconds. Preliminary results indicate an overall accuracy of 96%. A more comprehensive and rigorous test indicates an overall accuracy of 80%. The acoustic model combined with learning techniques hence proposed proves to be a viable approach for Language Identification. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: MFCC, Language Identification, SVM, GMM, Longrun Technique

data taking human-machine interaction to a new level. Extensive work was done to model language from a computational perspective. The initial years were dedicated to research in speech and speaker recognition systems, making speech a seamless input to machines [1]-[26]. The first question to consider before processing speech is what characteristics of speech could be used in computations. Speech is nothing but an audio signal which is characterized by many parameters. There are three approaches to analyze these parameters for linguistic computation, they are: 1) Prosodic, 2) Phonotactic and 3) Acoustic. Prosodic approach: Prosody is the rhythm, stress, and intonation of speech. The prosodic of oral languages involve variation in syllable length, pitch, loudness and the formant frequencies of speech sounds. This includes phoneme length and pitch contour. Some of the prosodic cues which have been proposed for Language Identification System are pitch contour shape of the pitch contour on the syllable, rhythm and phrase location initial/mid/final in breadth. Phonotactic approach: Phonotactics are rules that govern permissible sequence of phonemes in speech signals. Phonotactics defines acceptable syllable structure, vowel sequences and consonant clusters by means of phonotactical constraints. This approach becomes more meaningful when the linguistics of the language is thoroughly known. The phonemes used while identifying language is a daunting task, because many phonemes overlap across languages. Hence the model should have good set of phonemes which can help identify languages accurately.

Nomenclature MFCC SVM GMM V A H X P(H/X) P(H) P(X) P(X/H) fi s v W(n) F[n] mf y(k) Cn Cij

Mel Frequency Cepstral Coefficient Support Vector Machine Gaussian Mixture Model Vertices Arcs Hypothesis Data value Posterior Probability Prior Probability Probability of the occurrence of data value X Conditional probability that given H, the tuple X satisfies it Gaussian Function Mean Predefined positive variance of the function Hamming windowing Discrete Fourier Transform (DFT) Mel filter bank Discrete Cosine Transformation (DCT) of log of the spectrum energy MFCC Confusion Matrix

I.

Introduction

Language document identification is the process of identifying the language uttered in a given audio excerpt. The advent of artificial Intelligence gave rise to Computational Linguistics, a new branch of NLP, to devise algorithms for intelligently processing language


502

R. Madana Mohana, A. Rama Mohan Reddy

Acoustic approach: The acoustic features are the low level features from which the prosodic and phonotactic features are derived. The acoustic features deal with modelling those parameters which are obtained from digital signal processing techniques. Acoustic features are independent of speaker’s intrinsic characteristics and hence their performance is unprejudiced. The power spectrum of a signal is indicative of acoustic information in speech. The Cepstral analysis of the power spectrum of the speech signal is the most common acoustic feature. A cepstrum is the result of taking the Inverse Fourier transform of the logarithm of the spectrum of a signal. This data can be used to model the language feature space [23], [24]. Some of the Cepstral coefficients which can be used are Mel Frequency Cepstral Coefficient (MFCC), Perceptual Linear Predictive Cepstrum Coefficient (PLPCC) and Linear Predictive Coding (LPC). The Melfrequency Cepstrum is a illustration of the short term power spectrum of a noise based on a linear cosine transform of a log power spectrum on a non-linear Mel scale of frequency. Mel-frequency Cepstral coefficients (MFCCs): MFCCs are coefficients that together build up an MFC. The MFC frequency bands are uniformly spaced on the Mel scale, which approximates the human acoustic system's reply more strongly than the linearly-spaced frequency bands. Linear Predictive Cepstrum Coefficients (LPCC): It is a tool used for audio signal processing and is based on short term spectrum of the speech. The basic idea behind LPCC is to approximate a current sample to a series of past samples. The predictive and actual samples are used to obtain the coefficients. Perceptual Linear Predictive Cepstrum Coefficient (PLPCC): This is also a short term spectrum of the speech. It modifies the short-term spectrum of the speech by several psychophysically based transformations. This paper proposes an acoustic model to develop a language identification system. We will study the nature of other two approaches here in order to justify the selection of the acoustic model. The prosody of oral languages involves variation in syllable length, pitch, loudness and the formant frequencies of speech sounds. This approach may seem convincing but when considered for a diverse dataset of language its accuracy will get in jeopardy due to the infiltration of speaker innate features like irony, sarcasm, focus and other emotion. The next approach is based on the phonemes of a language. Phonemes are the building blocks of the speech. Phonotactics define permissible syllable structure, consonant clusters, and vowel sequences. A model built upon these structural components i:e phonemes looks more appropriate when compared to a prosodic approach. This approach is resource and computation intensive as it should maintain intense phoneme dictionary consisting of syllables, vowels, consonants, and phonotactic syntax for all languages.

The acoustic feature that we have adopted is Mel Frequency Cepstral Coefficient (MFCC). MFCC by far has proved to be an efficient tool in various speech processing systems. The selection of MFCC is justified by the fact that MFCC is modelled to align the human auditory system. As it is clearly evident that language identification is classification problem. Machine Learning will help in making the final discretion. We have used Support Vector Machines (SVM) as the learning engine for the LiD system. The SVM basically defines vectors, and uses them to draw boundaries between languages. These boundaries are a result of the training phase of the SVM. Once the boundaries are defined the system is subjected to test cases in the testing phase. The system uses the model developed during training to make a decision about the language of the test sample.

II.

Background

Research in the field of Spoken Language Identification (SLId) started in the 1970s. During the four decades of research, many methods in different aspects were studied to achieve high performance language recognition. Of many approaches the phonotactic approach deals with modeling speech at the phoneme or syllable level. A phoneme is a sound or a group of sounds that is the smallest unit which can be used to differentiate between utterances. Different phoneme based approaches are proposed by Berkling et al [1]. Hieronymous and Kadambe proposed a task independent spoken language identification which uses a Large Vocabulary Automatic Speech Recognition (LVASR) [2]. The LVASR system has many differences in the language model. Different languages have different number of phonemes, word length, and word. A Broad Phoneme [3] approach for Language identification was proposed by Berkling and Barnard. Their system claims 90% accuracy to discriminate between Japanese and English. The duo also proposed a theoretical error prediction for language identification system [4]. A segmental approach to Automatic Language Identification is based on the assumption that the acoustic structure of language can be estimated by segmenting the speech into phonetic categories [5]. Zissman has compared the performance of the following four approaches [6] for automatic language recognition of speech utterances, they are single language phone recognition followed by language dependent, interpose n-gram language modelling (PRLM); Parallel PRLM, which uses several single language phone recognizers, each one trained in a unusual language; language dependent Parallel Phone Recognition (PPR) and Gaussian Mixture Model (GMM) classification. Prosodic features encompass a large number of vocal tract dependent features like rhythm, pitch and stress. An approach to automatic language identification using pitch contour information is proposed by Lin and Wang [7].



503


A segment of pitch contour is approximated by a set of Legendre polynomials so that coefficients of polynomials form a feature vector to represent this pitch contour. Biadsy and Hirschberg [8] examined the role of intonation and rhythm across four Arabic dialects: Gulf, Iraqi, Levantine and Egyptian for the purpose of automatic dialect identification. This method gave good results with the duration of utterances being two minutes. A novel phonotactic approach to LiD was described in language recognition using Gaussian Mixture Model Tokenization [9] in which a Gaussian Mixture Model rather than a phone recognizer was used. To accomplish SLId a variety of methods using Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) are proposed. Phonetic, acoustic and discriminative approaches to automatic language identification [10] describes and evaluates the three techniques that have been helpful to the language identification problem: phone recognition, support vector machine classification and Gaussian mixture modeling. The next approach to SLId is the acoustic model. This aims at obtaining cepstral data from speech samples. The cepstral data which are used in majority of LiD systems are MFCC, LPC, and PLPCC. The advantage of applying the Mel scale is that, it approximates the non-linear frequency resolution of the human ear. The work [11] provides an insight to compute Mel frequency cepstral coefficients on the power spectrum. An adaptive algorithm for Mel cepstral analysis of speech was proposed by Fukada et al [12]. The process of generation of MFCC is further described in detail by Hasan et al [13] where they have applied it to speaker verification. Mathematically, Language Identification is nothing but a maximum likelihood classification problem. A system for Speaker and Language Recognition using Support vector machine was proposed by Campbell et al [14]. Artificial neural network based LiD system was also proposed [15]. This work makes use of two different statistical parameters namely prosodic and segmental features extracted from fundamental frequency contour (F0) and frequency spectrum were used for language classification. From the detailed examination of the literature, it can be observed that acoustic model analysis coupled with a learning technique yields a good model for Language Identification.

The speech dataset is derived from podcasts and online audio books. This corpus comprises of utterances each of which span over a uniform duration of 10 seconds. All samples used are recorded in studio environment with reduced noise and glitches. It is semi spontaneous and colloquial in nature which resembles the real world closely. This approach has its own limitations and one of it is reduced system accuracy. It is mainly due to the fact that there is no standardization of the training data and is diverse in terms of distribution of speech instance and in terms of speech and sound characteristics.

IV.

Language Identification System

IV.1. Stages in Language Identification System All Language Identification systems irrespective of the type of the model should follow some basic steps for processing utterances. These steps can be visualized in Fig. 1.

Fig. 1. Stages in Language Identification System

Pre-processing: Pre-processing is the tuning stage of the system. The basic pre-processing involves background noise reduction and trimming the audio samples to durations which are suitable to extract sufficient features. This step should not be overlooked as the accuracy may change with the type of modelling and duration. It is advisable to have raw speech data, which is without any lossy compression, given as input to the system. Hence the file formats, bit rate, frequency and number of channels of recording have to be taken care of, to avoid losing meaningful information in the audio signal. In general the pre-processing is required to handle dissimilarities in the input and converging them on common the grounds. Feature Extraction: Transforming the key data into various set of features is called feature extraction. Feature extraction is a very pivotal stage in language identification system. Whenever the features extracted are circumspectly preferred and it is estimated that the

III. The Corpora One of the challenges faced during the development of a speech based intelligent system is the requirement of accurate and adequate data for training and testing. In our experimentation we handle this problem in a more realistic manner rather than the more conventional counterpart. Instead of building a system and testing for a set of standard input sample, we have used a speech corpus which consists of real-time/non-standard speech input from different users with different origin and background over the selected set of languages.



504


features set will extract the proper information from the input data. In order to perform the desired task this condensed representation used as an alternative of the full size input. Feature extraction is a common phrase for methods of constructing combinations of the variables to find around these problems whereas at a standstill relating the data with adequate accuracy. With respect to language identification the first task is to identify features which may provide us with information relevant to the task at hand. The feature extraction is not a one-step extraction process but involves many sequential phases. Generally followed steps for feature extraction are described as follows. The first stage is to apply a window function to the input signal. In signal processing, window function is a statistical function that is zero valued outer surface of a number of chosen intervals. Hamming and Hanning are the prominently used window functions. Transformations are applied to shift domains. The frequency domain analysis holds grater relevance for audio processing; hence Fourier transforms (DFT, FFT) are usually used. Further filters are used to analyze the audio signal in different frequency bands. Language Identification: The identification of language is nothing but a classification problem. Hence various approaches for classification be used majority of them being machine learning techniques like Hidden Markov Model (HMM), Gaussian Mixture Model (GMM). Hidden Markov Model (HMM): The Hidden Markov Model (HMM) is a statistical model which follows the Markov property. A model which abides by the Markov property is called a Markov model. The Markov property states that one should be able make predictions for the future of a process based solely on its present state just as accurately as one could do so by knowing the process's full history. For any model there are three basic aspects, an input, states and the outputs. In case of a Hidden Markov model the states are hidden where as the outputs are known. But each state is mapped with certain probability over all the possible outputs. Thus a HMM can be used to decipher the states or parameters behind an output. In case of a Language Identification system the HMM is used to learn the states or parameters behind a specific language sample. The system when trained with a large corpus of languages generates a probabilistic model of states for each language. In other words the highly probable state route by the language is marked. Any test sample is then featured against the available language equations. The model it matches with highest probability can be taken as the resulting language. The HMM is defined as follows [16], [21]: A HMM is a directed graph V , A with vertices representing



states



A  i, j vi ,v j  V ,

V  v1 ,v2 ,....,vn  showing

and

transitions

components: i. Initial state distribution used to determine the starting state at time 0, v0. ii. Each arc is labeled with a probability pij of transitioning from vi to vj. This value is fixed. iii. Given a set of possible observations, O{o1, o2,....,ok}, each state, vi contains a set of probabilities for each observations, {pi1, pi2,...., pik}. Gaussian Mixture Model (GMM): The Gaussian mixture model estimates probability density functions for each class (here each language will be a class), and then performs classification based on Bayes rule. The Bayes theorem is used to calculate the probability of an event (say event 1) given another event (one or more, event 2) has already occurred. Extrapolating this to the Language Identification system scenario the parameters of languages sample are event 2 and the language itself is event 1. Hence GMM can be used as a classifier for a Language Identification system. The Bayes theorem is given in Eq. (1):

PH / X   P X / H  PH  / P X 

(1)

Here, P(H/X) is Posterior Probability, P(H) is Prior Probability associated with Hypothesis H, P(X) is Probability of the occurrence of data value X and P(X/H) is the Conditional Probability that given a Hypothesis H, the tuple X satisfies it. The Gaussian function is defined as [16]: The Gaussian function is a bell-shaped curve without values in the range [0, 1]. A typical Gaussian function is shown in Eq. (2):

fi  S   e

s2 v

(2)

Here s is the mean and v is the predefined positive variance of the function. A typical Gaussian function is shown in Fig. 2.

Fig. 2. A typical Gaussian Function

IV.2. Applications

arcs

The applications may range from casual to industrial use each of them seeking a common response of

between

states. Each HMM has the following additional Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved


505


language identification. The Language Identification system can be used in any Contact Centre deployment to pre-sort the callers based on the language they speak so that the required service or IVR can be provided in the language appropriate to the caller. Global call centres would benefit from this as callers from any part of the world may be redirected to the centres of their local native language without human intervention. The Language Identification system can act as a switchboard routing for incoming calls to operators fluent in the language. Internationally operating companies maintain customer care centres to assist their clients. Hence such centres handle language identification module in such centres can help routing the customer’s call to the language specific location. For instance a call from Germany can be automatically switched to a German proficient operator. This increases the efficiency of the organization’s in understanding the customer’s problems. Deployment of a Language Identification system in a hotel lobby could cater to the queries of international customers. They can pose questions in their native languages and get help accordingly. The customers can make reservations, set menus, set cleaning schedules if they have a system that can identify their language. Language Identification system finds extensive use in the tourism industry, as tourists may or may not know language used in the visited place. Hence such systems can act a link, enabling people from diverse community to be able to identify and by further introspection understand each other’s languages. This helps in propagation of correct information to the tourists which otherwise may get distorted due to limited understanding of languages. International airports are common hosts to foreign travellers, as they might be present on a direct visit or hop journey. Language Identification systems at airports can assist the airport authorities to gratify the needs of foreign tourist. Hence it voids the effect of language barrier on the service of the airport to the customers. Various speech queries from across the globe, and these may not be in the same language. Presence of an automatic speech activated systems which can understand limited range of languages can be expanded to cater to a larger language space. Dialogue systems are becoming common in places like parliament. These systems can identify the language being spoken and simultaneously broadcast it in multiple languages. One such implementation is found in the parliament. At present, in Lok Sabha, there is a facility for simultaneous interpretation in the following languages namely: Assamese, Bengali, Kannada, Malayalam, Manipuri, Maithili, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu and Urdu are available. Hence in parliaments and such conventions like United Nations Organizations where representatives from across the globe gather a language identification system can be very useful.

Another implementation can be found in the audio and video media. The majority of communication is through television and radio, provisioning of a language identification system followed by a speech interpreter can bridge the language barrier for the viewers. The prevailing speech recognition systems like siri and iris are proficient in doing so for only English. This language limitation can be surpassed by efficient Language Identification systems. Language Identification system can be the initial module where in the recognizer can first detect the language and then accordingly interpret them. Imagine using siri in your native language. The primary perspective of enabling diverse language in such systems is to wrap a larger community into the user space. Rapid language identification can even save lives. There are many reported cases of 911 operators being unable to understand the language of the destressed caller. The current response service uses trained human interpreters who can handle about 140 languages. The drawback with this system is that it has an innate delay because of human interpreters. An automated system can thus provide a more reliable and give faster responses. Language Identification system can be used for any indexed search engines and could be coupled with multilingual speech recognition systems to switch between recognizers. Spoken language interpretation and dialogue systems are other services which use Language Identification system. Currently AT&T and Language Line Services partner to provide Customer Service Assistance in more than 170 Languages. IV.3. Challenges The major constraint in the field of Language Identification system is the lack of suitable resources. The initial problem in the formative years of Language Identification system research was the lack of speech data across multiple languages. Over the years more speech data became available including multilingual speech database suitable for Language Identification system research. Most of the speech databases available for research were telephonic speech corpus. However recording across multiple languages is a start and obtaining accurate phonetic transcriptions of the speech data are mandatory. The utilization of word level information therefore, becomes a more serious problem. Apart from the limitations of the dataset, another obstacle is the variation in the same language. Most of the languages have many dialects and sub categories. The speakers of the same language may sound different or have different accents in different parts of the world. For example English spoken in the United States and in India have a significant difference in the accent. Apart from this, within India the accents further change based on the location. Identifying a language irrespective of these constraints is not an easy task. Thus the datasets must include a large variety of speakers, both



506


male and female, having different accents to make the system more robust. Collecting such speech samples is a serious constraint in the field of Language Identification system. Determining the best duration of speech samples required for training is another task which cannot be overlooked. The feature space varies with the duration of the speech samples used. So fixing on an optimum duration of the utterances is important. Another pivotal constraint is feature selection, as there is no unique feature which can be used to discriminate between languages accurately. Hence selecting an optimum set of relevant features is an important decision to make while implementing a language identification system.

V.

information which is required to measure the acoustic behaviour and hence the language behaviour. The input samples are all in WAV format. This is particularly useful because no data is removed in as part of compression unlike the compressed counterparts.  Sampling rate is in direct effect with the amount of information contained by a speech sample. The sampling rate defines the number of samples per unit of time taken from a continuous signal to make a discrete signal. This process has a conditional execution in the system. Re-sampling processes the input audio and tunes the sampling rate of every audio file to 44.1 KHz. Feature Extraction: Feature extraction is a pivotal stage in language identification system. If the features extracted are appropriately chosen it is likely that the feature set will extract the related information from the source input data in order to complete the required assignment using this reduced depiction instead of the full size input. The chosen acoustic feature is MFCC [18]. MFCC extraction is carried out in the following steps: 1) windowing, 2) Discrete Fourier Transformation, 3) Mel filter bank, 4) Discrete Cosine Trans-form, 5) Mean MFCC. The block diagram for MFCC extraction is given in Fig. 4.

Implementation

The following section describes the architecture of SLId. SLId follows an acoustic model and this type of modelling makes use of lesser resources for training and testing when compared to a large speech vocabulary method and similar techniques(phonemes). In contrast, the phonotactic model maintains a database of all possible phonemes (a sound or a group of sounds that is the smallest unit that can be used to differentiate between utterances) occurring in a particular language. And in case of a prosodic approach the system would rely on features which may be morphed due to extraneous factors like emotions. The design includes various phases based on the flow of data and the action performed on this data. Fig. 3 represents the overall system architecture.

Fig. 4. Extraction of MFCC

Windowing: A window function in signal processing is a mathematical task that is zero-valued outside of some preferred interval. Window is optimized to decrease the maximum (nearby) side lobe, giving it a stature of about one fifth that of the Hann window. We pertain a hamming window to the speech utterance. Hamming Windowing is given by Eq. (3): Fig. 3. SLId Architecture

n  W  n   0.54  0.46 * cos  2  , 0  n  N N 

The process of language identification is carried out progressively in three stages: 1) Pre-processing, 2) Feature Extraction and 3) Machine Learning phase. Pre-processing: In the pre-processing stage the emphasis is on collecting data which would help in achieving the goal of identifying language and also to make further stages in the system easier by maintaining technical consistencies. Methods are adopted to bring all the input data at the same configuration of the concerned attributes. This involves background noise reduction, resampling and file format handling.  The first step in pre-processing is file format handling and it takes care of format of the speech data sample. This is ensured to avoid loss in energy or power

(3)

The window length is L = N + 1. Discrete Fourier Transform (DFT): The DFT is defined mathematically shown in Eq. (4): N 1

F n 



f k  e



j 2 nk N

(4)

k 0

where n is 0 to N-1, F[n] is the DFT of the sequence f [k], given the sequence of N instants or samples denoted f [0], f [1], f [2],….,f [N-1] and f [k] be the continuous



507


signal which is the source of the data. Complex numbers x0,... ,xN-1 is transformed into another sequence of N complex numbers according to the DFT formula shown above. The input signal which is in the time domain is converted to frequency domain by applying DFT. Mel filter bank: MFCCs are one of the most popular filter bank based parameterization used in speech technology. As with any filter bank based analysis technique an array of band pass filters are utilized to analyse the speech in different frequency bandwidths. A popular formula to convert f hertz into Mel mf is given by Eq. (5):

f   m f  2595 log10 1   700  

in Eq. (6): N

y k   wk 

   2n  1 k  1 )   2N  

 x  n  cos  n 1

(6)

where k = 1, 2, 3, …., N and w(k) is given by Eq. (7):

 1 wk    ,k  1  2 / N , 2  k  N  N

(7)

The output after applying DCT is known as MFCC (Mel Frequency Cepstrum Coefficient) given by Eq. (8):

(5)

m

cn 

 

1

  log Dk  cos m  k  2  k 

(8)

k 1

Thus, with the help of Filter bank with proper spacing done by Mel scaling it becomes easy to get the estimation about the energies at each spot and once this energies are estimated then the log of these energies also identified as Mel spectrum can be used for calculating first thirteen coefficients using DCT. Since, the increasing numbers of coefficients represent faster change in the estimated energies and thus have less information to be used for classifying the given images. Hence, first thirteen coefficients are calculated using DCT and higher are unused. The following two famous experiments generated the Bark and Mel scales, given below Fig. 5 describe the experiments. So, we make use of the Mel scale to manage the filter bank used in MFCC computation, using function melbankm. This function returns a sparse matrix, so using command full to convert it to a regular matrix [17].

where m = 0, 1… k- 1 and cn represents the MFCC and m is the number of the coefficients. Mean MFCC: A mean of all the MFCC is taken at every cepstrum. The mean is calculated given by Eq. (9):

Mean 

1 k ABS  MFCC  n   MFCC 1  m n2



(9)

where m is input parameters, n is the designed cases varies from 2 to k. Reconstructing the spectrum based on MFCC. Some examples are in Fig. 6 and Fig. 7: one segment of voice speech and another unvoiced [17], [19], [20].

Fig. 5. Mel scale to organize the filter bank used in MFCC calculation Fig. 6. Voiced Speech

Discrete Cosine Transformation (DCT): A DCT expresses a series of finitely various data points in terms of a sum of cosine functions oscillating at different frequencies. DCT make similar functions and both decompose a finite length discrete time vector into a sum of scaled-and-shifted basis functions. DCT property makes it relatively suitable for density is its high degree of spectral compaction at a qualitative level, a signals Discrete Cosine Transformation representation tends to have more of its energy determined in a small number of coefficients when compared to other transforms like the DFT. The output of the band pass filter is used for MFCC extraction by application of discrete cosine transforms. DCT of Log of the Spectrum Energies given

Fig. 7. Unvoiced Speech



508


Machine Learning: Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data. We apply machine learning techniques for the language classification problem. We make use of support vector machine, as the machine learning block. Support vector machines are a set of related supervised learning methods used for classification and regression. As it is required to identify the language of the speech sample from the set of languages which it is trained with, we make use of multi class Support Vector Machine (SVM).

which separates the training data with a maximal margin. One of the highlighting difference between the binary and multi class SVM is the set y = {1, 2, 3, …, k} and operations which are dependent on this set. A SVM operation consists of the two phases. In the training phase SVM plots the vectors on an Ndimensional space. Hence in the case of LiD the mean MFCC form the vector space for the SVM [22]. In the testing phase, the speech sample is subjected to feature extraction and similar data, that is, 20 orders of mean values of MFCC. Using the model built in the training phase, the SVM predicts the language of the test sample. The proposed system implements the SLId which makes use of python bindings for audio feature extraction. The libraries used in our proposed system provide capabilities to extract mean MFCC values for the given sample. In our SLId system, the server should have libraries shown in Table I.

Support Vector Machine (SVM): The system uses support vector machines, as the learning technique for the language classification problem. SLId involves identification of languages from a set of languages therefore it employs multi class support vector machines. Basic SVM algorithm is an efficient binary classifier. The idea behind SVM approach to language detection is that we map our data to a feature space. This feature space is the basis for the SVM algorithm which determines a linear decision surface (hyperplane) using the set of labelled data within it. This surface is then used to classify future instances of data. Data is classified based upon which side of the decision surface it falls. SVM is applicable to both linearly separable and non-linearly separable patterns. Patterns not linearly separable are transformed using kernel functions- a mapping function, into linearly separable ones. It can be formulated as follows. The optimal hyper plane separating the two classes can be represented as given in Eq. (10): w X    0

TABLE I SLID SYSTEM LIBRARIES SLId System Libraries The server should have the SVM libraries. libsndfile library to enable reading WAV file formats. libmpg123 library to enable reading MP3 audio files. liblapack library to enable general audio features like linear algebra routines. FFTW3 library to use FFTW for Fast Fourier Transform computations.

The various steps in our proposed SLId system testing for the given speech sample inputs are shown in Fig. 8.

(10)

where, X - sample input vectors defined as {(x1,y1),(x2,y2)............(xk ,yk)} xk ∈ Rn, yi ∈ {1,-1} and ω, β - non zero constants ω indicating the weight component and β indicating the bias component. The ordered pair is the representation of each input used to form hyperplane which are N dimensional vectors labelled with corresponding y are given in Eqs. (11) and (12):

w  X    1 if yi  1

(11)

w  X    1 if yi  1

(12)

These can be combined into one set of inequalities is given in Eq. (13):

yi  xi  w     1

(13)

The above inequalities hold for all input samples (linearly separable and suffice the optimal hyper plane equation). The optimal hyper plane is the unique one

Fig. 8. Steps in testing SLId system



509


The pseudo codes for our implementation are shown in the following Tables II to V which consisting of algorithms for SLId speech, generate_MFCC (resampled_speech), SVM_Train (Vector) and SVM_Predict (Model, TestSample).

This test case is the trivial case of testing the SLId with the entire training set. Confusion matrix is defined as follows: Confusion matrix illustrates the accuracy of the solution to a classification problem. Given m classes, a confusion matrix is mxn matrix where cij indicates the

TABLE II ALGORITHM FOR SLID (SPEECH) Algorithm SLiD (speech) Input: The speech sample whose language has to be identified Output: Mean values of MFCC which represent the language information If( File_Type = Mp3) Convert_to_wav( speech ) If( Sampling_Rate != 44.1kHz ) resampled_speech = resample( speech ) Vector = generate_MFCC( resampled_speech )

number of tuples from D that were assigned to class c j but where the correct class is ci . TABLE VI INPUT SPEECH SAMPLES Number of Speech samples English 1093 French 1069 Hindi 853 Japanese 539 Telugu 868

TABLE III ALGORITHM FOR GENERATE_MFCC (RESAMPLED_SPEECH) Algorithm generate_MFCC(resampled_speech ) Input: The pre-processed speech sample Eng Fr Hin Tel Jap

Output: Mean MFCC values upto 20thcepstrum Vector = Add_feature( MFCC:Window=Hamming,blockSize=1024, stepSize=2048, CepsNbCoeffs=20, computeMean = True)

Eng 98.558 0.0935 7.735 2.9935 0

TABLE VII CONFUSION MATRIX Fr Hin 0.0108 0.0027 97.0065 0 0.351 91.79 0. 4608 0.1152 1.29 0

Tel 0 0.0935 0.1172 96.42 0.371

Jap 0 2.8 0 0 98.3302

Using this algorithm MFCCs are generated with resampled speech samples. MFCCs are coefficients that together buildup an MFC. The MFC frequency bands are uniformly spaced on the Mel scale, which approximates the human acoustic systems reply more strongly than the linearly spaced frequency bands. TABLE IV ALGORITHM FOR SVM_TRAIN (VECTOR) Algorithm SVM_Train( Vector) Input: The Support Vectors which have the cepstral data Output: Model file which represents the knowledgebase Model = SVM( type = C-SVM, kernel = RBF ) TABLE V ALGORITHM FOR SVM_PREDICT (MODEL, TESTSAMPLE) Algorithm SVM_Predict (Model,TestSample) Input: The Model file built in the training phase and the test speech sample. Output: The language of the test sample Language = SVM ( LiD (TestSample) , Model)

VI.

Fig. 9. Performance analysis of SLId

From the confusion matrix it is evident that the system is reliable as the diagonal elements of the matrix holds highest values when compared to its row contemporaries. From the confusion matrix it is evident that the system is reliable as the diagonal elements of the matrix holds highest values when compared to its row contemporaries. Further, experiments are conducted to demonstrate the system accuracy for a chosen language. Around 105 English speech samples were used to test the system and the LiD demonstrated around 80% classification accuracy. The graph of classification accuracy of the system against English is shown in Fig. 10, it is evident that the system perform well as it comes across more evidences against each language. The correctly classified instances of English language from a subset of the open source speech corpus, Vox Forge reveals that 85 out of the 125 samples were classified correctly as English. The accuracy is found to be 80.95%. The TIMIT database, that consists of labeled segments of speech.

Experimental Results

The datasets for all our experiments are randomly taken from different parts of Web like podcasts and online audio books. The datasets are divided into two parts: Training Data and Testing Data. N-fold cross validation is adopted for training the machine for different languages. The system is trained over a large corpus of data and a small subset is used for testing to achieve better accuracy. The input speech samples are given in Table VI. The experiments are conducted to analyze the response of the proposed SLId against the considered languages (English, Hindi, French, Japanese and Telugu). The result is depicted in the form of a confusion matrix (Table VII) and graphically represented in Fig. 9.



510


VII.

Conclusion

SLId system should be capable of accurately identifying the language of a speech sample for which it is trained. The current system is capable of identifying English, Telugu, Hindi, French and Japanese with an appreciable accuracy. There are very few SLId systems which provide support for regional languages and adding Telugu to the set of languages is a minimal contribution. The major barrier with any SLId research is the availability of standard multi lingual speech corpus for training. Our proposed system has not made use of any standard dataset, and still competes for a decent accuracy. A novel approach of creating the dataset was tried. As SLId systems doesn’t require phoneme level description or syllable database but needs noise free speech at a constant level. Thus the online audio books without background sound and few podcasts were used to build the corpus. The SLId system can be made more robust by increasing the number of samples for each language. Adding more speech samples from different speakers and incorporating different accents of the same language can improve the accuracy. The immediate improvement could be to add more languages to the existing dataset to enhance the boundary of identification of languages. The feature space can be enhanced by considering more acoustic parameters apart from MFCC and could incorporate a hybrid model comprising of many parameters. The biggest improvement to the system could be to incorporate incremental machine learning technique, that is, to learn from the utterances which the system had wrongly classified via a user feedback mechanism.

Fig. 10. Classification Accuracy of English language

As an example, a phrase is shown in Fig. 7. The TIMIT database gives the first and last samples of each word, and the first and last sample of each phoneme too. In Fig. 11, the vertical line in blue indicates the beginning of a word and the red line indicates its end [17].

Fig. 11. Example of a phrase in the TIMIT database: "she had your dark suit in greasy wash water all year"

References

A snapshot is the state of a system at a particular point in time. It can refer to an actual copy of the state of a system or to a capability provided by certain systems. These snapshots output a set of Classification Accuracy of languages used in our SLId system. The following Fig. 12 shows the GUI of SLId system.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Fig. 12. GUI of SLId system


K. M. Berkling, T. Arai and E. Barnard, "Analysis of phonemebased features for language identification", in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 94, Adelaide, Australia, April 1994. J. Hieronymous and S. Kadambe, "Spoken Language Identification Using Large Vocabulary Speech Recognition", in Proceedings of the 1996 International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, USA, 1996. K. M. Berkling and E. Barnard, "Language Identification of Six Languages Based on a Common Set of Broad Phonemes", in Proceedings of the 1994 International Conference on Spoken Language Processing (ICSLP94), Yokohama, Japan, September 1994. K. M. Berkling and E. Barnard, "Theoretical Error Prediction for a Language Identification System using Optimal Phoneme Clustering", in Proceedings 4rd European Conference on Speech Communication and Technology (Eurospeech 95), Madrid, Spain, September 1995. Y. K. Muthusamy, "A Segmental Approach to Automatic Language Identification", Ph.D thesis, Oregon Graduate Institute of Science & Technology, July 1993. M. A. Zissman, "Comparison of Four Approaches to Automatic Language Identification of Telephone Speech", in IEEE Transaction Speech and Audio Processing, SAP-4(1), January 1996. Chi-Yueh Lin, Hsiao-Chuan Wang, “Language identification using pitch contour information”, from IEEE ICASSP-2005,


511


[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16] [17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 2005. Fadi Biadsy, Julia Hirschberg, “Using Prosody and Phonotactics in Arabic Dialect Identification”, In Proceedings of Interspeech 2009, Brighton, UK, 2009. Pedro A. Torres-Carrasquillo et al, “Approaches to Language Identification using Gaussian Mixture Models and Shifted Delta Cepstral Features”, 2002 International Conference on Spoken Language Processing (ICSLP 2006), Denver, USA, 2006. E.Singer et al, “Acoustic, Phonetic, and Discriminative Approaches to Automatic Language Identification”, In Proc. Eurospeech, 2003. Sirko Molau et al, “Computing mel-frequency cepstral coefficients on the power spectrum”, Proceedings. (ICASSP '01). 2001 IEEE International Conference, Salt Lake City, UT, USA, 2001. Fukada et al, “An adaptive algorithm for mel-cepstral analysis of speech”, IEEE conference on Acoustic, Speech and Signal Processing (ICASSP-92), Information Systems Research Center, Canon, Japan, 1992. Hasan et al, “Speaker identification using mel frequency cepstral coefficients, 3rd International Conference on Electrical & Computer Engineering ICECE 2004, 28-30 December 2004, Dhaka, Bangladesh, 2004. Campbell et al, “Support Vector Machines for Speaker and Language Recognition”, Computer Speech and Language, Elsevier, MIT Lincoln Laboratory, 2006. Javad Shiekzadagen and Mahamood Reza Roohani, “Automatic spoken language identification based on ANN using fundamental frequency and relative changes in spectrum”, International Conference on Speech Science and Technology (SST-2000), Research centre of intelligent signal processing, Iran, 2000. Margaret H. Dunham, “Data Mining Introductors and advanced topics”, Pearson Education, 2008. Davis S, Mermelstein P,“Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 4, 1980. Wei Han, Cheong-Fat Chan, Chiu-Sing Choy and Kong-Pang Pun, “ An Efficient MFCC Extraction Method in Speech Recognition”, Department of Electronic Engineering, The Chinese University of Hong Kong, Hong, IEEE-ISCAS, 2006. Lindasalwa Muda, Mumtaj Begam and I. Elamvazuthi, “Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques”, Journal of Computing, Malaysia , Volume 2, Issue 3, March 2010. Shikha Gupta, Jafreezal Jaafar, Wan Fatimah wan Ahmad and Arpit Bansal, “Feature Extraction using MFCC”, Signal & Image Processing : An International Journal (SIPIJ) Vol.4, No.4, August 2013. Mark Gales and Steve Young, “The Application of Hidden Markov Models in Speech Recognition”, Foundations and Trends in Signal Processing, Vol. 1, No. 3, UK, 2007. Shi-Huang Chen and Yu-Ren Luo, “Speaker Verification Using MFCC and Support Vector Machine”, Proceedings of the International MultiConference of Engineers and Computer Scientists 2009, Vol.1, IMECS 2009, March 18 - 20, Hong Kong, 2009. Katrin Kirchhoff, Gernot A. Fink, Gerhard Sagerer, “Combining acoustic and articulatory feature information for robust speech recognition”, Speech Communication 37 (2002), Elsevier, USA, 2002. Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition”, IEEE Signal processing Magazine, November, 2012. Beydoun, A., Sharafeddine, Y., Alaeddine, H., Rachini, A., Khalil, F., Beydoun, B., Optimization and Implementation of Acoustic Echo Canceller Based on LMS Algorithm Using FPGA, (2014) International Journal on Communications Antenna and Propagation (IRECAP), 4 (6), pp. 234-243.

[26] Alaeddine, H., Beydoun, A., Beydoun, B., Khalil, F., Rachini, A., A Novel Double Talk Echo Canceller Algorithm Using Multi Delay Filter, (2013) International Journal on Communications Antenna and Propagation (IRECAP), 3 (4), pp. 199-205.


Research Scholar, Department of Computer Science and Engineering, Sri Venkateswara University College of Engineering, Sri Venkateswara University, Tirupathi - 517 502, Andhra Pradesh, India. 2

Professor, Department of Computer Science and Engineering, Sri Venkateswara University College of Engineering, Sri Venkateswara University, Tirupathi - 517 502, Andhra Pradesh, India. R. Madana Mohana is a Research Scholar in the Computer Science and Engineering Division at Sri Venkateswara University College of Engineering. His research interests include Data Mining, Computational Intelligence and optimising compilers. He received his B.Tech in Computer Science and Information Technology in 2003 from Jawaharlal Nehru Technological University, Hyderabad and ME in Computer Science and Engineering in 2006 from Satyabama University, Chennai, India. He is life member of ISTE. Dr. A. Rama Mohan Reddy is a Professor in the Computer Science and Engineering division at Sri Venkateswara University College of Engineering. His research interests include Software Architecture, Software Engineering, Data Mining and optimising compilers. He received his B.Tech. from JNT University, Hyderabad in 1986, M. Tech degree in Computer Science from National Institute of Technology in 2000 Warangal and Ph. D in Computer Science and Engineering in 2007 from Sri Venkateswara University, Tirupathi, Andhra Pradesh, India.



512


Spatio-Temporal Wavelet Based Video Compression: a Simulink Implementation for Acceleration I. Charfi, M. Atri Abstract – In this paper, we presented a Wavelet based video compression method. The wavelet transform is performed by lifting scheme methodology which allows temporal and special scalability as well as lossless reconstruction. The spatial decorrelation was elaborated with the standard JPEG2000. We exploited the inter-frames correlation thanks to a temporal wavelet transformation. A Simulink based implementation is then experimented in order to accelerate the process thanks to a co-simulation step. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Lifting Scheme, Spatio-Temporal Wavelet Transformation, Motion Estimation, CoSimulation

t+2D video coders, case of wavelet based compression method, showed good performance while ensuring scalability. In this paper, we propose a t+2D spatio-temporal wavelet based video coder which exploits firstly the temporal redundancy between the different frames. The Three Step Search (TSS), an algorithm of Block Matching is used for the motion estimation, key step for video compression. A spatial decomposition is then applied to the obtained temporal sub-bands. This paper is organized as follows. In section II, we review the related works. The section III proposes an overview of the selected method. In section IV, the experimental results and discussions are presented. This paper ends with conclusion in section V.

Nomenclature MPEG JPEG ME MC DWT MCTF DWT, ES TSS 4SS NTSS DS ARPS SESTSS

Moving Picture Expert Group Joint Photographic Experts Group Motion Estimation Motion Compensation Discrete Wavelet Transform Motion Compensated Temporal Filtering Exhaustive Search Three Step Search 4 Step Search New Three Step Search Diamond Search Adaptive Rood Pattern Search Simple and Efficient Three Step Search

I.

Introduction

The video compression plays an essential role in our daily-life due to the growth of the internet and advances in multimedia. The diversity of video compression standard allows to transmit, process and store video signals on different platforms and interact with their content in function with the bandwidth required for the signal distribution. Furthermore, the scalability has become an important feature of a video coding scheme which is carried out essentially by the wavelet transform. Lifting scheme based Wavelet transformation has been successfully applied in image processing. Later, numerous researchers spread the use of wavelet transformation to the temporal analysis in addition to the spatial analysis for video compression. The result is spatio-temporal sub-bands characterized by spatial and temporal levels of resolution and therefore spatio-temporal scalability is reached.

II.

State-of-Art

The most active organisms for normalization named UIT-T and l’ISO/IEC [1] developed numerous standards for video compression such as H.26X and MPEG-X [2] [3]. Subsequently, H.264/MPEG-4 AVC [4] was proposed which allows to reduce the transmission rate for an equivalent visual quality compared to previous standards. Later Scalable Video Coding [5], [6] developed jointly by UIT-T and ISO in order to afford a scalable content adapted to the expected quality of service. In fact, the need to adapt the reconstructed information in function with the application justifies the common seek of scalability. The wavelet transformation is a good solution for this problem which is confirmed for t+2D and 2D+t and 3D [7] coding schemes. 3D and 2D+t coding schemes


513

I. Charfi, M. Atri

showed a weakness since the temporal analysis is applied on spatial sub-bands and local motions are not considered and subpixel motion is not decorrelated. Motion JPEG2000 [8] is a standard for video compression where each frame of video stream is compressed using the standard JPEG2000. The video compression method is a simple concatenation of the encoded frames. The major interest of such standard is the easy edition of images once encoded. Random access is immediate and modifications are simple and lossless. This standard is used especially for video indexation. Nevertheless, Motion JPEG2000 do not make use of the temporal redundancy since each frame is independently encoded. It insures fast encoding with low compression rate. The decorrelation between frames in video coding started truly only with t+2D schemes based on Motion Compensated Temporal Filtering (MCTF) where wavelet transform is applied to frames in the motion direction. The resulting temporal sub-bands are spatially transformed, quantified and finally encoded. Motion vectors, result of the motion estimation step, are encoded and sent into the compressed flow. Wan showed the efficiency of CRC-WVC [9] video scheme compared to H.264.

III.1. Lifting Scheme The Lifting Scheme, named Wavelet of second generation, was proposed by Sweldens [10]. The main steps are Split, Prediction and Update and presented in Fig. 2. The Split operator divide the original signal in two subsets: and . The Prediction operator P applied to the signals in order to calculate the signal which represents the high frequencies (Eq. (1)). The Update operator is then applied to signal in order to calculate the low frequency signals (Eq. (2)): =

− { =

};

+ {

∈ }

(1) (2)

These steps can be recursively applied to the signals in order to obtain different resolution levels. The greatest advantage of the Lifting Scheme is its simplicity to evaluate the wavelet transform in few steps. With the increasing need for high quality data compression techniques, Wavelet Transforms using Lifting scheme found instant acceptance among the research community. We used lifting scheme methodology for temporal filtering step and for spatial transformation.

III. The Proposed Method

III.2. Temporal Analysis: Motion Compensated Temporal Filtering (MCTF)

We propose t+2D coding scheme which consists in building, firstly, the temporal sub-bands obtained thanks to Motion-Compensated Temporal Filtering step, named MCTF, and to use these sub-bands at the input level of a image compression standard JPEG2000 [8]. The method is based on 5/3 lifting scheme for temporal and spatial analysis which insures temporal and spatial scalability. Motion estimation (ME) and motion compensation are used at the motion compensated temporal filtering step in order to characterize the inter-frames motion. We evaluated different block matching algorithms for motion estimation and presented the results in section IV. The ME output represents the motion vectors which are encoded without loss and transmitted within compressed stream. Motion vectors are used for reconstruction thanks to the lifting scheme approach which insures reversibility. In Fig. 1 we present an overview of the proposed method.

In almost of the existing video compression methods video sequence is firstly divided to Groups of Pictures (GOP) on which the same process is then applied. Motion compensation and motion estimation are applied in order to calculate and to take into account the difference between frames. Motion estimation output represents the motion vectors (MV) that are used for the reconstruction of the video sequence. We illustrate in Fig. 3 the three levels of temporal decomposition of a group of 8 successive frames (GOF) using the motion compensation (MC) as non-linear Prediction and Update operators. The motion compensation consists in taking the motion into account and associating each pixel to its corresponding pixel in neighboring frames. III.2.1. Temporal Lifting Scheme and Motion Compensation In video compression, the temporal transformation is used in order to exploit the temporal redundancy between frames since motion compensation do not allow the optimal construction. This problem is resolved thanks to the reversibility of lifting scheme (Fig. 4) and we used Motion Compensation for Prediction and Update steps as explained in Eqs (3) and (4). The sub-bands and are obtained thanks to the equations below:

Fig. 1. Method overview



514

I. Charfi, M. Atri

= =

− +

1 2

1 [ 4

{ {

}+

{

}

(3)

}+

{

}]

(4)

where { } and { } represent respectively odd and even frames, { } and { } represent respectively details and approximations frames.

Fig. 2. The lifting scheme Fig. 5. Bi-directional motion compensation

III.2.2. Motion Estimation Motion estimation represents a crucial step for video coding schemes. It generates the motion vectors that are used in motion compensation. Motion estimation consists in exploiting the temporal redundancy between two successive frames. In this study, we used Block Matching algorithms which generate two-dimensional Motion Vectors corresponding in the coordinates of the origin of the block in the image of reference. Each frame is divided into blocks with the same dimension (typically 8×8 or 16×16) where the motion is considered as static. Informally, the motion estimation algorithm consists in finding, inside a window search, the corresponding block in the image of reference. We used a window search with the size of 16×16 blocks. The criteria of comparison used for the corresponding block search is the Mean of Absolute Differences (MAD):

Fig. 3. Three level of motion compensated temporal filtering

Fig. 4. Temporal lifting scheme

The lifting scheme formalism insures reversibility regardless of the prediction and update operations. Thus the reconstruction of the original frames is given by the equations below: = =

− { − {

}; };

∈ ∈

=

(5)

1

−

(7)

where N is the size of the macroblock and represent respectively the corresponding pixels of current and reference macroblocks. We compared different algorithms of block matching Full Search (FS), Three Step Search (TSS), New Three Step Search (NTSS), Four Step Search (4SS), Diamond Search (DS) and Simple and Efficient Three Step Search (SETSS) [12]. We retained the “Simple and Efficient Three Step Algorithm” which is an extension of the Three Step Search algorithm. The principal idea is to divide the Window Search into 4 parts [13] and to define three

(6)

For our work, we selected the commonly used 5/3 wavelet for the temporal transformation since it allows bi-directional motion estimation and better exploitation of the motion unlike the mono-directional Haar wavelet [11]. As mentioned in Fig. 5, motion compensation (MC) represents the Prediction and Update operations.



515

I. Charfi, M. Atri

positions A (the origin), B and C with S =4 as presented in the Figs. 6. The flow chart in the Fig. 7 presents the rules adopted to define the quadrant of search as presented in Figs. 6. The next step is to find the new origin that minimizes the MAD criteria with the Three Step Search method [14].

(a)

(c)

bands LL, HL, LH and HH. The decomposition is applied recursively to the LL and LLL sub-bands at the second and the third levels (Figs. 8). A quantification step is then applied allowing reducing the number of bits used for the wavelet coefficients encryption [15].

(a) first level

(b)

(d)

(e)

Figs. 6. Search zones

(b) second level

(c) third level

Figs. 8. Three level of the spatial avelet transformation

IV.

Experimental Results

We evaluated the robustness of the wavelet based video coding method due to three video sequences: we compared three video sequences with slow motion “Claire”, speed motion “Tennis player” and very speed motion “Car”. We evaluated different Block Matching Algorithms and we retained the best method in terms of reconstruction performance and processing time. We used (16 × 16) size of block and p =7 as step size. We used the Peak Signal to Noise Ratio (PSNR) in order to calculate performance of video coder. = 20

255

(8)

The Mean Squared Error (MSE) is given in the equation below: =

Fig. 7. Rules used for the quadrant definition

III.3. Spatial Transformation: JPEG2000

1

−

(9)

represent pixels of the original frame and represents the pixels of the reconstructed frame.

JPEG2000 is a high performance image compression technique developed by the Joint Photographic Experts Group committee. JPEG2000 is based on the discrete wavelet transform. JPEG2000 compression is applied, for our work, on the temporal sub-bands obtained after MCTF analysis. The spatial DWT is applied to each sub-band. At the first level, each frame is decomposed on 4 spatial sub-

IV.1. Temporal Filtering We compared the performance obtained using 7 block matching methods [16] implemented in Matlab and presented the results in Tables I, II and III for the three selected video sequences.



516

I. Charfi, M. Atri

For the three video sequences, the ES method gives the best performance in term of PSNR, however the processing time is very slow in comparison with the others algorithms. The 4SS, TSS, NTSS, DS and ARPS are about 10 times faster than ES in term of processing time. This is due to the number of operations used for the ES algorithm. Inside the window search, MADs are calculated in order to find the corresponding block, without any optimization. Compared to simple TSS and NTSS, SESTSS algorithm includes different rules allowing to accelerate the processing time without decreasing the performance in term of PSNR. This is confirmed by results presented in Tables I, II and III for the three used video sequences. The SESTSS outperforms DS and ARPS for the three video sequences in terms of processing time and PSNR.

performance with a good tradeoff between processing time and PSNR that we selected for our video coding method. We note that for the sequence «Claire», the results are better than those obtained for «Tennis player» and «Car» which is explained by the nature of the motion characterizing each sequence. We present in Fig. 9, three levels of temporal decomposition applied to a GOP of “Tennis player” using SESTSS for motion estimation and 5/3 wavelet transformation.

TABLE I PSNRS OBTAINED FOR THE SEQUENCE “CLAIRE” AFTER MCTF PROCESS WITH DIFFERENT BLOCK-MATCHING ALGORITHMS APPLIED TO A GOP OF 8 FRAMES ES 4SS TSS NTSS DS ARPS SESTSS 0 50.19 49.78 50.18 50.87 49.66 50.49 50.75 1 25.98 26.35 25.99 30.87 26.40 26.70 35.11 2 26.08 26.51 26.11 28.56 26.53 26.46 29.67 3 25.59 26.11 25.73 30.40 26.39 26.53 35.29 4 36.86 37.01 36.94 36.93 37.01 37.02 36.79 5 26.51 27.35 26.78 32.53 27.24 27.37 37.63 6 32.77 34.60 34.65 35.14 34.60 33.83 36.22 7 34.42 34.92 34.82 39.31 34.89 35.11 40.91 Time(s) 4481.60 469.46 455.06 353.03 556.87 321.06 260.98 TABLE II PSNRS OBTAINED FOR THE SEQUENCE “TENNIS PLAYER” AFTER MCTF PROCESS WITH DIFFERENT BLOCK-MATCHING ALGORITHMS APPLIED TO A GOP OF 8 FRAMES ES 4SS TSS NTSS DS ARPS SESTSS 0 47.32 47.32 47.32 43.21 47.32 47.22 39.79 1 26.61 26.58 26.57 28.45 26.57 27.03 33.85 2 27.65 27.51 27.51 29.00 27.5296 27.89 29.05 3 26.94 26.87 26.89 28.70 26.8793 27.12 33.16 4 41.04 41.04 41.04 36.65 41.0412 40.76 33.03 5 27.64 27.64 27.65 29.06 27.6298 27.78 34.16 6 31.64 31.64 31.64 30.59 31.6391 31.26 30.25 7 33.36 33.32 33.74 33.31 33.3225 33.29 36.40 Time(s) 2710.40 458.34 403.74 350.77 500.44 304.55 196.36

Fig. 9. Temporal sub-bands for the three levels of wavelet decomposition for the Sequence “Tennis player”

IV.2. Spatial Filtering The Wavelet based spatial transformation is applied to the eight temporal sub-bands H0, H1, H2, H3, LH0, LH1, LLH0 and LLL0, outputs of the MCTF step. We used the Jasper JPEG2000 coder [17]. In Table IV, we presented the obtained PSNRs for different compression rates applied to the sequence «Tennis player» for three levels of decomposition.

TABLE III PSNRS OBTAINED FOR THE SEQUENCE “CAR” AFTER MCTF PROCESS WITH DIFFERENT BLOCK-MATCHING ALGORITHMS APPLIED TO A GOP OF 8 FRAMES ES 4SS TSS NTSS DS ARPS SESTSS 0 58.60 60.33 60.33 52.44 60.33 60.34 59.47 1 19.33 19.30 19.32 21.91 19.30 19.39 27.35 2 20.96 20.93 20.92 23.28 20.97 21.05 26.72 3 19.31 19.27 19.27 21.78 19.28 19.35 26.57 4 45.32 45.06 45.24 41.83 45.16 46.31 43.06 5 20.44 20.38 20.38 22.93 20.38 20.56 27.93 6 31.84 31.88 31.88 30.03 31.88 31.57 33.03 7 27.97 27.92 27.93 29.24 27.97 27.99 36.62 Time(s) 2385.80 311.30 291.27 279.23 402.11 221.94 310.26

0 1 2 3 4 5 6 7

TABLE IV PSNRS OBTAINED FOR DIFFERENT COMPRESSION RATES FOR THE SEQUENCE “TENNIS PLAYER” 10 5 2.5 1.66 1.25 1 22.76 24.07 25.17 25.31 25.33 25.33 26.66 30.05 32.54 32.77 32.84 32.84 24.56 26.84 28.49 28.66 28.68 28.68 26.60 29.57 31.55 31.77 31.83 31.83 21.98 21.85 22.25 22.31 22.32 22.32 26.90 29.90 32.05 32.27 32.31 32.31 23.89 24.49 25.03 25.10 25.10 25.10 26.95 28.31 29.23 29.31 29.30 29.30

As expected, the performance of the video codec decreased in function with the compression rate. Performances are slightly decreased for the rate of 2.5 in comparison with rate 1 (without compression), and we obtained good performance for the compression rate 5 and acceptable performance are given for the rate 10.

For “Claire”, “Car” and “Tennis Player”, the PSNRs given by DS are near to those obtained using SESTSS, however, the processing time is more important than the SESTSS which confirms the selection of SESTSS for the motion estimation step. The SETSS gives the best Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved


517

I. Charfi, M. Atri

Critical results are obtained with higher compression rates due to the nature of movement characterizing the sequence “Tennis player”. For slower movements, we reach good performance using Jasper implementation for the compression rate 20 (using the sequence “Claire” PSNR=28 and using “Car” PSNR=27.13). IV.3. MCTF Acceleration The analysis of the processing time required for our video coding method showed that the motion estimation requires 98% of the total processing time. We proposed to implement our video coder using the Simulink environment which allows co-simulation as presented in Fig. 10.

Fig. 12. Simulink-based codec modelisation

We applied 5/3 wavelet based transformation for better temporal and spatial decorrelation thanks to bidirectional prediction. We proposed finally a Simulink based implementation using the embedded function which is useful for hardware/software co-simulation. Hardware implementation of the motion estimation, requiring 98% of the total processing time, allows to accelerate considerably the video coding process.

Fig. 10. Bidirectionnal co-simulation Simulink/Modelsim platform

The main idea is to have a graphical representation adapted to heterogeneous implementation and to implement the slowest block, motion estimation in our case, in VHDL. In Fig. 11, we presented the Simulink based implementation of a temporal sub-band including motion estimation (colored with blue), motion compensation (colored with pink). This implementation allows Matlab/ModelSim co-simulation.

References [1]

Information technology — JPEG 2000 image coding system Part 10: Extensions for three dimensional data ITU-T Recommendation T.809 | ISO/IEC 15444-10:200X [2] M. Castro Dufourny, "MPEG-4 object-based codec with Matlab”, TFE Department, Umea University, Sweden, 2006. [3] « MPEG-4 AVC (H.264) and why, only now, it can save 60% of the network video bandwidth and storage requirement», provided by www.hivision.com [4] F. Loras J. Fournier, H.264/Mpeg-4 Avc, « Un Nouveau Standard De Compression Vidéo », Conférence Internationale ; Compression Et Représentation Des Signaux Audiovisuels « Coresa », 2003. [5] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the Scalable Extension of the H.264/MPEG-4 AVC Video Coding Standard”, IEEE Trans. Circuits Syst. Video Technol., 2007. [6] G. Pau, « Ondelettes et décompositions spatio-temporelle avancées ; application aucodage vidéo scalable », Thèse de doctorat à l’Ecole nationale supérieure des télécommunicationsTélécom Paris-, 2006. [7] J. Hua, Z. Xiong, and X. Wu, “High-performance 3-D Embedded wavelet video (EWV) coding”. In Proc. Of IEEE Workshop on Multimedia Signal Processing, pages 569–574, Cannes, France, 2001. [8] G. Pearson and M. Gill , “An Evaluation of Motion JPEG 2000 for Video Archiving”, Proc. Archiving 2005 (April 26-29, Washington, D.C.), IS & T (www.imaging.org), 2005. [9] S. Wang, “Stratégie de codage conjoint de séquences vidéo à base ondelettes”, Thèse de Doctorat, Université de Poitiers, 2008. [10] Tinku Acharya and Ping-Sing Tsai. JPEG2000 Standard for Image Compression: Concepts, algorithms and VLSI architectures. Wiley-Interscience, New York, 2004 [11] Y. Andreopoulos, A. Munteanu, G. Van Der Auwera, P. Schelkens & J. Cornelis. “Wavelet-based Fully-scalable Video Coding With In-band Prediction”. In Proc. Third IEEE Benelux Signal Processing Symposium (SPS-2002), 2002. [12] S. Valette, « Modèles de maillages déformables 2D et multirésolution surfaciques 3D sur une base d'ondelettes », Thèse de Doctorat, Institut National des Sciences Appliquées de Lyon, 2002.

Fig. 11. Simulink based temporal sub-band implementation

In Fig. 12, we present the proposed codec implemented with Simulink thanks to the embedded function including SESTSS for motion estimation, 3 levels of temporal decomposition and 3 levels of spatial decomposition using Jasper for JPEG2000 standard.

V.

Conclusion

We proposed a wavelet based video coder using t+2D lifting scheme. We used SETSS block matching algorithm for motion estimation which afford best performance in comparison with other different block matching methods.



518

I. Charfi, M. Atri

[13] P. Brault, « Estimation de mouvement et segmentation d'image », thèse de Doctorat. Université Paris-Sud XI, Faculté des Sciences d’Orsay, 2005. [14] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies. “Image coding using waveletTransform”. IEEE Transactions on Image Processing, 1(2) :205–220, 1992. [15] T. Totozafiny, « Compression d’images couleur pour application à la télésurveillance routière pour transmission vidéo à très bas débit », Thèse de Doctorat, Université de Pau et des Pays de L’Adour, 2007. [16] Barjatya, A. (2004). Block matching algorithms for motion estimation. IEEE Transactions Evolution Computation, 8(3), 225239. [17] http://www.ece.uvic.ca/~mdadams/jasper/

Authors’ information Faculty of Sciences of Monastir. Imen Charfi is a research assistant at the LEME laboratory (Tunisia). She obtained her PhD in Instrumentation and Image Processing in 2013 and her Master degree in Electronics in 2009. She is involved in real-time image processing and computer vision.

Mohamed Atri received his PhD in Microelectronics from the Faculty of Science of Monastir in 2001. He is currently a member of the Laboratory of Electronics and Microelectronics. His research includes Circuit and system design, image processing, network communication, IPs and SoCs.



519


Generating Graphical User Interfaces Based on Model Driven Engineering S. Roubi1, M. Erramdani2, S. Mbarki3 Abstract – In this article we present our approach developed concerning the generation of usable User Interfaces (UI), then we show its results, starting as a basis with UML models. Indeed, we based the approach on the Model Driven Engineering and the development of specific meta models for UI from the perspective of a UML model-based design on one hand and a set of transformations on the other hand. Indeed, we defined a new meta model that is neither a use case nor an activity diagram, but our way to describe the UI in terms of the user's interactions. Thus we used the good practices of Designs Patterns when developing the meta models. The approach includes first creating a platform independent model (PIM) and transforming this PIM into platform specific model (PSM) thought transformation rules. The PIM is created first based on the UML use case and activities diagrams to extract the main functionalities offered by the system to be developed. Then we established the PSM meta model, being the target model of our transformation engine, taking into account the design patterns composite and MVC. With this methodology, the UI can easily be analyzed, designed, and generated to increase system development. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Meta-Model, User Interface, Transformation, Model Driven Engineering (MDE), Design Pattern

I.

Also, it is necessary to take into account the changes of context and to adapt the interface to the user. So the obvious challenge is to unify the design and implementation around model and ensure the independency of platforms. That is why we can say that the obvious reconciliation between the two communities HCI and MDE is promising. The anchoring is to apply the key concepts of MDA to be able to model and represent User Interface that can be adapted to various platforms; which is claimed in both of them. In this work the main objective is to unify the design and implementation of human computer interface around models by applying the MDA principles while introducing design patterns’ practices, and also take the UML models as basis. So one way to tackle these challenges is by abstracting from the implementations through a Model-Driven Engineering [3] approach. Our vision to achieve this work is to apply the principles of MDA and use the UML models to first define a source meta-model describing the user interface and its operations, which led us to define a new meta model apart from use case or activity diagrams. Second we defined the target meta-model to represent a graphical user interface independently from any platform. Then, we developed our transformation rules that will connect the tow models instantiated from source and target meta-models. Last but not least, we defined the Model to Text transformation that will help us generate the source code for the user interface as described first for Java Swing platform.

Introduction

Today, developing high level applications requires an approach to software architecture that helps architects evolve their solutions in flexible ways. This should allow reuse of existing effort and taking into account that the target infrastructure is itself evolving. At this end, we can say that modeling approach is an efficient way to master complexity and ensure consistency. These ideas, among others, were considered central by the Object Management Group (OMG), a consortium of software organizations that develops and supports specifications to improve the practice of enterprise software development and deployment, to address this challenge. It has devised a number of standards for software development under its Model Driven Architecture (MDA) approach [1]. Thus, MDA encourages efficient use of system models in the software development process. As defined by the Object Management Group (OMG), MDA is a way to organize and manage enterprise architectures supported by automated tools and services for both defining the models and facilitating transformations between different model types. These transformation rules, together, describe how source models are transformed into target models [2]. It focuses also on separating the business logic from the technical platform. Also, the application’s user interfaces are facing nowadays several challenges, among them, the diversity of the interaction devices which certainly involves multiples interaction platforms.


520

S. Roubi, M. Erramdani, S. Mbarki

The paper is organized as follow: Section 2 is dedicated to related work. In section 3 we present the MDE principles. Section 4 defines the relationship between the HCI and Design Pattern used to develop the meta models. Sections 5 and 6 include the approach and present the running example. Finally, section 7 concludes the work and offers some perspectives.

II.

III. Model Driven Engineering III.1. The OMG Approach In November 2000, the OMG (The Object Management Group), a consortium of over 1,000 companies, initiated the MDA (Model Driven Architecture) approach [8]. The OMG presents this approach as a way to develop systems that offer greater flexibility in the evolution of the system while remaining true to customer needs and satisfaction. We can say that in MDA, models expressed in a well-defined notation are a cornerstone to understand systems for enterprise-scale solutions. MDA introduces an approach related to system specification which relies on the separation into three different layers of abstraction; Computing Independent Model (CIM), Platform Independent Model (PIM) and the Platform Specification Model (PSM). We can define the three layers as follow:  CIM: It represents a high level specification of the system’s functionalities. It is often seen as a business model, as it uses a vocabulary that is familiar to the subject matter experts. It shows exactly what the system is supposed to do, but hides all the technology specifications.  PIM: It allows the extraction of the common concept of the application independently from the platform target. It exhibits a sufficient degree of independence so as to enable its mapping to one or more platforms.  PSM: It combines the specifications in the PIM with the details required of the platform to stipulate how the system uses a particular type of platform which leads to include platform specific details. The transition from one level to another is provided by transformations; that can be defined as the operation of taking elements of one or more models (source) and to match them with other elements of the model (target). There are two types: Model To Model (M2M) and Model To Text transformation (M2T). The first lets us go from CIM to PIM and PIM to PSM. As for the second, it allows the generation of platform-specific code chosen. Fig. 1 below shows how the transformations are done. So we can say that this relationship, τ, introduced in [14] and [15], connects two models and is the first step to automation and code generating. In addition to that, three key concepts are the basis of MDA: models, meta-models and transformations. Each of these concepts has a different relationship:  Models: There is no universal definition for model concept; however we can say that the model and the system studied are two complementary roles. The relationship between a model and the system studied is denoted μ. In simplified terms, a model is a simplified representation used to answer questions instead of the system. The reader could refer to [13] for fuller discussion.  Meta Models: It can be defined as the model of a modeling language, as it’s the case of UML. The concept of meta-model leads to the relation denoted χ that is discussed in article [14].

Related Work

Given the generic aspect that MDA principles provide, the approach has been applied to several system development layers. One of them consists in defining a new approach to generate the data warehouse multidimensional schema based on models and their transformations [5]. In [6], [7] the authors automatically generate a web application from a simplified class diagram. Also, the application of these concepts has shown its efficiency when applying them to generate N-Tiers web application as defined in [8]. The application of the three models layers, as defined in MDA, on one hand, and the QVT transformation rules on the other helped to reach these results. Recently, several works related to MDE area generally and its application for IHM have emerged. The first work is based on the plasticity of User Interface and the application of MDA concepts for the purpose of unifying the modeling for GUI [11]. This approach evinces that the two communities have a lot in common and their basic objective is to model and generate more plasticity when it comes to User Interface. Another related work on applying MDA approach for Rich Internet Applications is found in [4]. The approach is based on XML User Interface description languages using XSLT as the transformation language between the different levels of abstraction. Again, this approach is oriented towards the User Interface and lacks flexibility in an MDA approach for the whole web application. Also, an MDA approach for AJAX web applications [12] was the subject of a study that proposes a UML scheme by using the profiling for modeling AJAX user interfaces, and report on the intended approach of adopting ANDROMDA for creating an AJAX cartridge to generate the corresponding AJAX application code, in ICEFACES, with back-end integration. A meta model of AJAX has been defined using the AndroMDA tool. The generation of AJAX code was made and illustrated by an application that manages CRUD operations of person. In the same perspective, in [16] the authors demonstrate the generation of GUI of the Amazon Integration and Cooperation Project for Modernization of Hydrological Monitoring by using the stereotyping method on UML models with AndroMDA tool. In terms of our work, we focused on demonstrating the reconciliation between the two communities MDA and HCI via the application of the MDA principles, which is based on models, and come up with a new way of modeling Human Computer Interfaces.



521


The declarative part is defined by the Relation and Core languages with different levels of abstraction, while the Imperative part is defined by the operational language. Finally, a black box is defined in the MOF 2.0 QVT standard that enables escaping the whole transformation/library or its parts that are difficult or impossible to implement in pure QVT. This work uses the QVT-Operational mappings language implemented by Eclipse modeling [18]. B. Acceleo There are a number of tools in MDA aimed at the automation of applications’ development. They are known as generators of Model Driven Architecture, they have the possibility to generate significant portions of the source code of the modeled application. The principle being to parse the representation of the model in XML file format Metadata Interchange (XMI) [19] and apply a number of templates in order to generate the source code. Optionally, the developer will have to add or edit source code portions to complete its application code. Acceleo is an implementation of the "MOFM2T" standard, from the Object Management Group (OMG), for performing model-to-text transformation and code generation. It is used in our work for final transformation and code generation of the graphical user interface.

Fig. 1. Model Driven Architecture levels

III.2. Transformation of MDA Models Once the meta models developed, MDA provides the passage between the CIM, PIM and PSM models through the execution of models’ transformation. A transformation converts models with a particular perspective from one level of abstraction to another, usually from a more abstract to less abstract view, by adding more detail supplied by the transformation rules. There are two types of transformations in the MDA approach. The first one has as an input a model consistent with the meta model defines and gives another model as an output of the transformation engine and it is known as Model To Model (M2M) transformation; it concerns the transition from CIM to PIM or from PIM to PSM. As for the second, it concerns the generation of the code from the entry model (the PSM) to a specific programming language. This transformation is called Model To Text Transformation (M2T). These transformations can be written according to three approaches: The approach by Programming, the approach by Template and the approach by Modeling.

IV.

Human Computer Interface

The Graphical User Interface, GUI, is a type of Human Computer Interface that allows the user to interact with the devise (computer, smart phone, PDA…) by manipulating conventional graphical object with a pointing device, usually a mouse. Besides, we can say that the life of a HCI is divided into two phases: its construction and its execution. Then the building of this GUI is managed while respecting the terms of the usefulness as it should be able to offer the desired user services, and the usability is terms of ease of learning, efficiency etc. Nowadays, the GUIs are deployed in heterogeneous and dynamic interactive spaces: they are spread over a range of platforms. This leads to think of adapting new ways of developing the presentation layer of the application and lower the boundaries between the design, implementation and evaluation. If one identifies common characteristics of visual components for HCI, there are three general constants for a component:  its content (internal data, stored data, etc ...);  its appearance (style, color, size, etc ...);  its behavior (mainly in reaction to events). For this purpose the Model-View-Controller architecture makes this design with 3 classes associated with each component:  The model: stores content and contains methods to change this content.  The view: displays the content is responsible for drawing on the screen the data stored in the model.  The controller: manages the interaction with the user.

A. MOF 2.0 QVT Using the modeling approach is designed to have a sustainable and productive models’ transformation, independently of any execution platform. This is why the OMG has developed a standard for this transformation language which is the MOF 2.0 QVT [17], standing for Query View Transformation. QVT is hybrid character (declarative / imperative) consisting of three languages: QVT-Relation, QVT-Operational and QVT-Core. Fig. 2 demonstrates this repartition.

Fig. 2. Relationship between QVT metamodels [15]



522


model as our PIM, and the target meta model, PSM with the meta elements that describes a graphical interface. A. PIM source meta model Based on what has been described above on the one hand and the MDA Platform Independent Model definition on the other hand, we elaborated the base meta model which traces back these findings. This meta model should be able to capture the main structure of the user interface described throughout his elements. As a result, we have defined a new way to model a user interface which is neither a use case nor an activity diagram, and it takes into account the relationship between the user operation and the goal to be achieved. This meta model [20] is modified and improved to take into account the layout as described further in this section.

Fig. 3. The MVC pattern

Besides, the graphical interface is composed of several components that allow the interaction between the user and the application. These components can be divided into containers that gathered simpler components which goes along with the composite design pattern that is one of the structural design patterns. It is used when we have to represent a part-whole hierarchy. In the GUI modeling process we adopted, we used the composite design pattern because it offers the possibility to add new kinds of components and the design could be overly general, since it can treat objects and composites uniformly. Along these same lines, we opted to use the pattern design previously presented which promotes the reusability of code on the one hand and significantly reduce the time required for the application implementation of another part. The meta model proposed in the section 5 describes this application.

V.

Fig. 4. Proposed platform independent User Interface meta model

The proposed PIM meta model contains the following:  Use Case: describes the main functionality offered by the system.  MainOperation: express the concept of the generic operation performed by the user to interact with the system. This operation is divided into several atomic activities.  Activity: represents the atomic activity done by the user to handle a part of the main operation; (select an element from a list, input an information..).  Property: gives further information about the activity, such as if it is a single or multiple choice. This property narrows the translation into graphical component in the PSM meta model.  ActivityType: enumeration that lists the basic types that an activity could belong to. For each meta model, relationships and cardinality improve semantic model. In our case, the use case, gathers one or more main operations. Each main operation is split into several atomic activities with a specific type that will help choosing the right graphical component. It is important to mention that the positioning of the components in the GUI is essential. So, as improvement and continuity of [20], we enhanced the proposed PIM to account for the positioning of components as desired by the user. So we added some meta elements into the meta model to take into account the layout of components as shown in the Fig. 5. The user should define the component horizontal and vertical location that will be translated within the targeted specific platform in the transformation process.

Model Driven Engineering for Human Computer Interface V.1.

Overall View of Our Approach

Among UML models, the tow models that describe in the best way the user’s interaction with the system are the use case and activity diagrams. However, this representation stays week from any additional information on graphical components that the user will be interacting with. From this, we noticed that the use case is logically divided into multiple users’ activities to achieve the goal it encapsulates. That is to say, in our approach, the use case is seen as a series of operations, performed by the user, to enable the direct interaction with the system. For achieving the purpose behind each activity or operation, a graphical component is included in the interface to assist the user to perform the actual operation. That what helped us define the meta models which are described in the following sections. V.2.

The Modeling Process

In order to be able to model a user interface of an application during the design process, an appropriate meta model, which is the case in [6] and [10], or profile as developed in [11], [18] needs to be defined first. Therefore, we chose the definition of meta models based on the findings and the approach described above. In the following sections, we will present the base meta



523


have its layout parameters that allow its positioning in the view. The ViewPackage in our model represents the package that contains all the Views of the application. Each view is composed of several Graphical Components. The component could raise an event that is the link between the component and the processing based on the functionality provided. Fig. 5. The PIM meta classes for positioning the component

B. PSM Target meta model Once we have established the source meta model for our approach, the second step while applying the MDA approach is to elaborate the target meta model that will form the PSM independently from any specific programming language. We are focused, as stated earlier in this paper, on the basic operations required in any User interface; which are: input, selection and click operations. Therefore, we used the two design patterns MVC and Composite to establish the PSM in question. Fig. 6 represents the MVC part of our meta model.

Fig. 7. The proposed PSM target meta model for HCI spcecific to swing platform

V.3.

The Transformation Process

Once the meta models developed, the next step is to specify the correspondences between two meta models and automatic generation of transformation rules. These mapping rules will then get a first PSM model from a PIM model. The second step is to transform the PSM model to the final code specific to the chosen platform; Java Swing in our case.

Fig. 6. The MVC part of the proposed PSM meta model

 SwingMVCPackage: expresses the concept of package that includes the entire elements of the model; the view and the controller, that are themselves regrouped, respectively, in ModelPackage, ViewPackage and ControllerPackage.  View: contains the widgets of the View that will be defined hereafter.  Controller: The responsible for the communication between the model and the view. It contains all the listeners to handle each event raised by a user operation.  Model: express the business behind the application and is composed of methods that is triggered by a specific listener. Fig. 7 describes the View part of the PSM Meta model. Thereafter we give a description of each element:  View: describes the main element of the GUI.  Widget: represents the graphical widget that handles the event described before.  Container: it refers to the generic component that will gather other graphical components as a specification for each type (buttons, labels, textfields…).  Component: expresses the Swing main component. We defined in a hierarchical way the other types of the widgets as shown in Fig. 7. Each component should

V.4.

The Model To Model Transformation Rules

To make MDA model construction and transformation practical, transformation rules must be developed between models [9]. For our work, we established the mapping rules that lead to the transformation from PIM elements to PSM equivalent elements. The algorithm and the flow chart presented below illustrate the sequence of required transformations. Besides, the position layout defined by the user in the input model are also taken into account in the transformation process. A. Main algorithm: input srcModel:UmlPackage output destModel:SwingMVCPackage begin create SwingMVCPackage mVCRegistrationPack create ViewPackage viewPack create ModelPackage modelPack create ContorllerPackage controllPack for all c in srcModel.UseCase map UseCase2View(c) map UseCase2Controller(c) map UseCase2Model(c) end for end



524


mapping UseCase2View(act:Activity):View begin create View theView theView.name = act.container.name + ‘View’ for all act.type if act.type is label map activityToLabel() end if if act.type is input map activityToInput() end if if act.type is Click map activityToButton() end if if act.type is select map activityToselection() end if end for end mapping UseCase2Controller(c:UseCase):Controller begin create Controller ctrl link theView to ctrl link theModel to ctrl Create Listener theListener Link theListener to theView end mapping UseCase2Model(c:UseCase):Model begin create theModel Model Create theMethod Method Link theMethod to theEvent end

The main algorithm and its flow chart describe the steps to be taken to generate the MVC Swing model for the final application. First, we create the MVC Package based on the UML package from the input model, which is conform to the PIM meta model presented earlier. Then, for each use case we create the controller, the model and the views. Then we establish the connection between those elements through the events, handlers and methods that are the actual connectors to the application, respecting the MVC pattern. Finally, the graphical components are specified and added to the view depending on the activity’s type. For the model parsing part, we used Eclipse Modeling Framework, offered by the Eclipse platform, along with a number of dedicated tools. More specifically, for the parsing phase, all diagrams are traversed in their EMF format that can be exported either from the modeling tool or from their XMI representation. This parsing is used to apply the M2M transformation rules for the models. Once the source model passes through the engine model transformations, we obtain the output target model describing the HCI modeled first. V.5.

The Model to Text Transformation Rules

Once the application has been sufficiently modeled the code generation procedure follows with the Model To Text transformation (M2T) to get, in our case, the Java Swing code, using Acceleo. This transformation follows the template approach; thus we have developed the needed templates for code generation swing taking as an input the model generated as a result from the M2M transformation.

B. Flow-chart of the main algorithm

[comment encoding = UTF-8 /] [module generateJavaSwing('http://swingmvcmm/1.0')] [template public generateJavaSwing(aSwingMVCPackage : SwingMVCPackage)] [comment @main/] [for (v : View | aSwingMVCPackage.ViewPack.views)] [file (v.Title.toUpperFirst().replaceAll(' ', '')+'.java', false, 'UTF-8')] package [aSwingMVCPackage.ViewPack.Name.replace(' ', '')/] [createViewClass(v)/] [/file] [/for] [for (c : Controller | aSwingMVCPackage.ControllPack.controller)] [file (c.Name.replace(' ', '')+'.java', false, 'UTF-8')] [createController(c)/] [/file] [/for] [for (m : Model | aSwingMVCPackage.ModelPack.model)] [file (m.Name.replace(' ', '')+'.java', false, 'UTF-8')] [createModel(m)/] [/file] [/for] [/template] [template public createViewClass(view : View)] import java.awt.*; import javax.swing.*; public class [view.Title.toUpperFirst()/] extends JFrame{ [for (widget : Widget | view.widgets)] [declareWidget(widget)/] [/for]……

Fig. 9. The Acceleo main module for Java Swing code generation

In the M2T transformation with acceleo, the execution of the templates we developed gives the source code of the application with Java files for the views, the controllers and the models. With these generated files we were able to create an MVC Swing project that gave us the graphical interface connected with the application’s other parts. Fig. 8. The main algorithm flow chart



525


VI.

was used. This means the user only specified the input model and the generated view reflects all the user’s specifications. Fig. 14 shows the final result for this example.

Running Example

In order to validate the proposed architecture and development methodology, we took the “Registration” functionality as a case study. In this use case, several operations are defined implicitly. To achieve the registration goal, the user should provide some required information and then submit the form to the data base. More precisely, the user has to enter the name, address, email and password. These operations could be gathered as Input Operations. Then he should choose the country and the gender, which can be defined as Select Operations. Finally, in order to send the information, the submission of the form’s information is done by pressing the Send button, or the user could cancel the whole operation; these are Click Operations. In addition to that, we can also get label operation which are simpler than the other operations. We illustrate this succession using a similar diagram to UML activity diagram as shown in Fig. 10. From this distribution, we instantiated the PIM defined earlier. Fig. 11 presents a few elements from this model and their Layout information for the graphical components. Each and every operation from this scheme can be associated with a component. For instance, “Enter Name” is closely related to the “text field” component, “the Select Country” will be a drop down list and so on. In addition, the layout information taken into account from the entry model give proper positioning in the output generated model. Fig. 12 illustrates the generated file respecting the PSM meta model and gathers all the elements for generating the Java Swing MVC application. It has the View elements with all the graphical components. Once the application corresponding to the starting use case “Registration” is modeled, the code generation procedure follows applying the templates developed for the M2T engine using Acceleo and the EMF tools for parsing the models in question. Fig. 13 is the result of the generated code. Since the user interface has to be the most ergonomic possible, the layout of the components is essential. We took into account the information in the modeling phase and have generated the following interface that is more organized by using the grid bag layout, the most powerful library for the positioning of java swing components. The result is illustrated in Fig. 13. Note that positions given in the model may not be used, which will generate a random display of components and the user can then deploy them as he wishes. This modeling process can easily be used for different cases to generate automatically a running Java Swing application. Herein the example below, the same process

Fig. 10. The Registration form operations

Fig. 11. The input model for the registration form

Fig. 12. The XML result file after applying M2M transformation rules



526


[4]

[5] [6]

[7]

[8]

Fig. 13. The Registration Form generated with the grid bag layout [9] [10]

[11]

[12]

Fig. 14. The input file model for the find/replace application [13]

VII. Conclusion and Perspectives [14]

In this paper, we have defined a new way to model user interface based on MDA meta models. To do this, we came up with a new meta model, which is neither a use case nor an activity diagram, but a description of the UI operations, their types and the user’s preferences for positioning the components. We elaborated first the PIM as a base meta model for designing the UI, then the target meta model representing the PSM meta model for the Java Swing platform. After that, we defined the mapping rules for the model to model transformation that allow the transition between the two models. And last but not least, we defined the Model To Text transformation based on the output file resulting from the first transformation. Following these steps, we were able to generate an ergonomic UI described simply with a use case first. In future works, we are aiming to develop further the proposed meta models and be able to couple several input PIMs to handle complex user interfaces while dealing with Rich Internet Applications.

[15]

[16]

[17] [18] [19]

[20]

F. J.Mart`ınez-Ruiz, J.Munoz Arteaga, J. Vanderdonckt, and J. M. Gonz`alez-Calleros. A first draft of a model-driven method for designing graphical user interfaces of Rich Internet Applications. In LA-Web ’06: Proceedings of the 4th Latin American Web Congress, pages 32–38. IEEE Computer Society, 2006. QVT transformation by modeling From UML Model to MD Model S. Mbarki, M. Erramdani, Toward automatic generation of mvc2 web applications InfoComp, Journal of Computer Science, Vol.7 n.4, pp. 84-91, December 2008, ISSN: 1807-4545. Mbarki, S., Erramdani, M., Model-driven transformations: From analysis to MVC 2 web model, (2009) International Review on Computers and Software (IRECOS), 4 (5), pp. 612-620. Esbai. R, Erramdani, M., Mbarki, S., Arrassen. I, Meziane. A. and Moussaoui. M., Model-driven transformation with approach by modeling: From UML to N-tiers Web Model, International Journal of Computer Science Issues (IJCSI), Vol. 8, Issue 3 ISSN (Online): 1694-0814 (2011). Object Management Group (OMG), MDA Guide 1.0.1. http://www.omg.org/cgi-bin/doc?omg/03-06-01. Duddy, K., A. Gerber, M. Lawley, K. Raymond and J. Steel (2003). Model transformation: A declarative, reusable patterns approach. In Proceedings of the 7th International IEEE Conference on Enterprise Distributed Object Computing (EDOC), pp. 174-195, IEEE Press, Brisbane, Qld., Australia. Sottet, J.S., Calvary, G., Favre, J.M., Coutaz, J., Demeure, A., Balme, L. Towards Model-Driven Engineering of Plastic User Interfaces, in Conference on Model Driven Engineering Languages and Systems (MoDELS’05) satellite proceedings, Springer LNCS, 2005, pp 191-200. Gharavi, V., Mesbah, A., Deursen, A. V., Modelling and Generating AJAX Applications: A Model-Driven Approach. Proceeding of the7th International Workshop on Web- Oriented Software Technologies, New York, USA (Page: 38, Year of publication: 2008, ISBN: 978-80-227-2899-7). J.M. Favre, Foundations of Model (Driven) (Reverse) Engineering: Models - Episode I, Stories of the Fidus Papyrus and of the Solarus, in [14]. J.M. Favre, Foundations of the Meta-pyramids: Languages and Metamodels - Episode II, Story of Thotus the Baboon, in [14]. Dagstuhl Seminar 04101 on Language Engineering for ModelDriven Software Development, Dagsthul, Germany, February 29March 5, 2004, DROPS proceedings, http://drops.dagstuhl.de/portals/04101/. J. A. Monte-Mor, E. O. Ferreira, H. F. Campos, A. M. daCunha, and L. A. V. Dias, “Applying MDA Approach to create Graphical User Interfaces” Eighth International Conference on Information Technology: New Generations Las Vegas, NV, IEEE pp. 766– 771 (2011). Meta Object Facility (MOF) 2.0 Query/View/Transformation (QVT), Version 1.1 (OMG, 2009). Eclipse modeling, http://www.eclipse.org/modeling/. Object Management Group (OMG), XML Metadata Interchange (XMI), MOF 2.0/XMI Mapping, v.2.1.1, 2007, http://www.omg.org/docs/formal/07-12-02.pdf. S. Roubi, M. Erramdani, S. Mbarki, Model Driven Architecture as an approach for modeling and generating Human Computer Interface, accepted in The first Mediterranean Conference on Information & Communication Technologies (MedICT’15).


MATSI Laboratory, EST, Mohamed First University, Oujda, BP 473, Morocco.

References

2

[1] [2]

[3]

Department of Management, EST, Mohamed First University, Oujda, BP 473, Morocco.

OMG. MDA, 2008. http://www.omg.org/mda Mens T., Czarnecki K., Van Gorp P. A Taxonomy of Model Transformations Language Engineering for Model-Driven Software Development, Dagstuhl February - March 2004. D. C. Schmidt. Model-driven engineering. Computer, 39(2):25– 31, 2006.K. Elissa, “Title of paper if known,” unpublished.

3

Department of Computer Science, Faculty of Science, Ibn Tofail University, Kenitra, BP 133, Morocco.



527


Sarra Roubi is pursuing her PhD at Mohamed first University at MATSI laboratory. She graduates as a computer science engineer from ENSA (High School of Applied Science). Her research activities at the MATSI Laboratory (Applied Mathematics, Signal Processing and Computer Science) are focused on MDA (Model Driven Architecture) approach applied to dynamic generation of Graphical User Interface. Mohammed Erramdani teaches the concept of Information System at Mohammed First University. He got his thesis of national doctorate in 2001. His activities of research in the MATSI Laboratory (Applied Mathematics, Signal Processing and Computer Science) focusing on MDA (Model Driven Architecture) integrating new technologies XML, EJB, MVC, Web Services, etc. Samir Mbarki received his B.S. degree in applied mathematics from Mohammed V University, Morocco, 1992, and Doctorat of High Graduate Studies degrees in Computer Sciences from Mohammed V University, Morocco, 1997. In 1995 he joined the faculty of science Ibn Tofail University, Morocco where he is currently a Professor in Department of mathematics and computer science. His research interests include software engineering, model driven architecture software metrics.



528


Requirement Scheduling in Software Release Planning Using Revamped Integer Linear Programming (RILP) Model Sandhia Valsala, Anil R. Nair Abstract – Software development generally undergoes four traditional processes namely requirement management, construction and development of architecture, delivery and maintenance. The requirement management phase is in turn composed of two processes namely generating the requirements and scheduling these generated requirements. Generating the requirement is to select the best requirements and scheduling is ordering the execution sequence of these generated requirements so that project will be delivered exactly on time. Improper scheduling delays the product delivery. This paper focuses on this issue of scheduling by proposing a new algorithm that provides an efficient schedule for developing the requirements so that the project duration will be minimized. Hence a Revamped Integer Linear Programming (RILP) model is proposed that considers requirement precedence and resource constraints while scheduling the generated requirements and thus calculates an on-time-delivery project schedule which minimizes both the project span and also the cost for requirement development. Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Software Product, Requirement Management, Requirement Scheduling, Integer Linear Programming (ILP), Software Release Planning

Nomenclature {

…

}

( , ,…, )

“m” number of requirements Development teams Number of days a team taken to develop a requirement Requirement consists of ) number of jobs Requirement Set Job Set consists of number of jobs in a requirement Needed resources for the developing the requirement , Two virtual jobs created before the start of the project and after the end of the project Processing time of a job Time span of the project Earliest start time Latest start time Representing the possible start time of the project between earliest start time and the latest start time Dead line Release date Decision variable Amount of resources that are available during the overtime Volume of resources that are hired in overtime

ILP RILP GSD ISPMA

Integer Linear Programming Revamped Integer Linear Programming Global Software Development International Software Product Management Association

I.

The software development is a sequence of action where the user requirements are interpreted into the final software product. This sequence of action includes converting the user requirements into the prototype (user model), developing the prototype into real time software (development phase), appraise the developed product and sometimes includes the maintenance of the delivered product. Initially a decision has to be made on the requirement selection that is which requirements are to be placed on the next version of the existing software [1]. This process is called as requirement prioritization where the selected set of requirements must satisfy the various stakeholders’ needs in terms of stakeholder preference, requirement quality, requirement dependencies, cost of development, development risk and so on [2]. The next essential part in software development is requirement scheduling. Since the selected requirements are having dependencies with each other, the requirement scheduling may have restrictions with time. Likewise, some specific functions need to be added in the software before the development starts [3]. Hence it is necessary


529

Introduction

Sandhia Valsala, Anil R. Nair

to order the optimal requirements for the purpose of deciding the requirements that has to be included in the next release and also a suitable time for release. Very few researches have been carried out in the area of requirements scheduling while comparing to the researches done on requirement selection. The paper introduces the RevampedInteger Linear Programming (RILP) for scheduling the optimal requirements that are already prioritized in the requirements prioritization process. This proposed RILP is mainly based on a mathematical model that makes a clear understanding of software release scheduling in terms of dependencies, cost and profit. The system also has knowledge of many political, behavioral factors of the requirements that may influence the correct scheduling. The mathematical model of the proposed RILP as a whole is not a decision making process but providesuseful factors that cansupport decision making. This way of scheduling the optimal requirements is applied on numerous requirements and hence the optimal solution is to maximize the revenue against the available resources.

II.

2. To design a mixed organization structure for traditional approach and agile approach. 3. To generate an ontology based on the knowledge management system for traditional approach and agile approach to overcome the issues like missing, disapproval of requirements, collaboration and knowledge management problems and to maximize the project management process in GSD environment. The author [7] conducted a state of practice survey which is used to evaluate the companies about how the companies accept the Software Project Management (SPM) practices and how the companies practically coordinate with the framework which the International Software Product Management Association (ISPMA) suggests. In this ISPMA, the SPM framework initiates the core product management process but the product management processes the victory of interpretations. The paper [8] suggests a new Test Point Analysis based Module Priority approach for computing the time to stop the testing phase and to start the procedure for product delivery. Hence it is essential to schedule the testing phase in prior. Due to the wrong scheduling, the testing phase duration can be reduced or increased. This leads to non detection of bugs and thus decreases the software quality. The testing phase in general utilizes much expenditure, so prioritizing and scheduling the testing components is an essential task and this gives an optimal solution with time constraints [9]. The author [10] analyzes different release planning models and the feature selection has made upon the factors of that models. The taxonomy of requirement selection process is build based on the 32 release planning models. This paper attempts to get a clear process of finding the real factor which is necessary to planning a release and to initiate the value of the identified factors on a release the planning efficiently and optimally. The author [11] indicates a method to solve an ILP by an approximate integer solution to RLP. If the solution doesn’t satisfy the exact condition, a Bound algorithm and a modified form of branch along with the optimal hyper plane is searched to provide an optimal integer solution.

Related Work

The author [4] proposes that the incremental software development is an evaluation of software product that is completely based on user satisfaction. New features initiated by the stakeholder’s leads to a new software process. Release planning involves a decision making process regarding the assignment of features. The decision making depends upon cost, time and available resources. In Requirements management, effective communication and collaboration between stakeholders are important. In software development, Global Software Development (GSD), where software teams are all around the world having some geographical distance between the stakeholders may pose problems in effective communication. To consider the factors of GSD, previous research shows that the requirements management performed in collocated software development projects are not effectively used in GSD projects. For this issue, the author [5] proposes a requirements management method for GSD which comprise of four stages: (1) Establishing and maintaining a requirements repository; (2) generating a requirements traceability matrix; (3) Communicating and discussing requirements; and (4) requirements change management. In a distributed software development environment, requirements management is the most challenging and important key to produce good software products. The main reasons for the failure of software projects are poor project management and requirement management process. This work is created to initiate the following [6]: 1. Create a framework to get the appropriate requirements for management framework in global software development projects.

III. Revamped Integer Linear Programming (RILP) Model for Requirement Scheduling The requirement prioritization is the most fundamental stage of software release where the incoming requirements of particular software are processed and optimized proficient requirements that satisfy all the stakeholders’ needs are finally selected. This process of selecting the requirements is done using some efficient algorithm. Once the requirement prioritization is over, the selected requirements are needed to be scheduled within the fixed time interval. The time interval for all the requirements development has to be determined [11]. Hence in this section, the RILP model attempts to solve the requirement scheduling problem such that the



530


execution of order of requirements should minimize the delay of final software release. Usually the requirement scheduling intakes two constraints namely the precedence constraints that is present between any two requirements and limited available resources. This scheduling of requirements using RILP model overcomes the RCPSP problem.

the job ∗ belongs to the requirement ∗ . Based on the requirements, the jobs also expose the same behavior (i.e.) the job ∗ depends on the job (i.e.) the job should be done before the job ∗ . In order to make the precedence constraints more flexible and understandable, two virtual jobs are introduced and placing the first before the project starts and another after the project end. Hence, these virtual jobs are named as “project start” and “project end”. The job “project start” must start and end before the jobs of the real project starts. Likewise the job “project end” must start only after all the jobs in the real project are finished. Since these two jobs do not have any project work to process, their processing time will always be “0”. Each job requires come processing time and also it is in need of amount of resources. The variable X defines the entire jobs of the project and if these two virtual jobs are added with the other jobs in the project, then the newly discovered job set will be ′ }. In case of any = ∪{ , job in the real project do not have any successor or predecessor then this virtual jobs will be added before and end of all the jobs.

III.1. Problem Statement and Formulation Let the requirements are collected and organized to }. The form a set and represented as { … software development teams to develop the requirements is denoted as and ( = 1,2, … ). The number of days required for the team to develop a requirement is . Each team starts to work for developing the single requirement and no time constraints for the development activity within the single requirement. These activities for a single requirement are denoted as a set as = ( , , … , ), a set of jobs and the time for developing this single requirement is positive and is denoted as . Since a requirement has a set of job, this set is partitioned into w subsets which are disjoint (i.e.) not having any work in common. Thus this partitioned subset is denoted as { ( ) ( ) … ( )}. Likewise, partitioned activitiesbelong to individual teams, so that the set of partitioned activities belonging to individual teams are represented as { ( ) ( ) … ( )}. As mentioned before, the requirement scheduling using RILPconsidersrequirement dependencies and resource availability while performing the scheduling.The proposed scheduling process, RILP performs scheduling of the prioritized requirements with respect to the resource constraints.

(ii) Time span representation of a project The RILP model needs to be formulated for the RCPSP problem that discovers schedulingrequirement in time. Let the time required to complete the whole project is and is computed as: =

(i) Requirement dependencies Consider a requirement set consisting of a set of prioritized requirementsand the dependencies between the requirements. Let the requirement set be , the requirements and their dependencies between them is represented as = {( , ∗ )| ← ∗ }. This requirement set represents the requirements and ∗ belonging to it and also shows a dependency that requirement ∗ depends on the requirement . Usually in software companies, a single requirement of the project is partitioned into many small jobs. Since the requirements are having dependencies between them, the partitioned jobs are also having precedence between them [12]. The job set and their precedence is expressed as: ( ,

∗ )|

∈ ( ), ( , ∗)

∗

∈ (

∗ ),

∈

(2)

represents the developing time. It is developing the requirements in sequence without any overlapping between teams and in time. Then the time computation of earliest start and the latest start has to be done as follows. Initially the jobs based on their dependencies have to be arranged such that the same sequence of order will be given for the development process. Then the earliest start time of the job “a” is computed as:

III.2. Metrics

=

| =1

=

(

,

)

(3)

It means that the time between the project start (virtual job) and the real job. After that, the latest start time of the job is computed as: =

(

,

)

(4)

It means that the time between the completion of project time and project end (virtual job).

(1)

III.3. Methods Revamped Integer Linear Programming (RILP) model In Revamped Integer Linear Programming (RILP) model, the basic steps of the general ILP for the RCPSP

This precedence equation between the jobs shows that the job belongs to the requirement and obviously



531


problem areconsidered, in addition some more constraints are involved in formulating the general ILP. Hence, this proposed method is named as Revamped ILP. This proposed RILP for the RCPSP model minimizes the cost of the project in addition with the project span minimization in the ILP. In order to provide these two minimizations, additional constraints are included [13]. Consider the variable occur in the time interval between early start time and latest start time. This variable occurs for each job and the time in the following RILP formulation represents the possible time for the particular job to start. Each job development is started with the already given deadline and the release date . Also, a decision variable is introduced and is equal to “1” only if the particular job processed in time . In some cases, the jobs have been developed in the specified time or utilizing overtime. So implies the volume of resources that are available during the overtime and implies the volume of resources that are hired in overtime. Then formulation of RILP for the RCPSP problem is shown as follows:

:

,

,

,

(5) +

+ =1

∈ \ 0

+

subject to: = 1,

′

∈

(6)

∗

∙

+

≤

∙

∗

(7)

∗

for all ( ,

∗)

∈

≤ 1, for all ∈ (

)

},

≤

+

,

≥

IV.

}

,∀ , ∈

=

1−

∈ {1, … ,

;

(10)

∈

(11)

̅ ≥

∈ {0,1} for all ∈ [

,

],

(13)

= 0, ∀

,…,

−1

(14)

,

∈ {0,1}, ∀ , ∈ ≥ 0, ∀ , ∈ \

(15) 0

(16)

∈

′

Results and Discussion

A successful software release planning includes two major steps. Step one is requirement prioritization (optimal requirement selection) and step two is scheduling of these requirements to be released on time. This paper analyses the process of scheduling the requirements which are assumed to be prioritized previously and proposes Integer Linear Programming (ILP) with adding some additional constraints to solve the RCPSP problem effectively and hence named as Revamped Integer Linear Programming (RILP) model. The requirements scheduling is performed using the RILP and the results of scheduling is shown in Table I. This table shows the requirements, teamsand time duration taken by each team to develop the requirements.

(9)

,∀

̅,∀

−

(8)

( , )

∈ {0,1, … ,

,…,

The objective statements (5) are the goal of the RILP, which minimizes the project time and minimizes the cost for developing the requirements. In order to minimize these objectives, the following constraints are taken into account. First constraint in (6) indicates that each job begin to process only once, consider when two requirements are processed by two different teams. If the first requirement depends on the second requirement, then the second one has to be processed first and the first requirement will be processed second. They can’t be processed either in different order or at the same time. These requirements dependencies are shown in the constraint (7). Likewise, a single development team concentrates on a single job or a single requirement at a time. This constraint is explained in (8).In some cases, the job development process may exceed the given time and the resources given to develop the job may also not enough. In such cases, during the processing stage, it may utilize the volume of resources available and also hire the resources. To provide this utilization of amount of resources, the constraint (9) preserve that the required amount of resources must not exceed the amount of available resources even in the hiring time. The constraint (10) is a condition for checking the development time required for a job or requirement should be between the dead line and the release date. The constraint (11) ensures that the particular job begins and ends in particular time. Constraint (12) is the {0, 1} constraint for all the variables. Finally the constraints (13) and (14) make all the non-relevant variables to zero and the then the constraints (15) and (16) are the variable domain.

∙

∈ 0

= 0, ∀

(12)



532


TABLE I REQUIRMENTS SCHEDULING RESULTS Requirement ID Requirements Team A Start day End day 35 Adaptations in rental systems 0 8 25 Inclusion graphical Plan Board 9 15 63 Performance Improvements order processing 16 28 34 Authorization on archiving service orders 43 Link with Acrobat Reader for pdf files 29 36 12 Authorization on order cancellation and 37 45 removal 67 Comparison of services per department 46 50 66 Symbol Import 51 54

Fig ig. 1 shows individual requirements, teams developing the requirements and the time duration for each requirementthat each team takes to develop it. Fig. 2 compares the possibilities of on on-time time delivery and the delayed delivery of the software on increasing the requirement dependencies using the Revamped ILP. It shows that the proposed RILP control the delayed delivery even when the dependencies are increasing. Fig. 3 shows the comparison between the expected scheduling results and the achieved results using RILP according to the scheduling time, where proposed proposed RILP achieves the expected results moderately.

Team B Start day End day 0 5 6 12 13 18 19 -

Team C Start day End day 0 6 7 10 11 17 18 23 23 32

26 -

-

-

Fig. 3. Evaluation of system generated result

V.

Conclusion

The requirement management is one of the key processes in software development lifecycle. It includes scheduling of prioritized requirements efficiently thatcan minimize the project time span. Many techniques have been proposed to solve this scheduling problem. problem. Improper scheduling leads to a delay in product delivery. Thus a new scheduling model is provided that solves the RCPSP problem and schedules the requirements efficiently. The proposed Revamped Integer Linear Programming (RILP) model is improved fro from m the previously used Integer Linear Programming (ILP) based on the process of not only minimizing the project time span, but also minimizing the cost of the project by adding additional efficient constraints with the general ILP formulation.

Fig. Fig 1.. Experimental Result

References [1] OmoladeSaliu and Guenther Ruhe, “Supporting Software Release Planning Decisions for Evolving Systems”, Proceedings of IEEE/NASA Software Engineering Workshop, IEEE IEEE,, ISBN: 769576952306, pp. 14 14-26, 26, 2005. [2] Mohammad Dabbagh and Sai Peck Lee, “An Approach ffor or Integrating the Prioritization of Functional and Nonfunctional Requirements”, The Scientific World Journal Volume, Article IDID737626, pp. 11-12, 12, 2014. [3] Sunil Yadav, “Efficient operating system scheduling for symmetric multi multi-core core architectures in CPU sscheduling”, cheduling”, International Journal of Innovative Computer Science & Engineering Volume 1 Issue 2, pp. 24 24-27, 27, ISSN: 2393 2393-8528, 8528, 2014. [4] Amir SeyedDanesh, Rodina Ahmad, Mahmoud Reza Saybani, AmjedTahir, AmjedTahir, “Companies Approaches in Software Release Planning – Based on Multiple Case Studies”, Journal of Software Software,,

Fig. 2.. Possibilities of On On-time time delivery and Delayed delivery



533


[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Volume 7, Issue 2, pp. 471-478, February 2012. Richard Lai, Naveed Ali, “A Requirements Management Method for Global Software Development”, Advances in Information Sciences (AIS), Volume 1, Number 1, pp. 38-58, March 2013. S. Arun Kumar and T. Arun Kumar, “Study the Impact of Requirements Management Characteristics in Global Software Development Projects: An Ontology Based Approach”, International Journal of Software Engineering & Applications, Volume 2, Issue 4,ISSN: 2333-9721, pp. 107-127 October 2011. AndreyMaglyas and Samuel A. Fricker, “The preliminary results from the software product management state-of-practice survey”, Springer International Publishing, Volume 182, pp. 295-300, Series ISSN: 1865-1348, 2014. Praveen RanjanSrivastava, SubrahmanyanSankaran and PushkarPandey, “Optimal Software Release Policy Approach Using Test Point Analysis and Module Prioritization”, MIS Review, an International Journal, Volume 18, Issue 2, ISSN: 1018-1393, pp. 19-50, March 2013. BjörnRegnell and Krzysztof Kuchcinski, “Exploring Software Product Management Decision Problems with Constraint Solving – Opportunities for Prioritization and Release Planning”, International Workshop on Software Product Management, IEEE, ISBN: 4577-1146, pp. 47-56, 2011. SandhiaValsala and Anil R, “Review and Analysis of Software Release Planning Models”, International Journal of Engineering and Advanced Technology, Volume-3, Issue-5, ISSN: 2249 – 8958, June 2014. Shinto K.G, C.M. Sushama, “An Algorithm for Solving Integer Linear Programming Problems”, International Journal of Research in Engineering and Technology, Volume 02 Issue 07, pp. 107-112, pISSN: 2321-7308, July 2013. DiwakarGuptal and Brian Denton, “Appointment scheduling in health care: Challenges and opportunities”, IIE Transactions, Volume 40, Issue 9, ISSN: 0740-817X, pp.800–819, 2008. Chen Li, Marjan van den Akker, SjaakBrinkkemper and Guido Diepen, “An integrated approach for requirement selection and scheduling in software release planning”, Journal ofrequirements engineering,Springer-Verlag, Volume 15, Issue 4, pp. 375-369, ISSN: 0947-3602, 2010. T. A. Guldemond, J.L. Hurink, J.J. Paulus and J.M.J. Schutten, “Time-constrained project scheduling”, Journal of scheduling, Springer US, Volume 11, Issue 2, ISSN: 1094-6136, pp. 137-148, 2008.

Authors’ information Ms. Sandhia Valsala, holds a Master’s degree in Computer Applications from Bharatiyar University, Coimbatore and is currently pursuing her PhD from Karpagam University Coimbatore. She has completed her Mphil Computer Science from MadhuraiKamaraj University.

Dr. Anil R Nair holds a PhD in software reliability from Indian Institute of Technology Bombay, India. At present he is the working as Associate Director at EY.He has worked as Principal Scientist with ABB corporate Research, Bangalore, India. Before that he was manager-QRM with Deloitte consulting, Mumbai, India. He has published over 20 papers in international conferences and journals. He is concentrating on software engineering research with a specific focus on industrial applications Anil was Special session co-chair - Special Session on Modern Software Engineering Methods for Industrial Automation Systems, INDIN’2013, Special Session on Software engineering methods, tools and Practices for automation systems, ETFA 2013; Program Committee member - Third annual world conference on Soft computing-wcsc 2013, 3rd International Conference on Soft Computing for Problem Solving (SocProS-2013), India HCI, 2012. He was the lead organizer of 1st International Workshop on Modern Software Engineering Methods for Industrial Automation (MoSEMInA 2014). MoSEMInA 2014 is a co-located with International Conference on Software Engineering (ICSE 2014) held at Hyderabad, India.



534


Errata corrige In the paper entitled “An Infrequent Route Selection Strategy for Unequally Clustered Wireless Sensor Networks”, published on the April 2015 issue of the Journal IRECOS, Vol. 10 N. 4, pp. 399-406, by the authors U. Hari, B. Ramachandran, N. Divya, for a print mistake Figures 4 to 7 are not correct. The correct figures are as below:

Fig. 4. Throughput of IFR and UCR

Fig. 5. Average delay of IFR and UCR

Fig. 6. Energy Consumption Rate of IFR and UCR

Fig. 7. Lifetime of IFR and UCR

Many apologies to the authors and to our readers for this mistake.


535

1828-6011(201505)10:5;1-A Copyright © 2015 Praise Worthy Prize S.r.l. - All rights reserved