programme

87 downloads 64524 Views 13MB Size Report
Title: « XW Data Analysis: Competitive Intelligence System in Cloud ». Guest speaker: ...... proposition d'une déma rche de construction d'un tableau de bord »,.
6ème Conférence Internationale “ Systèmes d’Information et Intelligence Economique ” SIIE’2015 12-14 Février, 2015 Diar Lemdina, Hammamet – Tunisia. http://www.siie.fr

PROGRAMME 1

Wednesday, Feb. 11th., 2015 (PM) / Mercredi, 11 février 2015 16h00-18h00 : Welcome and Registration / Accueil et Inscriptions 18H00 : Welcome Reception Cocktail / Réception de Bienvenue

Thursday, Feb. 12 th., 2015 / Jeudi, 12 février 2015 8h00-9h00 : Welcome and Registration / Accueil et Inscriptions 9h00-9h30 : Conference opening / Ouverture de la conférence

Welcome inauguration / Allocutions de bienvenue : o o o

Malek GHENIMA, Co-President – Université de Manouba, Tunisie Sahbi SIDHOM, Co-President – Université de Lorraine, France Nacer BOUDJLIDA, Co-President – Université de Lorraine, France

9h30-11h00 : Plenary Conference / Conférence plénière

Chair: Nacer BOUDJLIDA ( Université de Lorraine, France) Title: “ The third digital revolution : Agility and Fragility ”. Guest speaker : Hubert TARDIEU (ATOS, France) Discussion / Débat 11h00-11h15: coffee break / pause café 11h15-12H45 : Parallel sessions / Sessions parallèles Session A1: Information System, Data and Visualization Chair: Riadh FARAH (Directeur de l’ISAMM - Université de la Manouba, Tunisie) Tanti Marc. (centre d'épidémiologie et de santé publique des armées, France) Exploitation of "Big Data": the experience feedback of the french military health service on sanitary data Tahar Mehenni. (University Mohamed Boudiaf of M'sila, Algeria) Integration of useful links in distributed databases using decision tree classification Cherni Ibtissem, Faiz Sami and Robert Laurini. (ISG de Tunis – Tunisia, LTSIRS – Tunisia, INSA Lyon – France) ChoreMAP: extraction and visualization of visual summaries based chorems Sébastien Bruyère and Vincent Oechsel. (Custom Solutions, France) Creation of a Data Observatory enables the uncovering of consumer behavior by client behavioral study, through the use of old generation loyalty and stimulation platforms

Session B1: Structuring Multimedia Streams Chair: Irina ILLINA (INRIA Grand Est & Université de Lorraine, France) Dominique Fohr and Irina Illina. (LORIA, France) Neural Networks for Proper Name Retrieval in the Framework of Automatic Speech Recognition Imran Sheikh, Irina Illina and Dominique Fohr. LORIA, France) Recognition of OOV Proper Names in Diachronic Audio News Lina Maria Rojas-Barahona and Christophe Cerisara. LORIA, France) Enhanced discriminative models with tree kernels and unsupervised training for entity detection

Session C1: System Development and Community Applications Chair: Zaki BRAHMI (RIADI Lab & Higher Business School of Tunis & Manouba University, Tunisia) Zaki Brahmi and Jihen Ben Ali. (RIADI LAB & Higher Inst.of Comp. Sciences and Comm. Tech.Sousse University - Tunisia) Cooperative agents-based Decentralized Framework for Cloud Services Orchestration Ines Achour, Lamia Labed Jilani and Henda Ben Ghezala.(RIADI Lab, ENSI, Mannouba University – Tunisia) Proposition of Secure Service Oriented Product Line Ilhem Feddaoui and Zaki Brahmi. (Higher Institute of Computer Sciences and Communication Techniques Sousse University- Tunisia, Jendouba University - Tunisia) Decentralized orchestration of BPEL processes based on shared space

12h45-14h15: Lunch / Déjeuner 14h15-15h45 : Plenary Conference / Conférence plénière

Chair: Sahbi SIDHOM (LORIA Lab. & Université de Lorraine, France)

2

Title: « HyperHeritage: New Forms of Human Heritage Interactions in progress » Guest speaker: Khaldoun ZREIK (CITU- Paragraphe, Université de Paris 8, France) Discussion / Débat 15h45-16h00 : coffee break / pause café 16h00-18h00 : Doctoral Symposium / Doctoriale Chair: Henda BEN GHEZALA, Narjès BELLAMINE & Invited Guest Speakers Session D1 : Doctoral Symposium

Friday, Feb. 13 th., 2015 / Vendredi, 13 février 2015 8h00-9h00 : Welcome and Registration / Accueil et Inscriptions 9h00-10h30 : Plenary Conference / Conférence plénière

Chair: Henda BEN GHEZALA (Directrice Labo RIADI - Université de la Manouba , Tunisie) Title: « Online Privacy issues in Information Systems » Guest speaker: Esma AIMEUR, (Dept of Computer Science and Operations Research at the University of Montreal, Canada) Discussion / Débat 10h30-10h45: coffee break / pause café 10h45-12h15 : Parallel sessions / Sessions parallèles Session A2: Information Technologies & Competitive Intelligence Systems Chair: Ahmed SILEM (Université Jean Moulin Lyon3, France) Zakaria Boulouard, Amine El Haddadi, Anass El Haddadi, Abdelhadi Fennan and Lahcen Koutti. (Fac.Sc., Univ. Ibn ZohrAgadir, Fac. Sc. et Tech.Tanger, Ecole Nat. Sc. App. El Hoceima, Fac.Sc. et Tech. Tanger, Fac.Sc., Univ. Ibn Zohr-Agadir) XEWGraph: A Tool for Visualization and Analysis of Hypergraphs for a Competitive Intelligence System Zineb Drissi. (Faculté des Sciences Juridiques, Économiques et Sociales de Fès, Maroc) competitive intelligence and decision Naneche Fariza and Meziaini Yacine. (Mouloud MAMMERI University, Algeria) Constraints to Integrate Competitive Intelligence within the Algerian Operator of the Telephony Mobile Mobilis Case

Ahmed Aloui, Okba Kazar and Samir Bourekkache. (Computer science department, university of Biskra, Algeria) Security study of m-business: Review and important solutions

Session B2: Knowledge Management, Process and Quality Chair: Maher GASSAB (Directeur de l’Ecole Sup. de Commerce de Tunis, Univ. de la Manouba, Tunisia) Brahami Menaouer, Semaoune Khalissa and Benziane Abdelbaki. (National School Polytech. Oran, Univ. Oran – Algeria) The Relationship between Knowledge Management and Innovation process in FERTIAL company Imen Hamed and Faiza Jedidi. (MIR@CL laboratory, University of Sfax, Tunisia) Knowledge-based approach for quality-aware ETL process Brahami Menaouer, Semaoune Khalissa and Benziane Abdelbaki. (National School Polytech. Oran, Univ. Oran – Algeria) Integration of the knowledge management process into risk management process – Moving towards actors of project approach

Session C2: Cognitive and Social Dimensions in Monitoring Process Chair: Souad LARBAOUI (PDG Institut de Managt Stratégique Intelligence Economique Alger, Algeria) Marilou Kordahi. (Laboratoire Paragraphe, UFR MITSIC, Université Paris 8, France) Automatic translation of text phrases into vector images for crisis communication Fatma Fourati-Jamoussi. (Institut Polytechnique LaSalle Beauvais, France) E-reputation: A case study of organic cosmetics in social media Youssef Ait Houaich and Mustapha Belaissaoui. (SIAD Laboratory, ENCG, Université Hassan I, Settat, Morocco) Measuring the maturity of open source software

12h15-14h00 : Lunch / Déjeuner 14h00-15h30 : Tutorial / Tutoriel

3

Chair: Malek GHENIMA (Ecole Supérieure de Commerce, Université de la Manouba) Title: « BIG DATA Solutions » Guest speaker : Pascal GUY , (Oracle (Toulouse) - France) Discussion / Débat 15h30-15h45 : pause café / coffee break 15h45-17h45 : Doctoral Symposium / Doctoriale Chair: Henda BEN GHEZALA, Narjès BELLAMINE & Invited Guest Speakers Session D2 : Doctoral Symposium

20h30 : Gala Dinner / Dîner gala

Saturday, Feb. 14th. , 2015 / Samedi, 14 février 2015 8h00-9h00 : Registration / Inscriptions 9h00-10h30 : Parallel sessions / Sessions parallèles Session A3: Information System, Strategy and Economic Intelligence Chair: Jameleddine ZIADI (LARTIGE, Faculté des Sciences Economiques et de Gestion de Sfax) Sabrina Abdellaoui and Fahima Nader. (Ecole nationale Supérieure d'Informatique ESI, Algérie) A Methodology for designing Competitive Intelligence System based on semantic Data Warehouse Amine El Haddadi, Zakaria Boulouard, Anass El Haddadi, Abdelhadi Fennan and Lahcen Koutti.(Fac. Sc. et Tech.Tanger, Fac. Sc., Univ. Ibn Zohr-Agadir, Ecole Nat. des Sc. App. El Hoceima, Fac. Sc. et Tech.Tanger Fac. Sc., Univ. Ibn Zohr) Mining unstructured data for a competitive intelligence system XEW Iman AHDIL, Boujemaa ACHCHAB. (Université Hassan I – Berrechid - Maroc) Competitive Intelligence experiences in companies: case studies on creative opportunities Imen Gmash. Sahbi Sidhom, Malek Ghénima and Lotfi Khrifech. (FSEG Tunisia ) Towards an approach of trust-based recommendation system

Session B3: Collaborative Information Retrieval, Monitoring and Added Values Chair: Habib KAMMOUN (REGIM-Lab. & University of Sfax, Faculty of Science, Tunisia) Safi Houssem, Jaoua Maher and Hadrich Belguith Lamia. (Faculté des Sc. Eco.et de Gest.de Sfax, Univ. de Sfax, Tunisie) AXON: a personalized retrieval information system in Arabic texts based on linguistic features Yemna Sayeb, Meriem Ayba, Sihem Chabchoub and Henda Ben Ghezala. (RIADI Lab, ENSI, Mannouba Univ. – Tunisia) Urba-UML: Information systems' environment of urbanization based on UML Carole Henry, Sahbi Sidhom and Imad Saleh. (Paragraphe-Univ. Paris 8, LORIA-Univ. Lorraine, Univ.Paris 8, Fance) The dematerialization of information carriers and their appropriation by uses: MOOC example

10h30-10h45 : coffee break / pause café 10h45-12h15 : Plenary Conference / Conférence plénière

Chair: Abdelmajid BEN HAMADOU (Lab. MIRACL Université de Sfax, Tunisie) Title: « XW Data Analysis: Competitive Intelligence System in Cloud » Guest speaker: Bernard DOUSSET, Wahiba BAHSOUN and Anass EL HADDADI (IRIT, Université Paul Sabatier – Toulouse 3, France)

Discussion / Débat 12h15-12h45 : Closing Ceremony / Cérémonie de Clôture o Gifts to guest speakers and session chairs o « Best papers » of the conference

12h45 : Lunch / Déjeuner 14h00 : Extra activities / Extra activités : Excursion (guided tour in Carthage-Sidi-Bousaid).

4

1

The Third Digital Revolution: Agility and Fragility Hubert TARDIEU, (CEO Advisor and Co-Chairman of Atos Scientific Community, ATOS, France) Mail: [email protected] ; GUEST SPEAKER/ CONFERENCIERE INVITEE he was asked to drive in addition Innovation & Partnership. He became member of the executive committee of Atos Origin when it was created in January 2007.Atos Origin has become Atos in July 2011 after the merger with SIS the It arm of Siemens with 78 500 employees and a turnover in excess of 6.8 B€. In 2014 Atos Acquired Bull and in 2015 the IT service business of Xerox bringing turnover to 11M€. Since March 2009, he is CEO Advisor and Co-Chairman of the Scientific Community. Hubert Tardieu has been a board member of Infovista SA for more than 10 years; he is also a reviewer of Information & Management the reference journal of Elsevier for Information System.

ABSTRACT / RÉSUMÉ

The “Third Digital Revolution” is the result of extensive research conducted by Atos’ Top 100 scientists from the Scientific Community which I am co-chairing within Atos( 5th WorldWide IT Service Company) “Agility and Fragility” is a key theme of the report that illustrates the balance of risk and opportunity presented by unprecedented levels of technical disruption and business change that we should expect to see. We all agree that Data will be the “black gold” of tomorrow. We are already experiencing a Data-led revolution of our ever more connected and digital world. Gathering and using Data will transform our lives whether we are at home, traveling, shopping or at work. Our new publication, Ascent Journey 2018, focuses on Data at the core of the “3rd Digital Revolution”. Why the third one? Because we are living through the convergence of two different development cycles: - A “3rd revolution” in the way we represent information (after the creation of the cuneiform script in 3200 BC and the invention of the movable type printing press between the 11th and 15th centuries)

Hubert Tardieu is engineer from French Grande Ecole SUPELEC with a degree in Economy. He started his career with IBM in USA, then was in Africa for two years and, after that, joined French Administration in a research centre specializing in database management systems. With his team he developed Merise a method to design and build Information System. He joined SEMA in 1984 as CTO and monitor large projects in Defence, Nuclear, Payment Systems, as well as he drove a large Software Engineering Research Project. In 1993, he started the first transversal strategic business unit dedicated to Telecom which he helps to grow as the first market in SEMA. He was part of the Executive Committee of Sema Group. With the acquisition of SEMA by Schlumberger, he took the responsibility of the Group Finance Service Business and then the responsibility of the global Service Line Systems Integration. When Atos Origin acquired SEMA, he continued as EVP Global Systems Integration augmented with Global Consulting in 2004(with a turnover in excess of 2.7 B €). In 2008,

5

2 - And a 3rd revolution in the way we compute this information (after the creation of computers in the 40s and the worldwide web in the 90s) The lifecycle of Data is now at the heart of the digital transformation. Those that are first to grasp its relevance will be the winners in the new Data Economy. Looking towards 2018, we can no longer limit our thinking to that of an evolution of business and technology. We must embrace a revolution in our thinking and need to do things differently, not just better. The new digital era opens up many new opportunities that are only limited by our capacity to imagine new application use cases. The full impact of the third Digital Revolution will be experienced when the link is made between the B2C and B2B worlds, bringing a wealth of opportunities for those that are ready to ride the wave. NB. The report is publicly available on Internet in various formats PDF Tablet

KEYWORDS / MOTS-CLÉS

(EN) lifecycle

of Data, 3rd revolution, Information Systems, new application use cases.

6

1

HyperHeritage: New Forms of Human Heritage Interactions Hyper-Patrimoine : nouvelles formes d'Interaction Homme Patrimoine Khaldoun ZREIK CITU-Pargraphe, Université Paris 8, France - [email protected]

and Technology has always been used to promote and enrich cultural heritage representation and study, however the massive development and impressive easy use of ICT and devices have introduced new forms of public social cultural, the Augmented Culture (AC).

GUEST SPEAKER/ CONFÉRENCIER INVITÉ

AC) suggests, to actors and consumers of cultural heritage, various new ways, very often independent of space and time, to access, process and to deal with interconnected cultural information. AC could also be seen as shortcut of using Virtual Reality and Augmented Reality Technology into some culture representation paradigms. HyperHeritage observes that AC suggests incessantly new paradigms of Human Information Interaction and Human-Human Interaction that invite to explore new dimensions of Human Culture Environment. It is important to notice that in this presentation HyperHeritage and AC are not suggesting to review or to replace traditional institutions or forms of cultural heritage communication and promotion. They observe the emergence of new forms of Interactions with Information that to be explored and revisited through multidimensional interconnected spaces Most of those observations and reflections are based on three technological concepts that have been considered and developed in CITU-Paragraphe: Autonomous Avatar, Heritage Web Documentary and Cultural Heritage Open Data.

Khaldoun ZREIK, full Professor at the Department of HyperMedia, University Paris 8 where he is in charge of the CITU (Cybermedia, Interaction, Transdisciplinary and Ubiquity) research team of Paragraphe Laboratory which has been actor of important projects on Augmented & Digital City. K. Zreik has introduced new research and teaching topics like HyperUrban (Impact of Information and Communication Technology on City Design and Practicing), HyperHeritage, Augmented Culture, and Post-Digital Documents. Since 2009 he is in charge (director) of the Master Program NET (Digital: Challenges and technology) on Information and Communication Sciences. Since 2012 he is the head of Board of Trustees of the Scientific Interest Group (GIS) Human to Human Lab. that includes 15 prestigious Art and Design Establishments in Paris and in the neighborhood. Since 2006 he has been involved in the fields of HyperHeritage and Augmented Culture design. ABSTRACT / RÉSUMÉ

H

perHeritage covers every cultural heritage environment that embeds and includes traditional and digital cultural information. Integrating Information and Communication Technology (ICT) in this field helps to discover new ways of perceiving, representing and practicing cultural heritage. Information Sciences

KEYWORDS / MOTS-CLÉS (EN) Augmented Culture, Human Computer Interaction, HyperHeritage, Hypermedia, Information Design, New Culture, Open Data, Serious Game, Web Documentary. (FR) Conception de l’information, Culture Augmentée, Interaction Homma machine, Jeux Sérieux, Multimédia, Open Data, Patrimoine Augmenté, Web Documentaire.

7

1

Online Privacy issues in Information Systems ESMA AïMEUR (Dept. of Computer Science and Operations Research - University of Montreal, Canada) Mail: [email protected] ; http://diro.umontreal.ca/repertoire-departement/vue/aimeur-esma/ GUEST SPEAKER/ CONFERENCIERE INVITEE

decision making in security awareness by providing best practices to protect personal data. Esma Aïmeur is one of the associate editors of the International Journal of Privacy and Health Information Management (IJPHIM). She is and has been a member of more than 150 program committees of international conferences. She co-organized the 5th International Symposium on Foundations & Practice of Security (FPS 2012). She also organized several workshops on computer privacy in Electronic Commerce in Montréal. She is currently co-chairing the program committee of the First International Conference on Information Systems Security and Privacy which will be held in France in February 2015.

Esma Aïmeur is a Professor in the Department of

Computer Science and Operations Research at the University of Montreal. She received her Ph.D. Degree from University of Paris 6 in the field of Artificial Intelligence. She was the head of the Computer Science division of the multidisciplinary Masters Program in Electronic Commerce at the University of Montreal. She has been working with her team on computer privacy for more than 15 years. She is interested in privacyenhancing technologies in different settings, such as social networks, electronic commerce and e-learning. She also works on privacy-preserving data mining and

ABSTRACT / RÉSUMÉ

Nowadays

data

generates

value

to

individuals,

organisations, and society. As a result, websites and Internet services are collecting personal data with or without the knowledge or consent of users. Not only does new technology readily provide an abundance of methods for organizations to gather and store information, people are also willingly sharing data with increasing frequency and exposing their intimate lives on social media websites. Online data brokers, search engines, data aggregators, geolocation services and many other web actors are monetizing our online presence for their own various purposes. Similarly, current technologies such as smartphones, tablets, cloud computing/SaaS, big data, BYOD also pose serious problems for individuals and businesses alike.

the protection of personal data (identity theft, information disclosure, profiling and re-identification). She was appointed a member of the Data Protection Advisory Committee of University of Montreal. Her responsibilities include helping to improve policies and

In this proposed talk, we will address various issues inherent to Internet data collection and disclosure

8

2

behavior in online social media. More precisely, we will examine the economic dimensions of personal data and privacy. Although there are means at our disposal to limit or at least acknowledge how and what we’re sharing on the Internet, most of us do not avail ourselves of these tools. We conclude this talk by discussing the current and future challenges facing privacy in information systems. KEYWORDS / MOTS-CLÉS

(EN) Online data brokers, engines, data aggregators,

geolocation services, web actors, Online Privacy, Information Systems

9

1

Bigdata solutions Pascal GUY, (Pre-Sales Architect, Business Unit System, Oracle France) Mail: [email protected] development of students, shared research and Business market involvement.

GUEST SPEAKER/ CONFÉRENCIER INVITÉ

ABSTRACT / RÉSUMÉ

Human people are characterized by the capability to decide before to act. Decision process requires information and interpretation of information. Information lives actually this third revolution after the invention of written word, the invention of printing press and now the electronic computing. Our societies enter on the digital transformation days. Digital transformation consists to align for each event of life a trend of digital information that describes it. Business Intelligence was be the first step of analytical and decisional support for managers, but it is limited to internal and structured data. Now Bigdata offers new capabilities in terms of kind of data and algorithmic and should transform all data on rich information. Oracle is a long term leader on data management and naturally evolves to accompanied Customers to exploit and enrich Bigdata solutions.

Pascal Guy (http://fr.linkedin.com/in/guypascal/) is a Pre-Sales Architect on Oracle France Business Unit Systems. He works on biggest deals of French Customers where Oracle best solutions must be positioned. Pascal Guy works on Business Intelligence market since 1991 with project on many different areas. He is a specialist on Datawarehouse and Very Large Database. He, naturally, evolves on Bigdata domains with some projects for flight industries, life science or governmental entities. Pascal, since 2000, is the president of the Cursus of Master Engineer Statistics and Decisional Computing (https://cmisid.univ-tlse3.fr/) Paul Sabatier Toulouse University. The role of President is to establish a bridge between University and Enterprise for competencies

10

Oracle develops a complete Bigdata stack, based on open and standard components like Hadoop, Linux, java and completed by rich tools for data flow and analysis needs. Oracle proposes for specific workloads his optimized appliances. Exadata is dedicated for very large relational databases, Bigdata is a Hadoop and NoSQL farm for unstructured or semi-structured data, Exalytics optimize analytics and search workloads. Many other software bricks should be exploited all around the life of data. KEYWORDS / MOTS-CLÉS

(EN) Very Large Database, Business Intelligence, Bigdata technologies. Enterprise Architecte. High Availability and Disaster Recovery Process. (FR) Management des grandes bases de données, solutions décisionnelles, technologies Bigdata, Architecture d’Entreprise, Solution de haute disponibilité et de plan de reprise d’activité.

1

XeW Data Analysis: Competitive Intelligence System in Cloud Bernard, DOUSSET*, Wahiba BAHSOUN* [email protected], [email protected], [email protected] * IRIT, Université de Toulouse – France, & Anass EL HADDADI** ** Département Mathématique et Infomatique, ENSA Al-Hoceima - Maroc GUEST SPEAKER/ CONFERENCIERE INVITEE

ABSTRACT / RÉSUMÉ

Competitive Intelligence (CI) is the set of coordinated researches, treatments and distribution of useful information to stakeholder to-wards action and decision making. In order to enable users to search, monitor, validate and rebroadcast strategic information, we provide our new tools Xplor EveryWhere (Xew), which can be helpful for them in their executive travels. In this paper, we focus on the architecture, multilayer model and service of the CI system – Xew to describe our approach to treat different data-sources (Patents, Paper, etc.) in cloud computing.

Dousset Bernard, IRIT University Toulouse III Statut : Service/Equipe : Contact : Localisation : Téléphone :

Nowadays, companies are faced with external risk factors linked with an increased competition market place – we know that markets are extremely dynamic and unpredictable: new competitors, mergers and acquisition, sharp price cuts, rapid changes in consumption patterns and values, weak brands and their reputation… CI is a discipline to better anticipate risks and identify opportunities. Fifteen years after the canonical definition in French proposed by Martre [1], CI is still a concept with unstable borders. The last few years have seen a multiple definition of CI: from definitions oriented mapping process practice of CI, strategic vision of the CI to others including the concepts of knowledge management, collective learning and cooperation [2]. In the context of our work, we retain the concept of CI as it was defined by the Society of Competitive Intelligence Professionals (SCIP): Competitive Intelligence: A systematic and ethical program for gathering, analyzing, and managing external information that can affect your company's plans, decisions, and operations. Put another way, CI is the process of enhancing marketplace competitiveness through a greater -- yet unequivocally ethical -- understanding of a firm's competitors and the

Permanent Systèmes d'Informations Généralisés Bernard.Dousset at irit.fr IRIT1 / Niveau 4, Pièce: 427 05 61 55 6781

Publications :

Bahsoun Wahiba, IRIT University Toulouse III

Statut : Service/Equipe : Contact : Localisation : Téléphone :

Permanent Systèmes d'Informations Généralisés Wahiba.Bahsoun at irit.fr IRIT1 / Niveau 4, Pièce: 425 05 61 55 6945

Publications :

11

2

competitive environment. Specifically, it is the legal collection and analysis of information regarding the capabilities, vulnerabilities, and intentions of business competitors, conducted by using information databases and other "open sources" and through ethical inquiry. Effective CI is a continuous process involving the legal and ethical collection of information, analysis that doesn't avoid unwelcome conclusions, and controlled dissemination of actionable intelligence to decision makers. In CI process multivariate techniques are currently well controlled for all available quantitative data, on condition that DBM be suitable, the database schema be adopted and data be of the highest quality (homogeneous, current, complete...). It is always possible to extract the relevant data to a database custom built for multidimensional analysis. But, textual data from all electronic sources (the scientific databases, the patents databases, the press, RSS, Internet, intranet, forums…) is difficult to implement: data sources have different formats or are even unstructured, they are distributed, heterogeneous and encountered many particular / singular cases, particularly when we analysis a topic from different points of view (science, technology, news, etc…). To standardize the multidimensional analysis text data from all sources, we propose an unified structure [3, 4] for storing all relationship items encountered in analysis documents. This method allows the construction of tridimensional analysis among variable and a time. Since 2001[5], a first tool Xplor has been suggested to upload this type of structure in client server, to perform a custom search tools through various graphic restitution of the results. All text data are then in the same structure and are therefore common tools of interactive investigation. An improved version of Xplor appeared in 2007[6] and in order to enable users to search, monitor, validate and rebroadcast strategic information, we provide our new mobile tools Xew, which can be helpful for them in their executive travels. The rest of paper is structured as follows: first, we identify in section II the literature view of CI and our CI process. In section III, we will explain the architecture of Xew in cloud couputing. Section IV summarizes and assesses the approach.

[3] DOUSSET B., Integration of interactive konwledge diecovry for envirnmental scanning. Phd. Rapport , University Paul Sabatier, Toulouse, (2003). [4] EL HADDADI A., DOUSSET B., BERRADA I., LOUBIER I., The multi-sources in the context of competitive intelligence , EGC 2010, P A1-125 A1-136, Tunisie, (2010). [5] SOSSON D., M. VASSARD, DOUSSET B., Portal for navigation in the strategic analysis , VSST’01, Vol 1, pp 347-358, Barcelone – Espagne, (2001). [6] GHALAMALLAH I., GRIMEH A., DOUSSET B., Processing data stream by relational analysis, EUROPEAN WORKSHOP ON DATA STREAM ANALYSIS, March, 14-16, CASERTA, ITALY, N° 36 , pp 67-70 (2007). [7] HAAG S., Management Information Systems for the Information Age, Third Edition. McGraw-Hill Ryerson, (2006) [8] GILAD B., The Future of Competitive Intelligence, Contest for the Profession's Soul, Competitive Intelligence Magazine, 11(5), 22, (2008). [9] FAVIER L., Research and application of a methodology in information analysis for cometitive intelligence, Thesis University Lyon II, France, (1998). [10] ALABDULSALAM M., PATUREL R., Tool to help SME access to the competitive intelligence approach, CIMS, Suisse, (2006). [11] GHALAMALLAH I., A proposed model of exploratory multivariate analysis in competitive intelligence, thesis rapport, Toulouse university, Dec. (2009). [12] EL HADDADI A., DOUSSET B., BERRADA I., Securing a competitive intelligence platform, conference communication, INFORSID 2010. [13] HATIM H., EL HADDADI A., EL BAKKALI H., DOUSSET B., BERRADA I., Generic approach to control access and tratement in a competitive intelligence platform, conference communication, VSST 2010.

KEYWORDS / MOTS-CLES

[1] MARTRE H., CI and corporate strategy, French documentation, Paris (1994). [2] SALLES M., CLERMONT Ph., DOUSSET B. , A design method of CI system, conference communication, IDMME’2000, Montréal, (2000).

12

Competitive Intelligence (CI), decision making, Tools Xplor EveryWhere (Xew), CI system, CI process.

Exploitation of "Big Data": the experience feedback of the french military health service on sanitary data M. Tanti

These three components therefore require new forms of information processing (Pouyllau, 2013). In the medical field, the mass of data circulating on the Internet has also become bloated. The sanitary sector is particularly affected by this glut of information, especially with the increasing emergence of new epidemics, the frantic rush to new treatments linked with the economic interests of pharmaceutical industries and the running of publication of public laboratories. The exploitation of this huge mass of data has opened new avenues in terms of exploration of information and production of new knowledge. This article aims to present the experience feedback of the French military health service in the use of sanitary data of the « Big Data ». The first object is to propose a definition of « Sanitary Big Data ». The second object is to explore the construction of knowledge generated. The third object is to analyze what is the creation of value provided. The last object is to determine what are the limits of this phenomenom?

Abstract- In recent years, the number of data circulating on

the Internet has exploded causing the phenomenon of "Big Data". In the health field, the mass of data circulating on the Internet has also become bloated. From a study conducted within the french medical health service, this article aims to explore the construction of knowledge generated by the "Sanitary Big Data". It analyzes the creation of value made and the limitations of this “ Big Data". The result is a an exploitation of polymorphic data mainly from global programs for infectious disease surveillance including PromedMail and Sentiweb. It shows the use of data from clinical trials of the Cochrane Library and the use of data of social medias and social networks. This real time data are exhautive of the population's health status. It also shows the construction of knowledge, which quickly reveals the emergence of diseases that may affect the forces in operation, as in 2009, during the H1N1 pandemic. We may also mention the anticipation of new pharmaceutical innovations, for example to detect the effectiveness of new treatments, such as in the context of the current epidemic of Ebola. As main limitations, it shows a data mining that does not offer interpretation of results and potential ethical drifts on handling health data.

II. CONSTRUCTION OF KNOWLEDGE GENERATED BY THE EXPLOITATION OF « SANITARY BIG DATA » A. What is « Sanitary Big Data »?

Index Terms— Sanitary Big Data, French military health

In the health sector, the global programs of emerging disease surveillance including PromedMail (http://www.promedmail.org) are source of bloated data, and source of very high strategic value. These programs broadcast on the democratized Internet, in near real time, data and trends data on outbreaks affecting humans and animals, on any geographic area (number of cases, deaths, geographical location), from ground relay (WHO experts, MSF, Institut Pasteur ...). Works from scientific research and global clinical trials are also an inexhaustible source of sanitary data and high value. For example, the Cochrane library (http://www.thecochranelibrary.com/) lists the global clinical trials and their results. PubMed references over 20 millions of international scientific works on domains as vast as epidemiology, public health, infectious diseases and health economics. The Protein Data Bank (PDB) broadcasts a (http://www.rcsb.org/pdb/home/home.do)

service, Knowledge, Value I.

INTRODUCTION

In recent years, the number of data circulating on the Internet has exploded and continues to grow exponentially. Today, there are almost as electronic data as there are stars in the universe. This phenomenon is known as "Big Data". We can give a definition of this phenomenon from three components : - Its volume, generally we defines « Big Data » from five terabytes of data to be processed, - Variety, the acquired data are rough or structured, text or image format, with owners and rights of use as different as their sources, - Its velocity, it must be able to integrate in real-time the latest available data and link them to other data sets, without starting a full analysis at each cycle.

M. Tanti, Service de veille sanitaire, Centre d’épidémiologie et de santé publique des armées, Camp militaire de Ste-Marthe, UMR912-SESSTIM, Marseille, France (e-mail: [email protected]).

1

13

worldwide collection of millions of data concerning biological macromolecules (DNA, proteins...) (Berman, 2000) from the genome sequencing works (human, viral, bacterial...) and the proteomics. In France, there is a monitoring population system called "Sentiweb" (https://websenti.u707.jussieu.fr/sentiweb/?page=presentation) , which is based on private and hospital sentinel physicians who transmit, every week, by secure Internet, data from the monitoring of eight health indicators of the population (measles, viral hepatitis, influenza-like illness...), which is an enormous mass of data. On the same principle, GrippeNet.fr (https://www.grippenet.fr/) collects anonymously via the Internet, but this time, directly by the population, data on influenza. Each week, participants report symptoms that they had since their last connection and provide a wealth of information, including the behavior of populations. In the field of worlwide detection of influenza, we can cite Google Flu Trends (http://www.google.org/flutrends/). This site developed by Google in collaboration with the US Department of Health detects the spread of influenza. It allows, by sophisticated correlations based on the search terms in the Google search engine (such as "flu", "fever"), to predict the outbreak of influenza in any part of the territory in real time. In the health sector, social media are a source of huge data. For example, messages and conversations exchanged on social networks (Tweeter...), medical discussion in forums (Doctissimo...) posts on personal and professional blogs are a real-time reflection of the health of the population and concerns in the field of public health (Raghupathi, 2014).

to monitor the dynamics of the epidemic in real time and allowed to anticipate the evolution of the epidemic. This tool revealed soon the emergence and spread of the disease as the traditional methods of field collections. Indeed, statistics from field physicians take several days to be analyzed (eg Sentiweb in France, and the Center for Disease Control and Prevention in the USA- CDC). This tool is very fast, see almost instantaneous. However, a study from 2014 tempers the advantages and demonstrates that the figures developed by the tool are overestimated and exceed 50% of those from the field (CDC). It would requires a recalibration of data from field data (Lazer, 2014). In addition, the Medical intelligence service, object of our study, uses data from genomics and proteomics, from the Protein Data Bank (PDB). Data exploitation of this « Big Data » opens the way for the construction of new medical knowledge and explore new pharmaceutical innovations, including vaccine and therapy. It can be used to detect the effectiveness of influenza vaccine. The exploitation of the data from Sentiweb allows to track the detection of influenza epidemics or acute diarrhea in France. It allows modeling with a view to decision support. It allows estimation of the basic parameters of the transmission, it assesses the impact of control and intervention strategies It allows the integration of medical and economic aspects. In addition, new knowledge are constructed by crossing different data of the « Sanitary Big Data », therapeutic, vaccine, diagnostic, sanitary, medical, economic and strategic data. III. CREATION OF VALUE A. Définitions Value creation is a theme that actually raises increasing interest in different areas of management science: strategic management, corporate finance, accounting, management control, organization, marketing. Bourguignon A. (Bourguignon, 1998) distinguishes three meanings of value: the value in the sense of measure (especially in science such as mathematics and physics), the value in the economic sense and the value in the philosophical sense. The term of value is synonymous with the term of wealth. The theme of the value is the subject of multiple looks or paradigms, eg common visions to members of a particular group (Kuhn, 1983). The issue of value therefore refers to the question of the recipients of the value created: for whom we create value? As part of the corporate finance, value is often a financial value for the shareholder. In our study, we define value as the strategic value for military decision-maker in health, so this is a larger value than the economic or financial value. Since it also includes medical, regulatory and adjudicative aspects. Creating economic value is to vary in the direction of increasing. Conversely, destroy value, it is lower over time. The creation of economic value is in the heart of organizational activities and at the center of their vocation and strategy (Savall, 1998). The creation of the value is also at the center of the concerns of military organizations. In our study, it is defined 2

We can define the « Sanitary Big Data » as the set of large medical data, extremely varied, produced in near real time, by the population and the scientific communities. These data are raw or structured. They are in any format and they provide information concerning population health status, behaviors and health concerns. B. Construction of Knowledge B.1. By what method ? The mass of information from « Sanitary Big Data » is collected and analyzed by the medical intelligence service of the Military Centre of Epidemiology and Public Health, part of the French military health service, dedicated for the exploitation of the data from the « Sanitary Big Data » (Boutin, 2004). To do this, it uses data mining tools, methods of collection and analysis that are automatic or semi-automatic. It uses monitoring software and specialized documentary platforms. It also uses cloud computing tools, and software of clustering and data classification B2. Which knowledge is build ? A number of examples of construction of knowledge from the « Sanitary Big Data » can be presented. For example, in 2009, in full H1N1 pandemic, the use of Google Flu trends by the Medical Intelligence Service allowed

14

as the increase of the strategic value for the military decisionmaker in the domain of health. The creation of the value will provide benefits to anticipate the military decisions, to prevent risks to the soldier and the people under its responsibility. Particularly in our framework, health risks, risk of epidemics which by definition have an impact on military operations, peacekeeping and homeland security.

A limit also comes from the interoperability of the different systems and their ability to work together. Which is currently not the case in the field of « Sanitary Big Data ». Another limitation, under ethics, is possible diversion of medical or personal data for any purpose other than that assigned initially, or for any other purpose. For example, possible threats to the privacy of individuals are possible, especially in search of personal data collected on the Internet or on social networks where people reveal themselves and voluntarily their health. In this context, should be questioned the feasibility of an exploration of the « Big Data » that would preserve the privacy of individuals. The storage of the data necessary for their exploitation also poses another technical problem. Digital data can be hacked. In this case, the encryption is an existing technical solution. Finally, one of the limitations of the « Sanitary Big Data » is on logistics information : how to ensure that the relevant information reaches the right place at the right time to the right person ? Especially for decision making in real time or near real time? This is a micro-economic approach that is being evaluated by the medical intelligence service. However, its effectiveness also depends on the combination between micro and macro approaches of the problems.

B. What value created ? New knowledge constructed by crossing different data of the « Sanitary Big Data » are made available to decisionmakers of the French military health service, in the form of a dedicated intranet platform, which allows the creation of value, that we describe in this section. This platform leads to consider unpublished data correlations. It offers two challenges. The first challenge will be to reuse, share and verify the knowledge resulting from the exploitation of the « Big Data ». Another challenge will be to reuse data to extract innovations and trends. Main value of the creations are: -The Early detection of epidemic events on current, possible or probable military operations theatres, -The Monitoring of known epidemic events (epidemics, pandemics), to anticipate their occurrence and their potential impact on the soldier, -The Monitoring of the social impact of a health event, of an epidemic risk, -The monitoring of an action or health policy, such as a vaccine or therapeutic policy, -The Identification of preventive or therapeutic innovations, to anticipate and prevent health risks and their economic and human impacts. The exploitation of « Big Data » finds particular value creation, in the context of emerging epidemics, like in the current Ebola outbreak raging in Africa, at the time of writing this article. It allows the monitoring of the event, in real time, for example through the use of the data from PromedMail. The exploitation of the « Sanitary Big Data » also allows the identification of therapeutic and vaccine innovations, notably through the analysis of data from the Protein Data Bank (Reynard, 2014). The identification of these innovations will create strategic value for military decision makers. They will be able to anticipate the risks in operation and preserve the health of the population from which they are responsible.

V. CONCLUSION The « Big Data » brings advances in the field of health. Particularly in the sanitary field, exploitation of the « Big Data » allows to the french medical forces, value creation, particularly for the decision-making. The « Big Data » is also an innovation in terms of economic and social models. But is it just an evolution of the performance of existing tools or a simple fad? This question remains open. Finally, the "Big Data" involves the use of ultra-sensitive data, that even if they are used anonymously, must be handled with care. The debate is more relevant than ever, and the time will tell how to reconcile these two aspects. REFERENCES [1] Berman HM & al (2000). The Protein Data Bank. Nucleic Acids Research; 28 (1): 235-242. [2] Boutin JP & al (2004). Pour une veille sanitaire de défense. Medecine et Armées ; 32 (4):366-372. [3] Bourguignon A (1998). Management Accounting and Value Creation : Value Yes but what Value ? Working Paper, ESSEC, November 1998, 19 p. [4] Kuhn T (1983). La structure des révolutions scientifiques, Flammarion [5] Lazer D & al (2014). The Parable of Google Flu: Traps in Big Data Analysis. Science ; 343 (6176): 1203–1205. [6] Pouyllau S (2013). Web de données, big data, open data, quels rôles pour les documentalistes. Documentaliste Sciences de l'Information ; 50 : 32-33 [7] Raghupathi W & Raghupathi V (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems; 2: 3

IV. LIMITS OF BIG DATA The limits of the « Sanitary Big Data » are, first, related to the data mining tools, that do not offer interpretation of the results, because an expert analyst of data mining and a person familiar with the trade which the data are extracted (one epidemiologist in this context) are needed to analyze the software deliverables. In addition, data quality, relevance and completeness of data, is a necessity for data mining, but that is not enough. The input errors, duplicates, not filled data or data indicated without reference to time, also affect the quality of the results.

3

15

[8] Reynard O, Volchkov V & Peyrefitte C (2014). Une première épidémie de fièvre à virus Ebola en Afrique de l’Ouest. Medecine Sciences ; 30 : 671–673 [9] Savall H, Zardet V, Cappelletti L, Beck E, Noguera, Ocler R (1999-200I). Rapport de recherche, bilan de réalisation d'une recherche-intervention conduite sur 72 entreprises du secteur de la gestion de patrimoine, ISEOR.

4

16

Integration of useful links in distributed databases using decision tree classification Tahar Mehenni Computer Science Department University Mohammed Boudiaf of M’sila 28000 M’sila, Algeria Email: [email protected] Abstract—Nowadays, distributed relational databases constitute a large part of information storage handled by a variety of users. The knowledge extraction from these databases has been studied massively during this last decade. However, the problem still present in the distributed data mining process is the communication cost between the different parts of the database located naturally in remote sites. We present in this paper a decision tree classification approach with a low cost communication strategy using a set of the most useful inter-base links for the classification task. Different experiments conducted on real datasets showed a significant reduction in communication costs and an accuracy almost identical to some traditional approaches.

I.

I NTRODUCTION

Currently, information is stored practically in relational databases. Moreover, the storage of data in different and distant sites is nowadays possible and easy, thanks to the recent technologies of computer networks. Extracting knowledge from the huge data stored in distributed databases is a complex task due to the traditional techniques used to perform a data mining task, e.g. the classification. To perform a distributed data mining task, the traditional way consists in migrating the whole partitions of the distributed database in a unique site. This will make easy the application of traditional relational data mining algorithms, but this technique is often expensive in moving one whole database to another site and unsafe in the case of confidential data. It is useful and economic to achieve data mining tasks on data coming from different sites without trying to migrate them (i.e. keeping data in-place). Moreover, this technique can prevent sharing of data and will give a complete proof of security that gives a tight bound on the information revealed. The main idea developed in this paper consists in exchanging some information between the different tables of the distributed database using some useful links, in order to perform a data mining task. To transfer information between several relations concerning some links, it would be necessary to detect these links, which are simply links between attributes of relations issued from different sites and that will serve like bridges for the information transfer. Moreover, information exchanges between sites are often expensive especially if they are very distant. It is very important to take into account the number of transfers achieved between the sites. We present in this paper a decision tree based classification approach without migrating the whole distributed database in

17

a unique site. We use a traditional decision tree algorithm, and we integrate a technique to select the partition which has the lowest communication cost and a computing method to find the most useful attribute for the decision tree. The remaining of the paper will be as follows. In Section 2, we review the related work of classification approaches in distributed databases. Section 3 presents our contribution, in particular the different methods and techniques of the low cost communication strategy. In section 4, we present the decision tree classification algorithm where we integrate the techniques explained in section 3. Section 5 discusses the different tests and experimentations performed. Finally, section 6 gives perspectives of the work and concludes the paper. II.

R ELATED WORK

Classification based decision tree approaches [1] have been proposed initially by [2] (see also [3]), where authors proposed a general framework for the multi-relational data mining. [4] developed the (Multi-Relational Decision Tree Learning) algorithm (MRDTL) on the ideas of [2] and of another system based on Inductive Logic Programming (ILP) named TILDE (Top-down Induction of Logical Decision Trees) [5]. MRDTL2 [6] is a more efficient version of MRDTL. HTILDE (Holding TILDE) [7] is an algorithm proposed to handle very large relational databases, based on TILDE and the propositional Very Fast Decision Tree (VFDT) learner [8]. CrossMine is an efficient approach for multi-relational classification, presented in [9]. To perform links between relations, CrossMine uses a virtual joint technique named Tuple ID propagation ( [10] and [11]). The main idea of this method is to propagate the identifier (ID) of tuples as well as their classes to the different relations, the propagated ID can be used then to identify the different features in the relations. RDC algorithm [12] is an efficient and accurate approach for relational database classification using decision trees. Authors suggested some modifications to MRDTL-2 algorithm with usage of ID propagation technique in order to speed up and enhance the applicability of decision trees algorithms to significantly relational databases. Yin and Han [13] developed a rule-based classification approach by introducing the notion of link utility. Their algorithm named MDBM is used mainly for the classification in multiple and heterogeneous databases. Distributed data mining is more and more studied these last years ( [14], [15], [16], [17], [18] and [19]). Its objective is the

knowledge discovery from data distributed through different sites. There exist two types of distributed data: •

Horizontal Data: where different objects with the same attributes are stored in different sites.



Vertical Data: where different attributes of a same object are stored in different sites.

It would be then very interesting to be able to detect these links and to determine their utility before achieving a joint between their corresponding relations for the data mining task. We propose a prediction-based model using regression concepts to identify the useful links between two different relations coming from different sites. A. Notion of link usefulness

The traditional way of distributed data mining is to integrate in a unique place all the data located in different sites and then apply the adequate algorithms of data mining [18]. Distributed data mining finally operates on a table of migrated data coming from different sites. However it is always expensive and difficult to merge distributed data or even to migrate the whole database to another site. A recent research was interested in the classification from distributed heterogeneous data (vertical data). [15] developed a decision tree framework from heterogeneous data using an evolutionary technique. [16] presented a general decision trees strategy by exchanging through different sites information about tuples that verifies some constraints of particular attributes. [19] developed an algorithm, named DDT (Distributed Dot Product) which builds a decision tree from vertical distributed data without centralization (i.e. without migrating the whole database to one site). To reduce the communication cost between sites, authors used an approximation of information gain using a random projection-based dot product estimation and message sharing strategy. We present in this paper, a decision tree classification algorithm over heterogeneous distributed data without centralization. Our algorithm, named CLADIS (CLAassification over DIStributed data) uses a technique of selection of the best attribute coming from the most economic site having the lowest communication cost ratio, in order to build the decision tree. III.

ATTRIBUTE SELECTION

A relational database D consists of a set of relations denoted as Ri (i = 1 . . . n) and a set of links between pairs of relations. The columns in a relation correspond to the attributes of that relation and the rows correspond to tuples; we will denote an attribute k in the relation R as R.k. Each relation can have at least one key attribute, a primary key, that uniquely identifies the tuples in it. The remaining attributes are either descriptive attributes or foreign key attributes. A foreign key attribute is a key attribute of another relation. Each relation may have one primary key and several foreign keys. In a relational decision tree classification, one of the relations in the database is the target relation Rt , with class labels associated with its tuples. A tuple associated with the value Yes is called positive tuple, while a negative tuple is a tuple associated with the value No. In order to transfer some information between two relations via their links, it is necessary to know namely these links, which are attributes in relations located in different sites. The problem is that certain attributes can be efficient bridges for the information transfer and the process of data mining, but other links can be bad means of communication.

18

A link is considered useful if it gives significant information gain and vice versa. To build a prediction model for useful links, it is necessary to define the usefulness of a link in a predictable way. This definition must indicate the potential information gain to have via the link. It must be also independent of the parameters of the problem. To evaluate the classification capability of attributes, we use the information gain of an attribute, which is defined as follows (see [20] and [9]): Definition 1 (Information gain): Assume there are P positive tuples and N negative ones of a tree node. Suppose an attribute Al found by propagation through the link l which divides these tuples into k partitions, each contains Pi positive tuples and Ni negative tuples, we have: gain(Al ) = entropy(P, N ) −

k X Pi + N i i=1

P +N

· entropy(Pi , Ni ) (1)

where  entropy(P, N ) = −

P P N N log + log P +N P +N P +N P +N (2)

The information gain measure is biased toward tests with many outcomes. We use an extension to information gain known as gain ratio [20], which attempts to overcome this bias. It applies a kind of normalization to information gain. Definition 2 (Gain ratio): Suppose P positive tuples and N negative ones on a tree node. Suppose an attribute Al found by propagation through the link l. The gain ratio of Al is defined by : gainratio(Al ) =

gain(Al ) −

Pk

Pi +Ni i=1 P +N

i · log PPi +N +N

(3)

We define the usefulness of a link l as the maximum gain ratio we get from attributes Al found by propagation through the link l, as follows Definition 3 (Link usefulness): Suppose Al an attribute of the relation Rl found by propagation through the link l. The usefulness of l is defined by: usef ulness(l) = max gainratio(Al ) Al ∈Rl

(4)

B. Predicting link usefulness To build a predicting model of link usefulness, two properties of links that are related to their usefulness are selected: coverage and fan-out ( [13], [21]). Definition 4 (coverage of link): Suppose l a link between two relations R1 and R2 joining attribute A and B, i.e.



l = R1 .A → R2 .B. Suppose n12 the number of tuples in R1 joinable with R2 via l, and n1 the number of tuples in R1 . We define the coverage of link l as the proportion of tuples in R1 that are joinable with R2 via l. n12 coverage(l) = (5) n1 Definition 5 (Fan-out of link): Suppose l a link between two relations R1 and R2 joining attribute A and B, i.e. l = R1 .A → R2 .B. Suppose n12 the number of tuples in R1 joinable with R2 . For each tuple vj in R1 (j = 1..n12 ) there is a number of tuples sj in R2 that are joinable with vj via the link l. The fan-out of link l is defined as follows: n12 1 X f anout(l) = sj (6) n12 j=1 Based on these two properties of links, we use regression techniques to predict their usefulness. Regression is a well studied field, with many mature approaches such as linear or non-linear regression, support vector machines, and neural networks. We choose Support Vector Regression (SVR) [22], because experiments show that this model achieves high accuracy in prediction usefulness of links on testing datasets [21]. 1) Support Vector Regression (SVR): Support Vector Regression (SVR) is a powerful machine learning method that is useful for constructing data-driven non-linear process models by a kernel function. It shares many features with Artificial Neural Network (ANN) but possesses some additional desirable characteristics and is gaining widespread acceptance in data-driven non-linear modeling applications. SVR possesses good generalization ability of regression function, robustness of solution, addressing regression from sparse data and an automatic control of solution complexity. The method brings out the explicit data points from the input variables that are important for defining the regression function. This feature of SVR makes it interpretable in terms of the training data in comparison with the other black-box models including ANN, where the model parameters are difficult to interpret. Bellow is a brief description of SVR. One can have a more detailed description of SVR, in [23], [22] and [24] for example. Given dataset D = {(xi , yi )}N i=1 obtained from a latent function where xi denotes the sample vector, yi the corresponding response and N is the total number of samples. In the SVR, the original data is first nonlinearly mapped into a high dimensional feature space, and then a linear function is fitted to approximate the latent function between x and y. Given the training data, the linear -SVR algorithm theoretically aims to solve the optimizing problem, which can be written in the following form with an -insensitive loss term: Minimise

N C X 1 kwk2 + L(yi − f (xi ), ε) 2 N i=1

(7)

where C, penalty parameter is a predefined regularizing parameter. The above minimizing problem can further be expressed

19

(∗)

in the following form with the slack variables ξi

introduced:

N

Minimise

X 1 kwk2 + C (ξi + ξi∗ ) 2 i=1

subject to yi − hw, xi i − b ≤ hw, xi i + b − yi ≤ xii , ξi∗ ≥

(8) ε + ξi ε + ξi∗ , i = 1, 2..., N 0

With the help of Lagrange multiplier method and Quadratic Programming (QP) algorithm, the regression function can be derived as f (x) =

N X

(αi∗ − αi )k(xi , x) + b

(9)

i=1

b = yi −

N X

(αi∗ − αi )(xi · xj ) + ε

(10)

i=1

where αi∗ and αi are the optimized Lagrange multipliers, and k(xi x) is the kernel function. For the regression by SVR, the user has to select three parameters namely insensitivity parameter ε, the penalty parameter C and the shape parameter of kernel function. The choice of these parameters is vital to good regression. If C is too small then insufficient stress will be placed on fitting the training data. If C value is too large then the algorithm will over fit the training data and over fit implies poor generalization. In our study -SVR is applied to construct the SVR model. The Radial Basis Function (RBF) is used as Kernel. The support vector will be the center of the RBF and the σ will determine the area of influence this support vector has over the data space. There are two parameters to be predefined before training. One is the regularizing factor C, the other is the sparsity parameter ε. The libsvm package [25] is employed in our study to construct SVR models. C. Economical strategy of site selection We have defined previously a predicting technique of link usefulness between two relations belonging to two different sites. Nevertheless, a certain confusion can occur if some sites have nearly the same link usefulness. In such situations, it is necessary to choose the best site, i.e. that permits to have the lowest communication cost. We use an inspired strategy from [13], where authors defined a way to choose the most economic attribute having the lowest communication cost ratio (CCR). This ratio is defined as follows. Definition 6 (Communication Cost Ratio (CCR)): Suppose there are a source relation Rs with |Rs | tuples, and each tuple is associated with I tuple IDs on average. The Communication Cost Ratio (CCR) of propagation through a link l is given by the coverage property of link l and its usefulness, which can be estimated by the prediction model. CCR(l) =

coverage(l) · |Rs | · I usef ulness(l)

(11)

D. Selection Procedure The selection Procedure shown in Algorithm 1 is used to choose the best attribute with the lowest communication cost. For every active relation Ra , which is the current or the target relation, proceed by computing the information gain for every possible attribute. For every inactive relation Ri that can be joined with some active relation Ra , predict usefulness of possible links with Ra using the regression model, compute the CCR of each link greater than a threshold usefulness (defined after many experiments), then choose relation Rc having the lowest CCR and propagate ID from Ra to Rc . Finally, choose the best attribute with the highest information gain. Algorithm 1 Attribute selection Algorithm 1: procedure S ELECT ATTRIBUTE () 2: Set Rt to active 3: for active relation Ra do 4: Amax := Find InfoGain() 5: end for 6: for inactive relation Ri do 7: Ui := Predict usefulness(Ri ,Ra ) 8: if Ui ≥ Uminimum then 9: Costi := CCR(Ri ,Ra ) 10: end if 11: end for 12: Choose relation Rc having Costc =min(Costi ) 13: Propagate ID from Ra to Rc 14: Amax := Find InfoGain() 15: end procedure

IV.

(i.e., selects objects from the database that are not selected by R). Algorithm 2 CLADIS Algorithm 1: procedure T REE CLADIS(Tree T , Relations R ) 2: if Stopping cond() then 3: leaf node := Create node() 4: leaf node · label := Classif y() 5: return leaf node 6: else 7: R := Select attribute() 8: T lef t := T REE CLADIS(T, R) 9: T right := T REE CLADIS(T, R) 10: end if 11: end procedure

V.

E XPERIMENTS

We have performed comprehensive experiments on two public datasets in order to evaluate the performance of our algorithm. The first dataset is StatLog DNA, which is a version of molecular biology database from UCI Machine Learning Repository. StatLog DNA is composed of 2000 nucleotide sequences of DNA. Each sequence has 180 binary attributes plus the class label (1 among three values). The second dataset is COIL 2000 Challenge from UCI databases. It contains information of customers of an insurance company. COIL is composed of 5822 examples and 86 attributes. In order to obtain distributed data, we divide each dataset in a vertical way into two subsets and store them in two different sites.

D ECISION TREE CLASSIFICATION ALGORITHM

CLADIS uses the principle of RDC algorithm [12] to build the decision tree, while integrating the procedure of selection of the best attribute detailed in the previous section. CLADIS adds decision nodes to the tree through a process of successive refinement until some stopping condition is met by testing whether all the records have either the same class label or the same attribute values, and whether the number of records has fallen below some minimum threshold. Whenever some stopping condition is met, a leaf node with its corresponding class is introduced instead. Otherwise the procedure Select attribute() is called to return the attribute that provides a good split, then a left and right branch are introduced and the procedure is applied to each of these recursively. The pseudo-code of CLADIS is given in the Algorithm 2. E is the set of examples and R the relations of the distributed database. CLADIS proceeds recursively by selecting the best attribute that partitions the records and building the decision tree by successive additions of the nodes. Stopping cond() is used to terminate the building tree process. Create node() serves to create a new node of the tree. classif y() is the function that assigns a class label to the leaves of the tree. Select attribute() designates the procedure of selection of the best attribute, previously presented. T ree cladis(T, R) denotes the left branch of the tree resulting from applying building process of R to the current tree, while T ree cladis(T, R) denotes the right branch of the tree resulting from applying the building process of the complement of R to the current tree

20

CLADIS is implemented on a microcomputer of 2.4Ghz, 2Go RAM under Windows 7. The used language is C# under Visual Studio.Net 2008 (Express Version). CLADIS is compared to two algorithms: MDBM [13] and DDT [19]. MDBM is a rule-based classification algorithm, used in relational and heterogeneous databases, while DDT is a decision tree based classification algorithm, used in distributed heterogeneous databases. We implemented these two algorithms based on their relative papers, i.e. [13] and [19]. We start with a comparison of the three algorithms according to the communication cost resulting from different transfers between the sites of the database. DDT estimates the communication cost to the number of messages exchanged between the sites, while MDBM and CLADIS estimate the cost according to the formula 11. The Figure 1 gives average communication costs on StatLog DNA and COIL. It is interesting to notice that the three algorithms give nearly the same results. For StatLog DNA, the three algorithms are quasi-economic, but for COIL, DDT is more expensive. In both datasets, CLADIS seems to reduce the communication cost in a more efficient way than the two other algorithms. Figure 2 shows the results of our experiments on StatLog DNA and COIL. MDBM, DDT and CLADIS have been executed on the two datasets. Cross-validation technique is used on 10 blocks of each dataset to obtain 10 accuracy rates then the average is computed. For the dataset StatLog DNA, one can see that both MDBM and DDT have an average accuracy not exceeding 90%. For CLADIS, the average accuracy is more than 92%. However, for COIL dataset, the three algorithms

300 350 250 Running Time (sec)

Communication Cost (KB)

300 250 DDT

200

MDBM

150

CLADIS

100

COIL

Fig. 3.

COIL

Average Running time in StatLog and COIL.

A primary set of directions for future work can be resumed in the following axes :

0,96 0,94 0,92 Accuracy

CLADIS

StatLog

Average Communication cost in StatLog DNA and COIL.



Making comparative study on several regression models, in order to find the best one that predicts efficiently usefulness of links.



Adding more properties of attributes to take them in consideration in the regression model, in order to predict efficiently the link usefulness.



Using other classification approaches (Neural Networks, Support Vector Machines, Naive Bayesian Method, ...) and making comparative study, in order to find the efficient and more accurate classifier.

DDT

0,9

MDBM 0,88

CLADIS

0,86 0,84 0,82 StatLog

COIL

Average Accuracy in StatLog and COIL.

give accuracy rate greater than 90%, but CLADIS seems to give an accuracy rate near 95%. The results showed that CLADIS is more efficient than MDBM and DDT because it uses the technique of link prediction, which is an interesting information that allows the building of an efficient decision tree. Average running times are showed in Figure 3. It can be seen that DDT is the slowest algorithm, this is due to the complex computing performed in the matrixe projection used by DDT. For MDBM and CLADIS, the running times are nearly equal because both algorithms use a similar strategy. The difference resides mainly in the prediction model used by the algorithms. While MDBM uses neural networks, CLADIS uses Support vector regression. VI.

MDBM

100

0 StatLog

Fig. 2.

DDT 150

50

50 0

Fig. 1.

200

R EFERENCES [1] [2]

[3] [4] [5] [6] [7]

C ONCLUSIONS AND FUTURE WORK [8]

We presented an algorithm of classification over distributed database located in different sites. This algorithm, called CLADIS, uses decision trees model, but integrate a novel technique to select the best attribute for the construction of the tree. The basic idea consists in predicting the most economic site, i.e. presenting the low cost communication to connect it, where the most useful attribute is found in order to build an efficient decision tree. Different experimentations have been performed on two public datasets. The results showed that our approach gives not only a meaningful manner to reduce the inter-site communication cost, but it performs also an efficient classification.

21

[9]

[10]

[11] [12]

L. Rokach and O. Maimon, Data Mining With Decision Trees: Theory and Applications. World Scientific, Singapore, 2008. A. Knobbe, H. Blockeel, A. Siebes, and D. Van der Wallen, “Multirelational decision tree induction,” Principles of Data Mining and Knowledge Discovery, vol. 1704, pp. 378–383, 1999. A. Knobbe, Multi-relational Data Mining. IOS Press, Netherlands, 2006. H. Leiva, “A multi-relational decision tree learning algorithm,” Master’s thesis, department of Computer Science, Iowa State University, 2002. H. Blockeel, “Top-down induction of first order logical decision trees,” Th`ese de doctorat, Katholieke Universiteit Leuven, 1998. A. Atramentov and H. Leiva, Inductive Logic Programming, vol. 2835, pp. 38–56, 2003. L. Lopes and G. Zaverucha, “Htilde: Scaling up relational decision trees for very large databases,” in Proceedings of the ACM symposium on Applied Computing, Honolulu, Hawaii, 2009, pp. 1475–1479. P. Domingos and G. Hulten, “Mining high-speed data streams,” in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 2000, pp. 71–80. X. Yin, J. Han, J. Yang, and P. Yu, “Efficient classification from multiple database relations: A crossmine approach,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 6, pp. 770–783, 2006. X. Yin and J. Han, “Efficient multi-relational classification by tuple id propagation,” in Proceedings of the 2nd International Workshop on Multi-Relational Data Mining (MRDM-2003), Washington DC, 2003, pp. 122–134. X. Yin, “Scalable mining and link analysis across multiple database relations,” Th`ese de doctorat, Illinois University, 2007. J. Guo, J. Li, and W. Bian, “An efficient decision tree classification algorithm,” in 3rd IEEE International Conference on Natural Computation (ICNC’07), vol. 3, 2007, pp. 530–534.

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20] [21]

[22] [23] [24] [25]

X. Yin and J. Han, “Efficient classification from multiple heterogeneous databases,” in Knowledge Discovery in Databases (PKDD’05), vol. 3721, 2005, pp. 404–416. K. Potamias, M. Tsiknakis, V. Moustakis, and S. Orphanoudakis., “Mining distributed and heterogeneous data sources in the medical domain,” in Proceedings of Machine Learning in the New Information Age, MLnet workshop, European Conference of Machine Learning, Barcelona, 2000, pp. 27–38. B. Park, H. Kargupta, E. Johnson, E. Sanseverino, and D. Hershberger, “Distributed, collaborative data analysis from heterogeneous sites using a scalable evolutionary technique,” Applied Intelligence, vol. 16, pp. 19–42, 2002. D. Caragea, A. Silvescu, and V. Honavar, “Decision tree induction from distributed heterogeneous autonomous data sources,” in Proceedings of the Third International Conference on Intelligent Systems Design and Applications (ISDA03), Tulsa, USA, 2003, pp. 341–350. J. Castillo, A. Silvescu, D. Caragea, J. Pathak, and V. Honavar, “Information extraction and integration from heterogeneous, distributed, autonomous information sources - a federated ontology-driven querycentric approach,” in IEEE International Conference on Information Reuse and Integration (IRI03), 2003, pp. 183–191. B. Park and H. Kargupta, The Handbook of Data Mining. Lawrence Erlbaum Associates, London, 2003, ch. Distributed Data Mining, pp. 341–362. C. Giannella, K. Liu, T. Olsen, and H. Kargupta, “Communication efficient construction of decision trees over heterogeneously distributed data,” in Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM04),, Brighton, UK, 2005, pp. 67–74. J. Han and M. Kamber, Data mining: Concepts and Techniques, 2nd ed. Morgan Kaufmann, San Francisco, 2006. T. Mehenni and A. Moussaoui, “Data mining from multiple heterogeneous relational databases using decision,” Pattern Recognition Letters, vol. 33, pp. 1768–1775, 2012. S. Abe, Support vector machines for pattern classification. Springer Verlag, 2005. A. Smola and B. Scholkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, 2003. V. Vapnik, The Nature of Statistical Learning Theory, 2nd ed. NY: Springer, 1999. C. Chang and C. Lin, “LIBSVM: A library for support vector machines,” http://www.csie.ntu.edu.tw/ cjlin/libsvm, Software available at: http://www.csie.ntu.edu.tw/ cjlin/libsvm)., 2001.

22

ChoreMAP: extraction and visualization of visual summaries based chorems I. Cherni, S. Faiz and R. Laurini

Abstract—Traditional cartography is an essential tool to describe the facts and the relations concerning a territory. Expert users are usually satisfied with the expressive power of traditional mapping, when it deals with simple cases. But in some complex cases including a large number of data, the expert users need a map which stresses the most important aspects rather than having several maps with a high level of details. It’s in this context that our search has been launched in order to automatically discover spatial patterns and view based on a spatial database and chorems. This paper presents the project focusing on the extraction subsystem of salient patterns that will be encoded with an extensible markup (XML)-based language called chorem markup language (ChorML) and then be viewed as visual summaries by visualization subsystem. Index Terms—Chorem, mapping, visual summaries, patterns, extraction, visualization.

I.

INTRODUCTION

Business intelligence is now a very common concept for companies, the new concept of territorial intelligence was recently created in order to transfer methods of business intelligence to understand and govern geographic territories. According to [17], territorial intelligence can be defined as a science having for object the sustainable development of territories and having for subject a territorial community. Moreover, territorial intelligence: • puts in relation multidisciplinary knowledge on territories and their dynamics, • strengthens territorial community abilities to take part in their development in a fair and sustainable way, • improves territorial information sharing and spreads its analysis methods and tools thanks to Information and Communication Technologies, • and promotes governance, decision making processes and practices valuing participation and partnership and research-action that contribute to fair and sustainable development of the territorial community. Ibtissem Cherni is with LIRIS Laboratory – INSA- University of Lyon and LTSIRS Laboratory- ENIT- University of Tunis, (e-mail: [email protected]). Sami Faiz is with is with LTSIRS Laboratory- ENIT-University of Tunis, and Professor in Computer Science at Institut Supérieur des Arts Multimédias de La Manouba (ISAMM) (e-mail: [email protected]). Robert Laurini is with LIRIS Laboratory - INSA- University of Lyon (email: [email protected]).

23

But, what can be the ideal computing tools for territorial intelligence? Those tools must bring fresh geographic knowledge to politicians in order to help them not only be aware of problems, but also make good decisions. Usually business intelligence is based on (1) knowledge extraction especially by data mining and on (2) visualizing tools. A difference must be made between “spatial knowledge” and “geographic knowledge”. Whereas spatial knowledge derives from geometry and topology, as soon as geographic entities are concerned, then it becomes geographic knowledge. In our case, territorial intelligence is essentially concerned by geographic knowledge. Chorems, as visual schematized representations of territories and visual tools for geographic knowledge representation can be seen as a good candidate to help territorial decision-makers. Chorems were proposed by cartographers to summarize important patterns in geographic landscapes [7]. The objective of this paper is to present the challenges related to automatically extract the most significant elements, and then create chorems from the geographic data-bases. We will present chorems and a system able to extract and to visualize them. Then an example will be detailed. We conclude this paper by giving some future research perspectives. II. REPRESENTING GEOGRAPHIC KNOWLEDGE BY CHOREMS As previously said for many decisions, visual tools are necessary, and especially for spatial decision making for which geovisualization is an essential tool. When it is the cartography of facts, usually decision-makers are satisfied, but when it deals with visualization of problems, conventional cartography is rather delusive. Indeed it seems more interesting to locate problems and perhaps to help discover new problems or hidden problems. So, a research program [4] was launched between several research institutions in order to test whether cartographic solutions based on chorems can be more satisfying. By chorem-based schematized representations, one means that the more important is a sort of global vision emphasizing salient aspects. So, this definition can be considered as a good starting point to construct maps for spatial decision making. The word chorem comes from the ancient Greek χώρα which means space or territory.

In the past, chorems were made manually by geographers, essentially because they had the whole knowledge of the territory in their mind. This knowledge was essentially coming from their familiarity about the territory, its history, the climatic constraints and the main sociological and economic problems. Figure 1a shows a typical map of internal migrations in Tunisia; one notes that the information presented in this map seems a little bit unclear while Figure 1b (or chorematic map) highlights the salient elements through a formalism associated with the nature and location problems.

Fig. 1 Migrations in Tunisia; (a) conventional map; (b) chorematic map: internal flow of migration in Tunisia [1].

The objective of our research project is, from a geographic database to extract important knowledge and visualize in order giving a visual summary of the contents of the database. Now, a second definition of chorem can be proposed, i.e. “a chorem can be seen as visual way to represent geographic knowledge”, and so it can be a tool to summarize geographic databases [16]. But the main problem is “how to extract geographic knowledge” from a geographic database. The solution is to use or develop some data mining procedures to extract knowledge, usually on the form of descriptive logics. In this paper, we present an approach that has been designed according to the main following specifications (Figure 3): • chorem discovery based on spatial data mining, the result being a set of geographic patterns or geographic knowledge • chorem layout including geometric generalization, selection, algorithms for visualization. • Based on XML, ChorML presents several levels (for details, see [6]): • level 0 corresponds to the initial database in GML (Geographic Markup Language) (See http://www.opengis.net/gml/ for details), • level 1 corresponds to the list of extracted patterns, • level 2 is a subset of SVG [http://www.w3.org/Graphics/SVG/]. In order to encode this knowledge, a special language named ChorML was designed.

24

For instance, at level 0, a feature such as a city can be described by longitude/latitude and some additional attributes, whereas at level 1 the feature remains only if it belongs to a selected pattern, and finally at level 2, we deal with pixel coordinates, radius, line styles, colors and textures. At level 1, the heading and complimentary information are practically not modified, but in place of the GML database contents, we have the list of patterns together with the way to obtain them (lineage). III. SOFTWARE ARCHITECTURE OF A PROTOTYPE The ChoreMAP project [1] was launched in 2010 and jointly developed between France and Tunisia. The project’s mission is to display geographic information in the form of chorems following the passage through a phase of data mining applied to a spatial database. The Figure 2 shows the general architecture of the prototype. We distinguish three sub-systems, each of which has a specific role: • The extraction patterns subsystem from a geographic database. • The extraction of the most important patterns subsystem. • The chorems visualization subsystem. The Chorem Extraction System first transforms the actual data base of available geographic data in order to facilitate the extraction of significant information by spatial data mining. The Chorem Visualization System manages this information by assigning them a visual representation in terms of chorems and chorematic maps. Once the patterns are extracted, these results are encoded in the ChorML language by a special subsystem for generating ChorML documents [2], and then visualized.

Fig 2. General architecture of the ChoreMAP prototype.

A. Chorem extraction subsystem There are four kinds of patterns as results from data mining that appear to be the most interesting in chorem discovery: • Facts, for instance the name of a country capital, • Clusters, for instance any spatial regrouping of adjacent sub-territories, • Flows (one way or both ways), • Co-location patterns, especially to describe geographic knowledge; for instance “when there is a lake and a road leading to that lake, there is a restaurant”. In addition to that, we need to include:

• Topological constraints, for instance that a harbor must be inside a territory, not in the middle of the sea, • and boundary description, especially because outside information are usually not included in database, such as sea or neighboring country names. 1) Fact A fact is considered as the result of one or more queries against the database [6]. A set of rules is defined in order to obtain basic information from the database [11]. To achieve this sub-system, two methods are required: one to analyze data through a user request and another to encode and store the results in a file ChorML. 2) Cluster Use Clustering is the method used to group data into classes; consequently an object in a cluster has certain similarities with other objects in the same cluster. For example, we could group parcels in a city by their land use type or group regions by their ecosystem similarities. Clusters are strong candidates to generate chorems. There are many different algorithms for clustering (for example the distribution of clusters). But which one do we use to realize these spots (or subsystem)? Clusters are good candidates for generating chorem. After a detailed study of methods of data mining, we set the most appropriate algorithm k-means clustering to group the cities that are geographically close and share common characteristics of a set of groups fixed in advance. 3) Flows One study showed [6] that three types of flows are the most important: flow path, flow divergent source and flow well oriented. The stream type represents a flow path where the origin and destination are well defined and it may possibly have a geometric shape (for example a large arrow). While the divergent flow type source has a definite origin, the destination is a little uncertain. The destination is a list of different geographical directions. Finally, the convergence has a definite destination, but the origin is a list of convergent geographic directions. Flows are used to represent the spatial dynamics within a territory. “We consider as flows every movement, material or immaterial, of goods, of people, of information, between different locations” [8] Flows are generally represented by arrows in the current mapping. We are particularly interested in the flow of goods.

Flow ChorML structure is presented in Figure 3 • Flow of goods Several authors, like Yann, Zanin Tobelem in [9] and W. Tobler in [10], find that the flows of goods are mainly related to three factors: (1) the emissivity of the zone i (2) the attractiveness of zone j and (3) the inverse function the distance between the two zones i and j. The distance between the zones has a space factor and economic separation. Based on these studies, to extract related knowledge flows between clusters, we propose a method to study the available quantity of goods. Through the census figures of cities, the consumption and production of products for each city, we can get a good approximation of the movement of goods. This method is done by comparing the production and consumption of agricultural products. First, to extract the flow of goods, it is necessary to study the quantities available in each city for each product. This method is to subtract the quantities produced: a product with the quantities consumed of the same product. The quantity calculation equation is: Qt = Pr - Pt *Ci (1) Where: Pt = population in year t Ci = consumption in the year t of a product (P) for a person living in the city (V) Pr = production of a product for the city (V) during the year t Qt= the quantity demanded (missing) for the product (P) in year t for the city (V) After the preprocessing module in which we store the geometrical shapes in the thematic layers, the descriptive data are organized into a cube data base. We apply the extraction flow of goods proposed method. This method consists in: 1. Calculating the quantities of goods available for all cities in the database. 2. Determining the quantities available for each product group from the cluster extraction subsystem. 3. Identifying clusters that have seized a quantity above the threshold. These clusters are considered emitter groups of the product. 4. for each receiver group, comparing the amount available for the quantities available for issuers groups and the distance between the groups; the selected cluster is the nearest cluster and has an available quantity for the product p. 5. and finally encoding flows in the ChorML language. 4) The co-location patterns Co-location patterns are sets of characteristics of places that are presumed to be close with a certain probability to each other. Rules seem interesting co-localization in creating chorems because they define the organization of objects within the territory with a quantitative accuracy. The results of the extraction modules and those made of clusters are used to determine the relationship between major cities to the user and the resulting groups of the k-means algorithm. For example: if there are commonalities between two geometric shapes so we

Fig 3. The structure of flows in the ChorML language [6].

25

have a relation of the type 'Touch' whose boundaries touch but the inside cannot be touched.

algorithm, being faster with few points, is interesting in some cases. This will give a solution that is both quick and effective.

B. Chorem Visualization System In this subsection, we propose an architecture for the chorem visualization subsystem (Figure 4) containing three major components which make up this framework, namely preprocessing of geographical coordinates, chorems construction and chorems edition tasks.

- The process of aggregation The aggregation process generates a geometric representation by the amalgamation of some characteristic elements sharing some common properties. In our case, we are interested only by the spatial characteristics of geographic chorems. An aggregation tolerance is defined to include vertices. We say that the coordinates are within the tolerance, when considered coincident and are adjusted to share the same location. We start by calculating the maximum distance based on the Pythagorean Theorem. Once the value of the maximum distance is obtained, we calculate the distance between every two vertices v1 and v2.We must first of all find the third vertex v3 in order to build a right triangle. This vertex is the result of the intersection of v1 and v2 projections according to x and y axis. Requirements of the Pythagorean Theorem are answered. It is therefore possible to apply the same formula to find the distance separating v1 and v2. We finally turn to the comparison of this value and that of the maximum distance:

Fig 4. General architecture of visualization system

1) Preprocessing of geographical coordinate Because the Earth is round and maps are usually flat, converting information from the curved surface into a flat one requires a mathematical formula called a map projection. 2) Chorems construction As its name suggests, the purpose of this phase is to create chorematic maps from a level1 ChorML file. To do this, we will proceed with the subsequent application of the phases described below. • Simplification and aggregation of geographic chorems We chose to apply two operations of map generalization which are simplification and aggregation. Note that the purpose of the generalization is to model the geographical area in order to capture a broader phenomena-level of abstraction than of the data or the initial map. Note also that simplification and aggregation are based on spatial functions. - Simplification Simplified geometric shapes are done by reducing the number of vertices that make up spatial objects, which correspond to chorems in our case, while trying to keep the original shape. We chose to apply two algorithms consequently: “Radius” and “RDP” [12]. This choice is endorsed by the fact that the geometric shapes to be treated consist of polygons (formed by a cyclic sequence of consecutive segments and delimiting portion of the plan). For this reason, we have to take into account the relationships between the points. In addition, the reduction by Radius, with a small tolerance, will reduce the number of points without losing too much information. The RDP

26

If Maximum distance >= distance (v1, v2) Then replacing v1 and v2 by a single vertex with x = (xv1+xv2)/2 and y = (y v1+y v2)/2 Else v1 and v2 are quite distant that they do not undergo any aggregation operation

• The management of topological relations Theoretically speaking, topological relations are invariant with elastic transformations [3]. We implement two specialized algorithms to define the nature of the relationships resulting from the previous phase. Each identified relationship is compared with the original one described on the ChorML1 file. In case they are different, we try to correct the first one by performing successive operations of movements. We consider for example the case of a geographical chorem and one of annotation. Our process begins by determining the direction of one over the other: left, right, up or down. The determination is made by comparing the coordinates of the annotation chorem by: Min x, Min y, Max x and Max y of the geographical chorem. Note that these coordinates are obtained by inserting the geographical chorem in a rectangle where we consider Min x and Min y share the same position. Then, we go to the calculation of the Euclidean distance between the annotation chorem and all vertices of the geographical chorem. The smaller distance means sharing the same location between the two chorems. Through the original topological relationship contained in ChorML1, a displacement of the annotation chorem towards the nearest vertex is performed with a well-defined distance on the two axes x and y. This process is recursive until arriving to the stopping

condition which is on our case the resolution of the identified topological conflicts. • Placement of chorems Once it undergoes a series of treatments in the previous phases, we will proceed to the conversion of the ChorML file to an appropriate format. We chose as a visualization format the SVG one. The positions of the geographical and the annotation chorems on the map are the results of the third phase. Concerning the phenomenological chorems, we propose a treatment to determine their locations. This treatment is described in the following: Calculation of gravity centers of geographical chorems: given the importance of response-time and considering the number of clusters to hold, we propose a simple and fast solution to calculate the gravity center of the geographical chorems. It is sufficient to insert each chorem in a rectangle defined by the minimum and maximum x and y coordinates. Then we obtain the center coordinates by applying these two formulas (4) and (5): x= Min x + (Max x-Min x) /2 y= Min y + (Max y-Min y) /2

(4) (5)

For the placement of phenomenological chorems, we only treat flows. We define for each of them a starting point and an ending one. These points correspond to the gravity centers of the regions in question. The thickness of these flow arrows is proportional to their importance described in the ChorML1file. 3) Chorem edition In order to ensure interaction with the user, our system provides the ability to change the chorematic map produced through a set of operations (displacement, change the size of an item, zoom in, zoom out, etc.). These operations are feasible through a specialized graphical editor called “Inkscape”. The use of such a tool helps to better meet the needs of users when they require further refinement regarding semantic and graphical properties of chorems. This phase helps meet the needs of most users when requesting additional refinement regarding semantic and graphic properties of chorems. In addition, for possible users, ChorML2 representing of the resulting map is generated. IV. CASE STUDY: VISUALISATION OF MERCHANDISE FLOWS IN TUNISIA In this section, we are going to treat an example of migratory flows in Tunisia. A. Description of the dataset The inland transport of merchandise is carried by road, rail or inland waterway. According to international definitions, transportation means a flow of goods moved over a given distance and is measured in ton- kilometers.

27

According to [14] and given the large amount of digital information available in the world, statisticians have the difficult task of ensuring that the trade analysts and others have speedily access to accurate business data. Production data both accurate and up to date is costly and requires resources that are unfortunately still lacking in many developing countries. That is why the frequency and level of detail of national statistics vary considerably from one country to another. It was often difficult to establish timely and comparable statistics on trade in goods of some developing countries because these countries do not regularly communicate consistent, comparable over time and between countries, and in accordance with standards and guidelines international. Demographic and economic dynamics of major Tunisian cities is an undeniable reality in the recent reconfiguration of the Tunisian territory. As a basic source of the Tunisian economy, intra-regional trade in food products are as the main flow of goods. Taking account of their importance and because of the difficulty of their expert analysis, we find it interesting to address this limitation and propose an easier method. In fact, it would be interesting to represent flows on a chorematic map. This provides an easy and synthetic vision as massive data rate in the territory and donated goods will be replaced by forms and symbols easy to understand. In what follows, we turn to the testing phase where we use as input system a ChorML file from ChoreMAP project and containing the flow of food and agricultural products between regions. B. Generation of chorematic map After calling the ChorML1 file, we try to integrate it with Java through a rich library: JDOM. It would be possible to read this XML document and repair these components. Afterward, we proceed to the implementation of the Mercator projection, where each component of the file has an identifier and its coordinates are expressed consecutively on longitude latitude and x, y. We think it would be advantageous to provide for the user an access to data contained in the ChorML1 file. Thenceforth, their storage into a database proves a good solution as data access, interpretation and management become easier. The file components are distributed according to their types. Therefore, the distinction between clusters, facts and flows would be easier than in the original file. Via this interface, we offer to the user the ability to manage all chorems. So he can update them by adding, modifying or deleting one or more types (Geographic chorem (s), annotation chorem (s) and / or phenomenological chorem (s)). After running the file, the application of cartographic generalization operations: simplify-cation and aggregation on the geometric shapes of clusters occurs. The user can intervene in this process by modifying the default tolerance values. The large input values imply simpler geometric shapes of all the clusters. Following the resolution of all topological conflicts and the definition of final chorems

locations, our system generates the results. These results consist of a chorematic map displayed through the Inkscape editor. In addition, a ChorML2 file is produced. It is composed, as indicated above, of XML and SVG tags. It allows describing all of the following elements: • Metadata ; • A simplified list of geographical chorems which is the result of simplification and aggregation phase; • A list describing annotation chorems locations obtained after the topological relations correction; • A list defining the locations of phenomenological chorems. All changes made by the user are stored in the resulting file. The chorematic map and the corresponding TunisiaChorML2 file representing merchandise flows are shown in Figure 5. With this map, we can easily analyze merchandise flows in Tunisia. We still distinguish six main regions with different rates of goods production. The importance of cities production is expressed through the ellipses diameters on the map. As for the arrows, they represent the flow of merchandise between regions. The generated file from ChoreMAP indicates that the cities of Sidi Bouzid, Kairawen, Sfax, Gafsa and Gasserine produce considerable quantities of agricultural products exceeding their consumption. Their excess is distributed to the regions which are devoid from: Mednine Tataouine, Gabes, Kebili Touzeur, Sousse, Monastir and Mahdia. Our map indicates the destination of different flows. Their width is proportional to the transmitted amounts. The user interaction is provided through a set of possible operations provided by the Inkscape editor. Henceforth, user can easily customize the map produced by our system to his own needs. The choice of this graphical editor is approved by its originality. In addition to the rich operating range offered by Inkscape, it supports SVG format [13]. The user can customize the map in its guises. Changes made on the ChorML2 file, composed of XML and SVG tags, are recorded. Note also that is possible to handle more than one phenomenon on the same map. This is guarantees thanks to the concept of chorem layers where every phenomenon is described on a separate map. The superposition of these maps allows for a more comprehensive result.

28

Fig 5.User Interaction with the produced map through the graphic editor.

V. CONCLUSION Traditional cartography is an essential tool to describe the facts and the relations concerning a territory. Geographic concepts are associated to geographic symbols and graphic symbols which help the readers understand immediately the visualized data. Expert users are usually satisfied with the expressive power of traditional mapping, when it deals with simple cases. But in some complex cases including a large number of data, the expert users need a map which stresses the most important aspects rather than to have several maps with a high level of details. So our objective is to define geovisualization solutions which can adequately represent the information extracted from geographic data. Visual models based on chorems can interpret and represent spaces, their geographic distributions and their dynamics. The same space can be represented in different ways, but all the corresponding maps will tell the same thing. We cannot change the message, its position, its hierarchy, its network, and all those items are expressed in the chorematic map. The representation of chorems allows us the best interpretation of problems. It is in this way that we can obtain all what we need: from the young pupils who want to learn geography, up to the researchers who investigate new forms of communication. Each chorem is a drawing that has its own form and its own meaning. The meaning can be a process which represents the dynamics of a certain place. Therefore a chorem is a powerful tool to represent the knowledge we possess about certain places due to their ability to symbolize and encapsulate a methodology and corresponding interpretation. We can show climatic, geographical, economic, sociological, geological, agricultural, issues etc. based on their spatial context, statically and temporary, due to the combination of several chorems. The most common situations to represent are those relating to the study of the structure and dynamics of the population, urban concentrations or the interaction between natural and

social systems. Chorems constitute a visual vocabulary for the description of the main characteristics of a territory, and from our point of view, they are a solid basis for decision making, because they highlight the most significant aspects by leaving aside secondary issues. When it is necessary to understand the structure of a territory, a complete map is not useful (Figure 1a), while a small pattern may be more useful (Figure 1b). So Chorems are a key tool to map a territory, and allow decision makers to have a clearer view of the situation. In conclusion, we can consider that chorems are excellent candidate tools for territorial intelligence. REFERENCES [1] I.Cherni, S. Ouerteni, S. Faiz, S. Servigne and R. Laurini, “Chorems: A New Tool for Territorial Intelligence”, 29th Urban Data Management Symposium, C. Ellul, S. Zlatanova, M. Rumor, Eds. London, pp 67-76, Taylor&Francis, 2013. [2] I.Cherni, K. Lopez, R. Laurini, S. Faiz, “ ChorML: résumés visuels de bases de données géographiques”, International conference on Spatial Analysis and GEOmatics, Paris, France, 2009. [3] R. Laurini, “A conceptual framework for geographic knowledge engineering”, Journal of Visual Langages & Computing, volume 25 issue 1: 2- 19, Elsevier, 2014. [4] R. Laurini, F. Milleret-Raffort, K. Lopez, “A Primer of Geographic Databases Based on Chorems”, Springer Verlag LNCS 4278, pp. 16931702, 2006. [5] OGP Publication. “Coordinate Conversions and Transformations including Formulas”, Geomatics Guidance Note Number 7, part 2, 2013. [6] A.R. Coimbra, “ChorML: XML Extension for Modeling Visual Summaries of Geographic Databases Based on Chorems”, Master Dissertation, INSA-Lyon, Université de Lyon, France, 2008. [7] R. Brunet, “La carte-modèle et les chorèmes”, Mappemonde 86/4, pp. 46, 1986. [8] M. Egenhofer, “A Formal Definition of Binary Topological Relationships”, Foundations of Data Organization and Algorithms, pp. 457-472, Springer, 1989. [9] R. Yann., C. Zanin Tobelem, “L’Europe dans la régionalisation de l’espace mondiale : étude des flux commerciaux par un modèle d’interaction spatiale”, Géocarrefour, pp 137-149, 2009. [10] W. Tobler, “Interaction spatiale et cartographie : les solutions de W.Tobler”, Espace Populations Sociétés, pp 467-485, 1991. [11] Z. Guo., S. Zhou, Z. Xu, A. Zho. “G2ST: a novel method to transform GML to SVG”, In: 11th ACM international symposium on Advances in Geographic Information Systems, Association for Computing Machinery, pp 161-168, 2003. [12] D. Douglas, T. Peucker, “Algorithms for the reduction of the number of points required to represent a digitized line or its caricature”, The Canadian Cartographer 10(2), pp. 112-122, 1973. [13] SVG Scalable Vector Graphics (1999), [Online], Available : http://www.w3.org/Graphics/SVG. [14] H. Escaith. “Statistiques du commerce international”, Organisation mondiale du commerce, 2012. [15] B. Lafon, C. Codemard, F. Lafon, (2005) “Essai de chorème sur la thématique de l’eau au Brésil”, [Online], Available : http://webetab.acbordeaux.fr/Pedagogie/Histgeo/espaceeleve/bresil/eau/eau.htm. [16] V. Del Fatto, R. Laurini, K. Lopez, R. Loreto, F. Milleret-Raffort, M. Sebillo, D. Sol-Martinez, G. Vitiello, “Potentialities of Chorems as Visual Summaries of Spatial Databases Contents”, Springer Verlag LNCS, 4781, pp. 537-548, 2007 [17] J.J Girardot, “Principes, Méthodes et Outils d'Intelligence Territoriale. Évaluation participative et Observation coopérative”. In Conhecer melhor

29

para agir melhor, Actes du séminaire européen de la DirectionGénérale de l'Action Sociale du Portugal : EVORA, DGAS, Lisbonne, pp. 7-17, 2000.

I. Cherni was born in Tunisia in 1984. She graduated in 2008 from Faculté des Sciences Juridique, Economiques et de Gestion de Jenouba. She received the Master Degree in Computer Science: data, knowledge, and distribued systems in 2009 from Faculté des Sciences Juridique, Economiques et de Gestion de Jendouba and Institut National des Sciences Appliquées de Lyon-University of Lyon. She is now Assistant at Faculté des Sciences de Gabes and researcher at LIRIS Laboratory - University of Lyon and LTSIRS Laboratory- University of Tunis. Her research interests are in the area of spatial datamining and visual summaries. Sami Faiz was born in Tunisia in 19xx. He obtained the Ph.D. in Computer Science of the University of Orsay (Paris 11) in 1996. He is currently Professor in Computer Science at Institut Supérieur des Arts Multimédias de La Manouba (ISAMM) since 2014 and member of the LTSIRS LaboratoryUniversity of Tunis. He is a scientific and organizing member of various international conferences. He’s also the founder of many national and international projects in Geomatics. His main interests are on spatial datamining and geomatics.

Robert Laurini is presently Professor Emeritus at INSA-Lyon University of Lyon, president of the NGO "Universitaires Sans Frontières/ Academics Without Borders". During his activities, he was intensively involved in international affairs. Among others, he has been member of PhD committees in 17 countries. He was recently elected "Fellow of the Knowledge Systems Insitute" of Chicago, Illinois, USA. He obtained an Engineer Diploma in 1970, the DoctorEngineer in 1973 and Habilitatated doctrate in 1980.

Creation of a Data Observatory enables the uncovering of consumer behavior by client behavioral study, through the use of 1st generation loyalty and stimulation platforms Sébastien Bruyère, R&D manager Custom Solutions, Doctor of Information Sciences & Communication, Vincent Oechsel, Product Director Custom Solutions, 135 avenue victoire, 13790 Rousset, France.  Abstract— This article presents the work that allowed the creation of a consumer data observatory to reveal and develop a consumer universe for the brands wishing to refine their marketing strategies. It explains the development of techniques and practices for processing data collected in the context of the consumer participation in promotions powered by Web platforms marketed by top brands. It is particularly based on deficiencies seen through the participant observation methodology applied in the context of training sessions for Brand Managers who bought an old generation consumer loyalty and stimulation internet platform. Based on the results, a new platform -more in line with the needsand moving towards a consumer data observatory is considered.

The BOA is a major evolution which will complement the Promo Place “Front Office Shopper” (FOS) which offers the latest innovative technologies in terms of user experience. The FOS is an evolution of the “Obama” project which deals with the prioritizing and evolvement of consumers in online promotions. The basic data statistics are presented in a Back Office summary. Even though the Brand Marketing managers appreciate our Loyalty and stimulation frameworks, they are now looking to know more, and by observing their methods with our existing tools, we’ve noticed that they are using the existing data to hypothesize over potential future scenarios by means of their expert judgment, based on the current information they have extrapolated.

Index Terms—Consumer data observatory, consumer universe, engineering marketing information systems I.

INTRODUCTION

The web platforms commercialized by major brands have the aim of centralizing their promotional offers, that is to say recruiting, stimulating and retaining clients. Specifically we are talking about the “Obama” product, offered/marketed by Custom Solutions, which allows a major brand to promote their promotional offers on line, on which their clients can actively participate. The interest for the brand, besides extending their support, is to be able to reactivate these clients every time a new promotion is launched, thus boosting the sales of their new products.

For example, They use the consumer data stating that “ X% of Consumers who took part in a high pressure cleaner cashback offer, six months ago” to estimate the potential percentage consumer participation rate on a cleaning liquid compatible with the high pressure cleaner…… Or even the potential percentage of customers who might take part in a similar cashback offer, based on a panel of consumers who took part in a similar offer 3 years ago, imagining the wear and tear accumulated on their previous high pressure cleaner, and/or the customers desire to change, and benefit from the latest product technology.

On this basis, Custom Solutions wanted to further enhance their consumer behavioral study, by adding an application that allowed us to analyze the data collected, and also demanded more consumer information regarding their buying habits, as part of their applications, through the initiation of The Promo Place project.

Based on this observation, Custom Solutions wanted to enhance their “BOA”, making it capable of dealing with present customer data, and to facilitate the extrapolations through a more ergonomic adapted interface.

In terms of presentation, this lead to the availability of a “Back Office Analytics” pack (BOA), allowing brands to consult the data statistics of their Promotional Applications

We should note that the majority of people interested by the BOA are Brand marketing managers who are recognized principally for their ability to “imagine” the traits and habits of their customers, to better ensure the success of future promotional campaigns. [8]

30

The idea being, following a training program on the functionalities of the Obama platform, we wanted to observe the behavior of the Brand marketing managers, in order to more accurately identify the information to provide them with. The goal being to aid them with the constitution of their study projections, using present data, for the initiation of new offers or to greater understand the behavior of their consumers, and to assist with, or redefine the overall marketing of the brand.

II. METHODS AND MATERIALS In terms of research methodology, we decided to observe the behavior of Marketing managers within the brands using the Obama loyalty and stimulation program, which provided some rudimentary statistics. To do this, in the context of training and supporting of Brand marketing managers, we noticed that their behavior can help us to better understand the real needs and practices of these “experts”, helping them to instigate their programs or better understand their consumers.

The Subject matter has been defined on the basis that Brand marketing managers are no longer looking to just evaluate performance of marketing activities, they also need a benchmark, a daily Compass, a helping hand to achieve the trajectory of the marketing plan. “They will no longer be satisfied with managing their activities, on the contrary, they will lean towards reflection and anticipation in the medium to long term” [5].

Indeed, the use of interfaces for summary data extrapolations provided by our Obama platforms to determine new target remarketing projects (Couturier, 2014) has in reality proven to be very interesting. Hence, for the sake of formalizing and in order to evolve as close as possible to the expectations of the current Obama tool into Promo Place, and in particular the BOA, we adopted a participant observation approach. This approach allowed us to study the use of defective interfaces, and the deficiencies, whilst considering the BOA

To do this, they will need to have not only the relevant data on the direct performance of one, but several past programs, consumer behavioral data from the studies and the benchmarks at their disposal. The question today is no longer to know “what segmentation to use” [10] moreover “how best to cross reference the socio-demographic, behavioral and value data.”

A. Initiation to the method The participative observation had been carried out by means of training sessions to better understand the tool. To gain in efficiency, it was structured in a way that respected the typical approaches used in this field, such as choice of object, area, observation type, timescale, gathering of prior information, constitution of observation chart [11].

Furthermore, as shown by the emergence of new business functionalities such as Data Scientist or Data Steward [1] “a need for predictively and statistically competent business intelligence software is emerging and is proving to be increasingly essential for efficient use of Big Data infrastructure.” An infrastructure that will consolidate the Marketing data collected through the Promo Place application. So besides the need, it is good business practice in these technological areas which will further support the idea of developing the predictive decision aspect.

The use of participative observation is an increasingly popular practice in the field of Education Science by researchers. It allows the trainers to “widen the possibilities of understanding and explanation” [12], Indeed, The respondents, (who are in fact purchasing a solution), will ask questions to the enquirers (In this case, the trainer), who, even though in the knowledge transfer phase, will be required to answer according to the respondents’ expectations and hence understand the real needs expressed during the training.

C. Choice of Field It is worth noting that, in our model, we chose the field based on the object as it appears in the overall strategy, to benefit from the knowledge transfer from the training session. [6]. So the field is materialized through the training room chosen to work in. This is conducive to participant observation because it has the necessary equipment needed to use the online loyalty and stimulation platform.

These needs are often difficult to explain by the respondents without using this scenario, even if this is just a case study for the training session. Likewise, as Dupont explains “Once a need has been satisfied, to a great extent, the individual looks to satisfy his next need” [4] and thus progressively develop innovative ideas for the solution in question.

The exercises provide training support, allowing the observer to study the behavior and attitudes of the trainees with regard to the interface to be of best use to the Marketing department.

B. Choice of Subject Matter The choice of subject matter should be naturally differentiated from the topic of the study. In this way, we have chosen to study the behavior of Marketing managers in the context of a case study conducted on the basis of statistics generated by our first generation loyalty & stimulation platforms (A solution prior to Promo Place, christened “Obama”) during the training sessions.

As outlined by Sensi, and as the optimum observation method is a participant one, “It is useful for an evaluator to do some of the participants’ exercises” to promote a truly participative approach with the trainees. The difficulty of this task from the observers’ point of view resides in relationship with the observees, as much due their presence and participation as in the nature of publishing the results of the tasks.” [14].

31

With regards to boundaries, it seems clear that by limiting it to training rooms and sessions, as so often happens in training, the trainee may end up consulting the trainer, posteriorly on a particular issue, either because he is no longer in a position to reproduce a scenario outside of training, or because, with hindsight, he was able to consider a use that he was unable to think of during training

study is being conducted in order to evaluate the new version of the platform. The matching of the findings to real life is often a fixed limit for open observation, but, given the public, we can assume that the observer, through his stance as trainer, will maintain the same level Vis à Vis this limit. Similarly, the intimate understanding of social roles is not too prominent in this study, given its’ technical aspect access to information issue by issue, the possibility for note taking, and access to multiple observation situations are, according to them, the positive aspects of an open observation stance. The cross referencing with the Training profession is still appropriate Vis à Vis these beneficial tasks, in the case of open participative observation. The technique used in training is interrogation, to ensure that the audience has received and understood the subject covered. Note taking is of equal importance as it allows the trainer to note areas on which he needs to come back to, or later find answers to. The different situations are also observed by a trainer.

In this later case, the observation of this action becomes all the more interesting as it can complement and indeed lead to an idea that is totally in sync with the object. The choice of field, and the participants (trainers and trainees) is in fact, normally pertinent to accessibility as much on the relational aspect as the material and logistical. Effectively, the trainer and trainees have the entire adapted infrastructure at their disposal to maximize the acquisition of the trainers’ knowledge. The observation field is hence ideal for the very purpose of the séance. The fact that the trainees are sometimes in groups can be beneficial, and help encourage out ideas and opinions about the extrapolation of data from the loyalty and stimulation platforms.

There will only be one group of trainees on whom we will openly observe for our study, there will be no other group over whom we could appear as an undercover observer.

On the other hand, this situation can limit certain functions if the person being observed is of a shy or introverted nature, justifying in these cases the possibility of continuing individual participant observation, post training, through an analysis of future criteria ( by mail, telephone, after the training session, by the trainer)

Dress code will be that of a sales representative operating in the heart of the company. E. Timescale of the study The number of Observation séances was chosen on the basis of training programs scheduled by the Marketing manager on the Obama loyalty and stimulation platform.

D. Choice of Observation method With regards to the level of participation, and, given the relationship between trainer and trainee, it would be interesting to keep the existing situation and inform the trainees of the situation. The trainer being already in a position as transmitter and evaluator of the knowledge emitted, He already takes an observatory stance.

In terms of volume, they consist of five training sessions in the presence of the Marketing manager, and Marketing Assistant of major brands. On average, the training sessions last approximately two hours, with inevitably, 30 minutes needed to set up the material and initiate the session, leaving one hour and 30 minutes for Observation.

The choice of participation or not does not arise since the trainer is required to do exercises to illustrate the use of the platform. His acceptance by the respondents and their receptiveness is also the trainers’ responsibility and will therefore likely be achieved naturally once the trainee recognizes the trainers’ expertise.

F. Collection of Initial Information Before the training session, quotes will be obtained in relation to the loyalty &stimulation platforms marketed. The idea being to identify in the description of services in the data used in the sales method, to both adapt the training and also to start the behavioral study via the study of the details of the services agreed between the Sales representative and the client.

A risk of turning “Indigenous” [7] is often denoted in the posture of the Observer. However, by being conscious of this risk, it is possible for the observer to affect a more distanced stance to enable to take a step back from the action in course, to maintain a better sense of perspective and analysis

The collection of records relating to other operations carried out by the Marketing manager of the brand in question turns out to be interesting to identify the overall Marketing thinking of the person to be trained and observed.

The methodology will be applied live and not after the event, as a means of avoiding the aforementioned pitfalls. Access to the field being favorable to an open observation posture, an incognito observation is not applicable given the fact that open observation is a key part of the Trainer/Observer missions. However a clarification will be offered to the trainees on the fact that, besides being trained on the platform, a sociological

32

steer it. This fact was noted by four Brand marketing managers out of five.

III. RESULTS AND DISCUSSION POINTS A. Organization The Observation sessions were conducted during the training sessions of five Brand managers, having bought access to the internally named Obama loyalty stimulation solution. As planned, the objects used during the session are those typically used in training, that is to say, office furniture, a video projector, Internet access via computers connected to the company’s’ network and Custom Solutions notebooks and pens

Mail Management Demands captured 7% Mail Received 24% % Capture rate 29.2% Conformity Compliant 55% Irretrievably Non-Compliant 30% Non- Compliant 15%

The training sessions lasted approximately three hours, divided into two hours Theory training and 1 hour practical Training support is given to each of the participants in the form of a slide-slow export with space to write notes The Observer/Trainer is an expert solution belonging to Custom Solutions who takes the role of trainer, however taking a step back, with a detached stance to better analyze the behavior of the participants, as is his function of Observer. During the course of the sessions, the Observer-Trainer consolidates the various outstanding phrases and opinions, serving as the base to build the post session evaluation grid B. Results We present the findings via two sub parts

Figure 2. Proposed "Obama"interface statistics

The first sub-part chronicles the findings related to interface usage within the proposed exercises, This collection of information is carried out not only by observing the attitudes, achievements and shortcomings of the interface use, but also by retaining the questions asked by the brand managers to the Trainer-Observer.

As the screen capture above shows, the data is presented on the basis of cumulative statistics, its’ presentation is simplistic, and no filter, or data manipulation is possible, rendering the exploitation of these statistics very limited. • Performance: The interface scarcely manages to display the data on the pages with numerous indicators. These delays have been reported on the basis of returns made to the support service following the training session.

The second sub part concerns behavioral observations of brand managers in full thought to establish potential remarketing strategies [3]. Usage related Observations All the Brand marketing managers have underlined the limited nature of available interface statistics. The number of participations is certainly indicated, but it is difficult to make projections, you cannot cross reference the data, apply filters etc…

Figure 1. Waiting time display on an « Obama » interface Production statistics page

It would also appear that there is no place for interactivity notably due to delays on display performance Below is a detailed list of findings from the usage- based observation: • Simplistic data/stats: The statistics given by the interfaces offer visibility of the participations completed, but do not allow us to look beyond this. In the same way, the data is presented, cumulatively in a non-modular simple display. Furthermore, some statistics are made on the same basis, it’s only the meta-information it produces that varies. These statistics are seen as being too simplistic, or “the bare minimum” to report on the activity of an operation. They simply allow us only to report on the operational state of an activity, once it has ended without allowing us the chance to

• The brand managers have underlined that interactivity and data display speed is crucial for decision making concerning budget control and orientation of current marketing activities. 0MS Current duration of launch request Start of application phases and elapsed time of the request Timing of the event relative to the start of the request As shown on the screen shot above, The “Obama” Interface displays a statistic report in more than one second, rendering interactivity difficult between the platform and decision maker

33

C. Remarketing behavior observations • Data management: The data on offer is static, they increment over time and it is impossible to cross reference, apply filters, or modify parameters. The brand managers cannot highlight notable results or redirect existing campaigns or rethink new operations. The fact that it is impossible to cross reference the number of applications with the number of valid files, for example, prevents brand managers from being able to correctly manage their campaigns. This fact was noted by three marketing managers out of five. • A non- modular solution. The platform offers a rigid statistics back–end, in the case of additional requirements being detected during the training sessions it is imperative to review the specific developments that can sometimes lead to software engine adjustments. This fact in no way enables brand managers to stretch the system statistics to give them more visibility within a reasonable time frame especially seeing as the offer being proposed is a superimposed platform layer that requires the purchase of lower levels not necessarily essential for brand managers, in order to gain access to the additional functionality residing in the one desired level. Although this is certainly connected to the way that the application was designed, it also an envisaged marketing packaging that lacks flexibility. This point was noted by four Brand marketing managers out of five.

Figure 3. Promo Place home page

In terms of performance, Obama uses a traditional relational database management system known for supporting Read/Write access, with a small amount of exchangeable information. While this information is interesting, it is not appropriate for projects requiring analysis on a global scale, with a large quantity of information exchanged, and substantial databases. Indeed, it is in these cases that the OLAP model, based on a robust data warehouse system dedicated to the analysis of up to many tetra bytes of data. [13] that will be used to consolidate data from the Promo Place application

D. Discussion Considering the results, it would appear that the Obama platform offers statistics oriented towards “the measuring of operational performance” whereas the Brand managers expect a real “decision engineering” tool (Nadeau & Landry, 1986). Indeed, the statistics on offer deal only with “participations” without any detail of “inscriptions” or of the “activity” itself. It should also be possible to gather all of the statistics together on a central dashboard which would allow us to highlight certain facts to aid Brand managers with their decision making.

The heart of the data base will be christened the Data Mart Marketing (DMM). It will be based on a data warehouse using the concepts of Big Data and Business Intelligence to enable effective management and cross-referencing of data within the framework of the Constitution of the “Consumer Universe” and through the cross referencing of data through advanced segments. This structuring and modelling will give a competitive advantage to the platform thanks to a “Consumer Centric” vision, via a convergence of consumer data, a re-modelling of analysis segments, and of multidimensional data analysis cross-referencing. Performance will also be enhanced

As Mallet indicates in his research as to the efficiency of dashboards to help decision making, an efficient dashboard must include “Three types of indicators, warning indicators which indicate an abnormal system state, needing a short repair; Balance indicators which allow us to take a look at the state of the system to ensure that we are progressing on the right track; and finally, Early warning indicators, that allow us to have a broader vision enabling us to make changes to both strategy and goal” [9]. Hence, an « Operations » segmentation will enable us to follow and monitor as a means of indicating the state of health of the operation The « Participations » will bring visibility to a more macroscopic level based on pre-operation captured data The « Registered » will enable us to bring more visibility of consumers themselves, enabling us to pre-empt potential product or campaign re-launches.

Indeed, where competing platforms offer "Promotion Centric" vision, that is to say, a data collection operation stored in silos with scarce repeat purchase information and very little indicators of multi-operations, we propose to make use all the data collected to provide consumer marketing behavior. Regarding the interface itself, Obama relied on custom made developments, which by nature are strongly linked to good practice, and the ability and skills of the developer who is rarely an expert on all areas of the development of the application. The frameworks are “adaptable work tools, they are a collection of libraries, tools and conventions that enable the rapid development of applications”, they allow the development of “successful and easy to maintain applications” [2]. They will therefore enable an evolutionary development in correspondence with the expectations of the

34

available?”, and “how can I use it?” contrary to Obama where we were searching from the very beginning of the operation to identify “what are the uses that we want from back office?” , and “how can we build them?” Hence the Obama project is fixed/frozen? Or at least subject to “typical” statistical interface constraints, Promo Place will benefit from its own mechanism (settings, segments etc.) which will not be linked to the data exploitation that will be done.

Brand marketing managers, notably the marketing aspects of the offer. Given that the constitutions of the Frameworks are based on good practice, can display good overall performance, which could benefit the Promo Place project. It will be therefore necessary to develop the Front Office Shopper (FOS) which will be proposed as a SaaS mode web Interface, based on a Zend Framework, presenting the “participations’” KPI's, management of offers, but above all the real-time management of the consumer profile, constructed from the information collected. Beside the data presentation, the advanced filter application will enable targeting to firstly isolate a target population. Then it will be possible to manipulate the desired display, by choosing the indicators required, it will also be possible to save this filter for future uses.

All of these technical evolutions enable us to consider a much more flexible packaging offers than the Obama model based on layers.

Figure 5.Promo Place Packaging offer approach

Hence, the standard pack will consist of Front Office Shopper (FOS), Back Office Analytics (BOA) and Data Mart Marketing (DMM). The options will be put together on the basis of additive features incrementing the 3 extension modules necessary for efficient use. For example it will be possible to add an option like "Advocacy Marketing" in order to use the consumers’ recommendations for the participation, this will automatically prepare Data Mart Marketing (DMM) to accept new cross-referenced data and also new Back Office Analytics indicators to monitor, manage and anticipate.

Figure 4. Promo pLace statistical interface

Some possible questions, answerable using the advanced segments available:    

How many of our Parisian CSP+ customers over thirty years old, have equipment more than three years old? What is the effectiveness of the Point of sale animation signs in this region? How much did my operations recover last month, all operations taken into account? What are the raises of my operation, in real time, on which products and which brands?

IV. CONCLUSION The use of Participant Observation methodology has enabled us to identify numerous needs, not covered in the scope of the current situation. Hence we can consider the real changes and the creation of a Back Office Analytics system dedicated to enabling Brand marketing managers to envisage future scenarios, with the aim of providing an effective marketing strategy. In conclusion, it will enable them to have a real observatory of their consumer data, to manage their current campaigns, and conceive new ones, but above all, to know their targets better

The Obama projects have been put in place with one (or more) goals, described and studied in advance so that Promo Place will be considered on the basis of an adaptive engine. The primary goal will be data capture and compilation. The exploitation will take place through an iterative learning process, answering the question as to “What is the data

35

[14] D. Sensi, L’évaluation dans les formations en entreprise. Éd. L’Harmattan, 1992.

to weigh up remarketing campaigns or the development of new offers, or even new products. To make this happen, The Observatory provides indicators on enrolment, participations, current and completed multiple transactions, but above all, the possibility to create your own new indicators, apply complex advanced filters, segment data, and cross-reference them to create a Consumer Universe

V. ACKNOWLEDGEMENTS Custom solutions would like to thank the Brand Marketing managers and all employees who have worked on the “Obama” and “Promo Place” projects

In addition to this, the article shows that Participant Observation methodology applied in the context of training sessions on an older generation application can allow us to consider innovative developments in line with our customer expectations.

VI. AUTHORS Sébastien Bruyère graduated a PhD in Information and Communication Sciences. He manages the R&D activities of Custom Solutions. Many products, such as research and development activities by Custom Solutions, Cartavenue, full on-line cash back operations (dematerialization) or R&D activities are also used for internal projects, in order to optimize the efficiency of the organization. Sébastien is also teaching courses in engineering schools, business schools and universities. He has previously worked in an interactive agency, where he has been developing an innovative Business Intelligence platform. He has also been working in an incubator for innovative companies and has been supporting many R&D projects. The R&D Sébastien Bruyère blog is available via the following link : http://www.imotic.fr

REFERENCES [1]

[2] [3] [4] [5]

[6]

[7]

[8]

[9]

[10] [11]

[12]

[13]

P. Besse, A. Garivier, et J.-M. Loubes, « Big Data Analytics - Retour vers le Futur 3; De Statisticien \`a Data Scientist », ArXiv e-prints, vol. 1403, p. 3758, mars 2014. O. Capuozzo, « Zend Framework ». Recuperado el, 2011. G. Couturier, Guide pratique des Marketing. Société des Ecrivains, 2014. L. Dupont, Le plan marketing du tourisme par la pratique. L’Harmattan, 2005. C. Garcia-Zunino, « Le tableau de bord : la boussole des marketeurs et commerciaux », JDN L’économie de demain, 2014. [En ligne]. Disponible sur: http://www.journaldunet.com/ebusiness/expert/58312/le-tableau-debord---la-boussole-des-marketeurs-et-commerciaux.shtml. [Consulté le: 12-sept-2014]. R. Hess et G. Weigand, L’observation participante dans les situations interculturelles - broché, Remi Hess, Gabriele Weigand tous les livres à la Fnac. . G. Lapassade, « L’observation participante », La méthode ethnographique, 2013. [En ligne]. Disponible sur: http://vadeker.net/corpus/lapassade/ethngr1.htm. [Consulté le: 18-sept2014]. C. Lottret, « Le métier de Directeur marketing », Graphiline, 2014. [En ligne]. Disponible sur: http://www.graphiline.com/article/17807/Lemetier-de-Directeur-marketing. C. Mallet, « Innovation et mesure de l’appropriation des outils de gestion : proposition d’une déma rche de construction d’un tableau de bord », présenté à En route vers Lisbonne, Lisbonne, 2006.R. Nadeau et M. Landry, L’Aide à la décision: nature, instruments et perspectives d’avenir, Les presses de l’Université de Laval. Presses Université Laval, 1986. N. Oyarbide, « Réussir sa segmentation Marketing », Réussir son CRM, 2013. . A. Revillard, « Définir son statut d’observateur », Anne Revillard, 2003. [En ligne]. Disponible sur: http://annerevillard.com/enseignement/ressourcespedagogiques/initiation-investigation-empirique/fiches-techniquesinitiation-investigation-empirique/fiche-technique-n%C2%B01-definirson-statut-dobservateur/. J.-C. Sallaberry, Théorisation des pratiques: Posture épistémologique et méthode, - Statut des modèles et des modélisations. Editions L’Harmattan, 2005. M. Santel, « Entrepot de Donnees - SGBD et Datawarehouse », Igm.univmlv.fr, 2006. [En ligne]. Disponible sur: http://www-igm.univmlv.fr/~dr/XPOSE2005/entrepot/sgbd.html. [Consulté le: 14-oct-2014].

Vincent Oechsel is Head of Product and Board Member at Custom Solutions. His main objective is to seek and transform projects and localized actions into innovative and valuable products both client and service oriented. He was previously working as CIO for 8 years, and has been designing many successful innovative projects for the Custom Solutions group. In 2011, he created the R&D Division in order to support the group's digital strategy.

36

Neural Networks for Proper Name Retrieval in the Framework of Automatic Speech Recognition Dominique Fohr, Irina Illina

by speech recognition system, because it is missing in the vocabulary. Proper names “De Villepin” and “Gergorin” occurred together in a diachronic document of the same time period. So, we can make the hypothesis that “Gergorin” is related to the lexical and semantic context of PN “De Villepin” and could be present in the test document. So, we should add it in the ASR vocabulary.

Abstract— The problem of out-of-vocabulary words, more precisely proper names retrieval for in speech recognition is investigated. Speech recognition vocabulary is extended using diachronic documents. This article explores a new method based on neural network (NN), proposed recently by Mikolov. The NN uses high-quality continuous representation of words from large amounts of unstructured text data and predicts surrounding words of one input word. Different strategies of using the NN to take into account lexical context are proposed. Experimental results on broadcast speech recognition and comparison with previously proposed methods show an ability of NN representation to model semantic and lexical context of proper names.

Manual transcription of a broadcast audio document: Dominique De Villepin en personne aurait demandé au corbeau de l’histoire Jean Louis Gergorin de je cite balancer Nicolas Sarkozy.

Output of the speech recognition system:

Index Terms— neural networks, out-of-vocabulary words, proper names, speech recognition, vocabulary extension

dominique de villepin en personne aurait demandé aux corbeaux de l’histoire jean louis gérard morin deux je cite balance est nicolas sarkozy

I. INTRODUCTION

Diachronic document :

Large-vocabulary Automatic Speech Recognition (ASR) systems are faced with the problem of out-of-vocabulary (OOV) words (words that are not in ASR system vocabulary), when used for very large vocabulary recognition or in new domains. Among these OOV, Proper Names (PNs) are very largely represented. These PNs evolve during the time and no vocabulary will ever contains all existing PNs [6]. These missing proper names can be very important for the understanding of the test document and can be crucial for the other tasks using speech recognition, like document indexing. For instance, for broadcast document indexing, proper names often contain key information for this broadcast. In the context of broadcast news recognition, we propose to use a diachronic corpus to find missing OOVs. This corpus contains the documents that are contemporaneous with each test document of the test corpus. We assume that these documents will contain missing proper names because these documents talk about the same events. Fig. 1 presents an example of a test sentence containing an OOV proper name (Gergorin) and two proper names correctly recognized (De Villepin and Sarkozy). In this example, proper name Jean Louis Gergorin from manual transcription is not recognized

Le lendemain, au bureau de M. De Villepin, ils ont emporté huit Post-it, des cartes de vœux adressées à M. De Villepin par Jean-Louis Gergorin.

Fig. 1. Example of one test sentence containing an OOV proper name (Gergorin) and two proper names correctly recognized, and a diachronic document1.

Nowadays, Artificial Neural Networks (ANN) are widely used for natural language processing [3][5][16][18]. ANN models can be trained to learn a continuous vector space representation of the word distribution in the training corpus and such continuity permits smoother generalization to unseen contexts. The representations of words in a continuous space are learned automatically (no manual tagging or labelling of the text corpus is required). In the framework of speech recognition systems, different architectures have been successfully used for language modeling: feed-forward NNLMs [1] and recurrent RNNLM [2]. A special type of deep model equipped with parallel and scalable learning has been proposed in [4] for information retrieval task. Our idea is to take into account the lexical and semantic context of words to retrieve missing PNs from diachronic documents using the capability of neural networks to project

Dominique Fohr and Irina Illina are with the Loria-Inria Laboratory, Nancy, France. This work is funded by the ContNomina project supported by the French national Research Agency (ANR) under the contact ANR-12-BS02-0009.

1

37

Usually the recognition output does not contain punctuation or uppercase.

words in a continuous space. Rest of the paper is organized as follow: section II introduces the proposed approach. Sections III and IV describe the experimental sets and the results of the evaluation of proposed methods in term of recall and word error and PN error rates. The conclusions are given in the last section.

information-based representations have been proposed. Occurrence-based mutual information method assumes that the greater the likelihood of statistical dependence of two proper names in the diachronic corpus, the greater the likelihood of their occurrence in the test document. Vectorbased cosine-similarity method measures the similarity between two bag-of-word vectors as an angle between them. In the present work, we propose to use Neural Networks high-quality continuous representation of words from large amounts of unstructured text data, proposed by Mikolov et al. [12][13][14]. The continuous skip-gram model tries to predict surrounding words of one input word. This is performed by maximizing the classification rate of the nearby words given input word. Given a sequence of training words w1, w2…, wT, skip-gram representation maximizes the average log probability:

II. METHODOLOGY We have a test audio document (to be transcribed) which contains OOV words, and we have a diachronic text corpus, used to retrieve OOV proper names. Diachronic text documents are contemporaneous with each test document of the test corpus. The diachronic documents allow building an augmented vocabulary and to take into account the temporal context. The extended vocabulary is built dynamically for each test document to avoid an excessive increase of the vocabulary size. We assume that, for a certain date, a proper name from the test corpus will co-occur with other PNs in diachronic documents corresponding to the same time period. These cooccurring PNs might contain the targeted OOV words. The idea is to exploit the relationship between PNs for a better lexical enrichment. In other words, we rely on the temporal and lexical contexts of words.

(1) where c is the context size. Compared to classical NN, the non-linear hidden layer is removed and the projection layer is shared for all words. Fig. 2 shows schematically this model.

A. General Approach Our general methodology contains 3 steps: 1) In-vocabulary (IV) PN extraction from each test document: For each test document, we extract IV PNs. The goal is to use these PNs as an anchor to collect linked new PNs from the diachronic corpus. 2) Temporal and lexical context extraction from diachronic documents: We build temporal contexts for each extracted IV PN. Only diachronic documents that correspond to the same time period as the test document are kept. After POS-tagging of these diachronic documents, meaningful words are kept: verbs, adjectives, nouns and PNs. This space can be modeled by a discrete or continuous vector, which takes into account semantic relationships, depicting lexical context. To reduce the vocabulary growth, some similarity metric, depending of the space representation, is calculated between the IV PNs found in the test document and each new2 PN occurring in the diachronic set. To better take into account the lexical context, a local-window context for each IV PN is used. 3) Vocabulary augmentation: The new PNs (that are not in our vocabulary) with the “best” metrics are added to our vocabulary. PN pronunciations are generated using a phonetic dictionary or an automatic phonetic transcription tool. Using this methodology, we expect to extract a reduced list of all potentially missing PNs.

Fig 2. Skip-gram model structure An important property of this model is that the word representations learned by the Skip-gram model exhibit a linear structure: word vectors can be combined using vector addition. We propose to use this NN-word representation for our task of proper name retrieval in the step 2) of the general approach. Mikolov's NN will be trained on all documents of a large text corpus. Using this learned NN and following the step 2), for each IV PNs, found in the test document, and for each new PN, occurring in the diachronic set (called selected PN) , the NN-projection is calculated. Each word projection is represented by a high-dimensionality vector. Cosine similarity between the NN-projection of each selected PN and each IV PN from test document is evaluated. Following the step 3) of the general approach, selected PNs with the “best” cosine-

B. NN Word Representation Space PNs space representation can be performed in the discrete or in the continuous spaces. In [8], cosine-based and mutual 2

New PN means PN that is not present in the vocabulary of the ASR.

38

similarity are added to our vocabulary. We hope that two semantically similar words will be projected in the same region of the representation space and so they will be close to each other in this space.

lexical and semantic context of IV PN. Therefore to better choose new PNs, relevant to test documents and to avoid the excessive increase of the vocabulary size.

C. Different Strategies to Take into Account Lexical Context To better select new PNs in the diachronic documents and to reduce the vocabulary growth, it is important to better take into account the lexical and semantic word context. In some manner, Mikolov's Skip-gram model takes into account the context of words (parameter c in the formula (1)). But this context is considered only in the training step of NN. During the new PN retrieval step, the vector representation for a new PN is obtained using the NN-projection. However, the context of this new PN can be different compared to the training step. To improve the context modeling, we propose different strategies to take it into account. Fig. 4. Local-window context NN-projection for each occurrence of IV PN in the test document and for each occurrence of selected new PN from diachronic documents.

III. EXPERIMENTS We call selected PNs the new proper names that we were able to retrieve from diachronic documents using our methods. We call retrieved OOV PNs the OOV PNs that we were able to retrieve from diachronic documents using our method and that are present in the test documents. Using the diachronic documents, we build a specific augmented lexicon for each test document according to the chosen period. Results are presented in terms of Recall (%): number of retrieved OOV PNs versus the number of OOV PNs. For the recognition experiments, Word Error Rate (WER) and PN Error Rate (PNER) are given. PNER is calculated like WER but taking into account only proper names.

Fig. 3. Local-window context NN-projection for each occurrence of IV PN in the test document. As said previously, in the Skip-gram model word representations, learned by the Skip-gram model, have a linear structure: word vectors can be combined using vector addition. We propose to use this property during the step 2) of the general approach: instead to represent one word by the NNprojection of this word, local-window context NN-projection of this word is used (vector addition of NN-projections of context words). Two strategies are proposed: - Local-window context NN-projection is used for each occurrence of IV PN in test document. After this, a cosine similarity between the local-window context NN-projection for each occurrence of IV PN in the test document and NNprojection for each selected new PN is calculated (cf. Fig. 3). - Local-window context NN-projection is used for each occurrence of IV PN of the test document and for each occurrence of selected new PN from diachronic documents. After this, a cosine similarity between the local-window context NN-projection for each occurrence of IV PN in the test document and local-window context NN-projection for each occurrence of selected new PN in diachronic document is calculated (cf. Fig. 4). This modeling allows to take into account the lexical and semantic context of retrieved new PNs that are close to the

A. Development and Test Corpora As development corpus, seven audio documents of development part of ESTER2 (between 2007/07/07 and 2007/07/23) are used. For the test corpus, 13 audio documents from RFI (Radio France International) and France-Inter (test part of ESTER2) (between 2007/12/18 and 2008/01/28) [7] are used. Table I gives the average occurrences of all PNs (IV and OOV) in development and test documents with respect to 122k-word ASR vocabulary. To artificially increase OOV rate, we have randomly removed 223 PNs occurring in the development and test set from our 122k ASR vocabulary. Finally, the OOV PN rate is about 1.2%.

File

Word occ

IV PNs

IV PN occ

OOV PNs

Dev Test

4525.9 4024.7

99.1 89.6

164.0 179.7

30.7 26

OOV PN occ 57.3 46.6

Table I. Average proper name coverage for development and test corpora per file.

39

Table II. shows that using the diachronic documents of 1 year, in average we retrieve 118797.0 PNs per file. Among these PNs, we retrieve in average 24.0 OOV PNs per development file (compared to 30.7 of Table I.). This represents the recall of 78.1%.

B. Diachronic Corpus GigaWord corpora are used as diachronic corpora: Agence France Presse (AFP) and Associated Press Worldstream (APW). French GigaWord is an archive of newswire text data and the timespans of collections covered for each are as follows: for AFP May 1994 - Dec 2008, for APW Nov 1994 Dec 2008. The choice of GigaWord and ESTER2 corpora was driven by the fact that one is contemporary to the other, their temporal granularity is the day and they have the same textual genre (journalistic) and domain (politics, sports, etc.).

B. NN-based Results We used the open-source Mikolov’s NN project available on the web. The NN is trained on all diachronic corpora (cf. Section III.B). For this network, the important parameters are: the model architecture, the number of neurons of the hidden layer and the context size. After several experiments, we defined the best values of parameter set that will be used here: 400 for the vector size (the number of neurons of the hidden layer) and 30 for the context size and the skip-gram model. Preliminary study of the extension from word-based to phrasebased representation (like “New York” and “Times” versus “New_York_Times”, called phrases according to Mikolov) has shown a limited improvement. As in our previous study [8], to evaluate the importance of the time period, we set up a temporal mismatch experiment: to select new PN candidates, we use diachronic documents (for one day, for one week and for one month) 10 months after period of development documents (cf. Table III., called “mism”).

C. Transcription System ANTS (Automatic News Transcription System) [10] is based on Context Dependent HMM phone models trained on 200hour broadcast news audio files. The recognition engine is Julius [11]. The baseline phonetic lexicon contains 260k pronunciations for the 122k words. Using SRILM toolkit [17], the language model is estimated on text corpora of about 1800 million words. The language model is re-estimated for each augmented vocabulary using the whole text corpora. The best way to incorporate the new PNs in the language model is out of the scope of this paper. IV. EXPERIMENTAL RESULTS In a first step, we will use the development corpus to set the parameters of the proposed method. In a second step, we will evaluate the proposed approach on the test set.

Time period

A. Baseline Result Baseline method consists in extracting a list of all new PNs occurring in the diachronic corpus, using some time period corresponding to the test document. This period can be, for example, a day, a week or a month. Then, our vocabulary is augmented with the list of extracted OOV PNs. The problem of this approach is that if the diachronic corpus is large, we can have a bad tradeoff between the lexical coverage and the increase of the lexicon size. Using Treetagger [15], we have extracted 160k PNs from 1 year of the diachronic corpus. From these 160k PNs, 119k are not in our lexicon. From these 119k, only 151 PNs are present in the development corpus (193 in the test corpus). It shows that it is necessary to filter this list of PNs to have a better tradeoff between the PN lexical coverage and the increase of lexicon size.

Time period 1 day 1 week 1 month 1 year

Average of selected PNs per dev file 532.9 2928.4 13131.0 118797.0

Average of retrieved OOV PNs per dev file 10.0 11.4 17.6 24.0

1 day 1 week 1month

Method NN NN mism NN NN mism NN NN mism

Selected PNs 400 400 1500 1500 2000 2000

Retrieved OOV PNs 9.9 2.9 11.4 7.1 14.4 11.1

Recall (%) 32.1 9.3 37.2 23.3 47.0 36.3

Table III. NN-based results according to time duration period for development corpus, with and without temporal mismatch. Values averaged on the 7 development files. During the preliminary experiments, for each time period we evaluated the recall for different number of selected PNs. Acceptable compromise between recall and the number of selected PNs is 400 of selected PNs for the day period, 1500 for the week and 2000 for the month. Using higher number of selected PNs improves very slightly the recall. These parameters have been used in the Table III. Table III. shows the result for the proposed NN-based method using different time periods for the development corpus. Using one day period, the NN achieves the recall that is very close to the baseline results (32.6%). Using one week period, the NN obtains the same recall as the baseline (37.2%) but selecting only 1500 PNs instead of 2928 as in the baseline method. For the month period, using six times less of selected PNs compared to the baseline one (2000 versus 13131), the recall of 47% is obtained, that is not very far from 57% of the baseline one. This shows the good quality of word representation performed by NN.

Recall (%) 32.6 37.2 57.2 78.1

Table II. Baseline results for development corpus according to time periods.

40

NN NN with 1-side local window NN with 2-sides local window

Average rank of retrieved OOV PNs 1662

Median rank of retrieved OOV PNs 1334

1334

411

42.3

1672

343

40.5

(< 100), the best performance is obtained for MI-based approach. Using more selected PNs, three methods give the similar results. Indeed, the one day diachronic documents contain only about 500 new PNs (cf. Table II.) and reducing the vocabulary grow by using our methods allow obtaining the same recall with about 300 selected new PNs. Using the time period of one week, NN-based approach gives best performance (cf. Fig. 6).

Recall (%) 37.7

Table IV. Results for a one month period and 2000 selected new PNs for NN different strategies (development corpus). Table IV. presents the results for different strategies of NNbased method (cf. section II.C). “NN with 1-side window” corresponds to previously described local-window context NN-projection used for each occurrence of IV PN in test document (cf. Fig. 3). “NN with 2-sides window” corresponds to local-window context NN-projection used for each occurrence of IV PN of the test document and for each occurrence of selected new PN from diachronic documents (cf. Fig. 4). “Average rank” (resp. “median rank”) means the average (resp. median) rank of the retrieved OOV PNs in the list of selected PNs. Local window only for recognized words of test documents seems to work better that the one for 2sides. One explanation can be that our choice of the decision function (maximum, cf. Fig 4) is not optimal. We note, that 2sides local window method is much more time consuming because each local window is calculated for each diachronic document and for each occurrence of selected PN in this diachronic document. In the following experiments, NN with 1 side local window will be used.

Fig. 6. Results for a week time duration period in function of number of selected PNs. Development corpus. For the time period of one month, NN-based and MI-based methods seem to be a good compromise (cf. Fig. 7). The disadvantage of MI method is that it requires more computational effort than NN-based method: the mutual information is calculated for each diachronic document and for each occurrence of selected PN in this diachronic document. In conclusion, for all time periods, NN-based and MI-based methods give good results. However, MI method is much more time consuming.

C. Comparison of NN-based Method with Previously Proposed Methods In this section we propose an experimental comparison of three methods: the NN-based method proposed in this paper, the cosine-based and mutual information-based methods, proposed in [8].

Figure 7. Results for a month time duration period in function of number of selected PNs. Development corpus.

Fig. 5. Results for a day time duration period in function of number of selected PNs. Development corpus. Figure 5 shows, that for the time period of one day and small number of selected new PNs from diachronic document

41

D. Speech Recognition Results Automatic transcription of the 7 development documents using augmented lexicons (generating one lexicon per development file) is performed. For generating the pronunciations of the added PNs, G2P CRF approach is used [9]. It is trained on phonetic lexicon containing about 12000 PNs. In order to incorporate the new PNs in the language model, we re-estimated it for each augmented vocabulary using a large text corpus. Number of selected PNs per period is the same as in Table III: 400 for day, 1500 for week and 2000 for month. Compared to standard lexicon, small improvement is obtained for NN system in term of WER (29.8% versus 30.2%). In term of PN error rate, a significant difference is observed (35.3% versus 40.7). There is no significant difference between NN, MI and cosine-based results. Similar results are observed on the test corpus: small improvement in term of WER (31.4% versus 31.8% for month period) and significant difference in term of PN error rate (38.2% versus 44% for month period). There is no significant difference between results for different periods.

REFERENCES [1] [2]

[3] [4] [5]

[6]

[7]

[8]

[9] [10]

V. CONCLUSION

[11]

In this article, the problem of the retrieval of OOV words and speech recognition system vocabulary extension were investigated. We focused only on new proper name retrieval. Diachronic documents contemporary to test documents were used to retrieve proper names to enrich the vocabulary. We proposed the continuous space word representation using neural network. This continuous vector representation is learned from large amounts of unstructured text data. To model semantic and lexical context of proper names, different strategies of local context modeling were proposed. Experimental results and a comparison with previously proposed MI-based and cosine-based methods show an ability of NN representation to model semantic and lexical context of proper names.

[12]

[13]

[14]

[15] [16]

[17] [18]

42

Bengio, Y., Ducharme, R., Vincent, P. “A neural probabilistic language model” Journal of Machine Learning Research, 3: pp 1137-1155, 2003. Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., Khudanpur, S. “Recurrent neural network based language model”, In Proceedings of INTERSPEECH 2010, pp 1045–1048. Hinton, G., Osindero, S., Teh., Y. “A Fast Learning Algorithm for Deep Belief Nets”, Neural Computation, 18:1527–1554 Deng, L. He, X., Gao, J. “Deep stacking networks for information retrieval”, In Proceedings of ICASSP, 2013. Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y. and Acero A. “Recent Advances in Deep Learning for Speech Research at Microsoft”, Proceedings of ICASSP, 2013 Friburger, N. and Maurel, D. “Textual Similarity Based on Proper Names”, Proceedings of the workshop Mathematical/Formal Methods in Information Retrieval, 2002, pp. 155-167 Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J.-F., Gravier, G.“The ESTER Phase II Evaluation Campaign for the Rich Transcription of French Broadcast News”, Proceedings of Interspeech, 2005. Illina, I., Fohr, D. and Linares, G. “Proper Name Retrieval from Diachronic Documents for Automatic Transcription using Lexical and Temporal Context”, Proceedings of SLAM, 2014. Illina I., Fohr D., Jouvet D. “Grapheme-to-Phoneme Conversion using Conditional Random Fields”, Proceedings of Interspeech, 2011. Illina I., Fohr D., Mella O., Cerisara C. “ The Automatic News Transcription System: ANTS, some Real Time experiments “, Proceedings of. ICSLP, 2004. Lee, A. and Kawahara, T. “Recent Development of Open-Source Speech Recognition Engine Julius”, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2009. Mikolov, T., Chen, K., Corrado, G. and Dean, J. “Efficient Estimation of Word Representations in Vector Space”, Proceedings of Workshop at ICLR, 2013. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. “Distributed Representations of Words and Phrases and their Compositionality”, Proceedings of NIPS, 2013. Mikolov, T., Yih, W. and Zweig, G. “Linguistic Regularities in Continuous Space Word Representations”, Proceedings of NAACL HLT, 2013. Schmid, H. “Probabilistic part-of-speech tagging using decision trees”, Proceedings of ICNMLP, 1994. Seide, F., Li, G., Yu, D. “Conversation speech trasncription using context-dependent deep neural network”, Proceedings of Interspeech, 2011. Stolcke, A. “SRILM - An Extensible Language Modeling Toolkit”, Proceedings of ICSLP, 2002. Vinyals, O., Ravuri, S.V. And Povey, D. “Revisiting recurrent neural network for robust ASR”, Proceedings of ICASSP, 2012.

Recognition of OOV Proper Names in Diachronic Audio News Imran Sheikh, Irina Illina, Dominique Fohr MultiSpeech Group, LORIA-INRIA, 54602 Villers-l`es-Nancy, France {imran.sheikh, irina.illina, dominique.fohr}@loria.fr

Abstract—LVCSR based audio indexing approaches are preferred as they allow search, navigation, browsing and structuring of audio/video documents based on their content. A major challenge with LVCSR based indexing of diachronic audio data, for e.g. broadcast audio news, is OOV words and specifically OOV PNs which are very important for indexing applications. In this paper we propose an approach for recognition of OOV PNs in audio news documents using PNs extracted from collections of diachronic text news from the internet. The approach has two steps (a) reduce the long list of OOV PNs in the diachronic text corpus to a smaller list of OOV PNs which are relevant to the audio document, using probabilistic topic models (b) perform a phonetic search for the target OOV PNs with the reduced list of relevant OOV PNs. We evaluate our approach on French broadcast news videos published over a period of 6 months. Latent Dirichlet Allocation topic model is trained on diachronic text news to model PN-topic relationships and then to retrieve OOV PNs relevant to the audio document. Our proposed method retrieves up to 90% of the relevant OOV PNs by reducing the OOV PN search space to only 5% of the total OOV PNs. Phonetic search for target OOV PNs gives an F1-score up to 0.392.

I. I NTRODUCTION With the proliferation of multimedia content on internet, automatic content based indexing of audio data has been highly sought-after. In general there are two common approaches to audio indexing - one uses Large Vocabulary Continuous Speech Recognition (LVCSR), and the other uses phone recognition to carry out phonetic audio mining [1]. LVCSR based audio indexing approaches allow search, navigation, browsing and structuring of audio based on content [2] as opposed to phonetic audio mining. LVCSR based systems are often the choice for ad-hoc search on large audio databases [3], on the other hand, phone recognition based approaches mostly serve user query based audio document retrieval from relatively smaller databases. A major challenge with LVCSR based indexing of audio data and specifically broadcast audio news, is the diachronic nature of the data. Diachronic news is characterised by different topics which change with time, leading to a change in the linguistic content and vocabulary. As a result, a typical problem faced by LVCSR systems processing such diachronic audio news is Out-Of-Vocabulary (OOV) words. In previous works it has been observed that majority of OOV words are PNs (PN percentage in OOV words being reported as: 56% in [4], 66% in [5], 57.6% in [6], 70% in [7], 72% in [8]). On the other hand, PNs in audio news are of prime importance for content based indexing and browsing applications. In this

paper we focus on the recognition of OOV PNs i.e., PNs which appear in diachronic audio news but are not present in the LVCSR vocabulary and cannot be recognised by the LVCSR system. To recognise OOV PNs in test audio documents, we rely on new PNs extracted from collections of diachronic text news from the internet (referred as diachronic corpus). Given this list of new PNs, the simplest approach would be to perform a phonetic search for the target OOV PNs1 in the LVCSR hypothesis [4]. Additional information such as error regions in the hypothesis [9] or the LVCSR lattice (instead of the 1best hypothesis) [10] can be used. However, as discussed in Section III the list of OOV PNs itself can be very large, leading to errors due to confusability. We propose to reduce the long list consisting of the OOV PNs in the diachronic corpus to a smaller list of OOV PNs which are relevant to the test audio document. To achieve this, we leverage the topic and lexical context from the audio document. We train a Latent Dirichlet Allocation (LDA) [11] topic model on diachronic text news corpus as training corpus, to model PN-topic relations. Then for the given test audio document the In-Vocabulary (IV) words are hypothesised by LVCSR and its latent topic context is inferred using the LDA topic model. A list of most relevant OOV PNs is then retrieved based on the topic and lexical context in the LVCSR hypothesis. With this list the number of PN candidates to be searched for a given audio news is highly reduced. A search in the phonetic space is then performed to recover the target OOV PNs in the audio news. The rest of the paper is organised as follows. In Section II we discuss about related works. In Section III we present two realistic diachronic broadcast news datasets used for evaluation of our proposed approach. In Section IV we discuss our proposed method to retrieve OOV PNs relevant to an audio document using probabilistic topic models and in Section V we discuss our approach for recognition of the target OOV PNs by performing a phonetic search in the LVCSR hypothesis. Section VI presents the experiments and results, followed by future work in Section VII and conclusion in Section VIII. 1 Ideally new PNs extracted from collections of diachronic text news are OOV PNs with respect to the LVCSR. However all new PNs are not present in the test set audio documents. Hence we use the term target OOV PNs to refer to the OOV PNs actually present in the test set audio documents. The general term ’OOV PNs’ is used to refer to news PNs until (or unless) they are recognised as target OOV PNs.

43

II. R ELATED W ORK We propose recognition of OOV PNs in an audio news document by leveraging topic, lexical and phonetic context; with the goal of indexing audio documents with PNs for browsing and structuring. We use LDA to model PN-topic relations. Previously, PNs have been modelled with LDA [12] and a similar approach based on vector space representation similar to Latent Semantic Analysis (LSA) has been tried [13]. However, these approaches estimate one LDA/LSA context model for each PN which restricts them to only frequent PNs i.e., PNs which have significant amount of associated documents to learn individual LDA/LSA models. In our approach, we train a global topic model with all the text documents. And as opposed to the usual practise, of discarding less frequent terms, in approaches based on topic models we have retained the less frequent PNs both in the training and the test set. Phonetic search based recovery of PNs has been discussed in [9]. In this work the system searches only for the most frequent PNs. Similarly, lattice-based phonetic search for proper name retrieval task has been proposed in [10]. But these approaches do not make use of rich topic and semantic information associated with PNs. An approach for combining phonetic search and semantic information inferred with Probabilistic Latent Semantic Analysis (PLSA) has been discussed in [14]. However their goal is spoken document retrieval by comparing search query and spoken documents in semantic and phonetic space. A combination of topic based context models and phonetic search has been proposed for PN recognition [12], [13], [15]. However, as mentioned earlier these approaches estimate one LDA/LSA model for each PN which restricts these approaches to only frequent PNs. III. B ROADCAST N EWS D IACHRONIC DATASETS In this section, we present two realistic broadcast news diachronic datasets which will highlight the purpose of handling OOV PNs. These datasets will be used as the training and test sets for evaluation of our proposed methods. Table I shows a description of these two datasets. The L’Express dataset is collected from the website of the French newspaper L’Express2 whereas the Euronews dataset is collected from the French website of the Euronews3 television channel. The L’Express dataset contains text news whereas the Euronews dataset contains news videos along with their text transcriptions. TreeTagger [16] is used to automatically tag PNs in the text. The words and PNs which occur in the lexicon of our Automatic News Transcription System (ANTS) [17] are tagged as IV and the remaining PNs are tagged as OOV. ANTS lexicon is based on news articles until 2008 from French newspaper LeMonde. As shown in the table 64% of OOV words in Euronews video dataset are PNs and about 47% of the videos contain OOV PNs. For our experiments the L’Express dataset is used as diachronic corpus and audio news extracted from the Euronews video dataset is used as test set. 2 http://www.lexpress.fr/ 3 http://fr.euronews.com/

TABLE I B ROADCAST NEWS DIACHRONIC DATASETS .

Type of Documents Time Period Number of Documents* Vocabulary Size (unigrams) Corpus Size (total word count) Number of PN unigrams+ Total PN count Documents with OOV Number of OOV unigrams+ Total OOV count Documents with OOV PN Number of OOV PN unigrams+ Total OOV PN count

L’Express Euronews Text Video Jan 2014 - Jun 2014 45K 3K 150K 18K 24M 600K 40K 2.2K 1.3M 19K 43K 2172 55K 1588 450K 7710 36K 1415 17K 1024 200K 3128

*K denotes Thousand and M denotes Million +unigrams occurring only once are excluded

IV. OOV PN R ETRIEVAL USING T OPIC M ODELS Figure 1 shows a diagrammatic representation of the proposed approach for recognition of OOV PNs using topic and phonetic context. As shown in the figure, topic models are trained, on a diachronic text corpus as training corpus, to learn relations between words, latent topics and OOV PNs. Given a test audio news document, IV words (including IV PNs) are hypothesised by LVCSR and the topic model is used to infer and retrieve a list of most relevant OOV PNs for the test document. Then the target OOV PNs in the test document are identified by performing a phonetic search in the LVCSR hypothesis with each of the OOV PN from the list of most relevant OOV PNs. In this section we briefly discuss about topic models and then discuss in detail our approach to retrieve the list of OOV PNs relevant to the test document. Phonetic search and recognition of the target OOV PNs is discussed in the next section. A. Topic Models Latent Semantic Analysis (LSA) [18], Probabilistic LSA [19] and Latent Dirichlet Allocation (LDA) [11] have been the most prominent unsupervised methods for extracting topics and underlying semantic structure from collection of documents. While LSA derives semantic spaces from word cooccurrence matrix and operates using a spatial representation, PLSA and LDA derive topics using hierarchical Bayesian analysis. We choose LDA to capture PN-Topic relations as it is a well defined generative model and new elements (or variables) can be easily incorporated into this model to capture richer semantics and relations. Additionally, LDA has been shown to outperform PLSA and LSA for document classification [11] and word prediction [20] tasks. In our approach, LDA is used to model topics in the diachronic corpus. Figure 2 shows the graphical representation (or plate diagram) of the LDA topic model. In this figure, w represents the observed word and z represents the topic corresponding to word w in a text document. Each document,

44

Dirichlet priors are chosen. Topic model parameters θ and φ are estimated using Gibbs sampling algorithm [21]. For an unseen test document the latent topic distribution can be inferred with a Gibbs sampling equation similar to that used in training [21].

Diachronic Corpus

Internet

Topic Context Model

B. OOV PN Retrieval using LDA Topic Model

OOV PNs

Topics learned by LDA are used to retrieve OOV PNs relevant to the test document. Let us denote LVCSR hypothesis of test document by h and OOV PNs in diachronic corpus by v˜x . In order to retrieve OOV PNs, we calculate p(˜ vx |h), for each v˜x and then treat it as a score to rank OOV PNs relevant to h. With the words observed in h, the latent topic mixture [p(t|h)]T can be inferred by re-sampling from the word-topic distribution φ learned during training. Given p(˜ vx |t) = φvt , the likelihood of an OOV PN (˜ vx ) can be calculated as:

Speech-to-Text Topic Based Ranking

Text Output

LVCSR

Audio

Relevant OOV PNs G2P G2P

p(˜ vx |h) =

Phonetic Search

φ

0

p(˜ vx |h) ≈ max {CosSim(h, d )} d0   PT 0  (2)  t=1 p(t|h) p(t|d ) q q = max PT PT 0 2 d0  2 t=1 p(t|h) t=1 p(t|d )

T

z

θ D

(1)

While this method relies on word-topic distribution of LDA, we propose another method to retrieve relevant OOV PNs by using the document-topic distributions learned by LDA. This method relies on topic similarity between h and each text 0 document d in diachronic corpus, which contains the OOV 0 0 PN v˜x . Topic mixture [p(t|d )]T for each d is available from θd0 estimated during training. Topic mixture [p(t|h)]T for h is inferred and likelihood of v˜x is calculated as:

Fig. 1. OOV PN Recognition in Diachronic Audio document.

α

p(˜ vx |t) p(t|h)

t=1

Target OOV PNs

β

T X

w

Nd

0

Fig. 2. Graphical representation of LDA topic model.

represented by the inner plate around w and z, has Nd words. The entire corpus, represented by the outer plate around w, z and θ has D documents. θ = [θdt ]D×T is the topic distribution for each document d, and φ = [φvt ]Nv ×T is the topic distribution to each of the Nv word in vocabulary of the corpus. Both θ and φ are distributions across T topics. α and β are Dirichlet priors for θ and φ. The generative process of LDA is given as: 1) For all d docs sample θd ∼ Dir(α) 2) For all t topics sample φt ∼ Dir(β) 3) For each of the Nd words wi in document d: • Sample a topic zi ∼ M ult(θd ) • Sample a word wi ∼ M ult(φzi )

where CosSim(h, d ) is the cosine similarity between test and diachronic document in topic space. The main idea behind this technique is to associate each OOV PN with several topic distributions, each of which is derived from the documents in diachronic corpus in which the OOV PN was observed. It can be viewed as document specific topic distributions. While this method gives the best retrieval ranks to OOV PNs, it requires iterating through the diachronic corpus. We refer to the method using Equation (1) as PN-Topic based Method and the method using Equation (2) as Document Similarity based Method. It should be noted that these methods can be applied to any probabilistic topic model, although we have chosen LDA for modelling topics. V. P HONETIC S EARCH FOR R ECOGNITION OF OOV PN S

To model topics in the diachronic corpus of text documents, a topic vocabulary of size Nv , the number of topics T and

In this section we discuss our approach to identify the target OOV PNs by utilising phonetic information. A phonetic search is performed on the LVCSR hypothesis using each of the OOV PNs retrieved with topic model. Phonetic form of the LVCSR text hypothesis is obtained from the LVCSR and the OOV PNs retrieved using topic models are converted into their phonetic forms using a Grapheme to Phoneme (G2P) converter.

45

Our phonetic search algorithm is based on the classical kdifferences approximate string matching algorithm [22]. To formulate, let us consider a string h = h1 h2 h3 ...hn , representing the LVCSR hypothesis in its phonetic form, and a string P = p1 p2 ..pm , representing an OOV PN in phonetic form. Given a search constraint parameter k, the task is to find all i such that the edit distance (insertions, deletions and substitutions) between P and some phone substring of h ending at hi is at most k. With proper choice of k, this phone substring of h which matches P can be hypothesised as a target OOV PN. To find i we use dynamic programming. Let D be an n + 1 by m + 1 matrix such that D(i, j) is the minimum edit distance between P and any phone substring of h ending at hi . The entries of D are calculated as: 0≤i≤n

D(i, 0) = 0;

(3a)

D(0, j) = D(0, j − 1) + δdel ; 1 ≤ j ≤ m (3b)  D(i − 1, j) + δins    D(i − 1, j − 1) + δ ; if h = p eq i j D(i, j) = min (3c)  D(i − 1, j − 1) + δ ; if h 6 = p sub i j    D(i, j − 1) + δdel where δins , δdel , δeq δsub are costs for insertion, deletion, equality and substitution of a single phone. Matrix D can be evaluated in time O(nm). Whenever D(i, m) is found to be at most k for some i, there is an approximate occurrence of P ending at hi with edit distance D(i, m) = k. The corresponding matching phone string in h can be recovered by back tracking in distance matrix D from point (i, m) or by storing all the edit operations during the calculation of D. A problem with any algorithm based on k-differences approximate matching is that the matching distance depends on the length of P . In order to address this the match score is normalised as: D(i, m)norm =

D(i, m) max(m, li )

(4)

where li is the length of the matching phone substring in h ending at i. This normalisation converts the match score into range of 0 to 1, making the match score comparable for PNs of different lengths. It should be noted that it is not required to calculate and search the entire distance matrix D. The error and OOV regions in the LVCSR hypothesis can be hypothesised [23] and only these parts of the LVCSR output can be used for phonetic search. In this case the distance matrix has re-initialisations, similar to Equation (3b), wherever a beginning of OOV and/or error region is hypothesised. VI. E XPERIMENTS AND R ESULTS In this Section we first present our experimental setup and then discuss performance of our proposed approach. For evaluation of our proposed approach we use the datasets presented in Section III. The L’Express dataset of diachronic text news is used as diachronic corpus. Audio news extracted from the Euronews video dataset is used as the test set.

A. LVCSR system The ANTS [17] LVCSR system is used to perform automatic segmentation and speech-to-text transcription of the test audio news. ANTS is based on Context Dependent HMM phone models trained on 200 hour broadcast news audio files. ANTS uses the Julius [24] speech recognition engine at the backend. The baseline phonetic lexicon contains 260k pronunciations for the 122k words. Using SRILM toolkit [25], the language model is estimated on text corpora of about 1800 million words. The automatic transcriptions of the test audio news obtained by ANTS have an average Word Error Rate (WER) of 40% as compared to the manual transcriptions. B. Topic Model The L’Express dataset is our diachronic corpus for training topic model. Diachronic corpus vocabulary is lemmatised and filtered by removing PNs occurring only once, non PN words occurring less than 4 times, and using a stoplist of common french words and non content words which do not carry any topic-related information. Moreover, a POS based filter is employed to choose words tagged as PN, noun, adjective, verb and acronym. The filtered vocabulary has 40000 PNs and 28000 words. Out of the 40000 PNs 17000 are not present in the ANTS LVCSR lexicon and are tagged as OOV PNs. LDA topic model is trained with this filtered vocabulary. Model parameters are estimated with 2500 iterations of Gibbs Sampling algorithm [21] to ensure convergence. We tried different number of topics (in the range 20-1000) in our experiments, the best performance being obtained for 300 topics. Beyond 300 topics the improvement is not significant. C. Performance of OOV PN Retrieval using Topic Model Proposed OOV PN retrieval using topic model, as discussed in Section IV, is evaluated on the 1415 test audio news (in Euronews dataset) which contain OOV PNs. As shown in Table I, these 1415 documents consist of 1024 unique PN unigrams occurring total 3128 times. However, the total number of OOV PNs to be retrieved, obtained by counting unique OOV PNs per document, is 2300. Out of the 2300 OOV PNs to be retrieved, 476 (20%) occur only 5 or less times in the diachronic corpus for training topic model. Thus we have retained the less frequent PNs both in the training and test set. Less frequent PNs are problematic and we will discuss how our methods perform with them. Figure 3 shows the Recall performance of PN-Topic based Method and Document Similarity based Method, discussed in Section IV-B, on manual transcriptions of the test set. In the graph in Figure 3, the X-axis represents the number of OOV PNs selected from the diachronic corpus by the two methods. Y-axis represents recall of the target OOV PNs. PN-Topic based Method is denoted as M-1 and Document Similarity based Method is denoted as M-2. It can be seen that OOV PN retrieval with proposed methods can recover up to 90% of the target OOV PNs within 5% of retrieval results, thus reducing the search space to only 5% of OOV PNs from diachronic corpus.

46

of the probabilistic topic vectors. A similar observation is also discussed in [20].

1

OOV PN Recall

0.8

Frequency in Diachronic Corpus

M-1 M-2

0.6 0.4 0.2 0

5%OOV P N s

1

8

64 512 Number of OOV PNs

4096 17000

Fig. 3. Recall for OOV PN Retrieval Methods on manual transcriptions.

TABLE II R ECALL AND M EAN AVERAGE P RECISION (MAP) OF OOV PN R ETRIEVAL M ETHODS WITH TOP 5% OOV PN S .

Manual LVCSR

Method M-1 M-2 M-1 M-2

Recall 0.87 0.91 0.8 0.87

1

4 16 64 256 1024 Retrieval Ranks of OOV PNs

Fig. 4. Rank-Frequency Distribution for PN-Topic based OOV PN Retrieval Method (M-1).

Frequency in Diachronic Corpus

We choose the 5% operating point to compare further results. Table II compares OOV PN retrieval performance of M-1 and M-2 for both manual and LVCSR transcripts. Comparison is in terms of Recall and Mean Average Precision (MAP) [26] obtained with top 5% of the retrieved OOV PNs. Document Similarity based Method (M-2) has better Recall and MAP, but as mentioned earlier this method requires iterating through the diachronic corpus and so it is computationally more expensive. As expected, the performance for both methods is slightly degraded on LVCSR transcripts. This is due to the LVCSR errors. However, it must be noted that this degradation is small because the proposed methods of retrieval rely on the topic inferred from the transcripts and not the transcripts directly. The topic inference step smooth’s out LVCSR errors.

50+ 45 40 35 30 25 20 15 10 5 0

50+ 45 40 35 30 25 20 15 10 5 0

1

4 16 64 256 1024 Retrieval Ranks of OOV PNs

Fig. 5. Rank-Frequency Distribution for Document Similarity based OOV PN Retrieval Method (M-2).

MAP 0.21 0.23 0.20 0.20

D. Performance of OOV PN Recognition with Phonetic Search

To study the difference in the performance of M-1 and M2 we plotted the distribution of the ranks obtained by the target OOV PNs versus the frequency of occurrence of the target OOV PNs in the diachronic corpus used to train topic models. Figure 4 shows the rank-frequency distribution for M-1 whereas Figure 5 shows the rank-frequency distribution for M-2. As shown in these figures M-1 performs better for frequent OOV PNs whereas M-2 is more uniform across different OOV PNs. A major reason behind this is the method of calculation of OOV PN scores in M-1 and M-2. It can be seen that Equation (1) is similar to a dot product measure between topic vectors and Equation (2) is a cosine similarity between topic vectors. For probabilistic topic models the topic vectors contain probabilities which are directly related to frequency of occurrence. And it is known that cosine similarity is a normalised measure and in our case it leads to normalisation

As shown in Figure 1, to recognise the target OOV PNs, a phonetic search is performed on the LVCSR hypothesis, for the OOV PNs obtained with topic models. The phonetic string corresponding to the LVCSR hypothesis is obtained using forced alignment, whereas the OOV PNs are converted to phone strings using our CRF based G2P converter [27]. The k-differences based approximate matching algorithm discussed in Section V was used for performing phonetic search with δins = −1, δdel = −1, δsub = −1, δeq = 1. For evaluation of our proposed approach the phonetic search was performed only in those regions of the LVCSR hypothesis which do not exactly match the manual transcriptions. Figure 6 shows the OOV PN recognition performance using this approach for OOV PNs retrieved with PN-Topic based Method and Document Similarity based Method. The performance is evaluated in terms of F1-score calculated as:

47

τ = 0.1

F1-score

0.40 0.35

0.35

0.30

0.30

0.25

0.25

0.20

0.20

0.15

0.15

0.10

0.10

0.05

0.05 1

8

512

1

τ = 0.6

0.40

F1-score

64

0.35

0.30

0.30

0.25

0.25

0.20

0.20

0.15

0.15

0.10

0.10

0.05

0.05 8

64

8

512

64

512

τ = 1.0

0.40

0.35

1

τ = 0.3

0.40

1

8

64

512

Fig. 6. F1-score for OOV PN recognition for different phonetic distance thresholds (τ ). corresponds to PN-Topic based Method and to Document Similarity based Method. The X axes corresponds to number of OOV PNs retrieved with the two methods.

2 ∗ precision ∗ recall precision + recall 2 ∗ tp = 2 ∗ tp + f p + f n

F1 =

(5)

where tp, f p, f n stand for number of true positive, false positive and false negative OOV PNs respectively. The F1score is calculated for different thresholds of the phonetic match score (denoted as τ ) and for different number of OOV PNs retrieved with the topic model. The best F1-score of 0.392 is obtained for Document Similarity based Method at τ = 0.7 (not shown in the figure) with top 64 OOV PNs retrieved with topic model. PN-Topic based Method has the best F1-score of 0.370 obtained at τ = 0.6 and with top 32 OOV PNs retrieved with topic model. The recall obtained in these cases is about 33% with a precision of about 48%. However it must be noted that a recall up to 74% can be obtained in the same getting with a low precision. This gives scope for future work which can employ additional contextual information for improving the precision of the proposed approach. VII. D ISCUSSION AND F UTURE W ORK As shown in Figures 4 and 5 our proposed methods is able to handle rare OOV PNs. However, we observe that probabilistic topic models are based on co-occurrences and are biased towards highly frequent words and PNs. And so the rare OOV PNs need to be addressed separately. Similarly there is another problem with topic models. The topic models transform the words and PNs into a reduced space. Due to this many words will overlap in the topic space. This extends to

corresponds

PN as well i.e., the PNs also overlap in the topic space. As a result retrieval of OOV PNs with topic models gives rise to many relevant but non target OOV PNs. We believe that there is great scope for future work and several extensions are possible. The proposed OOV PN retrieval methods based on topic models can be used in a pipeline i.e. Document Similarity based Method followed by PN-Topic based Method. Document Similarity based Method for retrieving similar topic documents and followed by PNTopic based Method to choose the relevant OOV PNs from similar documents. Also, we have used the classical LDA topic model. The LDA probabilistic topic model can further incorporate document time stamps as labels [28] or as a model variable [29]. With such an extension of LDA, timestamp information can act as a context information to further filter the retrieved OOV PNs. VIII. C ONCLUSION LVCSR systems processing diachronic audio news, specially for content based indexing, need to handle OOV PNs. While PNs extracted from collections of diachronic text news from the internet can be used to recognise and recover the target OOV PNs, but the total number of new PNs appearing is very large. In this paper we proposed a two step approach for recognition of OOV PNs in an audio document. The first step of retrieving OOV PNs relevant to an audio document using probabilistic topic models reduces the search space to about 5% of the total OOV PNs and still retaining about 90% of the target OOV PNs. The second step of phonetic search for the target OOV PNs using a k-differences approximate string matching algorithm can recognise about 33% of the target

48

OOV PNs with a precision of about 48%; and about 74% of the target OOV PNs at a lower precision. These results are promising and give scope for future work which can employ additional contextual information for improving the precision of the proposed approach. ACKNOWLEDGMENT The authors would like to thank the ANR ContNomina SIMI-2 of the French National Research Agency (ANR) for funding. R EFERENCES [1] C. Bhatt and M. Kankanhalli, “Multimedia data mining: state of the art and challenges,” Multimedia Tools and Applications, vol. 51, no. 1, pp. 35–76, 2011. [2] C. Alberti, M. Bacchiani, A. Bezman, C. Chelba, A. Drofa, H. Liao, P. Moreno, T. Power, A. Sahuguet, M. Shugrina, and O. Siohan, “An audio indexing system for election video material,” in IEEE ICASSP, April 2009, pp. 4873–4876. [3] F. Seide, K. Thambiratnam, and R. Yu, “Word-lattice based spokendocument indexing with standard text indexers,” in IEEE Spoken Language Technology Workshop, Dec 2008, pp. 293–296. [4] L. Qin, “Learning out-of-vocabulary words in automatic speech recognition,” Ph.D. dissertation, Language Technologies Institute, School of Computer Science, Carnegie Mellon University, 2013. [5] C. Parada, M. Dredze, and F. Jelinek, “OOV sensitive named-entity recognition in speech,” in INTERSPEECH, 2011, pp. 2085–2088. [6] D. Palmer and M. Ostendorf, “Improving out-of-vocabulary name resolution,” Computer Speech & Language, vol. 19, pp. 107 – 128, 2005. [7] A. Allauzen and J.-L. Gauvain, “Open vocabulary ASR for audiovisual document indexation,” in IEEE ICASSP, 2005, pp. 1013–1016. [8] F. B´echet, A. Nasr, and F. Genet, “Tagging unknown proper names using decision trees,” in 38th Annual Meeting on Association for Computational Linguistics, PA, USA, 2000, pp. 77–84. [9] R. Dufour, G. Damnati, D. Charlet, and F. Bchet, “Automatic transcription error recovery for person name recognition,” in INTERSPEECH, 2012, pp. 1007–1010. [10] M. Akbacak and J. Hansen, “Spoken proper name retrieval for limited resource languages using multilingual hybrid representations,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1486–1495, Aug 2010. [11] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003. [12] G. Senay, B. Bigot, R. Dufour, G. Linar`es, and C. Fredouille, “Person name spotting by combining acoustic matching and LDA topic models,” in INTERSPEECH, 2013, pp. 1584–1588. [13] B. Bigot, G. Senay, G. Linar`es, C. Fredouille, and R. Dufour, “Person name recognition in ASR outputs using continuous context models,” in IEEE ICASSP, 2013, pp. 8470–8474. [14] P. M. Beth Logan, Patrawadee Prasangsit, “Fusion of semantic and acoustic approaches for spoken document retrieval,” in ISCA Workshop on Multilingual Spoken Document Retrieval (MSDR), 2003, pp. 1–6. [15] B. Bigot, G. Senay, G. Linar`es, C. Fredouille, and R. Dufour, “Combining acoustic name spotting and continuous context models to improve spoken person name recognition in speech.” in INTERSPEECH, 2013, pp. 2539–2543. [16] TreeTagger - a language independent part-of-speech tagger. [Online]. Available: http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/ [17] I. Illina, D. Fohr, O. Mella, and C. Cerisara, “The Automatic News Transcription System: ANTS some Real Time experiments,” in INTERSPEECH, 2004, pp. 377–380. [18] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” J. Assoc. Inf. Sci. Technol., vol. 41, no. 6, pp. 391–407, 1990. [19] T. Hofmann, “Probabilistic latent semantic analysis,” in Uncertainty in Artificial Intelligence, 1999, pp. 289–296. [20] T. L. Griffiths, J. B. Tenenbaum, and M. Steyvers, “Topics in semantic representation,” Psychological Review, vol. 114, p. 2007, 2007. [21] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences, vol. 101, pp. 5228–5235, 2004.

[22] G. Navarro, “A guided tour to approximate string matching,” ACM Comput. Surv., vol. 33, no. 1, pp. 31–88, Mar. 2001. [23] B. Lecouteux, G. Linars, and B. Favre, “Combined low level and high level features for out-of-vocabulary word detection.” in INTERSPEECH, 2009, pp. 1187–1190. [24] A. Lee and T. Kawahara, “Recent development of open-source speech recognition engine julius,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2009. [25] A. Stolcke, “SRILM - an extensible language modeling toolkit,” in Proceedings International Conference on Spoken Language Processing, November 2002, pp. 257–286. [26] C. D. Manning, P. Raghavan, and H. Sch¨utze, Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008. [27] I. Illina, D. Fohr, and D. Jouvet, “Multiple Pronunciation Generation using Grapheme-to-Phoneme Conversion based on Conditional Random Fields,” in XIV International Conference ”Speech and Computer” (SPECOM’2011), Kazan, Russia, Sep. 2011. [28] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled LDA : a supervised topic model for credit attribution in multi-labeled corpora,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, ser. EMNLP ’09. Stroudsburg, PA, USA: Association for Computational Linguistics, 2009, pp. 248–256. [Online]. Available: http://dl.acm.org/citation.cfm?id=1699510.1699543 [29] D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in Proceedings of the 23rd International Conference on Machine Learning, ser. ICML ’06. New York, NY, USA: ACM, 2006, pp. 113–120. [Online]. Available: http://doi.acm.org/10.1145/1143844.1143859

49

Enhanced discriminative models with tree kernels and unsupervised training for entity detection Lina M. Rojas-Barahona

Christophe Cerisara

Universit´e de Lorraine-LORIA Nancy, France Email: [email protected]

CNRS-LORIA Nancy, France Email: [email protected]

Abstract—This work explores two approaches to improve the discriminative models that are commonly used nowadays for entity detection: tree-kernels and unsupervised training. Feature-rich classifiers have been widely adopted by the Natural Language processing (NLP) community because of their powerful modeling capacity and their support for correlated features, which allow separating the expert task of designing features from the core learning method. The first proposed approach consists in leveraging the fast and efficient linear models with unsupervised training, thanks to a recently proposed approximation of the classifier risk, an appealing method that provably converges towards the minimum risk without any labeled corpus. In the second proposed approach, tree kernels are used with support vector machines to exploit dependency structures for entity detection, which relieve designers from the burden of carefully design rich syntactic features manually. We study both approaches on the same task and corpus and show that they offer interesting alternatives to supervised learning for entity recognition. Index Terms—Entity recognition, Tree Kernels, Unsupervised Learning.

I. I NTRODUCTION The goal of this work is to detect entities, with a focus on proper nouns, in French text documents. Entity detection is classically realized in the state-of-the-art with a sequence discriminative model, such as a conditional random fields (CRF), which can exploit rich input features typically derived from the words to tag, its surrounding words (linear context) and gazettes, which are list of known entities. This traditionnal approach is highly efficient, but still has to face some important issues, in particular: • The cost incurred to manually annotate a large enough training corpus; • The fact that the input features do not exploit the intrinsic linguistic structure of the sentences to tag, despite its fundamental importance for interpreting and relating the surface words together. We propose next to address both issues, with two original approaches. The first one proposes to use tree kernels within a baseline support vector machine (SVM) to exploit the syntactic structure of parse trees of the input sentence for supervised learning, and the second one explores the use of an unsupervised training algorithm on a discriminative linear models, which opens new research directions to reduce the requirement of prior manual annotation of a large training corpus.

Section II presents the target unsupervised approach, discusses the general issue of how to train discriminative models and some of the solutions proposed in the state-of-the-art, and describes our proposed adaptation of the algorithm for entity detection. Section III presents tree kernels, while Section IV shows how tree kernels can be used to exploit the syntactic structures for entity detection. Section V presents experimental validations of both approaches on the same broadcast news corpus in French. Section VI briefly summarizes some of the related works in the litterature, and Section VII concludes the paper. II. UNSUPERVISED TRAINING A. Context Unsupervised training of discriminative models poses serious theoretical issues, which prevent such models from being widely adopted in tasks where annotated corpora do not exist. In such cases, generative models are thus often preferred. Nevertheless, discriminative models have various advantages that might be desirable even without supervision, for example their very interesting capacity to handle correlated features and to be commonly equipped with many rich features. Hence, many efforts have been deployed to address this issue, and some unsupervised training algorithms for discriminative models have been proposed in the Natural Language Processing (NLP) community, for instance Unsearn [1], Generalized Expectation [2] or Contrastive Training [3] amongst others. Our unsupervised approach relies on a novel approximation of the risk of binary linear classifiers proposed in [4]. This approximation relies on only two assumptions: the rank of class marginal is assumed to be known, and the class-conditional linear scores are assumed to follow a Gaussian distribution. Compared to previous applications of unsupervised discriminative training methods to NLP tasks, this approach presents several advantages: first, it is proven to converge towards the true optimal classifier risk; second, it does not require any constraint; third, it exploits a new type of knowledge about class marginal that may help convergence towards a relevant solution for the target task. In this work, we adapt and validate the proposed approach on two new binary NLP tasks: predicate identification and entity recognition.

50

B. Classifier risk approximation We first briefly review the approximation of the risk proposed in [4]. A binary (with two target classes: 0 and 1) linear classifier associates a score fθ0 (X) to the first class 0 for any input X = (X1 , · · · , XNf ) composed of Nf features Xi : fθ0 (X) =

Nf X

θi Xi

i

where the parameter θi ∈ IR represents the weight of the feature indexed by i for class 0. As it is standard in binary classification, we constrain the scores per class to sum to 0: fθ1 (X) = −fθ0 (X) In the following, we may use both notations fθ0 (X) or fθ (X) equivalently. X is classified into class 0 iff fθ0 (X) ≥ 0, otherwise X is classified into class 1. The objective of training is to minimize the classifier risk: R(θ) = Ep(X,Y ) [L(Y, fθ (X))] (1) where Y is the true label of the observation X, and L(Y, fθ (X)) is the loss function, such as the hinge loss used in SVMs, or the log-loss used in CRFs. This risk is often approximated by the empirical risk that is computed on a labeled training corpus. In the absence of labeled corpus, an alternative consists in deriving the true risk as follows: R(θ) =

X

Z

+∞

P (y)

y∈{0,1}

P (fθ (X) = α|y)L(y, α)dα

(2)

−∞

We use next the following hinge loss: L(y, α) = (1 + α1−y − αy )+ (3) where (x)+ = max(0, x), and αy = fθy (X) is the linear score for the correct class y. Similarly, α1−y = fθ1−y (X) is the linear score for the wrong class. Given y and α, the loss value in the integral can be computed easily. Two terms in Equation 2 remain: P (y) and P (fθ (X) = α|y). The former is the class marginal and is assumed to be known. The latter is the class-conditional distribution of the linear scores, which is assumed to be normally distributed. This implies that P (fθ (X)) is distributed as a mixture of two Gaussians (GMM): X P (fθ (X)) = P (y)N (fθ (X); µy , σy ) y∈{0,1}

where N (z; µ, σ) is the normal probability density function. The parameters (µ0 , σ0 , µ1 , σ1 ) can be estimated from an unlabeled corpus U using a standard Expectation-Maximization (EM) algorithm for GMM training. Once these parameters are known, it is possible to compute the integral in Eq. 2 and thus ˆ an estimate R(θ) of the risk without relying on any labeled corpus. The authors of [4] prove that: • The Gaussian parameters estimated with EM converge towards their true values; ˆ • R(θ) converges towards the true risk R(θ); • The estimated optimum converges towards the true optimal parameters, when the size of the unlabeled corpus U increases infinitely: ˆ lim arg min R(θ) = arg min R(θ) |U |→+∞

θ

θ

They further prove that this is still true even when the class

priors P (y) are not known precisely, but only their relative order (rank) is known. These priors must also be different P (y = 0) 6= P (y = 1). Given the estimated Gaussian parameters, we use numerical integration to compute Eq. 2. We implemented both Monte Carlo [5] and trapezoidal methods for solving numerically Eq. 2. In the Monte Carlo integration, the integral is evaluated by sampling T points (αt )T according to a hypothesized probability distribution p(α) = P (fθ (X)) and by computing the sum: n

I=

1 X P (fθ (X) = αt |y)L(y, αt ) n t=1 p(αt )

(4)

Where n is the total number of points (i.e., the number of trials). The simplest integration method uses a uniform 1 distribution p(α) = (b−a) and the sum in Equation 4 reduces to Equation 5: n 1X I = (b − a) P (fθ (X) = αt |y)L(y, αt ) (5) n t=1 a and b are broadly set so as to capture most if not all possible points in the domain of the integral: a = min(µy,0 , µy,1 ) − 6 max(σy,0 , σy,1 ) b = max(µy,0 , µy,1 ) + 6 max(σy,0 , σy,1 ) As is well known in numerical analysis, the trapezoidal rule for computing the same integral uses the following approximation: Rb f (x)dx ≈ h2 (f (x0 ) + 2f (x1 ) + ... + 2f (xn−1 + f (xn )), a where h = b−a n Our unsupervised training algorithm then implements a coordinated gradient descent, where the gradient of the risk is computed with finite difference. III. T REE K ERNELS Kernel methods explore high-dimensional feature spaces on low-dimensional data, alleviating the burden of meticulously designing and extracting rich features. Then, it is possible to detect nonlinear relations between variables in the data by embedding the data into a kernel-induced feature space. A kernel is a similarity function over pairs of objects. Convolution kernels allow to compute this similarity based on the similarity of object parts. Tree kernels for instance, are convolution kernels that measure this similarity by computing the number of common substructures between two trees T1 and T2 , exploring in this way rich structured spaces. Tree kernels have been widely used for a variety of NLP applications such as relation extraction [6], [7], semantic role labeling [8] as well as parsing [9] and named-entity recognition re-ranking [10]. We follow here the work of [11], [8] on convolution tree kernels. We explored the following tree spaces: (i) the subset tree (SST) kernel and (ii) the partial tree (PT) kernel (see Figure 1). The former is defined as a tree rooted in any non-terminal node along with all its descendants in which its leaves can be non-terminal symbols and satisfies

51

Capitalization: the pattern “Chris2useLC”, as defined in Stanford NLP, describing lower, upper case and special NP V characters in words [14]. visited V NP Paris NP • POS tags: the part of speech tag of every word as given visited Paris by the Treetagger [15]. (a) (b) (c) (d) The part of speech tags as well as capitalization of words Fig. 1. (a-c) SST subtrees and (d) PT subtree for constituence syntactic trees. are common important features for entity recognition, while Note that in (d) the grammatical rule VP → V NP, is broken. character n-grams constitute a smoother (less sparse) alternative to word forms and are also often used in this context. the constraint that grammatical rules cannot be broken. The The label priors P (y) are set so that 90% of the words are latter is a more general form of substructure that relax the not entities and only 10% of the words are entities. The initial weights are obtained after training the linear classifier on 20 constraint over the SSTs. We are interested in studying the impact of using manually annotated sentences. The same set of features is used dependency-trees (i.e. a syntactic representation that denotes both in the supervised initialization and the unsupervised risk grammatical relations between words) in tree kernels. Apart minimization iterations. Tree Features: The following features are used in the from the previously mentioned work on named-entity recognition re-ranking, there is still little work studying the impact of supervised experiments with dependency tree kernels. The rich syntactic tree structures for the task of entity recognition. input dependency trees have been obtained by automatically Such features would allow the model to take into account the parsing the corpus with the MATE Parser [16] trained on the internal syntactic structures of multi-words entities, but also French Tree-Bank [17]. The following features are used for to potentially model the preferred syntactic relations between training tree kernels: named entities and their co-occurring words in the sentence. • Top-down tree: the tree fragment in which the current To address these questions, we studied the impact of structured word is the governor (see Figure 2 (b)). tree-features in supervised models by training and evaluating • Bottom-up tree: the tree fragment in which the current tree kernel-based models for the binary task of entity detection. word is a dependent (see Figure 2 (c)). In our experiments we use an optimized SVM implementaWe also consider the following dependency tree variations: tion of tree-kernels, namely fast-tree kernels [12], in which a • Emphasize the current word: we created another kind compact representation of trees (i.e. a directed acyclic graphs) of tree by simply introducing a prefix “CW”, that stands is used, avoiding processing repeated sub-structures and as a for current word, in the node of the tree that contains the consequence reducing the total amount of computations. word in focus. • POS-trees: words in dependency trees are represented by IV. E NTITY RECOGNITION FEATURES their part of the speech (POS) instead of their word-form. Therefore, we can combine tree and vector features as well The goal of entity recognition is to detect whether any word form in a text refers to an entity or not, where an as using either both types of tree-features top-down (TD) and entity is defined as a mention of a person name, a place, an bottom-up (BU) or only one of them. organization or a product. We use the ESTER2 corpus [13], V. R ESULTS which collects broadcast news transcriptions in French that are annotated with named entities. It is worth noting that this We present in this section the results of our experiments corpus contains spontaneous speech, which are characterized with both the unsupervised and supervised tree-kernel models. by the abundance of irregularities that make it difficult to We removed from the training set (but not from the treeparse, such as ellipsis, disfluences, false starts, reformulation, structure) all the words that have been annotated with the hesitations and ungrammaticality (i.e. incomplete syntactical following part of the speech by the tree tagger: punctuation structures) due to pauses and absence of punctuation, as shown and determiners. in the example presented in Figure 2 (a), in which a comma is missing just before the entity mention. A. Risk minimization Vector Features: The following features are used in the On the Gaussianity assumption: The proposed approach unsupervised experiments with a linear classifier. We adapted assumes that the class-conditional linear scores are distributed in these experiments the Stanford supervised linear classifier normally. We invite the interested reader to read [4], where of the Stanford NLP toolkit 1 , in which we added methods theoretical arguments are given that support the validity of to perform risk minimization on an unlabeled corpus. The this assumption in various contexts. However, this assumption features used in this context are: can not always be taken for granted, and the authors suggest •

VP

VP



V

Character n-grams with n = 4.

1 http://nlp.stanford.edu/nlp

NP

VP

to verify it empirically. The distributions of fθ (X) with the initial and final weights (i.e. the weights obtained after training) on the ESTER2 corpus

52

obj obj

suj

Vous you

souhaitez want

mod

mod

aider to help

les the

enfants children

mod

Patricia Patricia

Martin Martin

(a) mod

Patricia

Martin (b)

obj

souhaitez

mod

obj

aider

enfants

Patricia

(c) Fig. 2. (a) Dependency tree, (b) top-down and (b) bottom-up tree fragments

Supervised vs Unsupervised System precision recall Stanford trained on 20 sent. 89.8% 68% Stanford trained on 520 sent. 90.3% 84.7% Unsupervised trap. 88.7% 79% Unsupervised MC 88.7% 79%

F1 77.4% 87.5% 83.5% 83.6%

TABLE I P ERFORMANCE OF THE PROPOSED UNSUPERVISED SYSTEM .

(a)

(b)

Fig. 3. Distribution of fθ (X) on the ESTER2 corpus (unlabeled dataset) (a) using the initial weights trained on 20 sentences; (b) using the weights at the final iteration of the gradient descent algorithm. The largest mode is on the right because the entity class is class 1.

are shown in Figure 3(b) and (c) respectively. These distributions are clearly bi-normal on this corpus, which suggest that this assumption is reasonable in both our NLP tasks. Experiments with gradient descent: Starting from the initial weights trained on 20 sentences, we now apply the gradient descent algorithm described in [4] that minimizes an estimate of the classifier risk on the full corpus without labels. The next Table reports the entity detection results of the initial linear classifier trained on 20 sentences, but also when trained on the full corpus composed of 520 sentences. This latter results shows the best performances that can be reached when manually labelling a large number of training sentences. The objective of our unsupervised approach is to get as close as possible from these optimal results, but without labels. The metric used in these experiments is the f-measure, which is the harmonic mean of precision and recall. precision ∗ recall F1 = 2 ∗ (6) precision + recall The unsupervised risk estimation method relies on the continuous integration of the bimodal distribution of the classifier scores on the full corpus, which may be relatively costly to perform especially as this computation is done at every step

of the gradient descent. We have thus made preliminary experiments with two numerical integration approaches: the trapezoidal and Monte Carlo methods [5]. These methods are compared next both in terms of computational cost and approximation quality. Figure 4 shows the approximation error when using the trapezoidal rule for integration. The x-axis represents the number of parameters of the chosen numerical integration method, i.e., here, the number of trapezoids used. The yaxis represents the squared error between the risk estimated with a nearly infinite precision and the risk estimated with numerical integration and a limited number of parameters. We use the root-square of the approximation error to better view the details, because the trapezoidal and the Monte Carlo methods are known to respectively converge in O(n−2 ) and 1 O(n− 2 ). We can observe that increasing the number of trapezoids also increases the accuracy of numerical integration, and that the approximation error becomes smaller than 10% of the risk value for 20 trapezoids and more. Figure 5 shows a similar curve (on a different figure to have a better precision on the y-axis) but for Monte Carlo integration, where the x-axis represents the number of Monte Carlo iterations. Note that both Figures 4 and 5 show the risk approximation error, and not the final impact of this error on the entity recognition task: this is rather shown in Table I. With regard to complexity, Figure 6 shows the computation time, measured in seconds, required to compute the integrals

53

0.3

Error in risk estimation

0.25

0.25

0.2

0.2

Elapsed time (sec.)

Root-squared error

0.3

0.15

0.1

0.05

Computation time for trapezoidal rule

0.15

0.1

0.05

0

0 0

10

20

30

40

50

60

70

80

90

100

0

Fig. 4. Approximation error of the trapezoidal rule with regard to the number of trapezoids (i.e., segments) used for numerical integration when computing the unsupervised risk. 0.12

10

20

30

40

50

60

70

80

90

100

Fig. 6. Computation cost of the trapezoidal rule with regard to the number of trapezoids (i.e., segments) used for numerical integration when computing the unsupervised risk.

10

Error in risk estimation

Computation time for Monte Carlo integration

9

0.1

Elapsed time (sec.)

Root-squared error

8 0.08

0.06

0.04

7 6 5 4 3

0.02

2 0 0

1

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0

Fig. 5. Approximation error of the Monte Carlo Integration with regard to the number of trials used for approximating the integrals of the risk.

with the trapezoidal rule during risk minimization. Figure 7 shows a similar curve but with Monte Carlo integration. The final performance figures are shown in Table I (bottom part). We can observe that the Monte Carlo method takes much more time in our experimental setup, without impacting the final entity detection rate. Indeed, according to our experiments, both the trapezoidal risk and Monte Carlo integration reach the same performances (the differences shown in Table I are not statistically significant) after 2, 000 iterations with an F1-measure of 83.5%. In the following experiments, we have thus chosen the trapezoidal approach. Figures 8 and 9 respectively show the convergence of the risk estimate with trapezoidal integration and the entity F1-measure as a function of the number of iterations of gradient optimization. Therefore, when evaluated on a test set of 167, 249 words and 10, 693 sentences, both methods outperform the supervised linear classifier trained on 20 sentences. In general the proposed model is prone to detect person

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Fig. 7. Computation cost of the Monte Carlo Integration with regard to the number of trials used for approximating the integrals of the risk.

names that are undetected by the baseline (i.e., the Stanford linear classifier trained on 20 sentences). Table II shows two examples of family names (e.g., Drouelle and Floch-Prigent) that are correctly recognized by our model but ignored by the baseline. Our model also correctly detects entities other than person names, such as the fighter aircraft F16, which are not captured by the initial model. Note also that for the first

54

Word Fabrice Drouelle Floch-Prigent Iran F16

Baseline (Sup. on 20 sents) Class Prob. Entity 0.94 NO 0.53 NO 0.58 Entity 0.66 NO 0.73 TABLE II

Proposed Class Entity Entity Entity Entity Entity

Model Prob. 0.99 0.79 0.69 0.82 0.91

E XCERPT OF EXAMPLES CORRECTLY CLASSIFIED BY THE UNSUPERVISED APPROACH FOR ENTITY RECOGNITION , IMPROVING THE BASELINE ( I . E . THE S TANFORD LINEAR CLASSIFIER TRAINED ON 20 SENTENCES ). T HE LAST COLUMN SHOWS THE OUTPUT PROBABILITY OF THE WINNING CLASS .

constituency trees as they do not have trees with broken grammatical production rules. However, SST tree kernels are more accurate than PT when using POS-trees, suggesting that POS tags behave as non terminals. Although POS tags could help the classifier to capture a more generic tree-structure without having all the word-form variations, word-form dependency trees clearly outperform POS-trees. Bottom-up trees seem to better capture the structural context of entities because entities are more likely to be dependent (leaves) than governors (heads). In fact, bottom up trees increase by (+0.5) and (+0.36) the F1-measure for SST and PT trees respectively. In conclusion, much better results are obtained when combining both top-down and bottom-up trees, especially when using the word in focus or current word (CW) distinction.

0.012 0.011

Risk for unsupervised model

0.01 0.009 0.008 0.007 0.006 0.005 0.004 0.003 0.002 0.001 0

200

400

600

800

1000

1200

1400

1600

ˆ Fig. 8. R(θ) (from Eq. 2) for entity detection, in function of the number of iterations, up to 1600 iterations. 0.84

F1-Measure entity detection

0.83 0.82 0.81 0.8 0.79 0.78 0.77 0

200

400

600

800

1000

1200

1400

1600

Fig. 9. F1 for entity detection, in function of the number of iterations, up to 1600 iterations

name Fabrice and the country Iran, the unsupervised model correctly augments their probabilities (where the probabilities correspond to the normalized scores fθ (X) given by the model) to belong to the class entity. B. Experiments with Tree Kernels We have run experiments on tree kernels using as features top-down (TD), bottom-up (BU) trees as well as using vector features (i.e the same features used in the unsupervised experiments). In our experiments we also used the tree kernel spaces (SST and PT) introduced in Section III. Furthermore, we used either dependency trees or modified dependency trees (as explained in Section IV), in which nodes contain the part of the speech of words instead of the word form. The baseline is the SVM with linear kernel where only vector features are used for training. We performed further experiments by introducing or not tree-variations, from which Table III shows a summary. In general PT kernels perform better than SST kernels, in agreement with [11], where they found PT more accurate when using dependency structures. Indeed, STT were mainly thought for

VI. R ELATED W ORK A number of previous works have already proposed approaches to train discriminative models without or with few labels. Please refer, e.g., to [18], [19] for a general and theoretical view on this topic. For NLP tasks several approaches have also been proposed. Hence, the traditional selfand co-training paradigm can be used to leverage supervised classifiers with unsupervised data [20], [21]. [2] exploit the Generalized Expectation objective function, which penalizes the mismatch between model predictions and linguistic expectation constraints. In contrast, our proposal does not use any manually defined prototype nor linguistic constraint. Another interesting approach is Unsearn [1], which predicts the latent structure Y and then a corresponding “observation” ˆ with a loss function that measures how well X ˆ predicts X, X. This method is very powerful and generalizes the EM algorithm, but its performances heavily depend on the quality of the chosen features set for discriminating between the target classes. A related principle is termed “Minimum Imputed Risk” in [22] and applied to machine translation. Our proposed approach also depends on the chosen features, but in a less crucial way thanks to both new assumptions, respectively the known label priors and discrimination of classes based on individual Gaussian distributions of scores. Another interesting generalization of EM used to train log-linear models without labels is Contrastive Estimation, where the objective function is modified to locally remove probability mass from implicit negative evidence in the neighborhood of the observations and transfer this mass onto the observed examples [3]. Comparatively, the main advantage of our proposed approach comes from the fact that the algorithm optimizes the standard classifier risk, without any modification nor constraint. The objective function (and related optimal parameters) is thus the same as in classical supervised training. The authors of [23], [24] state the problem of considering syntactic structures for named entity detection as a joint optimization of the two tasks, parsing and named-entity recognition. Although this is a sophisticated solution that avoid cascade errors, the cost of optimizing joint models is high while the improvement is still modest with respect to performing both tasks in a pipeline. Other works exploit

55

K. Space Linear SST PT SST PT SST PT SST PT SST PT SST PT SST PT SST PT SST PT SST PT

Features Vector TD trees + vector TD trees + vector TD POS-trees + vector TD POS-trees + vector TD CW trees + vector TD CW trees + vector BU trees + vector BU trees + vector BU POS-trees + vector BU POS-trees + vector BU CW trees + vector BU CW trees + vector TD and BU trees + vector TD and BU trees + vector TD and BU POS-trees + vector TD and BU POS-trees + vector TD and BU CW-trees + vector TD and BU CW-trees + vector TD and BU CW-POS-trees + vector TD and BU CW-POS-trees + vector

precision 89.48% 93.55% 93.07% 88.72% 82.85% 94.09% 93.56% 93.20% 93.35% 85.41% 85.25% 93.19% 93.07% 94.17% 94.24% 89.95% 85.89% 94.17% 94.26% 90.70% 86.73%

recall 79.56% 76.60% 77.49% 75.39% 75.43% 76.59% 77.73% 77.67% 77.91% 71.48% 68.31% 77.22% 78.09% 77.71% 78.51% 76.08% 74.75% 78.18% 78.75% 74.48% 74.19%

F1 84.23% 84.23% 84.57% 81.51% 78.97% 84.44% 84.91% 84.73% 84.93% 77.82% 75.85% 84.46% 84.92% 85.15% 85.66% 82.43% 79.93% 85.43% 85.81% 81.79% 79.97%

TABLE III P ERFORMANCE OF THE TREE - KERNELS .

tree-kernels for named-entity recognition re-ranking [9], [10]. The authors of [25] further use tree-kernels for named entity recognition, however they do not use STT nor PT kernels. They rather introduced a different tree kernel, the sliding tree kernel, but which may not be convolutive as the SST and PT kernels.

suggest that there is still room for improvement in the task of entity detection thanks to more linguistically rich features, and to unsupervised training on larger unlabeled corpora.

VII. C ONCLUSION This work explores two original solutions to improve traditional discriminative classifiers used in the task of entity detection. These solutions address two classical problems of traditional named entity detection systems: the high cost required to manually annotate a large enough training corpus; and the limitations of the input features, which often encode linear word contexts instead of the more linguistically relevant syntactic contexts. The former problem is addressed by adapting a newly proposed unsupervised training algorithm for discriminative linear models. At the contrary to other methods proposed in the litterature to train discriminative models without supervision, this approach optimizes the same classifier risk than the one approximated by a supervised classifier trained on a corpus with labels, hence ultimately leading theoretically to the same optimal solution. We thus demonstrate the applicability of this approach to the entity detection NLP task, and further study the computational complexity of two numerical integration approaches in this context. We also show that the main assumption of the approach, i.e., the gaussianity of the classconditional distributions of the linear scores, is fulfilled in this task. The latter problem is addressed by considering rich structured input features to a SVM, thanks to an adapted treekernel that exploits dependency graphs that are automatically computed on broadcast news sentences. Both approaches are validated on the same French corpus for entity detection and exhibits interesting and encouraging performances, which

ACKNOWLEDGMENT This work was partly supported by the French ANR (Agence Nationale de la Recherche) funded project ContNomina.

56

R EFERENCES [1] H. Daum´e III, “Unsupervised search-based structured prediction,” in Proc. of ICML, Montreal, Canada, 2009. [2] G. Druck, G. Mann, and A. McCallum, “Semi-supervised learning of dependency parsers using generalized expectation criteria,” in Proc. of ACL, Suntec, Singapore, Aug. 2009, pp. 360–368. [3] N. A. Smith and J. Eisner, “Unsupervised search-based structured prediction,” in Proc. of ACL, 2005. [4] K. Balasubramanian, P. Donmez, and G. Lebanon, “Unsupervised supervised learning II: Margin-based classification without labels,” Journal of Machine Learning Research, vol. 12, pp. 3119–3145, 2011. [5] H. Gould and J. Tobochnik, An introduction to computer simulation methods: applications to physical systems, ser. Addison-Wesley series in physics. Addison-Wesley, 1988, no. v. 1-2. [Online]. Available: http://books.google.fr/books?id=JfpQAAAAMAAJ [6] C. M. Cumby and D. Roth, “On kernel methods for relational learning,” in Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, 2003, pp. 107–114. [Online]. Available: http://www.aaai.org/Library/ ICML/2003/icml03-017.php [7] A. Culotta and J. Sorensen, “Dependency tree kernels for relation extraction,” in Proceedings of the 42Nd Annual Meeting on Association for Computational Linguistics, ser. ACL ’04. Stroudsburg, PA, USA: Association for Computational Linguistics, 2004. [Online]. Available: http://dx.doi.org/10.3115/1218955.1219009 [8] A. Moschitti, D. Pighin, and R. Basili, “Tree kernels for semantic role labeling,” Computational Linguistics, 2008. [9] M. Collins and N. Duffy, “New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser. ACL ’02. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp. 263–270. [Online]. Available: http://dx.doi.org/10.3115/1073083.1073128 [10] T.-V. T. Nguyen and A. Moschitti, “Structural reranking models for named entity recognition,” Intelligenza Artificiale, vol. 6, pp. 177–190, December 2012. [11] A. Moschitti, “Efficient convolution kernels for dependency and constituent syntactic trees,” in In European Conference on Machine Learning (ECML, 2006. [12] A. Severyn and A. Moschitti, “Fast support vector machines for convolution tree kernels,” Data Min. Knowl. Discov., vol. 25, no. 2, pp. 325–357, Sep. 2012. [Online]. Available: http://dx.doi.org/10.1007/ s10618-012-0276-8 [13] S. Galliano, G. Gravier, and L. Chaubard, “The ester 2 evaluation campaign for the rich transcription of french radio broadcasts,” in Proc. of INTERSPEECH, 2009, pp. 2583–2586. [14] D. Klein, J. Smarr, H. Nguyen, and C. D. Manning, “Named entity recognition with character-level models,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 Volume 4, ser. CONLL ’03. Stroudsburg, PA, USA: Association for Computational Linguistics, 2003, pp. 180–183. [Online]. Available: http://dx.doi.org/10.3115/1119176.1119204 [15] H. Schmid, “Improvements in part-of-speech tagging with an application to german,” in Proc. Workshop EACL SIGDAT, Dublin, 1995. [16] A. Bj¨orkelund, L. Hafdell, and P. Nugues, “Multilingual semantic role labeling,” in Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, ser. CoNLL ’09. Stroudsburg, PA, USA: Association for Computational Linguistics, 2009, pp. 43–48. [Online]. Available: http://dl.acm.org/citation.cfm?id= 1596409.1596416 [17] M.-H. Candito, B. Crabb´e, P. Denis, and F. Gu´erin, “Analyse syntaxique du franc¸ais : des constituants aux d´e pendances,” in Actes de TALN, Senlis, 2009. [18] A. Kapoor, “Learning discriminative models with incomplete data,” Ph.D. dissertation, Massachusetts Institute of Technology, Feb. 2006. [19] A. B. Goldberg, “New directions in semi-supervised learning,” Ph.D. dissertation, Univ. of Wisconsin-Madison, 2010. [20] X. Liu, K. Li, M. Zhou, and Z. Xiong, “Enhancing semantic role labeling for tweets using self-training,” in Proc. AAAI, 2011, pp. 896–901. [21] R. S. Z. Kaljahi, “Adapting self-training for semantic role labeling,” in Proc. Student Research Workshop, ACL, Uppsala, Sweden, Jul. 2010, pp. 91–96.

[22] Z. Li, Z. Wang, J. Eisner, S. Khudanpur, and B. Roark, “Minimum imputed-risk: Unsupervised discriminative training for machine translation,” in Proc. of EMNLP, 2011, pp. 920–929. [23] J. R. Finkel and C. D. Manning, “Joint parsing and named entity recognition,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics on ZZZ. Association for Computational Linguistics, 2009, pp. 326–334, computer science, stanford university. [24] ——, “Hierarchical joint learning: Improving joint parsing and named entity recognition with non-jointly labeled data,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ser. ACL ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 720–728. [Online]. Available: http://dl.acm.org/citation.cfm?id=1858681.1858755 [25] R. Patra and S. K. Saha, “A kernel-based approach for biomedical named entity recognition,” The Scientific World Journal, vol. 2013, 2013.

57

Cooperative agents-based Decentralized Framework for Cloud Services Orchestration Zaki Brahmi

Jihen Ben Ali

Higher Institute of Computer Sciences and Communication Techniques Sousse University, Tunisia Email: [email protected]

Higher Institute of Computer Sciences and Communication Techniques Sousse University, Tunisia Email: [email protected]

Abstract—The Software as a Service provides complete software systems. SaaS is known as «on-demand software». In the Cloud Computing, Business Process Execution Language (BPEL) is widely used in SaaS application development. BPEL is the de facto standard for business process modeling in today’s enterprises and is a promising candidate for the integration of business. It allows describing the control flow needed to orchestrate a set of services into a meaningful business process. However, current BPEL implementations do not provide an orchestration framework that take into account the quality of service (QoS) of Clouds and services in addition to the deployment of a centralized framework. In this paper, we propose a decentralized approach to the orchestration of Cloud services using multiagent system (MAS). The proposed framework orchestrates dynamically concrete services by delegating an agent to each activity of a BPEL process. Index Terms—Cloud Computing, Multi-Agent System, Orchestration, BPEL.

I. I NTRODUCTION In service-oriented architecture (SOA)[8], service orchestration is the coordination and arrangement of multiple services exposed as a single aggregate service. In other terms, it means that only one service is going to play the role of «conductor» as it knows the logic of composition. One of the languages used in describing service composition is Business Process Execution Language (BPEL)[7]. BPEL is de facto standard for business process modeling in today’s enterprises. Cloud Computing is a new technology that has changed the entire computer industry, too, Cloud orchestration is different from the orchestration in service-oriented architecture (SOA) ; It is a typically complex operation in terms of services’ virtualization and localization. Orchestration in web services is a process of collaboration of services as predefined patterns based upon the decision about their interaction one another. However, Cloud Computing orchestration is the composing of architecture, tools and processes to deliver a defined service. Indeed, in Cloud computing environment , services are defined in an abstract way. In addition, when specifying the BPEL process, the location of these services is unknown. During the execution of this process, the discovery of a specific service is a necessary step. Listings 1 and 2 illustrates more the specificity of the orchestration difference. In Cloud Computing (as shown in

listing 2), the target service to be invoked is described via a so-called partnerLink that - besides others - contains two important elements: (1) partnerLinkType and (2) EndpointReference (EPR). (1) is static information which must be known at design time. It refers to the WSDL description of a partner service’s portType, while (2) refers to the concrete service to be invoked. An EPR contains a service’s name, its port (and binding mechanism) and its address given by a URI. At runtime, the EPR is evaluated and may even be set. Listing 1. Simple BPEL process

http://FQDN:PORT/SERVICE-ADDRESS ns:SERVICE-NAME ... ...

58

Listing 2. BPEL process in Cloud Computing environment

In reality, Cloud Computing is an embracing multi-layered resource stack (IaaS, PaaS, and SaaS) orchestrated in an intricate manner to ensure that application -delivered reach an acceptable QoS level for the users. Moreover, the principle of orchestration is divided into two main areas: orchestration of resources (IaaS) [1] [2] and orchestration of services (SaaS) [3] [4] [5]. In our case, we are interested in the orchestration of services (SaaS). To ensure the orchestration, we need a composition language (BPEL) and an orchestration engine which is the goal of our work. Although some work has been developed in the literature ([3] [4] [5]). Showing their effectiveness for a limited number of Cloud services, they seem ineffective at the dynamicity and the transition to scale of Clouds and services. Moreover, these approaches do not take into account the quality of service (QoS) of Clouds and services. Indeed, in a highly dynamic environment (services crashing, server-overloaded, service changing its behavior and its address, etc.), new approaches taking into account various aspects such as QoS, time response and architecture implementation work are required. To provide a solution to this problem, we propose in this paper a decentralized approach to Cloud services orchestration, based on the multi-agent system (MAS) [9]. Our idea is to delegate an agent to each activity of a BPEL process. Indeed, the use of intelligent and autonomous agents can handle the dynamicity of services in Cloud Computing. These agents cooperate to find the best orchestration, seeking, therefore, the best Cloud that offers the desired service during the execution of BPEL processes. The paper in hand is organized as follows: Major related work will be discussed in Section 2. Then, Section 3 is dedicated to outline our current work expanding the previous one by means of incorporating the multi-agent system technology. Moreover, Section 4 is dedicated to interaction protocol. Finally, Section 5 concludes the paper. II. R ELATED W ORK The primary area of research related to this work: Cloud orchestration based on the BPEL language. This section stands for presenting some prominent approaches dealing with these issues. To orchestrate Software as a Service (SaaS), the authors in [3] propose an approach that describes an abstract process of SaaS as a BPEL program. By this approach, we can orchestrate distributed services by describing the virtual information of the services without identifying the actual information of the service. In addition, to improve the performance of service orchestration and success rate, the authors select a service with minimal ping time among the candidate services. However, the

presented approach does not improve the quality of services (QoS). In this paper, we will introduce an approach that aims at avoiding similar disadvantages. In [4], there will be a presentation of the approach that exploits the implementation of BPEL to ensure the dynamicity of service being based on the principle of choosing the target machine with a low load at runtime. In case of a problem with high-level support, this approach solves it by automatically launching a virtual machine from the Cloud Computing infrastructure. This approach may solve the problem since BPEL implementation does not provide features for scheduling service calls based on the load of possible target hosts when using Cloud Computing infrastructures in peak-load situations. However, this approach fails as the authors did not take into consideration the fact that the use of cloud computing resources can be costly because data is transferred frequently from inside to outside the Cloud. In previous paper [4], the presented approach did not take into consideration data flow while scheduling. The developers of the approach [5] present a solution that takes into account data dependencies between workflow steps and the utilization of resources at runtime in order to avoid the earlier-mentioned problems caused by the previous work [4]: (1) Due to frequent data transfers between hosts, the workflow system’s throughput might be suboptimal and (2) uses cloud resources. Workflow execution might be expensive since data is frequently transferred into and out of the Cloud. Assign operations (that copy data from a source variable to a destination variable) are used to model the data flow between activities. Once the data flow graph has been generated, a Heuristic Algorithm is required. Therefore, it is necessary that this algorithm takes into account the current load of all possible target machines, network bandwidth between the machines and the amount of data to be transferred between them. After scheduling a critical path computation, the requested resources are allocated by means of a reservation mechanism. The problem with this approach is that the authors did not put into consideration the cost of using Cloud resources and the consequences of the deployment of a centralized architecture. In [6], the authors have proposed a decentralized and dynamic platform for the orchestration of web services, taking into account any exception handling. The proposed solution makes use of the account framework MAS (Multi-Agent System) in order to facilitate migration between nodes across the network also the authors have used the mobile agents technique to deal with system exception. III. C LOUD S ERVICE O RCHESTRATION S OLUTION A. Background Work Based on the work studied in [6] we present our framework. Our approach is based on autonomous agents. The autonomy and sociability of agents facilitate the implementation of an orchestration engine while putting into consideration the quality of service (QoS: response time, consumption cost, scalability, etc.) of each visited Cloud. Our approach is based on the following ideas:

59

• • •

For each service requested, we must find the best Cloud offering this service, Organization of directory services by classifying services according to their inputs / outputs, Delegate to each activity an agent and this in order to distribute the work among a cooperative agents.

B. The Framework orchestration The architecture as illustrated in figure1 is organized according to three layers:

shown in figure 2, the class diagram AUML [13] [14] describes the types of agents and their relationships in the static system. The classes shown below represent the static architecure of our platform as a class diagram. The class «Agent» inherits the classes «UsrAg», «BPELOEAg», «InvAg» and «ProvAg». And the package «Directory»includes the services offering the same functions in a class service. This organization can accelerate the search time of Clouds providing the desired service.

Fig. 2. Class diagram illustrating the used agents

User Agent (UsrAg): The initiator of the conversation. The UsrAg presents a graphical interface through which the user can select the BPEL and WSDL files. BPEL Orchestration Engine Agent (BPELOEAg): It is the orchestration engine of our system, it interacts with other agents. Its role is to parse the BPEL process. For each activity, the BPELOEAg launches the InvAg. Figure 3 shows the architecture of the BPELOEAg. Fig. 1. Architecture of the framework of Cloud Services Orchestration

User layer: This layer announces that a process or service has just started. It is the interface between the user and the system. It encloses the User Agent (UsrAg), which offers to the user all the necessary functionalities so that he specifies his BPEL process P, and returns to him the result of the execution of P. Intermediate layer: This is the main layer of our system that represents the orchestration engine of Cloud services. It surrounds the BPEL Orchestration Engine Agent (BPELOEAg), the Invoke Agent (InvAg) and the Provisioner Agent (ProvAg). Service layer: This layer aims to provide Cloud services to be accessible by our framework on the internet. C. Architecture and role of agents Our framework consists of four types of agents: User Agent (UsrAg), BPEL Orchestration Engine Agent (BPELOEAg), Invoke Agent (InvAg) and Provisioner Agent (ProvAg). As

Fig. 3. BPELOEAg architecture

Invoke Agent (InvAg): This agent is responsible of the invocation of services and the handling of exception. It communicates with the ProvAg to get the list of Clouds offering the invoked service. After receiving this list, it is sorted in

60

the ascending order of QoS to get the best Cloud offering the desired service. Figure 4 shows the architecture of InvAg.

Fig. 4. InvAg architecture

The operating process of this agent is: 1) The InvAg sends to the ProvAg the characteristic of the requested service, 2) The ProvAg sends to the InvAg the list of Clouds offering the desired service, 3) The InvAg sorts this list according to the QoS of Clouds, 4) Invoke the desired service, 5) The Cloud returns the result of the requested service. 6) Verification of the result which could be: • A response from the proper execution of the called service. Then, it will be communicated to BPELOEAg. • A system exception (SE), in that case the parameters of the invoked service will be communicated to the ProvAg. Otherwise business logic exception (BLE), service will be communicated to the BPELOEAg. Provisioner Agent (ProAg): This agent acts as a directory containing a list of available Clouds. It contains two modules (the registry and the provisioner) to accomplish its task. a) Registry module: It contains information about the machines and services offered by each machine. Services offering the same functions are grouped in a class service.This classes are then connected according to a dependency graph [12]. In this graph, nodes represent class service1 and the arc represents the connection between these classes2 . This grouping allows to speed up the search time on one hand and to look in terms of QoS on the other hand. Figure 5 shows an example illustrating the classes services deployed in our approach. In this example, the services offering the same functionality (input / output) are grouped together. For example, the providers of the service S11 are C1 and C3, which assumed that the optimum QoS is 20ms and 10ms respectively. The InvAg would choose the C3 Cloud as the Cloud offering the service S11 with optQoS is equal to 10ms.

1 A Class managing a set of Cloud services that provide similar functionalities (Input and Output) with different providers. 2 Two classes are connected according to the following rule: class1 is matched to the class2 if and only if there exist, some outputs of class1 that can match some inputs of class2

Fig. 5. The classes service

b) Provisioner module: It plays the role of providing general abstractions (enable / disable the virtual machine) as it collects information about the overload of virtual machines (VM). When a machine is turned off, it must inform the registry to update data (for this host not selected when looking for a service).

Fig. 6. ProvAg architecture

Figure 6 shows the architecture and the operation process of the ProvAg: 1) The provisioner module collects information about overloading virtual machines (VMs) by sending a ping. 2) When a machine is off, the provisioner module update the directory. In this case, the registry module removes all the services installed on this machine. IV. T HE INTERACTION PROTOCOL First, we have to take into consideration that the agents interact together to achieve the whole process. To communicate, they use the FIPA language [10]. Indeed, in addition to the primitives of communication [11] of the language FIPA-ACL, the conversations between agents can be described through the following primitives: • Request_Execute(BPEL): This message sent by the user to BPELOEAg in order to start the execution of the BPEL file.

61

Start_InvAg (): This message sent by the BPELOEAg to the InvAg as a launch order. • Return (Ci , Ai ): The ProvAg sends this message in order to indicate to the InvAg Ai , the list Ci , • Select(Ci ): Select the best Cloud from the list Ci based on the QoS value. • Invoke (Sj ): The InvAg invoke the service Sj . • Check(R): The InvAg check the result returned by the Cloud. The result R could be a succes or a BLE. The operation process of our system is as follows (Figure 7): •

have used a benchmark containing a set of Cloud. Each Cloud is characterized by its id, QoS, address, machine and the services available on each machine. Our experiments show the evaluation of the execution time compared to the number of activities invoke or to the number of Clouds available in the system. To simplify the understanding of the functioning of our system, the value of QoS is represented as a single variable representing the sum of the response time and latency. In the first test, we set the number of available Cloud to 20 Clouds and we varied the number of activities invoke. Figure 8 shows two curves representing the evaluation of the execution time based on the number of invoke service. The first curve represent the evaluation of the execution time of our approach (Cloud Service Orchestration) and the other represent the evaluation of the execution time of the approche presented by K.Jong and al. [3] The result shows that our solution is much faster also the scalability of our system allows it to generate an optimal exact orchestration in terms of QoS in a time that does not exceed 4000 (ms).

Fig. 8. Execution time compared to the number of the invoked service Fig. 7. Sequence diagram of the framework

1) Initially, the user agent (UsrAg) receives user queries that represent the BPEL processes and sends it to the BPELOEAg. 2) The BPELOEAg parses the BPEL file and delegates for each activity an InvAg. 3) The InvAg communicates with the ProvAg to give it the list of Clouds offering the desired service. 4) The ProvAg chooses from the available Clouds the one which provides the desired service and sends to the InvAg the resulting list. 5) The InvAg orders the returned list according to the QoS. 6) The InvAg invokes the requested service. 7) Finally, the InvAg checks the returned result by the selected Cloud and informs the BPELOEAg.

We varied in the follows, the number of Clouds setting to 4 the number of the invoked activity.Figure 9 shows the evaluation of the execution time of our approach. The variation of the number of Clouds is made by increasing to 20 Clouds in order to have a clear comparison between the two approaches. The result shows that our system is more efficient than the approach developed by K.Jong and al.[3]

V. IMPLEMENTATION AND EVALUATION The experimentation are performed on an Intel (R) Core (TM)2 Duo CPU T6570 @2.10 GHz with 3 GB of RAM. To implement the various agents of our system, we use the Multi-Agents platform JADE installed on ECLIPSE. Firstly,we

62

Fig. 9. Execution time compared to the number of Clouds

VI. C ONCLUSION In this paper, we introduce our research efforts to address Cloud service orchestration issues. The characteristics of our approach are the dynamicity that can support a highly dynamic environment, a decentralized architecture as well as taking into account the different QoS metrics. As a future work, we intend the implementation of our real Cloud orchestration engine environment by using the Open source platforms (such as OpenStack, CloudStack, etc.) and improving the scheduling method. R EFERENCES [1] L.Changbin, M.Yun, F.Mary, V.Jacobus, «Cloud Resource Orchestration: A DataCentric Approach», 5th Biennial Conference on Innovative Data Systems Research, California, USA, January-2011 [2] L.Changbin, M.Yun, L.Boon Thau, «Declarative Automated Cloud Resource Orchestration», SOCC-11, Cascais, Portugal, October-2011 [3] K.Jong-Phil, H.Jang-Eui, C.Jae-Young,C.Young-Hwa, «Dynamic Service Orchestration for SaaS Application in Web Environment», ICUIMC-12, Kuala Lumpur, Malaysia, February-2012 [4] D.Tim, J.Ernst, F.Bernd, «On-Demand Resource Provisioning for BPEL Workflows Using Amazon’s Elastic Compute Cloud», 9th IEEE/ ACM International Symposium on Cluster Computing and the Grid, Shanghai, China, 2009 [5] D.Tim, J.Ernst, N.Thomas, S.Dominik, F.Bernd, «Data Flow Driven Scheduling of BPEL Workflows Using Cloud Resources», IEEE Computer Socety, Marburg, Germany,2010 [6] Zaki BRAHMI, Mounira ILAHI, «Enhancing Decentralized MAS-based Framework for Composite Web Services Orchestration and Exception Handling by means of Mobile Agents Technology», Active Media Technology, 5th International Conference, AMT 2009, Beijing, China, October 22-24, 2009 [7] B.Mathew, J.Matjaz, Poornachandra, «Business Process Execution Language for WS», ISBN : 1904811817, January-2006 [8] B.Douglas, D.David, «Web Services, Service-Oriented Architectures and Cloud Computings», The Savvy Manager’s Guide Second Edition, 2013 [9] J.Nick, «Intelligent Agents: Agent Theories, Architectures, and Languages», Springer-Verlag USA-1999 [10] FIPA English Auction Interaction Protocol Specification, «http://www.fipa.org/specs/fipa00031/index.html», 2001 [11] T. Finin, R. Fritzson, and D. McKay, «An overview of KQML : A Knowledge Query and Manipulation Language», Technical report, University of Maryland Baltimore Country, 1992 [12] Zaki Brahmi, M.M. Gammoudi, «QoS-aware Automatic Web Service Composition based on Cooperative Agents», IEEE 22nd International Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises, (WETICE), 2013, pp.27, 32, 17-20 June 2013. [13] O.James, P.Dyke, B.Bernhard, «Representing Agent Interaction Protocols in UML», AAAI conference, Barcelona, 2000 [14] O.James, P.Dyke, B.Bernhard, «Extending UML for Agents», AgentOriented Information Systems Workshop at the 17th National conference on Artificial Intelligence, 2000

63

28

Proposition of Secure Service Oriented Product Line Achour Ines, Lamia Labed and Henda Ben Ghezala

Abstract— SOPL approach (Service Oriented Product Line) can be used in various domains where SOA based applications are needed such as e/m government, e-business, e-learning and so on. This approach is a combination of Service-Oriented Architecture (SOA) and product line. SOA offers a solution to problems of integration and interoperability between services. But the SOA approach doesn't follow a specific reuse approach. In Product Line approach, reuse is systematic with two phases domain engineering and application engineering and offers mechanisms for the management of the variability. Ensure secure services are vital in order to establish trust between users and service providers. In this context, we aim to extend SOPL phases with security activities in order to produce secure serviceoriented applications and for such activity we indicate means (methods, tools, etc.) that that allow us to perform security activities that we have introduced. In this work, we propose Secure SOPL: a development process that integrates security in the SOPL process. The originality of our work is on the integration of security in SOPL. Notes that existing works are either oriented to the integration of security in classical development processes or they are oriented enrichment of SPL by security activities but not treat the SOPL approach.

problems; however it has not offered solutions to security problems especially as we live in a connected and opened world. The idea is to integrate security activities throughout SOPL development cycle. We are based on the idea that the improvement of software products must go through improving their development process [1]. This leads us to propose Secure SOPL: a development process that integrates security in the SOPL process and that aims to produce secured software products based on SOA. In this paper, we present in section 2 an overview of software security and in section 3 we outline SOPL approach. We reserve section 4, 5 and 6 for introducing Secure SOPL and its two phases. In section 7 we illustrate the feasibility of our work with a brief case study and in section 8 we give an overview of related works. Finally, section 9 summarizes our proposal and outlines our future work.

Key words— secure domain engineering, secure application engineering, Service Oriented Product Line, software security.

According to [27, 28], we can classify secure software development into four categories: reactive, corrective, proactive and hybrid approaches. - Reactive approaches are adopted in maintenance phase. They are performed by adding security mechanisms at the end of the development process, independently of the other features of the system, to protect the application against potential threats. - Corrective approaches are used with iterative transition between implementation and test phases. - Proactive or preventive approaches which are development processes in which most security problems are addressed and resolved in advance. The applications resulting of this approach are designed and implemented so that they can function correctly in the presence of attacks. In this paper, we adopt this approach and we integrate security issues early in a SOPL process. - Hybrid approaches are a mixture of previously cited ones but it seems very difficult to apply because of the luck of processes and practices to support this kind of approaches. As we focus on the development of secure services which are in reality software applications, we can not ignore efforts in strengthening the assurance of security services [2], [16] and the proposition of WS-Security [35] which is an important building block for fending attacks. But, [30]

II. SOFTWARE SECURITY ISSUES

I. INTRODUCTION

N

OWADAYS, companies rely on distributed architectures and more specifically on the architecture based services namely SOA (Service Oriented Architecture). This type of architecture is a solution to problems of integration and interoperability within services. In addition, companies hope satisfying customers’ needs such as reducing cost, effort and time to market of software products. These advantages are offered by SOPL approach (Software Oriented Product Line) which is a combination of the key concepts of SOA and those of SPL (Software Product Line). It is true that SOPL could solve some This paragraph of the first footnote will contain the date on which you submitted your paper for review. It will also contain support information, including sponsor and financial support acknowledgment. For example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456”. Achour Ines, was with the Department of Computer Science, ENSI, Lab. RIADI-GDL, University of Manouba, (e-mail: [email protected]). Lamia Labed, was with the Department of Computer Science, ENSI, Lab. RIADI-GDL, University of Manouba, (e-mail: [email protected]). Henda Ben Ghezala, was with the Department of Computer Science, ENSI, Lab. RIADI-GDL, University of Manouba, (e-mail: [email protected]).

64

28 produce during the domain analysis a list of components, services and composite services candidates for reuse. Afterward, and during the domain design, the reference architecture is produced taking into account the variability of different services and components. Regards that there are different stakeholders involved in a project, the reference architecture is presented through different views (structural view, layer view, interaction view, dependency view, concurrency view and physical view) [26]. We conclude this phase with the implementation of different components, services and services orchestration. We start the application engineering [3], with the application analysis phase on which we select components, services and services orchestration from the list identified in the domain engineering. The selection responds to the functionalities required by the developed application. Next, the configuration and the specialization of selected components, services and composite services are performed in order to propose a specific architecture of the developed application. Product construction concludes the application implantation phase. Several studies have focused on the SOPL approach [3], [8], [12], [17], [22], [38], [44] but we have not identified works treating the concept of security and its influence on this approach.

notice that the use of WS-Security is not enough to prevent against network threats. III. SOPL PROCESS Service Oriented Product Line (SOPL) was introduced at the 11th edition of the International Conference SPLC (the ‘Software Product Line Conference’) in 2007. The SOPL approach is a combination of Software Product Line engineering and Service Oriented Architecture approach providing thus solutions to many common software problems as reuse and interoperability issues [17]. It allows developing applications oriented SOA as Software Product Lines (SPL). In fact, Service Oriented Architecture (SOA) is an approach that aids solving integration and interoperability problems [17]. Nevertheless, it does not provide support for systematic planned reuse as does Software Product Line engineering. This latter has the principles of variability, customization and systematic planned reuse and can be used to assist SOA to reach its benefits. Thus, the term ServiceOriented Product Line is used for service oriented applications that share commonalities and vary in an identifiable manner. Therefore, we deal with a service line which is a group of services sharing a common set of features. In fact, SOPL process introduces variability management concepts into Service Oriented Architecture and applies SPL engineering life cycles as shown in figure 1.

IV. SECURE SOPL PROCESS For synthesizing the most important security activities and indispensable for secure development process, we based our study on the three validated secure development processes which Touchpoints approach of McGraw, CLASP of OWASP and SDL of Microsoft [5], [18], [37], [41]. These are the most used and proven processes in this field. We integrate the activities identified on the life cycle of the SOPL described in [3], [8], [12], [17], [22’], [38], [44]. This leads us to strengthen security throughout the SOPL approach. Our Secure SOPL inherits from the classic SOPL his two principle phases. We show in figure 2 the two phases of Secure SOPL: - Secure domain engineering phase which defines development for reuse. It has for aims to offer components and services useful for the derivation of secure applications based on SOA. - Secure application engineering phase that presents the development with reuse. Its aim is to use components and services developed in the secure domain engineering to quickly produce secure application based on SOA.

Fig. 1. Service Oriented Product Line process [21].

V. SECURE DOMAIN ENGINEERING

The domain engineering phase defines the development for reuse. The application engineering phase presents the development with reuse. The SOPL process begins with a domain engineering phase [17]. It uses the feature model and the business process model as inputs [12], in order to

Secure domain engineering is divided into four phases, namely, 1) Training and awareness, 2) Domain analysis, 3) Domain design and 4) Domain implantation.

65

28

Secure Domain Engineering

Identification of Domain Threats

Mitigate Domain Security Risks

Identification of Domain Security Requirements

Training & Awareness

Determination of security strategy

Domain Threats modeling

Domain Analysis

Preparation of documentation

Choice of Secure programming language

Domain Design

Domain Implantation Publication of implemented services

List of components, services et services orchestration

Knowledge of software security

Reference Architecture (multiple views)

Training & Awareness

-Security strategies -List of security risks

Application Analysis

Identification of Application Security Requirements

Components, Services and Services Orchestration

Service Registry (UDDI)

-Security objectives -List of threats

Secure Application Engineering

Unitary tests

Service Search

Domain Design

Global modeling of application threats

Identification of application threats

List of components, Services and services orchestration of application

Application Implantation

Mitigate application Security Risks

Reference Architecture of application

-Application Security objectives -List of application threats

-Application Security strategies -List of application security risks

66

Fig. 2. Secure SOPL

Service Invocation

Application testing Preparation of documentation

28 by security choices undertaken during the architecture specification. Thus, according to [19] several security principles can be considered such as the assumption that any interaction with a third-party product is risky and so by considering incoming data as suspect and isolating critical assets, etc. - Mitigate security risks which consists on the synthesis and prioritization of risks (developers can use RMF, STRIDE or DREAD, etc) and the definition of the risk mitigation strategy (for example they can remove the feature, solve the problem, etc. [18]). - Built of the reference architecture of our service line. We notice that we can produce different architectural views [17] (a structural view for representing the static structure of the architecture specified; a layer view for representing components and services organized in their layers; an interaction view showing communications and interactions between components and services when achieving a particular functionality; a dependency view showing the dependence information among services and components; a concurrency view showing parallel communication among services and components; a physical view showing the communication protocols and distribution of components and services) [26] as cited above.

A. Training and awareness The three secure development processes mentioned above [5], [18], [37], [41] emphasize the importance of training and education in security software to every team member of the project. Thus, during this phase we propose that developers community for reuse are trained in security engineering basics in order to increase their awareness of the importance of the domain problems on which they operate and its broad scope. In this sense, it is essential to: -Plan regular courses for the project reuse development team [41]. These courses cover the latest security software issues since it is an evolving field where it frequently reflects the emergence of new threats. -Promote sharing and communication artifacts such as security threat models within the development team [37]. -Assign a security adviser to the project. This adviser helps the developers with security related fields [37]. This phase is used for the capitalization of knowledge in software security. This knowledge will be used firstly by the members of the development team for reuse and secondly by the members of the development team by reuse. B. Domain analysis For this phase, we can list the steps below: - We start with identifying and analyzing business requirements. We are based on features models and the business process model to identify a list of components, services and services orchestration candidates for reuse. After that we conduct the variability analysis activity. This activity starts with an analysis of similarities and variabilities between services and components with the purpose to reduce the number of service candidates [17]. The literature [20], [40] proposes several mechanisms of variability management such as parameterization, inheritance, conditional compilation, etc. - Referring to knowledge acquired on the Training and awareness and based on information gathered in the previous step, we identify security requirements of the studied domain. To carry out this step several means are at our disposal. Among these means we quote: Abuse Case Diagram [23], SQUARE methods [33], RMF (Risk Management Framework [18] or standards such as Common Criteria [9, 10, 11]. - Once the domain security needs are identified, the developers must identify threats that can harm the studied domain. This is possible through a variety of ways which are namely: the bases of vulnerabilities [34], [36], [42] the STRIDE method [31], the Attack Tree [39], Threat Tree [15] or standards [9, 10, 11].

D. Domain implantation Domain implantation consists on the implementation of components, services and services orchestration identified and on their validity according to the different tests that we can conduct. Over this phase we can find the steps below: - Components and services implementation: during this step developer for reuse build components, services and services orchestrations. They should respect the development instructions and guidelines identified in previous phases and respect a set of practices [29] to improve the security code such as: choosing a programming language that offers more security (such J2EE [40]), using security standards for service (for example using XML Encryption to provide confidentiality [4]), using code analysis tools [6], etc. - The developers for reuse must prepare documentation to the development community with reuse by defining software security practices during the secure engineering domain and by presenting instructions and conditions for using the developed components and services. - Components and services testing: after finishing the coding stage, the development team for reuse focuses on testing the developed components and services. They precede unitary test for such components and services (we can use White-hat or Black-hat approaches [5]). The secure domain engineering is finalized by the publication of the developed services service repository (UDDI) [32].

C. Domain design For the domain design, we can list the steps below: - The developers start with the definition of security strategies. The robustness of a system proceeds imperatively

67

28 This step can be assured by STRIDE method [31]. - This step consists on the synthesis and prioritization of all risks (risks selected from the ones identified in the Domain Design and risks suggested to fit the requirement of the studied application). Also, this step defines the risk mitigation strategy. - The architecture specification is performed by instantiating the reference architecture and that by keeping only components, services and services orchestration selected for the studied application.

VI. SECURE APPLICATION ENGINEERING After detailing the secure domain engineering, we addressed to the secure application engineering which is divided into four phases: 1) Training and awareness, 2) Application analysis, 3) Application design and 4) Application implantation. A. Training and awareness At this level, training and awareness affect the development team by reuse. For this, we recommend to project managers and other stakeholders to use the knowledge consolidated by the development team for reuse. Also, they should encourage sharing all artifacts among the development team by reuse.

D. Application implantation For this phase, we can list the steps below: -Developers search required services that fit to the studied application. Thus, they can find the requested service and invoke it [32] synchronously, asynchronously or with an ESB (Enterprise Service Bus), combine existent services, adapt existent ones or implement new ones. - To validate the developed application, we propose to rely on integration tests conducted by the developers before its deployment and acceptance tests performed by the end user after product deployment. Integration tests [43] allow validating that components, services and services orchestration developed independently work together coherently. These tests can be automated by tools such as Eggplant [45] and JUnit [24], etc. Regarding acceptance tests [7], they allow testing the product in its real environment by the end user. The client in this case may provide the development team with a feedback from performed tests. - After ensuring that the developed application fits to the business requirements and to the security requirements, the development team with reuse must produce the documentation and guides for the end users.

B. Application analysis For this phase, we can list the steps below: - The developers with reuse must identify specific requirements of the application based on the study conducted on the Secure Domain Engineering phase. - They must select common security requirements from the knowledge gained on the domain analysis. - They must identify specific security requirements which are not common for the applications of our line service and characterize the studied application. This step can be conducted with the same means used on the identification of security requirements leaded on the domain analysis, namely: Abuse Case Diagram [23], SQUARE methods [33], RMF (Risk Management Framework) [17] or the standards such Common Criteria [9, 10, 11]. - Once the application security needs are identified, the developers must identify threats that can harm the application. Firstly, the developers must select common threats based on knowledge gained on the domain analysis. Secondly, they must identify specific threads to the studied application. And this is can be realized by the vulnerabilities bases [34], [36], [42], STRIDE method [31], the Attack Tree [39], Threat Tree [15] or also the standards [9, 10, 11]. These are the same means used in the step of the threat identification which is performed in the secure domain engineering.

VII. CASE STUDY In order to show the feasibility of our Secure SOPL, we choose to study a range of governmental services offered by the Tunisian Ministry of the interior and local development as the demand of National Identity Card (CIN), Passport and Bulletin n°3 (which an official paper to mention if the person has corruption or not). We proceed by following the different steps cited above. We are going to present only some steps of Secure Domain Engineering because of the limited space. In order to analyze our domain, we use the feature model (illustrated in Fig. 3) and to perform this we are based on the study of the business requirements of our service line. This model is based on a hierarchy of composition of characteristics (functional, non functional or parameters) where some branches are mandatory, some are optional, and others are mutually exclusive [25]. It can show us commonality and variability of services. To achieve the identification of the requirements of the secure domain, we choose to model our security requirements with abuse case

C. Application design For the application design, we can list the steps below: - The team of developers with reuse starts with selecting appropriate security strategies to the studied application and defines possible new ones. -The developers must conduct a global threat modeling approach. This is explained by the fact that the threat modeling in the secure domain engineering phase is conducted for each component, services and services orchestration identified: it refers to a local threat modeling. However, the threat modeling in this phase must be performed for all components, services and services orchestration together: it refers to a global threat modeling.

68

28

Fig. 3. Feature Model of the online administration

making illegal demands of doccuments

Theft server identity

Theft of data Malicious user Theft user identity

listening to the traffic and authentication data interception



disclosure of sensitive data

Fig. 4. An extract of the abuse case diagram of the on line administration. Threat

D

R

E

A

D

Total

Level

The hacker obtains the credentials by monitoring network

3

3

2

2

2

12

High

Fig. 5. An extract of an evaluation of risks using DREAD method.

Figure 7 : Static view of the reference architecture.

Fig. 6. Static view of the reference architecture

69

28 development lifecycle, while Secure SOPL deals with SOPL approach.

diagram. Fig. 4 illustrates an extract of this diagram. In fact, we show the abuse case related to the authentication service reported in the Fig. 3. To accomplish the identification of the threats of the secure domain, we are based on OSVDB (Open Source Vulnerability Database) [36] to determine threats related to security requirements mentioned on our abuse case diagram. To carry out the mitigation of the risks of the secure domain Security Risks” we choose Dread method to evaluate risks; the Fig 5 illustrates an abstract of this evaluation. And to perform “Design domain”, we modeled our reference architecture. We show in Fig. 6 the static view.

IX. CONCLUSION The principle aim of this work is to integrate security activities in the Service Oriented Product Line process, in order to proactively produce secure services. These latters are so requested in an open world such as Internet in order to assure a trusty climate. In this work, we have greatly inspired by the SDL, CLASP and the McGraw approaches [5], [18], [37], [41] which are the most used, proved and validated approaches for producing secure applications. So, we have added several security activities in the SOPL process and also advised the use of different security methods, concepts, standards and frameworks (such as RMF, STRIDE and Common Criteria) which are well suited for given situations and contexts. This work aims to ensure the development of a product (service in our case) by taking advantages of three concepts contributions: a large-scale reuse system that’s product line engineering, serviceoriented architecture and software security. We developed a case study related to a range of governmental services offered by the Tunisian Ministry of the interior and local development (PRF Tunisian National Project, "Federated Project Search ", where we are one of the partners) to show the feasibility of our proposition. Our perspectives are first to provide a tool which supports Secure SOPL. Second, we would like to validate the proposed approach in different contexts such as e-commerce, e-learning, etc.

VIII. RELATED WORKS Extensive work has been carried out on software security during the last few years, and there are several works that deals with security at the early stages of the development lifecycle, the same as Secure SOPL. Next, there are summarized some proposals particularly close in topic to ours and paralleling it is also explained their relation to our Secure SOPL. SREP [13] (Security Requirements Engineering Process describes how to integrate security requirements into the software engineering process in a systematic and intuitive way. In order to achieve this goal, the approach is based on the integration of the Common Criteria (CC) (ISO/IEC 15408) into the software lifecycle model. However, SREP is only focused on the activities directly concerning to security requirements elicitation and specification, while Secure SOPL deals with all the lifecycle. Also our approach concerns the SOPL approach but SREP are applied to classical software development lifecycle. SREPLine [14], Security Requirements Engineering Process for software Product Lines (SREPPLine), which is a standard-based process that describes how to integrate security requirements into the software engineering process in a systematic and intuitive way, as well as a simple integration with the rest of requirements and the different phases/processes of the SPL development lifecycle. Additionally, this process will facilitate the fulfilment of the IEEE 830:1998 standard. However, SREPLine is only focused on security requirements and it is applied to SPL, while Secure SOPL deals with the two phases of SOPL approach. S2D-ProM [27, 28] (Secure Software Development Process Model). The main feature of this approach is that it is a strategy oriented process model, with respect to the MAP formalism, that allows to provide two level guidance: (a) a strategic guidance helping the developer to choose one among of existing techniques, methods, standards and best practices which are useful for producing secure software and (b) a tactical guidance on how to achieve his selection. However, S2D-ProM is applied to classical software

REFERENCES [1]

A. Finkelstein, J. Kramer, and B. Nuseibeh, “Software Process Modelling and Technology”. Advanced Software Development Series, Research Studies Press/John Wiley &Sons, 1994. [2] A. Guruge, “Web Services: Theory and Practices”. Digital Press, 2004. [3] A. Heferich, G. Herzwurm, and S. Jess, “Software Product Lines and Service-Oriented Architecture: A Systematic Comparison of Two Concepts”. In : the First Workshop on Service-Oriented Architectures and Software Product Lines, pp 31-37, 2008. [4] A. Toms, “Threats, Challenges and Emerging Standards in Web Services Security”. Technical report HS-IKI-TR-04-001, Department of Computer Science, University of Skövde, 2009. [5] B. De Win, R. Scandariato, K. Buyens, J. Grégoire, and W. Joosen, “On the secure software development process: CLASP, SDL and Touchpoints compared”. Information and Software Technology, Vol. 51, No. 7, pp. 1152-1171, 2009. [6] B. Schneier, “Attack trees: Modeling security threats”. Dr. Dobb’s Journal, 1999. [7] Buzzle, [online] http://www.buzzle.com/articles/software-testingacceptance-testing.html. [8] C. Wienands, “Synergies between Service-Oriented Architecture and Software Product Lines”. Siemens Corporate Research Princeton, NJ, 2006. [9] Common Criteria for Information Technology Security Evaluation Norm ISO 15408 – “Part 1: Introduction and general model – version 3.1”,2009. [10] Common Criteria for Information Technology Security Evaluation Norm ISO 15408 – “Part 2: Security functional requirements– version 3.1”, 2009.

70

28 [27] M. Essafi, “Approche multi-démarches avec guidage flexible pour le développement de logiciels sécurisés”. Thesis, Manouba University, (2014). [28] M. Essafi, L. Labed, and H. Ben Ghezala, “S2D-ProM: A Strategy Oriented Process Model for Secure Software Development”. In : the second International Conference on Software Engineering Advances (ICSEA 2007), Cap Esterel, French Riviera, France, 2007. [29] M. Howard, “ Microsoft Corporation: Fundamental practices for secure software development”. Stacy Simpson, SAFECode, 2008. [30] M. Jensen, N. Gruschka, R. Herkenhoner, and N. Luttenberger, “SOA and Web Services: New Technologies, New Standards – New Attacks”. ECOWS '07, 2007. [31] Microsoft Corporation, “The STRIDE Threat Model”, [online] http://msdn.microsoft.com/enus/library/ee823878%28v=cs.20%29.aspx. [32] N. M. Josuttis, “SOA in Practice: the art of distributed system design”. O’Reilly Media, 2007. [33] N. R. Mead, E. D. Hough, and T. R. Stehney, “Security Quality Requirements Engineering (SQUARE) Methodology”. Technical report CMU/SEI-2005-TR-009, Carnegie Mellon University, 2005. [34] National Institute of Standards and Technology, “National vulnerability database”, [online] http://nvd.nist.gov/. [35] OASIS Corporation, “OASIS Web Services Security (WSS) TC”, [online] https://www.oasisopen.org/committees/tc_home.php?wg_abbrev=wss#overview. [36] Open Security Foundation (OSF), “Open Source Vulnerability Database (OSVDB)”. [online] http://osvdb.org. [37] OWASP Corporation, “CLASP Comprehensive Lightweight Application Security Process”, 2006. [38] R. Helali, “L’approche Lignes de Produits pour la dérivation d’applications logicielles en E-Gouvernement”. Master, University of Tunis, 2010. [39] Rolland, C., Prakash N., Benjamen A.: A Multi-Model View of Process Modelling. Requirements Engineering Journal, 1999. [40] S. Krakowiak, T. Coupaye, V. Quema, L. Seinturier, and J. Stefani, “Intergiciel et Construction d’Applications Réparties”, 2007. [41] S. Lipner, “The Trustworthy Computing Security Development Lifecycle”. Computer Security Applications Conference, 20th Annual Publication, ISSN: 1063-9527, ISBN: 0-7695-2252-1, pages 2-13, 2004. [42] SecurityFocus, “Securityfocus vulnerability database”, [online] http://www.securityfocus.com/Vulnerabilities. [43] T. Bradley, “ Integration Testing”, 2008. [44] T. Mannisto, V. Myllarniemi, and M. Raatikainen, “Comparison of service and Software Product Family Modeling” In : the First Workshop on Service-Oriented Architectures and Software Product Lines, pp 47-57, 2008. [45] TestPlant, [online] http://www.testplant.com/products/eggplant_functional_tester.

[11] Common Criteria for Information Technology Security Evaluation Norm ISO 15408 – “Part 3: Security assurance requirements– version 3.1”, 2009. [12] D. Benavides, P. Trinidad, and A. Ruiz-cortés, “Automated Reasoning on Feature Models”. LNCS, Advanced Information Systems Engineering. In : 17th International Conference, CAISE, 2005. [13] D. Mellado, E. Fernández-Medina, and M. Piattini, “A common criteria based security requirements engineering process for the development of secure information systems”. Computer Standards and Interfaces 29 (2), pp 244–253, 2007. [14] D. Mellado, E. Fernández-Medina, and M. Piattini, “Towards security requirements management for software product lines: A security domain requirements engineering process”. Computer Standards & Interfaces Volume 30, Issue 6, pp 361–371, 2008. [15] E. G. Amoroso, “ Fundamentals of Computer Security Technology”. Prentice-Hall, 1994. [16] E. Ort, “Service-Oriented Architecture and Web Services: Concepts, Technologies, and Tools”. Technical report, SUN, 2005. [17] F. Medeiros, S. Romero, and E. Santana, “Towards an Approach for Service-Oriented Product Line Architectures”. 13th International Software Product Line Conference (SPLC 2009), San Fransisco, CA, USA, 2009. [18] G. McGraw, “Software Security: Building Security In. IEEE Computer Society”, IEEE Security and Privacy, 2004. [19] G. Stoneburner, C. Hayden, abd A. Feringa, “Engineering Principles for Information Technology Security (A Baseline for Achieving Security)”, Revision A. Recommendations of the National Institute of Standards and Technology, 2004. [20] H. Gomaa, “Designing Software Product Lines with UML: From Use Cases to Pattern-Based Software Architectures”. Addison-Wesley Professional, 2004. [21] I. Achour, Sh. Khadouma, L. Lamia, and H. Ben Ghezala, “Towards a Secure Service Oriented Product Line”. In : the International Conference on Software Engineering Research and Practice SERP'11. Las Vegas, Nevada, USA., 2011. [22] J. Lee, M. Kim, D. Muthig, M. Naab, and S. Park, “Identifying and Specifying Reusable Services of Service Centric Systems Through Product Line Technology”. In: the First Workshop on ServiceOriented Architectures and Software Product Lines, pp 57-67, 2008. [23] J. McDermott, and C. Fox, “Using Abuse Case Models for Security Requirements Analysis”. In: 15th Annual Computer Security Applications Conference, Phoenix, Arizona, 1999. [24] JUnit, [online] http://www.JUnit.org. [25] K. Kang, S. Cohen, J. Hess, W. Novak, and S. Peterson, “FeatureOriented Domain Analysis (FODA) Feasibility Study”. Technical report CMU/SEI-90-TR-21, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, 1990. [26] L. Dikmans, and R. V. Luttikhuizien, “ SOA Made Simple”. Packt Publishing, 2012.

71

1

Decentralized orchestration of BPEL processes based on shared space Zaki BRAHMI, RIAD-LAB, Sousse University, Ilhem FEDDAOUI, Jandouba University

Abstract—Decentralized orchestration offers performance improvements in terms of increased throughput and reducing response time. The idea of decentralized orchestration of BPEL is to partition the program into different partitions in such a way that each partition is executed by a different orchestrator; all sub-BPEL programs must be coordinated to achieve the overall program. On the other hand, decentralized orchestration brings temporal complexity level partitioning BPEL program to find a partition that maximizes the profile of service users and also the bottleneck at orchestration engine when running activities. The main objective of this paper is to provide decentralized execution environment which optimizes the execution of a business process described in BPEL. Our approach is based on two basic technologies namely: i) shared space that represents an environment of communication between intelligent agents, and ii) a set of cooperative agents to share execution. Index Terms—Cloud Computing, Orchestration, Service, BPEL, Shared Space, agent.

I. INTRODUCTION

B

usiness processes are typically complex operations, including numerous individual stages, and in the context of Service Oriented Architecture (SOA) each such stage is realized as a web service. Business Process Execution LanguageBPEL, is the current industry standard frequently used to specify the composition of these steps (control ow, data ow, etc) and used to express Web Services (WS) orchestrations. A BPEL process defines how multiple service interactions between partners can be coordinated internally in order to achieve a business goal (orchestration) [1][2]. It is predominantly deployed in centralized servers, which implies that all interactions and intermediate data must go through one server. Therefore, the problems This paragraph of the first footnote will contain the date on which you submitted your paper for review. It will also contain support information, including sponsor and financial support acknowledgment. For example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456”. The next few paragraphs should contain the authors’ current affiliations, including current address and e-mail. For example, F. A. Author is with the National Institute of Standards and Technology, Boulder, CO 80305 USA (email: author@ boulder.nist.gov). S. B. Author, Jr., was with Rice University, Houston, TX 77005 USA. He is now with the Department of Physics, Colorado State University, Fort Collins, CO 80523 USA (e-mail: [email protected]). T. C. Author is with the Electrical Engineering Department, University of Colorado, Boulder, CO 80309 USA, on leave from the National Research Institute for Metals, Tsukuba, Japan (e-mail: [email protected]).

72

in relation to centralized management that have been encountered in the non service environment, including poor performance, impaired reliability, limited scalability, and restricted exibility [3]. Moreover, due to the long-running of BPEL process, caused by the exchange of voluminous data with external Web services which are concurrently accessed by large numbers of users, these problems are even aggravated. Thus, represents one of the major obstacles for wide deployment of the web services technology, especially for applications where transfer of large amount of intermediate data is needed. To address these problems and to overcome the single server bottleneck, previous works presented a decentralized execution mode for composite web service (web services orchestration) [3], [4], [5], [6], [14] and [15]. Unlike centralized orchestration, the decentralized orchestration consists to execute the overall composite web service into several servers. On the other hand, in a decentralized orchestration, there are multiple engines. Each one of them executes a portion of the original composite web service. These engines communicate directly with each other while they transfer data and define the control in a loosely coupled manner. This model brings several benefits as [15]: i) There is no centralized coordinator to become a potential bottleneck, ii) Data distribution reduces communication and improves transfer time. iii) Control distribution improves concurrency, and iv) Asynchronous messaging between the engines improves the throughput. Previous works [3],[4],[5],[6],[14] and [15] act on decentralized web services orchestration problem are based on different styles of decentralization such as BPEL partitioning and message passing style. The most used technique for the decentralized orchestration is BPEL process partitioning. The BPEL process Partitioning consists of dividing BPEL process into a set of sub-processes. These subprocesses are executed by the different engines and coordinated to execute the overall process in a distributed fashion. In BPEL partitioning problem, BPEL process is divided into two sets activities: fixed activities and mobile activities. The aim is to find the optimal allocation of mobiles activities with the fixed activities, in order to minimize the response

2 delay and to improve the throughput. The weakness of these approaches, assume that service providers have the necessary infrastructure to run a subprocess. In order to address problems presented above, we propose an approach that allows the decentralization of a BPEL business process. Our approach is based on the cooperation of a set of intelligent agents. These agents interact with each other by using shared space paradigm. This involves taking into account several parameters such as the execution cost, time and quality of service. Our approach allows providers to participate in the choice of service. The rest of the paper is organized as follows: Section 2 provides a review of approaches to decentralized orchestration. In Section 3 we present our approach to decentralized orchestration based on shared space. Finally we conclude this research in Section 4. II. RELATED WORK WMany works dealt with the decentralized execution of composite web service problem. We classify these works into two classes according to their actions’level. The first class actes on high-level: [4], [5], [6] and [15]. At this level, we focus on program code description and we not need to define the way the Tuning machine manages its tape and the way that it stores data on its tape. These approaches are characterized with a certain difficulty of implementation, limited scalability and they are not adaptable to problems with large size and large complexity. The more the size and the complexity of problems increase, the more these approaches approve their weaknesses. [4] and [5] propose two different ways to partition a BPEL process presented as a Program Dependence Graph (PDG)[13]: i) Nanda et al. [4] proposed two heuristic partitioning algorithms named, respectively, Merge-by-DefUse(MDU) and Pooling-and-Greedy Merge (PGM). The aim of the MDU is to merge the nodes along the ow loop independent dependency edges. However, due to large computation time to MDU algorithm, the authors choose to apply the PGM heuristic which is a combination of: (1)- the greedy-merge heuristic is a refinement of MDU to minimize the number of nodes in PDG and (2)- the pooling heuristic tries to minimize the total number of edges in PDG. Ii) Lifeng et al. [5] proposed another approach based on genetic algorithm and its characteristics is that mechanisms are adopted to treat precedence dependency (PD) constraints and the constraints of control dependency (CD). Further, a local optimizer is incorporated into the genetic algorithm to improve the quality of the solution. Unlike [4], authors transform the PDG into two graphs: a dependency graph to treat PD constraints and a control ow graph to treat CD constraints. In [15] authors introduced a framework comprising

73

a Peer-to-Peer (P2P) architecture and a set of distributed algorithms to support the decentralized enactment of BPEL processes in Peer-to-Peer systems. BPEL process are deployed, executed and monitored by a set of nodes organized in hypercube P2P topology, named BPELCube. Each node in the hypercube topology is capable of executing one or more individual BPEL activities as part of a given process instance execution, while also maintaining one or more of the instances data variables. The second class actes on low-level: [3], [6] and [14]. This Level gives most details in order to describe the way the Tuning machine is used to resolve the statement problem. These solutions are characterized with a high response delay. The authors in [3] proposed an approach based on the partitioning of BPEL process execution in a series of mobile agents. Each mobile agent is responsible to perform an activity invoke in the BPEL process. This agent migrates to the Web services providers in order to reduce the communication ow during the execution of BPEL Process. In [27], authors define a variant of Petri-Nets called Executable Workow Network (EWFN), based on its capacity to build many individual modules, each of them responsible for a particular task of the overall system. The main idea behind the development of EWFNs however is their use in decentralized workflow enactment and being executed "natively" on an extended Linda-like tuplespaces system. Each tuplespace can reside on a different machine in the network and even the execution of a single process instance may be arbitrarily distributed. The EWFNs are specifically used by authors in [6] to represent BPEL Workow, in a way that enables distributed and decentralized execution. The data reported during the execution of BPEL process are explicitly described by a set of EWFNs. The proposed method runs in three phases based on different types of nodes (fixed, heavily and light). Another approach based on binary-tree was developed by [14]. The main idea is to create a binary tree for each composition request and store all the information about the process in this data structure as follows: Each service (single or composite) is represented by a node in the tree. The root node reects the output parameter of a web service request. A composite service reects a branch of the tree (is a set of nodes connected in series). Authors proposed two parallelization techniques to partition a service web composition presented by binary tree. The first technical consist to partition the binary-tree to a set of sub-tree. Each sub-tree presented by a thread and used its own stack. The second parallelization technical consists of creating a thread for each node of the shared stack. In [16], authors propose a novel approach based on

3 division of labour in ants colonies as a solution for BPEL partitioning problem. Fundamentally, authors consider the BPEL partitioning process as a labor division problem. Each fixed activity is involved in the execution of the set of mobile activities with probability. This probability is calculated by stimuliassociated to each fixed activity and an internal response-associated to each mobile activity. III. PROPOSED APPROACH In this section, we propose an approach to decentralized orchestration of web services. The idea is to share the execution of a BPEL program between a set of cooperative agents. These are the web service providers. Thus, each agent brands an activity to perform according to its resources and the cost of BPEL activity execution. The marking is done at all activities except the "receive" activity because this activity will be executed by the main space. That our approach based on three main ideas namely:

Fig. 1. Marking the BPEL program.

marking at the BPEL file to "invoke" activity. B. Shared Space The shared space model is a combination of parallel computing system that can be distributed and viewed by the participant [6]. The approach of the shared space provides a high-level abstraction that can simplify the task of programming systems. It provides a communication mechanism based on a logical shared memory space called tuple as shown in Figure 2 As part of this work a tuple is strictly defined by the couple: (BPEL, ID), with:

- Marking BPEL activities and this by adding tags attributed to the BPEL process. - Cooperative Agents: Our approach is composed by fully cooperative agents. - Using a shared space as a model of communication between agents representing service providers.

- ID is the identifier of each tuple. - BPEL is all component activities BPEL process to be performed by cooperative agents.

A. Marking activities The marking is a basic concept in our approach. Indeed, it can coordinate the actions of providers’ agents. The marking is intended to:

C. Structural Architecture of our system Figure 2 shows the architecture of our approach. Our approach is composed of three layers: interface layer,

- Prevent two providers to choose the same activity. - The marking of activities allows any provider to check the execution result of the activity performed when needed. - Reservation of activity for a specific provider. We distinguish between the types of markings that are: - Mark the activities that a provider may run. - Add the result of the execution of an activity BPEL file. - Add the execution of the activity given cost. - Add the time needed for the execution of an activity. To achieve these types of marking, we add attributes in the BPEL file: - Active: In this attribute provider adds its number in case of execution of the activity if not active is set to 0. - Result: This attribute is the execution result of a given activity. - Cost: the cost provider adds its enforcement activity in this attribute. - Time: the provider adds the time of execution of the activity. Figure 1 shows a diagram of the concept of

Fig. 2. General architecture of our approach.

orchestration layer and supplier’s layer.

74

4 1) Interface layer It represents the interface between the user and the system. It surrounds the user agent that provides the user with all the functionalities needed to execute the BPEL process.

3) Discovered Agent (DA) The discovered agent receives requests for services to discover from the AR agent. It can trigger a call to suppliers’ agents, in order to generate a list L of services. 4) Selection Agent (SA) This agent (SA) is activated when it receives a message from the agent discovered. This message contains the list L of suppliers which are chosen by the agent discovered. The selection agent chooses the best service in the list L. The selected service is sent to the agent responsible for space that updates the BPEL process.

2) Orchestration layer This is the main layer in our system. It is composed of: - A responsible agent space (AR) plays the role of director of space, and the intermediary between the orchestration layer and the provider on the one hand and between the orchestration layer and the interface on the other hand - A dynamicity module (DM) who is looking for suppliers able to run a business failure.

5) Provider Agent (Pr) A Pr agent can read the proposed activities in shared space, and choose the activities which have the same parameters that its service. Providers’ agents can mark the activities chosen and sent his proposal the AR agent.

3) Providers layer In this layer there are two types of Provider: - Providers of partner service BPEL process. - Other providers on the web. Agents of the provider layer can perform activities of a BPEL program. They may decide according to the capacity to perform or not the activities advertised. The operation of the provider layer is described by the interaction between the agents as follows:

E. Illustrative example In this part of the work we present an illustrative example of our approach. The example includes a BPEL process that is composed of a set of activities that are receiveInput, assign, while ... Table 3.1 shows the BPEL process. In this example we have three providers offer a cost for each activity, as shown in the following table.

- Each provider interacts with the administrator of space to identify activities that may have the resources to execute them. We note that a supplier interacts with the space charge agent or agent found in some cases. - After comparing its cost with the cost of the proposed activities in space, each supplier identifies the set of activities that are capable of performing. - If a provider has chosen to run an activity, not selected by another provider, he must mark it. All these layers are working together to meet the needs of the user.

TABLE 1: COST FOR EACH PROVIDER

1)

D. Role of agents 1) User Agent (UA) Its plays the role of intermediary between the user system and the agent responsible of the shared space. It presents the user by a graphical interface end to publish a BPEL process and receives the execution result. 2) Agent Responsible Space (AR) The agent plays the role of director of space which is the essential agent in our system. It can post tuples in the shared space and update the space. If it find an activity not yet executed it may request to the dynamicity module.

75

Scenario1

The provider number 17 and 22, respectively, marked the activities assign and while, by against the activity Assign2 was not marked because his cost (does not exceed 10, as shown in Figure 4).

5

Fig. 6: The new marking While activity.

Fig. 4. marking of activities.

2) Scenario 2 The agent responsible space done is updating the shared space, it communicates with dynamicity module for performing the activity Assign2. The agent discovered seeks the providers who are capable of performing the activity Assign2. It sends the list L of Provider to Selection agent. It done the selection based on the criterion of each provider to get to the end a provider capable of running Assign2 activity with a cost that does not exceed 10. The selection agent chooses the provider number 120 since it will execute with a cost equal to 8.5. After receiving the number of provider by the agent responsible space, the marking of activity Assign2 is done by the provider 120 which was chosen by the selection agent. The figure below presents an explanation of this technique.

IV. EXPERIMENTAL RESULTS We present in this section the experiments made following the implementation of our approach to orchestration of web services based on shared space with algorithm-based Swarm Intelligence (SI). The basic results of tests are presented in Figure 7 we can see that the computation time based on our shared space approach increases slowly with the problem size. As against the timing of the calculation algorithm (SI) significantly increases with the size of the problem. For small problem sizes, 6-7 and 10-9 problem, the computational time by applying our approach is only 25 ms and 34 ms respectively. However, applying the SI algorithm, its computation time is 80 ms and 93 ms, respectively. For the 42-40 problems, having a large size, our approach has spent 80 ms to find a solution, while the SI algorithm requires 145 ms to solve the problem. We can conclude that our approach based on shared space is more efficient.

Fig. 7: Comparisons of the computation time of our approach based on shared space versus algorithm-based Swarm Intelligence (SI) for resultant partitioning of different complexities Levels (in milliseconds).

Fig. 5: Running the Assign2 activity.

3) Scenario 3

V. CONCLUSION

The provider number 200 marked the while activity which was marked by provider number 22, provider number 200 offered a lower cost to the proposed cost by provider number 22, then the tagging is done again by the provider number 200. The Figure 6 explains this concept.

In this article we presented our research on approaches to decentralized orchestration, and we offered our decentralized approach to orchestration of shared space that presents a solution for the problems

76

6 of other approaches. Our approach is based on three main ideas that are: − The use of a shared space. − The use of cooperative agents. − The use of the marking of activity. As future works we plan to: − Provider’s agents don’t have the knowledge of other agents. − If the number of providers increases the access of agents to shared space becomes challenging, also the marking of activities becomes a little heavy. − The agent responsible space becomes a bottleneck.

VI. REFERENCES [1] F. Abouzaid and J. Mulins "A Calculus for Generation, Verification and Reffinement of BPEL Specifications", Proc. of the 3rd International Workshop oncAutomated Specification and Verification of Web Systems, Electronic Notes in Theoretical Computer Science (2008), p. 43-65. [2] D. Habich, S. Richly, M. Grasselt, S. Preissler, W. Lehner and A. Maier "BPEL - Data-Aware Extension of BPEL to Support Data-Intensive

77

Service Applications", In. Whitestein Series in software Agent Technologies and Automatic (2008), p. 111-128. [3] M. Ilahi, Z. Brahmi and M.M. Gammoudi "Enhancing Decentralized MAS-Based Framework for Composite Web Services Orchestration and Exception Handling by Means of Mobile Agents Technology",Proc. of AMT'09, Beijing-China, Vol. 5820/2009, Springer Berlin / Heidelberg (2009), p. 347-356. [4] M. G. Nanda, S. Chandra and V. Sarkar "Decentralizing Execution of Composite Web Services", Proceedings of the 19th conference on Objectoriented programming, systems, languages, and applications OOPSLA '04(October 2004), p. 170-187. [5] Ai. Lifeng, T. Moalin and F. Colin "Partitioning composite web services for decentralized execution using a genetic algorithm", In. Future Generation Computer Systems. Elseiver Sciences Publisher (August 2011), Vol.27, p. 157-172. [6] D. Wutke, D. Martin and F. Leymann "A Method for Partitioning BPEL Processes for Decentralized Execution", Proceedings of the 1st CentralEuropean Workshop on Services and their Composition, Stuttgart, Germany. Lecture Note in Computer Science(March 2009), p. 109-114. [14] P. Hennig and W-T. Balke "Highly Scalable Web Service Composition using Binary Tree-based Parallelization", Proceedings of 8th IEEE International Conference on Web Services IEEE ICWS'10, Miami, Florida (July 2010), p. 123-130. [15] M. Pantazoglou, I. Pogkas and A. Tsalgatidou,"Decentralized Enactment of BPEL Processes", In. IEEE Transactions on Services Computing. IEEE computer Society Digital Library. IEEE Computer Society (February 2013), vol.99. [16]Takwa Mohsni and Zaki brahmi, "Toward an approach for a partitioning BPEL program" In Proc of Electrical Engineering and Information Technology, WitPress(2014), pp 319-326

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < As described in the adopted approach, defining a feature and using it visualize hyperedges linking nodes having this same value of this feature allowed us to gain considerable advantages comparing to the web prototype of VisuGraph that we have tried to integrate earlier to Xplor EveryWhere: • Much less time during the pre-knowledge generation phase. We could execute this process in 1h instead of 2h30 for a corpus of 65000 data sources treated by a standard PC. • Data Visualization wise, we could visualize three times more significant data using the same JavaScript engine. We have also tested XEWGraph in one of the aspects where Competitive Intelligence can be helpful, which is the Scientific Monitoring by analyzing a corpus of 65000 scientific papers. This analysis enabled us to get an idea about the cooperation levels between research groups just by reading the resulting hypergraph. These encouraging results have incited us to push further the performances of XEWGraph by adopting, in one hand, a clustered architecture based on Hadoop, and on the other hand, by using the genetic algorithms as well as an extension of graph morphing [14] in order to alter the visualized hypergraphs just by changing the feature generating them instead of launching a new query every time. REFERENCES [1]

[2] [3]

[4]

[5]

[6] [7]

[8] [9]

[10]

[11] [12]

[13]

[14]

K. Rees. (2010, October 07). Data Visualization Review: Gephi, Free Graph Exploration Software. [Online]. Available: http://infosthetics.com/archives/2010/07/review_gephi_graph_exploratio n_software.html B. Dousset, “Intelligence Economique: Proposition d’un outil dédié à l’analyse relationnelle,” SciWatch Journal, vol. III, Issue n°2, 2008. A. El Haddadi, “Fouille multidimensionnelle sur les données textuelles visant à extraire les réseaux sociaux et sémantiques pour leur exploitation via la téléphonie mobile,” Ph.D. dissertation, IRIT, Paul Sabatier Univ., Toulouse, France, 2011. I. Ghalamallah, “Proposition d’un modèle d’analyse exploratoire multidimensionnelle dans un contexte d’Intelligence Economique,” Ph.D. dissertation, IRIT, Paul Sabatier Univ., Toulouse, France, 2009. S. K. Card, J. D. Mackinlay, and B. Shneiderman, “Information Visualization,” in Readings in Information Visualization: Using Vision to Think, ed. San Francisco: Morgan Kaufmann Publishers, 1999, pp. 1– 34. R. Spence, “The Issues,” in Information Visualization: Design for Interaction, 2nd ed. New York: ACM Press, 2007, pp. 16–28. C. Ware, “Foundations for an Applied Science of Data Visualization,” in Information Visualization: Perception for Design, 3rd ed. San Francisco: Morgan Kaufmann Publishers, 2013, pp. 1–29. E. Mäkinen, “How to draw a hypergraph,” in International Journal of Computer Mathematics, vol. 34, 1990, pp. 177–185. H. Klemetti, I. Lapinleimu, E. Mäkinen, and M. Sieranta, “A Programming Project: Trimming the Spring Algorithm for Drawing Hypergraphs,” ACM SIGCSE Bulletin, vol. 27, 1995, pp. 34–38. J. R. Bertini, Jr., M. C. Nicotelli, and L. Zhao, “Attribute-based Decision Graphs for Multiclass Data Classification,” in IEEE Congress on Evolutionary Computation, Cancun, Mexico, 2013, pp. 1779–1785. P. Eades, “A heuristic for Graph Drawing,” in Congressus Numerantium, vol. 42, 1984, pp. 149–160. TMJ. Frutcherman and EM. Reingold, “Graph Drawing by ForceDirected Placement” in Software – Practices and experience, 21, 1991, pp. 1129–1164. A. Frick, A. Ludwig, and H. Lehldau, “A fast adaptive layout algorithm for undirected graphs” in Proceeding of Graph Drawing 894, 1994, pp. 388–403. E. Loubier, W. Bahsoun, and B. Dousset, “Visualisation de l’évolution des informations relationnelles par morphing de graphes,” in Journées Francophones Extraction et Gestion de Connaissances, Namur, Belgium, 2007, pp. 43–54.

82

5

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision 

main elements of the theoretical framework of Competitive intelligence; we’ll try to show its impact on decisions and, in the end, we are going to present a state of places on the practice of the strategic foresight within Moroccan exporting firms as well as its impact on the decision making.

I. INTRODUCTION The new economic realities require new ways of learning in order to identify the knowledge which can enable firms to face the changing environment. In face of globalization the leaders must be aware of what is happening in the environment in order to come up with the necessary actions. Thus, the adaptation to changes in the environment has become essential in order to deal with an increasingly strong competition at the international level. One of the possible solutions to enable firms to observe deeply the environment is the Competitive intelligence.

Part 1 : Theoretical framework of the Competitive intelligence Competitive intelligence is a process by which an individual, or group of individuals, trace, in a voluntary manner, and use anticipatory information about changes that may occur in the external environment, with the aim to create business opportunities and reduce generally both risks and uncertainty1.

Henceforth, in this information society, competitiveness of a firm requires the mastery of information (which seems like raw materials and an extremely essential strategic tool for a firm), and more precisely, knowledge of environment in all domains in order to know the new trends and mantain a vision of the world in which this develops.

We can say that strategic foresight is an informational process through which an organization observes its environment and follows its progress in order to decide and act while respecting its objectives at the same time. In fact, the competitive intelligence is not a passive act, confined to a simple observation of the environment; it is rather a voluntary action aiming at the mastery of strategic information.

The concept of the Competitive intelligence is an answer framework to this strategic determination of adaptation ensuring a quick reaction at the right time.

Before starting this first part, let’s quote, for the memory, the definition of the competitive intelligence according to the standard XP X 50053 : continuous and largely iterative activity designed to monitor the technological, commercial environment …, to anticipate the evolutions2.

In fact, the information management systems role is to help the leaders to turn information into knowledge, and knowledge into action, inter alia provide decision makers with information and help them in their decision process. A role made possible by the competitive intelligence.

I- Organization

It is in this framework that our research is placed; the competitive intelligence is a type of information systems whose goal is to help in decision making.

of the competitive

intelligence 1 LESCA, H. (1997), - Veille stratégique, concepts et démarche de mise en place dans l'entreprise. Guides pour la pratique de l'information scientifique et technique. Ministère de l'Education Nationale, de la Recherche et de la Technologie. 2 Prestations de veille et prestations de mise en place d’un système de veille, AFNOR, avril 1998, page 6 ;

In the light of the important role that Competitive intelligence plays, we outline in this research the

83

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision The informal sources are those which become useful once they receive an appropriate treatment. They are useful through the way we use them.

1- Between formal and informal Not all firms necessarily need an organized approach concerning the problematical of the Competitive intelligence.

II-

In fact, to opt for a purely formal system implies that the firm already knows its needs and has already identified its targets and the persons involved in the process of the Competitive intelligence3.

Different types of business intelligence

To achieve a good watch, one has to know what to observe in the light of the priorities and objectives of the firm. We can say that the scanning focuses on technology, competition, the client, and general environment. We can normally distinguish between four types of scanning: technological, competitive, commercial and environmental.

In this case, the people in charge of the scanning are often appointed by the management, and the process of scanning is clearly identified, and the information usually centralised. It would seem that the more the process is formalized (having a procedure previously defined), the more it corresponds to problem zones already known and analyzed with precision4.

In what follows, we will describe briefly each type of scanning separately. 1- Technology watch, sometimes called scientific watch, is interested in:

On the contrary, in a purely informal process, no procedure is implemented by the management; each employee chooses how to organize his/her own scanning activities, according to his/her own preferences and skills.

Scientific and technical acquisitions, the results of fundamental and applied research, in goods (or services), design, manufacturing procedures, in information systems, in service performances in which the image factor is very strong and which make the transition with the commercial lookout.

2- Information sources 2-1 Formalized sources The firm must be aware of its environment. Among the formalized sources liable to provide information, we can cite : research, seminars, works, industrial films, radio or tv documentaries, technical product catalogs, activities reports. We have also to take into account internal information published within the firm such as mission reports.

2- The competitive monitoring : The competitive lookout is interested in the current or potential competitors, in the newcomers to the market who may offer substitution products. The acquired information may cover very large domains. We will cite only few of these: nature of competitive products, distribution areas, marketing and sales, cost analysis, the organization and culture of the firm, capacity of the board of directors, the activity portfolio of the firm..,

2-2 Informal sources The more information is formalized, the more it dates, and has less interest. Most of the time, the strategic advantage that a firm wishes to get is to get acces to the information before any competitors.

3- The commercial monitoring:

3 Guechtouli M., Comment organiser son système de veille?, Symposium ATELIS, Beaulieu sur Mer, november 25-26, 2009. 4 Pateyron E, 1997 ; « la veille stratégique » ; encyclopédie de gestion, Economica, Paris.

Alongside the competitive and technological monitoring, the firm must also develop an active 84

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision commercial watch, which focuses on : clients, markets, suppliers/ providers and the job market. The selection of information is the operation which consists of keeping only the information of business intelligence capable of interesting potential users within the firm. In fact, it also consists in making sure that the information is both valid and useful, and that its source is reliable. So, a lack of selection will lead to too much information, and a too restricting selection would make the processes poor and dry.

4- The environnemental monitoring : This watch includes the rest of the field (excluding technology, competition and commercial) of the company. A firm that is capable to integrate elements of judicial, cultural, social and political environment is able to stand out from its competitors. In fact, these different lookouts are interdependent and mutually enriching. III-

4- The circulation and distribution phase : This stage deals with information technologies as well as with the organization of the information path from the monitoring unit to the concerned actor6.

The stages of competitive intelligence

The firms often think that that information circulates in a fluid manner. It is rarely the case. We have to make sure that good information reaches the good recipient, and at the right time. We distinguish two basic models which are used to circulate information and knowledge:

We try to browse through the different phases of a competitive intelligence system. We hold seven phases: 1- The targeting phase

- The "flow" approach, where the manager of information flows is "pro-active" and the user "passive".

Targeting this CI is the operation by which the external environment is delimited, which the company wants to place under anticipatory lookout, is defined; that is, on which it wants to focus a voluntary attention. It is also to express in a clear and explicit manner what would interest the different participants of the process of the business intelligence.

- The "stocks" approach, where the manager of the stock of information is "passive" user "pro-active". 5- The storage phase : The information storage of business intelligence is a necessary condition to valuate and exploit such information. It materializes information sharing. This must be made easily accessible at any time by the authorized people.

2- The research phase : This is the phase of the processes of the monitoring which is the most frequently cited and documented by the writers. It designates the totality of the operations of research and data collection, realized by different categories of people in relation to the information sources familiar to them5.

6- The exploitation phase :

3- The selection phase : 5 LESCA, H. (2003), Veille stratégique : La méthode L.E.SCAnning ®, Editions EMS.

6 Véronique Coggia, intelligence économique et prise de décision dans les PME, L’harmattan, 2009 ;

85

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision The exploitation of the anticipatory information of the environment is the most critical phase of competitive intelligence processes.

Part 2: Competitive intelligence and decision taking

The collected information must be exploited in a way that leads to the construction of meaningful explanations of a number of incoherent, incomplete and disordered information in order to be able to predict, control and manage the changes in the firm environment.

To decide, you need information. In fact, the firm, in order to take quality decisions has to look for information within its own organization, but also outside the firm; when this is done, the firm proceeds to the treatment of the collected information. The process of the competitive intelligence meets these needs and constitutes a support for decision taking.

In fact, as Simon7 (1983) argues: ‘the information treatment systems of our contemporary world is overwhelmed by an excessive abundance of excessive information and symbols. In such a world, the rare resources is not the information, but the capacity of treatment to deal with such information’.

I-

The decision maker in relation to decision and information

1. The decision problem : the emergence conditions

Therefore, competitive intelligence is not limited to a department of alert or electronic documentation. Indeed, if documentation and news feed CI, it must have a dynamic reading to detect the evolution of knowledge8.

The decision maker is constantly faced with a diversity of events coming from the environment; these may cause a decision problem. Two questions, in this context, seem predominant.

7- The animation phase : Who talks about "Animation" should talk about "animator" ...The animation is a vital function for the SF device to ensure its functioning and sustainability. ‘To animate’ means to give real life to this device.

-

What are these signs and how are we to recognize them as such ?

-

What is the problem and who /what defines it?

1-1 The importance of weaks signals?

In a monitoring process, the facilitator will be responsible for:

The expression ‘weak signals’ was probably introduced into the domain of management by I. Ansoff. This notion is very interesting especially because of its orientation towards anticipation, and more precisely towards the eventual discontinuities and ruptures which may occur in the environment of the firm10.

- Guiding the discussions,

- Establishing consensus - Identifying the prospects.9.

7

Dans « Emmanuel-Arnaud PATEYRON », 1994, Le management stratégique de l’information Applications à l’entreprise, Economica.

The weak signals are, thus, potentially rich in information, and have value only if they are rapidly interpreted to be used. Nevertheless, the decision

8

Valérie Brosset-Heckel et Michèle Champagne, Management des expertises et veille, Afnor, 2011. 9 Valérie Brosset-Heckel et Michèle Champagne, Management des expertises et veille, Afnor, 2011.

10 www.veille-strategique.org

86

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision maker does not always possess the right strategy which will allow him to translate these signals to the maximum, and to find a relatively short deadline of comprehension in relation to the environment changes. Even though it is possible to manage these from their detection to their operational exploitations by the observer, it is the decision maker11 who is the appropriate person to decipher them and to bring to them all the conviction relating to the meaning given to their appearances.

process aims at appeasing the representation of the problem and at helping in its resolution. The solution of a problem is not confined merely to looking for informational and technical solutions,

but it requires the modification of the analysis framework and the way the decision problem is presented. If we refer to Kotler and Dubois14 the business intelligence is a crucial element in the strategic decision. It is the basis of the recognition of the strategic problems. The strategic decision does not, like the operational decision, consist in solving a clearly identified problem; it includes the recognition of the problem itself.

1-2 from the detection of the signals to the decision problem Certain conditions have to be present in order for a transition to happen between the identification of a signal and the existence of a problem decision. Newel and Simon12 advocate that two conditions are necessary in order for a decision problem to exist: a strong motivation of the decision maker to take action in given situation, and the lack of immediate understanding of the actions which would enable him to find a solution. Lebraty has two hypotheses: ‘the expression of a wish of the disappearance of

II. COMPETITIVE

INTELLIGENCE, A TOOL TO HELP IN DECISION?

1. Cpmpetitive intelligence and action The business intelligence is not limited to receiving or classifying information which have a strategic value in order to take good decisions; it is a collectively organized process. To have the ‘good’ information does not mean anything if this ‘good’ information is in the head of someone who is not involved in decision making, or is not available in

the differences between the wished for and the real’ and ‘the existence of the skills and resources to solve the problem’13. Moreover, the perception of a decision problem depends on the initial representation which the decision maker has of the situation. In order to solve a problem, we should implement a process of collecting and treating of information, and use common sens methods. In this case, the discovery of solutions using information which is the result of the implementation of a competitive intelligence

the right time or to the right person, or is unusable, or is not well interpreted….. We cannot speak of a field of knowledge and a field of decisive action which are separated. The value of the strategic information is related to that of its eventual consequences ; but what we mean by consequences those which the other actors would draw and the modifications of their action ; the consequences which they would draw themselves from what goes before and which, in its turn, will modify the situation and the implementation of strategies ( theirs and those of other actors) the new needs of knowledge, that is of cognitive strategy,

11 Modélisation du problème informationnel du veilleur dans la démarche d’intelligence économique ; Thèse de Doctorat de l’Université Nancy 2 ; par Philippe KISLIN ; 2007. 12 Newel A., Simon H.A., Human problem solving, Englewood Cliffs N.J.: Prentice Hall, 1972 dans Modélisation du problème informationnel du veilleur dans la démarche d’intelligence économique; Thèse de Doctorat de l’Université Nancy 2 ; par Philippe KISLIN ; 2007. 13 Lebraty J.F., Nouvelles technologies de l’information et processus de prise de décision : modélisation, identification et interprétation, Thèse en Sciences de Gestion, Université de Nice Sophia-Antipolis, octobre 1994.

14 Kotler et Dubois, (1990),Marketing management, Publi Union; 6 éd,

87

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision which will follow, new decisions, new actions, reactions, interactions and so on.

III. THE IMPACT LEVELS OF COMPETITIVE INTELLIGENCE ON THE STRATEGIC DECISION IN THE FIRM.

To the cognitive aspect of information is added that of conviction. That is, the information must be operational. In fact, information is supposed to

The business intelligence requires an identification of the decision structures. It studies them deeply in order to limit them and use its channels in an effective manner.

rationally clarify and orient the decision maker in his decision taking. If the system of decision of the firm is almost absent, no effective process of business intelligence will be implemented. The scanning depends on the system of decision. In the opposite case of a system which functions well, the business intelligence components will adapt to the requirements and internal structures of the firm15.

The Business intelligence, whether it is reactive or anticipatory, is based on a strong idea: every actor of the firm is liable to possess elements of information and it is the synthesis of these elements which give rise to an information which can be used for action. 2. How can the Competitive intelligence be useful in decisions

The information acquired from the competitive intelligence process will have to go through different filters. The decision maker will not face a too important mass of information, but information which is classified, validated, and of high quality.

The observer in competitive intelligence must be capable of facilitating the work of decision takers, who possess little time to scrutinize the global environment. Consequently, they have to channel and direct the information to enable people in charge to have a number of elements liable to enrich their strategic decisions.

The decision maker will do his job without wasting time in trying to deal with irrelevant information. In fact, we think that information can constitute both a precious competitive advantage and the origin of a major illfunctioning unless it is treated in such a way as to allow the decision maker to obtain useful and filtered information.

The decision maker should, thus, face all the information issued from the SL process, and relate them to each other in order to bring out their similarities, their points of divergence, of convergence and of complementarily in order to use them as a support to the decision taking. The scanning is a tool, and its results aim at to attain actions, decisions relating to the future of the firm, in order to ‘alert in time’, to ‘seize the opportunities’, ‘not to be taken unawares’. The lookout is, therefore, ‘necessary for action and decision’. The lookout seems as a competitive advantage, a competitive factor, and a key factor of success of the firms.

15 « Dynamisation du dispositif de veille stratégique pour la conduite de stratégies proactives dans les entreprises industrielles » ; Thèse de Doctorat réalisée par, CHALUS épouse SAUVANNET Marie-Christine, Université Lumière Lyon 2 ; 2000.

88

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision risks are big, the deadlines are shortened a lot, and the chances of failure reduced to the minimum. It is a means to be informed correctly and in time to decide. There is not only objective and formalized factors in the decision taking process. The decision maker has to follow his instinct and the advice of those around him. The lookout and its information are only one factor. Some leaders abuse of it, others abandon it.

Part 3: The practical framework of the study Competitive intelligence is still a new concept for Moroccan companies in general and exporting ones in particular. Indeed, competitive intelligence activity is not yet part of the traditions and habits of the majority of the managers since it is little or almost none formalized in theses companies.

Filters decision information captured by the device BI16.

This diagram represents the different stages of the progression of strategic information. The information is first captured and forwarded to a first decision level. Various actions are possible:

He then, seemed interesting to us to explore the behavior of Moroccan exporters in the field of competitive intelligence and see how this process can help in decision taking.

1. Selection of data (validation or invalidation of the information collected); 1. 2. Treat information and initiate action. This information relates directly to the decision makers in their business segment players, they can act without informing their superiors.

I – Research methodology The requirements of this analysis lead us to conduct a quantitative research which will come up with statistics relating to the studied phenomenon and allow to provide answers to the research question through the validation of our hypotheses.

3. The actors validate relevant information which must be transmitted to a superior level of decision. Decisions are made at each stage of the scheme. It is this set of micro-decisions that carries the major strategic decisions at the highest level and that impact on the organization.

Concerning the investigation method we have chosen: 

Firms which are situated in Fes, Meknes, Tangiers and Casablanca

 Hold different sized companies, in terms of the number of their personnel

The goal of establishing a pattern of business intelligence is to help in the decision making. A tool such as the scanning will allow, in the current context, to optimise the chances of success. The



Try to include sectors that are most in

need of this activity within their firms Considering these parameters, 360 companies were selected.

16 « Dynamisation du dispositif de veille stratégique pour la conduite de stratégies proactives dans les entreprises industrielles » ; Thèse de Doctorat réalisée par, CHALUS épouse SAUVANNET Marie-Christine, Université Lumière Lyon 2 ; 2000

89

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision Thus, this paradox may be justified by the scepticism of the firms towards a scanning process. In fact, the small and medium firms seem to implement information research actions without identifying them as a lookout activity.

After elimination of incomplete questionnaires, we have obtained the assistance of 66 exporting companies. We have come up 4 hypotheses

This result shows that competitive intelligence is still an uncommon process in our firms that a little more than 50 percent of them do not practice it.

1. In the Moroccan exporting companies, the competitive intelligence is the domain mainly of the big companies

1-1.Organization of competitive intelligence of the firms having answered ‘yes’

2. The adoption of the competitive intelligence clarifies the reality by providing quality information

More than half (66.7 per cent) of the firms engaged in competitive intelligence are limited companies, this can be explained by the fact that in most cases limited companies are big firms. So we can say that this result confirms our first hypothesis that competitive intelligence remains the prerogative of big firms.

3. The quality of information contributes positively to decision taking 4. The adoption of competitive intelligence clarifies the decision taking.

Concerning the unity of competitive intelligence, we found that in the majority of cases (87.5 per cent) firms practicing this process do not have an independent unity competitive intelligence. this shows that in spite of the integration of this process, it does not have the same importance as the other services which have a distinct status in the firm.

II. Results and analyses 1. Descriptive study One of our first questions, ‘do you have a cell of competitive intelligence?’ aims at guiding the interviewee towards the good scenario of the questionnaire. So, the 51.5 percent of the organizations which have answered ‘NO’ to this question, went directly to the questions relating to the case of non practice of the competitive intelligence. The organizations which have answered ‘yes’, 48.5 percent, continued normally with questions relating to the practices and relation that competitive intelligence has with the process of decision taking.

As for the degree of formalization, we have found that more than half (62.5 per cent) say that they have a formal lookout team. And regarding the types of competitive intelligence practiced, the one which comes on top is the commercial lookout, followed by the observation of the competitive changes. To achieve this visits salons is the most mentioned source used to monitor environmental research followed through the informal network.

This percentage seems a high percentage indicating that the leaders of exporting firms give little importance to research information on their external environment. Subsequently they are refractory to the challenges of information following a monitoring process .

as regards the different phases of competitive intelligence process, a little more than half 53.1 % do not have a targeting process (i, e) they conduct the environmental monitoring randomly without delimitation of outer space placing under surveillance, The research phase is used by 81.3 % of respondents, then comes the diffusion step with 90

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision 75 % of companies who claim having circuits diffusions to their information issues from the competitive intelligence process. 84.4 % of the interviewed managers think it is necessary to store the information collected at the end they can be accessed and used by the right people in the context of their work. Also, the majority of managers surveyed believe exploit the information collected and they do not have more information than they need, Finally, and in relation to the animation phase is almost equal between companies who believe that the competitive intelligence process must have a leader or not to operate it.

which means that if they are aware of the usefulness of the monitoring process then it will be found useful and use it to minimize the need for information . Also half of the companies (50%) do not have competitive intelligence cell try to find out information on competitors or suppliers . This shows that the leaders of these companies even if they do not have a business intelligence processes have a need for information , this confirms their lack of knowledge of business intelligence processes, which give the right information to the right person at the right time to make the right decision17.

So we can say that the least used phases are targeting and animation. Finally, we wanted to measure the degree of the performance satisfaction of their process. So, (56.3 per cent) of the organizations in our study are satisfied. (15.6 per cent) are a little satisfied, this indicates a potential of growth and interest of developments.

2. Exploration and confirmation study Our first hypothesis was descriptive study in Sphinx.

validated

through

Concerning our last three hypotheses, we have verified them in the first place through an exploration study using ACP, ACM and ANOVA; then we proceeded to a confirmation study. The two studies have been made through the SPSS programme.

This result, which reflects a general satisfaction, may seem a little surprising when we analyse the real performance of the components of their scanning process and the difficulties faced. It can also be explained by a misunderstanding of both the sources and tools that would optimize the competitive intelligence system.

The study uses a number of variables from the information system literature, which are linked to the evaluation problematic of the information use in the framework of the competitive intelligence. So, we have tested the impact of the process of competitive intelligence on decision taking.

1-2.Arguments of the absence of competitive intelligence team For firms which do not practice competitive intelligence, we have tried to know why. The major reason (50 percent) is that it is not necessary. Another reason is that of lack of knowledge of the process and sources, (35.3 percent) and (29.4 per cent) respectively. This represents a potential of development for the specialists of information.

In order to look for this impact, we have established a relation among three variables: Variable 1: the organizational competitive intelligence;

process

of

Variable 2: the quality of information; Indeed the lack of the process knowledge and its usefulness appear evident in their responses since the majority of companies ( 58.8 %) have expressed a need for information during the decision-making

Variable 3: decision taking; 17 Porter M.1980, « Competitive strategy : Techniques for analysing industries and competitors », Free Press.

91

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision To sum up , and following the exploration study we can say that the organization of the business intelligence enables companies to obtain and generalize the dissemination of information for taking best decision.

2-1 Exploration study The linear regression ACP has allowed us to validate in a global manner the positive and significant relations existing between the different variables of our model which are the organization of competitive intelligence, the quality of information and the decision taking.

Hypothesis The adoption of the competitive intelligence clarifies the reality by providing quality information The quality of information contributes positively to the taking of decision the adoption of the competitive intelligence clarifies the decision taking

T 2.906

2.881

Signification 0.007

0.007

Competitive intelligence can be approached from the perspective of a process with a strategy that gives meaning and which therefore allows through his organization to have an impact on the process of decision making.

Validation Validated

2-2 Confirmation study As for the confirmation phase and through the modelisation assistance PLS-SEM, we have managed to deepen this study, by providing a detailed analysis of the types of relations existing between the construct (direct and indirect relations) and their origins ( taking into account some sub components).

Validated

In order to do this, we have come up with sub hypotheses which are theoretical assumptions which we have made to determine the origin of the link existing among the variables. 2.943

0.006

Validated

H2 :

Concerning the ACM method, it gives a strong support to our hypotheses given that a positive link exists among our variables.



H2.1 the organization of the competitive intelligence has an impact on the quality of information



H2.2 the research process has an impact on the quality of information



H2.3 the process of access and distribution has an impact on the quality of information



H3.1 the quality of information on the suppliers, competitors and consumers has an impact on the decision making

Finally, and through the analysis of ANOVA, we have rejected the null hypothesis18. There is, thus, a statistically significant relation between the dependent and independent variables. H3 :

18 The null hypothesis states that there is no relationship between the dependent variable and the independent variable, so that the independent variable does not predict the dependent variable.

92

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision 

H4

H3.2 the relevance of information has a positive impact on decision making



H4.1 the organization of competitive intelligence has an impact on the decision making



H4.2 The research process has an impact on the decision making



The fourth sub hypothesis has a positive, significant and indirect impact on the decision process. The fifth sub hypothesis has been validated through a positive and significant link. The sixth and seventh hypotheses have not been validated the fact that we found a positive relationship but not significant between variables. Finally, the last hypothesis has been verified through a positive and significant link. If directly, latent variables measuring the organization's competitive intelligence and research processes have little effect on decision-making , they have an important indirect effect on decisionmaking through the quality of information that they provide .

H4.3 The process of access and distribution has an impact on the decision

The first two sub hypotheses have been validated, through a positive relationship , reliable and converge between these indicators and their built and direct impact , positive and significant between the organization of the competitive intelligence cell and the quality of information on the one hand and between the research process and the quality of information on the other hand .

In sum, the Organization of the competitive intelligence process has a significant influence on decision-making through direct linear relationships (Access and circulation) and indirect linear relationships (Organization of the VS and research process ) . We can say that the organisation of competitive intelligence (its organization, its research process, its degree of formalization…) provide quality information which in turn has an impact on the decision process. However, we have observed that the direct impact of competitive intelligence adoption on the decision process is verified only through the access and distribution variable which requires an information system in order to be able to store and distribute information to the relevant people to help them take good decisions.

So we can concluded saying that a good organization of competitive intelligence unit and an organized research methodology within the Company have a significant effect on the quality of information available in the company. As for the third we have found that there exists a positive link between the variables, but this link is not significant. The influence of storage and distribution on the quality of information although statistically small , is justified, as the causal factor is positive. Information storage and their organization as well as the use of an adapted diffusion circuit helps the Company to maintain the necessary data and facilitates user access to information.

In fact, it is necessary that the decision makers identify their problematic of research information. This stage would allow them to solve their decision problems and orient their actions. The competitive intelligence is, thus, a process to help in the decision making.

93

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision -Pateyron E, 1997 ; « la veille stratégique » ; encyclopédie de gestion, Economica, Paris.

Conclusion The acquisition of information has become a necessity. In fact, competitive intelligence has for mission to provide decision makers with useful information allowing them to possess a data base in order for them to take better decisions as well as to support a collective intelligence.

- Porter M.1980, « Competitive strategy : Techniques for analysing industries and competitors », Free Press.

Therefore, scanning would be, through its organization and its different stages, an important tool to prepare decisions.

- Véronique Coggia,2009, intelligence économique et prise de décision dans les PME, L’harmattan;

- Valérie Brosset-Heckel et Michèle Champagne, 2011, Management des expertises et veille, Afnor.

- Chalus épouse Sauvannet Marie-Christine; « Dynamisation du dispositif de veille stratégique pour la conduite de stratégies proactives dans les entreprises industrielles » ; Thèse de Doctorat , Université Lumière Lyon 2 ; 2000.

Finally, we can say that competitive intelligence makes a very relevant topic for research for firms, especially the exporting ones as they are the ones which deal with the outside world. There is a lot of work to be done to create awareness about this issue in order to show to our decision makers that competitive intelligence is a profitable investissement.

- Lebraty J.F., Nouvelles technologies de l’information et processus de prise de décision : modélisation, identification et interprétation, Thèse en Sciences de Gestion, Université de Nice SophiaAntipolis, octobre 1994.

Bibliographies

- Philippe KISLIN , Modélisation du problème informationnel du veilleur dans la démarche d’intelligence économique ; Thèse de Doctorat de l’Université Nancy 2, 2007.

- Emmanuel-Arnaud Pateyron, 1994, Le management stratégique de l’information Applications à l’entreprise, Economica.

-Prestations de veille et prestations de mise en place d’un système de veille, AFNOR, avril 1998, page 6;

- Kotler et Dubois, (1990), Marketing management, Publi Union; 6 éd, -LESCA, H. (1997), - Veille stratégique, concepts et démarche de mise en place dans l'entreprise. Guides pour la pratique de l'information scientifique et technique. Ministère de l'Education Nationale, de la Recherche et de la Technologie.

-Guechtouli M., Comment organiser son système de veille?, Symposium ATELIS, Beaulieu sur Mer, november 25-26, 2009.

- LESCA, H. (2003), Veille stratégique : La méthode L.E.SCAnning ®, Editions EMS. - Newel A., Simon H.A., 1972, Human problem solving, Englewood Cliffs N.J.: Prentice Hall.

94

Zineb Drissi. Professeur d’enseignement supérieur assistant, USMBA/ FSJES Fès Maroc

Competitive intelligence and decision

-

Porter M.1980, « Competitive strategy : Techniques for analysing industries and competitors », Free Press.

95

Constraints to Integrate Competitive Intelligence within the Algerian Operator of the Telephony Mobile “Mobilis” Case Fariza NANECHE, Department of management, Mouloud MAMMERI University, Algeria [email protected] Yacine MEZIAINI, Department of management, Mouloud MAMMERI University, Algeria [email protected]

Abstract_ In this paper we have approached the competitive intelligence within a management perspective. We have studied the case of a public Algerian company “Mobilis”, in order to highlight its assets and constraints to the implementation of competitive intelligence’s practices. Actually, our national companies performing in both the public and the private sectors must regenerate their strategic management. Thus, requires, in the era of the information based business, a permanent appropriation and exploitation of pertinent information with a view to create competitive advantages. However, that cannot be reached without setting and exploiting a coherent process, affecting the competitiveness of the whole company, which we qualify as the competitive intelligence.

In this era, companies often perform informational activities largely underlying their competitiveness, likely, those activities aiming to outperform all others value company’s activity. Therefore, the Competitive Intelligence, can also be considered, one of them, which is as well as a determinant key of gaining sustainable competitive advantages. Competitive intelligence, is in a broad sense presented as a strategic voluntarist process allowing the company to monitor both internal and external environment, in so doing, it captures needed information, thus, improve daily decisions making. Indeed, the number of companies around the world adopting this activity has dramatically increased, since; scholars strongly highlighted the extent of competitive intelligence in enhancing company’s competitiveness. However, the notion of competitive intelligence was introduced in the Algerian economy, for more than a decade. Nowadays, the number of Algerian companies’ likelihood performing competitive intelligence’s practices is insignificant in both public companies and private ones. Indeed, some initiatives were done, for example, the creation of the Regional Economic Monitoring of the East (REME), in 1994, which, comprises near twenty companies both public and private. At the same time, a process of technologic monitoring was set within the national company for information systems. Mainly, too many goals were assigned to those initiatives, which for want of pursuing and framing have failed. This paper aims to help our national companies to respond to the challenge of the era of information based business, by highlighting what could be the eventual constraints restraining the appropriation of the competitive intelligence within national companies.

Keywords_ Information; competitive intelligence; competitiveness; informational patrimony’s security; influencing actions; strategic monitoring I. INTRODUCTION Information and telecommunication revolution is sweeping through economies. No company can escape its effects. Moreover, success in today’s business world consequently requires capturing, manipulating, and channelling information necessary to perform activities. Obviously, every activity creates and uses information of some kind, a service activity, for instance, uses information about service requests to schedule calls and order parts, and generates information on product failures that a company can use to revise product designs.

96

To do so, we wish to bring a new contribution. But first we would to remind that the notion of competitive intelligence still discussed in the managerial literature. For that, we first encompass the notion of competitive intelligence. Then, we investigate the required conditions to set competitive intelligence’s practices, which we apply to the case of the national operator of telephony mobile “Mobilis”, in so doing, we will to identify strengths and weaknesses of “Mobilis” in the appropriation of competitive intelligence.

organizational activity systemizing the collection, treatment and the exploitation of environmental information (Baumard, 1991), CI is also, defined as a process of information’s exchange (Bloch, 1999), furthermore, it is considered like the ability to identify strategic opportunities (Marmuse, 1996), or, as an organized strategic approach, that aims to enhance competitiveness, by the collection, treatment of information, and the diffusion of useful knowledge to control environment (Bournois & Romani, 2000). Moreover, we suggest defining CI as organizational activity procuring information required, to perform company’s activity; it enhances the interactivity on a fluctuating environment, with all warranties of immaterial patrimony’s protection. CI is the combination of three informational practices: the strategic monitoring, the informational patrimony’s security and last the influencing actions (Larivet, 2006). The influence between the company and its environment is mutual, therefore, the company cannot be considered as a locked system, so starting to supervise the environment, becomes primordial. The strategic monitoring is a voluntarist informational process, through the company can capture information; this process enables the company to anticipate threats and opportunities, in a way to reduce incertitude’s factors. Beyond, the strategic monitoring is a cycle composed on four steps: first, identifying and formalizing informational needs, second, collecting information, third, treating information, and last, diffusing them to final users. Consequently, the strategic monitoring can be considered as this source of information asymmetry among companies. Information coming from the process of strategic monitoring, will be diffused both in and out of the company, nevertheless, the diffused information don’t have to be accessible to everyone (Marcon & Moinet, 2006), thus, this function have to protect information held or issued, in particular, to avoid their appropriation by rivals in order to sustain information asymmetry. Securing the company’s information, lies on identifying and selecting, those which have to be protected, in fact, none company can secure all its information, specially, that threats’ sources are multiple, as threats over products

II. ENCOMPASSING THE NOTION OF COMPETITIVE INTELLIGENCE The competitive intelligence (CI) is built on the idea that companies exploit information in order to outcompete theirs rivals, and achieve sustainable competitive advantages, therefore, our understanding about how the use of information can improve the company’s competitiveness can be a determinant factor in the integration of the competitive intelligence in our companies. The essence of the CI is that the information changes the way companies operate, it also, affects the entire process by which companies create value, in this way, information has acquired strategic significance, otherwise, information must be considered as a strategic resource, which gives today’s company competitive advantage. Anglo-Saxon were pioneers in the implementing and exploiting the CI, obviously, the first who wrote on this topic, were American. So, Luhn (1959) proposed the notion of business intelligence system which, he defined as a communication facility serving the conduct of a business. Wilensky (1967), also talked about organizational intelligence, considered as a process of collecting, treating, interpreting and communicating necessary information to decisions making. Porter (1982) was likelihood, the founder of the CI, through his researches focused on the process of the transformation of information into intelligence which, supports the making of decisions. Since, terms referring to the use of information for strategic purposes have widely multiplied. Despite the will of conceptualizing the notion of CI, definitions are numerous, and somehow conflicting. Consequently, CI is defined as an

97

(pirating license, counterfeit, sabotage), threats over sites (intrusion, listening), threats over information (rumors, disinformation), (Levet & Paturel,1996). As we have seen, the information is nowadays considered as a strategic resource, therefore, triggering an informational culture within the company became an exigency, particularly, that the most information’s loss comes from involuntary human’s errors. Influencing actions highlight perfectly the CI’s interactive dimension; by performing influencing actions, the company anticipates changes, moreover, it initiates and leads the transformations, and shapes its environment in its favor and forces other companies to follow. There are several types of influencing actions, according to the users, to the influence’s target, and to the concerned scope. As noted earlier, the company captures information by the strategic monitoring, that information are often used to make informed decisions, as well as influencing actions, channel issued information to decisions process of other companies like rivals, partners, and the public sphere. In doing so, the company affects its environment, in fact, success in today’s company lies on their ability to trigger changes in the industries at the expenses of its competitors. Scholars have acknowledged that the success of CI’s process, requires the intervention of three kind of key actors: watchers, experts and decision makers (Jakobiak, 2004), actually, watchers, have to capture and collect information from external environment, thus, the company can confer this function to its employees or to external ones. Experts have to analysis treat information captured by watchers, at this point; the information will be transformed to exploitable and interpretable information which assist the decision making. Mainly, making a decision involve pertinent information, which reduce the incertitude of the company’s environment. Those actors can coordinate their actions closely, and form a network in order to provide decision makers with useful information, missing their decisional puzzle (Lesca, 2003). The figure 1 represents the CI’s actors, highlighting the linkages between them, in terms of capturing, treating and using information, as a strategic resource.

Consequently, in any company, CI has a powerful effect on competitive advantage, using pertinent information; it allows companies to achieve competitive advantage by leading and exploiting changes in competitive scope on the one hand, and sustaining information asymmetry at the expenses of its rivals, on the other hand. However, the ambition of our work is nothing less than to explain what restrains national companies from setting CI, as we noted earlier, CI has a great role in achieving competitive advantage, and then reinforce the company’s competitiveness. This goal can only be reached, if we inspect every determinant factor in the process of implementing competitive intelligence within a company in a broad sense, then, we transpose these factors to the case of the Algerian company of the telephony mobile “Mobilis”, in order to highlight the effective source of this company’s deficiency. III. CONDITIONS TO IMPLEMENT COMPETITIVE INTELLIGENCE IN A COMPANY Considering the factors which can make easy the adoption of the CI’s practices, shows that it doesn’t exist a unique model of CI for all, because each company can build its inherent model, nevertheless, the main axis remain similar. In the purpose of identifying priors to the appropriation of CI’s practices in companies, we take into consideration three factors constituting the most companies: resources, the organizational structure, and informational culture.

98

A. Required Resources Competitive Intelligence

to

Implement

the entire company, each value activity is affected deeply. Moreover, two technological aspects are very pertinent, it concerns the well usage of information and communication technology (ICT) on the one hand, and the setting of an effective information system on the other hand, furthermore, the success of CI lies on the quality of those technological aspects. The control of the ICT has a great role on the appropriation of CI; likely, the information is in a junction of both the ICT and the CI, ICT produces information and CI harnesses it, in a specific purpose. CI often takes advantage of ICT’s facilities in terms of capturing, treating, stocking and channelling information. But also, the appropriation of CI inevitably needs an original reflexion on the interne organization on the one hand, and on information systems on the other hand. Most of companies perceive that to implement CI, it is required to own colossal financial resources, for that, small companies do not plan to set it. Although, scholars advocate that costs are elastic and adaptable for each kind of budget, even a very small company can implement CI. A company sets CI in order to ameliorate the decision making, thus, enhances the possibility to make right decisions and reduces the possibility to make wrong ones, so, managers have to take into consideration a certain acceptable level of risk the company can support after taking a wrong decision. In fact, the risk’s level, and financial investment on CI fluctuate in the opposite direction (Martinet & Marti, 1995). Yet, it is to the company to decide how much risk it can admit, and consequently, what budget it should place to adopt the CI, the measure is not obvious, not easy to realize, but, we already know like the following figure shows : that if the company doesn’t plan a budget for CI, the risk will be higher.

Three kinds of resources seem indispensable for implementing CI; it concerns human resources, the technology and last financial resources. The adoption of CI’s practices is often, the result of a perception of a need, a project, or a work combined to the awareness of politic and economic decisions makers (Levet, 2008). Somehow, implementing CI requires several, heterogeneous and transversal competencies, as well as, that the comprehension of CI needs numerous disciplinary scopes. Therefore, to integrate CI backs to consider the company’s human resource in two perspectives: the mobilization of collaborators on the one hand, and the organizational learning on the other hand. Furthermore, this project, claims concrete and constant contribution of the company’s member, that demands growing awareness and progressive forming in terms of CI’s practices, managers unequally evoke employees motivation as fundamental criterion of success, although this motivation is acquired by the confidence within the company, and also, by the fulfilment that collaborators achieve on performing their jobs. CI is developed on collective competencies, the unique presence of multiple competencies, is not enough to generate an intelligence behaviour, if those ones just coexist, ideally, those competencies have to be shared on an interactive way and on enriching mutual dynamic organizational learning, where everyone collaborates and takes advantage of its assessment. The project of implementing CI will be easier, whether the company is embedded on an internal dynamic learning; cognitive company is the best place to set CI’s practices. Technology was not considered before as a determinant factor in most models of strategic analysis, the technology revolution was mainly regarded like an external phenomenon, intrinsic at the competitive environment; thus, the company has not the ability to control it. Nowadays, technologic resources are at the essence of the today’s company definition, technology is far away from being that exogenous changeable which companies cannot control at all, in fact, the latter, affects

99

Anyway, the appropriation of CI demands an adapted organizational structure, since some organizational structures restrain this project, in particular, structures where partitioning over its components is higher. However, some other organizational structures are favourable to implement the CI, likelihood, the catalysis of CI needs more a network structure rather than a hierarchical one, where the communication is more vertical and formal. Although, in a network structure the communication is mainly transversal, every collaborator can contribute. Henceforth, the ultimate authority does not belong to the top; hence the other levels codetermine the result of a particular decision, the real stake for the company is nothing less than to slide from a hierarchical authority to a knowledge network, in order to found the required bases to an effective collective communication within the company, which is necessary to appropriate the CI.

B. The Appropriate Organizational Structure to the Competitive Intelligence Companies have organizational structure adapting them quickly environment changes in

articulated their in a new way, by and vigorously to order to stay alive.

With information’s share, the company avoids threats, takes advantage of opportunities and anticipates competitor’s behaviours. Furthermore, this share bears flexibility and engenders a decisive competitive advantage. Therefore, the lack of a collective culture within the company can be a result of structural deficiency, then the information channels badly or even doesn’t channel at all, the partitioning of departments and the multiplication of hierarchical levels restrain the share of information within the company. The origins of this structural deficiency is mainly a lack of a coherence and a cooperation of the company’s collaborators, for the reason that they think that the withholding of the information confer a power, or on the contrary, they misunderstand the information’s value, so they don’t won’t to share it, another reason is the absence of a receiver’s feed-back actions.

C. Informational Culture in the Company The informational culture of a company is considered as a core for the integration of the CI. So, CI requires an appropriate culture which can encompass it. Resolutely, the opportune informational culture needed to implement CI is a mix of two complementary attitudes which are: the valorization culture and the collective culture. Valorizing the information within the company involve to be aware that in the era of information based business, the information is nothing less than a strategic resource, then harnessing it cleverly is acquired. In fact, the best way to motivate company’s member to use information is to show them the advantages that exploiting information creates. With unshared information the company can’t achieve competitive advantage; actually, this is the share of information or the collective culture, which confers a power and added value to the company. Very often, companies spend 90% of their time and resources to perform capturing and treating steps, while, they spend only 10 % on sharing and channelling information to final users.

100

Also, we first present the national operator of telephony mobile and its economic importance, then, we deal with its strengths and weaknesses to adopt CI. A. A historic National Operator Mobilis is a national company owning a social capital of 100.000.000, 00 DA, which is divided on 1000 shares. Pioneer of the telephony mobile, it was a monopolist over the cellular’s telephony Algerian markets as a subsidiary of the historic operator Algérie Télecom. The monopole of Mobilis is ended on 2001, when the Regulation Authority of Post and Telecommunications (RAPT) gave Orascom Telecom Algeria a license to furnish the telephony mobile’s services. After dealing with some crisis, in particular, a collaborators’ displeasure and the accentuation of the syndicalism’s movements, Mobilis was reorganized and a new staff were appointed with a board directors comprising nine collaborators belonging to several professional horizons. 2004, outstands the set of new strategies for all functions like the sales, technical and deployment, thus, the necessary conditions to an effective starting were gathered. An innovative marketing approach, an effective communication policy besides to new network proceedings’ deployment, permit to this young company to yield superior results in few years. Mobilis performs on a national market of 37.5 million telephone subscribers, like the last statistic published of the RAPT shows. The number of telephone subscribers was enhanced between 2013 and 2014 by 5.3%. RAPT mentioned on 2013, 35.6 million telephone subscribers in Algeria, divided on the three operators of the telephony mobile: Vimpelcom (Djezzy), Algerie Telecom (Mobilis), Qtel (Ooredoo). The telephony mobile market share’s evolution shows that Djezzy owns 47.55% of telephone subscribers, Mobilis 28.31% and Ooredoo 24.14%, the next figure reflects those data:

Implementing the CI is more an intrinsic project, this is the reason for we have chosen to enumerate and analyze determinant factors allowing this appropriation. Moreover, each company has to value its strengths in an original way, and to develop as a matter of fact an idiosyncratic CI. In each case, the CI must not be an aim in itself, but as this instrument enabling the company to consolidate its market position, and so, its competitiveness in a hostile environment. Keep its eyes on this evidence, optimizes in a large sense the CI appropriation, since it affects the entire company, otherwise, total interest of stakeholders. IV. THE ALGERIAN OPERATOR OF THE TELEPHONY MOBILE” MOBILIS” CASE In management sciences, studying cases is a classic approach, yet, we don’t develop asset and inconvenient of this methodology, furthermore, we use some information’s sources in order to build this study; it mainly concerns documentary researches and semidirecting interviews. We have chosen in representing this paper, to include some integral manager’s responses, with whom we had discussions, other answers are gathered in a different way, in order to facilitate the reading. Our framework proposes to highlight constraints of this company which restrains the appropriation of CI’s practices, at the same time, we study several factors which can motivate this eventual project, thus, we have exploited the implementation’s conditions as a reading grid for this particular case.

101

Mobilis performs in a very competitive sector, in which, the operator having the ability to harness information for strategic purposes, achieves competitive advantage, since, the information is considered as a strategic resource. As a national operator of telephony mobile Mobilis, can outperform its two rivals which are subsidiaries of international holdings, if it integrates to its core strategy of growing, the exploitation of information in particular, technologic, competitive and sale’s information. However this exploitation cannot be effective without an organized approach around the diffusion and using of information in order to achieve strategic goals. B. Mobilis’ Resources Impact to Implement the Competitive Intelligence

1) Human resources: the national operator Mobilis, employs more than 4000 employees, divided on engineers, technicians, application’s engineers,…

Mobilis is an Algerian public company, owning an important capital of resources like: Therefore, despite important rates of Mobilis’ employees formation’ levels, it remains urgent to adapt and also update theirs competencies to post’s needs, so, the training direction intervene in order to continually train Mobilis’ collaborators, by dispensing two big types of formation, it concerns: - Technical formation for executives and engineers; - Managerial formation for managers and sellers. Furthermore, most of Mobilis’ collaborators are young; more than a half of them are aged between 24 and 30 years. That consist on an important indicator to discard retirements, and so, the loss of value collaborators. In fact, those collaborators represent a huge investment to this company since they are source of creativity, dynamism and more, changes leaders. As we mentioned earlier, the integration of the CI requires collaborators having some indispensable, polyvalent and complementary competencies. In particular, competencies intended to collect, to manipulate, and to transform information into actionable knowledge. CI is a connection between knowledge and actions (Baumard, 1991). Consequently, Mobilis has to develop its collaborators’ competencies around two fundamental axis:

- Information and knowledge; - Action. And to mobilize its collaborators by informing and explaining to each one, that he already is a potential actor of the CI. The project of implementing CI cannot be reached without a real participation of all, with the proviso that, it is accompanied by the top management’s legitimate authority. Furthermore, if the information is individual, the intelligence is collective (Moinet, 2011). Then, implementing and exploiting CI need collective competencies. Contrary to, individual ones, collective competencies are resulted from an internal process of organizational learning. Every collaborator contributes to that learning process. For that, our national operator, in addition to formation dispensed, must set an internal organizational learning process, in order, to create new collective competencies articulated around information and actionable knowledge. In fact, without those competencies, adopting CI’s practices cannot be realized. Mostly, Mobilis’ youth collaborators suit triggering and sustaining a dynamic of an effective learning process.

102

These functions perfectly underlie CI’s practices, performing it, enables Mobilis to set easily the CI and to exploit it effectively.

2) Technological Resources Mobilis is an operator of telephony mobile, it contributes to broadcast ICT in Algeria, but also it uses them in performing its daily activities. When we were at this company we noticed that all work posts were connected to the Internet, this company owns it inherent Intranet, which allows its collaborators to compress the spaces, expend the information storage, gain a precious time and a flexible usage. ICT produces, diffuses and stocks information, as the information system manager of Mobilis underlines, he added that information from 80% of this company’ patrimony, so, the usage of ICT within this company is inescapable, in order to: - Enrich more with new information Mobilis patrimony; - Channel and share information within Mobilis, and stock information at the same time, for that, this company builds an emergency center which shelter all information, in order, to assure continuity service, if undesirable events happen.

3) Financial Resources On the contrary of some public companies that use subventions to perform their activities, Mobilis uses its own resources in financing investments. The turnover of this company was enhanced by over 11.4% to reach 59 billion dinars in 2012, thus, the benefit of Mobilis has doubled comparing to the year before. Furthermore, the results of Mobilis for the first semester of 2013 are very positive, registering a growth of 25%, compared to the same period of the previous year. Consequently, with those results, it likely seems that Mobilis have consequent financial resources. Financial resources of Mobilis cannot be considered as a constraint to the appropriation of the CI. Moreover, this public operator can take advantage of its financial resources and palliate others deficits in terms of: - Recruiting collaborators having needed competencies to integrate CI’s practices; - Procuring coaches, in order to form and update the collaborators of Mobilis; - Acquiring more modern technologies affecting the capture, treatment, exploiting of information for strategic purposes; - Renewing and consolidating the effectiveness of its informational system; - Restructuring Mobilis’ organizational structure; - Creating strong links within the network developed; - Gathering required conditions to set an effective informational culture.

One of these technologies that Mobilis owns, is the information system “Lotus Domino”, which is considered as an internal communication system, enabling users to stay all the time connected on the one hand, and constituting their own networks on the other hand. This system offers a perfect control of channeling information within the company, also, it permits to select receivers of the information shared, and over the team room, this system strengths the confidentiality of information which have to be exchanged between members of a common project. However, owning an effective information system is a condition to implement CI; consequently, we argue that Mobilis has an important arsenal of appropriate technologies that can be exploited to implement CI within this public Algerian company. Consequently, Mobilis ought to exploit its owning ICT and the “Lotus Domino” system in order to: - Collect pertinent competitive, sale and technology’s information; - Treat collected information; - Diffuse information selectively, within and out this public operator; - Memorize and to stock information used.

Integrating CI somehow demands financial resources; in fact, the financial resource is considered the one which is responsible for the creation of other resources. In the case of the national operator of the telephony mobile Mobilis, its important financial resources must be that base, subtending the set of the CI’s practices, by acquiring and gathering lost indispensable resources.

103

C. The Mobilis Organizational Structure

central management and operational management gathered into three divisions.

The organizational structure of Mobilis contains many hierarchical levels; it is built of There are six levels from the top to the base of Mobilis for each division, although, this operator has three divisions and this is considerable, as we know that today’s company requires light organizational structure which contains few hierarchical levels, and that engender more flexibility, and so, more effectiveness. Consequently, Mobilis, is likelihood a piling of hierarchical levels which reduce its flexibility, in particular, that this criterion conditions the company’s behavior towards its competitive environment, because, a flexible company is more reactive to eventual environment’s fluctuations. The multiplication of hierarchical levels within Mobilis is source of: - Heaviness and inertness in its daily performing activities; - Huge efforts to channel and diffuse information; - Accentuation of rumors and alteration of channeled information’s quality; - Alteration of feedback actions when information were received. The organizational structure of Mobilis is not appropriate to harness and take advantage of information channeling within it daily, because, with such organizational structure the appropriation of information within Mobilis and its exploiting are strongly unfocused, therefore, none effective. Moreover, it is the diffuse and channeling, of information which, cause more failing, with a huge partitioning of departments and services, information required to decision making cannot be easily and effectively communicated at desired moments. D.

the exploitation of information, thus, requires improving the information to a strategic resource, allowing companies to create informational asymmetry and to achieve competitive advantage. On the contrary of the first attitude, the second one concerning the share of information within Mobilis, is not effective because of: - Human weaknesses relative either to an insufficient awareness or collaborators formation; - Inappropriate organizational structure containing more hierarchical levels; - The share is not systematic, and it has not a particular purpose; - The share is masked by the importance of rumors, besides to collaborators’ individualism culture, and so, the team work is frequent but not effective. Subsequently, the collective informational culture within Mobilis stills not mature, and that for a simple reason: The information is still considered in Algeria as an instrument of authority, so, no one will to share it. The share of information within companies confers competitive advantages (Achard, 2005). Regrettably, our national operator is far away from setting an effective collective informational culture. In fact, it does not mean that Mobils is forever condemned, the necessary is gathered. Mobilis just needs to trigger strong linkages among each constitutive factor in order to develop an appropriate and idiosyncratic informational culture. Let us remind again, to adopt the CI’s practices; an effective informational culture remains inescapable for every company.

The Mobilis Informational Culture

Therefore, we tried to inspect two attitudes with managers with whom we have discussed, which are: valorization informational culture and collective informational culture. Most of those managers reflect a great awareness concerning the importance of information’s use within this company; in fact, they will to give more extent and formalism to

CONCLUSION CI represents a pertinent notion in the literature of management; it is built on a combination of three interdependent functions: the strategic monitoring, the informational patrimony’s security and influencing actions, in fact, those

104

functions are applicable in all types of companies. Furthermore, highlighting several conditions to the implementation of the CI is useful as an analyzing grid for constraints and assets of companies to this project. In an empirical perspective, we have attempted to identify which can constitute strengths and weaknesses of the national operator of the telephony mobile Mobilis to appropriate the CI, however, technological and financial resources correspond to strategic levers which must be exploited to palliate other deficits on the implementation of the CI within this operator. In other words, Mobilis detains some modern technologies as the system of “Lotus Domino”, so it is important to integrate these technologies and to exploit them efficiently in the implementation of the CI. In addition, our public operator has good financial conditions. Therefore, it is appropriate to harness its financial resources with a view to acquire and to gather lost resources to facilitate the adoption of the CI’s practices. Although, human resources represent the main deficiency of Mobilis since on the one hand collaborators are not well trained to CI’s practices, and on the other hand, top managers don’t aim to set CI. Consequently, triggering a dynamic mobilization and a process of organizational learning, enable this operator to create collective competencies required to set the CI. The organizational structure of Mobilis is not appropriate to this project and thus, is considered as a constraint. Both the collect and the share of information are ineffective within Mobilis, with an

apparent partitioning of directions and divisions, efforts spending on the collect are unfocused and the share is unsuccessful, so, pertinent information needed to make decision cannot be available on time. Manager continue to make decisions even if they have not pertinent information, in this case, environment changes are not take into account, then, Mobilis’ reactivity fails in front of two aggressive rivals. The Mobilis’ informational culture is also another constraint to the appropriation of the CI within this operator. The informational culture of valorizing information, certainly, is recognized among most of Mobilis’ collaborators but it stills in its infancy. Also, the collective culture of information is not well developed in Mobilis. That affects the diffusion and the channeling of information to the final users and so the quality of decision making, in particular, strategic ones. Nevertheless, to consider the information as a new strategic resource requires to unlearn and to regenerate the collaborators’ knowledge in order to prepare them to new challenges about the use information for strategic aims. Hence, the necessity, to divide pertinent information becomes primordial. For future perspectives, we suppose that is also interesting to consider what can make raise the CI’s practices within national companies, and in a managerial approach to take advantage of contextual conditions defining the effectiveness of the CI for the long term.

[5] BOURNOIS F & ROMANI P-J, « L’intelligence économique et stratégique dans les entreprises Françaises », ECONOMICA, 2000, pp 19. [6] FARNEL. F-J, « Le lobbying : stratégies et techniques d’intervention », les éditions d’organisation, 1994, pp 24. [7] JAKOBIAK. F, «L’intelligence économique : la comprendre, l’implanter, l’utiliser », les éditions de l’organisation, 2004, pp 88-90. [8] LEVET. J-L, «Les pratiques de l’intelligence économique », 2nd ed, ECONOMICA, 2008, pp 10.

REFERENCES [1] ACHARD. P, « La dimension humaine de l’intelligence économique », LAVOISIER, 2005, pp 130. [2] BESSON. B & POSSIN. J-C, « Du renseignement à l’intelligence économique », 2nd ed, DUNOD, 1998, pp 126. [3] BAUMARD P, « Stratégie et surveillance des environnements concurrentiels » MASSON, 1991, pp 43. [4] BLOCH A, « L’intelligence économique », 2nd ed, ECONOMICA, 1999, pp 25.

105

[9] LUHN H-P, « A business intelligence system », IBM Journal, October 1958, pp 314. [10] MARCON. C & MOINET. N, «L’intelligence économique », DUNOD, 2006, pp 75. [11] MARTINET. B & MARTI. Y-M, « L’intelligence économique : les yeux et les oreilles de l’entreprise », Les éditions d’ORGANISATION, 1995, pp 200. [12] MASSE. G & THIBAUT. F, « Intelligence économique : un guide pour une économie de l’intelligence », DEBOECK, 2001, pp 58. [13] MOINET N & DESCHAMPS C, « La boite à outils de l’intelligence économique », DUNOD, 2011, pp 42. [14] NORLAIN. B & LA SPIRE. L-T, « L’intelligence économique au service de l’entreprise », PUBLISUD, 1999, pp 11. [15] PORTER M, « Choix stratégique et Concurrence », ECONOMICA, 1982, pp 80100. [16] TABATONI. P, « Les systèmes de gestion: politiques et structures », PUF, 1975, pp 227. [17] WILENSKY H « Organizational intelligence: knowledge and policy in government and industry », 1967, pp 75.

[18] BENSBAA. F, « Les organisations ne peuvent plus ignorer l’intelligence économique », 5th conference de l’intelligence économique VIP Group, Algiers, Esplanade du Sofitel, November 28th, 2011, pp 05. [19] DUSSAUGE. P & RAMANANTSOA. B, « Technologies et stratégie », Harvard L’Expansion, Summer 1986, pp 62. [20] JUILLET. A, « Intelligence économique et pôles de compétitivité », conference Avignon, May 11st, 2006. [21] LARIVET S, « L’intelligence économique: étude de cas d’une pratique managériale accessible aux PME », 8th International Congres of Entrepreneurship, october 25th- 27th 2006, pp 9-14. [22] LEVET. J-L & PATUREL. R, « L’intégration de la démarche d’intelligence économique dans le management stratégique », Vth international conference of strategic management, May 13rd-15th 1996, pp 08.

Fariza NANECHE was born in the village of Ait Bouaddou, Tizi-ouzou, in 1988. She obtained the Licence degree in management from the University Mouloud MAMMERI of Tizi-ouzou, Algeria, in 2011, and the Master degree, in 2014. Since 2013, she has been a member of the research team CNEPRU. She is preparing her Doctorate in management, since 2014, at Mouloud MAMMERI

University. Her research interests include the implementation of competitive intelligence within companies, in particular national companies, the exploitation of information for strategic purposes and also the analysis of determinants of setting strategic monitoring systems in the small companies of electricity, electronics and household electrical appliances, industries in Algeria. She has been an Assistant Lecturer in the management department, Mouloud MAMMERI University, since 2012.

Yacine MEZIAINI was born in the city of Tizi-ouzou, in 1984. He obtained the Licence degree in management from the University Mouloud MAMMERI, of Tizi-ouzou, Algeria, in 2008, and the Master degree, in management of companies, in 2012, at the UMMTO.

He has been a member of the research team CNEPRU, since 2013. He is preparing his Doctorate in management since 2013. His researches interests include the integration of the monitoring process within companies, and also the corporate social responsibility, in particular, national public ones. He has been an Assistant lecturer with the management department Mouloud MAMMERI University, since 2012.

106

Security study of m-business: Review and important solutions Ahmed Aloui (1), Okba Kazar (2) 1,2

Samir Bourekkache (3), Merouane Zoubeidi (4) 3

Laboratoire d'INFormatique Intelligente, university of Biskra, Algeria 1 [email protected] 2 [email protected]

Laboratoire d'INFormatique Intelligente, university of Biskra, Algeria 3 [email protected] 4 [email protected]

Abstract — In our previous work [1], we proposed a mobile business approach based on mobile agent. Dealing with the same subject, we treat the security aspects of m-business. The security is considered as a major importance in the electronic world, and the security became a key factor for the success of the mobile activity. The success of m-business will depend on the security strategy endto-end which minimizes the risks at an acceptable level for companies and users. In this article, we propose the requirements of security (the challenges), the attenuation strategies (possible solutions) and the stages of the security of m-business architecture.

Ensure the security of e-commerce transactions can reassure service providers, potential customers of mobile business and increase the success factor of m-business applications.

Index Terms: E-business, M-business, Security, Challenge Security, the attenuation strategies.

In the m-business, security measures will be subjected to more stress and to be vulnerable in a range of violence. It is important to understand these threats and to plan a strategy to manage them.

I. INTRODUCTION The mobile phones and other wireless devices (PDA, pocket computers, embarked devices) become omnipresent, which have increased the mobility of our lifestyle. Moreover, mobile business (m-business) has become a commercial reality. Therefore, many people want to go a little further to make the shopping and conduct business anywhere and at any time, using a range of mobile devices [1]. The main key for the success of every application of m-business is a high level of security. It is very clear that one of the biggest obstacles in the mobile business is to win the trust of the customer. Thus, the success depends on the development of a security strategy that meets the new challenges of m-business. The security was always considered as a major importance in the digital world. To secure the various network infrastructures (wireless and wired), and the various models of payment, constitutes the main key of the development and the generalization of the m-business.

The security and the privacy of personal data can be considered as a major obstacle against the successful implementation of m-businesses applications. The wireless ebusiness environment is very different from the wired environment.

This paper is organized as follows: the section 2 presents the architecture of m-business. The section 3 presents the properties of security. The challenges of security m-business are presented in section 4. Then section 5 presents the attenuation strategies (possible solutions). Section 6 presents the stages of the security of m-business architecture. Section 7 explains the next step and research focus. Section 8 presents the related works. Finally section 9 expresses the future works and concludes this paper. II. THE ARCHITECTURE OF M-BUSINESS We propose a mobile agent-based approach designed for the M-business [1]. The figure 01 shows the architecture of mbusiness. A consumer can connect his mobile device, as a PDA or a mobile phone, to the application server via a wireless connection and send then a demand of creation of a mobile agent to begin a specific business task on his behalf. An application server provides services such as the creation of mobile agents according to the demands of consumers. After being created, the mobile agents in an autonomous way travel to

107

several servers-based agents on the internet when the consumer wishes to proceed to a comparison on several world markets.

In this section, we analyze the challenges of security in the mobile business. The analysis is made from the point of view of every entity of the architecture, namely the mobile users, the application server (The broker) and the service providers. The figure 1 shows that there are points of vulnerability in the environment m-business. The reach and the nature of the threats vary enormously according to every application and the environment in which it operates. However, the following points illustrate some of the typical challenges (threats) that the mobile business present:

Mobile Client

Application server

Home Server

IV.THE CHALLENGES OF SECURITY M-BUSINESS

Repertoire Server

A. Mobile agent Server

Site of the supplier 01

Site of the supplier 02

The increasing complexity

It is difficult to ensure the confidentiality and the integrity of a company data because the enterprises exchange data using the wireless networks. Mobile devices are the new interfaces to the m-business applications, but their security capabilities are severely limited. [3] B. Confidentiality of personal data

Site of the supplier 03

The protection of data and the management of the identity are very active subjects of research in the academic world. Many research projects are available focusing on different aspects of supporting privacy and identity management. Confidentiality of personal data should be supported by the identity management (that is before the users publish data) and trust management (that is after the users publish data) [4].

Fig.1. Architecture of the system [1]

III. THE PROPERTIES OF SECURITY A secure m-business system must have the following properties [2]:

C. The anonymity and the respect of privacy Most users do not like to reveal their identity unnecessarily when requesting a mobile service. The anonymity guarantees that a user can use a resource or a service without revealing his real identity.

• The confidentiality: The information and the systems have to be protected from a person, a process or an unauthorized device. • The Authentication: The participating parties in the transaction must be authenticated (to establish the confidence) and authorized to perform the requested transaction. • The integrity: The transmitted information must not be modified or distorted (altered) by the parties outside the transaction. • The availability: The system must be accessible to the users authorized at any time (no attacks of Denial-of-Service). • Authorization: The system must provide a set of procedures to verify that the user can make the requested purchases. • Non-repudiation: The system has to verify that the user does not have to deny a transaction which he realized and has to supply a proof if such a situation appears.

D. Integrity and authenticity of descriptions of service and the results The integrity protects against any unauthorized modification of information [5]. The integrity of the descriptions of service stored and transmitted by the application server (The broker) is obviously very critical for the users and the service providers, because it influences the choices of the users when they decide to use a service. The opponents are particularly interested in the modification of the information on the price, the location and the quality in the descriptions. The descriptions of service of an application server which are modified by the opponents embarrass users and put in danger the business of service providers. To modify the descriptions of service, the aggressors can use a number of different methods. They can modify the service descriptions of the application server. This modification can be made when the service descriptions are either in the

108

communication channel or in the repository application server (Repertoire Server in our approach). Another method is that the opponents (the false application servers) seem to be an authentic application server and send modified service descriptions for mobile users. Modified service descriptions can cause many serious side effects, for example, users can be charged by more money than the cost required or pressured to accept a disservice. Worse still, the users can be oriented to a false service provider, the purpose is only knowing their personal data and to steal their numbers of credit card. In addition to the protection of unauthorized modification of messages on the channel, the data stored in the repository (Repertoire Server) on an application server should obviously be authentic. Service providers (as opponents) can aim to overtake the other providers by modifying the repository in a way that their descriptions of services become more attractive for the users. E. Authentication and Authorization After the registration process, service providers can request the application server to update their service descriptions. To avoid any unauthorized modification of the reference table (Repertoire Server), the application server must authenticate and authorize service providers. The authenticated service providers are then authorized to modify only the entries they own. Service providers can also require the authentication of the authentic application servers by opposition to the server’s parasites and thus the bidirectional authentication is applied. The users who want to authenticate service providers to protect their personal data of their hostile opponents that claim to be service providers. On the other hand, many service providers have to authenticate their customers, for example for countable purposes. F. Confidentiality (Privacy) of the Communication Communication messages transmitted among the main architecture (Application Server, Suppliers and mobile users) contain sensitive information such as personal data, numbers of credit card, location, requests of mobile users, data of recording of suppliers, results of application server and service providers, etc. The management of the identity allows the users to control transmitted personal data, but the revelation (the disclosure) of that sensitive information would not be difficult in mobile networks, where the data are transmitted on the air are easily received by any mobile device. G. Confidentiality (Privacy) of data stored locally On the mobile devices, the users do not always know what data is stored locally by the applications they use. Most of the applications store the data of the basic configuration, but some perform a local caching of the data downloaded to reduce the use of the network or the waiting time [6]. H. Security of the mobile payment « The use of a mobile terminal, as a Smartphone, allows envisaging simple and fast means of payment being able to go up to the payment by a click», explained Alex Rolfe, Managing

Director and publisher of Payments Cards & Mobile. The mobile payments could well upset the business in the next years. In the practice, the consumers use more and more their mobile to inquire before the purchase, if it is not to buy on-line after a passage in the shop. The challenge of this development background of e-business is to continue to capture this flow of customer deserted PCs for Smartphones and tablets. Several factors such as: the multiplicity of the material platforms and the operating systems, the possibilities of personalization leaving with the users of Smartphones, without counting the size of these new means of communication which makes them more vulnerable, make the payment using mobiles a tricky and risky operation. In many aspects, the development of mobile payment promises to be a new beginning for the industry. Banks are the first who recognizes that the mobile establishes a new relationship with their customers, which requires to see again all the regulations of security and authentication. The mobile payments imply operations in which the monetary values are transferred by mobile customers to service providers to pay for the offered services. A misuse of the numbers of credit card or the other means of payment can be a grave problem for the users and the service providers. I.

Denial of Service Attacks A denial of service attack is a computer attack which aims to make unavailable a service and prevents the justifiable users from using the service. The attack by denial of service can so block a file server, make it impossible to access to a web server or prevent the distribution of mail in a company. The denial of service attacks have changed over time. By transmitting permanently big quantities of data in the wireless device and the bandwidth of the network can be saturated leading to a degradation of the performances or the non-availability (unavailability) [3]. J.

Location-based services

M-business offers services based on the location, which will allow the users to be tracked. This presents the new private life and concerns of confidentiality for consumers. K. Non-repudiation For financial transactions, companies will be able to provide non-repudiation, in other words, to prove that a mobile transaction actually took place. L.

The viruses

The variety and the immaturity of the wireless devices, the operating systems, the applications and network technologies increase the threat of virus attacks and malicious code.

109

instead of their real identities. A more complete solution of the problem of the anonymity is the management of the identity (identity management) [7], it allows the user to keep the control over their personal and confidential information. When the users communicate directly with service providers, personal information (IP address, country, operating system and much more) is all published in the service providers which can then use this information to identify the users and to profile them. Based on Mix-Net [12] solutions are well accepted in academia and have also been designed and deployed for various scenarios of application.

These risks could lead to data and applications being accessible, destroyed, treated or copied by an unauthorized person. Confidential and personal data may be disclosed or modified. There is also the risk that the rights of private life of the users could be misused and user business could be a target for fraudulent activities. V. THE ATTENUATION STRATEGIES M-business without secure environment is unacceptable, essentially for the transactions which imply monetary exchanges. There are several challenges of security of Mbusiness that is relative to the mobile devices, to the infrastructure of the wireless and wired networks, and to the various technologies of the M-business services.

C. Integrity and authenticity of service descriptions and results • The digital signatures (Digital signatures) [10] can be applied as cryptographic methods, at the same time, the integrity and authenticity that require verification of unauthorized modification of messages and check of the origin of the descriptions of service and the results, respectively. To apply a plan of digital signature, the application server (The broker) and service providers have to hold a pair of public and private keys. The application server must sign his service descriptions with its private key and then distribute them. Service providers must apply the same method for the results of their services. The users should verify the integrity of descriptions of service and the results with the public key of the application server and the service providers, respectively. Solutions of digital signature require generally a certificate management system to exist in the architecture of mbusiness.

A. For privacy of personal data: • For identity management, Jendricke [7] presents an identity manager (iManager) to control the personal data sent from mobile devices through networks. Jendricke conceived an application which is called Identity-Manager (iManager). It exists generally between the Internet applications and the network; it allows the users to manage their identities. Consequently, it controls the flow of data towards and from the network and provides a friendly interface for the required security functionality. Identity Manager provides an interface which we use to create different virtual identifications (ID) or pseudonyms, and bind to a subset of its personal data in every ID. When one contacts a Service Provider, the user chooses an identifier that is appropriate for this particular type of communication. Before any personal information is sent to a service provider, the user is explicitly asked to allow the transmission.

• The solution for the referential integrity (Repertoire Server) is the application of authentication, the authorization and mechanisms of detection of intrusion. The authentication allows only the directors authenticated to have access to the repository. The authorization provides that authenticated managers can work only with the data on which we allow them to have access. Intrusion detection tools such as Tripwire [11] can check changes in the repository and notify administrators of changed data.

• P3P (Platform for Privacy Preferences) and Appel (A P3P Preference Exchange Language) are the recommendations of the W3C and help the individuals to build a relationship of trust with the servers and Service providers [8]. Obviously, the architecture cannot control the continuation of the use of the information, when it has been sent to service providers. Wrong use of personal information collected should be prohibited at the company, for example, by establishing (Privacy Management Code of Practice) which is mandatory for all service providers who register with the application server [9]. B. The anonymity and the respect for private life • Encryption protects the communication partners of disclosing their secret messages, but cannot prevent traffic analysis and flight information on "who communicates with whom". However, reliably provide anonymity is essential in many applications, particularly m-business. A partial solution of the anonymity is pseudonym. Pseudonyms are trafficked (simulated) names as nicknames. When communicating with service providers, users are presented by their pseudonyms

D. Multifactorial authentication and Authorization • The authentication can be executed by three different methods: the authentication by "something that you know ", for example the passwords. Another common form of authentication is by "something that you have ", as a proximity card to reach an office or a magnetic card for a hotel room. Finally, the authentication by "something you are" involves biometrics, for example the readers of fingerprints of laptops. • Multifactorial authentication: the monofactorial authentication (user name and password) is inadequate to protect itself from constant and complex threats of the current environment.

110

Two, even three factors of authentication are required to prevent the aggressors from reaching the systems by appropriating or by diverting the identifiers of authentication. A combination of two methods (the authentication in two factors) is generally used, because every three methods in themselves would supply only the low authentication. The bifactoriels systems consist not only in identifying the user by means of a secret piece of information, where from the method " an element which you know ", but also apply the formula " an element which you have " or " an element appropriate to you ", so forcing the aggressor to become identified by means of a specific element [6]. • A completely natural solution is to combine the authentication in two factors and the mechanisms of single sign-on, SSO [13] to guarantee the ease of use of the system. Single sign-on (SSO) is a method that allows a user to access multiple computer applications (or secure web sites) making only a single authentication. To authenticate users and providers of services, single sign on (SSO) should be integrated into the application server. • Cryptographic techniques based on cryptographic credentials and zero-knowledge proofs of knowledge provide a solution for anonymity: The authenticator can verify that the user is actually a legal subscriber, but can learn nothing about the identity of the user. • Authorization: To obtain the authorization, many solutions such as access control lists (ACL), the authorization based on certificates (for example KeyNote, SPKI and SAML [14]), and the authorization based on the roles, etc. E. Confidentiality of Communications Many technologies of telecommunication supply mechanisms of encryption between the sender and the network bearer. For the security of the end-to-end SSL (Secure Socket Layer), a protocol SSL (Secure Socket Layer) can be implemented. • SSL: a method based on public key cryptography to ensure secure transmission of data over the internet. Its principle consists in establishing a secure communication channel between two machines (a customer and a server) after a stage of authentication. The secure transactions with SSL are based on an exchange of keys between customer and server [15]. In the protocol SSL, messages are coded with a symmetric key, but the encryption with public key is used for the exchange of keys of the session. Consequently, the protocol SSL also requires a system for managing certificates. In this context, the authenticated-encryption with associated-data (AEAD) [16] could significantly accelerate the communication compared with the conventional methods, in particular on the mobile devices with low capacity.



Web Service Security (WS-Security): WSSecureConversation is a specification that provides a secure communication between web services using session keys (session keys). WS-SecureConversation [17] works in defining and by implementing an encryption key to be shared between all the entities implied in a session of the communication. Contrary to SSL, WS-SecureConversation supports encryption end-to-end. For example, if a message needs to pass any number of intermediaries before the reaching the final receiver and each intermediary must verify the full or partial content of the message, the sender can encrypt the message individually for each intermediary. Even different parts of messages can be encrypted to various intermediaries and identity authentication multiple parties are also possible. The WS-Security libraries are not currently available for platforms of mobile device.

F. Confidentiality of data stored locally In the mobile domain, where the theft of devices are very common, the confidentiality is particularly required to protect data stored locally on the mobile devices. The data are sensitive, because it contains private information as the name, the address, the special interests and probably even the numbers of credit card. • The public key encryption [18] can be used. The mobile user encrypts his local data with the public key. The corresponding private key is stored by a remote system and can be only recovered after the authentication by a password. G. Mobile payment security The protocol for mobile payment that would be deployed in the architecture of M-business should consider methods of strong encryption to ensure the confidentiality of monetary values transmitted over unreliable networks. In payment systems, if any conflict arises between the customer and the merchant, they need a trusted third party to solve the conflict. In the architecture of M-business, the application server can take the role of such a trusted party. To provide the evidence in the cases of conflict, the protocols of payment should be able to represent all the transactions of both parties. They should also supply the anonymous payment for certain applications, the responsibility (non-repudiation), for users and the service providers as well as mechanisms verifying the authenticity and the integrity of message protocol. However, there are some trends and options that companies can consider. These include:

111

E. Develop an architecture of security

H. WPKI: Wireless Public Key Infrastructure It is not a new PKI, but an optimized extension of traditional PKI for the wireless environment. It consists of two basic elements: the public key cryptography and digital certificate. The WPKI includes the necessary cryptographic technology and a set of standards for managing security of mbusiness. The WPKIs like PKIs reinforce the rules of m-business transactions and manage the relationship between the communicating parties, keys and certificates. The WPKI extends the e-business to the wireless and mobile environments [19]. I.

One either builds a new architecture of security, or integrates new controls m-business. Make sure that all the threats and the vulnerabilities are taken into account. F. Implement a security solution based on the business The technologies of security must be selected and configured to meet the requirement policies of security and the strategy of m-business. G. Test and validate the solution

TTP (Trusted Third-Party) :

The trusted third party (TTP) manages the authentication of the parties of the transaction, and the authorization for the regulation of payment. The different roles can be merged into a single organization, such as a banking system that is able to act as a (CP / M-Content Provider / Merchant), (PSP-Payment Service Provider) and TTP simultaneously. The roles of a PSP and a TTP can be used generally by the same organization. [2] VI.The stages of security m-business The following steps describe the success factors that are essential for planning and implementing an effective security solution for m-business [3]: A. Define objectives Understand the objectives of the company, the objectives and the critical factors of success during the planning of the strategy of security, as well as the impact on the company if they are not reached. B. Identify the points of vulnerability One identifies the vulnerabilities in the processes of our company, organization and technologies and to anticipate the way they could be exploited. C. Manage the risks Determine and measure the risks for the company, then to identify the security needs for reducing these risks to an acceptable level. D. Formalize the plan The strategy of security must be clearly defined and structured. Formalize the requirements of security, the politics and the processes to avoid the risk of uncertainty or bad interpretation.

It consists in estimating the solutions of security against a variety of threats and scenarios of attacks and then to refine solution as a consequence. VII. Next Step and Research Focus Up to here, we specify the requirements of security and the possible solutions in the m-business framework. The next step is to design security architecture for m-business. The security architecture must be flexible enough that you can easily adapt it for mobile devices, PCs and servers. The security architecture should support an identity manager that manages pseudonyms and personal data. To manage the problems arising from payments, m-Wallet (for mobile devices) and e-Wallet (for suppliers and brokerage services) components must be implemented. A crypto provider offered to many cryptographic libraries for encryption, digital signatures, support for SSL, generating strong password etc. Thus, our research objective is to build an open, evolutionary and flexible architecture of security supporting numerous possible solutions to surmount the challenges which can arise in different mobile business applications. VIII.

Related Works

In the literature, there are several research works which treat the security aspects of m-business; their security objective is essentially limited to privacy, confidentiality and anonymity. A. Work of Longyi Li and Lihua Longyi Li and Lihua tao [20] presented a framework of security system of WPKI. They also developed the security mechanism of WPKI (wireless public key infrastructure). They also presented some perspectives of WPKI in the application of mobile business, which provides a full support for the confidentiality, the integrity, the non-repudiation, and the personal identification in the mobile business. In order to adapt to the authentication and the encryption in wireless network, the WPKI technology is gradually developing, it is even applied in

112

the wireless data services. WPKI is a set of management systems for the certificate and key following an established standard, which introduces PKI security mechanisms in e-business into a wireless network environment. WPKI is the extension and optimization of PKI in the wireless environment.

Mobile Terminal

Request Code Message

Wap Gateway Server

Http request Internet

GPRS, CDMA, SMS, MMS and 802.11. In addition, it can automatically change the security channel to fit in with the different topologies of links. The framework provides various VPN (Virtual Private Network) link patterns and an automatic link switch mechanism, and a secure SMS channel as the backup option. VPN connections are not necessarily encrypted. C. Work of Thomas Walter

Business Server

Thomas Walter and al [22] presented a framework offering many solutions for the development of secure mobile business applications that takes into account the need for strong security credentials, e.g. based on smart cards. This framework consists of software and abstractions that allow for the separation of the core business logic from the security logic in applications.

Responded Message

Platform of network security protocol (WTLS, IPv6, SSL) Platform of security infrastructure (WPKI) Fig.2. Security framework of WPKI [20]

In this framework (Figure 02), as a security infrastructure platform, WPKI is a foundation for security protocols to implement effectively. WPKI provides a high-scale authentication means in a distributed network. WPKI can combine with WTLS (Wireless Transport Layer Security) to perform the function of digital signature and authentication. WPKI adopts safer ECC encryption algorithm (Elliptic Curve Cryptosystems), and it is stricter in the authentication of the mobile terminal. But there are current problems of WPKI as for example the wireless terminals are low as regards the resources, processing capacity and storage capacity, so data of certificates on the length and difficulty should be minimized to the best. The wireless channel is weak in resources and reliability. Furthermore, its bandwidth costs are high. Thus, the reliability and speed of the safe conduction should be achieved technically. B. Work of Yang Yang Yang Yang and al [21] proposed an integrated security framework based on the security demands of business applications on mobile network. Yang Yang presented the functionalities of this framework as follows: (1) it supports various mobile communication protocols, and it can automatically switch to the most appropriate security channel in according with link status and user demands, (2) it provides multi-element authentication mechanism to reduce the risk of stolen terminal, (3) it shares the computing load between smart card and the terminal device to fit in with the capability of PDA, (4) it is independent of upper layer applications. This framework supports various mobile communication methods includes

Thomas Walter presented and discussed an architecture and framework that provides a comprehensive and evolutionary approach for the implementation of secure mobile business applications. This architecture and framework allows for an integration of existing technologies (if useful, for example Bouncy castle Java Cryptographic Extension) as well as for their complementation (if required – WiTness security module), usually it consists in an open architecture and framework based on standards (WIM smart card specification). D. Work of Pan Tiejun and Zheng Leina Pan Tiejun and Zheng Leina [23] proposed a security solution based WPKI. In this security solution, they selected the Bluetooth earphone as the medium implementing storage of suitable digital certificate, asymmetrical encryption and signature algorithm, so that WPKI can provide better secure information between user and mobile commerce server. [24] In their security solution, the user accesses a certificate with a passphrase stored on a Bluetooth earphone. This certificate represents the individual as his or her personal identity throughout the session, and they transform the WIM function (WAP Identity Module or Wireless Identification Module) into the Bluetooth earphone which can be widely accepted by the people. Also, this security solution uses WIM Bluetooth earphone which can protect the message at the application layer, avoid the Security Gap problem to ensure end to end security. In this work, they only paid more attention to the application domain security and they chose the security solution based on WPKI with Bluetooth earphone, and they selected the WIM Bluetooth earphone as the core component to ensure the end-to-end mobile commerce security at the application layer.

113

[13] Niels Ferguson , Bruce Schneier, “ Practical Cryptography,” chapter 22: Storing Secrets, pages 357–358. John Wiley and Sons, Inc., 2003.

IX.CONCLUSION AND PERSPECTIVES In this article, we presented certain security requirements, the strategies of mitigation (possible solutions) and the steps of the security architecture of m-business. In the end, we presented some related work. Of course, it is never possible to eliminate all the threats. However, the success of m-business will depend on the ability to have a strategy end to end that minimizes the risk to an acceptable level for companies and users. We intend to conceive architecture of security for m-business. Besides, the architecture must be easily configurable on the mobile devices, the application servers and the service providers. REFERENCES

[14] Sebastian Wiesner, “Simple PKI,” Seminar Innovative Internet Technologies and Mobile Communications Chair for Network Architectures and Services,2013. [15] Dieter Gollmann, “Computer Security,” chapter 13: Network Security, pages 232–235. John Wiley and Sons, Ltd., 1st edition, 1999. [16] Phillip Rogaway:, “Authenticated-encryption with associated-data,” In CCS ’02: Proceedings of the 9th ACM conference on Computer and communications security, pages 98–107. ACM Press, 2002. [17] Hongbin Liu, “A Multi-party Implementation of WS-SecureConversation,” Community Grids Lab, Indiana University, Bloomington, Indiana 47404. [18] by CGI Group Inc, “Public Key Encryption and Digital Signature: How do they work?;” Business solutions through information technology,2004.

[1] A. Aloui, O. Kazar, “Architecture for mobile business based on mobile agent,” IEEE Xplore - Multimedia Computing and Systems (ICMCS), 2012 International Conference; (ISBN: 978-1-4673-1518-0). [2] Wen-Chen Hu, Chung-wei Lee ,Weidong Kou, “Advances in Security and Payment Methods for mobile commerce,” University of North Dakota, USA, Auburn University, USA, Chinese State Key Lab. of Integrated Service Networks, China, IDEA GROUP PUBLISHING 2005. [3]: Daniel Keely, “A security strategy for mobile e-business;” EMEA Wireless Security Competency Leader within the IBM Security and Privacy Services organization; 2001. [4] N. Diezmann ,“Neue Wege zum mobilen Kun-den,” chapter Payment Sicherheit und Zahlung per Handy (in Ger-man), 2001.

[19]The Open Mobile Alliance, Wireless Application Protocol Public Key Infrastructure, Version 24-Apr-2001, http://www.wapforum.org/. [20] Longyi Li, Lihua Tao, ” Security Study of Mobile Business Based on WPKI,” South China University of Technology, Eighth International Conference on Mobile Business 2009. [21] Yang Yang, Chengxiang Tan and Haihang Wang, ” An Integrated Security Framework for Mobile Business Application,” Electronics and Information Engineering College;IEEE 2008. [22] Thomas Walter, and al” Secure Mobile Business Applications, Framework, Architecture and Implementation,” 1363-4127/04/© 2004, Elsevier Ltd. [23] Pan Tiejun and Zheng Leina, ” New Mobile Commerce Security Solution Based on WPKI,” International Conference on Communication Systems and Network Technologies 2012.

[5] Deborah Russell and G.T. Gangemi Sr. “Computer Security Basics,” O’Reilly & Associates, Inc., 1992. [6] GROUPE CGI INC, “Connexion sans fil Assurer la sécurité des entreprises mobiles, ” ÉTUDE TECHNIQUE, 2013. [7] Uwe Jendricke , Daniela Gerd tom Markotten, “Usability meets Security - The Identity-Manager as your Personal Security Assistant for the Internet,” In Proceedings of the 16th Annual Computer Security Applications Conference, pages 344–353, December 2000.

[24] Y. Canjun, “Design of PKI based mobile bank security system”, .Journal of Chongqing University of Posts and Telecommunications (Natural Science). Chongqing University of publication, vol. 19, pp. 80-85, Jun. 2007.

[8] Lorrie Cranor, “The Platform for Privacy Preferences 1.0 (P3P1.0) Specification,” W3C Recommendation 16 April 2002. [9] Sarah Spiekermann , “Location-based Services,” Morgan Kaufmann, 2004. [10] S.R subramanya, Byung , “Digital signatures,” IEEE POTENTIALS, 2006. [11] Tripw ire. http:/ /www. tripwire.org. [12] Krishna Sampigethaya, Radha Poovendran, “A Survey onMix Networks and Their Secure Applications,” Proceedings of the IEEE |Vol.94,No.12,December2006.

114

> 6th. International Conference on Information Systems and Economic Intelligence, PAPER 14
6th. International Conference on Information Systems and Economic Intelligence, PAPER 14 < 

What is the impact of the knowledge mapping on the innovation process? In a first part, we study the main theoretical concepts that are used in our paper on to the knowledge mapping, innovation, and the using of TRIZ method with its different dimensions. In a second part, we propose a new approach which is based one hand, on knowledge mapping using the method MASK [12][4] and on the other hand, on the innovation process by the use of TRIZ method [1] for having at end the decision-making .If your paper is intended for a conference, please contact your conference editor concerning acceptable word processor formats for your particular conference. II. INTERRELATION BETWEEN KNOWLEDGE MAPPING AND INNOVATION

When you open TRANS-JOUR.DOC, select “Page Layout” from the “View” menu in the menu bar (View | Page Layout), (these instructions assume MS 6.0. Some versions may have alternate ways to access the same functionalities noted here). Then, type over sections of TRANS-JOUR.DOC or cut and paste from another document and use markup styles. The pulldown style menu is at the left of the Formatting Toolbar at the top of your Word window (for example, the style at this point in the document is “Text”). Highlight a section that you want to designate with a certain style, then select the appropriate name on the style menu. The style will adjust your fonts and line spacing. Do not change the font sizes or line spacing to squeeze more text into a limited number of pages.Use italics for emphasis; do not underline. To insert images in Word, position the cursor at the insertion point and either use Insert | Picture | From File or copy the image to the Windows clipboard and then Edit | Paste Special | Picture (with “float over text” unchecked). IEEE will do the final formatting of your paper. If your paper is intended for a conference, please observe the conference page limits. A. Knowledge mapping - Definitions and Background In recent years, awareness about the strategic importance of knowledge of an organization that has its strategic value is linked to its knowledge and its exploitation. The potential damage caused by the loss of a key competence and the volume of departures, scheduled or not, most experienced staff alert, in a manner becoming stronger, the need to adopt management strategy knowledge. Indeed, tact/explicit knowledge management is extremely rich and dynamic and it has become necessary to model them. This modeling is used to transform large amounts of data, from interviews with experts to searching documents in multiple repositories that are related to trades activities [13] [17]. To this end, a multitude of tools and methods exist for knowledge discovery in data, expert interviews, and/or reference materials. These methods are classified into two categories: explicit methods (capitalization) and methods for automatically extracting knowledge [13]; [26]; [31; 32].

2

Knowledge mapping, which is considered as a method of knowledge explicitation, aims to showcase the trade’s critical knowledge of the company [4]. Knowledge mapping is primarily a managerial approach whose finality is to identify the patrimonies of know-how that are strategic in the actions of the trades in the organization. The identification of the latter in an organization is to sustain develop knowledge that is related to the company's business as its work strategy. In the other words, knowledge mapping is a process by which organizations can identify and categorize knowledge assets within their organization – people, processes, content, and technology. It allows an organization to fully leverage the existing expertise resident in the organization, as well as identify barriers and constraints to fulfilling strategic goals and objectives. It is constructing a roadmap to locate the information needed to make the best use of resources, independent of source or form [13]. Knowledge mapping is an important practice consisting of survey, audit, and synthesis. It aims to track the acquisition and loss of information and knowledge. It explores personal and group competencies and proficiencies. It illustrates or "maps" how knowledge flows throughout an organization [17]. Its main purpose consists in quickly showing the collaborators of an organization, a network or pathway, where is located the expertise sought. Similarly, it allows for the indication of the importance of knowledge that is at risk of being lost and that must be preserved [26]. Several approaches to the evolution of mapping have been proposed for organizing the cognitive resources of a company. Aubertin [4] proposed three different approaches for the realization of mapping by functional classification, which respectively use the organization chart, classification by process, and classification by domains. Matta and Ermine [27] conducted a project for mapping the knowledge and the technical competence that are critical within the direction of the innovation and research of the INRS. Ermine and boughzala [13] completed a project at Chronopost International (observatory of trades), which relies on the following two objectives: first, to identify the know-how of trades that are affected by the strategy; and second, to consider the evolution of critical skills in the future. For this, Ermine J.L. built upon the project in several phases: the first phase is the realization a mapping that is strategic in regards to business actions and that is formalized by the graphical model approach of “a map of knowledge domains.” The second phase consists of an analysis of the know-how of the trades that are critical. This is done through the use of criticality criteria and takes into account the specifics of Chronopost International. Chabot [7] proposed a complete mapping of the different areas of expertise to the company HYDRO-Quebec. However, the primary objective was to on one hand, identify the areas of knowledge, and on the other hand, to do a study of criticality in order to bring out the critical knowledge domains with the help of the French Society Kadrant. Barroso and Ricciardi [6] conducted a project at the center of radio pharmacy in Sao Paulo (IPEN). Since the nuclear domain suffers from problems related to this considerable

116

> 6th. International Conference on Information Systems and Economic Intelligence, PAPER 14 < accumulation of the knowledge, such as the risk of nonpreservation, the difficulty of transfer, etc. they have developed the project in several steps by using a process approach. The process was described in a conventional manner in the form of flow diagrams linking the activities that in the process. Knowledge Engineering [8]; [5] offers a rational framework allowing a representation of knowledge obtained through the experiments [25]. This technique found a great application in knowledge management and especially to capitalize knowledge [10]. For that, we find in these approaches in one hand, models representing tasks, manipulated concepts and problem solving strategies, and in the other hand, methods to extract and model knowledge. We note for instance MASK [12]; [27] and REX [24] methods. These methods are used mainly to extract expertise knowledge and allow defining corporate memories. B. Innovation - Definitions and Background 1) Definitions Innovation is crucial to the success and survival of companies. It is seen as the single most important building block of competitive advantage. “Successful innovation of products or processes (or services) gives a company something unique that its competitors lack [21]”. Different types of innovation can be delivered, for example it may be a product-, process- or organizational innovation. The scope of innovation can range in scope from radical/disruptive to incremental/evolutionary innovation) [9]; [36]; [40]. Depending on the type, complexity and scope, the role of knowledge in the innovation process is crucial. For more radical innovations, new knowledge needs to be created or applied from very different contexts. For incremental innovations, it is more important to re-use existing knowledge in many aspects of the product’s design, manufacture and delivery [22]; [41]. Various mechanisms exist to deliberately feed new knowledge into the organization, for example communities of practice, the reading of technical journals, conversations with customers and suppliers etc. Literature supports the view that you need new, external knowledge to generate innovation [29]. Tasmin and Woods [34], commenting at an organizational level, suggested that borrowing rather than invention was fueling innovation. Additionally information useful to innovation can come from other internal units in the organization. So in different organizations particular sets of practices for feeding and creating knowledge and sources from which it is drawn may be found. 2) Innovation and using TRIZ method Today innovation is more than a need, it is a company policy. In the literature, we find several methods to generate innovation in the projects of company (the creativity tools, the theory C-K (Creativity – Knowledge) of Hatchuel and Weil [18], and the TRIZ method of Altshuller [1; 2]. In the context of our works, we are interested specifically here; to the TRIZ method (Theory of Invention Problem Solving) that is a set of methods, processes and tools for conduct innovation.

117

3

The analysis of the literature highlights that TRIZ is for sure one of the most known systematic approaches for creative design. According to Zhang [39], TRIZ methods are based upon prevailing trends of system evolution (predominantly, technological systems). These trends were identified by examining statistically significant information from different areas of intellectual activities (mainly, technological innovation). In paper of Yamashina [38] TRIZ method is presented as a method that “opens up the pragmatic orientation of engineering creativity, represented more modest by value analysis (engineering) and by numerous analitico-matricial methods (e.g. arrays of discovery). According to Apte [3], TRIZ was developed successfully as a powerful problem solving tool, especially for product innovation design in conceptual design phase, to promise the engineers with breakthrough thinking. The goal of TRIZ, as it is known today, is to support inventors when they have to solve primarily technical or technical-economical problems. The fundamental idea of TRIZ is to provide them with easy access to a wide range of experiences and knowledge of former inventors, and thus use previous solutions for solving new inventive problems (see figure 2).

Fig. 1.Problem solving with TRIZ tools at different levels of abstraction (Leung and Yu, 2007)

Problem solving within TRIZ can be described using a four-element model [14]; [40]; [41]:  The problem-solver should analyze this specific problem in detail. This is similar to many other creative problem-solving approaches.  He should match his specific problem to an abstract problem.  On an abstract level, the problem-solver should search for an abstract solution.  If the problem-solver has found an abstract solution, he should transform this solution into a specific solution for his specific problem. During this process, TRIZ can support the problem-solver by accumulating innovative experiences and providing access to effective solutions independent of application area [11]. Consequently, TRIZ has the capacity to considerably restraint the search space for innovative solutions and to guide thinking towards solutions or strategies, that have demonstrated its efficiency in the past

> 6th. International Conference on Information Systems and Economic Intelligence, PAPER 14 < in a similar problem and, in this process, to produce an environment where generate a potential solution is almost systematic [21]. III. RESEARCH APPROACH A. Context Research Manager the knowledge and the skills of companies is one of the challenges ahead, the knowledge constitute an essential factor of the development, of performance, of profitability and of innovation. For evolve, any person needs to discern its know-how, to evaluate its skills in order to them reinforce. Experience feedback is then made at the individual level, at the team level and at the company level. To realize our new approach, we have opted for the Algerian company of Fertilizer (FERTIAL). The latter conducts a research project in collaboration with the research team to be able bring elements of response to support the innovation process. National Fleuron of the petrochemical industry, FERTIAL [42], Company of Fertilizer of Algeria, is a company resulting from a partnership signed in August 2005 between the Group Algerian ASMIDAL [43] and the group Spanish Grupo Villar Mir [44]. Also, it is composed of five major divisions specialized in numerous activities related especially the manufacture of fertilizers and agricultural fertilizers. Indeed, the security is a key factor in the Industrial Policy and Human Resources, as well as staff training, quality and respect for the environment. The goal most important for the company FARTAIL is to achieve zero accidents and ensure industrial safety of the surrounding communities by proposal an approach of knowledge capitalization in the trades’ and its exploitation in the projects. These projects were intended for among others to the renovation and modernization of industrial facilities to improve their capacity, the acquisition of new digital control system, to the environment and to the security. Finally, the FERTIAL Company enrolled fully in a market

4

of the fertilizer strongly competitive, striving reduces its costs, its development cycles and in a continuous improvement approach of the quality and its innovation process. Develop an innovation process closely linked to knowledge trades of the group is therefore a priceless asset. B. Multipart figures The literature of technology shows us that the success of a generation of innovation depends on the context in which to situate the patrimony of knowledge. In order to address the need for generation of innovations from the capitalized knowledge (tacit and explicit), we decomposed our approach in two parts (see figure 3):  The knowledge capitalization via critical knowledge mapping using the principle of the MASK method in order to highlight pathways of innovation,  The generation of innovation with as a resource the capitalized knowledge (critical knowledge mapping). The model of "generation of innovation starting from knowledge capitalized" is formalized with the aid of software tools, that we consider a full-fledged product (inseparable nature of the two computer software: tool of knowledge capitalization and generation tool of innovations). As we have seen previously, the direct passage of knowledge (tacit and explicit) to innovation is not possible. However, the model that we proposed is based on the hypothesis that is possible to generate innovation starting from knowledge capitalized passing through the knowledge base is to verify. In the life cycle of the knowledge base, the first phase of our approach feeds the knowledge base through the XMind tool [45] (tool of analysis and knowledge visualization) by critical knowledge mapping strategic / trades. The second phase exploits the knowledge base through the TRIZ method (see figure 3).

Fig. 2. The problematic industrial of support innovation guided by knowledge mapping

118

> 6th. International Conference on Information Systems and Economic Intelligence, PAPER 14
6th. International Conference on Information Systems and Economic Intelligence, PAPER 14
6th. International Conference on Information Systems and Economic Intelligence, PAPER 14 < M. BRAHAMI was born in Mostaganem, Algeria, in 1972. He received a Ph.D. in computer science from the University of Oran (Algeria), in 2014. He is a professor in the Computer Science Department at the National School Polytechnic of Oran (ENPO). His research interests include knowledge management, Knowledge mapping, Knowledge visualization, Knowledge representation, knowledge discovery in databases, data mining, and cellular automata.

K. SEMAOUNE was born in Oran, Algeria. He obtained his Master of Science Economic degree in the same department in 2007. He is currently a Ph.D. candidate in the Science

8

Economic Department at the University of Oran. His research interests include RSE, Performance Indicators, Global Performance, Balanced Scorecard, knowledge and skills Mapping, Sustainable Development.

N. MATTA was born in France. She received a HDR degree in computer science from the University of Technology of Troyes (France), in 2004. He is a teacher-researcher contractual at UTT (University of Technology of Troyes) in the Computer Science. His research interests include Information Systems, Knowledge Engineering, Knowledge Management, Knowledge capitalization, Memory Project, and Knowledge Tracing.

122

A Knowledge-based approach for quality-aware ETL process Imen Hamed, Research scholar, and Faiza Jedidi, Assistant professor  Abstract—ETL processes (Standing for Extract, Transform and Load) are focal component in the data warehousing projects. They supply the warehouse with the necessary integrated and reconciled data. However, they are the first to blame when wrong business decisions are made, as they provide incorrect or misleading data. Therefore a correct design of this process at early stages of data warehouse (DW) project is required. This calls for a specific knowledge to design ETL process able to provide data of good quality. A way to achieve this is to provide the ETL worker (designer, monitor, developer) with the necessary knowledge. Accordingly, we propose to anticipate the most likely to happen exceptions during ETL process and then to resolve it. Consequently, we provide a set of best practices and methodologies modeled as knowledge to the benefit of the ETL worker during the process lifecycle. Finally, we instantiate a prototype as an initial validation of this approach. Index Terms— ETL, Knowledge, Meta data, Exceptions.

I. INTRODUCTION Oday’s challenge is to make intelligent right time business decisions. In fact, the existence of data does not insure the robustness of the undertaken decisions. Indeed Data quality problems erase during different warehousing stages. Quality of data may change depending on how data are received, maintained, processed (Extracted, Transformed and Cleansed) and loaded. Thus, it is highly impacted by the ETL process. This latter’s implementation is the task with the greatest effort [3]. Otherwise, the ETL risks of reproducing the garbage received from various data sources (Garbage IN (GI)) to further garbage (Garbage OUT (GO)). It serves as input to the data warehouse as shown in figure 1. This reveals the complexity of this process. Indeed, “ETL process development is complex, error-prone and time consuming” [3]. It is considered as the most challenging step in data warehousing projects in terms of cost and time. Eckerson [21] reports that the cost of ETL and data cleaning tools are estimated to be at least a third of the efforts and budget expenses of a data warehouse. Besides, “ETL is a complex combination of process and technology which requires the skills of business analysts, database designers and application developers” [16]. Novice

T

Submission date : 5th January 2015. Imen Hmaed : Research scholar (email : [email protected]) Faiza Jedidi: Assistant professor with Higher institute of computing and multimedia, MIR@CL laboratory, University of Sfax, Tunisia (email: [email protected])

Fig 1. ETL architecture

designers frequently lack experience and have incomplete knowledge about the application being designed. The purpose of this paper is twofold: It approaches mainly the data quality problems through a metadata perspective and attempts to reduce the time and the cost spent in inadequate ETL process design. The complexity of the ETL process and the non-expertise of the ETL designer/developer/monitor impose new challenges. Several problems during the process implementation may be faced. So, the ETL worker requires in-depth knowledge about the exceptions, errors and failures that can tackle him. In order to facilitate this mission, we identify the set of possible causes of data quality issues. We classify the collected causes as exceptions related to each ETL phase. A second step is to resolve these exceptions. To do so, we propose to capitalize existing knowledge and past experiences. Thus, the ETL worker is supplied with the necessary knowledge to sustain a smooth process running and workflow efficacy. The remainder of this paper is organized as follows: The

123

TABLE I ETL DESIGN APPROACHES

first section is dedicated to expose different related works in the literature. Our contributions may be summarized as follows: 1- The modeling of the ETL process as a business process is the object of the second section. 2-The third section illustrates an attempt to model the knowledge in a reusable way. 3- An implementation phase, which is a prototype validation, is the object of the fourth section. Finally, we enclose the paper with conclusion and open issues.

Approaches

II. RELATED WORKS “The appropriate design and maintenance of the ETL processes are key factors in the success of data warehousing projects” [19, 20]. Basing on this principle, several research efforts about ETL processes have emerged. These latter, labeled costly and risky, [4] present a very rich field to explore starting from its modeling phase to the maintenance one. Most of the works focused on the ETL design phase, considering it as a complex phase [2]. So, they present relative conceptual models. The authors in [4] propose a new UML based approach for the design phase of ETL processes. In this perspective, they provide the necessary mechanisms as UML stereotypes to represent the most common ETL operations such as: Aggregation, Conversion, Filter, Join, Merge.... This approach facilitates the design and subsequent maintenance of ETL processes at any modeling phase. Another proposition of Mrunalini and al (2006) [22] restrict their work to the extraction phase omitting the transformation and the loading ones. They model the extraction scenario using UML diagrams. Using UML in the ETL modeling phase presents various ups that can be summarized into three weighty profits: Firstly, UML insures the ease of use, being a well known modeling language. Secondly, since the ETL modeling still lacks standardization [9], UML may be nominated as standard for it. The third advantage concerns the seamless integration of the ETL process with the data warehouse conceptual schema. Conceptual models were not restricted to UML. El Akkaoui and al come with new approach of ETL modeling. It consists in considering ETL as business process (BP). So, they present several interventions: In [5], the authors highlight the importance of merging ETL with other business processes in the organization. The ETL process was defined by two main processes: Data process and control process. Then, they provide the modeling of both processes using BPMN 2.0 (Business Process Modeling and Notation). In another work [7], the same group pursuit their approach and aim to align it to MDA (Model Driven Architecture) standard. Indeed, recent work consists in following an MDD (Model Driven Development) process to automatically generate vendorindependent code of ETL processes. This approach is an attempt to standardize ETL process development in order to reduce cost and especially to share and reuse methodologies among different projects. A recent publication [6] is a perpetuation of the alignment of ETL process development to MDD framework. But this time the authors cover overall ETL development process and allow an automatic maintenance phase equally. Another approach introduced in [7] proposes

Advantages

Limits

UML based approaches

-Ease of use -Standardization -Integration with DW schema

-No requirement analysis -No data flow management -No data sources content analysis

BPMN based approaches

-Data flow management -Business view -Business requirement analysis

-Modeling efficiency only

Ontology based approaches

-Better requirement analysis -Automation of some ETL tasks

-Ontology construction is not sufficient -No data flow management -Enormous effort in building the ontology of the required information

the marriage between BPMN and BPEL in order to insure the mapping from conceptual model to the logical one. The major advantage of the BPMN use is the integration of the ETL process within the enterprise. In accordance with the precedent efforts, a new proposition, based on ontology, is introduced to the ETL field. In fact, this proposal has two major advantages: The automation of some ETL tasks like the inter-attribute semantic mapping and the automatic selection of transformation functions [8]. The second benefit of ontology use in ETL field is the improvement of data quality: In [9] the authors thought about earlier stages in data warehouse implementation. So, they focus on business requirement collection firstly and they provide its semantic analysis secondly. Semantic integration in this phase may significantly reduce the possible misunderstandings of user requirements. Also in [10], the suggestion is to prevent the sources heterogeneity problem by providing the ETL process with the correspondent ontology. Thus, they ensure the same semantics of integrated data records and especially prevent constraint contradiction among sources. Although naturally conceptual, the mentioned approaches have contributed to the complexity attenuation and the data quality improvement. Table 1 sum up the studied approaches. We refer the reader to the survey in [3] for a precise literature review. III. DESIGN ETL AS BUSINESS PROCESS Modeling is a crucial step in the process lifecycle. It has numerous advantages; according to Booch and al [11], we model a system in order to: - Express a structure and behavior - Understand the system - View and control the system

124

-

Manage the risk

In order to identify clearly the ETL process problems, it is crucial to model the process as a first step. This latter allows the definition and the analysis of the various what-if-situations during the process lifecycle. This obviously leads to, constantly, review and improve the existing process. Among the reasons of the ETL process complexity: the difficulty of obeying different business requirements. Business process modeling techniques allow business rules and alternate scenarios to be captured as part of the main business process logic. Several modeling languages are provided for the process modeling such as: IDEF0, IDEF3, UML 2.0 (activity diagram) and BPMN 2.0. Based on a comparative study in [12], BPMN 2.0 seems to be the best modeling language to meet our needs. Also, modeling ETL using BPMN or considering ETL as BP presents several advantages:     

To have a business view of the ETL To have a unified formalism for modeling both operational processes A layered methodology, the ETL flow can be modeled and represented in an appropriate form for users To easily identify which data are required especially when and how to load them Real time decision making

ETL is often a complex combination of process and technology. In fact, it consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts, data base designers and application developers. The ETL processes must be designed for ease of modification because data warehouse varies as time function. To make efficient decisions, ETL should support all these modifications. The ETL process creates a physical and logical separation between the source systems and the destination. It profiles different sources considering the suitable information that imports value to the enterprise. As a consequence, different sources are available and different file formats exist. In this case, the ETL process should support the variety and the multiplicity of the data sources types. The ETL process also needs to effectively integrate sources that have different DBMS (Data Base Management System), operating systems, hardware and communication protocols. The ETL process is decomposed of three main operations (detailed next). To model this process, we choose the ARIS (Architecture of Integrated Information Systems). This latter is both a BPM methodology, and an architectural framework for designing enterprise architectures. It also narrows the gap between business requirements and IT. The above mentioned diagram presented in figure 2 is the BPMN representation of the ETL process. We first separate the three main operations in three pools.

125 Fig. 2. ETL design as a business process

Each pool encompasses the activities of the correspondent operation. The pools are used to indicate the participants of a process and express their organizational affiliation. In our case, the pools indicate the set of activities for each subprocess. Extraction: The first operation called Extraction where different data sources are captured and those, containing the appropriate data, are selected. The ETL worker has to profile data sources and select the appropriate one for the extraction event; as it is indicated by icons in the upper left corner of each task. The selection of the data sources is a user task. The advantage of using BPMN is the possibility of visualizing and differentiating the tasks types. Then, the extraction activity is a receive task. Its output is the extracted data. This data flow is sent to the data staging area. Another benefit of BPMN use is the possibility to manage the flow between different tasks and different pools equally. Transformation: The transformation operation is the core of the ETL process where a sequence of tasks takes place for instance: filtering, converting codes, checking for null values. Those activities can be represented by the constraint functions. The ETL worker has to specify the constraints and the process performs the actions. So, they are system tasks. Loading: The last step in the ETL process is the loading phase which is decomposed into three activities: The loading pool contains two user tasks: The ETL worker has to select the target to load and join the source and the destination. Then, the process establishes the mapping between sources and the target attributes according to the mapping rules. Finally, data are embedded in the warehouse through a send task.

the most susceptible exceptions during different ETL phases. Exceptions which block the ETL process or affect the quality of data. In this paper we expose data sources exceptions as an example. The following table regroups the exceptions that can arise in data sources and affect the ETL process enactment. We tried to relate each exception to the affected element.

TABLE II DATA SOURCES EXCEPTIONS Data sources exceptions

Orphaned or dangling data

×

Important components are hidden and floating in text fields

×

Varying timeliness of data sources Inadequate knowledge of inter dependencies among data sources

Data and metadata mismatch Wrong number of delimiters in the source files Columns having incorrect data values Non compliance of data in sources with the standards Multi-purpose fields present in data sources

We get a good insight into how the process works. So, we can identify structural, organizational, and technological weak points and even bottlenecks, and identify potential improvements to the process. We identify possible exception scenarios which interrupt the usual process flow. Therefore, we need to specify how the exceptions will be handled.

Additional columns

126

× × ×

Misspelled data

Prsenece of outliers

Dramatic bad business decisions refer to incorrect or misleading data provided by the ETL process. Consequently, the accuracy and the correctness of data are key factors of the success or failure of DW projects. Data quality issues can arise at any stage of the ETL process: during different data operations such as: extraction, transformation, or loading. Also, it may exist in data sources and data destinations. To maintain the flexibility of the ETL process and its robustness and especially to maintain a high quality of data, we collect

×

Data loss (rejected records)

ARIS offers a variety of tools to model the business process and manage the data flow, and also handle events. The ETL is a dynamic process varying as data sources change. This change may impose new costs and generate some risks. Indeed, business process modeling helps in estimating these costs and mitigating those risks.

IV. KNOWLEDGE MODELING

Affected component Multiple Data data Attribute source sources

× × × × × ×

Usage of decontrolled applications and databases as data sources Lack of validation routines at sources Different business rules of various data sources Lack of business ownership of The entire enterprise Presence of duplicate records of same data in multiple sources Multiple data sources generate semantic heterogeneity Data values stray from their field description and business rules Missing columns in data sources

× × × × × × × × × ×

Missing values Failure to update sources in timely manner Not specifying null character properly in flat file

× ×

Exceptions can be related to the data source itself (structure,

metadata..) or to the content of the data source: the attribute (data value, data meaning..). Also, exceptions may be generated from the presence of different data sources such as the heterogeneity problem. Simple design suffers from lack of formalism and does not provide re-usability of presented models. Therefore, we intend to model our exceptions as design patterns encapsulating valuable knowledge and storing past experiences. The definition of standards allows the reuse and portability of solutions across different tools and application contexts. Besides, since our approach is quality aware, our proposition (metamodel) is based on the CWM data quality metamodel [13]. This paper adopts a definition of data quality criteria that closely follows the one presented in [14]. Data is considered of good quality if it fulfills the following criteria:     



Completeness: This criterion defines whether the requisite information is available: for e.g., there is no missing data value or unusable state. Consistency: This criterion determines whether distinct occurrences of the same data provide some conflicts. Validity: This criterion specifies the correctness and reasonableness of data. Conformity: This criterion determines the correctness of the data format. Accuracy: This criterion defines whether the data objects represent the ‘real world’ data values: for e.g., there is no incorrect spelling of product or person names. Timeliness: This criterion defines the availability of data when needed (up to date data).

Alike we will be able to measure the quality of the piece of data. Accordingly, we obtain an iterative improvement of data quality. Now, we aim to manage the exceptions in an easy and pertinent way so we adopt the ECA (Event Condition Action) rules. These latter are a well known technique in the field of knowledge representation. In addition, those rules offer a flexible, adaptive, and modular approach to realize business process [15]. In fact, ECA rules allow exceptions management as normal situations thus making their handling quite easy. They are a convenient mechanism for exceptions handling. Hence, we intend to adopt it in our approach. As a result, the events represent the ETL operations. The conditions are the exceptions listed above. The actions are the methods attached to each class in the diagram which represent in fact the knowledge collected from data experts. The following diagram shown in figure 3 regroups the different exceptions that may arise during ETL stages. In each task and relatively its sub task, the ETL worker may face different exceptions that can affect the quality of data and subsequently the decision making effectiveness. Therefore, a collection of various exceptions and situations are to be handled and managed from the very beginning of the process lifecycle. This paper’s scope is metadata support for exceptions tackling the ETL process and affecting the data quality. To illustrate our approach, we treat the data sources exceptions. To reach a high quality of data, a set of rules has to be followed. Considering those rules and the ECA ones, we get the diagram shown in Fig. 4.

The aforementioned criteria are considered as data quality indicators. To quantify the defined criteria, we are willing to add the correspondent metrics.

127 Fig 3. Exceptions metamodel

Fig 4. Data sources exceptions metamodel

Our metamodel may be split into three levels. Most of the classes in the diagram possess very few or no attributes. “Ds exceptions” is a top level metaclass with one attribute. The first classification level considers the origin of the detected exception. This latter may be generated from the data source itself such as data and metadata mismatch. Also, the attribute (the piece of data) can disobey to the data quality criteria. So, it has to be fixed. Besides, the presence of different data sources may generate heterogeneity. As a result, two child classes are created. Class of the upper left part is a further classification of exceptions related to the attribute itself. Generally, data quality may be specified by a set of rules. The set of rules packages defined above, may help in exceptions resolution in one hand and in knowledge representation in the other hand. There are five classes extending the attribute class. In this level, the metamodel starts to move forward a rule-driven approach. The first class “Uniqueness rule” embodies the necessary features that are required to identify attributes whose values should not repeat. It regroups all the problems related to primary key identification and duplicate data. The second one “Nullity rule” is related to most common problems manifesting low data quality: Null values. Data absence prevents the user from getting a specific piece of information. So, requirements for such a problem have to be specified. The “adjacency and consecutiveness

rule” eliminates the existence of gaps between data values. It specifies the continuity of values for some specific fields for e.g., presence of outliers in data sources. Additionally to these factors, some other considerations are required to fully define quality in data. This implies that data quality is defined upon a business context of application. Accordingly, rule class encompasses business rule sub class where exceptions related to the business context are classified such as: Data values stray from their field description and business rules. Data quality problems may also be generated from mistyping, outdated value… Those rules concern data syntax and structure. We classify those exceptions under a class named “dictionary rules”. As an example we may cite the following exception: Misspelled data. All the discussed classes encapsulate the appropriate knowledge to resolve the ETL exceptions in an easy and pertinent way through the ECA rules. Hence, we sustain the robustness and the flexibility of the ETL process regardless to the expertise of the ETL worker. The upper right part of the metamodel demonstrates two classes regrouping the exceptions related to one data source or many. The “source exceptions” class possesses one attribute “data source type”. The related enumeration indicates the different types a source can be. This class is an instance of the parent class “multiple data sources exceptions”. This latter is a set of resolved exceptions arising from the presence of many data sources at once. We may cite mainly the heterogeneity of

128

sources semantics, time, business rules… as an example. Conceptual design is probably the most critical and important process of developing quality application. Reuse of already existing resources and solutions has become a strategy for cost reduction and efficient improvement. This classification allows first and for almost the extensibility: new exceptions generated can be added to the afforded meta-model without any problems. Besides, metamodel refinement is possible too: particular needs of particular applications may be imported.

A first step in the implementation side is to extract data from this flat file in an ETL environment. In our case, we select an ETL open source tool with the purpose of modifying the code and adding specific functionalities. Basing on a comparative study of different ETL open source tools [18], we choose Talend Open Studio. It was nominated as the best tool that fits our needs, thanks to its numerous functionalities and its ergonomic interface as good as its rich palette.

V. IMPLEMENTATION In order to emphasize our proposition, we implement the instantiation of the following exception: “Not specifying null character properly in flat file data sources”. We pursue the following demarche during the implementation phase.

Fig 7. Data extraction from flat file containing null primary key

Fig 5. Proposed demarche

Fig. 5 sums up our proposition: The demarche consists of collecting the appropriate knowledge to resolve the anticipated exceptions, next this knowledge is represented in a reusable metamodel. Afterward, the UML diagram is translated into a XML file as a knowledge repository. According to the ETL environment, the collected knowledge is processed with the appropriate code. In our case, we opt for the java use according to our ETL environment. This section illustrates the applicability of our approach through the chosen example. Null values are symptom of low data quality. In the following, we test whether the requirements for non nullity are specified in the system implementation. For this reason, we create a flat file containing null primary key and other null fields as it is illustrated below:

Accordingly, we implement our ETL prototype. We propose to extract data from the mentioned flat file. When the extraction process starts, Talend assigns “0” to each null primary key. Hence, a duplicate entry problem is generated as seen in figure 7. Consequently, the ETL process is blocked. For a novice designer, it is difficult to handle this exception. In fact, according to Talend, there are some components that manage the null fields in data sources such as: TFilterRow, Treplace, tmap expression etc… First, the tFilterRow component eliminates all the rows containing null values so we get data loss as it is shown in Fig. 8. Second, the trpelace component replaces null values with the same value which is inappropriate in some cases like replacing null primary key. Third, the tmap is destined to manage null fields in a data source except null primary key. We need an efficient way that regroups the different best practices for null value handling whatever was its type. The importance of our approach is highlighted here.

Fig 8. Data Extraction using tFilterRow Fig 6. Flat file with null primary key

129

Thus, we aim to help novice designers how lack the correspondent knowledge to achieve a good data quality and reduce the time spent in inadequate solutions. We propose to include our proper knowledge processed as a java code in the Talend environment. To start, we created an xml file where we store the exceptions metadata. XML is widely used to store data and especially to exchange it between programs. So, the exceptions are stored as classes and subclasses with different attributes and methods.

According to the best practices collected to manage the null field exception, a default value should be attributed to the important fields. In our case, a random number is assigned to primary key field. Hence, the data are sent to the destination without any block. The problem of primary key duplication is resolved and null value is handled.

In

Fig 11. Null primary key handled

In this example, we have handled null primary key in flat file only.

Fig 9. XML excerpt from data sources metadata

VI. Then, we implement our exceptions in our ETL environment. When the extraction process begins, a verification of null existence fields is taking place. Next, if the data source contains null fields, the java code is triggered so the knowledge stored is processed.

Fig 10. Data extraction from flat file with ECA implementation

CONCLUSION AND OPEN ISSUES

Complexity and cost are two known labels of the ETL process. Efforts and propositions pursue the design and modeling as way out of these issues. However, the ETL processes still “require intensive human effort from the designers to create them” [3]. Furthermore, the quality of data is a matter to worry about when creating the ETL process. Therefore, the proposed approach is an attempt to reduce the complexity of the ETL process and the intensity of human effort in one hand. In the other hand, we mainly target the data quality improvement. Otherwise, the collected modeled exceptions may improve considerably the quality of data. They also help designers to avoid many errors that can disrupt the ETL process enactment. So, this prevents designers from reinventing the wheel and wasting time in inadequate design. Considering the importance of data quality, as future research we aim first to align our approach to a MDD framework: So, we provide iterative refinement of our meta-model through model to model transformations (M2M). Moreover, a model driven framework allows the generation of vendor independent code by using model to text transformations (M2T). Another important issue is to introduce metrics to evaluate the quality of data. Hence, methodologies and best practices collected address the continuous improvement of data.

130

REFERENCES [1]

[2]

[3] [4]

[5]

[6]

[7] [8]

[9]

[10]

[11] [12]

[13]

[14]

[15]

A. Simitsis and P. Vassiliadis, “A method for the mapping of conceptual designs to logical blueprints for etl processes”, Decision Support Systems, vol. 45(1), pp. 22–40, 2008. P. Vassiliadis, A. Simitsis and E. Baikousi, “A taxonomy of ETL activities”, Proc. ACM twelfth international workshop on Data warehousing and OLAP, 2009, pp. 25–32. A. Kabiri and D. Chiadmi, “Survey on ETL processes”, Journal of Theoretical and Applied Information Technology, vol. 54(2) , 2013. J. Trujillo and S. Luján-Mora “A uml based approach for modeling etl processes in data warehouses”, In Conceptual Modeling-ER, pp. 307– 320, 2003. Z. El Akkaoui, J. Mazón, A. Vaisman, and E. Zimányi, “BPMN based conceptual modeling of etl processes”, In Data Warehousing and Knowledge Discovery, pp. 1–14, 2012. Z. El Akkaoui, E. Zimányi, J. Mazón and J. Trujillo, “A Bpmn based design and maintenance framework for ETL processes”, International Journal of Data Warehousing and Mining, vol. 9(3), 2013, pp. 46–72. Z, ElAkkaoui and E.Zimanyi, “Defining ETL Workflows using BPMN and BPEL” Proc. DOLAP09, 2009, pp. 41–48. S. Bergamaschi, F. Guerra, M. Orsini, C. Sartori, and M. Vincini, “ A semantic approach to etl technologies”, Data & Knowledge Engineering, pp. 717–731, 2011. A. Ta’a and M. Syazwan Abdullah, “Goal-ontology approach for modeling and designing etl processes”, Proc. Computer Science, 2011, pp. 942–948. D. Skoutas and A. Simitsis, “Designing etl processes using semantic web technologies”, Proc. the 9th ACM international workshop on Data warehousing and OLAP, 2006, pp. 67–74. G.Booch, J.Rumbaugh and I.Jacobson, The unified modeling language user guide, Pearson Higher Education, 2004. L. Businska and M. Kirikova, “Knowledge dimension in business process modeling”, IS Olympics: Information Systems in a Diverse World, pp. 186–201, 2012. P. Gomes, J. Farinha and M. Trigueiros, “A Data Quality Metamodel Extension to CWM”, Proc. The fourth Asia-Pacific conference on Comceptual modelling, 2007, pp. 17–26. R. Singh and K. Singh, “A descriptive classification of causes of data quality problems in data warehousing”, International Journal of Computer Science Issues, vol. 7(3), 2010, pp. 41–50. F. Bry, M. Eckert, P. Patranjan, and I. Romanenko, Realizing Business Processes with ECA Rules: Benefits, Challenges, Limits, Springer, 2006.

[16] Z. El AAkaoui, E. zimyani , J. Norberto Mazon and J. Trujillo, “A model driven framework for ETL process development”, Proc. the ACM 14th international workshop on Data Warehousing and OLAP, 2011, pp. 45–52. [17] S. Ali El-Sappagh, A. Ahmed Hendawi and A. Hamed El Bastawissy, “A proposed model for data warehouse ETL processes”, Journal of King Saud University-Computer and Information Sciences, vol. 23(2), 2011, pp. 91–104. [18] J. Francheteau, “Rapport de stage Etude des ETL open source” , 2007. [19] M. Solomon. “Ensuring a successful data warehouse initiative”. IS management, vol 22(1), 2005, pp. 26–36 [20] S. March and A. Hevner. “Integrated decision support systems: a data warehousing perspective”, Decision Support Systems, vol. 43(3), 2007, pp. 1031–1043 [21] W. Eckerson and C. White. Evaluating ETL and data integration platforms. [22] M. Mrunalini, T.V. Suresh, D. Kumar, E. Geetha and K. Rajanikanth. “Modeling of data extraction in etl process using uml 2.0” Bulltein of Information Technology, 26:3–9, 2006.

131

Integration of the knowledge management process into risk management process – Moving towards actors of project approach Brahami Menaouer

Semaoune Khalissa and Benziane Abdelbaki

LIO laboratory University of Oran, Department of Computer Science Oran, Algeria [email protected]

LAREEM laboratory University of Oran Oran, Algeria [email protected] and [email protected]

Abstract—The management of knowledge and know-how becomes more and more important in organizations. Building corporate memories for conserving and sharing knowledge has become a rather common practice. However, the researches in knowledge management focus mainly on the process of creation, of capitalization, and of transfer of knowledge. Researchers are also centered on the establishment of the process of knowledge management in companies, but little about interaction between the knowledge management process and the risk management process. In this paper, we propose a new approach to integration of the knowledge management process represented by the GAMETH method in the risk management process. We apply our approach on ammonia industry presented by the AlgerianSpanish company - FERTIAL.

  

Facilitate the activity of individuals in terms of the decision making, Increase productivity, Promote the innovation and creativity.

In a first part, we study the main concepts that are used in our paper regarding to context and elements of the project, the theoretical concepts of risk management, and knowledge management. In a second part, we present our approach to fusion of the knowledge management process by using the GAMETH method [12] in the risk management process. II.

TEORETICAL FRAMING

Keywords—knowledge management; knowledge mapping; knowledge capitalization; Risk management; GAMETH method; knowledge sharing

This section presents the main concepts that are used in this article regarding to context and elements.

I. INTRODUCTION Very many company use the risk management for develop their activity (construction, computer science, ecological, industrial, pharmaceutical, health, etc ...). Among the different research themes addressed in the literature, risk reduction of projects remains one of the most studied with others important literatures on detection, evaluation, estimation, the solutions and the tools to be implemented. However, it appears that appropriation (learning) and experience (know-how) are effective ways to prevent the risks.

A. Context and elements of the projects In the literature of management and according [18], the term project corresponds to the situation in which one finds oneself when we shall reach a goal with the means with ad hoc and in a given time frame. According [19], a project defines itself as a specific action, new, of limited duration, which structure methodically and progressively a reality to come. In addition, the project is a complex system of stakeholders, means and actions, constituted for provide a response to a demand elaborated to satisfy the need of project owner. For [15], the project is defined as a specific approach that allows to structure methodically and progressively a reality to come. Also, a project is defined and implemented to elaborate a response as required of a user, of a client or of a customers and it implies an objective and actions to be undertaken with of data resources.

Such knowledge acquired in the past must be managed to allow for more effective risk management: the role of knowledge management. This latter is a way of systematic management of knowledge tacit and knowledge explicit. Indeed its purpose is to retain, to transmit and to develop knowledge order to: 

Improve the management of skills,

At the conclusion of these three definitions, we find that the word project is highly bound to objective terms, means and time. The diagram below, adapted from Briner and Geddes [7], represents the fact that the realization of a project is influenced by policy of the organization, by some external

132

constraints and by the needs of some people in the organization's environment as much as to the interior of the organization. Consequently, these factors must be considered throughout the life cycle of a project (see fig 1).

Fig. 2. Projects success rate

B. The risk management process

Fig. 1. The project management triangles

In addition to its different elements, the project evolves in a particular context which confers the specific characteristics. According [22], this context is composed of the organizational structure of the enterprise, the direct environment of the project and the general environment of the enterprise. For [15] the organizational structure (dot matrix, per project, and functional) of the enterprise determines the organization of the project because it affects the roles and tasks of the actors. The direct environment of the project “users, project team, and management type” frame of the objectives and progress of the project. The general environment (competition, sector of activity...) legitimate, regulates and / or strengthens the project. The project, its elements and its context are managed by the project leader through of the project management. An abundant literature defines the project management. The project management is first presented like applying tools and technologies on resources for the accomplishment of a single task, complex possessing of the constraints of time, cost and quality [6]. Other author specifies that the project management has the role to address these challenges by putting in place an organization and an planning of the all the activities aimed at ensure the achievement of project objectives (quality, cost, time), perform their monitoring, anticipate the changes to be implemented and risks, decide and communicate [9]. Finally, some describe the project management like a process of organization, of planning and of coordination of means [2] rather than a process of control and of monitoring of the tasks. All projects are not of successfully. By way of example, we can cite the studies of the Standish Group published in the "Chaos report" but which only relate information system projects. The latest public release of 2009 date and is given below [17] (see fig 2):

The projects failures lead us to treating the existing risks preventing projects to arrive to their end or else to meet their initial specifications. The risks are defined in different ways in the literature. For some authors, the risk is a situation undesired having negative consequences resulting from the occurrence of one or more events whose occurrence is uncertain [3; 4]. Other authors add that the risk is more or less important depending on the uncertainty and the probability that he has be realized [20; 8; 16]. The risk is not solely associated with a negative result; it can also conduct a positive result [21]. Other hand, [2] distinguishes the risks and the risk factors. Also, [25] presents the risk as a lacuna of knowledge, in the sense where the risk is not reducible only by the knowledge capitalization, thus putting forward the interest of knowledge management for the risk management. According [21], the risks are present in all the systems of the model presented above: strategic, technical, social, structural, and the project management. The project manager must then seek to reduce them, and if they can't reduce them, it shall monitor their evolution. He set up in this optic a risk management. This risk management is a principal component of the project management. According to several authors, the process of risk management is defined as a concatenation of five (5) steps (see fig 3).

RISK Control Pilotage

Solution

Analysis

Identification

Fig. 3. The risk management process

For to be reduced, the risk must first be identified it this is the first step. The known risk, the analysis phase that follows consists in finding the causes of this risk and to evaluate its

133

consequences. The project team then searches the possible solutions to reduce it and sets up that which seems to them most effective. This setting up of the solution is piloted and regularly monitored in order to check that it matches well to the expectations of the team that shall make changes if necessary. C. Knowledge management Before understanding how knowledge management allows reducing the risks in the projects, we find out its own characteristics. In the literature, we find several definitions of knowledge management. For this, we are focused in our paper about a few definitions. Barclay et al [1], defines the knowledge management as being “as a process of identification, formalization, disseminating and use of knowledge in order to promote creativity and innovation in companies”. According [10], Knowledge capitalization in an organization has as objectives to promote the growth, the transmission and the preservation of knowledge in this organization. According [11], capitalizing on company’s knowledge means considering certain knowledge used and produced by the company as a storehouse of riches and drawing from these riches interest that contributes to increasing the company's capital. For [23], knowledge management systems are guided to capture, create, store, organize and disseminate organizational knowledge. This process (see fig4) takes into account the transformation and the evolution of tacit to explicit knowledge [23] and of individual to collective knowledge.

It can carry both theoretical knowledge and know-how of the company. It requires the management of company knowledge resources to facilitate their access and their re-use [24]. It consists of capturing and representing knowledge of the company, facilitating its access, sharing and re-use. This very complex problem can be approached by several points of view: socio-organizational, economic, financial, technical, human and legal [14] (see figure 5).

Fig. 5. The principles and practice of Knowledge Management [14]

We can find in the literature different proposals of life cycle used to realize of knowledge management (such as GAMETH, MASK, REX, KOD, etc.). In our paper, we adopted the knowledge management life-cycle proposed by Grundstein [13], where, according to him, "in any operation of knowledge capitalization, it is important to identify the strategic knowledge to be capitalized" (see fig 6).

Fig. 4. A Model of Dynamic Organizational Knowledge Creation [23]

134

Fig. 6. The Generic KM Processes (GAMETH method) [13]

III.

THE REDUCTION OF RISKS BY THE INTEGRATION OF THE KNOWLEDGE MANAGEMENT PROCESS

The capitalization gathers the processes allowing of valorize the knowledge "acquired": the return of experiments on reducing risks, capitalization around the finding solutions to improve the teamwork, use of tools for modeling, and the planning within the company FERTIAL. FERTIAL (National Fleuron of the petrochemical industry) [26], Company of Fertilizer of Algeria, is a company resulting from a partnership signed in August 2005 between the Group Algerian ASMIDAL [27] and the group Spanish GrupoVillar Mir [28]. Also, it is composed of five major divisions specialized in numerous activities related especially the manufacture of fertilizers and agricultural fertilizers. Indeed, the security is a key factor in the Industrial Policy and Human Resources, as well as staff training, quality and respect for the environment. The goal most important for the company FERTIAL is to achieve zero accidents and ensure industrial safety of the surrounding communities by proposal an approach of knowledge capitalization in the trades’ and its exploitation in the projects. These projects were intended for among others to the renovation and modernization of industrial facilities to improve their capacity, the acquisition of new digital control system, to the environment and to the security ... etc (see fig 7). Production Unit Storage installation

The knowledge management process can be integrated into the risk management process. Indeed, as we shall see the different phases of risk management correspond to the operational chain of the knowledge management process (see fig 8). During the identification phase, the project team puts in common all the knowledge related to sources of the risk and research the presence of these sources at all levels of the project. This knowledge was generated accumulated during previous learning’s and from experiments of past projects. The acquisition of such knowledge that corresponds to the step of acquiring the knowledge management process is carried out using different means: learning, the return of experiments, and the transfer of knowledge between the actors of the project team. This stock of knowledge feeds the discussions which occur during of the identification phase. During the phase of risk analysis, the knowledge gained in the past relating to the methods of evaluation, of estimation and the risk measurement are put to contribution. The solution for reduced and / or control the risk arises from the analysis developed just before. Through knowledge held by team, its trade’s actors can more or less predict the consequences entailed by the establishment of the solution. Thus, the steps of identification and analysis constitutes the phase of spotting of the knowledge management process according Grundstein (GAMETH method) such that it was defined previously (see fig 6).

Charging device and transport Auxiliary Installation Environment and Security Infrastructure Integral management system

Fig. 7. Destination of investment projects

135

INTERACTION REPAIR Identify Locate Characterize

Identification Analysis

INTEGRACTION

Solution

Pilotage

SUPPORT Update Improve Enrich

RISK

Control

PRESERVE Formalize Modelize Keep VALUE Access Combine Better to use Fig. 8. Integration of the knowledge management process (GAMETH method) in the risk management process

Moreover, the establishment of the solution is piloted and controlled by the project leader using the evaluation of the effect of the solution on the risk, using for example of dashboards monitoring or in dialogue with relevant actors. The interaction between the solution and the two managerial skills that are the piloting and control is similar to the phase of preservation of knowledge management process (see figure8). Indeed, the implementation of the solution is akin to a process of action which is tested and regulated by the control and piloting of project leader. Of this control and this steering pulls out a more or less thorough evaluation of the effects of the solution envisaged and this evaluation is the basis for the knowledge valorization, the third step of the knowledge management process. Evaluation of the solution consists in comparing the results obtained to desired results. This difference, positive or negative, between real results and contemplated results, allows the team to make self-criticism of the solution and to define thus the advantages and disadvantages of the solution developed to accumulate knowledge. After updating of the process of knowledge management by the new knowledge (the fourth phase of the process).This new knowledge is managed in a project memory for that the process of knowledge management can reduce the risks of the future projects (fifth phase of the process). The cycle of knowledge management is thus sealed off. The new knowledge accumulated over of the risk management is memorized and ready to be disseminated to the future of project teams. The risk management will be more effective

because the phases of identification, of analysis, and setting up the solution will benefit from the experience of past projects. The integration of knowledge management therefore allows to directly reducing risk. However, it also influences indirectly on of sources of risk: the lack of responsiveness and the cognitive biases. IV. CONCLUSION We can summarize the contributions of knowledge management to reduce the risks in the following way: 1. The knowledge management allows evolving the cognitive processes of the various project actors. 2. The knowledge management going to favor the acquisition of knowledge at the level the risks by making explicit the tacit knowledge of the different actors on the risks, retaining such knowledge and transferring them. However, it must be noted that the use of the knowledge management to reduce the risk is only relevant if an assessment and an experience feedback of projects is performed by all actors of the project. Indeed, it was only at this time that the knowledge about the risks can extend thanks to the measurement of deviations between what that was expected of the project and the real results, by analyzing these deviations and by fixing this analysis in the knowledge base. It does must also not lose sight of the fact that the knowledge management take an interest in the environment that surrounded the project, because, the risk was closely related to

136

the environment the solutions applied to reduce the risks may be different according to the project environment.

[13]

Acknowledgment This project is registered in the context of collaboration between the company FERTIAL, our research team of the laboratory LREEM and the National School Polytechnic of Oran (ENPO). The authors also acknowledge the service team of FERTIAL for her assistance on to finalize this project.

[14]

[15]

[16]

References [17] [1]

Barclay, Rebecca, and Murray, Philip C.(2004), What Is Knowledge Management?, Knowledge Praxis. [2] Barker S. and Cole R. E., (2007), Brilliant Project Management: What the Best Project Managers Know, Say and Do, Editor: Pearson Education, 2007, ISBN: 0273707930, p. 5-150. [3] Banham, R. (2004). Enterprising Views of Risk Management, Journal of Accountancy, Vol.197, No. 6, pp 65-72. [4] Baranoff, E. (2004). Mapping the Evolution of Risk Management, Contingencies, Vol16, No.4, pp 23-27 [5] Boucher, Karen D., Conners, Kyle, Johnson, Jim, and Robinson, James, Collaboration: (2001) Development & Management Collaborating on Project Success, Software Magazine, February/March. [6] Buttrick,R., (2012). Project management: The comprehensive guide to project management, 4rd Edition, Pearson, Collection: Management an action, ISBN-10: 2744076406, p.5-300. [7] Briner L., Geddes M., Hastings C., (1993). Le manager de projet : un leader, Paris, AFNOR, ISBN : 9782124783113,P. 1-177. [8] Chapman, C., and Ward, S. (2003) Project Risk Management: Processes, Techniques and Insights, 2nd Edition: John Wiley and Sons Ltd, UK ISBN: 978-0-470-85355-9. [9] Corbel J-C, (2012), Project Management: Fundamentals - Methods Tools, 3rd Edition: Eyrolles, ISBN-10: 2212554257, p. 23-68. [10] Dieng-Kuntz, R., Corby, O., Gandon, F., Giboin, A., Golebiowska, J., Matta, N., Ribière M. (2001). Méthodes et outils pour la gestion des connaissances: une approche pluridisciplinaire duknowledge management, 2e édition, Dunod (2001). [11] Grundstein, M. (2007). GAMETH®: a constructivist and learning approach to identify and locate crucial knowledge. International Journal of Knowledge and Learning. Volume 5. n° 3/4. 2009, Inderscience Publishers, pages 289-305, DOI http://dx.doi.org/10.1504/IJKL.2009.031227. [12] Grundstein, M, and Rosenthal-Sabroux, C., (2005). Towards a Model for Global Knowledge Management within the Enterprise (MGKME).

[18] [19]

[20]

[21]

[22]

[23]

[24] [25]

[26] [27]

[28]

137

IRMA'2005 Information Resources Management Association International Conference. Reading. Etats-Unis. Grundstein, M. (2012). Three Postulates That Change Knowledge Management Paradigm, In Huei-Tse Hou (Ed.) New Research in Knowledge Management, Models and Methods (Chap. 1 pp. 1-26), InTech ISBN: 978-953-51-0190-1. Grundstein, M and J – P. Barthès (1996). An Industrial View of the Process of Capitalizing Knowledge, Fourth International ISMICK Symposium Proceedings, Rotterdam, the Netherlands, October 21-22, 1996, ERGON VERLAG. Giard, V., (2004), Project Management, Editor : Economica, Collection : Collection Management,Series Production and quantitative techniques applied to the management, p. 3 -23. Hillson, D.A., and Simon, P. (2007) Practical Project Risk Management - The ATOM methodology Management Concepts inc., USA ISBN: 978-1-56726-202-5. International Standardization Organization 2007. Draft ISO 31000, Risk Management Guidelines on Principles and Implementation of Risk Management. Final version to be issued in 2009. Morley C. (2003). Project management information systems: Principles, techniques, implemented and tools, 4rd edition, Dunod, Paris, p.5-208. Marciniak, R. and Rowe, F. (2008), Information systems, dynamic and organization, Editor: Economica, Edition: 3rd edition, Collection: Management, ISBN-10: 2717855823. Miller, R. and Lessard, D. 2001. Understanding and managing risks in large engineering projects, International Journal of Preject Management, 19 (8), P.437-443. Moulard, M. (2003), the use of risk management in software development projects, an exploratory study, Journal of Technology Management, Vol. 14 N°1, School of Management, Grenoble, France. Newcombe, R. (2000). The anatomy of two projects: a comparative analysis approach, International Journal of Project Management, 18, p. 189-199. Nonaka, I. and H. Takeuchi (1995). The knowledge-creating company. How Japanese companies create the dynamics of innovation. New York: Oxford University Press. O’Leary, D.E.(1998). Entreprise Knowledge Management, Computer, 31, 3 (1998), p.54-61 Pender, S., (2001). Managing incomplete knowledge: Why risk management is not sufficient, International Journal of Project Management, Vol. 19 (2), Peterson, C. 2001, p. 79–87. FERTIAL, Official Web Site: http://www.fertial-dz.com/, ASMIDAI, Official Web Site: http://www.asmidal-dz.com/, The ASMIDAL group is specializes in development of fertilizers, of ammonia and derivatives. Grupo Villar Mir, Official Web Site: http://www.grupovillarmir.es/, Grupo Villar Mir, S.L., through its subsidiaries, is engaged in real estate, electrometallurgy, electric energy production, fertilizers, construction, concessions, services, etc.

A Methodology for designing Competitive Intelligence System based on semantic Data Warehouse Sabrina ABDELLAOUI

1

Fahima NADER

Ecole Nationale Supérieure d’Informatique ESI Algiers, Algeria [email protected]

Ecole Nationale Supérieure d’Informatique ESI Algiers, Algeria [email protected]

Abstract—Competitive Intelligence (CI) is a systematic and ethical process for gathering, analyzing, and managing information about the business environment of an organization in order to identify relevant information for decision making process. CI enables the development of strategies that confer companies a significant competitive advantage. The best decisions are made when all the relevant data available are taken into consideration. As the amount of data grows very fast inside and outside of an enterprise, exploiting these mountains of data efficiently is a crucial issue that needs to be addressed. Note that such data are particularly heterogeneous, distributed, autonomous and evolving in a dynamic environment. The main difficulty is the identification and resolution of structural and semantic conflicts between heterogeneous data, usually spread in multiple sources. Data integration is one of the relevant solutions that solve this problem. It can be viewed as a process by which several heterogeneous sources are consolidated into a single data source associated with a global schema. Data Warehouse (DW) systems are defined as Data Integration Systems (DIS), where data sources are duplicated in the same repository after applying an ExtractTransform-Load (ETL) process. In a decisional context, DW is the most suitable solution offering various analysis and visualization tools used within a decision making process. In this paper, we propose a methodology for designing DW that takes into account semantic sources and decision maker requirements. As the result a Competitive Intelligence System (CIS) for a successful strategic management is generated. Keywords—Competitive intelligence, Integration Systems, Data Warehouse, Semantic Databases.

I. INTRODUCTION Competitive Intelligence (CI) is a systematic and ethical process for gathering, analyzing, and managing information about the business environment of an organization (e.g. competitors, customers, suppliers, governments, technological trends, or ecological developments) in order to identify relevant information for decision making process [1]. CI enables the development of strategies that confer companies a significant competitive advantage. The best decisions are made when all the relevant data available are taken into consideration. As the amount of data grows very fast inside and outside of an enterprise, exploiting these mountains of data efficiently is a crucial issue that needs to be addressed. Data integration is one of the relevant solutions that solve this problem. It can be viewed as a process by which several data sources (where each source is associated with a local schema) are consolidated into a single data source associate with a global schema. DW systems are defined as Data

Integration Systems (DIS), where data sources are duplicated in the same repository after applying an Extract-TransformLoad (ETL) process [2]. Formally, a data integration system is a triplet I: , where G represents the global schema (defined on an alphabet AG) that models the integrated schema, S is the set of sources schemas (defined on an alphabet AS) describing the structure of the sources participating in the integration process, and M is a mapping between G and S that establishes connection between the elements of the global schema and those of the sources. To interrogate the integrated system, queries are expressed in constructions terms of the global schema G [3]. The construction of a data integration system is a hard task due to the following main points: (a) The explosive growth of data sources: the number of data sources involved in the integration process is increasing. Integrating these mountains of data requires automatic solutions. (b) Autonomy of sources: the data sources are created independently by various designers at different moment. These designers are quite free to modify their schema or update their content without informing users. (c) Distribution of sources: the distribution means that the data sources are often stored on supports distributed geographically. (d) The heterogeneity of data: the types of heterogeneities that can impact data have been widely studied in the literature. These heterogeneities can be grouped into two main categories:  Structural heterogeneity: structural heterogeneity is due to the use of different structures and/or different formats to store data.  Semantic heterogeneity: semantic heterogeneity is due to different interpretations of real world objects. Semantic conflicts occur when (1) Same symbolic name covers different concepts (homonyms), or (2) Several symbolic names cover the same concept (synonyms). The main difficulty is the identification and resolution of structural and semantic conflicts between heterogeneous data, usually spread in multiple sources. Different categories of conflicts may be encountered. The following taxonomy was suggested (Figure 1): naming conflicts, scaling conflicts, confounding conflicts and representation conflicts [4]. (1) Naming conflicts arise when the same name is used for different objects (homonyms) or different names are used for the same object (synonyms). Example: The property Situation appears in S1 and S2 with two different meanings. In S1 it indicates student's

138

2 employment status (employee or not), while it refers in the source S2 to student's family status (married or single), on the other hand the same concept is designated by Student in the source S1, and Probationer in the source S2. (2) Scaling conflicts occur when different reference systems are used to measure a value. For example Grant of a Student is measured in dollars in S1 and euros in S2. (3) Confounding conflicts occur when concepts seem to have the same meaning, but differ in reality due to different measuring contexts. For example, the Grant is assigned to all students in S1 and only to the students that are not employed in S2. (4) Representation conflicts arise when two source schemas describe the same concept in different ways. For example the concept Student is represented by two classes (Person and Student) in S1 and seven properties and one class (Probationer) and six properties in S2.

needed. Conceptual design phase aims at providing a conceptual model of DW annotated by multidimensional concepts (facts, measures, dimensions). This model is an abstract representation independent of any constraint and technical implementation. Logical design step is generated by translating the conceptual schema into logical schema on the chosen logical model (relational, multidimensional, hybrid), adapted to the specific logic implementation model. Three main logical representations are distinguished: ROLAP, MOLAP and HOLAP. ETL phase aims at extracting data from their sources, eliminating the conflicts between data and finally storing them in the target repository. Physical phase implements the logical model of DW and specify the optimization techniques. In this paper, we propose a methodology for designing SDW that takes into account both sources and decision maker requirements. As the result a Competitive Intelligence System for a successful strategic management is generated. II. RELATED WORK

Fig. 1. Taxonomy of semantics conflicts.

Ontologies play an important role to reduce this heterogeneity and ensure automatic data integration by resolving Syntactic and semantic conflicts. They have received a lot of attention in last years from various research domains. The massive use of ontologies generates a big amount of semantic data. To facilitate their managements, persistent solutions to store and query these mountains of semantic data were proposed. These gave raise to a new type of databases, called semantic databases (SDB). The emergence of SDBs makes these sources candidates for DW systems. DW integrating SDB is called Semantic Data Warehouse (SDW). The DW design cycle is composed of five main phases [5]: requirements definition, conceptual design, logical design, ETL phase and physical design. Two main approaches are used for the initial design of DW [6]: the supply-driven approach and the demand-driven approach. In the supply-driven approach, the DW is designed starting from a detailed analysis of the data sources, but they do not take into account user requirements. This approach considers only the actual availability of data in the operational sources. The demand-driven approach starts from determining the information requirements of DW users or decision makers. This approach provides a model focusing in what is required rather than what is available. A confrontation between schemes generated from requirements and data sources is

Several approaches, models and tools covering the CI process have been proposed in the literature. We present in this section the most important works. According to the Society of Competitive Intelligence Professionals (SCIP), CI process is run in a continuous cycle, called the CI cycle [1]. There are five phases which constitute this cycle: planning and direction, collection, analysis, dissemination and feedback (Figure 2).The first phase planning and direction requires the identification of the key intelligence topics. It involves working with decision makers to discover their intelligence needs and then translating those needs into their specific intelligence requirements. Collection activities include identification of all potential sources of information and then gather the right data from which the required intelligence should be generated to support decision making. Unlike business espionage, which implies illegal means of information gathering, CI is restrained to the gathering of public sources that can be legally and ethically identified and accessed. After the collection of data, the analysis involves interpreting and translating the collected raw data into actionable intelligence [7] that will improve planning and decision making or will enable the development of strategies that offer a sustainable competitive advantage. The analysis phase must therefore produce a recommendation for a specific action. Dissemination phase is the step where the CI practitioner communicates the results of the analysis to the decision makers in a format that is easily understood [7]. Feedback is the last stage of the intelligence cycle. Feedback activities involve measuring the impact of the intelligence that was provided to the decision makers. They therefore provide the analyst with important areas for continuous improvement or further investigation. Approach SCIP allows the definition of a set of models, methods and technological tools to conduct CI analysis and research. This approach uses various analytical models, such as SWOT analysis, Porte’s Five Forces, etc [1]. In the MEDESIIE method [8], competitive intelligence is

139

3 seen as a cognitive process whose primary purpose is providing assistance in management processes and producing representations of the environment in order to create new knowledge. MEDESIIE realizes an analysis of decision-maker needs. This method proposes different models (model of the enterprise, model of the environment, model of strategy, model for the collection, analysis and validation of needs, model of defining the economic intelligence products). The SITE research team proposes different models for the different phases the CI process (Model for Decision Problem specification MEPD [9], Watcher’s Information Search Problem WISP [10], and Model for Information Retrieval query Annotations Based on Expression Levels MIRABEL [11], etc). These different models are used to represent the tacit Knowledge or skill of actors with respect to specific phases of CI process. The CI process exploits several data sources (scientific databases, media, news, RSS feeds, Intranet, wikis, social networks, internal database, etc.) in order to identify relevant information for decision making process. Such data are particularly heterogeneous, distributed, autonomous and evolving in a dynamic environment. The majority of approaches dedicated to CI do not address the problem of heterogeneity. XPlor approach emphasizes this aspect and assumes that the documents contained in the corpus can be target from heterogeneous sources. This approach defines a unified view of documents in the corpus target, based on the consideration of specific and generic format descriptors, but it not takes into consideration the semantic conflicts between heterogeneous data [12].

Fig. 2. CI cycle [1].

On the other hand, a lot of integration systems have been proposed in the literature. A classification based on three orthogonal criteria of existing integration systems was proposed in [13]. The first criterion is data representation. Two main data integration architectures are proposed in the literature: materialized (warehouse) and virtual (mediator). In a materialized architecture, data of local sources are duplicated and stored in a single database called the data warehouse. In the virtual architecture, data are remained in local sources and accessed through a mediator. Mediator translates queries into source queries, synthesizes results and returns answers to a user query. The second criterion is mapping sense between global and local schemas. In Globalas-View (GaV) systems, the global schema is expressed as a

view over data sources. This approach facilitates the query reformulation by reducing it to a simple execution of views in traditional databases. However, changes in source schema or adding a new data source requires a designer to revise the global schema and the mappings between the global schema and source schemas. Thus, GaV is not scalable for large applications. The reverse approach is Local-as-View (LaV). In this approach, the designer creates a global schema independently of source schemas. Then, for a new source schema, the designer has only to give a source description that describes source relations as views of the global schema. Therefore, LaV scales better. However, evaluating a query in this approach requires to rewrite it in terms of the data sources and rewriting queries using views is a difficult problem in databases. Thus, LaV has low performance when queries are complex. The third criterion is mapping automation. This criterion specifies whether the mapping between the global schema and local schemas is manual, semi-automatic based on vocabularies or linguistic ontologies, or automatic by using conceptual domain ontologies as a reference model. III. BACKGROUND We present in this sections the background related to Ontologies and Semantic databases, formalization of SDW. A.

Ontologies and Semantic databases Ontology is defined by Gruber [14] as an explicit specification of a conceptualization. The massive use of ontologies generates a big amount of semantic data. To facilitate their managements, persistent solutions to store and query these mountains of semantic data were proposed. These gave raise to a new type of databases, called semantic databases (SDB). These SDB use different ontological formalisms like RDF, RDFS, OWL, PLIB, FLIGHT, etc, on the other hand three main of relational representations are distinguished: vertical, binary and horizontal [15]. Vertical representation stores data in a unique table of three columns (subject, predicate, object). In a binary representation, classes and properties are stored in different tables. Horizontal representation translates each class as a table having a column for each property of the class. Three architectures dedicated to store the SDB are used. Systems of type I architecture use the same architecture of traditional databases with two parts: data schema part and the meta schema part to store SDB. In systems of type II architecture l, the ontology model is separated from its data which gives an architecture with three parts: the ontology model, the data model and the metaschema. The ontology model and the data model can be stored in different storage schemas. Systems of type III architecture consider an architecture with four parts, where a new part representing the the meta-schema of the ontology is added. B. Framework formalization for constructing SDW In this section we describe a generic framework for a DW system integrating SDB. Initially, generic framework is composed of a global schema G representing the intentional knowledge (the global ontology), a set of local sources S and mappings M between G and S. The extensional knowledge or

140

4 instances are stored in local sources. The global schema< G, S, M >: The global schema is defined by its conceptual structure that we call Information Model (IM). IM is defined as follows IM: • C: denotes Concepts of the model (atomic concepts and concept descriptions). • R: denotes Roles (relationships) of the model. Roles can be relationships relating concepts to other concepts, or relationships relating concepts to data-values (like Integers, Floats, etc). • Ref : C → (Operator, Exp(C,R)). Ref is a function defining terminological axioms of a DL TBOX. Operators can be inclusion (⊑) or equality (≡). Exp(C, R) is an expression over concepts and roles of IM using constructors of description logics such as union, intersection, restriction, etc. (e.g., Ref(Student)→( ⊑, Person ⊓ ∀ takesCourse(Person, Course))). • Formalism is the formalism followed by the global ontology model like RDF, OWL, etc. The local source : Si presents each local and is defined as follows: Si: . • IM: is the information model of the source. • I: presents the instances or data of the source. • Pop: C → 2I is a function that relates each concept to its instances. • SMIM: is the Storage Model of the information model (vertical, binary or horizontal). • SMI: is the Storage Model of the instances part I. • Ar: is the architecture of the source (type I, type II or type III architecture). The mappings DIS: : The mappings are defined between global and local schemas as follows M:< MapSchemaG, MapSchemaS, MapElmG, MapElmS, Interpretation, SemanticRelation, Strength, Type>. This formalization is based on meta-model defined for conceptual mappings. • MapSchemaG and MapSchemaS: present respectively the mappable schema of the global schema and of the local schema (the information model). • MapElmG and MapElmS: present respectively the mappable element of the global schema and of the local source. This element can be a simple concept or an expression over the schema. • Interpretation: presents the Intentional interpretation or Extensional interpretation of the mapping. • SemanticRelation: presents the type of semantic relationship between MapElmG and MapElmS. Three relationships are possible: Equivalence, Containment (Sound, Complete) or Overlap. Equivalence states that the connected elements represent the same aspect of the real world. Containment states that the element in one schema represents a more specific aspect of the world than the element in the other schema. Overlap states that some objects described by the element in the one schema may also be described by the connected element in the other schema [16].

takes into account both decision maker requirements and sources. We assume that the decision maker express their needs and strategic goals in natural language. Each strategic goal is decomposed into sub-goals. Our methodology use the decomposition of the main goal for determining with categories of the environment of an organization is concerned (competitors, customers, suppliers, governments, technological trends, or ecological developments, etc). We assume that each category of the business environment is associated to a shared ontology (SO). A local ontology of DW (DWO) is defined by extraction of ontological classes and properties of the SO, used to express goals. The conceptual view of the DW is represented by the DWO. The multidimensional model of DW is represented by the DWO annotated by multidimensional concepts (facts, measures, dimensions, hierarchies and dimension attributes). A fact is the subject analyzed. A fact consists of measures or attributes that correspond to information related to the domain of interest. A dimension is a context analysis of a fact. The ETL phase aim at extracting data from heterogeneous sources. After this, the extracted data are propagated in the temporary storage area of the warehouse, called Data Staging Area (DSA), where their transformation, homogenization, and cleansing take place. Finally, data is loaded into the target DW. [2] has defined ten generic conceptual operators typically encountered in an ETL process, which are: (1) EXTRACT(S,E): used to identify elements E (Class, Property) of the source schema S from which data should be extracted, (2) RETRIEVE(S,C): retrieve instances of the class C from the source S; (3) MERGE(S,I): used to merge instances belonging to the same source, (4) UNION(C,C’): used to merge instances whose corresponding classes C and C’ belong to different sources S and S’ respectively; (5) JOIN(C, C’): used to combine instances whose corresponding classes C and C’ are related by a property, (6)STORE(S,C, I): represent the loading of the instances I corresponding to the class C in the target data store S, (7) DD(I): detects duplicate values on the incoming record-sets; (8) FILTER(S,C,C’): filters incoming record-sets, allowing only records with values of the element specified by C’; (9) CONVERT(C,C’): converts incoming record-sets from the format of the element C to the format of the element C’; (10) AGGREGATE (F, C, C’): aggregates incoming record-set applying the aggregation function F (COUNT, SUM, AVG, MAX) defined in the target data-store. In the Logical design step we translate the conceptual schema into relational schema. We adopt a ROLAP representation. Finally the physical phase of our methodology implements the logical model of DW on Oracle 11g.

IV. THE DESIGN METHODOLGY We propose a hybrid approach to define requirements that

141

5 [10] P.Kislin, “Modelisation Du Problème Informationnel Du Veilleur“, Thèse de doctorat, Université Nancy 2, France, 2007. [11] S. Goria, “Proposition d’une démarche d’aide à l’expression des problèmes de recherche d’informations dans un contexte d’intelligence territoriale“, Thèse de doctorat, Université Nancy 2, France, 2006. [12] A. El Haddadi, B.Dousset, I. Berrada, “Establishment and application of Competitive Intelligence System in Mobile Devices”, Journal of Intelligence Studies in Business, Vol.1, N°1, 2011, pp.87-96. [13] L. Bellatreche, D. Nguyen Xuan, G. Pierra, H. Dehainsala, , “ Contribution of ontology based data modeling to automatic integration of electronic catalogues within engineering databases”, Computers in Industry Journal Elsevier 57(8-9), 2006, pp. 711–724. [14] T. Gruber, “A translation approach to portable ontology specifications”, In Knowledge Acquisition, Vol.5, N° 2, pp.199–220, 1993. [15] H. Dehainsala, G. Pierra, and L. Bellatreche, “Ontodb: An ontologybased database for data intensive applications“, In DASFAA, pp. 497–508, April 2007. [16] S. Brockmans, P. Haase, L.Serafin, and H. Stuckenschmidt, “Formal and conceptual comparison of ontology mapping languages”, In Modular Ontologies, pp. 267–291, Springer-Verlag, Berlin, Heidelberg, 2009.

Fig. 3. Proposed methodology.

V. CONCLUSION Integration of data from heterogeneous and distributed sources becomes a critical need for the competitive intelligence systems. With the explosion number of the data sources, automatic integration solutions are needed. These solutions are confronted to the problems related to structural and semantic heterogeneity of the sources. The fundamental challenge is first the identification of conflicts between concepts in different sources, and then, the resolution of these conflicts between semantically related concepts. Ontologies play an important role to reduce this heterogeneity and ensure automatic data integration by resolving syntactic and semantic conflicts. As a result it becomes a candidate to feed DW. The data warehouse technology is the incontestable tool for businesses and organizations to make strategic decisions and ensure their competitively. We proposed in this paper a methodology for designing SDW. This SDW constitute the heart of Competitive Intelligence System. REFERENCES [1] R. Bose,“Competitive intelligence process and tools for intelligence analysis”, Industrial Management and Data Systems, Vol.108, N°4, pp. 510– 528, 2008. [2] D. Skoutas and A. Simitsis, “Ontology-based conceptual design of etl processes for both structured and semi-structured data”, Int. J. Semantic Web Inf. Syst., Vol. 3(4), pp.1–24, 2007. [3] M. Lenzerini,” Data integration: A theoretical perspective”, In PODS, pp. 233–246, 2002. [4] C.H. Goh, S. Bressan, E. Madnick, and M.D. Siegel, “Context interchange: New features and formalisms for the intelligent integration of information”, ACM Transactions on Information Systems, Vol.17, N°3, pp.270–293, 1999. [5] M. Golfarelli, “Data warehouse life-cycle and design”, In Encyclopedia of Database Systems, pp. 658–664, Springer US, 2009. [6] R. Winter and B. Strauch, “A method for demand driven information requirements analysis in data warehousing projects”, In 36th HICSS, p. 231, 2003. [7] S.H Miller, “Competitive Intelligence – An Overview, Society of Competitive Intelligence Professionals”, Alexandria, VA, available at: www.scip.org/2_overview.php, 2001. [8] M. Salles, “Stratégies des PME et intelligence économique, Une méthode d’analyse du besoin”, édition économica, Paris, 2006. [9] N. Bouaka, “Développement d'un modèle pour l'explicitation d'un problème décisionnel : un outil d'aide à la décision dans un contexte d'intelligence économique“, Thèse de doctorat, Université Nancy 2, France, 2004.

142

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < these works remain limited to numeric analyses, to counting of instance, where from the problem of the analysis of textual contents. To resolve this problem, the authors of the papers integrate the techniques of data mining into the multidimensional modeling. We can also cited the approach of [6] wich suggest combining the multidimensional modeling and the information retrieval techniques to supply the important documents of the current analysis. This work proposes to bind contained information in documents to multidimensional data in order to explain the facts. A first new approach which consists in extending the models of multidimensional representation of given digital to the documents [7, 8]. This approach relates to the definition of dimensions modelling the structure of the documents. A dimension “structure” is made up starting from the structures extracted the documents via the tree structure of the documents XML (DTD or Schema). Each parameter of these dimensions models the various levels of granularity of the same document (section, sub-section, paragraph…). A second approach proposes the “Galaxy” model, to mitigate the difficulties of the models by extension [9, 10]. This modeling in galaxy rests on the idea to use a single concept to represent the data, which can be employed in a symmetrical way as a subject or an axis of analysis. This work proposes multidimensional modeling of the XML documents. Our objective goes further since we wish to be able to analyze and model all the types of document, whatever the format by combining the techniques of text mining and multidimensional modeling. This approach consists in treating the data in their basic form. The basic form offers several advantages: a better reactivity and update. Nevertheless, in order to adapt to the majority of the structures, it is necessary to use the meta-data, which are tools for description of the structure. For this, we use des web service and the No-SQL DBMS. Let us note in addition that more than 90% of the encountered cases can be treated without any reformatting. Therefore, to standardize the multidimensional mining on the textual data of all sources, we proposed a unified structure allowing to store all the relations inter-items met in the analyzed documents. This technique makes it possible to build cubes crossing two unspecified variables and time. This modeling allows the extraction of the relations of existing dependences enter the various attributes of treated corpora of information. Our goal is to present in the form of a multidimensional modeling the relations of dependences between the variables present in large documentation. The description of these relations and their analysis make it possible to establish scenarios tending to explain the complex mechanisms, which manage the operation of the environment of a field or an actor. The goal is to reduce informational space in order to best control, by eliminating the independent elements, to keep only the most significant relations in terms of strategy. Many measurements of dependence are usable: co-variances, correlations, coincidences, contingencies, co-occurences. They

2

give visions different but complementary to the same reality. B. Multidimensional model definition Def.

Figure 1. Multidimensional presentation The multidimensional model is then a modeling with three dimensions. It enables us to define the various relations of dependence between the values of the element with the taking into account of the temporal structure.

Qualitative variables: Ordinal: year of publication, hours of connection, days of the week or the month,… Hierarchical: hierarchical thesauris, semantic geographical zones, inclusions, access paths to the files,… Nominal: authors, reviews, country, dictionaries of keywords,… Moreover qualitative variables can be: Unimodal: presence or absence of a characteristic. Multimodal: year, review, language, type of document, source,… (only one method of this variable is then necessary obligatorily for each document). C. Meta-model of unstructured data The multidimensional model aims at identifying all the relations of existing dependences between different variables from subject of analysis. These relations are defined by

144

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < matrices of co-occurences, which indicate the simultaneous presence of the methods of two qualitative variables in a document. We adopt these matrices by adding a third temporal variable there (Year, Month, Days, Hours), which consists in indicating the presence of such a relation, in such a moment. The meta-model of data makes it possible to gather the existing relations in a corpus in periods. The composition of the corpus is based on the taking into account of the relations of existing dependences in the structure of the meta-model by removal of the independent elements. In order to build the multidimensional model, we will keep only the boxes whose values are equal to or higher than one. Example. Following presents a formed multidimensional presentation of collaborations between the authors in cells and of three edges graduated respectively by the sets of themes of research, the organizations and of the publication dates. This presentation is not limited to three axes but spreads into metameta-model or the number of axes is unspecified being able to go until several tens.

III. SOURCING AND WAREHOUSING SERVICE CARCHITECTURE

Figure 4. Service architecture for XEW

Figure 2. Laboratory collaboration D. Homogenisation of the information sources The final objective is to obtain a unified sight collected sources, which will be used throughout the process of analysis. This sight must answer the following objectives: A homogeneous sight, shared by the various data whatever their sources. A reduced sight of information to facilitate and accelerate the treatment. A sight which facilitates the analysis of any type of information and restores information within very short times such as requires it the economic intelligence.

3

Figure 4. Sourcing Service XEW

This unified sight associated with the targeted corpus corresponds to a logical representation, structured, preset of the whole of its collections in the form of a warehouse of strategic data.

145

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < these works remain limited to numeric analyses, to counting of instance, where from the problem of the analysis of textual contents. To resolve this problem, the authors of the papers integrate the techniques of data mining into the multidimensional modeling. We can also cited the approach of [6] wich suggest combining the multidimensional modeling and the information retrieval techniques to supply the important documents of the current analysis. This work proposes to bind contained information in documents to multidimensional data in order to explain the facts. A first new approach which consists in extending the models of multidimensional representation of given digital to the documents [7, 8]. This approach relates to the definition of dimensions modelling the structure of the documents. A dimension “structure” is made up starting from the structures extracted the documents via the tree structure of the documents XML (DTD or Schema). Each parameter of these dimensions models the various levels of granularity of the same document (section, sub-section, paragraph…). A second approach proposes the “Galaxy” model, to mitigate the difficulties of the models by extension [9, 10]. This modeling in galaxy rests on the idea to use a single concept to represent the data, which can be employed in a symmetrical way as a subject or an axis of analysis. This work proposes multidimensional modeling of the XML documents. Our objective goes further since we wish to be able to analyze and model all the types of document, whatever the format by combining the techniques of text mining and multidimensional modeling. This approach consists in treating the data in their basic form. The basic form offers several advantages: a better reactivity and update. Nevertheless, in order to adapt to the majority of the structures, it is necessary to use the meta-data, which are tools for description of the structure. For this, we use des web service and the No-SQL DBMS. Let us note in addition that more than 90% of the encountered cases can be treated without any reformatting. Therefore, to standardize the multidimensional mining on the textual data of all sources, we proposed a unified structure allowing to store all the relations inter-items met in the analyzed documents. This technique makes it possible to build cubes crossing two unspecified variables and time. This modeling allows the extraction of the relations of existing dependences enter the various attributes of treated corpora of information. Our goal is to present in the form of a multidimensional modeling the relations of dependences between the variables present in large documentation. The description of these relations and their analysis make it possible to establish scenarios tending to explain the complex mechanisms, which manage the operation of the environment of a field or an actor. The goal is to reduce informational space in order to best control, by eliminating the independent elements, to keep only the most significant relations in terms of strategy. Many measurements of dependence are usable: co-variances, correlations, coincidences, contingencies, co-occurences. They

2

give visions different but complementary to the same reality. B. Multidimensional model definition Def.

Figure 1. Multidimensional presentation The multidimensional model is then a modeling with three dimensions. It enables us to define the various relations of dependence between the values of the element with the taking into account of the temporal structure.

Qualitative variables: Ordinal: year of publication, hours of connection, days of the week or the month,… Hierarchical: hierarchical thesauris, semantic geographical zones, inclusions, access paths to the files,… Nominal: authors, reviews, country, dictionaries of keywords,… Moreover qualitative variables can be: Unimodal: presence or absence of a characteristic. Multimodal: year, review, language, type of document, source,… (only one method of this variable is then necessary obligatorily for each document). C. Meta-model of unstructured data The multidimensional model aims at identifying all the relations of existing dependences between different variables from subject of analysis. These relations are defined by

168

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < matrices of co-occurences, which indicate the simultaneous presence of the methods of two qualitative variables in a document. We adopt these matrices by adding a third temporal variable there (Year, Month, Days, Hours), which consists in indicating the presence of such a relation, in such a moment. The meta-model of data makes it possible to gather the existing relations in a corpus in periods. The composition of the corpus is based on the taking into account of the relations of existing dependences in the structure of the meta-model by removal of the independent elements. In order to build the multidimensional model, we will keep only the boxes whose values are equal to or higher than one. Example. Following presents a formed multidimensional presentation of collaborations between the authors in cells and of three edges graduated respectively by the sets of themes of research, the organizations and of the publication dates. This presentation is not limited to three axes but spreads into metameta-model or the number of axes is unspecified being able to go until several tens.

III. SOURCING AND WAREHOUSING SERVICE CARCHITECTURE

Figure 4. Service architecture for XEW

Figure 2. Laboratory collaboration D. Homogenisation of the information sources The final objective is to obtain a unified sight collected sources, which will be used throughout the process of analysis. This sight must answer the following objectives: A homogeneous sight, shared by the various data whatever their sources. A reduced sight of information to facilitate and accelerate the treatment. A sight which facilitates the analysis of any type of information and restores information within very short times such as requires it the economic intelligence.

3

Figure 4. Sourcing Service XEW

This unified sight associated with the targeted corpus corresponds to a logical representation, structured, preset of the whole of its collections in the form of a warehouse of strategic data.

169

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)