A new Automatic approach for Understanding the

0 downloads 0 Views 702KB Size Report
on Automatic Text Categorization .... to adopt the CTU (Cognitive-based text understanding)[4] .... Algorithm 1: Training of multinomial naive bayes classifier.
A new Automatic approach for Understanding the Spontaneous Utterance in Human-Machine Dialogue based on Automatic Text Categorization Mohamed Lichouri

Amar Djeradi

Rachida Djeradi

Laboratory of Spoken communication and signal processing Faculty of Electronics and Computer Science (FEI), USTHB BP 32, El Alia, Bab Ezzouar, 16111, Algiers, Algeria

Laboratory of Spoken communication and signal processing Faculty of Electronics and Computer Science (FEI), USTHB BP 32, El Alia, Bab Ezzouar, 16111, Algiers, Algeria

Laboratory of Spoken communication and signal processing Faculty of Electronics and Computer Science (FEI), USTHB BP 32, El Alia, Bab Ezzouar, 16111, Algiers, Algeria

[email protected]

[email protected]

[email protected]

ABSTRACT

1.

In the present paper, we suggested an implementation of an automatic understanding system of the statement in HumanMachine Communication. The architecture we adopted was based on a stochastic approach that assumes that the understanding of a statement is nothing but a simple theme identification process. Therefore, we presented a new theme identification method based on a documentary retrieval technique which is text (document) classification [2]. The method we suggested was validated on a basic platform that give information related to university schooling management (Querying a student database), taking into consideration a textual input in french. This method has achieved a theme identification rate of 95% and a correctly utterance understanding rate of about 91.66%.

One of the most significant challenges facing the society in terms of information is, nowadays, the creation of intelligent human-machine interfaces that accept and support the natural style of human communication. The growing need for IT application in all fields of life has created a need for effective, decisive and natural interfaces to relieve the access to information and its application. This need will continue to rise with the significant increasing complexity of computer systems and the decreasing lapse of time available to users for tasks achievement and new technology learning.

Categories and Subject Descriptors I.2.7 [Natural Language Processing]: Language parsing and understanding ; H.1.2 [User/Machine Systems]: Human information processing ; H.3.3 [Information Search and Retrieval]: Query formulation

General Terms Algorithms, Design, Human Factors, Languages

Keywords Communication, Human-Machine dialog, Understanding, Utterance, Thematic, Text Classification Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. IPAC ’15 November 23 - 25, 2015, Batna, Algeria c 2015 ACM. ISBN 978-1-4503-3458-7/15/11 . . . $15.00

DOI: http://dx.doi.org/10.1145/2816839.2816840

INTRODUCTION

In this regard, several systems have been realized, such as TINA system[21],PHILIPS system (train schedule)[1], CACAO (computer-aided automatic understanding based on conceptual segments) environment[3] and stock exchange system [7]. Other systems are to be realized (eg: Patient Care Robot, Smart TV...) in which the understanding of utterance plays an important role. Furthermore, These systems remain restricted to a single field of application, and due to the increasing number of applications, an implementation of Generic system (multi-applications, multilingual and multi-modal) is required. The understanding of an utterance in SDS (Speech Dialog System) is of a great deal of importance in their generalization, because it determines the semantic contents of the utterance, on one hand, and it is dependent on the application, on the other hand (more than speech recognition system or dialogue manager). Several techniques have been suggested in order to establish a model for the problem of utterance understanding using three approaches which are Linguistic approach (based mainly on a language model and speech acts), Stochastic approach (focusing on statistical properties of utterance like frequency of occurrence) and Hybrid [3, 7, 12](Combining language model and statistical properties). In addition to these approaches, by considering the problem of understanding of human point of view, we shall be obliged to adopt the CTU (Cognitive-based text understanding)[4] paradigm. This approach considers the cognitive theories

inspired from the process of human production and perception to allow machines to approach the understanding of utterance the closest to man.

which are Structural analyzer,Lexical analyzer and Semantic analyzer.

In our paper, we first described the architecture of our understanding of the utterance system (textual utterance) based on the understanding approach suggested in [2, 7, 12]. Then, we presented a new approach to determine automatically the themes of our corpus. To fulfill this aim, we presented a new method of vectorial representation of words. Then, we used the k-norms classifier for an unsupervised classification process. Finally, we came to the final stage of the process of understanding, in which we classified requests us˜ rve Bayes classifier, generated associated SQL ing the NaA´ commands, as we integrated the understanding module in real human-machine platform of utterance understanding. In the same vantage, we evaluated this system in an application of schooling management.

It is in charge of extracting sentences contained in our corpus of study [6]. The latter was taken by considering some real samples of request for information (query or requests and not dialogues) of university schooling type (Notes, Degree and Certificate).To extract these requests, we shall resort to a detection of the boundary of the sentences (Sentence boundary disambiguation)[17],by using ”PUNKT” of NLTK (Natural Language Toolkit) [16]. This tool, detects the borders by considering capital letter and punctuation (see Figure 2)[8].

2.

2.1.1

Structural analyzer

AUTOMATIC UNDERSTANDING OF UTTERANCES: THEMATIC APPROACH

One of the human cognitive properties is its tendency to understand an utterance without having to understand all the terms one by one, but to grasp the main idea or theme described in the utterance. Let S be an utterance consisting of N words. The understanding of this utterance is summarized in one step which is the identification of the theme of the utterance.In order to run this cognitive property, we will keep the same architecture [12]. However, we no longer consider, here, the concepts of terms, but rather the categories of sentences. These categories represent the various themes discussed in all of the utterances. It is worthy to mention in this regard, that text categorization is widely used in information retrieval field, identification of a document language, or filtration of unwanted or SPAM emails. In our case, we used it to extract the theme of each utterance (text with a single sentence).

Figure 2: Architecture of the Punkt System Punkt is designed to disambiguate periods and ellipses. It is based on two stages.

Step 1: Identification of abbreviations and ellipses. Using the criterion of length: • If a collocation followed by a period is long enough, it is considered abbreviation and receives the annotation . Otherwise the candidate belongs to the class of non abbreviations and receives annotation,end of the sentence . • If a period is followed by another for a certain length it will be considered ellipse and receives the annotation .

Step 2: identification of the end of the sentence. Among Figure 1: Utterance Understanding System in Human-Machine-Communication The general architecture of an utterance Understanding system shown below (see Figure 1) [8][9], contains four analyzers divided into two phases.

2.1

Phase one: Theme Identification Process

It involves the identification of relevant theme of all utterances. This is why we relied on three types of analyzers

the tokens already found
, and there are some who will be considered as an end of a sentence. This ambiguity will be removed by considering the context. • If after an abbreviation , the word begins with an uppercase letter, this abbreviation is seen as an end of a sentence. She will receive the annotation . Otherwise it will not be considered as such and will keep the initial annotation . • Similarly to the ellipse , the same case is considered. In the first case, he will receive the annotation

, otherwise will keep the initial annotation . • For the list of non-abbreviations(acronyms) such as initials (Mr.) and ordinal numbers (2.5), there are two possibilities: – For initial: if followed by a space and the next word begins with a capital letter (Mr. Mohamed LICHOURI ), we must check the possible collocations. If this word belongs to this list, so the initial will not be considered as end of the sentence and will receive the annotation of abbreviations
. On the other hand, if this word does not belong to the list of collocations it will be considered as end of the sentence and will receive the annotation . – For ordinal numbers: if it is followed by a space and the next word begins with a capital letter, it will be considered as end of the sentence and will keep the annotation . However, if the next word does not begin with capitalized, this number is not regarded as the end of sentence.

Figure 4: Semantic Analysis(Thematic)

Sentence Representation by LSA(Latent Semantic Analysis): . By LSA it is meant a statistical model dedicated to document retrieval [10]. But because of its performances, it is quickly imposed upon other domains of search. The objective of the LSA [22] is the representation of the ideas of the terms, by considering the context in which every term appears. This model considers that two contexts (Sentences) are similar if they contain similar terms. To achieve this goal, we are going to build an occurrence matrix A which is weighted by TF-IDF (Term Frequency-Inverse Document Frequency) [13]. This matrix is constructed by accumulating the numbers of occurrence of each term ti in every sentence pj . Then, we are going to apply a singular value decomposition ”SVD” of A. The SVD [11] formulates that:

In Figure 3, we show an example of using the PUNKT System of NLTK with a three sentence paragraph.

T A(n×m) = U(n×r) × S(r×r) × V(r×m)

(1)

After performing this calculation, we shall have at the end three matrices U ,S and V .By principle, we can say that U is a matrix of similarity terms-concept, while the diagonal elements of S represent the ”strength” of every concept, and finally V T represent the matrix of similarity sentencescategory1 (this matrix will be considered in the treatment which follows).

Figure 3: Example of usage for our study

2.1.2

Lexical analyzer

It receives in input, the set of the sentences:S = {p1 , p2 , ..., pm }, where: m represent the number of sentences contained in the corpus. It will inform us about the set of significant terms :T = {t1 , t2 , ..., tn }, where: n represent the number of terms. Two Stages are necessary. The first one is to split(tokenize) the sentence into substrings using a regular expression and whitespace criterion followed by a filtration [20] of unnecessary words to the understanding by taking the words with a length superior to three characters or those which do not belong to the list of empty words (Stop List)[5].

2.1.3

Semantic analyzer

During this analysis (see Figure 4), we can get back the categories of sentences. For this purpose, two stages are necessary:

Now we will apply a reduction of dimension, because we noticed that the more the corpus is long, the more the number of concept r increases. While we shall only need categories said to be speaking or strong. The principle consists in decreasing the number of categories to be kept. The criterion considered is the strength of categories, already expressed by the matrix S. To keep k categories results from it to keep only the first k singular values of the matrix S. Also, to keep k first values singular is equivalent to keep k first lines of V T and k first columns of U . We will only keep the sentences representation vectors taken from the matrix V T . These new vectors are advantageous because:(i)two sentences having the same theme, are of a considerable similarity in space sentences-categories than that of the terms-sentences,and (ii)two vectors having a close distance in space sentences-categories will share the same theme and,thus, will belong to the same category.

k-means classifier. Retrieve the categories is only a matter of classifying the sentences representations vectors. We considered the use of unsupervised k-means classifier[19, 18]. Because our application aims at offering information of uni1

To differentiate between the concepts of words and the themes of sentences, we use in what follows the naming of category[24] for the sentences

versity schooling management (note, degree and certificate), this brings us to find three categories. P (Cj ) =

2.2

Phase Two: Pragmatic Analysis

In this phase, we will consider the categories that we found during the previous phase to train the Naive Bayes Classifier[14] on the set the document D with the class C (see Algorithm 1).Then we will find the best class c for a document d (see Algorithm 2).

P (Wk \ Cj ) =

N (Wk , Cj ) + m ∗ p T otal.number.of.term + m

(3)

(4)

with: • m a constant. • a priori probability of P (Wk \Cj ) is equal to

Algorithm 1: Training of multinomial naive bayes classifier 1 TrainMultinomialNB(C, D); Data: Set of document Result: Set of terms plus its prior and conditional probability 2 initialization; 3 V = ExtractVocabulary(D); 4 N = CountDocs(D); 5 for c ∈ C do 6 Nc=CountDocsInClass(D,c); 7 Prior[c] = Nc/N; 8 Textc = ConcatenateTextOfAllDocsInClass(D,c); 9 for t ∈ V do 10 Tct = CountTokensOfTerm(Textc, t); 11 end 12 for t ∈ V do 13 condprob[t][c] = (Tct + m*Prior[c]) /(Sum(Tct) + m); 14 end 15 end 16 return V, prior, condprob ;

N umber.of.term.in.CAT EGORY.Cj T otal.number.of.term

1 . N (Wk ,Cj )

The probability P (Wk \ Cj ) is obtained through an estimator called M-estimator [15]. During the test phase, the query (phrase) of the user is classified. After considering an inference engine, we will get an SQL query interpretable by the database.

3.

RESULTS AND DISCUSSION

For the practical aspect, and to assess the quality of our understanding system, we put it in a real environment with text input and voice output. We considered the Schooling Management Application.The validation of the results we found, is ensured by F1-measure [23] defined by: F1 =

2 × precision × recall precision + recall

(5)

The following figures show the corpus of the application and the obtained categories dealt with in automatic and manual ways. The rate of identification of categories that resulted

Algorithm 2: Test of multinomial naive bayes classifier 1 ApplyMultinomialNB(C, V, prior, condprob, d); Data: Set of Class, Vocabulary and a document d Result: Best assigned Class c for d 2 W = ExtractTokensFromDoc (V, d); 3 for c ∈ C do 4 score [c] = log (prior[c]); 5 for t ∈ W do 6 score [c] + = log (condprob[t][c]); 7 end 8 end 9 return arg maxc∈C score[c] ;

Figure 5: Sample of school query corpus

The choice of this classifier is justified by the fact that it does not require a large corpus during learning phase. What really suits us, because we can not recover such a corpus. Furthermore, the naive character of this classifier is its assumption that terms are independent (not always true). Such an assumption will affect the calculation of the probability of the sentence in a manner that will be equal to the multiplication of the probabilities of different words constituting the sentence. In brief, this classifier will assign a sequence of words (phrase) to a class C according to the following law: Ci = argmaxC j P (Cj )

Y

P (Wk \ Cj )

(2)

Figure 6: The list of manual categories from the application of School Management, by considering the model LSA on a matrix TF-IDF is of the order of 95% on a corpus of 21 requests (142 words) .While the rate of

[10]

[11]

Figure 7: The list of automatic categories

understanding which we reached on 60 requests (340 words) is of the order of 91.66% which much better in that based on the approach of keyword search which is of the order of 68.33%. To add more elegance to our system, we added another modality further to the written text which is a speech synthesizer at the exit.

[12]

[13]

[14]

4.

CONCLUSION

The present work showed the possibility of building a general architecture of an utterance understanding system. Besides, it was noticed that the thematic approach is more advantageous than the conceptual one [12]. We, therefore, think that the use of a module of interpretation is going to improve at best the quality of the dialogue. At the end, we will try to see how we can integrate our system in a multi-modal and multilingual environment.

5.

[15]

[16] [17]

REFERENCES

[1] H. Aust, M. Oerder, F. Seide, and V. Steinbiss. The philips automatic train timetable information system. Speech Communication, 17(3):249–262, 1995. [2] A. Bawakid and M. Oussalah. A semantic-based text classification system. In Cybernetic Intelligent Systems (CIS), 2010 IEEE 9th International Conference on, pages 1–6, Sept 2010. [3] C. Bousquet-Vernhettes. un environnement de compr´ehension de la parole pour les serveurs interactifs: l’environnement cacao. Rencontres Jeunes Chercheurs en Interaction Homme Machine (RJC-IHM’2000), pages 77–80, 2000. [4] CTU. The 3rd International Workshop on Cognitive-based Text Understanding and Web Wis-dom. http://iic.shu.edu.cn/huiyi/CTUW3rdCFP.html, 19/09/2011. [Online; accessed 13/12/2014]. [5] B. Huber. Stoplist. http://textalyser.net/stoplist.html, 2004. [Online; accessed 13/11/2014]. [6] N. Indurkhya and F. J. Damerau. Handbook of natural language processing, volume 2. CRC Press, 2010. [7] S. Jamoussi, K. Sma¨ıli, and J.-P. Haton. Understanding speech based on a bayesian concept extraction method. In Text, Speech and Dialogue, pages 181–188. Springer, 2003. [8] T. Kiss and J. Strunk. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4):485–525, 2006. [9] A. Korzybski and D. Kohn. Une carte n’est pas un

[18]

[19]

[20]

[21]

[22]

[23]

[24]

territoire.: Prol´egom`enes aux syst`emes non-aristot´eliciens et a ` la s´emantique g´en´erale. ˘ Z´ ´ eclat, 2001. ´editions de lˆ aA T. K. Landauer and S. T. Dumais. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211, 1997. J. Leskovec. Dimensionality reduction pca, svd, mds, ica, and friends. Machine Learning recitation April, 27, 2006. M. Lichouri, A. Djeradi, and R. Djeradi. Une ˘ Zextraction ´ approche statistico-linguistique pour lˆ aA de concepts s´emantiques: Une premi`ere ´etape vers un syst`eme g´en´erique de dialogue homme-machine. personal communication, 2015. S. Loria. Tutorial: Finding Important Words in Text Using TF-IDF. http://stevenloria.com/ finding-important-words-in-a-document-using-tf-idf/, 01/09/2013. [Online; accessed 12/09/2014]. C. D. Manning, P. Raghavan, and H. Sch¨ utze. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008. [http:// nlp.stanford.edu/IR-book/pdf/irbookprint.pdf]. K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine learning, 39(2-3):103–134, 2000. NLTK. NLTK, Natural Langage Tool Kit. http://www.nltk.org. [Online; accessed 13/03/2015]. D. D. Palmer and M. A. Hearst. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241–267, 1997. P. Preux. Segmentation par les k-moyennes. http:// www.grappa.univ-lille3.fr/~ppreux/ensg/miashs/ fouilleDeDonneesII/tp/k-moyennes/index.html, 08/04/2010. [Online; accessed 12/09/2014]. quetedesavoir. Classification topologique non ˜ pour des donnAl’es ˜ ˜ supervisAl’e catAl’gorielles. http://quetedesavoir.blogspot.com/2011/06/ classification-topologique-non_20.html, 20/06/2011. [Online; accessed 12/10/2014]. L. A. Ramshaw and M. P. Marcus. Text chunking using transformation-based learning. arXiv preprint cmp-lg/9505040, 1995. [http://www.aclweb.org/ anthology-new/W/W95/W95-0107.pdf]. S. Seneff. Tina: A natural language system for spoken language applications. Computational linguistics, 18(1):61–86, 1992. J. Steinberger and K. Jezek. Using latent semantic analysis in text summarization and summary ˘ Z04, ´ evaluation. In Proc. ISIMˆ aA pages 93–100, 2004. C. Van Rijsbergen. Information retrieval. dept. of computer science, university of glasgow. URL: citeseer. ist. psu. edu/vanrijsbergen79information. html, 1979. Y. Yang. An evaluation of statistical approaches to text categorization. Information retrieval, 1(1-2):69–90, 1999.

Suggest Documents