Arabic Text Representation using Rich Semantic Graph: A ... - wseas

0 downloads 0 Views 628KB Size Report
Abstract: Representing Arabic Text semantically using Rich Semantic Graph (RSG) is one of the ... Processing(NLP), Text Summarization, Ontology.
Recent Advances in Information Science

Arabic Text Representation using Rich Semantic Graph: A Case Study SALLY S. ISMAIL [email protected]

IBRAHIM F. MOAWAD [email protected]

MOSTAFA AREF [email protected]

Computer Science Department Information System Department Computer Science Department Faculty of Computer and Information Sciences, Ain Shams University, Abbassia, Cairo, Egypt

Abstract: Representing Arabic Text semantically using Rich Semantic Graph (RSG) is one of the

recent techniques that facilitate the process of manipulating the Arabic Language in Natural Language Processing (NLP) field. The work presented in this paper is a part of an ongoing research to create an abstractive summary for a single input document in Arabic Language. The abstractive summary is generated through three modules: converting the input Arabic text into a Rich Semantic Graph, then performing Graph Reduction, and finally generating the summarized text from the reduced graph. This is done with the aid of a domain Ontology. In this paper, we are presenting the first module and a detailed case study verifying our work. Key-Words: - Arabic Language, Text Representation, Rich Semantic Graph (RSG), Natural Language

Processing(NLP), Text Summarization, Ontology. However, some of the aspects that slow down progress in Arabic NLP include its complex morphology, the absence of diacritics in written text and the fact that Arabic does not use capitalization. In addition, there is also a shortage of Arabic corpora, lexicons, Ontologies and machine-readable dictionaries. These tools are essential to advance research in different areas. A domain-specific Ontology, for example, is essential for many NLP applications such as Information Retrieval, Question Answering Systems, Natural Language Generation, Machine Translation, and Text Summarization. Thus, utilizing Ontology in representing text in a Semantic Graph can facilitate the process of getting the semantics behind the Arabic text. Our work in this paper is a part of an ongoing research on Ontology based approach for Arabic Text Summarization. In an earlier work, we have proposed a model where a single document text is represented in a Rich Semantic Graph, then the RSG is reduced, and finally Sentence generation module is invoked to produce an abstractive summary [1,2]. All this is done with the aid of an Arabic domain Ontology, which helps in concepts descriptions, relations and generalization. In this paper, we are going to verify the first module of the suggested model, which is Text

1 Introduction Text summarization has quickly grown into a major research area due to the highly growth rate in accessing various information, which makes it impossible to read all the relevant content. A possible solution would be condensing data and obtaining the kernel of a text to explore, analyze, and discover knowledge from documents. Text Summarization techniques can be classified according to the number of input documents (single-document versus multidocument), the documents type (textual versus multimedia), or the output type (extractive versus abstractive). Research in Arabic Text Summarization has not yet received much attention like that in other languages. Most of Arabic NLP has focused on the manipulation and processing of the structure of the language at morphological, lexical, and syntactic levels and only little consideration went to the semantic level. Also, most of the studies consider its derivational and inflectional aspects as disadvantage, rather than advantage implying the language richness. Thus, the specifications of the Arabic language were barely utilized. Instead, approaches that were successful with other languages such as English and French were applied on it.

ISBN: 978-960-474-344-5

148

Recent Advances in Information Science

semantic graph. This procedure is applied to both document and corresponding summary extracts. Finally, linear support vector machine is trained on the logical form triples to learn how to extract triples that belong to sentences in document summaries. The classifier is then used for automatic creation of document summaries of test documents [7,8]. On the other hand, there's another approach that incorporates the object-orientation techniques in knowledge representation. It supports data abstraction and hence increases the modularity of natural language processing applications. It organizes knowledge into classes of objects (sub and super classes), which is an important issue in knowledge representation to avoid redundant declarations or specifications [9].

to RSG representation from a new perspective using a new case study. The remainder of this paper is organized as follows. In section 2, some of the related work for this area is declared while section 3 gives an overview on the design of the proposed model as well as the phases of the first module. Section 4 illustrates a new test case to demonstrate the phases of the first module. We finalize the paper with the conclusion as well as the future work in section 5.

2. Related Work There are a number of research activities that are related to our topic. The majority of the early work stays at the parsing level, nevertheless, Leskovec et al[3-5] in 2004, presented a novel approach of document abstractive summarization by generating semantic representation of the document. It introduces an intermediate, more generic layer of text representation within which the structure and content of the document and summary are captured. The intermediate semantic graph representation opens up new areas of explorations. Then, a machine learning technique is applied to extract a sub-graph that corresponds to the semantic structure of a human extracted document summary using features that capture semantic structure, i.e. concepts and relations, in contrast to previous attempts in which linguistic features are of finer granularity, i.e. keywords and noun phrases.

On the other hand, semantic graph representation in its traditional form is incomplete due to the limitation of explicit operational or procedural knowledge, so it needs to assign more structure to nodes as well as to links. On the other hand, object-oriented modeling reflects the structure of data and the software's object behavior not the world concepts and its structure. Therefore, only traditional object-oriented technique is not enough for a good knowledge representation. It is difficult to build large coherent and complete knowledge representation. Finally, the granularity is an important issue in knowledge representation. It concerns with the level of knowledge details that should be represented and what are the primitives (fundamental concepts) needed to be constructed. Moawad et al. [10-12] adopted the idea of graph representation but through using Ontology rather than Machine Learning techniques to perform the abstraction. They designed a system model for abstractive Summarization in English Text. They also implemented a prototype to transform an input document to an RSG which is based on ontology. The ontology primitives (concepts) are language nouns and verbs only. It preserves words hierarchies and their semantic constraints and ontological relations among each other. Therefore, the output graph is not complex, and not huge, but rich, because each concept has its own linguistic and semantic attributes and relations

Although this direction of research has been outlined formerly by (Spark Jones) in [6] as still unexplored avenue. It was not explored until recently, such that you can find one of the most recent research methods; conducted in 2009; is extracting summary sentences based on the document semantic graph representation. This method generates semantic representation from the input document by a machine learning technique to extract full sentences suitable for creating summaries. It starts with deep syntactic analysis of the whole text, then for each sentence extracts logical form triples (subjectpredicate-object). After that, it applies cross sentence pronoun resolution, co-reference resolution, and semantic normalization to refine the set of triples and merge them into a

ISBN: 978-960-474-344-5

149

Recent Advances in Information Science

that can be deduced from the analyzed input text. By this, they have overcome the shortage in both, the traditional as well as ObjectOriented techniques. Our model design follows the same track, but for the Arabic Language.

capture the meaning of words, sentences and paragraphs by enriching each node with its attributes, which include; the value of the node, type (noun or verb), descriptors, and attachments. For example, if the concept in the node is for a verb then the attributes include its subject(s), object(s), tense, place, etc.

As for work related to Arabic Text Summarization, some efforts have been performed such as; the Optimized Dual Classification System [13], which represents an Arabic Extractive text summarization system. Based on the Rhetorical Structure Theory (RST), AlSanie [14] developed his Arabic text summarization system. Ikhtasir [15] also is an integrated RST-based system with a certain scoring scheme. On the other hand, there are systems which summarize multi-lingual sets of documents like Lakhas [16] and CLASSY (Clustering, Linguistics, And Statistics for Summarization Yield) in [17].

Fig 1. Overview on the proposed design

Text to RSG module is accomplished through 5 phases; 1. Pre-processing. 2. Word Sense Instantiation. 3. Concept Validation. 4. Sentence Ranking. 5. RSG Generation. In a previous research paper [2], the details of each phase were discussed, and we also used a case study for illustration. In this paper, we are verifying the same module from another perspective through a new case study.

3. Arabic Text to RSG Module Text to RSG representation module is a part of the proposed Arabic Text summarization model [1,2]. The proposed model follows the track of Moawad et al.'s design [10-12] in utilizing a domain Ontology for creating an abstractive summary for a single Arabic input document. This approach is considered a novel one for the Arabic Language, since none of the aforementioned Arabic Text summarization systems utilizes an Ontology for performing abstractive summary. In previous researches [1,2], we have proposed the model design, which is depicted in Figure1. It consists of three modules: representing the document in an RSG, graph reduction, and finally applying a text generation technique on the semantic graph to generate the abstractive summary. The main objective of the first module (Text to RSG) is to represent the input document semantically using a rich semantic graph, where words and concepts of the input document are represented as instance objects of the corresponding word classes in the graph along with edges corresponding to semantic and topological relations between them. It is called Rich Semantic Graph, because it is able to

ISBN: 978-960-474-344-5

4. Case Study The details of the case study presented in this paper is shown in Figure 2. It consists of 3 Paragraphs , divided into 6 main sentences, which are subdivided into 9 sub-sentences, and the total number of words, including stop words, is 54. .

.

Fig.2 Case Study

For illustration, we chose sub-sentence 1.3 to show how the 5 phases are applied. ) (

150

Recent Advances in Information Science

Table 1. Results of Preprocessing, W.S.I and Concept Validation

After removing stop words, the previous sentence contains 9 words. In Preprocessing phase, concepts are POS tagged determining their syntax type, in addition to named entity recognition. Stemming is also performed on concepts to give more accurate results when consulting the Ontology. Finally, pro-nominal and co-reference resolutions are carried out to help in identifying the relations between different concepts. In Word Sense Instantiation (W.S.I) phase, each tagged concept as a verb or noun is considered to extract its senses from Ontology. For the examined sentence, the original number of fetched senses would have given 2,079,710,640 combinations for RS Sub-G's (by multiplying the number of senses of all the concepts in the sentence). This number is reduced to be 234,594,360 by reviewing the tags of the words and only considering the matching types of senses (i.e. taking into consideration only noun senses for the concept that's tagged as a noun and similarly done for that tagged ones as verbs). The remaining senses are validated syntactically and semantically in Concept Validation phase. Some senses are refused syntactically due to the difference in the form of the word fetched from the Ontology than that in the original context. Other senses are refused semantically due to the lack of accompanied words, which we called Attributes. Also a few are rejected due to their belonging to different domain other than that in the context. Again the number of senses is reduced completely to give only 180 RS SubG's after this phase. Table 1 shows the number of senses for each concept in the aforementioned sentence and the effect of Preprocessing, W.S.I. , and Concept Validation phases on reducing this number as well as reducing the total number of RS SubGs. Afterwards, Sentence Ranking phase is invoked on both; Concept and Sentence levels. On Concept level, for each concept (Ci), we apply Equation 1 on every sense to give it a Sense Rank value (SRi,j ) from 1 to 10. Where n is the number of senses fetched for the concept in hand, (j) is the order of the fetched sense, where

ISBN: 978-960-474-344-5

#Rejected Senses Concept

Type

# Senses

#

#

Noun

Verb

s

s

W.S.I . Tags

Concept Validation Synta ctic

#Validate d Senses

Semantic

N

7

7

-

-

-

5(Attr.)

2

N

3

3

-

-

-

2(Attr)

1

N

9

5

4

4

-

2(Attr)

3

6(Attr)

1

14 N(Adj)

29

14(

7(Fo

1

15

6

7

-

10(Attr)

1

-

-

-

3(Attr)

1

2

4

-

12(Attr)

1

31

-

23(Attr)

domain

4(Attr)

5

Adj)

rm)

11 N(Adj)

18

1(A dj)

N

4

4 2

N(Adj)

17

13(a dj)

Total

V

31

-

N

10

9

2,079,7 10,640

1(ad v)

1

2(Fo rm) -

6 (1 )

234, 594,

180

360

we assumed that the higher the order of the sense in the fetched senses list, implies the more frequently used it is, and by this the higher the value of (SRi,j) the higher the ranking and relevance of the sense. ----- Eq.1

We then chose a threshold for the Sense ranks to be of value 6, where all senses with Sense ranks less than that are excluded automatically. The number of thresholded senses for each concept (nT) is then used to compute the reduced number of combinations of RS Sub-Gs, for the considered sentence which reached only 18 combinations. ----- Eq.2

Subsequently on Sentence Level; Average Sentence Rank (ASRk) for each sentence is calculated using Equation 2,where (Sri,j) is the Sense Rank number (j) for concept number (i) in the sentence and (N) is the total number of concepts in the sentence. By this we give ranks for the different suggested RS Sub-Gs and again we shrink the number of combinations by taking a threshold on the (ASRk). In our case study we chose the threshold value 9.5 and excluding any RS Sub-G having a rank below this value. Table 2 illustrates the details of Sentence Ranking phase, where the number of RS Sub-Gs is shrunk again to be only 7.

151

Recent Advances in Information Science Table 2. Results of Sentence Ranking Phase Ci

N

j

Si,j

SRi,j

1

10

2 2

1

Cmb

K

1

1

1

2

ASRk

#

figured out through the Ontology. RSG generation phase again has the effect of rereducing number of RS Sub-Gs to be a final number of 4. To view a sample on how the RSG looks like please refer to our work in [2].

RSGs

10

5

1

10

1

2

nT

3

10

4

6.66

9.81

Table 3. RSG Generation Phase # RS G's

9.78

RS G#

SRi,j

1

(10,10,10,10,1 0,10,10,10,10)



9.81

2

(10,10,10,10,1 0,10,10,8.33,1 0)



9.78

3

(10,10,10,10,1 0,10,10,10,8)

X

9.63

4

(10,10,6.66, 10,10,10,10,10 ,10)

X

9.63

5

(10,10,10,10,1 0,10,10,6.66,1 0)



9.59

(10,10,10,10,1 0,10,10,8.33,8)

X

9.56

(10,10,10,10,1 0,10,10,10,6)



ASR

Concepts

k

3

2

3

3.33

6

7

1

1

1

1

1

1

10

10

1

10

1

10

1

2

3

1

8

1

9

1

1

14

6.66

5

11

13

8.33

4

10

12

10

6

18

15 3 16

5

9.59

9.56

9.44

(

9.41

9.41

( ( ( ( ( (

7

7

) ) ) )( ) )( )

) )

9.37

9.26

9.22

9.19

9.19

9.04

9.00

1.66

1 5

9.63

3.33 17

6

Statu s

9.63 10

5

Si,j

2 3 4 5

10 8 6

18

Table 4 sums up the effect of all the phases on the number of RSGs for the entire case study.

8.81

Table 4.Effect of each phase on number of RSGs

3

W.S.I

RS Sub-G's

and Concept Validation

1.1.1

1,820

10

2

2

1.1.2

907,200

120

2

2

1.2

61,286,400

480

5

3

1.3

2,079,710,640

180

7

4

2.1.1

15,120

36

1

1

2.1.2

10

8

1

1

2.1.3

4

1

1

1

3

3

Sentence #

2

Finally, in RSG generation phase, we work on the Paragraph level, where all validated subsentences are related together and all RS SubGs are connected for the 9 sub-sentences. Table 3 shows all combinations for the validated senses for the examined sentence. Some sentences are rejected at this point due to inconsistent relations between words which is

ISBN: 978-960-474-344-5

Original # of

4

2.2

,

RSG Generation

3.1

19,600

20

2

1

Total

2,141,946,968

921

24

18

5. Conclusion

152

Sentence Ranking

Recent Advances in Information Science

International Conference on IEA/AIE 2000, New Orleans, PP. 591-600, June 19-22, 2000. [10] Ibrahim F. Moawad & Mostafa Aref, “A Semantic Graph Reduction Approach for Abstractive Text Summarization”, the Ninth Conference on Language Engineering, Cairo, Egypt, December 6-7, 2009. [11] Mostafa Mahmoud Aref, Ibrahim Fathy Moawad, Soha Said Ibrahim, "Rich Semantic Graph Generation System Prototype". The Tenth Conference On Language Engineering (ESOLEC), The Egyptian Society Of Language Engineering, Decemeber 2010. [12] Ibrahim F. Moawad, Mostafa Aref, Soha Said, “Ontology-Based Model for Generating Text Semantic Representation”, International Journal on Intelligent Computing and Information Sciences (IJCICIS), Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt, 2010. [13] I. Sobh, N. Darwish, and M. Fayek, “An Optimized Dual Classification System for Arabic Extractive Generic Text Summarization”, 2006.Available at: http://www.rdi.eg.com/rdi/technologies/papers.htm [14] W. AlSanie, “Towards an Infrastructure for Arabic Text Summarization Using Rhetorical Structure Theory”. M.Sc Thesis, Dept. of Computer Science, King Saud University, Riyadh, Saudi Arabia, 2005. [15] A. Azmi and S. Al-thanyyan, “Ikhtasir --- A user selected compression ratio Arabic text summarization system”, Proceeding of International Conference of Natural Language Processing and Knowledge Engineering (NLP-KE 2009), pp 1--7, 2009. [16] F.S. Douzidia and G. Lapalme, “Lakhas, an Arabic Summarization System”. In Proceedings of 2004 Document Understanding Conference (DUC2004), Boston, MA, 2004. [17] J.M. Conroy, D.P. O'Leary, and J.D. Schlesinger, “CLASSY Arabic and English Multi-Document Summarization”. In Multi-Lingual Summarization Evaluation, 2006.

In this paper, we have presented the module of Text to RSG representation with a detailed case study for Arabic Text Summarization. It is one of three modules in an ongoing research on abstractive Arabic Text Summarization, using a domain Ontology. It is considered a novel approach for Arabic Text Summarization; since all previous work in Arabic text performs Extractive Summarization and none of them proposed to utilize Ontology neither do they use RSG. In our ongoing research, we are going to develop the other two modules and to implement a prototype for the proposed design making use of an existing Arabic domain Ontology.

References [1] Sally Saad, Ibrahim Moawad, and Mostafa Aref, "Ontology-Based Approach for Arabic Text Summarization". In the proceeding of International Conference on Intelligent Computing and Information Systems (ICICIS), pp 18-22, Egypt July 2011. [2] Sally S.Ismail, Mostafa Aref, and Ibrahim F. Moawad, " Rich Semantic Graph: A New Semantic Text Representation Approach for Arabic Language". In the proceeding of the 7th WSEAS European Computing Conference (ECC '13), Dubrovnik, Croatia, June 2013. http://www.wseas.org/wseas/cms.action?id=4087

[3] Jure Leskovec, Marko Grobelnik, and Natasa MilicFrayling., "Learning Sub-structures of Document Semantic Graphs for Document Summarization", in KDD2004 Workshop on Link Analysis, 2004. [4] Jure Leskovec, Marko Grobelnik and Natasa MilicFrayling , "Extracting Summary Sentences Based on the Document Semantic Graph", Microsoft Research, Technical Report No. MSR-TR-2005-07, January 2005. [5] Jure Leskovec, Marko Grobelnik and Natasa MilicFrayling, "Impact of Linguistic Anaylsis on the Semantic Graph Coverage and Learning of Document Extracts", Proceeding of the 20th National Conference on Artificial Intelligence (AAAI), vol. 3, Pittsburgh, Pennsylvania, 2005. [6] Sparck Jones, K. "Automatic summarizing: factors and directions". In Inderjeet Mani and Mark Marbury, editors, Advances in Automatic Text Summarization. MIT Press, 1999. [7] D. Rusu, B. Fortuna, M. Grobelnik, D. Mladenić , "Semantic Graphs Derived from Triplets with application in Document Summarization", an international journal of Computing and Informatics , Volume 33, Number 3, 2009. [8] Lorand Dali, Delia Rusu, Blaž Fortuna, Dunja Mladenić, Marko Grobelnik, "Question Answering Based on Semantic Graphs", Semantic Search 2009 Workshop, 18th International World Wide Web Conference WWW2009, Madrid, Spain, April 21st, 2009. [9] Mostafa Aref, "Object Orientation in Natural Language Processing”, Proceedings of the 13th

ISBN: 978-960-474-344-5

153

Suggest Documents