Graph-based Arabic NLP Techniques: A Survey - Science Direct

0 downloads 0 Views 521KB Size Report
Procedia Computer Science 142 (2018) 328–333 ... of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
ScienceDirect

Available online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000

ScienceDirect

www.elsevier.com/locate/procedia

Available online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000

ScienceDirect

www.elsevier.com/locate/procedia

Procedia Computer Science 142 (2018) 328–333

The 4th International Conference on Arabic Computational Linguistics (ACLing 2018), November 17-19 2018, Dubai, United Arab Emirates The 4th International Conference on Arabic Computational Linguistics (ACLing 2018), Graph-based NLP Techniques: A Survey November Arabic 17-19 2018, Dubai, United Arab Emirates *

EtaiwiNLP and Arafat Awajan A Survey Graph-basedWael Arabic Techniques: King Hussein Faculty of Computing Sciences, Princess Sumaya University for Technology, Amman. Jordan

Wael Etaiwi* and Arafat Awajan

Abstract

King Hussein Faculty of Computing Sciences, Princess Sumaya University for Technology, Amman. Jordan

The improvements of natural language processing applications such as machine translation, text summarization and the likes are Abstract crucial, and can be achieved using many different techniques including graph, deep learning, word embedding and others. This survey investigates several research studies that have been conducted in the field of Arabic natural language processing using graph The improvements naturalliterature languageinprocessing such as machine translation, text summarization and the likesnew are representation. Theofrelated the use ofapplications graph in Arabic Natural Language Processing is limited and relatively crucial, andtocan achieved using many different techniques graph, wordthe embedding and others. This compared the be available literature on other languages, such including as English. This deep paperlearning, summarizes major techniques used in survey investigates research studies that havethe been conducted the field oftechniques Arabic natural language processing graph Graph-based Arabicseveral NLP techniques, and discusses role of using in graph based to solve natural languageusing processing representation. The related literature in the use of graph in Arabic Natural Language Processing is limited and relatively new problems. compared to the available literature on other languages, such as English. This paper summarizes the major techniques used in Graph-based Arabic NLP techniques, and discusses the role of using graph based techniques to solve natural language processing problems. © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) © 2018 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) © 2018 The Authors. Published by Elsevier B.V. Linguistics. Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Keywords: Arabic, Natural Language Processing, Graph-based. Linguistics. Keywords: Arabic, Natural Language Processing, Graph-based. 1. Introduction

Natural Language Processing (NLP) is one of the artificial intelligence domains used to design and develop a 1. Introduction computer systems that able to analyze, understand and synthesis natural human languages [1]. The huge amount of data published and distributed over the networks increase the demand for new and efficient methods and applications Naturalanalyze, Language Processing one of thefrom artificial intelligence used to design develop to handle, summarize, and(NLP) extractisknowledge this huge amount domains of texts. Therefore, manyand different NLPa computer systems that able to analyze, understand and synthesis natural human languages [1]. The huge amount of data published and distributed over the networks increase the demand for new and efficient methods and applications to handle, analyze, summarize, and extract knowledge from this huge amount of texts. Therefore, many different NLP *

Corresponding author. Tel.: +962795744288. E-mail address: [email protected]

* 1877-0509 © 2018author. The Authors. Published by Elsevier B.V. Corresponding Tel.: +962795744288. address: [email protected] ThisE-mail is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics. 1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.

1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics. 10.1016/j.procs.2018.10.488

2

Wael Etaiwi et al. / Procedia Computer Science 142 (2018) 328–333 Author name / Procedia Computer Science 00 (2018) 000–000

329

applications raises in the recent years, such as sentiment analysis [2], text summarization [3], name entity recognition [4], text classification and clustering, and others. NLP Techniques for English are mature and deeply investigated. However, not much work has been done for Arabic NLP applications due to the complexity of the language in terms of both structure and morphology. The presence of templatic morphemes and concatenating morphemes are the main features of Arabic language that make it’s NLP a challenging task [5].Concatenative morphemes include stems, affixes, and clitics. Affixes include: prefixes, suffixes, and circumfixes. Clitics, that represent another token such as pronouns, conjunction or preposition, include: proclitics and enclitics. The general structure of an Arabic word can be represented as follows: [Proclitic(s)+[Prefix(es)]] + stem + [Suffix(es) + [Enclitic]]. The most frequent words in Arabic, mainly the stop words, account in general for more than 40% of words in texts; on the other hand, the majority of words in a text appear only once [5]. Many different techniques and theories are used in NLP such as: Graph theories, fuzzy logic, statistical methods, machine learning, and others. In this survey, we focused on NLP techniques proposed for Arabic language that use graph theories and algorithms as a main tool to achieve application’s objective. A Graph, used to represent many types of data such as networks, webpages, social relations and text components, is a set of vertices (also called nodes) and connecting edges, it is classified according to edges properties into many different categories, such as: weighted or unweighted graphs, directed or undirected graphs, and cyclic or acyclic graphs. In this survey, we categorize graph-based techniques used to handle Arabic NLP problems based on type of data handled and the techniques employed, according to different approaches:  Static or Dynamic: static graph does not changed overtime during the technique operation, on the other hand, few proposed methods use a dynamic graphs that changed and altered during technique performing. Such changes may affect the graph structure, size or both.  Core technique used: many different techniques use graph theory and methods to handle NLP problems. Aggregation, Simplification and Similarity are the most common techniques applied using graphs. In the aggregation technique, the graph used to aggregate data and represent large amount of data as a small number of graph components (vertices and edges), this technique mainly used for compression purposes. Simplification technique used to remove the less important components from original text and keep the key components. And finally, Similarity techniques used to find the most appropriate (similar) component that match a given text, this technique mainly used in categorization, clustering and identification problems.  Main objective: graphs could be used in the same structure and format to handle many different problems, these problems lies into two main groups: Compression, in which the main purpose is to shrink data size and minimize the total amount of data required to solve problems in an efficient way. While Query efficiency purpose is the main objective in many researches, in which, the main concern is to enhance the performance of solving problems. 2. Graph-based Arabic NLP techniques Ranking techniques is used in Graph-based algorithms to rank different text units such as: words sections or paragraphs, where each unit is considered as a node. Edges will represent the lexical or semantic relations between two nodes (such as similarity). A Graph based text summarization method for Arabic proposed by Alami et al. [6] in which the proposed method uses PageRank graph-based ranking algorithm to generate salient score for text sentences. The vertices used in the built graph represent sentences of text, while edges represent sentence inter-connection relation between vertices, and the similarity relation measured as a function of content overlap. Maximal Marginal Relevance (MMR) method were used to eliminate redundant important vertices (sentences), and the best scoring

330

Wael Etaiwi et al. / Procedia Computer Science 142 (2018) 328–333 Author name / Procedia Computer Science 00 (2018) 000–000

3

vertices (sentences) that are less similar were selected to form the final summary. Preprocessing stage used to normalize, tokenize, remove stop-words, and stem the original text. The experimental results showed that the proposed method overcame other methods (such as LexRank, TextRank and Clustering Technique) in terms of precision, recall and F-measure. Abstractive Arabic text multi-document summarization model is proposed by Alwan et al. [7]. The proposed model uses textual graph to remove redundancy and generate coherent summary. Furthermore, it converts the original text into textual graph, in which vertices represent the unique stems of every word in the original text (including stop words), and edges represent the adjacency relation between words (sequential flow adjacency). After building the textual graph, graph traversal algorithm (Depth First Search) is applied to concatenate the related sentences from multi-documents together, and finally, the less weighted phrases will be removed. The proposed method consists of four main stages: 1. Preprocessing stage: to prepare original text and remove text noises. 2. Building textual graph by representing the multi-documents by directed weighted graph. 3. Traversing the textual graph and applying structural rules to generate the summary sentences. 4. Refining the sentences which contain unwanted parts and adding them into the final summery. A dataset collected from online shopping website and Twitter.com with 1651 documents were used in the experiments to evaluate the proposed model. The experimental results were evaluated by human experts manually and provided the degree of satisfaction about the output results. The experiments showed promising results with 88% reduction ratio. A new graph representation model proposed by Hadni and Gouiouez [8] to represent documents in a separated graph for each document, the representative graph encodes relationships between different name entities in the document. The proposed approach consists of two main steps: 1) Learning phase; that represents text based on mapping terms with their associated concept. 2) Classification phase; in which the text converted to graph and compared to the predefined concept graph, which build based on BabelNet knowledge resource (a multilingual lexicalized semantic and ontology resource) for Arabic text categorization. Furthermore, a new semantic similarity measures proposed to compare text-graph with concept-graph in order to classify text (document) to its appropriate class. The experimental results showed that the representation model using Support Vector Machine (SVM) algorithm outperforms the Naïve Based algorithm with regard to Precision, Recall, and the F1 measure. A graph approach, called LIGA, for noisy Arabic topic identification proposed by Abainia et al.[9]. Three graphbased implementations were proposed. The first one called LIGA, which firstly introduced by E. Tromp [10] for language identification. The second and third ones were modified versions of LIGA based on tf-idf weights. Moreover, in-house dataset of noisy Arabic text collected from different discussion forums, called ANTSIX, was used to evaluate the proposed approach. The experimental results showed that the total accuracy of the proposed approach reached 98%. Al-Taani et al.[11] used shortest path algorithm to extract summary of Arabic text. Each sentence in the text is represented by a vertex in the graph and ranked according to some statistical features such as: sentence length, sentence position, term frequency and title similarity. Consequently, cosine similarity used to measure the similarity between sentences base on three basic units: stem, word, and n-gram, and PageRank used to calculate the final score. Moreover, the similarity measure is represented by an edge between vertices (sentences) in the final graph. The summary extracted by finding the shortest path between the first sentence in the original text and the last sentence (first and last nodes in the graph). Accordingly, the shortest path nodes form the extracted summary. The experimental results showed that the use of n-grams summarization achieved better results than the use of stem and word in terms of fmeasure. A graph-based approach for automatic document indexing is proposed by El Bazzi et al. [12]. In the proposed approach, each document is represented by a graph, and TextRank algorithm [13] used to score each node (word) according to its importance in the document. Furthermore, term’s weight is computed to estimate the relevance of a term to the document. Experimental results conducted on 1084 news documents showed that the proposed approach

4

Wael Etaiwi et al. / Procedia Computer Science 142 (2018) 328–333 Author name / Procedia Computer Science 00 (2018) 000–000

331

is suitable for semantic and contextual indexation, and it outperforms other statistical based approach (TFIDF) by 12% in terms of F-measure. TextRank approach is also adapted by EL Bazzi et al. [14]. It was used to extract key phrases from Arabic text automatically. Each document represented as a graph, in which, each document’s term represented as vertice, and term co-occurrence within fixed window represented as edges. For comparison purposes, experiments conducted on TextRank and KPMiner (which is a system for Arabic key phrases extraction that uses TF-IDF for weighting terms). The results showed that TextRank overcomes KPMiner in terms of Precision, Recall and f-measure. In order to overcome the challenge of large vocabulary of continuous Arabic speech recognition systems, a new unsupervised graph-base method for improving Arabic speech recognition systems is proposed by Labidi et al. [15]. In the proposed method, an oriented weighted graph constructed, where each node presents a word and each edge presents the relationship of succession between two words in the Arabic language. Furthermore, a graph search algorithm used to detect the false words in the transaction and to replace it with the best word. The experimental results showed that the proposed approach reduces the word error rate in the Arabic speech recognition task by 4.6%. A hybrid approach for extractive Arabic text summarization is proposed by Alami et al. [16], the proposed approach is based on two-dimensional undirected and weighted graph, in which, each sentence presented by a vertex, and each two vertices connected by two edges: the first one represent the statistical similarity measure builds on the content overlap between two sentences, while the second edge represent the semantic similarity measure based upon semantic information extracted from Arabic WordNet (AWN) ontology. PageRank algorithms used to compute the final score of each sentence (vertex) in addition to other statistical features of the text such as TF-ISF and sentence position. The top ranking sentences were selected for the final text summary after using an adapted maximal marginal relevance (MMR) method to deal with redundancy and information diversity issues. Experiments conducted on Essex Arabic Summaries Corpus (EASC) with compare to four other Arabic text summarization methods, and the results showed that the proposed method overcome others in term of Precision, Recall and F1-measure. Natural language processing could be used to extract data from databases, the main idea is to map the natural language to a query language that can access the database and retrieve data. Machkour [17] proposed a model that use Arabic NLP interface based on graph theory for databases without the need for users to know the internal structure of the database. The proposed model uses graph theory methods to allow users to view the database as a single table. It consists of three component: 1) The Linguistic Component (LC). 2) Natural Language Query Definition Component (NLQDC). 3) The Database Knowledge Component (DBKC). In the LC, the Arabic Natural Language Query (ANLQ) submitted many analysis operations and produced an Intermediate XML Logical Query (IXLQ). This query corresponds to the logical interpretation of the input ANLQ. While NLQDC used to reuse the ANLQs already processed and reduces the waiting time necessary for the translation. Finally, in the last component (DBKC), the IXLQ translated into SQL query independently of database domain. The database schema represented as an undirected connected graph, and Dijkstra algorithm were used to find the access path to the required data. Moreover, the database graph consists of database relations represented as nodes, and database constraints (primary keys and foreign keys) represented as edges between relations (nodes). The experimental results showed that 92.4% of the access paths generated correctly. Furthermore, error produced by Dijkstra algorithm are 31.57 %. While 26.31 % of errors are in the phase of translation of access path to relation constraints (primary keys and foreign keys). And finally, 12.10% of errors are caused by other part of systems generally in semantic analysis phase. In order to access the knowledge stored in large semantic knowledge bases such as Aljazeera.net, a new framework proposed by Al-kouz et al. [18]. The proposed framework, called Arabic Semantic Graph Extraction Framework (ASGEF), used to mine the explicit and implicit lexical semantic information impeded in Aljazeera.net, which used to build Arabic semantic graph. After crawling Aljazeera.net website; two parsers used in the parsing stage: File Parser and Web Page Parser. The File Parser is capable for parsing the hierarchical directory structure, while the Web Page Parser used to parse the HTML pages within the hierarchical directory structure to extract textual content. And finally, the target Arabic semantic graph produced by using semantic graph builder based on Wikipedia and Wiktionary.

Author name / Procedia Computer Science 00 (2018) 000–000 Wael Etaiwi et al. / Procedia Computer Science 142 (2018) 328–333

332

5

Table 1: Articles Summary Static Vs. Dynamic

Technique

Application

Objective

Directed

Static

Similarity

Text Summarization

Compression

NO

NO

[8]

Static

Similarity

Text Categorization

YES

YES

3

[9]

Static

Simplification

Text Categorization

YES

YES

4

[7]

Static

Simplification

Text Summarization

Query efficiency Query efficiency Compression

YES

NO

5

[11]

Static

Similarity

Text Summarization

Compression

NO

NO

6

[12]

Static

Simplification

Document Indexing

NO

NO

7

[14]

Static

Simplification

NO

NO

8

[15]

Dynamic

Simplification

Key phrases Extraction Speech Recognition

YES

YES

9

[16]

Static

Simplification

Text Summarization

NO

YES

10

[17]

Dynamic

Simplification

Querying Databases

NO

YES

11

[18]

Dynamic

Aggregation

Semantic Graph

Query efficiency Query efficiency Query efficiency Query efficiency Query efficiency Query efficiency

N/A

N/A

ID

Article

1

[6]

2

Weighted

3. Discussion It is clear that graph-based methods for Arabic NLP is receiving growing attention in recent years. In this survey, we have focused our discussions on the applications and problems solved using graph in Arabic NLP, and taken special note of how graph could be used and build in a useful way that helps overcoming Arabic language challenges. Most of graph-based Arabic NLP studies used a static graphs rather than dynamic ones, which could be explained by the complexity of dealing with Arabic language due to its structure and morphology. On the other hand, graphs mostly common used to simplify Arabic NLP problems due to its ability to formalize huge and complex structure into standard and formal way. Table 1 shows that the majority of the researchers not only use graph-based techniques for query efficiency purposes, and also, produce a challenging results compared to other techniques in terms of Precision, Recall and F-measures.

4. Conclusion In this survey we have presented the state-of-the-art in graph-based methods used to handle Arabic NLP problems and applications. Distinguishing between types of graph used and core techniques. We introduce the key details of each method and explore how graph could be used to gain a better results in Arabic NLP field. It is clear that researches in Arabic NLP is still in its early stage and need more work and researches to be done. In order to improve Arabic NLP applications, the lack of Arabic NLP resources makes a strong challenge in this area. Many further work could be done in using graph in Arabic NLP applications such as: using hybrid approaches, which use graph and other techniques (like: statistical, neural networks, fuzzy logic, etc.) in order to get better results and overcome single-

6

Wael Etaiwi et al. / Procedia Computer Science 142 (2018) 328–333 Author name / Procedia Computer Science 00 (2018) 000–000

333

technique limitations. Future and possible new directions could be used in term of using graph and deep learning techniques in order to enhance Arabic NLP application’s performance.

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

N. Ranjan, K. Mundada, K. Phaltane, and S. Ahmad, “A Survey on Techniques in NLP,” Int. J. Comput. Appl., vol. 134, no. 8, pp. 6–9, 2016. M. Biltawi, W. Etaiwi, S. Tedmori, A. Hudaib, and A. Awajan, “Sentiment classification techniques for Arabic language: A survey,” 2016 7th Int. Conf. Inf. Commun. Syst. ICICS 2016, pp. 339–346, 2016. M. Gambhir and V. Gupta, “Recent automatic text summarization techniques: a survey,” Artif. Intell. Rev., vol. 47, no. 1, pp. 1–66, Jan. 2017. K. Shaalan, “A survey of arabic named entity recognition and classification,” Comput. Linguist., vol. 40, no. 2, pp. 469–510, Jun. 2014. A. Awajan, “Keyword Extraction from Arabic Documents Using Term Equivalence Classes,” ACM Trans Asian Low-Resour Lang Inf Process, vol. 14, no. 2, p. 7:1–7:18, Apr. 2015. N. Alami, M. Meknassi, S. A. Ouatik, and N. Ennahnahi, “Arabic text summarization based on graph theory,” in 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), 2015, pp. 1–8. M. A. Alwan and H. M. Onsi, “A Proposed Textual Graph Based Model for Arabic Multi-document Summarization,” Int. J. Adv. Comput. Sci. Appl. Ijacsa, vol. 7, no. 6, 2016. M. Hadni and M. Gouiouez, “Graph Based Representation for Arabic Text Categorization,” in Proceedings of the 2Nd International Conference on Big Data, Cloud and Applications, New York, NY, USA, 2017, p. 75:1–75:7. K. Abainia, S. Ouamour, and H. Sayoud, “Topic Identification of Noisy Arabic Texts Using Graph Approaches,” in 2015 26th International Workshop on Database and Expert Systems Applications (DEXA), 2015, pp. 254–258. E. Tromp and M. Pechenizkiy, “Graph-based n-gram language identification on short texts,” in Proc. 20th Machine Learning conference of Belgium and The Netherlands, 2011. A. T. Al-Taani and M. M. Al-Omour, “An extractive graph-based Arabic text summarization approach,” in The International Arab Conference on Information Technology, Jordan, 2014. M. S. E. Bazzi, D. Mammass, T. Zaki, and A. Ennaji, “A graph based method for Arabic document indexing,” in 2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT), 2016, pp. 308–312. R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Text,” Proc. 2004 Conf. Empir. Methods Nat. Lang. Process., 2004. M. S. E. BazzI, D. Mammass, T. Zaki, and A. Ennaji, “A Graph-Based Ranking Model for Automatic Keyphrases Extraction from Arabic Documents,” in Advances in Data Mining. Applications and Theoretical Aspects, 2017, pp. 313–322. M. Labidi, M. Maraoui, and M. Zrigui, “Unsupervised Method for Improving Arabic Speech Recognition Systems,” Proc. 31st Pac. Asia Conf. Lang. Inf. Comput., pp. 161–168, 2017. N. Alami, Y. E. Adlouni, N. En-nahnahi, and M. Meknassi, “Using Statistical and Semantic Analysis for Arabic Text Summarization,” in International Conference on Information Technology and Communication Systems, 2017, pp. 35–50. H. Bais, M. Machkour, and L. Koutti, “An Arabic natural language interface for querying relational databases based on natural language processing and graph theory methods,” Int. J. Reason.-Based Intell. Syst., vol. 10, no. 2, pp. 155–165, Jan. 2018. A. Al-kouz, A. Awajan, M. Jeet, and A. Al-Zaqqa, “Extracting Arabic semantic graph from Aljazeera.net,” in 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT), 2013, pp. 1–6.

Suggest Documents