Improving the Quality of Bionic Resource Retrieval by ... - Science Direct

0 downloads 0 Views 247KB Size Report
specialized thesaurus. The possibilities of using a graphical representation of a thesaurus in the field of bionics and bionic technology, as well as the approach ...
Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 103 (2017) 489 – 494

XIIth International Symposium «Intelligent Systems», INTELS’16, 5-7 October 2016, Moscow, Russia

Improving the quality of bionic resource retrieval by visualizing a specific bionic-oriented thesaurus A. Sigov, V. Baranyuk, V. Nechaev, A. Melikhov*, O. Smirnova Federal State Budget Education Institution of Higher Education «Moscow Technological University» (MIREA), 78, Vernadsky Avenue, Moscow, 119454,Russia

Abstract

This article describes the ability to improve the quality of information resources of search through the use of a specialized thesaurus. The possibilities of using a graphical representation of a thesaurus in the field of bionics and bionic technology, as well as the approach used for automated formation of a multilingual thesaurus primary are considered. © 2017 B.V. This is an open access article under the CC BY-NC-ND license © 2017 The TheAuthors. Authors.Published PublishedbybyElsevier Elsevier B.V. (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility ofthe scientific committee of the XIIth International Symposium «Intelligent Systems». Peer-review under responsibility of the scientific committee of the XIIth International Symposium “Intelligent Systems” Keywords: intellectual system; bionics; bionic technologies; knowledge engineering; information retrieval thesaurus

1. Introduction Bionics can be described as application of biological methods and systems found in nature to the study and design of engineering systems. Bionics regards biology and engineering from the viewpoints of their commonalities and differences. Also, the term biomimetics is used the framework of this discipline among English-speakers. This term originates from two ancient greek words βίος – life and μίμησις – mimicry and is used to describe an engineering approach, based on revealing and application the design, borrowed from living nature1. The development of cybernetics, which is the basic framework for studying and formalization of basic principles of control and interconnection in living creatures and machines, led to the need for the broader study of inner structure and functions of biological systems, directed to its practical implementation in the design of brand-new devices, mechanisms, materials and so on. * Corresponding author. Tel.: +7-499-215-6565; fax: +7-495-434-9287. E-mail address: [email protected]

1877-0509 © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the scientific committee of the XIIth International Symposium “Intelligent Systems” doi:10.1016/j.procs.2017.01.032

490

A. Sigov et al. / Procedia Computer Science 103 (2017) 489 – 494

The primary direction of the bionics` study are always referred as:  the study of human and animal nervous system and modelling the separate brain cells (neurons) and neural networks, intended for the further development of the computational systems and state-of-art elements and devices for automation and telemechanics (neurobionics);  the study of the sensory organs targeted on research and development of new sensors an detection systems;  the study of the basic principles of animal spatial orientation for implementing these principles in engineering;  the study of the morphological, physiological and biochemical characteristics of living organisms for the promotion of new technical and scientific ideas. Bionics helps to create original technical systems based on ideas borrowed from nature. It is closely related to biology, physics, chemistry, and provides a basis for new inventions in various engineering sciences, like electronics, navigation, communication, seamanship and others. In fact, bionics aggregates knowledge what requires sharing pure ideas, experimental data and other useful information. 2. Raising the quality of the information search results Our study is dedicated to creating the intellectual information support system, intended for research and development of the perspective bionic technologies and ideas` forming and early evaluation. It regards various aspects of knowledge engineering and intellectual information support systems for providing methods and techniques to improve information retrieval2,3,4,5,6,7. As knowledge base engineering involves formalization of the concepts and bounds between them, this can be implemented using various approaches, based on metadata, ontologies and thesauri. Ontology – a knowledge base describing the facts which are assumed to be always true within a certain community on the basis of generally accepted thesaurus` properties. It can be used as an intermediary, between the user and the information system or between community members. There are many definitions of "thesaurus" of the concept. However, more acceptable to the current study issues of information retrieval is the following. Thesaurus (from ancient Greek. θησαυρός – treasure) – the dictionary, a collection of information, set, covering concepts, definitions and terms specific to the certain scope, which should facilitate the correct lexical and corporate communication (mutual understanding in communication and interaction between individuals associated one discipline or profession)8. As the thesaurus contains semantic relations between lexical units, they can be considered as a basis for the individual scope description. Broadly speaking, the thesaurus in the scope of bionics and bionic technology can be regarded as a description of the system of knowledge about the scope. At the same time with the help of a thesaurus may be possible to the same understanding of the descriptions of the objects, phenomena and processes of various participants of a study. One of the most common types of thesaurus is an information retrieval thesaurus – normative Dictionary descriptor information retrieval language with paradigmatic relations enshrined in it lexical units9. Until recently, the term "ontology" and "Thesaurus" were used as synonyms, but nowadays thesaurus is more frequently used to describe the vocabulary in the projection on the semantics and ontology – for modeling semantics and pragmatics in the projection on the representation language10. The purpose of information retrieval thesaurus is to improve the quality of information retrieval in information retrieval systems. Typically, information retrieval thesauri (IRT) are used to translate natural language text, or to display the paradigmatic relations between descriptors. 3. Thesaurus and its graphic representation As it was mentioned above, one of the basic elements of an intellectual system is a knowledge base. The foundation of this knowledge base is formed by the specific thesaurus, so special attention should be given to its formation and visualization. The graphic display of a thesaurus is a set of semantic diagrams or maps, representing the paradigmatic relations between descriptors (using graphs, arrows, etc.) in a visual form9. The vertices represent terms and links between them are represented by arcs, which can be of various types.

A. Sigov et al. / Procedia Computer Science 103 (2017) 489 – 494

It should be noted that the visual representation of the thesaurus for a vast domain will also be large, so it shall be possible to expose certain areas related to the user-selected concepts separately. However, it is advisable to display the thesaurus in a large scale, providing user heuristics in a role which a selected element plays in a whole knowledge base. Software implementation of the specific thesaurus embody the following actions:  visualize the whole scope in a large scale;  visualize a selected scope area in a small scale;  displaying the term data on hover the cursor over a distinct term;  displaying the term source or definition on hover over a term or its definition respectively;  displaying the link type of a selected link on hover over it;  updating the thesaurus. A graphical representation of a thesaurus on bionics and bionic technology and its implementation allows to visualize:  the current role of a term in all scope;  references and link for a selected term;  term links to the original documents. With a help of the thesaurus, user can:  find the desired terms while surfing the scope;  use term search;  navigate through a thesaurus finding the resources which contain the required term. It is possible to increase system capabilities by adding fuzzy term search, based on their affinity. It is also recommended to develop a multilingual information retrieval thesaurus, containing parallel monolingual thesauri with a single knowledge base storing the matching links between the equivalent terms from different thesauri11. The primary purpose of multilingual IRT is to form a language-independent knowledge exchange environment. 4. Producing the primary thesaurus Bionics as a science stems from various scientific disciplines. Any of this scopes have its own terminology, so the creation of a large-scale bionic knowledge base will definitely get stuck on establishing the consistency between different terminological systems. Unfortunately, this problem does not have the universal solution, but it is possible to help the researchers by forming the initial thesaurus using the automated data extraction method, based on indexing algorithms. Let us describe the implementation of the mentioned method. Its initial idea is that the efficiency of a usual web crawler depends on the size and contents of its indexing database and populating it with large amount of consistent data will bring better search results. But, the capacity of maintaining consistency is as lower as bigger the data amount is. This vicious circle can be eliminated changing the way of collecting the indexing data and it can be done by excluding words and word combinations which do not bring any sense taken apart, i.e. do not have any semantics. For example, these can be some separate function words or sets of words which appear in a sentence one by another but relate to different phrases. Thus, the filtering can be provided using a sentence parser. There are several kinds of syntax parsing methods, but only two of them are widely spread. Universal Dependency (UD) originates from the hypothesis that the elements of the sentence in any natural language (NL) are connected to each other by binary relations and the possible variety of these relations is finite12. For example, let us regard a simple English sentence: "Business biomimetics is the latest development in the application of biomimetics" and define the set of binary relations for it, as shown in Figure 1.

491

492

A. Sigov et al. / Procedia Computer Science 103 (2017) 489 – 494

cop compound Business

det

biomimetics

is

the

nmod:of

case

latest

development

in

the

application

of

biomimetics

amod case

det nsubj nmod:in Fig. 1.Binary relations.

In a more compact way they can be represented as a list: compound (biomimetics-2, Business-1); nsubj (development-6, biomimetics-2); cop (development-6, is-3); det (development-6, the-4); amod (development-6, latest-5); case (application-9, in-7); det (application-9, the-8); nmod:in (development-6, application-9); case(biomimetics-11, of-10); nmod:of (application-9, biomimetics-11). The resulting list contains only grammatically valid relations, and their amount is always equal to the sentence length minus one. The former example demonstrates, that application of UD will not sufficiently reduce an IRT, but makes it possible to count those word combinations, which elements do are not placed together with each other. In the second case, the data structure that describes the grammatical relations is represented by a tree13. Terminal elements are single words, and the branches, in turn, form syntactic groups which can be nested can be nested (see Figure 2).          

S

NP NNP

Business

VP

.

VBZ

NNS

is

biomimetics

NP

NP DT JJS

the

PP

NN

IN

latest development

NP

in NP

PP

DT NN

the application

IN

NP NNS

of Fig. 2.Parse tree.

biomimetics

493

A. Sigov et al. / Procedia Computer Science 103 (2017) 489 – 494

Traversing the graph makes it possible to determine the quantity and composition of syntax groups which form the sentences. Each syntax group contains one or more elements, matched by different grammatical characteristics, such as gender, number and case (explicit or implicit), time, etc. For example, the sentence may be of the n-grams extracted, shown in Table 1. Table 1. Complete n-gramm list. N-gram of biomimetics the application business biomimetics the latest development the application of biomimetics in the application of biomimetics the latest development in the application of biomimetics is the latest development in the application of biomimetics business biomimetics is the latest development in the application of biomimetics

Frequency 2 2 2 2 4 5 8 9 12

It is important to note, that parse tree provides greater clarity and also produces n-grams independently from the current word order. The only weakness of this method is that a tree structure strictly depends on the applied model of language, so using different parser to the same sentence can potentially result in two or more parse trees. To eliminate various inconsistencies, it is strongly recommended to parse a whole textual base with a single parser. As it was mentioned above, maximum number N of n-grams in a sentence can be counted using a formula:

N  L ( n 1)  L  n  1

(1)

where L is a sentence length and n is an n-gram length. Parsing tree experiment demonstrates that the amount of real n-grams is always lower than maximal, this fact apparently indicates data reduction. The output of the proposed methods can be then vectorised according to the TF-IDF metric14. The term frequency (TF) for an n-gram can be counted using a formula:

tf ( t,d ) 

ni  k nk

(2)

where ni – is number of occurance and  k nk is the overall number of n-grams of this length. Then, the inverse document frequency (IDF) is counted as:

idf ( t,D )  log

D d i  ti

where |D| represents the total number of documents in a text,

(3)

di  ti – the number of documents where an n-

gram occured. The overall weight is counted as multiplication of TF and IDF:

tf  idf ( t ,d,D )  tf ( t ,d )  idf ( t,d )

(4)

The proposed data reduction approach can be used with different weight estimation metrics, but is vital to use the single metric for all IRT.

494

A. Sigov et al. / Procedia Computer Science 103 (2017) 489 – 494

5. Conclusion The proposed implementation of a specific thesaurus has a dual nature. From the one hand, on the one hand can be regarded as an independent source of information, and the other – as the technology to search and organize resources for an intensification of work conducted by various researchers in the field of bionics. Acknowledgements The following research is conducted in Federal State Budget Education Institution of Higher Education «Moscow Technological University» (MIREA) supported by Russian Science Foundation under the Grant №14-11-00854. References 1. Bar-Cohen Y. Biomimetics--using nature to inspire human innovation Bioinspir Biomim. 2006; 1. p. 1-12. 2. Sigov AS, Nechaev VV, Baranyuk VV, Koshkarev MI, Smirnova OS, Melikhov AA, Bogoradnikova AV. Architecture of domain-specific data warehouse for bionic information resources. Ecology, environment and conservation. Suppl. Issue 2015; 21. p. 181–186. 3. Baranjuk VV, Smirnova OS, Bogoradnikova AV. Intellectal system for supporting the development of perspective bionic technologies Int. Journal of Open Information Technologies 2014; 12. p. 17–20. 4. Nechaev VV, Baranyuk VV, Smirnova OS, Koshkarev MI, Volodina AM, Bogoradnikova AV, Markelov KS. Information resources and technologies 2015. (Moscow: Mirea). p. 92. 5. Baranjuk VV, Smirnova OS. Expanding the bionics ontology by the description of swarm intelligence. International Journal of Open Information Technologies. 12. p. 13–17. 6. Baranjuk VV, Smirnova OS. Detailed swarm intelligence algorithms description for expanding the bionics ontology. International Journal of Open Information Technologies 2015; 12. p. 18–27. 7. Smirnova OS, Bogoradnikova AV, Blinov MU. Describing the swarm algorithms, inspired by abiocen and bacterias, in the bionics ontology. Int. Journal of Open Information Technologies 2015; 12. p. 28–37. 8. Staab S, Studer R. Handbook on Ontologies. Second edition. NewYork: Springer; 2009. 9. GOST 7.74 System of standards on information, librarianship and publishing. Information retrieval languages. Terms and definitions. P. 38. 10. Narinyani AS. Centaur named Theon: + Thesaurus Ontology Proc. of the DIALOG’2001 Int.Workshop 2001; 1. p. 184-188. 11. GOST 7.24-2007. System of standards on information, librarianship and publishing. Multilingual thesaurus for information retrieval. 10. Composition, structure and basic requirements for development. M 2007; p. 14. 12. de Marneffe MC, Dozat T, Silveira N, Haverinen K, Ginter F, Nivre J, Manning CD. 2014 Universal Stanford dependencies: A cross-linguistic typology. Proc. of 9th Int. Conf. on Language Resources and Evaluation. 4. p. 585-592. 13. Carnie C. Syntax: a generative introduction. Oxford: Blackwell Publishing; 2007. 14. Robertson S. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 2004; 5.p.503–520.

Suggest Documents