Sep 25, 2013 - includes knowledge from different physical fields also on. 1. 2. 1. 2 ..... space, at the second stage we use FOREL algorithm for automatic ...
World Applied Sciences Journal 24 (Information Technologies in Modern Industry, Education & Society): 55-61, 2013 ISSN 1818-4952 © IDOSI Publications, 2013 DOI: 10.5829/idosi.wasj.2013.24.itmies.80012
System of Physical Effects Extraction from Natural Language Text in the Internet Dmitriy Mikhaylovich Korobkin, Sergey Alekseevich Fomenkov, Sergey Grigoryevich Kolesnikov and Yuriy Fedorovich Voronin VSTU, Volgograd, Russia Submitted: Aug 5, 2013;
Accepted: Sep 15, 2013;
Published: Sep 25, 2013
Abstract: This article describes the process of extracting Physical Effects (PE) descriptions from natural language text. Nowadays one of the most important tasks is automation of early stages of designing (requirements specification and technical proposal stages) of new technical systems (TS) and technologies on the base of which are made fundamental decisions about principles of operation and structure of design object. One of the most promising approaches to realize early stages of requirements specification is concerned with use of structured physical knowledge in the form of Physical Effects (PE) for automatic synthesis and choice of physical operating principle of developed technical system. As nowadays most modern discoveries in physics are described in English so the actual problem is automating the process of extracting PE descriptions from the text of these publications. Key words: Semantic Analysis
Ontology
Semantic Network
INTRODUCTION
Physical Effect
a database of Physical Effect (about 1300 descriptions) [2] includes knowledge from different physical fields also on the base of new discoveries and inventions. However before the present moment the DB PE replenishment was only on the base of the Soviet (Russian) periodical literature such as, for example, the Russian journal Uspekhi Fizicheskikh Nauk. Researches and discoveries of foreign physicists monitored only in the form of translated literature [3]. That's why a lot of physical inventions and discoveries of foreign scientists weren't included and translated in Russian. Because foreign publications are mainly in international (English) language, the actual task is an automation of the process of extracting descriptions of physical effects (PE) from the text of these publications.
Nowadays one of the most important tasks is automation of early stages of designing (requirements specification and technical proposal stages) of new technical systems (TS) and technologies on the base of which are made fundamental decisions about principles of operation and structure of design object. One of the most promising approaches to realize early stages of requirements specification is concerned with use of structured physical knowledge in the form of Physical Effects (PE) for automatic synthesis and choice of physical operating principle of developed technical system. Physical Effect [1] is an objective relation between two or more physical phenomenon, each of which is described appropriate physical quantity. As any physical phenomenon realize in material medium so a representation diagram of PE in the form of "black box" is visual and useful: A B C where A - input, B - object, C - output. In the process of solving problem of automation of early stages of designing at Volgograd State Technical University (VSTU) was formed an established scientific school. At the department of Computer Aided Design and Searching Construction (VSTU) was developed a principal model of PE description. And on its base was constructed
Method of Physical Effects Extraction from Natural Language Text: A principal model of PE description consists of two parts (documents): M = [1], where M1 - a model of input information (“input card”) of PE based on the use of descriptors from a database thesaurus of PE and it is used during searching of PE and synthesis of structures of physical operating principle of technical systems; M2 - a model of output information (“output card”) of PE which used for representation to a user all available information according to PE (including graphs, formulas etc).
Corresponding Author: Dmitriy Mikhaylovich Korobkin, VSTU, Lenin Av., 28, 400005, Volgograd, Russia
55
World Appl. Sci. J., 24 (Information Technologies in Modern Industry, Education & Society): 55-61, 2013
An Overview of Analogues: The article analyzes existent systems of text mining (searching of information in the unstructured text arrays) such as commercial: AeroText, TextAnalyst, WordStat, Attensity and non-commercial systems Carrot2, GATE, OpenNLP, Natural Language Toolkit, RapidMiner. Most systems gave opportunities of building semantic networks, extracting facts and concepts; keyword searching; taxonomy and thesaurus creation, but none of them couldn’t automatize the process of extracting physical effects from text. To realize this task we used a Knowledge-Based approach found on the use of patterns allowed to mark a definite syntactic and semantic construction in the text and then do the labeling of the found text fragment from which knowledge extracts and represents in the necessary form. The use of Semantic Role Labeling allows realizing syntactic and semantic text analysis and representing the text segments in the form of Dependency Tree Semantics. We considered ontologies developed on a material of English language: FrameNet, VerbNet, PropBank NomBank. We made a comparison of systems built on the base of these ontologies and provided Semantic Role Labeling: EP4IR parser, Link Grammar, SENNA, SwiRL. We chose Semantic Role Labeler as basic system.
University input influences on the object Physical Effect produce output influence on the environment or given object. That's why in the text including description of PE it is necessary to choose predicates which show some “influence” over arguments with certain semantic roles inside given “influence”. We made a lot of predicates of SF typical for description of Physical Effects (PE) in text, such as change (increase, decrease), dependence (depend, be directly proportional, be inversely proportional), influence (relate, cause) etc. For every predicate we've got semantic roles of arguments “A0(Subject)” (something that influence), “A1(Object)” (something that is influenced), “AM-LOC(Location)” (where influence is realized) compared to the elements of PE description: input A, output C and object B. Model of Physical Effect Description in the Text: We developed a model of Physical Effect [4] description related a given ontology and semantic predicate-argument structure: MFE = ,
(1)
C – multitude of Subject Field predicates typical for description of Physical Effect in the text, ci C – predicate of Subject Field; D - multitude of semantic roles of arguments for predicates of Subject Field {Subject, Object, Indirect Object, Location, Direction}, Di D – a list of semantic roles for ci, dj D; B – multitude of elements of Physical Effect description (A, B, C), Bk B,
The Ontology of Subject Field: For automation the process of extracting PE descriptions [3, 4] from English texts let's define components of formal description of Subject Field (SF) Physical Effect: Ontology [5, 6] including concepts and their relations in subject field Physical Effect; Subject dictionary [7] (thesaurus) including terms by means of which concepts and ontology relations can be represented in the text.
∀ci ∈ C ∃d j ∈ Di [ d j → Bk ] , def
where Bk {input (A), output (C), object (B)}; def – an operator defined a correspondence between semantic role of argument dj for predicate ñi and a set of PE description elements Bk; RC – relation on C×D, a pair (ci, dj) RC definitely determines an element (elements) of PE description which plays the semantic role dj inside the predicate structure ci; RB – relation on RC×B, a pair ((ci, dj), Bk) RB determines the set of SF concepts appropriate to an element of PE description bk, bk Bk. For instance, a relation of predicate roles Decrease with PE description elements is showed in the Figure 2.
Figure 1 represents a taxonomy diagram of SF concepts. Describing the concepts of Subject Field (SF) is used a taxonomy and relations “IS-A” and “HAS-PART”. We can make an example to show a relation “HAS-PART”: a presence in the text a physical quantity “Magnetic Induction” definitely shows the presence of influence “Magnetic Field”. The thesaurus of subject field concepts is formed from terms and their synonyms showed concepts in Natural Language (NL). According to a model of Physical Effect [1] developed at the department The Computer Aided Design and Searching Construction of Volgograd State Technical 56
World Appl. Sci. J., 24 (Information Technologies in Modern Industry, Education & Society): 55-61, 2013 Element of PE description
Groups of concepts SF Input PE
Name of influence
Output PE
Quality characteristics of influence
Physical quantities of influence (non-parametric)
Object PE
Physical quantities of influence (parametric)
Structure of object
Electric Field
Weak Electric Field
Electric Field Intensity
Temperature
Magnetic Field
Strong Electric Field
Potential Difference
Conductivity
…
Homogeneous Magnetic Field
Magnetic Induction
…
…
…
Concepts of SF
Mixture
Contact
…
Object characteristics
Solid
Crystalline Solid
Amorphous Solid
…
Fig. 1: Taxonomy of SF concepts: Relation IS-A; Relation HAS-PART DECREASE
Subject
Semantic Roles
Object
Location
Element PE Input PE
Input PE
Output PE
Object PE
Fig. 2: Predicate-argument structure with description elements of PE (c1, Subject) RC – semantic role “Subject” of argument for predicate c1 B1 ={input PE}; (c1, Object) RC –semantic role “Object” of predicate for argument c1 B2 = {input PE, output PE}; (c1, Location) RC –semantic role “Location” of predicate for argument c1 B3 = {object PE}.
Let’s give the sentence analysis: “(Role: Location) {Object PE: in metallic conductors} (Role: Object) {Output PE: the electrical resistivity} (Predicate: decreases) gradually as the temperature is lowered” or “As (Role: Object) {Input PE: the temperature} is (Predicate: decreased), the strength of current start to increase”.
The set of SF concepts appropriate to a PE description element bk, bk Bk.: ((c1, Subject), {InputPE}) RB {Electric Field, Magnetic Field etc}; ((c1, Object), {InputPE, OutputPE{) RB {Electric Field, Magnetic Field etc}; ((c1, Location) {ObjectPE}) RB {Solid, Liquid, Plasma etc}.
The example of informational filling of the model MFE (Figure 2): Predicate SF c1 = Decrease, c1
C.
D1 = {Subject, Object, Location}, D 1 D – multitude of semantic roles for the predicate c1; B = {input PE, object PE, output PE}.
Semantic Network of PE Description in the Text: The tops of conceptual graph (Vj1, Vi2):
The sets of PE description elements Bk appropriate to a semantic role of argument dj for the predicate ci:
Vj1 = cj, 57
(2)
World Appl. Sci. J., 24 (Information Technologies in Modern Industry, Education & Society): 55-61, 2013 Subject
c2=Produce
Object
O2=(T2, B2) T2={magnetic field} B2={output PE}
O4=(T4, B4) T4={magnetic flux} B4={output PE}
Location O1= (T1, B1) T1={electric current} B1={input PE}
O3=(T3, B3) T3={electrical circuit} B3={object PE}
Location
O5=(T5, B5) T5={circuit} B5={object PE}
Subject
c3=Generate
Object
Object
c4=Act
Subject
Fig. 3: Semantic network (input, output, object PE) reflects the sense of this concept within the framework of PE model. Let's give an example of a semantic network – a combination of many graphs according to the rules of conjunction and simplification received from sentence.
Begin Initial source text 1.Semantic analysis
2. Semantic-linguistic analysis Finding conceptual relations of SF
In electrical circuits, any electric current produces a magnetic field and hence generates a total magnetic flux acting on the circuit , showed in Figure 3, where c2 – relation Produce; c3 – relation Generate ; c4 – relation Act . On the basis of suggested ontology, model and semantic network we developed algorithm [3] of extracting structured physical knowledge in the form of physical effects from English texts (Figure 4) consists of the following sequential procedures:
Identifying roles and related terms
Terms ∈ DB concepts of SF Physical Effect
False
True Construction of semantic networks of PE in text sentences
3.Making of primary “input card” and “output card” of PE
Semantic analysis which represents the text of initial source in the form of syntactic-semantic trees. Linguistics semantic analysis, an initial operation of which is searching in the text terms from thesaurus of SF predicates that are the top of syntactic-semantic trees in text sentences. In syntactic-semantic tree there are important arguments for a given predicate which related by certain roles. In this case argument terms must be in conceptual thesaurus of SF Physical Effect.
End
Fig. 4: An algorithm of extracting structured physical information in the form of PE from text cj – predicates of SF; Vi2 = (Ti, Bi), (3) Ti –argument (term of Natural Language) for SF predicate (cj) playing a certain role, Ti RB; Bi – an element (elements) of PE description playing a semantic role inside the predicate Bi RC.
The next operation of linguistics semantic analysis is building a semantic network of PE in sentence (Figure 3) with concept identification of SF appropriate to text term concerning belonging to groups of SF concepts: Input PE, Output PE or Object PE (Figure 1).
Term is a linguistic constituent of Subject Field concept. It is a word or a set expression for expressing a given physical concept by means of Natural Language (NL). The element (elements) of PE description 58
World Appl. Sci. J., 24 (Information Technologies in Modern Industry, Education & Society): 55-61, 2013
The next procedure of algorithm (Figure 4) is construction of primary “input card” PE using concept joining. Document text is divided into sequence of fragments. As a basis for fragmentation are used authors paragraphs. Semantic networks of PE descriptions in the sentences of text paragraph are joined in single transformed semantic network.
Use of search engine indexes (Google, Yandex etc). System Architecture: The text analysis system realized in the form of the system with a hierarchical organization of search agents interaction (Figure 6). Search agent extracts internal URLs (the same host with parsed document) and external URLs (different hosts with parsed document) and passes them to Meta-agent. Meta-agent distributes the URLs between search agents on the base of the algorithm of bypassing the tree of extracting URLs (Figure 7).
For example, we have the following text fragment: “Temperature increase of crystalline dielectric changes its electrical resistance. Also we can see magnetic permeability growth of solid material near Curie-point”. In this case (Figure 5) we will make descriptions of two PE which have temperature as input influence, object is a crystalline solid (dielectric) and output influences will be differed: 1) electrical resistance; 2) magnetic permeability.
The initial URLs are on the upper level of the tree. Each i top-level URL is added a probability P(i) (equal to 1 at the stage of initialization) that URLs from this document points at the relevant physical document. In a URL tree we choose a node i for which P(i) is maximum and that wasn’t considered. The document with this URL is loaded and then filtered. If a document passes filtering, the relevance of its URLs will be equal to 1. New nodes are formed (according to the number of URLs from given node) with relevance equal to 1. If a document doesn’t pass, the relevance of URLs from given node will be equal to 0.
Documents Filtration: It is necessary to divide information sources according to theme groups to make a big array of text documents available for perception. In the following work document filtration is based on multistep cluster algorithm: at the first stage we use maps by Kohonen (SOM) [8, 9] for reduction of characteristic space, at the second stage we use FOREL algorithm for automatic defining of cluster number. We chose a model “term-document” for representations of documents in term space because it allows using a possible morphology analyze and also we can use a noise filtration to this model. At the stage of theme filtration we make a semantic analyze of a document – defining of term frequency (TF) and inverse document frequency (IDF). Theme defining of documents according to the developed algorithm is made due to characteristics (TF – IDF) [10] and neural net weights.
Relevance is recalculated for URLs that weren’t considered and which had a common document depository with checked URL: P (i ) =
1 + rl (Pr(i )) 1 + rl (Pr(i )) + nrl (Pr(i))
(4)
Pr(i) - a document depository for URL i, rl(Pr(i)) - a number of relevant URLs from Pr(i), nrl(Pr (i)) - a number of irrelevant URLs from Pr(i) A meta-agent works with a search engine index compiled on the basis of information transmitted by a search agent. The meta-agent in the mode of DB PE extension passes initial URLs to a search agent and in mode of DB PE modernization sends requests made on the basis of a modified PE description. The meta-agent performs the extraction of PE descriptions from initial source text. The search agent loads the documents using URLs passed by the Meta-agent or by Google Web API or Yandex.XML. The search agent performs html parsing, filtering documents and recursive bypass of URLs.
System of Physical Effects Extraction from Natural Language Text in the Internet: The strategy of searching documents on the Internet containing PE descriptions is based on two approaches: Work with the initial array of hyperlinks to resources with content in the field of physics (for example, sites of the journals: The Success of Physical Science, Journal of Applied Physics, Physics of Solid etc.) defined by the system administrator. 59
World Appl. Sci. J., 24 (Information Technologies in Modern Industry, Education & Society): 55-61, 2013 PE
Input PE c1=Increase
Output PE
Temperature (Curie-point)
IS-A
c3
c2= Change
c3=Growth
Temperature
Object PE
Electrical Resistance
c1 c3
Magnetic permeability
Dielectric c1
Crystalline dielectric
IS-A
Crystalline Solid
Solid material
Temperature (parametric influence)
HAS-PART Joining: Temperature
Joining: Crystalline Solid
Fig. 5: Process of making a primary description of PE.
Sites with physical content
Page content URL
Search agent Filtering documents
Loading and parsing of a document
Information about URL
Defining the strategy of bypassing URLs
Google Web API
Request
Yandex.XML
URL
Meta-agent Search Engine Index
Request, URL
Extracting of PE description
Fig. 6: Architecture of text analysis system Analysis of System Efficiency: The system efficiency was tested on a special documents array which consists of 60 documents from non-physical content, 17 documents with the physical content but without a PE description and 74 documents containing PE descriptions. So the number of relevant (according to checking the presence of PE descriptions) documents in the test array - Drel = 74, irrelevant documents - Dnrel = 77. Using the system in the mode of filtering gave results shown in Table 1.
Begin 1. Initialization of bypassing URL tree
Presence of unchecked URLs
False
True
2. Selecting URL
Drelretr – a number of passed through the filter relevant documents, Dnrelretr – a number of passed through the filter irrelevant documents, Dretr – a number of documents found by the system,
3. Modification of bypassing URL tree
End
P=
Fig. 7: Algorithm of bypassing URL tree 60
D rel ∩ D retr D retr
- precision,
(5)
World Appl. Sci. J., 24 (Information Technologies in Modern Industry, Education & Society): 55-61, 2013
REFERENCES
Table 1: Results of system efficiency verification Drelretr Dnrelretr Dretr Precision Recall F-measure
R=
F=
D rel ∩ D retr Drel D nrel ∩ D retr D nrel
Filtering
Extraction of PE description
73 3 76 0,961 0,986 0,039
49 78 127 0,386 0,598 -
1.
- recall,
(6)
- F-measure.
(7)
Fomenkov, S.A., D.A. Davydov and V.A. Kamaev, 2004. Modelirovanie i avtomatizirovannoe ispol'zovanie strukturirovannyh fizicheskih znanij: monografija. Mashinostroenie, Volgograd, VSTU, pp: 256 (in Russian). 2. Fomenkov, S.A., D.M. Korobkin and A.M. Dvorjankin, 2012. Programmnyj kompleks predstavlenija i ispol'zovanija strukturirovannyh fizicheskih znanij. Vestnik Komp'juternyh I Informacionnyh Tehnologij, 11: 24-28 (in Russian). 3. Korobkin, D.M. and S.A. Fomenkov, 2009. Metodika vydelenija strukturirovannoj fizicheskoj informacii v vide fizicheskih jeffektov iz teksta. Vestnik Komp'juternyh I Informacionnyh Tehnologij, 10: 35-39. (in Russian). 4. Korobkin, D.M. and S.A. Fomenkov, 2009. Modeli predstavlenija strukturirovannoj predmetnoj informacii v vide fizicheskih jeffektov v tekste na estestvennom russkom jazyke. Vestnik Komp'juternyh I Informacionnyh Tehnologij, 7: 17-21. (in Russian). 5. Benjamins, V., D. Fensel, S. Decker and A. GemezPerez, 2010. Building Ontologies for the Internet. Mid Term Report, pp: 302. 6. Domingue, J., 2009. Tadzebao and WebOnto: Discussing, Browsing and Editing Ontologies on the Web. Proc. of the Workshop on Knowledge Acquisition, Modeling and Management, Banff, Canada, pp: 24-35. 7. Bechhofer, S., I. Horrocks, C. Goble and R. Stevens, 2009. OilEd: A Reason-able Ontology Editor for the Semantic Web. Proc. of German/Austrian conf. on Artificial Intelligence, Springer-Verlag, Berlin, pp: 396-408. 8. Príncipe, J.C. and R. Miikkulainen, 2009. Advances in Self-Organizing Maps. Springer, Berlin, pp: 132. 9. Carpenter, G.A. and S. Grossberg, 2011. Normal and amnesia learning, recognition and memory by a neural model of cortico-hippocampal interactions. Trends in Neurosci, 16: 131-137. 10. Wu, H.C., R.W.P. Luk, K.F. Wong and K.L. Kwok, 2008. Interpreting tf–idf term weights as making relevance decisions. ACM Transactions on Information Systems, 26(3): 1-37.
74 documents contain PE descriptions Drel = 82 PE. The using of system in the mode of PE extraction gave results shown in Table 1. Drelretr – a number of relevant primary PE descriptions, Dnrelretr – a number of irrelevant primary PE descriptions, Dretr – a number of constructed PE descriptions. Average results of the test program (the number of testing is 100) shown in Table 1. CONCLUSION On the basis of suggested ontology, model and semantic network we developed a system of extracting PE descriptions from English text provided the compilation of primary “input card” PE [1] by means of merging the semantic networks of PE descriptions in single semantic network. A primary output card of PE is formed from text sentences on the base of which were made semantic networks. This developed approach can be used for different tasks [2] related with processing of semistructured texts. For example, it can be used for extracting of structured chemical knowledge in the form of chemical effects. ACKNOWLEDGEMENT This work was partly supported by the RFBR (grants 13-07-97032 and 13-01-00301).
61