Know-what: A Development of Object-Property Extraction from Thai Texts and Query System Authapon Kongwan and Asanee Kawtrakul NAiST Laboratory Department of Computer Engineering Kasetsart University Bangkok, Thailand 10900
[email protected],
[email protected]
Abstract In agricultural domain, the ObjectProperty knowledge is a useful knowledge that serves the ‘Know-what’ to a farmer or to an agriculturalist such as pest’s characteristics (color, size, and etc). This paper proposes the framework for extracting the object and its properties and also the query system. The extraction problems consist of object identification, property relation identification, and property value identification. The query problem is how to retrieve the object based on fuzzy property value. For the problems of extraction, we propose NLP techniques for extracting the intention information, consisting of 2 modules: property identification, and object identification. For the problem of query, we utilize the fuzzy concept to represent the property value and propose the similarity measure for retrieving the object by the given of property information.
1
Introduction
The Object-Property knowledge is a knowledge that gives the property informations(color, size, taste, and etc) that involve with the object. In agricultural domain, the Object-Property knowledge is a useful knowledge that serves the ‘Know-what’ to a farmer or to an agriculturalist. The ‘Know-what’, that is a farmer needs, is the answer from the question for example “What is kind of the insect, that its body is black, its body is 2 millimeter long”. To respond the question, the question-answering system needs the knowledge base to query the answers to the user. Today, texts are an important human knowledge storage and they grow rapidly,
157
therefore, we can build the incremental knowledge base by extracting knowledge from texts. Knowledge Acquisition (KA), Knowledge Extraction (KE) and Knowledge Discovery (KD) are the nearby fields of research that concern in discovery the various kinds of knowledge from the various kinds of sources. We may categorize the method to discover the knowledge from texts into 3 categories: Machine Learning, Pattern-Template Matching, and Text Data Mining. The machine learning technique, such as Inductive Logic Programming (ILP), are used for constructing the knowledge from texts. Delannoy et al.(Delannoy et al., 1993; Delisle et al., 1994) presented the knowledge extraction systems that are called the TANKA and MaLTe system. English technical text is processed by using of natural language processing technique and is represented the semantic of each sentence in text into a logic format. After that, the machine learning technique: Inductive Logic Programming (ILP), Explanationbase Generalization (EBG) are used for constructing the Horn clause rules. By this method, we can build the knowledge as the rules by the learning from the arguments of logical semantic. Although, by this method, we will get the unexpected knowledge that may be good in open domain knowledge to discover the new knowledge, but we need the explicit knowledge in specific domain knowledge. The pattern-template matching method is used widely in knowledge acquisition process. Gomez et al.(Gomez et al., 1994) proposed the methodology for acquiring knowledge from Encyclopedic Texts and implemented a program that is called SNOWY. The program acquires new concepts and conceptual relations about topics dealing with the dietary habits of animals, their classifications and habitats from the World Book Encyclopedia. the sentence is parsed on a syntactic basis until the meaning of verb is recognized by the rules that
is called VM rules. The VM rules are the rule for determining the meaning or verbal concept of verb. After that, the semantic of sentence is interpreted by the activation of the VM rules. And the final representation is constructed into a frame-like structure. Hahn and Schnattinger(Hahn and Schnattinger, 1997a; Hahn and Schnattinger, 1997b; Hahn and Schnattinger, 1998) described a German-language text knowledge assimilation system that is called SYNDIKATE. The system works on the using of knowledge-intensive. The system starts upon the reading the unknown lexical item and generates the concept hypothesis by the linguistic patterns and the ontology. The concept hypothesis are reduce by the using of the linguistic and conceptual constraints and then are assimilated into domain knowledge base. Our work also works by the linguistic patterns method. However, There are some element of Object-Property can not extract only with the linguistic template, more semantic analysis are needed. Nahm and Mooney(Nahm and Mooney, 2002) presented a framework for text mining that called DiscoTEX. The unstructured texts is transformed to the structured database by using the information extraction (IE) technique. From the information extraction, the collection of structured record database is created and then the data mining technique is applied to discover the interesting relationships in database. We can find interesting trends, associations, and relationships from the collection of document by this framework. However, we need more semantic analysis to derive some element of Object-Property knowledge to create the knowledge base for the question-answering system. In our work, The Object-Property knowledge is extracted by 2 module:property identification, and object identification. The object in Thai sentences may be omitted by using the zero anaphora or the textual ellipsis. The object is identified from the previous sentences by our proposed algorithm. the Object-Property knowledge is identified with linguistic patterns and is constructed into logic format knowledge base. The user can use query system for query the object by given the property informations. In this paper, we experimented and evaluated with texts in agricultural domain. This paper is organized as follows: Section 2 describes the linguistic problems that appear in Thai language. In section 3, the framework for acquiring the Object-Property knowledge is proposed. In section 4, the fuzzy concept is utilized to query the object by given the property informations. Finally,
158
the evaluation & conclusion are given in section 5.
2 2.1
Nontrivial Problems Extraction Problems
The Object-Property knowledge consists of 3 parts: the object, the property relation, and the property value. We categorized the type of property by the type of value into 2 types: numerical property and symbolic property. The numerical value consists of 3 parts: the quantifier, the measurement, and the number. The symbolic value consists of 3 parts: the quantifier, the operator, and the symbol. These can be formalize as follow: OP KB = (Ob, P, V ) P = {Pnum , Psym } Pnum = {height, length, width, diameter, weight} Psym = {color, taste, smell} V = {Vnum , Vsym } Vnum = ([Qnum ], M, N ) Vsym = ([Qsym ], [Op], S) where: OP KB is the Object-Property knowledge Ob is the Object P is property relation V is property value Pnum is numerical property relation psym is symbolic property relation Vnum is numerical property value Vsym is symbolic property value Qnum is set of numeric quantifier such as ประมาณ(approximately), ไม่น้อยกว่า(not less than) Qsym is set of symbolic quantifier such as มาก(very), เข้ม(dark) M is set of measurement such as นิ้ว(inches), เมตร(meters) Op is set of symbolic operator such as ปน, อม, แกม N is number S is symbol To extract each part of knowledge from texts, there are linguistic phenomenon have to be observed for solving the problems in extraction of each part of knowledge. 2.1.1 Object Identification Problems There are two major linguistic phenomenon that are the problems in object identification. The prob-
lems are described in subsection below: Zero Anaphora In Thai sentence, there are a lot of the using of zero anaphora that make the object in sentence is missing. Zero anaphora is the omitting of noun phrase in sentence for referring to noun phrase that used to introduce in text. In sentence that contained the Object-Property knowledge, The zero anaphora almost appears in subject. By using the zero anaphora in subject of sentence, Thai sentence can be constructed by only verb phrase. The example of the using of zero anaphora is shown as below:
เพลี้ยไฟเป็นแมลงขนาดเล็ก
1)
From above, the property relation(color) is indicated directly by the ‘สี(color)’ word. The property relation in some cases may be indicated implicitly. The value is a crucial cue that we can derive the property relation such as the sentence below:
ลูกมะม่วงเปรี้ยวมาก The mango fruit is very sour.
From above, the property property(taste) is derived from the ‘เปรี้ยว(sour)’ word that is the value of taste property. Furthermore, the structure of the values are also the important cue for deriving the property relation such as the sentence below:
โต๊ะขนาด 120x60x75 เซ็นติเมตร
The aphid is a small insect. Φ มีสีเหลืองหรือสีน้ำตาลอ่อน Φ has a yellow or light brown color. The subject(the aphid) in second sentence is omitted at the Φ symbol that make the target object for the color property is missing. Thus, the target object for the second sentence have to be identified from the previous sentence. 2)
Textual Ellipsis Some preposition on noun phrase may be disappear from sentence. The preposition ellipsis in noun phrase make the semantic of the noun phrase is incomplete. The example of the using of the textual ellipsis are shown as below. 1) เพลี้ยไฟเป็นแมลงที่มีขนาดเล็ก the aphid is a small insect. 2)
ไข่(ของเพลี้ยไฟ)มีสีขาว
the egg (of the aphid) is white. From above, The noun phrase ‘egg’ in second sentence has the meaning that is ‘the egg of the aphid’, it is not ‘the general egg’. Thus, we have to find the necessary textual element that is ellipsis before constructed knowledge. 2.1.2
Property Relation Identification Problems There are 2 nontrivial problems that have to be resolved in property relation identification. The problems are described in subsection below: Explicit & Implicit Indication The property relation may be indicated explicitly in texts. In this case, we can extract the property relation and its value directly in texts such as the sentence below:
ดอกกุหลาบมีสีแดง The rose has a red color.
159
The table’s size is 120x60x75 centimeter. From above, the ‘120x60x75’ number is indicated three property relation:width, length, and height. In texts, the size of object is described by the structure of the value that the first number is the width property and the second number is the length property and the third number is the height property. Ambiguity in Explicit Indication There are the ambiguity in word that indicate the property relation such as the sentences below: 1)
ปลาคาร์พขนาดประมาณ 1 กิโลกรัม The carp’s size is 1 kilogram approximately.
2)
ต้นปาล์มขนาด 3 เมตร The palm tree’s size is 3 meters.
From above, the ‘ขนาด(size)’ word is ambiguous to indicate the height and weight property. In this case, the measurement is important to indicate the property relation. 2.1.3
Property Value Identification Problem
There are many patterns of value that represented in texts. The extraction system have to recognize all of the elements of the value from texts. The quantifier such as ‘ประมาณ(approximately)’, ‘ไม่ต่ำกว่า(not less than)’ and also the measurement such as ‘นิ้ว(inches)’, ‘เมตร(meter)’ are attach to the numerical value. And also there are the quantifier such as ‘มาก(very)’, ‘เข้ม(dark)’ are attach to the symbolic value. Furthermore, in symbolic value, the operator is a crucial element that combine two symbols or more into one value, for example ‘เขียวอมเหลือง(yellow-green)’.
2.2
Query Problem
The problem of query is how to retrieve the object based on fuzzy property value. To query the object, the user must give the property information to the query system. There are some properties values are similar value such as red, dark red, and orange, therefore the good representation and similarity measure are needed for retrieving the object.
3
Framework of Property Extraction
3.1
Corpus Preparation
We have prepared 440 sentences corpus that are chucked phrase. The sentences are chucked phrase by parser(Charniak, 1997; Johnson, 1998) and are verified by linguistic expert. The example of sentence in corpus is shown as below.
[ [เมล็ด/ncn หน่อไม้ฝรั่ง/ncn]/NP [มี/vt [เส้นผ่าศูนย์กลาง/ncn]/NP ประมาณ/qubo /blk 0.2/nnum /blk นิ้ว/cl ]/VP ]/S
เมล็ดหน่อไม้ฝรั่งมีเส้นผ่าศูนย์กลางประมาณ 0.2 นิ้ว The seed of asparagus has the diameter 0.2 inches approximately.
EllipsisRes (S): 1 N = getSubject(S) 2 if N is no possession preposition: 3 Sp = getPreviousSentence(S) 4 if N is ”part, piece” or ”plant part” sense: 5 while Sp != Null: 6 I = getPreviousSubject(Sp) 7 if N is same I: 8 if I is no prep(I): 9 Sp = getPreviousSentence(Sp) 10 continue 11 addPrep(N,prep(I)) 12 return S 13 else if N is meronym of I: 14 addPrep(N,I) 15 return S 16 else if N is meronym of prep(I) 17 addPrep(N,prep(I)) 18 return S 19 else: 20 Sp = getPreviousSentence(Sp) 21 else: return S
Figure 2: Ellipsis Resolution Algorithm.
Figure 1: Example of Sentence in Corpus.
3.2
Input : S is the current sentence Output : S is the solved sentence.
ample of sentence pattern for the sentence in figure 1 is shown as below: [ object [ มี/vt [ prop ]/NP value ]/VP ]/S
Object Identification
The zero anaphora and textual ellipsis are solved in this module. From observation, zero anaphora in sentence, that contains the Object-Property knowledge, can solve by getting the object from the subject of previous sentence. The noun phrase, such as ‘ขา(leg)’, ‘ปีก(wing)’ are the part-whole object. In Thai language, the part-whole object almost omitted the possessor. We can find the possessor of the part-whole object by searching noun from previous sentence that has the part-of relation with the object through WordNet. The ellipsis resolution is proposed in figure 2. 3.3 Property Identification The property relation and the property value are identified in this module. There are 2 type of pattern in this module: the sentence pattern and the value pattern. Each element of knowledge is identified from sentence by sentence pattern. The ex-
160
The ‘object’ in sentence pattern is the location of the object that is identified in the sentence. The ‘prop’ is the location of word that identifies the property relation. Table 1 is shown the examples of property lexicon that derive the property relation. The ‘value’ is the location of the value pattern to identify each element of the property value. The example of the value pattern is shown as below: qnum /blk num /blk measure
The ‘qnum’ is the quantifier of numerical value. The ‘num’ is the number of numerical value. The ‘measure’ is the measurement of numerical value. There is the background knowledge for each element of the property value to construct the knowledge. Table 2 is shown the examples of the background knowledge for each element of the property value. Finally, The final representation is formulate in logic format as below:
Property length
height
diameter
weight
taste
Word
in Object-Property knowledge. Figure 3 depicts the examples of fuzzy value. To measure the sim-
ความ/pref1 ยาว/vi ยาว/vi ความ/pref1 สูง/vi สูง/vi เส้นผ่าศูนย์กลาง/ncn เส้นผ่านศูนย์กลาง/ncn น้ำหนัก/ncn หนัก/vi รส/ncn รสชาด/ncn
µ
0
Red µ
0
0
Word นิ้ว/cl
centimeter
เซ็นติเมตร/cl
pound
ปอนด์/cl
1
Blue µ
Green µ
0
1
0
1
1
0
value is dark red.
1
(b) Symbolic Values
Figure 3: Fuzzy Value
Property length height diameter length height diameter weight
ilarity, we choose the cosine angle distance. The property value is categorized to 2 categories: numerical value, and symbolic value.
Ratio 25.4
Symbolic Value Similarity We suppose i is the query property and suppose j is the target object property and suppose k is the sampling of membership function. Then the similarity function sim(i, j) for symbolic value is:
10
453.59
P
(µ(i)k × µ(j)k ) sim(i, j) = q P k P ( k µ(i)2k ) × ( k µ(i)2k )
Table 2: Examples of Background Knowledge
Numerical Value Similarity The similarity measure for numerical value is adapted from the similarity measure for symbolic value. We suppose i is the query property and suppose j is the target object property and suppose k is the sampling of membership function. Then the similarity function sim(i, j) for numerical value is:
pname(x1, หน่อไม้ฝรั่ง), part(x1, เมล็ด), prop(diameter, x1, v1), value(v1, 0.2, inch, approx)
4
0
1
Red µ
Symbolic Value เปรี้ยว/vi หวาน/vi แดง/adj เขียว/adj
Measure inch
Blue µ
Green µ
value is red.
ประมาณ/qubo ไม่/neg ต่ำ/vi กว่า/qubo
color
value 10-50 approximately
(a) Numerical Values
word
Property taste
0 10 50
10
value 10 approximately
Table 1: Examples of Property Lexicons
Qnum approx notless
µ
Query System
The fuzzy theory(Zadeh, 1965; Zadeh, 1983) is feasible to compute the values that are represented with words by the utilization of linguistic variables whose values can be words rather than numbers. In the property value, there are the quantifier such as “ประมาณ(approximately)”, “อ่อน(light)”, “เข้ม(dark)” composed in the property values that make the values are fuzzy. Thus, the fuzzy theory is act an important role to represent and compute the linguistic modifier of the property value
161
P
sim(i, j) = qP k k
5
(µ(i)k × (µ(i)k ∩ µ(j)k ))
µ(i)2k ×
P
k (µ(i)k
∩ µ(j)k )2
Evaluation & Conclusion
We evaluate the extraction system with the agriculture documents that contain 440 sentences. The extraction system is evaluated with the following precision and recall where K is the ObjectProperty knowledge.
P recision =
Udo Hahn and Klemens Schnattinger. 1997b. Knowledge mining from textual sources. CIKM 1997.
# correctly extracted K # all extracted K
# correctly extracted K # K in documents The extraction system is evaluated that the precision is 88.88% and the recall is 47.05%. The extraction system extract the Object-Property knowledge from sentence but the ObjectProperty knowledeg can be extracted from noun phrase also such as ‘เพลี้ยกระโดดสีน้ำตาล(Brown planthopper)’. In some case, the property value is from as the comparison such as ‘ผลมังคุดจะมีขนาดเล็กกว่ากำมือเล็กน้อย(the mangosteen fruit’s size is smaller than the fist)’. To improve the performance, the noun phrase analysis and the inferences for comparison is needed. In the summary, this paper proposes the framework for extracting the object and its properties and also the query system. In the module of extraction, we propose NLP techniques for extracting the intention information by the linguistic pattern technique and also the algorithm for ellipsis resolution. In the query system, we propose the similarity measure for retrieving the object. Recall =
References Eugene Charniak. 1997. Statistical parsing with a context-free grammar and word statistics. In Proceedings of AAAI/IAAI, pages 598–603. J. F. Delannoy, C. Feng, S. Matwin, and S. Szpakowicz. 1993. Knowledge extraction from text: Machine learning for text-to-rule translation. In Proceedings of the Workshop on Machine Learning Techniques and Text Analysis, European Conference on Machine Learning (ECML-93), Vienna, Austria. S. Delisle, K. Barker, J. F. Delannoy, S. Matwin, and S. Szpakowicz. 1994. From text to horn clauses: Combining linguistic analysis and machine learning. In Proceedings of the 10th Canadian Artificial Intelligence Conference, CAI-94, Canada. Fernando Gomez, Richard Hull, and Carlos Segami. 1994. Acquiring knowledge from encyclopedia texts. In Proceedings of the 4th ACL conference on Applied Natural Language Processing, Struttgart, Germany, October. Udo Hahn and Klemens Schnattinger. 1997a. Deep knowledge discovery from natural language texts. In Proceedings of Knowledge Discovery and Data Mining, pages 175–178.
162
Udo Hahn and Klemens Schnattinger. 1998. Towards text knowledge engineering. In Proceedings of AAAI/IAAI, pages 524–531. Mark Johnson. 1998. PCFG models of linguistic tree representations. Computational Linguistics, 24(4):613–632. Un Young Nahm and Raymond J. Mooney. 2002. Text mining with information extraction. In Proceedings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases, Stanford, CA, March. Lotfi A. Zadeh. 1965. Fuzzy sets. Information and Control, 8:338–353. Lotfi A. Zadeh. 1983. A computational approach to fuzzy quantifiers in natural languages. Computers and Mathematics with Applications, 9:149184.