Natural Language Processing

0 downloads 0 Views 169KB Size Report
aspects reflected by different verb forms, are important elements in a sentence for expressing temporal information ... actions may be viewed as complete regardless where they end. Based on the .... To figure out a proper temporal relation for.
Application and Difficulty of Natural Language Processing in Chinese Temporal Information Extraction Wenjie Li

Kam-Fai Wong

Chunfa Yuan

Department of Computing The Hong Kong Polytechnic University, Hong Kong

Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong

Department of Computer Science and Technology, Tsinghua University, Beijing, P.R.China

[email protected]

[email protected]

[email protected]

Abstract The conventional information system cannot cater for temporal information effectively. It may not be able to handle users’ queries concerning how an action relates to another in time. It is useful to capture and maintain the temporal knowledge associated to each action in an information system. In this paper, we proposed a general frame structure for maintaining the extracted temporal concepts and proposed a system for extracting time-dependent information from Hong Kong financial news, known as TICS. In the system, temporal knowledge is represented in the forms of types of temporal concepts and temporal relations. When analyzing a sentence, TICS first determines the situation related to the predicate verb. This in turn will identify the type of temporal concept. Based on that, the relevant temporal information is extracted and temporal relations are derived. These relations link relevant concept frames together in chronological order, and provide the knowledge to fulfill users' queries. Keywords: Chinese language processing, temporal information extraction, temporal classification, temporal relations

1. Introduction Temporal information is regarded as an equally, if not more, important piece of information in domains where the task of extracting and tracking information over time occurs frequently, such as planning [2], scheduling [4] and question answering [6]. It may be as simple as an explicit or direct expression in a written language, such as 公司于 97 年 5 月倒閉 (the company was closed in May 1997); or may be left implicit, to be recovered by readers from the surrounding texts. For example, one may know the fact that 地 震 前 公 司 已 經 倒 閉 了 (the company was closed before the earthquake), yet without knowing the exact time of the bankruptcy. Relative temporal knowledge such as this where the precise time is unavailable is typically determined by human. An information system, which does not

account for this properly, is rather restrictive. It is hard to separate temporal information discovery from natural language processing (NLP). In English and other Western languages, tenses and aspects reflected by different verb forms, are important elements in a sentence for expressing temporal information and for transforming situations into temporal logic operators. In contrast, a Chinese verb appears in only one form no matter the event it describes has finished in the past or will take place in the future. The lack of regular morphological tense markers renders Chinese temporal expressions complicated. Therefore, the conventional theory to determine temporal information based on verb affixations is inapplicable. Over the past years, there has been considerable progress in the areas of information extraction (IE) and temporal logic in English [5,6]. Nevertheless, only a few researchers have investigated these areas in Chinese. This triggers us to exploit Chinese language processing techniques for temporal IE and temporal reasoning. “Extracting temporal information from Chinese financial news” is a collaborative research project between the Chinese University of Hong Kong and Tsinghua University, Beijing1, which aims to design and develop a general framework for temporal IE. The final system, referred to as TICS, will accept a series of Chinese financial texts as input, analyze each sentence one by one to extract the desirable temporal information, represent each piece of information in a concept frame, link all frames together in chronological order based on inter- or intra-event relations, and finally apply this linked knowledge to fulfill users' queries [11, 12, 21]. This paper introduces the concepts and techniques involved in temporal classification and temporal relation discovery. Moreover, the difficulties encountered in the task are also discussed. The rest of the paper is organized as follows: Sections 2 and 3 explain temporal classification and temporal relation in natural languages. The

1 Under the NSFC (China) grant (project number 69975008) and the RGC (Hong Kong) direct grant (project number 2050237).

corresponding models in TICS are also described, respectively. They are followed by the discussion of the difficulties in Chinese temporal information extraction. Finally, Section 5 concludes the paper.

2. Temporal Classification 2.1. Temporal Classification in Natural Languages It is well known that the semantics of temporal expressions in any language cannot be modeled adequately in the absence of verb classifications in terms of the situations described by verbs. In general, the properties for determining verb categories are (1) static/dynamic; (2) instantaneous /durative; and (3) telic/atelic2. static/dynamic: static is a property referring to stableness. A static situation is considered as existing rather than happening. In most cases, the static/dynamic distinction is straightforward. durative/instantaneous: a dynamic situation is either instantaneous or durative. An instantaneous situation can only persist in a very short time period. Therefore it can hardly contain a relatively stable process. In contrast, a situation is said to be durative when there is an obvious distance between its starting and ending points. telic/atelic: many situations describing changes include inherent climaxes, i.e. the points which have to be reached for the changes to be considered complete. Once such situations begin, they progress step by step towards their climaxes where they finally stop. In contrast, other actions may be viewed as complete regardless where they end.

Based on the above three criteria, a number of verb classifications have been proposed in the literature. In spite of their differences in the number of category, the category name, nature and property, most of them were originated from the Vendler's taxonomy [20], in which verbs were divided into: state, activity, accomplishment and achievement. A similar classification scheme, but with three rather than four classes, was proposed independently by Kenny [10]. His three categories were activity, performances and states. In his scheme, Vendler's accomplishments and achievements were treated as two sub-species of performances. Moens [14], however, distinguished between five categories, namely state, process, culminated process, culmination and point. Vendler’s classification was later extended by Mourelatos [15] who took conscious and sensory occurrences into consideration. He believed that the trichotomy activity-performance-state fell under an ontological trichotomy of wider scope, viz. processevent-state. The event can be further divided into development and punctual occurrence if necessary. This idea was also adopted by Allen [3] and Parsons

[5]. Based on their ideas, similar classification schemes for Chinese verbs have been proposed, including the work of Deng [22] and of Chen [23]. Based on past experience, a trichotomy classification scheme is adopted in the proposed IE system, TICS, which are defined as follows: Definition 1: A state is defined as a stable situation, which does not involve changes3. It refers to a fact or a property. It is different from the others in that it cannot be qualified as an action at all. Definition 2: A process refers to an ongoing or a continual activity. It is an essential feature of processes that they have durations, but the time stretches are inherently indefinite. In other words, they can start or end arbitrarily. Definition 3: An event describes a situation, which occurs instantaneously. Its beginning and ending boundaries are so close to each other that it is difficult to perceive them as separate. Such an event takes time but does not last in a sense that it lacks of a process of change.

The requirement of verb classification is apparent. It is necessary to natural language understanding, in particular to explain time measurement. For instance, following a durative verb, duration refers to the duration of the action itself, e.g. 等了三天 (waited for three days). On the contrary, if an instantaneous verb is encountered, duration refers to the period of time after the realization of an action or a state to the time when the speech occurs, e.g. 死 了三天 (died three days ago). The interpretation of time can benefit information deduction and concept reasoning. For example, given the news 公司去年三月 倒閉 (the company was closed last March). Since the verb 倒閉 (close) indicates an event, we can deduce that the action must have occurred at a certain time point in last March. As a result, a new state, i.e. the company is no longer run its normal business after that time point, can be deduced.

2.2. Linguistic Criteria for Temporal Classification The simplest way to determine temporal classes is to exam the syntactic properties of a sentence. Vendler’s original observation was that only a number of simple grammatical tests were sufficient to distinguish the four categories in his proposed scheme [20]. According to him, neither achievement verbs nor stative verbs accept continuous tenses; and both achievements and accomplishments are compatible with the for-adverbial. Despite the differences in meanings, they are both admissible to the in-adverbial. Further advancement was made in Kenny’s analysis [10]. He introduced a table of tense-implications and nine supplementary linguistic criteria, which included permissible adverbial phrases, paraphrase possibilities and transformations of mood or voice.

2

An action is said to be telic when it describes a change including an inherent climax, i.e. the point to be reached when the change completes. The sentence objects could accommodate this distinction.

3

A state may arise as a result of a change and may provide the trigger to a change. But the state itself does not constitute to the change.

2

Mourelatos extended Vendler’s scheme by emphasizing the roles of verb arguments and of aspect [15]. He claimed that temporal classification would be incorrect if they were derived simply from the semantics of individual verbs. It was stated in his paper that “in all cases a total of six factors are involved: (1) the verb’s inherent meaning; (2) the nature of the verb’s arguments, i.e. of subject and of object(s); (3) adverbials; (4) aspect; (5) tense as phase; (6) tense as time reference.” Moreover, the effects of tense and aspect, especially the progressive and perfect forms, were emphasized in [7,14,15,16].

2.3. A Model for Automatic Temporal Classification in TICS As mentioned above, the aspectual properties of English sentences are not determined simply by the lexical meaning of the main verbs. It is the same in Chinese. A Chinese verb may have more than one temporal attribute. The determination of a proper class thus heavily depends on context. For instance, the verb 掛 (hang) may introduce two situations, i.e. 往牆上掛照片 (hang a picture on the wall) and 牆上 掛著照片 (a picture is hang on the wall). The former indicates an event while the latter a state. Thus, if we simply classify the sentences based on the verb temporal properties, ambiguities would arise. To enable an information extraction system to model temporal aspects of Chinese verbs, a two-step approach to automate temporal classification is proposed [13, 21]: (1) analyze the temporal nature of the verbs and classify them into five groups (verb level); and (2) if a verb is temporally polysemous, use contextual information to resolve the ambiguity (sentence level). The flow chart of the model is shown in Figure 1.

3. Temporal Relation Discovery 3.1. Background In English, temporal information is basically reflected by tense [14]. The pioneer work of Reichenbach [18] on tenses forms the basis of many subsequent research efforts in temporal NLP, e.g. the work of Prior in tense logic [17], of Hwang et al in temporal adverbials analysis [9], etc. Reichenbach argued that the tense system provided predication over three underlying times, namely S (speech time),

R (reference time), and E (event time). In addition, he pointed out that tenses could be applied to processes by coupling with aspects. Later, Bruce introduced the multiple temporal references model [6]. To facilitate logic manipulation, he proposed seven first order logic relationships based on time intervals and a method to map nine English tenses into temporal first order expressions. The seven relationships are symbolized as R(A,B) for some relations R and time intervals A and B, where R includes before, after, during, contains, same-time, overlaps and overlapped-by. His work laid down the foundation of temporal logic in natural language. These relations were then gradually expanded to nine in [1] and further to thirteen in [2], where meet, met-by, starts, started-by, finishes and finished-by are supplemented into Bruce's temporal relations. For quite a long time, linguists argued whether tenses existed in Chinese; and if they do how are they expressed. Different from Western languages, such as English and French, Chinese verbs appear in only one form. We believe that Chinese do have tenses. But they are not expressed in different forms. Rather, they are determined with the assistance of temporal adverbs and aspect auxiliary words. For example, 在...呢, 已經...了 and 要... express an ongoing action, a situation started or finished in the past, and a situation which will occur in the future, respectively. Following this idea, Li defined four Chinese tenses on absolute time points and seven on relative time points [24].

3.2. Temporal Relations Defined in TICS TICS is used to handle two types of temporal relations, namely absolute and relative relations.

3.2.1.

Absolute Relation

The role of absolute relations is to position concepts in the time axis. These relations depict the beginning and/or ending time bounds of an occurrence or its relevance to reference times. Seven such relations are illustrated in Figure 2. They are further classified as two sub-types, namely definite and indefinite relations. Definite relations, such as ON and BEGIN are used when the time of an occurrence is explicitly specified in a sentence. On the other hand, indefinite relations represent situations where only reference times are mentioned.

Modern Chinese Verb Machine-Readable Dictionary

Verb Level Classification Documents

attribute mentality instantaneous activity ambiguity

Sentence Level Classification

state process event

Modern Chinese Dictionary

Figure 1. Flow chart of the temporal classification model in TICS

3

Elementary: definite: ON BEGIN END ONGOING indefinite: PAST FUTURE Composite: CONTINUED exist

may or may not exist

Figure 2. Absolute temporal relations

Reference times include explicit time, speech time and news publication time. If an explicit reference time does not appear in the sentence, the speech time is assumed to be the reference time. If the speech time does not present, the publication time will serve as the reference time.

3.2.2.

Relative Relation

In many cases, the time when a concept takes place may not be known. But its relevance to another occurrence time is given. For instance, one may know that event Ex occurred before event Ey, without knowing the exact time when they took place. This kind of relative temporal knowledge is manifested by relative relations in TICS. Allen has proposed thirteen relations. The same is adopted in our system but extended by including time distance d as an optional parameter, where d(Ex,Ey) indicates distance. Relative relations are derived either directly from a sentence describing two concepts, or indirectly from the absolute relations of two individual concepts. Clearly, an effective deduction mechanism could produce more information than those directly expressed by the sentences.

3.3. A Model for Temporal Relation Discovery in TICS Suppose TR indicates a temporal relation, E indicates an event and T indicates time. The absolute and relative relations are symbolized as: OCCUR(Ei, TR(T))4 and TR(Ei , Ej), respectively. For the absolute relation of a single event, T is a very important parameter, which includes event time te, reference time tr5 and speech time ts: Some Chinese words function as temporal indicators. These include time word (tw), time position word (f), temporal adverb (adv), auxiliary word (aux), preposition word (p), auxiliary verb

4

OCCUR is a predicate for the happening of a single event. Under the situations where there are no ambiguity, Ei can be omitted. Thus, OCCUR(Ei,TR(T)) is simplified to TR(T) 5 There may exist more than one reference time in a statement.

(va), trend verb (vc) and some special verbs (vv). They are all regarded as elements of the temporal indicator TI. Each type of indicator, e.g. T, contains a set of words, such as TW=twlist= {tw1, tw2, ...twn}, with each word having a temporal attribute, indicated by ATT. Then the core of the model is a rule set R, which maps the combinational effects of all indicators, TI, in a sentence to its corresponding temporal relation, TR. To figure out a proper temporal relation for each relevant sentence, the input documents are first pre-processed. For each input document, sentences consisting of indicator words are identified. For each sentence, verbs are processed semi- automatically, i.e. they are automatically marked based on the verb list. Undetected verbs are then manually checked and added to the verb list. After pre-processing, all documents are tagged. Rules from the Rule Base are applied to the tagged texts for discovering the desired temporal relations (TR). Figure 3 shows the flow chart of the process.

4. Difficulties in Temporal Information Extraction 4.1. Difficulties in Parsing 4.1.1. Word Segmentation Usually, word segmenter adopts the maximal matching algorithm. When looking for a word, it attempts to match the longest word possible. This thus unavoidably introduces some lexically reasonable but semantically incorrect phrases (see E1), and could lead to incorrect interpretation of the temporal effect of the sentence. For example, 年前 三個季度(three quarters before 1999) instead of 九 九年 前三個季度(the first three quarters in 1999) would be wrongly recognized as the time during which 發電量完成情況(situations of the generated energy) is reported. E1. 國際電力宣佈其九九年前三個季度發電量完成情 況 Æ 國際 電力 宣佈 其 九九 年前 三個 季 度 發電量 完成 情況 (International Electricity announced its accomplishment of the generated energy in the first three quarters in 1999)

In addition, in many cases, many singleton words are produced. They would be classified in the wrong part-of-speech category. For example, in 會 景 閣 雖 然 只 錄 得 三 宗 成 交 (although three transactions are recorded for 會 景 閣 ), 會 景 閣 together is actually a noun indicating the name of a building, and should not be broken up. Breaking them into two words (such as 會_景閣) runs the risk of incorrectly recognizing 會(will) as a temporal indictor for a future event.

4.1.2.

Part-of-Speech (POS) Tagging

Major difficulty related to part of speech is that many Chinese words belong to more than one category of POS, depending on its relative position to its neighboring words. For instance:

pre-processing Indicator Word List

Rule Base

Automatic Tagging

Manual Marking

Verb List

Adding New Verbs

Discovering TR

Documents

ON PAST AFTER …..

Figure 3. Flow chart of the temporal relation discovery model in TICS

(1) In E2, is 收購(take-over) a predicate verb or a non-predicate verb? In an attempt to search for news related to it, should this be ignored because it does not appear to be the predicate verb in (b)? If not, what should the time slot be filled with? E2 (a) 向 集 團 收 購 零 售 全 部 股 份 (take-over the complete shares of the sales department of the corporation) (b) 正商討收購事議(discussing take-over issues)

(2) In E3, is 公佈 (announce) a verb with a noun object or a verb with a sentential object? The former indicates an event by itself, whereas the latter indicates that the event described by the following embedded sentence is declared. E3 (a) 本 周 半 島 豪 庭 新 盤 會 公 佈 定 價 (prices of Peninsula Villa will be announced this week) (b) 香港中旅國際投資公佈于九九年上半年業勣理 想(China International Investment HK Co. announced that the business in the first half year of 1999 was satisfactory)

(3) The word 將(will) has two syntactic functions: to be the adverb indicating future events, or the preposition word with a meaning similar to 把(put), see E4. The wrong tagging would bring about a wrong solution for the second function does not benefit the temporal information extraction task. E4 (a) 大東電報局將會出售香港電訊百分之五十四的 股權(Tai Tung Telegraph Co. will sell 54% of Hong Kong Telecom) (b) 因此將期內紅酒及白酒的生產比率調整(thus, they adjust the production ratio of read and white wines)

4.1.3.

Grammatical Structure Analysis

(1) A time phrase could either modify a verb as a temporal adverb, or modify a noun as a restricted modifier. For example, 投資部經理表示一月份已發行 股本增加三成五(the investment department manager said that the issued shares in January has been increased by 35%). 一月份(January) alone is a time phrase. However, in this case, it acts as an adjective modifying 股本 (issued shares). Separating it from 股本(issued shares) may give rise to the confusion that the verb 增加(increase) happens in January. (2) It is possible to have more than one verb in a sentence. But there is no simple and direct rule to tell

which one is the predicate. For example, in 昨日表決 通 過 了 向 集 團 收 購 零 售 全 部 股 份 的 建 議 (yesterday, voted to approve the proposal to take-over the complete shares of the sales department of the corporation), there are three (or even four if 建議 (propose) is also identified as a verb) verbs in this sentence, i.e. 表決 (vote), 通過 (approve), and 收購 (take-over). The actions 表決(vote) and 通過(approve) took place on 昨日(yesterday). With reference to the reporting date, they happened in the past. But how about 收購(take-over)? Actually, it is a future event.

4.2. Difficulties in Semantic Understanding 4.2.1 Dependence of Temporal Relations on Types of Temporal Concepts Temporal relation Determination is conditionally dependent on temporal classification. For instance, most of the time, the aspectual auxiliary word 了 (was) in conjunction with the adverb 已 經 (have) marks a past completion, as shown in E5(a). Notice, the verb 獲得(receive) indicates an event. But when the verb represents a state, as 具備(have) in E5(b), the pattern of 已經 and 了 will means to finish a change, and proceed to ingress a new state which continues into present. The influence of temporal classification on temporal relation will be explored in the future. E5 (a) 熊貓 GM410 手機已經獲得了信息產業部頒發的 電信設備進網許可證(Panda GM410 mobile phone has already received the telecommunication and network entry certificate awarded by the Property and Business Department) (b) 這樣的家庭在國外已經具備了購買轎車的能力 (this type of family abroad ahs already had the capability to purchase a car)

4.2.2.

Reference Times

The sentences in an article are actually temporally related. They may share the same reference time which is indicated in a preceding sentence or the event time in one sentence serves as a reference point for the next, see E6. The reference times can therefore shift between sentences and correlate different sentences into a complete concept in temporal meaning. Notice, the wrong reference time leads to a wrong relation. How to identify whether a reference time is continued from the preceding sentence or is the same as an omitted speech time, and how the reference time shift would

involve discourse. E6 上週曾與銀行工會就轉按上昇問題進行磋商(last week, we have discussed with the Bank Trade Association about the rising re-mortgage problem) TR=FUTURE(落實,報導日期); 但未有落實具體指 引(but no concrete indication has been established) TR=ON(磋商,上週) **correct: TR=FUTURE(落實,上 週)

4.2.3.

9.

Negation

The negation form of a verb may have two focuses. One emphasizes the event, which is expected to become the fact but, still has not yet happened, like E7(a). It implies that the event will take place in the future. Another emphasizes a status where the event did not happen throughout a specified duration, like E7(b). Is it possible to find out the focus of the negation? With our superficial understanding, when negation applies to a durative event, it most likely implies the situation like E7(b). But when an instantaneous event is referred to, it is most likely expected to take place in some future time. We still need more investigation to demonstrate whether this consideration is correct and sufficient. E7 (a) 目前尚未確定(it has not been decided yet) (b) 上月沒有跌破 16000 點水平(it has not dropped below 16,000-point level)

5.

8.

Concluding Remarks

10. 11.

12.

13.

14.

15.

Hitherto time-dependent information extraction and temporal reference are not been well studied. There are some researches for Western languages. However, the same for Chinese is rare. For this reason, our research work will be significant to the IE research field.

16.

Reference

19.

1.

2.

3. 4.

5.

6.

7.

Allen J.F., “An Interval-based Represent Action of Temporal Knowledge”, In Proceedings of 7th International Joint Conference On Artificial Intelligent, 1981, pp221-226. Allen J.F. and Koomen J.A., “Planning Using a Temporal World Model”, In Proceedings of the 8th International Joint Conference on Artificial Intelligence, Karlsruhe, Germany, 1983, pp741-747. Allen J.F., “Towards a General Theory of Action and Time”, Artificial Intelligence, 23, 1984, pp123-154. Androutsopoulos I., Ritchie G.D. and Thanisch P., “Time, Tense and Aspect in Natural Language Database Interface”, Natural Language Engineering, 4(3), 1998, pp. 229-276, (http://xxx.lang.gov /cmp-lg/9803002). Antony G., “Temporal Logics and Their Applications”, Department of Computer Science, University of Exeter, Academic Press, 1987. Bruch B.C., “A Model for Temporal References and its Application in Question-answering Program”, Artificial Intelligence, 3, 1972, pp1-25. Dowty, D. “The Effects of Aspectual Class on the Temporal Structure of Discourse: Semantics or Pragmatics?”, Linguistics and Philosophy, 9 1982, pp37-62.

17. 18.

20. 21.

22.

Glasgow B., Mandell A., Binney D. and Fisher F., “An Information Extraction Approach to Analysis of Free Form Text in Life Insurance Application”, In Proceedings of the 9th Conference on Innovative Application of Artificial Intelligence, Menlo Park, USA, 1997, pp992-999. Hwang C.H. and Schubert L.K., “Interpreting Tense, Aspect and Time Adverbials: A Compositional, Unified Approach”, In Proceedings of 30th Annual Meeting of the Association for Computational Linguistics, 1992, pp232-240. Kenny A., Action, Emotion and Will, New York, Humanities Press, 1963. (Chapter 8) Li, W.J. and Wong K.F., “Towards Automatic Chinese Temporal Information Extraction”, to appear in Journal of the American Society for Information Science, 2001. Li, W.J., Wong K.F. and Yuan C.F., “Modeling Temporal Relationships embedded in Chinese Sentences”, submit to ACM Transactions on Asian Language Information Processing, 2001. Li, W.J., Wong K.F. and Yuan C.F., “Temporal Classification of English and Chinese Sentences”, submit to Journal of the American Society for Information Science, January 2001. Moens M. and Steedman M., “Temporal Ontology and Temporal Reference”, Computational Linguistics, 14(2), 1988, pp15-28. Mourelatos A.P.D., “Events, processes, and States”, Linguistics and Philosophy, 2, 1978, pp415-434. Parsons T., “The Progressive in English: Events, States and Processes”, Linguistics and Philosophy, 12, 1989, pp213-241. Prior A.N., Past, Present and Future, Oxford, Clarendon Press, 1967. Reichenbach H., Elements of Symbolic Logic, Berkeley CA, University of California Press, 1947. Soderland S., Aronow D., Fisher D., Aseltine J. and Lehnert W., “Machine Learning of Text Analysis Rules for Clinical Records”, Technical Report, TE39, Depart of Computer Science, University of Massachusetts, USA, 1995. Vendler, Z., Linguistics in Philosophy, Cornell University Press, Ithaca, New York, 1967. Wong, K.F., Li W.J., Yuan C.F. and Zhu X.D., “Temporal Representation and Classification in Chinese”, submit to International Journal of Computer Processing of Oriental Language, January 2001. 鄧守信﹐ “漢語動詞的時間結構”﹐《第一屆國際漢 語教學討論會論文選》﹐北京語言學院出版社﹐1986 年, pp30-36.

23. 陳平﹐ “論現代漢語時間系統的三元結構”﹐ 《中 國語文》﹐1988 年第 6 期﹐pp401-422. 24. 李臨定, 《現代漢語動詞》, 中國社會科學出版社, 1990 年.

25. 馬慶株﹐ “時量賓語和動詞的類”﹐《中國語文》﹐ 1981 年第 2 期.

6