a rule-based system to extract financial information - Semantic Scholar

4 downloads 561 Views 945KB Size Report
Journal of Computer Information Systems. Summer 2012. A RULE-BASED ... Holly Spring, MS 38635. University, MS 38677. Received: October 5, 2011.
A RULE-BASED SYSTEM TO EXTRACT FINANCIAL INFORMATION

Mahmudul Sheikh Rust College Holly Spring, MS 38635

Sumali Conlon University of Mississippi University, MS 38677

ABSTRACT Extracting up-to-date information from financial documents can be important in making investment decisions. However, the unstructured nature and enormity of the volume of such data makes manual analysis tedious and time consuming. Information extraction technology can be applied to automatically extract the most relevant and precise financial information. This paper introduces a rule-based information extraction methodology for the extraction of highly accurate financial information to aid investment decisions. The methodology includes a rule-based symbolic learning model trained by the Greedy Search algorithm and a similar model trained by the Tabu Search algorithm. The methodology has been found very effective in extracting financial information from NASDAQ-listed companies. Also, the Tabu Search based model performed better than some well-known systems. The simple rule structure makes the system portable and it should make parallel processing implementations easier. Keywords and phrases: Information Extraction, Greedy Search, Tabu Search, Rule-based Model, Symbolic-Learning Model (SLM) 1. INTRODUCTION Investors need to analyze financial information to make wise investment decisions. Information about the financial standings of companies can be obtained from reliable and easily obtainable text documents. Financial information can be extracted from these texts to produce well-structured data such as the entries of a relational database. The extracted information can be used to perform causal analysis among the determinants of financial success. Information Extraction (IE) technology is used to collect prespecified and structured information from free-text [19, 38] or the web [10, 29]. In an IE system, users just need to specify the field-names they are interested in such as the name and location of a company. However, the construction of such IE systems is difficult. The absence of phrase-boundaries, exact field-values, and the contexts of the words to be extracted complicate an IE task. Information Retrieval techniques such as Naïve-Bayes classifiers or the Average Mutual Information [9] are not effective for IE tasks. Two general IE tasks are slot-filling and template generation [4]. A slot can be assumed to be a field of a table and a template is a user-specified set of interrelated slots [8]. An example of a slot can be a company-name and the slot-filler or the value of the slot might be “Voxware Software Inc.” An example of a template is a user-specified set of interrelated slots such as, “Company Received: October 5, 2011

10

Name,” “Current Net Earnings,” and “Forward PE Ratio.” The experiments of this study were performed on a template containing six interrelated slots termed as “Financial Factor,” “Previous Financial Factor,” “Current Volume,” “Previous Volume,” “Change Type,” and “Change Volume.” The value of “Financial Factor” can be the recent profit, net earnings, loss, revenue, or any other financial determinants. The value of “Previous Financial Factor” is the comparable financial factor for the previous period. These two slots may contain values such as net earnings for the current quarter and net loss for the year-early quarter. The values of the slots “Current Volume” and “Previous Volume” are related to the respective financial factors. “Change Type” may contain values such as “up” and “decrease” while “Change Volume” contains the corresponding numerical figure. We developed an IE methodology that was effective in extracting financial information. The methodology includes a rule-based Symbolic-Learning Model (SLM) that uses humanannotated training documents. The usefulness of an IE system for a financial application depends on the quality of the extracted information. In order to address this issue, this study attempted to answer the following questions: 1. How effective is a rule-based SLM model for the extraction of financial information? 2. How effective is the Greedy Search method in generalizing an SLM model for a financial domain? 3. Is it feasible to apply the Tabu Search method to generalize an SLM model for this domain? We compared the performance of a Greedy Search-based IE system with that of some briefly trained human analysts. We also compared the training time for the Tabu Search method with the Greedy Search method to determine its feasibility. The trained models were tested on 20 randomly selected un-annotated documents. The best performing Tabu Search model obtained high precision and recall on the six extraction slots. The performance is competitive with a system developed for a similar corpus [23]. The remaining sections of this paper are organized as follows. §2 covers the previous research in the IE area, §3 introduces the inter-related methods used, §4 presents the results and the related comparisons, and §5 renders conclusions and future directions. 2. RESEARCH BACKGROUND In an IE system, free-text documents are fed as input and the relevant information is produced as output in a template structure. A template is similar to a table of a relational database. After

Revised: January 4, 2012 Journal of Computer Information Systems

Accepted: January 18, 2012 Summer 2012

tokenization, sentence level syntactic and semantic analyses [64] are performed on the input text. Syntactic analysis identifies the syntactic patterns such as noun phrases [12]. Some special features such as punctuation and special characters are also used. Lexical and semantic analyses are also performed on the initially identified patterns [24]. IE systems depend either on shallow [8] or deep level parsing [41]. Two methods, Knowledge Engineering and Automated Training, have been used before for various IE tasks [4]. The first method depends on the human-specified extraction rules. The second method uses training documents and a learning algorithm to train an extraction model. The Message Understanding Conference (MUC) introduced precision, recall, and F-Measure for assessing IE performances [4]. Both rule-based learners and feature-based statistical learners have been used before for various IE tasks. Following are several prominent rule-based systems. AutoSlog [44] builds a domain-specific dictionary of patterns by using CIRCUS [33] for concept analysis. The LIEP system [25] learns the rules using the syntactic constituents of phrases. CRYSTAL [51] induces a dictionary of Concept-Node definitions from manually tagged documents. HASTEN [28] allows users to annotate examples by labeling the important regions of texts and the relationships among the targets and their concepts (e.g. management succession). WHISK [52] uses syntactic constituents and semantic labels to extract information from semi-structured and free-texts. The RAPIER (Robust Automated Production of Information Extraction Rules) system [7] learns the rules by using Inductive Logic Programming techniques. The system uses manually filled templates to learn the rules by utilizing semantic information. Xiao et al. [58] used a combination of soft and hard patterns by using bootstrapped training. Yu et al. [61] implemented a two pass IE system. The first pass identifies the information blocks and the second pass extracts the target. They developed two models, one by using a Hidden Markov Model (HMM) and the other by using a Support Vector Machine (SVM). They found that the SVM model performs better in extracting names and addresses and the HMM model performs better in extracting educational information. Culotta et al. [14] used a conditional random fieldbased system that uses corrective user-feedback. An HMM model that exploits the lexicographer classes from WorldNet [37] was developed by Ciaramita and Altun [11]. Tratz and Sanfilippo [56] developed a system that combines a semi-supervised and a supervised method. Yu et al. [62] developed a Markov Chain Monte Carlo algorithm-based system that approximates Bayesian inference. An unsupervised relation extraction system that utilizes both deep and shallow patterns was developed by Yan et al. [60]. Mykowiecka et al. [39] developed a system that uses hierarchical typed-features and a dedicated grammar set. Kim et al. [27] developed a soft pattern matching system based on local trees. Because the above systems were developed for different domains and corpuses, their performance comparison is irrelevant. Most of the rule-learning systems used either the straight forward Greedy Search (GS) algorithm or a variant of it. Satpal [48] used the GS algorithm to optimize a Markov Logic Network for web-based IE. Sarma et al. [47] used the GS method to find the optimal set of rules. Various GS methods were applied in other IE tasks such as, topic selection for expert identification [15], feature selection for semantic role classifcation [16], label (term) selection for answering structured

Summer 2012

query [35], ad-hoc query generation for web crawling [36, 34], and named-entity disambiguation [42]. We used a Greedy Search method for one of our rule learning models to establish the baseline performance. The Greedy Search methods usually end up with locally optimal rules. As a global search strategy, we applied a customized variant of the Tabu Search (TS) method to generalize the rules for our second model. The TS method has been applied before for various text processing applications such as sentiment extraction [5], optimal feature selection from biomedical data [40, 63], text clustering for topic extraction [57], large scale text classification [6], and document retrieval through a Relevance Feedback approach [30]. Our system trains and generalizes a set of extraction rules from the annotated corpus. A separate set of rules is trained for each slot. Unlike other systems, our system considers only the presence and absence of the feature constraints to generalize the rules. We developed two different extraction models. One was developed by using the GS method and the other by using the TS method. The simplicity of our rule structure allowed us to apply the TS method. 3. METHODOLOGY The methodology includes a learning model that uses a set of features such as exact-words, special characters, and punctuation characters. The exact-words are provided by the method described in §3.1. Considering the domain relevancy, the special characters are selected by default. The parts of speeches provided by the Brill-Tagger were used to determine the contexts of the target phrases. A set of rules is learned by using the Symbolic-Learning method (SLM) [7] to extract each target. Unlike a statistical model [14], an SLM model uses the relationships among the surrounding words and other features to determine a target. In an IE task, SLM models [45] learn which words, wordfeatures, and special characters determine the target. SLM models have been found effective in extracting information from various domains. In an SLM model, the rules do not depend on the probability of the sequential n-grams [7]. Similar to [18], our SLM model covers each type of target (e.g. company name) by learning a separate rule set. The distinctive characteristics of our system are: 1. M  inimal requirement of manual data processing 2. Simple representation of rules that makes it portable to other domains 3. Only the absence and presence of the feature constraints of a rule are considered 4. An initial rule is not thrown out before generalization because its performance is unknown 5. Application of an advanced search method to improve the extraction performance The learning model considers an exact-word only if it exists within the six words before or after the target phrase and has the PMI value (§3.1) greater than 0.5. Each initial rule contains the exact-words and word-features found for a target phrase. Figure 1 shows the system flowchart for learning a rule set. The training method generalizes each initial rule found in the training documents for a slot. Then it retains only the best performing rules that covers all of the training examples for that

Journal of Computer Information Systems

11

slot. Unlike other SLM models, our model does not combine similar rules to construct one complex rule for a slot. 3.1 Preprocessing The documents for the corpus were collected manually from the online EDGAR database and various online news sources for 200 NASDAQ-listed companies from the Yahoo-finance website. The financial news and several 10-Q quarterly reports of each company were manually combined to create a text file for each company. The preprocessing includes removing unnecessary symbols, separating sentences by delimiters, excluding irrelevant sentences, and tagging the documents by Brill Tagger (http:// www.calcaria.net). The sentence selector method was designed to exclude irrelevant sentences from the training documents. This method was designed to reduce training time of the learning methods. It was also used to select the exact-words for the initial rules. In order to exclude the irrelevant sentences, the target phrases such as “pre-tax revenue” and “net loss” were provided to the system. Unlike a document-level term-weighting method [46] by using Mutual Information (MI) [59, 3], we used the Point-wise Mutual Information (PMI) to identify the relevant sentences. The PMI measures have been used in various text mining [32] applications such as selecting Implicit Features in Customer Reviews [54] and feature selection for text categorization [49]. The PMI is calculated by the following formula.

In the above formula, x and y represents the target phrase and a surrounding word respectively. In order to contain the PMI between -1 and +1, the PMI was normalized by the following formula.

The npmi (Normalized PMI) measures were found effective in collocation extraction [20, 50]. Our system selects a relevant sentence by using the occurrence of a target phrase and at least two significant surrounding words. The significance of a surrounding word is determined by two criteria. First, it has to appear within six words before or after the target. Secondly, its npmi has to be greater than 0.5. Figure 2 shows a sample from the corpus and the retained sentences. In Figure 2, the target phrase(s) of a selected sentence is underlined and the two surrounding significant words are shown as boldface. For each selected sentence, there were more than two significant surrounding words. In all three sentences, there is more than one occurrence of the target. For example, in Sentence 1 “net loss” and “loss” are both targets, but the first target is sufficient for its relevancy. All significant surrounding words of a target are used as the exact-word constraints for the initial rule for that target. 3.2 Rule Construction and Generalization The goal of the rule construction phase is to find a set of extraction rules [18]. The rules were learned by using the learning technique similar to [55] applied for the Symbolic Learning Models (SLM) [7, 33, 38]. For this domain, symbols are defined as words and word-features [2] or entity-relations [1, 31]. SLM models use the relationships among the symbols such as

June 27, 2007 (Reuters): Diedrich Coffee, Inc. today announced operating results for its fiscal year ended June 27, 2007. For fiscal year 2007, the Company reported a net loss of $1,765,000, or $0.33 per share, compared to a loss of $7,796,000, or $1.47 per share, in fiscal year 2006………….. June 27,2007 EDGAR online: In fiscal year 2007, the Company accounted for its Diedrich Coffee and Coffee People companyoperated retail operations as Discontinued Operations. The Company continues to own and operate the wholesale, Gloria Jean’s retail and Gloria Jean’s domestic franchise businesses that together comprise Continuing Operations. The net loss for the fourth quarter of fiscal year 2007 was $2,344,000, or $0.42 per share, compared to a net loss of $3,118,000, or $0.59 per share, for the fourth quarter of fiscal 2006. Results for fiscal year 2007 include the after tax gain of $3,580,000, or $0.66 per share from the sale of the majority of its Diedrich Coffee and Coffee People companyoperated locations and a loss from discontinued operations of $1,596,000, or $0.29 per share…………….. Sentences retained by the Sentence Selector For fiscal year 2007, the Company reported a net loss of $1,765,000, or $0.33 per share, compared to a loss of $7,796,000, or $1.47 per share, in fiscal year 2006. The net loss for the fourth quarter of fiscal year 2007 was $2,344,000, or $0.42 per share, compared to a net loss of $3,118,000, or $0.59 per share, for the fourth quarter of fiscal 2006. Results for fiscal year 2007 include the after tax gain of $3,580,000, or $0.66 per share from the sale of the majority of its Diedrich Coffee and Coffee People company-operated locations and a loss from discontinued operations of $1,596,000, or $0.29 per share.

FIGURE 1: The System Flowchart

12

FIGURE 2: Example of relevant sentences retained from the input document

Journal of Computer Information Systems

Summer 2012

contextual appearance of at least two significant surrounding words (§3.1). The only preprocessing we performed on the randomly selected test documents was Parts-of-Speech tagging. The system finds the initial rules for a slot in the training data by matching the target-values [7, 23]. The rule for a matched target is constructed by combining the target-value, the exactwords, the parts-of-speech tags, and special features. Figure 3 list the special features used. The initial rules are the most specific because they contain all of the exact-words (Keyword) and other feature-constraints. By using all features and exact-words, the system constructs the initial rules for each target. Figure 4 shows an initial rule generated for the “Financial Factor” slot. The Prefix part (denoted: PRF) of the rule of Figure 4 represents the constraints on the prefix. The constraints on the target phrase of this rule are POS tags and numeric constraints. The remaining part of the rule represents the constraints on the postfix (denoted: PSF). The rule generalization process is specific-to-general that starts with the initial rule. An initial rule is generalized by updating one constraint at a time. Figure 4 includes constraints such as, the POS tag of the first token of the prefix is “NNP” and the exact-word of the first token of the postfix is “of”. For this application, most of the exact-words for the nonnumeric targets are known. Some target phrases, for example “available-for-sale and held-to-maturity securities portfolios,” are not available in the training corpus. Thus, no exact-word constraint was imposed on the target of a rule. The learning process generalizes only the prefix and the postfix of the rules. Similar to [14, 52] we used the information-gain metric [43] to guide the rule-generalization process. For this corpus, the information gain is always positive if a rule covers at least one more positive example or one less negative example compared to the previous form of the rule. So, we did not use log as used by [43]. Updating a constraint means either dropping a constraint or adding a previously dropped constraint. For example, if the exact-word constraint “of” of the last row of Figure 4 is dropped in any iteration, this constraint can be added back in a future iteration.

words, Parts-of-Speech (POS) tags, and special characters to learn the rules. For simplicity, let us assume that the input sentences are not POS tagged and the SLM model has to learn a rule to predict the targetvalue for the slot “Current Volume” for the input sentence “For fiscal year 2007, the Company reported a net loss of $1,765,000, or $0.33 per share, compared to a loss of $7,796,000, or $1.47 per share, in fiscal year 2006.” The SLM model may end up learning the following rule from the training sentences, Rule: S[-3]= net AND S[+3]=per AND S[+4]=, AND S[+5]=compared . For this rule, “S[-3]=net” means that the third symbol before the targetvalue is net, “S[+4]=,” means that the fourth symbol after the target-value is a comma and it ignored the exact-word constraint “S[+4]=share” for this position. The learned (generalized) rule contains only the determining symbols. The training documents for learning the rules were manually annotated to identify each target-value for a slot. To ensure crossannotation agreements, seven accounting students were provided with five training documents. There were 94% cross-annotation agreements across the slots. The selection of the sample sentences for training was based on the occurrence of the target and the

Orthographic features: Title case: The first character of the word is upper case All upper case: All characters of the word are upper case All lower case: All characters of the word are lower case All numeric: All characters of the word are numeric characters Alpha numeric: The word contains numeric and nonnumeric characters The domain-specific features found in this financial domain: Special string: “Co.”, “Inc.”, “Corp.”, “Org.” Special character: , . : $, #, % FIGURE 3. The features used.

Tagged Sentence: In/IN June/NNP 2007,/CD the/DT Company’s/NNP Board/NNP of/IN Directors/NNP declared/VBD a/DT quarterly/JJ cash/NN dividend/NN of/IN $0.34/CD per/IN common/JJ share,/VBG a/DT 13%/CD increase/NN compared/VBN to/TO the/DT same/JJ quarter/NN of/IN 2006./JJ Target for “Financial Factor”: cash/NN dividend/NN Prefix and Postfix of the Rule: PRF Tag NNP PSF Tag IN PRF Char1 null PSF Char1 null PRF Char2 null PSF Char2 null FRF First Cap true PSF First Cap false PRF All Cap false PSF All Cap false PRF Keyword null PSF Keyword of

IN CD null $ null null false false false false of null

NNP IN null null null . true false false false null per

VBD JJ null null null null false false false false declared common

DT VBD null , null null false false false false null share

JJ DT null null null null false false false false quarterly null

FIGURE 4. Example of a rule to extract “Financial Factor”

Summer 2012

Journal of Computer Information Systems

13

3.3 Rule Generalization by Greedy Search Most rule-learning approaches (§2) used various forms of the original Greedy Search (GS) method [13]. Satpal [48] used the GS method to optimize Markov Logic Network for web based IE. Sarma et al. [47] used the GS method to find the optimal set of extraction rules. Various GS methods were applied in other IE tasks such as, topic selection [15], feature selection for semantic classification [16], term selection [35], ad-hoc query generation [36], and named-entity disambiguation [42]. We used a GS method for one of our rule learning models to establish the baseline performance of our system. Figure 5 shows the rule generalization algorithm by the GS method. Input: trainingExamples, targetPhrases, wordFeatures • g etInitialRules(trainingExamples, targetPhrases) {Return: initialRuleSet} • fi  ndCoverage(trainingExamples, targetPhrases, R) {Return: totalPositiveCovered, totalNegativeCovered } generalizeRule(initialRuleSet, trainingExamples, targetPhrases) Begin Loop For each InitialRule in initialRuleSet R1 ← any rule from initialRuleSet maxGain = -1 Loop Until informationGain < 0 for N iterations Loop Until all possible next updates are checked findCoverage(trainingExamples, targetPhrases, R1) P0 = totalPositiveCovered N0 = totalNegativeCovered R2 ← R1 with one constraint updated not considered before findCoverage(trainingExamples, targetPhrases, R2) P1 = totalPositiveCovered N1 = totalNegativeCovered t = total common positive examples covered by R1 and R2 informationGain = t ( P1/(P1+N1) - P0/(P0+N0) ) If (informationGain ≥ maxGain) Replace R1 by R2 and assign informationGain to maxGain End Loop End Loop End Loop Return : generalizedRuleSet End FIGURE 5. The Greedy Search algorithm for rule generalization Let us consider only the six words before and after the target used to construct the initial rule for the slot “Financial Factor.” For simplicity, assume that the rule contains only the exact-word constraints and ignore the positions of the words. The initial rule for the sentence part “Board/NNP of/IN Directors/NNP declared/ VBD a/DT quarterly/JJ cash/NN dividend/NN of/IN $0.34/CD per/IN common/JJ share,/VBG a/DT” will be: R1 ← Prefix: declared quarterly AND Postfix: per common share The findCoverage function returns the number of positive and negative examples covered in the training sentences R1. These values are saved in the variables P0 and N0. Covering a negative

14

example means that the rule extracts a target-value from a training sentence and it does not match with the manually tagged targetvalue. At this stage, the only option to update this rule is to drop any of these exact-words. If the word “declared” is dropped, the rule becomes: R2 ← Prefix: null quarterly AND Postfix: per common share The findCoverage function is called again to find the number positive and negative examples covered by R2 and they are saved in P1 and N1. The value of t is the number of common positive examples covered by R1 and R2. If the information gain for this update is 0.6, for example, max gain becomes 0.6. In order for a next qualifying update, the information gain has to be at least 0.6. If more than one possible update yields the highest information gain, the tie is broken by choosing the nearest item to the previously updated item. If the innermost loop updates any constraint, R1 gets replaced by R2. Then the innermost loop starts again with the updated rule and tries to generalize it further. Updating the rule continues in this fashion unless there is no information gain for N iterations (in the second innermost loop) which is set to the number of constraints in the initial rule. In some situations, the rule may be updated by adding a constraint that was dropped in a previous iteration. The reason is, adding back a constraint to a rule may yield information gain by covering less number of negative examples. The outermost loop ends after all initial rules are generalized. 3.4 Rule Generalization by Tabu Search The Greedy Search (GS) methods are prone to local convergence in a rule learning process [53]. As a global search strategy, we applied a customized version of the Tabu Search (TS) method to generalize the rules for our second model. The TS method has been applied before for various text processing applications such as, sentiment extraction [5] and optimal feature selection [40, 63]. More general text processing applications include text clustering [57], text classification [6], and document retrieval [30]. Zhang and Sun [63] found that the TS method takes less time than branch-and-bound and other sub-optimal methods such as Genetic Algorithm. To the best of our knowledge, no previous rule-based application was developed by the TS method for IE. We used the TS method to obtain a set of near-optimal rules. We used the information gain heuristic to guide the TS method. Due to higher training time, we used a simple form of the TS [21] algorithm. The generalization process moves from one form of a rule to another until a stopping criterion is reached. It checks all possible changes in the neighborhood of the current rule to find a better rule. In order to avoid redundant rules, the process maintains a list of recent moves by using a Tabu-List. A move can be defined as adding or dropping a feature to and from the current rule. If the length of a Tabu-List is 4 for example, the list includes the 4 latest moves. While the process finds a better rule, it accepts the new rules and updates the Tabu-List. For this application, the size of each Tabu-List was 1/4th of the number of features, a minimum size of 2, for a rule. The aspiration criterion is to break the highest information-gain yielding element of either the Tabu-Add or the Tabu-Drop list if a better rule is not currently available. Figure 6 explains our TS [20] method.

Journal of Computer Information Systems

Summer 2012

Similar to the GS algorithm, adding or dropping a constraint is evaluated by the training data. The admissibility of adding or dropping a constraint depends on the corresponding Tabu-Add and Tabu-Drop restrictions. The search process looks for a better rule first by dropping an admissible constraint. Then it tries with adding an admissible feature. The best new rule found by dropping a constraint is saved as the potential next rule (Rule-A). The best new rule found by adding a constraint is also saved as the potential new rule (Rule-B). If Rule-A is better or identical to the current rule and better than Rule-B, the current rule is replaced by Rule-A. Then TabuAdd list is updated. An identical new rule does not provide any information gain with respect to the current rule. The reason for accepting an identical rule found by dropping a constraint is that if the rule is accepted, it performs as good as the current rule using less number of constraints.

If Rule-B is better than Rule-A, and also better than the current rule, the current rule is replaced by Rule-B and Tabu-Drop list is updated. If Rule-B is not better than Rule-A, and Rule-A is not better than or identical to the current rule, the search process violates the Tabu restrictions. If the best of the rules found by violating the Tabu restrictions is better than the current rule, the current rule is replaced by the new rule (Rule-C). In this case, if Rule-C required dropping a Tabu constraint, the related drop constraint is eliminated from Tabu-Drop list and the Tabu-Add list is updated to include the recently dropped element to Tabu-Add list. On the other hand, if Rule-C required adding a constraint, the related add-constraint is eliminated from the Tabu-Add list and the Tabu-Drop list is updated to include the recently added constraint. If Rule-A is not better than Rule-B or Rule-C and these latter two rules are not better than the current rule, the best of rule A,

FIGURE 6. The Tabu Search algorithm for rule generalization

Summer 2012

Journal of Computer Information Systems

15

their names were selected to reflect a certain level of meaningful abstraction. Because the system does not provide inter-sentence inferences, template instantiations with the slot-values were limited to oneto-one alignments of the sentences to the template. Extracted slotvalues of Table 1 are from the output of the model generalized by the Tabu Search (TS) method. Sentence 3 of Table 1 shows a wrong slot-value extracted by a rule. Sentence 4 shows a slotvalue assigned to a wrong slot. For both models, average extraction performances of the slots over all test documents are given in Table 2. The performances of different slots varied significantly due to the differences in sparseness and linguistic variations for the slot-values in the training corpus. The precision of each slot is higher than the recall. The reason might be that there are not enough pattern-variations in the training data. This lack of pattern-variation led the rules

B, and C, is chosen (randomly if all are identical) as the new temporary rule (RT). When the number of consecutive iterations producing non-improving rules becomes equal to the number of constraints of the initial rule, the search process is stopped. When the generalization process is complete for either the GS or the TS method, the rules are sorted based on the ratio of the number of positive to negative examples covered by each rule. 4. RESULTS AND DISCUSSION Randomly selected 20 test documents, out of 200, were used to test the performances of the trained models. The experimental results show the effectiveness of the models. The template shown in the bottom of Table 1 was chosen because it captures important information that a user might need. Seven accounting students were given the corpus to choose the slots of their interest. After examining their choices, the slots and

TABLE 1. Sample Input Data for Beacon Power Corporation, March 31, 2007 Input Sentences 1 F  or/IN the/DT three/CD months/NNS ending/VBG March/NNP 31,/CD 2007,/CD the/DT Bank/NNP reported/VBD a/DT net/JJ loss/NN of/IN $3.0/CD million/CD or/CC ($0.51)/CD per/IN basic/JJ and/CC diluted/VBN share/NN as/IN compared/VBN to/TO net/JJ income/ NN of/IN $114,000/CD or/CC $0.02/CD per/IN basic/JJ and/CC diluted/VBN share/NN for/IN the/DT three/CD months/NNS ending/ VBG March/NNP 31,/CD 2006./JJ 2 During/IN the/DT three/CD months/NNS ended/VBD December/NNP 31,/CD 2006,/CD the/DT Company’s/NNP assets/NNS decreased/ VBN by/IN $18.0/CD million,/NN or/CC 2.3%/CD from/IN $785.8/CD million/CD at/IN September/NNP 30,/CD 2006/CD to/TO $767.8/ CD million/CD at/IN December/NNP 31,/CD 2006./JJ 3 The/DT Bank’s/NNP total/NN interest/NN expense/NN was/VBD $1.4/CD million/CD in/IN the/DT quarter/NN ended/VBD June/NNP 30,/CD 2007/CD compared/VBN to/TO $1.3/CD million/CD in/IN the/DT quarter/NN ended/VBD June/NNP 30,/CD 2006,/CD an/DT increase/NN of/IN $132,000/CD or/CC 10.2%./JJ 4 Our/PRP$ available-for-sale/JJ and/CC held-to-maturity/JJ securities/NNS portfolios/NNS decreased/VBN by/IN $11.1/CD million/CD in/IN the/DT aggregate/NN at/IN June/NNP 30,/CD 2007/CD compared/VBN to/TO December/NNP 31,/CD 2006,/CD primarily/RB reflecting/VBG normal/JJ amortization./NN Sentence/ Output Slots

Financial Factor

Previous Financial Factor

Current Volume

Previous Volume

Change Type

Change Volume



1

net loss

net income

$3.0 million

$114,000

--

--



2

assets

--

$767.8 million

$785.8 million

decreased

$18.0 million

total interest Correct: (total interest expense)

--

$1.4 million

$1.3 million

increase

10.2%

$11.1 million Correct: (Change Volume)

--

decreased

3 4

available-for-sale and -- held-to-maturity securities portfolios

TABLE 2. Extraction Performance by Greedy Search and Tabu Search Slot/Entity Financial Factor



Greedy Search

Tabu Search

Precision

Recall

F-Measure

Precision

Recall

F-Measure

89.4

70.5

78.8

83.7

82.4

83.1

Previous Financial Factor

80.0

74.1

76.9

91.3

77.8

84.0

Current Volume

87.6

81.8

84.6

95.6

83.4

89.1

Previous Volume

89.5

87.6

88.5

95.8

88.6

92.0

Change Type

94.5

76.3

84.4

93.2

91.9

92.5

Change Volume

82.9

77.3

80.0

86.8

77.8

82.0

Average Performance

87.3

77.9

82.2

91.1

83.6

87.1

16

Journal of Computer Information Systems

Summer 2012

generalized insufficiently. The other reason might be the presence of noise that prevented the rules from being generalized enough. The experimental results show that the Greedy Search (GS) model demonstrates improved average precision, recall, and F-measure (87.3%, 77.9%, and 82.2% respectively) over the average precision recall and F-measure (72.62%, 82.71%, and 77.34%, respectively) obtained by the Edgar Extraction System (EES) [13] applied on a similar corpus. The TS model performed better than the GS model to some extent. It improved the average precision by 3.8%, the recall by 5.7%, and the F-measure by 4.9%. In order to compare the performance of our system with other well-known systems, we tested our system with the Seminar Announcement corpus. All of the systems listed in Table 3 extracted speaker name, place of the seminar, starting time, and ending time. Table 3 presents the performances of the Tabu Search (TS)-based model and three other well-known IE systems on the Seminar Announcement corpus. All of these systems used keywords and parts of speeches for the rules.

than the Greedy Search model. However, the independence of the generalization process for each rule will make the implementation in a parallel processing platform easier. On an average, there were roughly 3000 training sentences for each slot. It took about 60 human hours to annotate an extraction slot for training. On a 3GH dual-core processor, the rule generalization took about 6 hours per slot for the GS method and approximately 18 hours for the TS method. The main limitation of this study pertains to template construction. The current system can construct a complete template only if the values of all of the slots in the template are contained in a single sentence. It will require incorporating a co-reference resolution method to build relationships between sentences to construct the template by using the slot-values extracted from multiple sentences. In future, we will apply the methodology on larger domains. The customization of the system will be easy because the external features such as the exact-words are selected automatically. We will also apply the system to semi-structured domains such as HTML and XML.

TABLE 3. Performance of Different Systems on Seminar Announcement Data

5. CONCLUSIONS

Slots (Speaker, Stime, Etime, Location)

Precision (%)

Recall (%)

F-Measure

RAPIER

91.3

73.2

79.7

SRV

82.3

79.9

81.0

Whisk

76.9

63.4

65.9

TS

95.1

77.4

84.7

Table 3 shows that our TS method improved the performances over the other methods. It improved the precision over the highest performer “RAPIER” by 3.8%. It improved the F-Measure by 3.7% over the best performer “SRV” system. In addition, we performed paired sample test to answer the questions we asked in this study. Research Question (RQ) one was about the effectiveness of a rule-based SLM model for extracting financial information. Our SLM models built by the GS and the TS method obtained high precision and recall. RQ two was related to investigating the effectiveness of the GS method in generalizing an SLM model for a financial domain. We compared the performance of our GSbased model with several briefly trained human analysts. The same 20 test documents were provided to 20 accounting students. The paired sample test shows the precision of the system was not significantly (P value was 0.56) higher than that of the human analysts. The recall was significantly (P value was

Suggest Documents