2009 International Conference on Electrical Engineering and Informatics 5-7 August 2009, Selangor, Malaysia
Automatic Extraction of Performance Indicators from Financial Statements Siti Sakira Kamaruddin#1, Abdul Razak Hamdan#2, Azuraliza Abu Bakar#3, Fauzias Mat Nor*4 #
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia 43650 Selangor, Malaysia 1
*
[email protected] 2
[email protected] 3
[email protected]
Graduate School of Business, Universiti Kebangsaan Malaysia, 43650 Selangor, Malaysia. 4
[email protected]
Abstract— We present a method to automatically analyze financial statements for the purpose of recognizing and extracting relevant financial indicators together with its values and its related narratives. We employ a rule-based approach to solve the problem of syntactical and morphological variations contained within the financial text. The information was extracted using a multi-pass scan to process the text in a series of pre-programmed functions. Experiments were carried out to demonstrate the feasibility of the system by deriving the precision and recall scores. The extracted results can be used to construct and augment knowledge bases for a more complex text mining systems. Keywords— Information extraction, rule-based financial statement analysis, text mining.
approach,
I. INTRODUCTION Extracting and mining relevant information from vast amount of text is a daunting task due to the lack of formal structure in the documents. Financial statements become more complex because of the alphanumeric characteristics, hence it needs a specific fact extraction process tailored to the problem domain. This research is concerned on the efficient extraction of alphanumeric documents. The basic purpose of extracting information from financial statements is to measure the business performance of companies. There are specific performance indicators distributed over widely spaced and irregular intervals in the financial statements. These indicators can objectively show the state of a company. Therefore, we focus our extraction on three main indicators i.e : net profit/loss, share capital and total assets for the reason that they are the basis of profitability ratios such as return on assets (ROA) and return on equity (ROE). Furthermore, profitability ratios are used widely to measure managerial efficiency [1]. Information extraction is the basis of text mining technology. It is the process of scanning the text to identify and extract key phrases and relevant sentences [2]. Text that have predictable information and are able to be expressed unambiguously using a set of linguistic forms are better candidate to perform information extraction [3]: The paper is arranged according to the following sections. Section II presents some related works on information
extraction. Section III describes the characteristics of data that are involved in this research. Section IV explains the proposed method. We present the result and evaluation in section V. The paper ends with a conclusion in Section VI. II. RELATED WORK Earlier works of IE systems are domain specific, and would be preprogrammed or trained to recognize predefined information. JASPER (Carnegie Group), a former commercial information extraction system depended on complex handcrafted templates [3]. More advanced technique for information extraction is reported in the work of [4]. In their work, patterns are discovered through observations on regions in typical documents without relying on any pre-defined templates. Recent study by [5], performed general NLP task such as tagging and parsing before conducting domain specific relation extraction on the parsed tree. Relation extraction is also explored in [6], where a linear-chain conditional random field is constructed to extract relations from biographical text. Most of the work reported here is focused on unstructured full-text documents e.g. news and e-mails. Financial statement portrays an additional challenge due to its alphanumeric and tabular characteristics. In [7], they semi-automatically infer the structure of the financial statements and used a matching language together with templates to extract financial values from financial statements. Our approach is focused on a multipass strategy as proposed in [8], employing specific rule-based method first to recognize the noun phrases and performance indicators within the documents and then to efficiently extract them from the financial statements to be used for further processing. III. THE CHARACTERISTICS OF DATA Financial statements are an alphanumeric document i.e. a combination of complete sentences arranged in paragraphs and monetary values arranged in tabular form. The nature of financial text is different from the natural conversational language. It contains variety of morphologies and synonyms. The financial terms and jargons of the domain are normally repeated through out the text. Further empirical observation on the financial text reveals that there are rare cases of semantic
978-1-4244-4913-2/09/$25.00 ©2009 IEEE IS-10 348
ambiguities occurring and the words and phrases in the text can be classified into limited number of groups. The datasets that were used in this work is a collection of financial statements from a local Islamic bank for a period of 9 years (2000 – 2008). The documents were obtained from the annual report published in the bank’s website. These documents that were originally in Pdf format were converted into text files preserving its layout as far as possible. All together the corpus contains 909 pages with approximately 163,000 words arranged in 24,000 paragraphs and 51,000 lines. As mentioned in section I, we focus our extraction on the 3 major managerial performance indicators i.e. net profit / loss, share capital and total assets. We extracted not only the indicator values but the sentences narrating these indicators as well. IV. METHOD OF EXTRACTION The fundamental challenge in Information Extraction is to ensure that it efficiently extracts the information according to the requirements; therefore our aim is develop an extractor that is tailored to the special needs of extracting the financial performance indicators together with their respective textual descriptions. The extractor should be able to excerpt relevant information and filter out the un-relevant ones from the lengthy financial statements. We have adopted a rule based approach to perform the information extraction. Rule-based approach defines a set of rules for possible textual relationships thus the retrieval accuracy is higher [9]. The rule-based approach is preferred because it is able to perform favorably well either with or without the syntactic information of the text. It implies that this approach can be applied without prior parsing and tagging the text as were employed in this work. The system was implemented using an integrated development environment named VisualText with the help of NLP++ programming language. The performance indicators were extracted through a series of steps. Figure 1 portrays the steps that are involved. The method begins with the process of multi-pass scanning the raw text. Each scan increasingly process a single input file and annotate a single parse tree based on the learned information.
Raw Text
Noun Phrase Recognition
Phrase List
Tokenizing
Indicator & Value Recognition
Indicator List
Joining
Posting
Zoning
Extracted Text
Fig. 1 Information Extraction Components
A. Tokenizing In the first scan, the document is tokenized. Tokenizing is the process of converting the characters of the raw text into tokens. In this component, the raw text was first broken into text units of alphabetic, numeric, punctuation, and white space characters. After this process, the raw text was transformed into a parse tree containing tokenized text unit as its leaf. This parse tree was updated in the subsequent passes. B. Joining Second pass on the raw text performs a joining operation on the resulting token leaf of the parse tree. This operation is essential on the account of it is necessary to join some tokens in order to consider them as one group. In the financial statements, numerical text units might have to be joined to form large numbers, monetary values, percentage and dates. The following are some examples of tokens that are joined. • ‘534,234’ is a combination of numeric and comas and will be recognized as _number • ‘RM20,000’ is a combination of monetary symbol, numeric and comas and will be recognized as _monetary • ‘75%’ is a combination of numeric and percentage symbol and will be recognized as _percent • 30 June, 2008 is a combination of numeric, text, coma, and numeric and will be recognized as _date C. Zoning In this pass, a zoning function is performed on the resulting parse tree from the previous operation. The parse tree is zoned into paragraphs, headers, sentences, and table zones. Zoning facilitates the searching process where the search space can be reduced by directly focusing on certain headers. This further improves the process of finding the required information. We have successfully zoned the financial statements according to the standard predefined headers e.g. directors report, balance sheet, income statement, notes to the account etc. With these zones, we are able to focus our search on the specific zone to search for the performance indicators. The search was done in the Indicator recognition component. E.g. the search for total assets was focused on the balance sheet zone. This can be easily achieved by specifying the path in which the parse tree is explored. D. Noun Phrase Recognition We perform noun phrase recognition on the zoned documents to identify important phrases. Some blank lines and white-space characters are removed before this operation takes place. A list of noun phrase which are specific to Islamic banking is given as input to perform this process. Some examples of these nouns are principle of shariah, zakat and tax, mudharabah depositors, takaful funds etc. This is done by a pattern matching technique incorporating specific rules. In addition to this, proper nouns such as name of cities, states and subsidiary companies are also searched and recognized. These new information is also annotated in the parse tree. E. Indicator Recognition
349
In this component, a list of financial indicators to be searched and extracted is provided i.e. net profit / loss, share capital and total assets. Since the pattern matching technique is used in this component as well, we provide all the possibilities of the words; uppercase, lowercase, sentence case etc. Financial indicators are identified and extracted together with its attributes or values and the accompanying text that explains the indicators.
comparable and in some occasion better than other information extraction systems reported in the literature.
F. Posting Once the needed information is extracted and placed in the parse tree, the output can be generated easily by traversing the tree and printing out the results. The result from all this operation is an extracted text that contains relevant performance indicators and phrases. Figure 2 shows a sample of the extracted information.
VI. CONCLUSIONS This paper explains an automatic information extraction system that recognizes and extracts relevant performance indicators from financial statements. The method presented here employs a rule-based approach and a multi-pass scan on the financial statement to process the text in a series of preprogrammed functions. We demonstrated the feasibility of our method through an experiment which produced a result that is comparable with common information extraction system. This research contributes to different areas such as natural language processing, information retrieval, and text mining, which benefits from accurate extraction of text contents. The automatic extraction of relevant performance indicators from financial statements facilitates the process of obtaining knowledge from the extracted information and it lays the foundation for a larger financial text mining system.
TABLE I PRECISION AND RECALL SCORES
Measurement EP ER
Scores 85% 71%
REFERENCES [1] Fig. 2 Sample of extracted information
[2]
V. RESULTS AND EVALUATION In order to evaluate our proposed method and the developed performance indicator extraction system, an experiment was performed on the datasets. To evaluate the accuracy of the extracted information the following Extractor Precision (EP) ratio and Extractor Recall (ER) ratio were defined:
[3]
EP =
R R+I
(1)
ER =
R R+N
(2)
where R represents number of relevant item retrieved, I represents number of irrelevant item retrieved and N represents number of relevant item not retrieved. Our information extractor revealed high precision and recall scores for all performance indicator i.e. total asset, share capital and net profit/loss. The average result of the experiment for all performance indicators are shown in Table 1. The scores are
[4] [5] [6]
[7]
[8] [9]
A. Samad and M. K. Hassan, "The performance of Malaysian Islamic Bank During 1984-1997: An Exploratory study," International Journal of Islamic Financial Services, vol. 1, 1998. W. Fan, L. Wallace, S. Rich, and Z. Zhang, "Tapping the Power of Text Mining," Communications of the ACM, vol. 49, pp. 77-82, 2006. P. M. Andersen, P. J. Hayes, A. K. Huettner, L. M. Schmandt, I. B. Nirenburg, and S. P. Weinstein, "Automatic extraction of facts from press releases to generate news stories " in Proceedings of the third conference on Applied natural language processing, 1992, p.170-177. L. Ma and J. Shepherd, " Information Extraction Using Two-Phase Pattern Discovery," in The 27th Annual International ACM SIGIR Conference. Sheffield, South Yorkshire, UK: ACM, 2004. H. Mangassarian and H. Artail, "A general framework for subjective information extraction from unstructured English text," Data & Knowledge Engineering, vol. 62, pp. 352-367, 2007. A. Culotta, A. McCallum, and J. Betz, "Integrating probabilistic extraction models and data mining to discover relations and patterns in text," in Human Language Technology Conference of the North American Chapter of The Association of Computaional Linguistics (HTL/NAACL), 2006. W. Kornfeld and J. Wattecamps, "Automatically Locating, Extracting and Analyzing Tabular Data," in Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, p.347-348. A. Meyers, "Multi-Pass Multi-Strategy NLP," Text Analysis International, Inc. D. Zhou and Y. He, "Extracting interaction between proteins from the literature," Journal of Biomedical Informatics, vol. 41, pp. 397-407, 2008.
350