Midas: integrating public financial data - ACM Digital Library

2 downloads 0 Views 786KB Size Report
Jun 6, 2010 - Sreeram Balakrishnan* , Vivian Chu, Mauricio A. Hernández, Howard Ho, ... Lucian Popa, Christine M. Robson, Lei Shi+, Ioana R. Stanoi, ...
Midas: Integrating Public Financial Data Sreeram Balakrishnan∗ , Vivian Chu, Mauricio A. Hernández, Howard Ho, Rajasekar Krishnamurthy, Shi Xia Liu+ , Jan H. Pieper, Jeffry S. Pierce, Lucian Popa, Christine M. Robson, Lei Shi+ , Ioana R. Stanoi, Edison L. Ting∗ , Shivakumar Vaithyanathan, Huahai Yang IBM Research - Almaden 650 Harry Road San Jose, CA 95120



IBM Silicon Valley Lab. 555 Bailey Ave. San Jose, CA 95141

+

IBM Research - China Beijing, 100193

ABSTRACT The primary goal of the Midas project is to build a system that enables easy and scalable integration of unstructured and semi-structured information present across multiple data sources. As a first step in this direction, we have built a system that extracts and integrates information from regulatory filings submitted to the U.S. Securities and Exchange Commission (SEC) and the Federal Deposit Insurance Corporation (FDIC). Midas creates a repository of entities, events, and relationships by extracting, conceptualizing, integrating, and aggregating data from unstructured and semi-structured documents. This repository enables applications to use the extracted and integrated data in a variety of ways including mashups with other public data and complex risk analysis.

Figure 1: Architecture

integrated view of the information is challenging due to the format of the data, the quality of the entries and from the sheer size of the corpus. For instance, the SEC currently receives close to a million new filings each year. Each filing varies in size (from hundreds of KBytes to over 10 MBytes), and may use a variety of data formats (e.g., English text, html, xml). We present Midas, a system that unleashes the value of information buried in SEC by extracting, conceptualizing, integrating, and aggregating data from semi-structured or text filings. We are focused not only on the semantic challenges of integrating heterogeneous data from text and from semi-structured formats, but also on the efficiency and scalability of processing these complex tasks over large amount of information. Midas, built from the ground up in 2009, is a scalable Hadoop-based system currently running in IBM Research – Almaden. Through sophisticated extraction, integration and aggregation features, it transforms the data from a document or record view of the world to an objectcentric view, where multiple facts about the same real-world entity are merged into one object with, ideally, clean and complete attributes. In a sense, as recently suggested by [4], we are building a concept-centric repository for the financial domain and we are creating tools to replicate our efforts in other domains [7]. One of the most salient features of our system is the synergistic integration into one framework of multiple components spanning the entire, end-to-end integration flow. The main stages in such a flow are:

Categories and Subject Descriptors: H.4.m Information Systems Applications: Miscellaneous General Terms: Design

1.

INTRODUCTION

Publicly traded companies are required to disclose their financial statements periodically to the U.S. Securities and Exchange Commission (SEC) and the Federal Deposit Insurance Corporation (FDIC). Information in these filings, predominantly buried in text and semistructured formats, is of crucial interest to investors, financial analysts, lawyers, and bankers. Potential investors need to understand the web of relationships that a company has with other companies and its subsidiaries. The activities of company officers and directors (i.e., insiders) have a significant impact on its health and prospects and provide clues about future performance of a company. Furthermore, when a company applies for a loan, loan officers also need to understand certain interor intra-company relationships to determine if, for example, the total debt of subsidiaries is excessive. Providing such an

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA. Copyright 2010 ACM 978-1-4503-0032-2/10/06 ...$10.00.

• Information Extraction: Extraction of various facts from the multitude of text or html documents archived by SEC or present in other public data repositories.

1187

Figure 2: Midas Components Flow With these filters in place, Midas crawled about 1,000,000 filings (50 GBytes) over several months2 . From the FDIC site, Midas downloaded and integrated nearly 77,000 documents (“Call Reports”) for banking subsidiaries. Information Extraction: The second step is to extract annotations from the unstructured data. We wrote a large collection of information extraction modules (annotators) using SystemT to identify concepts such as entities, events, relationships, and financial data such as investment and lending information. The complexity of these rules vary greatly depending on the extraction task. For instance, we have rules that specialize in detecting company names, person names, titles, board and committee memberships, biographies, loan information, company holdings, and shareholders information. In many cases, general rules that apply to simple patterns are reused by more specialized rules to perform specific data extraction. This step results in a sequence of annotated objects which is then indexed and used for searching documents via a semantic-search interface. Information Integration: Midas executes several entity resolution steps to combine data from individual documents into raw entities. For example, in this step, extracted data containing the biography of a person is linked with other data containing the current affiliation of the person and recent transactions. This step runs a sequence of matching rules implemented in Jaql, which then rely on matching key values, names, addresses, stock tickers, or any other available piece of information. Midas then maps and fuses the raw entity data into the chosen schema of our entities. In our case, Midas maps the raw data into our company or person entity schema. During this process, duplicate values will be created for some fields and Midas needs to fuse them into one (or more) normalized value. For instance, the title of a person reported as “CEO” from one source, “Chief Executive Officer” from another, and “CEO, Citigroup” from yet another, are combined into one title, “CEO”. Temporal Analysis: A final step in the flow analyzes the temporal values (timestamps) that are associated with various data values in the target objects. In general, each data value that we produce comes with a set of dates corresponding to all the individual documents in which that data value appears. The goal of temporal analysis is to identify the time span (i.e., start date – end date) for an attribute or a relationship (e.g., position within a company), and also to identify the currency or recency of the data. An important aspect here is that the more forms we are able to understand and analyze, the more accurate time lines we obtain (since we have more data points). Another form of temporal analysis involves deriving the most recent holdings

• Information Integration: Mapping of the extracted facts to a target model or schema, resolving and merging references to the same real-world entity (i.e., entity resolution), and creating correct relationships among the resulting objects. • Temporal Analysis and Fusion: Transforming a collection of unprocessed but time-stamped facts into objects with clearly defined time line or history. Our broader research aims to develop novel algorithms and tools as well as scalable and reusable software modules for all the different stages mentioned above. In particular, we are looking at new algorithms for entity resolution that can be integrated with mapping and fusion algorithms and that can be applied on a continuous basis (i.e., as new documents or data sources are discovered). Finally, one of our most important goals is defining high-level abstractions and models that can be used to specify, at a high-level and declaratively, the entire integration flow. This will enable the applicability of the resulting framework and system to new domains and to new users (i.e., domain experts that are not necessarily data integration experts).

2.

MIDAS ARCHITECTURE AND FLOW

Figure 1 describes the architecture used to run Midas, which was inspired by the “Content Analytics Platform” described in [3]. We use Hadoop as a distributed store and as a map/reduce execution engine. On top of Hadoop, we use Jaql [2] as a high-level language for data transformation. Jaql is a language that uses JSON as its native data model and is compiled into hadoop map/reduce jobs. To extract structured data from unstructured data sources (e.g., from text), we defined extraction rules for identifying concepts mentioned in text and html documents using the declarative query interface exposed by SystemT [5]. Finally, Lucene and an extended version of Nutch1 are used to index and crawl data, respectively. Figure 2 shows the steps in our current Midas Flow. The flow starts by “crawling” SEC data using Nutch. SEC maintains a public repository of regulatory filings which goes back to 1993 and currently contains nearly 10,000,000 filings (the number increases daily). Our crawling process identifies and downloads all SEC filings of certain form types that are related to financial companies and services. To determine which filing is potentially related to financial services, we use all “industry codes” reported by the issuer. Moreover, to restrict our analysis to data that can be considered “recent,” we arbitrarily picked a 5-year window for our data and we ignore documents whose reporting period is prior to 2005. 1

2

http://lucene.apache.org/, http://lucene.apache.org/nutch

1188

The SEC limits the time and frequency of any crawl.

Figure 3: Finding John Thain’s Employment History December 2007”). A final important piece of information is in Bank of America’s Current Report (filed in January 2009) announcing Mr. Thain’s resignation. Thus, notice how relevant data appears in documents spanning several time periods and filed by different related entities. From the point of view of the SEC, the three example fragments were filed by three entities (two companies and a person), each one with it own SEC identifier.

(of a certain type of stock or option) that a key executive holds in a company. This analysis involves keeping track of the transactions and holding records for a person in multiple documents, understanding when the values change and how (increase or decrease), and then finding the latest values. The final data is indexed and exposed to users through a search interface that helps them retrieve and browse the entities and relationships. We currently have extensive information about over 25,000 company and key people entities.

3.

4. THE DEMO

EXAMPLE

Our demo showcases the merged and cleansed data produced by Midas. The current Midas front-end uses a keyword search as its entry point. Given a search term (or a number of terms), the created indexes are queried for companies, persons, and/or forms that match the term. Users select which set of results to explore. For instance, if a user enters “Bank of America” in the search box, our frontend application finds and organizes 11,442 matching SEC filings. These matching filings can be filtered further using “facet” values computed from the query results, narrowing the matching documents by reporting period (the date for the report), the form type (e.g., current reports, proxy statements, etc.), industry codes, etc. The front-end also finds a “Bank of America Corp” entity that the user can explore in more details. Figure 4 shows our interface’s “Company View” for Bank of America. This view shows a selected few relationships between Bank of America and other entities (and allows navigation into those related companies). For instance, Figure 4 shows Bank of America’s banking subsidiaries, board members that are also board members in other companies, and several other types of relationships. Links from this view take the user to OLAP reports detailing recent loan activities, banking subsidiaries data, and institutional holdings. Users can graphically drill

We now illustrate some of the complexities Midas faces while merging and cleansing data using an example. Figure 3 shows an actual screenshot of our “employment history” view constructed from the merged data of Mr. John A. Thain. The source for this merged entity for Mr. Thain comes from multiple SEC documents. Figure 3 shows fragments of three such filings: one from Merrill Lynch [6], another from Bank of America Corp. [1], and one by Mr. Thain himself [8]. The arrows indicate the provenance of some of the data in the employment history view. None of the source documents, by itself, contains the data needed to create our employment history view. Midas knows John A. Thain is a person of interest because he reports holdings and transactions using SEC Forms 3/4/5. In fact, some details of his employment history are gathered from these forms. In this example, Midas “learns” about Mr. Thain’s position with Bank of America from his Form 3 filing on January 2009 [8]. However, if Midas only used these forms, it would only make a rough estimate of what the starting and ending dates were for each position. Many of the actual start and end dates can be extracted from Mr. Thain’s executive biography mentioned in Merrill Lynch’s 2008 Proxy Statement [6] (e.g., Chairman and CEO “since

1189

Figure 4: Relationships graph

Figure 5: Loans Report for Bank of America

6. REFERENCES

up and down among several dimensions (e.g., by year and quarter). For example, Figure 5 shows some of the graphs produced for the loans report. The top graph shows the size of each loan and the exposure Bank of America has on each loan. Users can drill into the details of each loan. The bottom graph shows the number of joint loans between Bank of America and other companies in our dataset. Much of the data needed to create these loan reports appears embedded within the text of loan agreements (reported in SEC forms 8K). Notice that this kind of reports can only be produced by culling data from multiple such agreements over numerous reporting periods, and by correctly resolving references to other companies within the text of the agreement. During the demo we will also go behind-the-scenes and show the JSON objects produced by Midas and compare the produced objects with their source SEC filings. We will walk the audience over the process of extracting, mapping, and fusing the data and discuss the queries used to process and merge the data. We will also discuss what quality guarantees we have for the produced data and how we plan to apply Midas’ tools for other domains (e.g., retail companies, government spending).

5.

[1] Bank of America Corp. Current Report. http://www. sec.gov/Archives/edgar/data/70858/ 000119312509012615/d8k.htm, January 2009. Form 8-K. [2] K. S. Beyer and V. Ercegovac. Jaql: a Query Language for JSON. http://code.google.com/p/jaql/, 2009. [3] K. S. Beyer, V. Ercegovac, R. Krishnamurthy, S. Raghavan, J. Rao, F. Reiss, E. J. Shekita, D. E. Simmen, S. Tata, S. Vaithyanathan, and H. Zhu. Towards a scalable enterprise content analytics platform. IEEE Data Eng. Bull., 32(1):28–35, 2009. [4] N. N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A Web of Concepts. In PODS, pages 1–12, 2009. [5] R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a System for Declarative Information Extraction. SIGMOD Record, 37(4):7–13, 2008. [6] Merrill Lynch & Co., Inc. Proxy Stmt. http://www. sec.gov/Archives/edgar/data/65100/ 000093041308001703/c52269 def14a.htm, April 2008. Form DEF-14A. [7] A. Sala, C. Lin, and H. Ho. Midas for Government: Integration of Government Spending Data on Hadoop. In Second Int’l Workshop on New Trends in Information Integration (NTII), 2010. [8] J. A. Thain. Stmt. of Beneficial Ownership. http: //www.sec.gov/Archives/edgar/data/70858/ 000122520809000096/0001225208-09-000096.txt, January 2009. Form 3.

ACKNOWLEDGEMENTS

Felix Naumann helped with the implementation of an initial version of Midas that merged Linked Open Data (LOD) sources. Yi-Hong Chu, Melanie Herschel, Jen-Wei Huang, Calvin Lin, and Antonio Sala participated in discussions and contributed to the Midas code. Echo Feng, Yiqiao Li, Zi Li, and Gary Wei, interns from Fordham University, provided useful insights on extracting data from SEC filings.

1190