OLAP Cubes Contextualized with Documents - Semantic Scholar

0 downloads 0 Views 324KB Size Report
the major World market indexes as measured by Morgan. Stanley Capital International Perspective [1]. The OLAP interface of Figure 1 allows the analyst to ...
R-Cubes: OLAP Cubes Contextualized with Documents Juan Manuel P´erez, Rafael Berlanga, Mar´ıa Jos´e Aramburu Universitat Jaume I 12071 Castell´on, Spain {martinej,berlanga,aramburu}@uji.es 1. Introduction Current data warehouse and OLAP [3] technologies can be efficiently applied to analyze the huge amounts of structured data that companies produce. These organizations also produce many text documents and use the Web as their largest source of external information. Although these documents include highly valuable information that should also be exploited by companies, they cannot be analyzed by current OLAP technologies because they are unstructured and mainly contain text. The current trend is to find these documents available in XML-like formats. Our proposal is to build XML document warehouses that can be used by companies to store unstructured information coming from their internal and external sources. In [6] we proposed an architecture for the integration of a corporate warehouse of structured data with a warehouse of text-rich XML documents. We call the resulting warehouse a contextualized warehouse. Since the XML document warehouse may contain documents about many different topics, we apply well-known Information Retrieval (IR) [2] techniques to select the context of analysis from the document warehouse. First, the user specifies an analysis context by supplying a sequence of keywords (e.g., an IR condition like “financial crisis”). Then, the analysis is performed on a socalled R-cube (Relevance cube), which is materialized by retrieving the documents and facts related to the selected context. Each fact in the R-cube will be linked to the set of documents that describe its context, and will have assigned a numerical value representing its relevance with respect to the specified context (e.g., how important the fact is for a “financial crisis”). In [6] we provided R-cubes with a data model and an algebra. This paper presents a prototype Rcube system, and explains how to use it.

2. Usage Example Let us consider the warehouse of a stock market analyst. In this data warehouse the user keeps a historical record of the major World market indexes as measured by Morgan

1-4244-0803-2/07/$20.00 ©2007 IEEE

Torben Bach Pedersen Aalborg University 9220 Aalborg, Denmark [email protected]

Stanley Capital International Perspective [1]. The OLAP interface of Figure 1 allows the analyst to query the data warehouse, and shows an analysis cube with two dimensions: M arkets and Date.

Figure 1. OLAP window Let us suppose that there are recent news about an important conflict happening in the Middle East. During the last decades conflicts have been frequent in this area, so the analyst decides to study the reaction of the stock markets to the Iraq war of 1990. Directly performing such an analysis over the raw factual data of the corporate warehouse (even by using the OLAP interface) is a hard task, since stock markets are sensible to almost any event. It would be much more useful to have some extra information describing the circumstances of each fact (its context). Our example analyst has access to a digital collection of international business newspapers. Among other things, these articles report the trends of the markets. It is usual to find news explaining how stock markets are affected by some financial circumstances, e.g.: “The reaction of German market to the rise of the interest rates is expected to be . . . ”. By using the IR tool presented in Figure 2 the user is able to explore the newspaper articles collection and to retrieve the document set that will describe the analysis context (e.g., “Iraq”). Afterwards, by clicking on the “Contextualize” button, the analysis task can continue in the OLAP window shown in Figure 3.

1477

Document Warehouse

Corporate Warehouse

Q, XPath Fact Extractor Dimensions

Contexts & Facts

Document Analysts

Contextualized Facts

R-cube

OLAP Cube

Corp. Facts

Contexts & Facts Analysts

Figure 2. IR window Figure 4. Contextualized warehouse

Figure 3. OLAP window showing an R-cube The Figure 3 window presents an R-cube that expresses the relevance and the context values assigned to each fact of the original cube. The most relevant facts correspond to Japan and the months of August and September 1990. As shown in Figure 3, the Japanese market index had a sharp fall during these months, a fall of 100 points, whereas the average falls in the rest of markets were of about 10 points. By selecting the fact that represents the average index of the Japanese market in August, the system shows the documents that describe its context (see the right side of Figure 3). There, the analyst discovers that “... plant engineering companies fell as their projects in Iraq and Kuwait were frozen because of the economic sanction of Japan against Iraq”. The analyst will watch Japanese investments now that a new conflict in the Middle East is happening.

3. The Prototype The demonstration will present our prototype R-cube system for the scenario above. We have considered a total of 132 articles from well-known international business newspapers published during 1990, and only the market indexes of this year, resulting, at the lowest dimension categories, in 1396 facts. Testing the performance of the system with larger data sets is future work.

1-4244-0803-2/07/$20.00 ©2007 IEEE

Figure 4 shows the architecture of the contextualized warehouse as proposed in [6]. Its main components are a corporate warehouse, an XML document warehouse and the fact extractor module. In order to evaluate keyword-based searches over the XML collection, the document warehouse keeps a inverted file index [2] and implements the Relevance Modeling logic of the IR model presented in [5]. Stemming and proper noun recognition tasks are executed by the Tree Tagger tool [7]. The corporate cubes and OLAP operations have been supported by implementing the data model and algebra operators of the base multidimensional model [4]. The Fact Extractor module relates each fact with the documents that describe its context by looking for date, stock market, and region references in the paragraphs of the documents. Finally, analysis capabilities over R-cubes have been provided by implementing the data model and algebra operators discussed in [6].

References [1] Morgan Stanley Capital International Inc. http://www.msci.com. Last accessed July 14, 2006. [2] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. [3] R. Kimball and M. Ross. The Data Warehouse Toolkit. John Wiley & Sons, 2002. [4] T. B. Pedersen, C. S. Jensen, and C. E. Dyreson. A foundation for capturing and querying complex multidimensional data. Inf. Syst., 26(5):383–423, 2001. [5] J. M. P´erez, R. Berlanga, and M. J. Aramburu. A Document Model Based on Relevance Modeling Techniques for Semistructured Information. In Proc. of DEXA, pages 318–327. Springer, 2004. [6] J. M. P´erez, R. Berlanga, T. B. Pedersen, and M. J. Aramburu. A relevance-extended multi-dimensional model for a data warehouse contextualized with documents. In Proc. of DOLAP, pages 19 – 28. ACM Press, 2005. [7] H. Schimd. Probabilistic part-of-speech tagging using decision trees. In Proc. of NeMLaP, 1994.

1478