Decision Tree Induction for Identifying Trends in

Decision Tree Induction for Identifying Trends in Line Graphs Peng Wu, Sandra Carberry, Daniel Chester, and Stephanie Elzer Dept. of Computer Science, University of Delaware, Newark, DE 19716 USA

Abstract. Information graphics (such as bar charts and line graphs) in popular media generally convey a message. This paper presents our approach to a significant problem in extending our message recognition system to line graphs — namely, the segmentation of the graph into a sequence of visually distinguishable trends. We use decision tree induction on attributes derived from statistical tests and features of the graphic. This work is part of a long-term project to summarize multimodal documents and to make them accessible to blind individuals.

1

Introduction

Information graphics (non-pictorial graphs, such as line graphs and bar charts) appear often in popular media such as Newsweek and Business Week. Such graphics generally have a message that the graphic designer intended to convey. For example, consider the information graphic shown in Figure 1, which appeared in Business Week. Its intended message is ostensibly that there has been a changing trend in global manufacturing utilization, falling from 2000 to 2002 and then rising until the end of 2006. We are developing a system that reasons about a graphic and the communicative signals present in the graphic to hypothesize the graphic’s intended message. The message recognition system plays an integral role in two very different projects: 1. A digital libraries project whose goal is to construct a more complete summary of a multimodal document that captures not only the article’s text but also its information graphics. 2. An assistive technology project whose goal is to provide blind users with access to graphics in popular media by conveying the graphic’s message via natural language. Previous work on these projects has produced a system that can recognize the message of a simple bar chart[1], along with an interface that provides sightimpaired users with access to the message recognition system[2]. This paper provides our solution to a significant problem encountered in extending our message recognition system to line graphs. Freedman et al. [3] noted

This material is based upon work supported by the National Science Foundation under Grant No. IIS-0534948.

A. An et al. (Eds.): ISMIS 2008, LNAI 4994, pp. 399–409, 2008. c Springer-Verlag Berlin Heidelberg 2008

400

P. Wu et al.

Fig. 1. Line graph with Change-trend message, from Business Week

Fig. 2. Ragged line graph from USA Today

that information should be presented in a line graph if the goal is to convey a quantitative trend. It is essential that, in reasoning about a line graphs’s message, we treat the line graph as capturing a sequence of visually distinguishable trends rather than as representing a set of data points connected by small line segments. For example, the line graph in Figure 2 consists of many short rises and falls, but a viewer summarizing it would be likely to regard it as consisting of a short overall stable trend from 1900 to 1930 followed by a long rising trend (both with high variance). This paper focuses on our graph segmentation module which uses decision tree induction on a variety of attributes of the line graph to develop a model identifying how the graph should be segmented so as to capture the sequence of trends apparent in the graphic. The identified sequence of trends will then be used by the message recognition system, along with other communicative signals, to identify the graphic’s intended message, such as the changing trend message for the graphic in Figure 1.

2

Related Work

Keogh et al.[4] discussed three approaches to linearly segmenting time series: sliding window, top-down, and bottom-up. In our work, we use the top-down approach. Bradley et al.[5] introduced an iterative method for smoothing data series using a Runs Test. Lin et al.[6] and Toshniwal et al.[7] discussed different ways of finding similar time series segments. Vieth[8] discussed piecewise linear regression applied to biological responses and Dasgupta et al.[9] presented an algorithm for detecting anomalies using ideas from immunology. Yu et al.[10] constructed textual summaries of time-series data sets for gas turbine engines. However, their work was concerned with identifying interesting patterns, such as spikes and oscillations, that were important for a particular problem. The

Decision Tree Induction for Identifying Trends in Line Graphs

401

above research efforts, and other related work, have mainly been concerned with detecting similar patterns or anomalies whereas the goal of our work is the identification of visually apparent trends. In addition, we use decision tree induction to investigate the contribution of a variety of different features, rather than settling on one or two features from the outset.

3 3.1

Trend Analysis in Simple Line Graphs Sampling of Line Graphs

Our graph segmentation module works on a set of data points sampled from a representation of the original line graph. Sampling is done uniformly across the x-axis, and then additional points are added to capture change points in the graph. A Visual Extraction Module[11] is responsible for processing an electronic image and producing an xml representation of the graphic that includes all change points in the line graph, from which the sampling can be done. However to train our graph segmentation model, we scanned the 197 hard-copied line graphs and converted them to digital versions, then manually sampled each of them. 3.2

Graph Segmentation

Our graph segmentation module takes a top-down approach to identifying sequences of rising, falling, and stable segments in a graph. For example, the graph in Figure 3a should be identified as composed of three trends (short rising trend, longer falling trend, rising trend), as shown in Figure 3b. The graph segmentation module starts with the original graph as a single segment. At each iteration, the module decides whether a segment in the current segmentation of the graph should be viewed as capturing a single trend or whether it should be split into two subsegments. If a decision is made to split the segment into two subsegments, then the segment is split at the point which is the greatest distance from the straight line connecting the two end points of

(a) Line graph with three trends

(b) Trends of Figure 3a

Fig. 3. Line graph with three trends

402

P. Wu et al.

the segment. Although this method for selecting the split point has produced good empirical results, in rare cases it can select an outlier as a split point; in future work, we will use outlier detection (see Section3.2.4) to eliminate outliers as possible split points. The graph segmentation module recursively processes each segment and stops when no segment is identified as needing further splitting. At this point, the individual segments must be represented as straight line trends. Although the least square regression line is a mathematically correct representation of a segment as a line, it does not necessarily capture the visual appearance of the trend and also results in disconnected segments representing the overall graph. Thus once the graph has been broken into subsegments, each segment is represented by a straight line connecting the segment’s end points, producing a representation of the overall graph as a sequence of connected line segments, each of which is presumed to capture a visually distinguishable trend in the original graphic. Decision tree induction is used to build a model for deciding whether to further split a segment. Thirteen attributes are considered in building the decision tree. The next four sections discuss statistical tests that are the basis for many of the attributes in our decision tree and the motivation for using them. 3.2.1 Correlation Coefficient A trend can be viewed as a linear relation between the x and y variables. The Pearson product-moment correlation coefficient measures the tendency of the dependent variable to have a rising or falling linear relationship with the independent variable. It is obtained by dividing the covariance of two random variables x,y by the product of their standard deviation. n xi yi − xi yi rxy = 2 n xi − ( xi )2 n yi2 − ( yi )2 The correlation is 1 in the case of an increasing linear relationship, −1 in the case of a decreasing linear relationship, and some value in between for all other cases. The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables; we use the absolute value of the correlation coefficient in our experiments. Thus we hypothesize that the correlation coefficient may be useful in determining that a set of jagged short segments, such as the interval from 1930 to 2003 in Figure 2, should be captured as a single rising trend and not be split further. 3.2.2 F Test Although the correlation coefficient is useful in detecting when a segment should be viewed as a single trend (and thus not split further), it is not sufficient by itself. For example, a long smooth rise in a line graph may overshadow a shorter stable portion of the graph (as in Figure 4), resulting in a high correlation coefficient even though the graph should be split into two segments. Similarly, a relatively flat segment, such as the line graph in Figure 5, will have a low correlation coefficient, even though it should not be split into subsegments.


403

350

300

250

200

300 250

150

200 100

150 100

50 50 0

0

100

200

300

400

500

600

700

Fig. 4. Graph with high correlation coefficient but which should be treated as two trends

0

0

50

100

150

200

250

300

350

Fig. 5. Graph with low correlation coefficient, but which should be treated as a single trend

To address this, we make use of the F test[8,12] which can measure whether a two-segment regression is significantly different from a one-segment regression based on the differences in their respective standard deviations. The null hypothesis is that the two regression models are equal, suggesting that the segment need not be split further into subsegments. The F test statistic is computed as test statistic: F =

(RSSL − RSS) /2 RSS/ (n − 4)

where n is the total number of points, RSSL is the residual sum of squares of the one-phase least squares linear regression, and RSS is the residual sum of squares of the two-phase piecewise least squares linear regression. F here is distributed as an F-distribution with (2, n − 4) degrees of freedom as given in [12]. For each sample point xi where 1 < i < n − 1, we test if it is appropriate to treat x1 to xi and xi+1 to xn as two linear regressions. We use a significance level of α = 0.05 and the critical value based on sample size given in [12]. We hypothesize that attributes based on the F test may be useful in identifying whether to split a segment into two subsegments. 3.2.3 Runs Test In using the F-test to suggest when a segment might represent a sequence of two trends, we consider every possible way of breaking the segment into two subsegments. This is computationally impractical when considering more than two subsegments. However, we still need to recognize when a segment consists of more than two trends, such as the graph in Figure 6. For this graph, the correlation coefficient is high and the two-segment F-test fails. Thus we make recourse to the Runs Test[5]. The Runs Test detects if a regression fits the data points well. For each point, it calculates its residual from the regression line and categorizes it as +1 or −1, according to whether the residual is positive or negative. Then the number of runs is calculated, where a run is a continuous sequence of residuals which belong to the same category, such as consecutive +1 or −1. If N+ is the number of positive residual points and N− is the number

404

P. Wu et al.

350 250 300 200

250

200

150

150 100 100 50 50

0

0

50

100

150

0

200

Fig. 6. Line graph with three trends in it, sampled from Business Week

0

50

100

150

200

250

Fig. 7. Line graph of falling trend, sampled from USA Today

of negative residual points, the mean and standard deviation of the runs are approximated as Rmean

2 N+ N− = +1, N+ + N−

SD =

2 N+ N− (2 N+ N− − N+ − N− ) (N+ + N− )2 (N+ + N− − 1)

If the number of runs computed from the data points is sufficiently close to Rmean ± SD, the residual is probably a reasonable approximation of the error from the regression, and this regression model may be regarded as a good fit to the data points. In our application, we use the least square linear regression through the sampled points as a linear approximation of the segment. We use the Runs Test to check how well this linear regression can fit these data points. If the actual number of runs R is larger than Rmean − SD, then the Runs Test suggests that the segment represents a single trend. Thus we hypothesize that attributes based on the Runs Test may be helpful in inducing a decision tree for deciding when to split a segment. Although the Runs Test appears powerful in suggesting whether a segment should be split further, it alone is insufficient. The Runs Test only uses the sign of the residual, not its value. It may suggest that the line graph in Figure 7 should be split, rather than viewing it as a single falling trend. However, other attributes, such as the correlation coefficient discussed earlier, will suggest otherwise. 3.2.4 Outlier Detection A line graph may have one or more points that significantly diverge from the overall trend; such points perhaps should be viewed as outliers and not cause a segment to be split further. Thus we employ an outlier detection test based on residuals[13]. To detect the presence of outliers, we assume that the trend can be represented as a regression line connecting two end points; thus all the points in the segment can be represented as yi = b1 + b2 xi + i where 1 and n are both 0. The residual ei = yi − b1 − b2 xi and the estimated standard deviation of ei is si = σ ˆ

1−

1 (xi − x ¯)2 − 2 n (xi − x ¯)


405

e2i / (n − 2). If σ where σ ˆ= ˆ equals 0, there are no outliers. Otherwise, the standardized residuals ri = ei /si are computed and Rm = max|ei /si | is used as a test statistic for outlier detection. We use a significance level of α = 0.01 and the critical value given in [13] (based on the sample size). If Rm is greater than the critical value, outlier detection suggests the presence of an outlier in the sampled data points. If there are several ri that exceed the critical value, then several outliers are suggested. Thus our decision tree induction includes attributes based on outlier detection. 3.3

Inducing the Model for Splitting

Table 1 presents the features of a segment in a line graph that are used to train a model for deciding whether to split a segment into subsegments. (Recall that initially the entire line graph is a single segment, which the model may recursively split into subsegments until each subsegment may be viewed as a single trend.) The first two features capture the absolute number of points both in the overall graph and in the segment under consideration, and the third feature captures the proportion of points in the segment; these features were included Table 1. All attributes used in decision tree ATT.ATTRIBUTE NAME # 1 total number of points 2 number of points in current segment 3 percentage of the total points 4 5

6 7 8 9 10 11 12 13

TYPE

ATTRIBUTE DESCRIPTION

numeric number of sampling points in the whole graph numeric the number of sampling points in current segment numeric ratio of attribute2 to attribute1: indicates length of current segment as a percentage of the whole line graph correlation coefficient numeric correlation coefficient calculated from the data points in current segment F test 0 or 1 result of F test: 1, if there exists a F value greater than the critical value in F test; otherwise, 0 changing points in F test numeric the number of xi where the F value exceeds the critical value Runs Test 0 or 1 result from Runs Test: 1, when actual number of runs is less than Rmean − SD; otherwise, 0 actual runs numeric number of runs detected by runs test for current segment mean runs numeric Rmean calculated in the Runs Test standard deviation of runs numeric the SD calculated in the Runs Test difference between actual numeric the difference between actual runs and mean runs and mean runs runs. Calculated as |R − Rmean |/Rmean outlier detection 0 or 1 result from outlier detection: 1, when Rm is greater than the critical value; otherwise, 0 number of outliers numeric the number of standardized residuals ri which are greater than the critical value

406

P. Wu et al.

because it appeared that length of a segment or the size of a segment in relation to the overall graph might influence whether the segment should be split. The remainder of the features in Table 1 are derived from the statistical tests discussed in the previous section. The fourth feature is obtained from the correlation coefficient; the fifth and sixth features result from the F test; features 7-11 are obtained from the Runs Test; and features 12 and 13 are produced by outlier detection. The C5.0 decision tree algorithm is used to build a classification tree based on these 13 attributes. The target value of the decision tree is a binary decision, split or no-split; a decision of no-split indicates that the segment should be viewed as consisting of a single trend and not split further.

4

Evaluation and Analysis of Results

We collected 197 line graphs from various local and national newspapers and popular magazines such as Business Week and Newsweek. For each line graph in this set, the system computed all 13 features for the graphic and asked the human supervisor whether the graph should be split into subsegments. The values of the 13 features and the split/no-split decision were recorded as one instance in the training dataset. If the human supervisor indicated that the graph should be split, then the split point was computed as described in Section 3.2, the segment was split into two subsegments, and the process was repeated on the two subsegments, thereby producing additional training instances. Our 197 line graphs produced a training set containing 754 instances. Training on this dataset using C5.0 produced the decision tree given in Table 3. The target value 1 means a split decision, and value 0 means a nosplit decision. Correlation coefficient and percentage of the total points appear at the top levels of the decision tree, indicating that they are the attributes deemed most important in making the split/no-split decision. The correlation coefficient is the measure of a linear rising or falling relationship between the x and y variables. A high correlation coefficient seems to be the strongest predictor of whether a segment should be viewed as a single trend. Percentage of the total points indicates how large a particular segment is compared to the whole line graph. This attribute is important because under a global view, frequent changes over a small subsegment will be less noticeable than if the segment covered a large part of the graph. Thus the decision tree requires a higher correlation coefficient to make a no-split decision when the segment constitutes a large portion of the graph, as shown in the top half of Table 3. Similarly, when the correlation coefficient is low ( 0.815541: :...percentage of the total points 0.62963: : :...correlation coefficient > 0.962782: 0 (28) : correlation coefficient 0.866667: 1 (30/7) : percentage of the total points