Text Processing for Classification - CiteSeerX

5 downloads 36741 Views 188KB Size Report
The Hong Kong University of Science and Technology. Clear Water ... by the best text processing method is very close to what can be expected by human experts. ... day, the real-time news sources are frequently updated on the spot. All these ...
Text Processing for Classification V. Cho, B. Wüthrich and J. Zhang Computer Science Department The Hong Kong University of Science and Technology Clear Water Bay, Hong Kong {wscho,beat,[email protected]} Abstract These days textual information becomes increasingly available through the Web. This makes text an attractive resource from which to mine knowledge. The major difficulty in mining textual data is that the information is unstructured. Hence the data has to be preprocessed first so as to obtain some form of structured data which is amenable to data mining techniques. This paper focuses on this preprocessing step. That is, methods and techniques are presented enabling the use of text as an information source to solve classification problems. Novel text processing schemes based on keyword record counting are proposed. The classification performance achieved by the various preprocessing techniques are measured and compared on an extremely challenging problem, the forecasting of stock market movements. The prediction accuracy achieved by the best text processing method is very close to what can be expected by human experts.

1 Introduction These days more and more crucial and commercially valuable information becomes available on the World Wide Web. Also financial services companies are making their products increasingly available on the Web. There are various types of financial information sources on the Web. The Wall Street Journal (www.wsj.com) and Financial Times (www.ft.com) maintain excellent electronic versions of their daily issues. Reuters (www.investools.com), Dow Jones (www.asianupdate.com), and Bloomberg (www.bloomberg.com) provide real-time news and quotations of stocks, bonds and currencies. Whereas newspapers are updated once or twice a day, the real-time news sources are frequently updated on the spot. All these information sources contain global and regional political and economic news, citations from influential bankers and politicians, as well as recommendations from financial analysts. This is the kind of information that moves bond, stock and currency markets in Asia, Europe and America.

There are thousands of financial Web pages updated each day. Each page contains valuable information about stock markets. The URLs of the pages are usually fixed, for example, the 1

page www.wsj.com/edition/current/articles/HongKong.htm contains news about Hong Kong’s stock market. This page is considered as one data source in the sequel. However, some news page addresses change from time to time, usually these pages are under a content page. For example, the page www.ft.com/hippocampus/ftstats.htm is the content page for various regional financial news pages. But the individual news pages under this content page have changing URL addresses. So the fixed content page together with its underlying news pages is considered as one identifiable data source in this case.

The rich variety of on-line data sources is an attractive resource from which to mine knowledge. This gained knowledge or insight can be used to predict financial markets for instance. The conventional approach to financial market prediction is to use numeric time series data to forecast stock, bond and currency markets. In this paper, however, we take a novel approach, text is used as input. Textual statements contain not only the effect (e.g., stocks down) but also the possible causes of the event (e.g., stocks down because of weakness in the dollar and consequently a weakening of the treasury bonds). Exploiting textual information therefore potentially may increase the quality of the input. Research on using text for prediction purposes has just started and not many results are available yet. One of the key issues is how to process the text. This paper presents several novel text processing techniques and investigates their appropriateness for predicting financial markets.

This research proposes, investigates and compares four text processing methods. The suggested methods focus on predicting stock markets in particular. The four text processing schemes are based on keyword record counting. A keyword record can be bond jump or dollar weak against mark. We are counting how many times such keyword records (keywords for short)

occur in the news. The keyword records have been created by hand, several hundred records are guessed which are co-related to stock movements. Knowing that currencies like US dollar, bonds, interest rates, Dow Jones index etc. all influence Hong Kong's stock market it is straightforward to create these records: Dow rise, Dow plunge, dollar strong, etc. A sensitivity analysis showed that as long as there are at least one hundred keyword records the forecasting accuracy is about the same for a particular prediction technique. The most crucial question then is what to do with these keyword counts. Clearly, at some point a statistical or machine learning

2

technique has to be employed to do the prediction. But probably more important than the question of what prediction technique to use is the issue of how to preprocess the record counts before feeding them into the prediction engine. According to our experience all prediction techniques have their strengths and weaknesses, but most crucial is the input given into these systems. Similarly, there are many Web search engines available, all based on keyword counting. The search engines differ from each other mostly in the way the keyword counts are processed. Though our forecasting approach and search engine techniques are based on word counting, there are major differences. Firstly, we count word records and not individual words. Secondly, our aim is to predict financial markets whereas search engines determine the relevance of a piece of news with respect to a query.

Figure 1: real-time stock index predictions available via www.cs.ust.hk/~beat. The rest of the paper is organized as follows. Related work is recalled in Section 2. Section 3 presents the overview and section 4 concentrates on the most important phase of this process, the record processing. Four promising ways of preprocessing record counts are presented. Section 5 uses these methods to predict the Hang Seng Index, Hong Kong’s major stock market

3

index. The final Section 5 summarizes the results which have been incorporated into the realtime prediction system available via www.cs.ust.hk/~beat/Predict, see Figure 1.

2 Related Work Most of the work on text processing and preprocessing has been done in the context of information retrieval. The goal of information retrieval is to find documents, which are most relevant with respect to a query. W üthrich et al [1998] introduced the idea to use text to forecast financial markets. This research investigates more sophisticated text processing techniques to achieve more stable prediction accuracy.

There is a long history in using keyword counts for text processing. The existing techniques have been developed mainly for document retrieval purposes [Keen 1991, Salton and Buckley 1988]. The typical document retrieval scenario is as follows. The user provides a query consisting of several keywords. The occurrences of the user provided keywords are counted on the documents to help identifying the most relevant documents. Once the keyword counts are available, the counts are transformed into weights. The computation of document relevance is finally done based on those weights. The following weighting techniques are employed: 1. Boolean weighting [Salton and Buckley 1988]: the weight is assigned one if the keyword occurs in the document and zero otherwise. 2. Term frequency (TF): the number of occurrences of the keyword record in the document. 3. Normalization [Keen 1991]: TF is normalized so as to lie between 0 and 1. 4. Inverse document frequency (IDF) [Sparck 1972]: this measures the discrimination of a document from the reminder of the collected documents. 5. TF × IDF [Wong and Lee 1993]: the product of Term Frequency and Inverse Document Frequency. A high value indicates the keyword occurs frequently in a document but infrequently in the remainder of the collected documents. 6. Argumented normalized term frequency [Lucarella 1988]: TF is divided by the maximum TF and further normalized to lie between 0.5 and 1. 7. Document discrimination [El-Hamdouchi et al. 1988]: this measures the power of the keyword to separate documents into various classes.

4

Our approach to forecast or classify stock markets is similar to determining document relevance. We count the occurrences of a fixed set of several hundreds of keyword records. The counts are then transformed into weights. Finally, the weights are the input into a prediction engine (e.g. a neural net, a rule based system, or regression analyzer) which forecasts the stock markets.

As it turns out, the computation of weights using a combination of information retrieval techniques does not necessarily yield accurate forecasts [Leung 1997, Peramunetilleke and Wüthrich 1997]. The weighting scheme suggested in Section 4.1 actually consists of three components: term frequency, document discrimination, and normalization. However, it can algebraically be simplified and is therefore called “simple weighting”. Leung (1997), Peramunetilleke (1997) and Wüthrich (1998) show that simple weighting outperforms other information retrieval weighting schemes on various financial classification problems. This paper here introduces novel weight computation methods and compares them to simple weighting.

Some research on financial market prediction is recalled next. There is a wide variety of prediction techniques used by stock market analysts. Very popular among financial experts is technical analysis [Pring 1991]. The main concern of technical analysis is to identify the trend of movements from charts. Technical analysis helps to visualize and anticipate the future trend of the stock market. These techniques include peak-and-trough analysis (which indicates a trend reversal when the series of rising peaks and troughs is interrupted), the moving average (which reduces the fluctuations of stock prices into a smoothed trend so that the underlying trend will be more clearly visible), and so on. Technical analysis only makes use of quantifiable information in terms of charts. But charts or numeric time series data only contain the event and not the cause why it happened. Hence, a modeling approach such as ours, which can also take the cause into account, is desirable and complementary.

A multitude of promising forecasting methods has been developed to predict currency and stock market movements from numeric data. Among these methods are statistics [Iman and Conover 1989, Nazmi 1993], ARIMA or Box-Jenkins [Wood 1996, Reynolds and Maxwell 1995] and

5

stochastic models [Pictet et al. 1996]. These techniques take as input huge amounts of numeric time series data to find a model extrapolating the financial markets into the future. These methods are mostly for short-term predictions whereas Purchasing Power Parity [Zhou S., 1997; Bahmani-Oskooee M., 1992] is a successful medium- to long-term forecasting technique. The successful Quest system [Agrawal et al. 1996] compares time series data and identifies similar sequences.

Due to the rich literature on forecasting it is not possible to elaborate on all successful approaches predicting financial markets. We are, however, not aware of any publicly available, regular and precise predictions about short-term stock market movements. In this respect our real-time forecasts (www.cs.ust.hk/~beat/Predict) are unique.

3 Overview of the Prediction Process The aim is to forecast the Hang Seng Index (HSI). If the closing value of HSI versus the previous day’s closing value moves up at least x%, then the classification is up. If the closing value of HSI versus the previous day’s closing value slides down by at least x%, then the classification is down. Finally, when HSI neither goes up nor down then it is steady. In our case, we set x to be 0.5 as this way each of the classes up, down and steady occurs about equally likely. Suppose a domain expert provided a fixed set of keyword records such as bond lost or stock mixed which are correlated with the Hang Seng Index. We collect Web pages

containing financial news in the mornings over a period of time such as Feb 14 to Nov. 6, which contains one hundred and seventy nine trading days. Also the actual outcomes of the HSI is collected over the same period. The provided and collected data is shown shaded in Figure 2. The prediction of the HSI movement, up steady or down, is done in four steps. The four steps are executed between 6:30 am and 7:30 am local time so that the prediction for today's movement is ready before the actual trading day starts. 1. Keyword counting. The number of occurrences of the keyword records in the news of each training day is counted, see Figure 2.

6

2. Computation of weights. The occurrences of the keyword records are transformed into weights (a real number between zero and one). This way, each keyword gets a weight for each class and each day, see Figure 2. This is described in Section 4.

K e y w o rd re c o rd s

O c c u rre n c e s o f k e y w o rd re c o rd s D ay 1 D ay 100 b o n d lo s t 12 4 ..... s to c k m ix e d 5 8 ..... in te re s t ra te c u t 0 5 .....

b o n d lo s t s to c k m ix e d in te re s t ra te c u t

: : :

: : in d e x c lo s in g vDaal uy e1s 10789 down

W eb pages D ay 1

: :

D ay 100

D ay 100 11593 up

......

W e ig h te d k e y w o rd re c o rd s

b o n d lo s t s to c k m ix e d in te r e s t r a te c u t

Up D ay 1 D ay 100 0 .1 0 ... 0 .2 4 0 .3 3 ... 1 .0 0 0 .2 1 ... 0 .3 0

S te a d y D ay 1 D ay 100 0 .3 3 ... 0 .4 5 0 .5 3 ... 0 .8 7 0 .4 8 ... 0 .6 2

Down D ay 1 D ay 100 0 .4 7 ... 0 .6 3 0 .2 1 ... 0 .9 8 0 .5 2 ... 0 .7 6

: :

Figure 2: weights are computed from keyword record occurrences.

3. Training the forecast engine. From the weights (a set of vectors for each term on the period day 1 to day 100) and the closing values of the training days (day 1 to day 100), probabilistic rules are generated, see Figure 3. (Alternatively, also neural net or regression analysis for instance could be used as forecast engine.) Further details on how the probabilistic rules are generated will be described later.

7

In d e x c lo sin g D ay 1: 10789

dow n

: : D ay 100: 11593

up

w eig h ted k ey w o rd rec o r d s fo r u p B o n d lo s t S to c k m ix e d :

p r o b a b ilis tic r u le s fo r u p

D ay 1 D ay 100 0 .1 0 .... 0 .2 4 0 .3 3 .... 1 .0 0



h s i_ u p (T )

.................... ....................

....

w e ig h te d k e y w o r d r e c o r d s fo r s te a d y D ay 1 D ay 100 B o n d lo s t 0 .3 3 .... 0 .4 5 S to c k m ix e d 0 .5 3 .... 0 .8 7 : ....

h si_ ste a d y (T )

w e ig h te d k e yw o r d r e c o r d s fo r d o w n D ay 1 D ay 100 B o n d lo s t 0 .4 7 .... 0 .6 3 S to c k m ix e d 0 .2 1 .... 0 .9 8 .... :

h s i_ d o w n (T )

p r o b a b ilis tic r u le s fo r s te a d y ←

.................... ....................

p r o b a b ilis tic r u le s fo r d o w n ←

.................... ....................

Figure 3: rules are generated from weighted keywords and historic index outcomes.

4. Applying the trained forecast engine. The generated rules are finally applied to today’s (day 101) web page content (containing yesterday's news) at 7:30 am local time on the day 101, see Figure 4. This yields forecast of closing stock index for the day 101, either up, down or steady. O c c u rre n c e s o f k e y w o rd re c o rd s D ay 101 b o n d lo st 3 sto c k m ix e d 8 in t e r e s t r a te c u t 12

: :

W e ig h t o f u p B o n d lo st S to c k m ix e d …

W e ig h t o f ste a d y

D ay 101 0 .0 5 0 .7 2

r u le s f o r u p h s i_ u p ( T )

B o n d lo st S to c k m ix e d …

W e ig h t o f d o w n

D ay 101 0 .3 4 0 .5 1

B o n d lo s t S to c k m ix e d …

ru le s fo r s te a d y

← . .. . . .. . . .

h s i _ s t e a d y ( T ) ← . . . .. . . .. .

. .. . . .. . . .

. . . .. . . .. .

p r o b a b i li t y o f u p is 0 . 8 1 1

p r o b a b i l i ty o f s te a d y is 0 . 0 1 8

D ay 101 0 .5 3 0 .2 8

ru le s fo r d o w n h si_ d o w n (T )

← .. . . .. . .. .

.. . . .. . .. .

p r o b a b i li t y o f d o w n i s 0 . 1 7 1

F o r e c a s t: H s i u p t o d a y

Figure 4: the rules are applied to the weights computed from today’s news in the morning, this yields forecast on today’s closing stock index.

8

In our case we use a rule based forecast engine. The rule bodies consist of the keyword records and their evaluation yields a probability saying how likely the particular index is going up, down or remains steady. Wüthrich [1995] formally defines the probabilistic semantics of such rules. The following is a sample rule set generated by the rule generation algorithm described in [Wüthrich 1997].

HSI_UP(T) ← STOCK_ROSE(T-1), NOT INTEREST_WORRY(T-1), NOT BOND_STRONG(T-2),NOT INTEREST_HIKE(T-2)

r1

HSI_UP(T) ← STERLING_ADD(T-1), BOND_STRONG(T-2)

r2

HSI_UP(T) ← YEN_PLUNG(T-1), NOT GOLD_SELL(T-2), STOCK_ROSE(T-1)

r3

So the likelihood of the HSI going up depends for instance on the computed weight for stock rose yesterday and on the weight of attribute or keyword record bond strong two days ago.

Unlike other rule-based approaches, these rules can also deal with weights and are hence more powerful. Suppose the following weights for the last two days, say day 99 and day 100. STOCK_ROSE(100) INTEREST_WORRY(100) BOND_STRONG(99) INTEREST_HIKE(99) STERLING_ADD(100) YEN_PLUNG(100) GOLD_SELL(99)

: : : : : : :

1.0 0.2 0.7 0.0 0.5 0.6 0.1

Applying the rule set R={r1 ,r2, r3} on those weights computes the probability of the dollar going up within the next hour. More specifically, the rules compute how likely the dollar moves up from the beginning to the end of period 101. evalR(101) = 1*(1-0.2)*(1-0.7)*(1-0) + 0.5*0.7 + 0.6*(1-0.1)*1 // likelihood that first rule is true, or the second // rule is true, or the third rule is true - 0 // since first and second rule are contradictory - 1*(1-0.2)*(1-0.7)*(1-0)* 0.6*(1-0.1) // likelihood that first and third rule are // both true; note stock_rose is taken only once - 0.5*0.7*0.6*(1-0.1)*1 // likelihood that second and third rule true + 0

9

// three rule bodies together are contradictory = 0.811

If a rule has also attached a confidence (the conditional probability that head is true given that the body is true) then every term involving that rule is additionally multiplied with this rule’s confidence. As probabilistic rules are extensions of first-order rules (whenever all the facts are either true or false probabilistic rules degenerate to first order rules) they also inherit all the strengths of rule classifiers: comprehensible models and relatively fast learning algorithms. Versus conventional rules, probabilistic rules have the advantage that they are able to handle continuous attributes and do not rely on Boolean tests. The learning algorithm is explained next.

Suppose the head of rule r is HSI_UP (the cases HSI_STEADY and HSI_DOWN are analogous). The confidence of rule r, denoted conf(r), is defined as follows:

∑ eval (t ) × up(t ) conf (r ) = ∑ eval (t ) {r }

t

{r }

t

where t is a training example, up(t) is 1 if the actual outcome is up and 0 otherwise. The evaluation of the single rule r on example t, denoted by eval{r}(t), has been explained above. The rule algorithm generating a rule set R is as follows. R=∅ while |R|≤maxRules do { C={r| r is a most general rule} repeat { r’=r C={s|r>s}∪{r} r= the rule s∈C minimizing mse(R∪{s}) } until (r=r’) attach conf(r) to r R=R∪{r} } R'=R R= the rule set S⊆R' minimizing mse(S)

10

In the inner loop, the algorithm selects the rule s with minimal mean square error (mse) of the rule set R∪{s}. mse( R ∪ {s}) = ∑ (up(t ) − eval R ∪{s} (t )) 2 t

The evaluation of an example at time point t using the rules R generated so far with their confidence plus the rule s is denoted by evalR∪{s}(t). The summation goes over all training examples t and up(t) is defined as before (assuming the rule set to be built is for hsi_up; for rule sets steady and down it is analogous). Note that mean square error is used to measure the quality of a rule. This is an appropriate goodness measure for applications where the classification problem is expected to be relatively difficult (no perfect models possible). Regression analysis, neural net learning based on back propagation and nearest neighbor algorithms are also based on mean square error or square distance considerations. The last statement of the algorithm selects that subset S of the generated rules R’ which has least mean square error. This is a common rule set simplification and yields the final result R. Once the rules are generated, they are applied to the most recently collected textual news and analysis results. So the likelihood of the HSI going up depends for instance on the weight computed for keyword records stock rose, and that of interest worry, etc. From those probabilities, i.e. how likely the HSI is going up, down or steady [Cho and Wüthrich 1998]. For example, the final decision is that the remains steady, the final decision is taken based on some consistent treatment method such as maximum likelihood [Cho and Wüthrich 1998]. For example, the final decision is that the HSI moves up. Section 4 describes the most crucial step of the prediction process, the computation of the weights, see Figure 2. Four different methods to compute the weights are presented. Their effect on the forecasting accuracy and stability is experimentally investigated in Section 5.

4 Weight Computation There is a long history in text retrieval on using keyword weighting to classify and rank documents. In contrast to text or information retrieval, we do not rank a document with respect

11

to a query, but rather we would like to make a classification (up, down, or steady) based on a rather small set of fixed documents. Therefore different weighting schemes are needed. Our weighting schemes are derived from keyword record occurrences and classification outcomes. Figure 5 shows the distribution of keyword bond rise for a period of 72 days. On 16 days the stock market went up, on 30 days it was steady, and on 26 days it went down. There are for example 5 days for which the number of occurrences is 10 and for which the outcome is steady.

7

number of days

6 5

Center of "down"

4

Center of "up"

Center of "steady"

down steady up

3

center 2 1 0 0

5

10

15

20

25

Word Count

Figure 5: Distribution of keyword record bond rise and class centers. According to Figure 5, the number of occurrences of bond rise is higher for days when Hang Seng Index is up. The distribution for the three different classes in general is quite different. The total number of keyword counts in this example is 110, for the 26 days in which the class is down. The corresponding means are calculated according to the description in Section 4.2. The average occurrence on up-days is 19.2 and the average number of times a particular frequency occurs on up-days is 1.97. This determines the center of the distribution for the class down. Figure 5 suggests the following: 1. The closer a keyword count on a particular day to the center of a class, the greater the probability that this particular day belongs to this class.

12

2. The wider the separation between the class distributions, the more useful is this keyword record. 3. A particular distribution has in general more than one local maximum. Each peak is the center of a cluster. The corresponding clustering algorithm is described in Section 4.4. The weight of a keyword should be related to these cluster centers. The class distributions are in general more complicated than the one shown in Figure 5. Figure 6 is the graph of keyword record share drop in the period 14 Feb. 1997 to 6 Nov. 1997. The keyword is counted on 41 financial Internet sources and its occurrences on individual sources are added together. How different cluster centers are located will be described in section 4.4.

Center of "up1"

Center of "up2"

Center of "up3"

16

number of days

14

Center of "down1"

12

Center of "down2" down

10

steady

8

up

6

center

4

Center of "steady"

2 0 0

2

4

6

8

10

12

Word Count

Figure 6: The class distributions of share drop computed from 41 sources during 179 trading days. The underlying assumption is that the classification can be derived from the class distributions of some of the several hundred keywords. Four schemes are introduced below to compute the weight w(t), they are Simple Weighting, Vector Weighting with Class Relevance, Vector Weighting with Class Relevance and Discrimination, and Vector Weighting with Cluster Relevance and Discrimination. As an overview, four weighting schemes, which exploit these distributions, are briefly described as follows. The resulting classification accuracy of these methods is then discussed in section 5.

13

Simple Weighting This scheme simply normalizes the keyword occurrences to numbers between zero and one. Vector Weighting with Class Relevance The class relevance is the distance of a keyword count from the center (mean) of a class. The closer a keyword count to the center, the higher its weight. A keyword gets three different weights, one for each class. Vector Weighting with Class Relevance and Discrimination This scheme extends the previous one by considering also how the class centers differ from each other, called class discrimination. For a highly relevant keyword record, the centers of its classes are very different and the current count is close to one of the centers. Vector Weighting with Cluster Relevance and Discrimination This scheme extends the third scheme by further dividing classes into clusters. The definitions of class relevance and class discrimination is extended to cope with multiple clusters. We use the following notation. Let T denote a set of time points, the training set. A feature occurs n(t) times in the documents received at time point t ∈ T. Each time point is classified into one of several mutually exclusive classes c1, …, ck. Let w(t) be the weight of the feature for time point t. A weight is a real number in the interval [0,1]. Let Tj denote the set of all those time points for which the outcome is cj. The problem can then be formulated as follows. Given keyword occurrences n(t) for all features and for all time points t ∈ T, how to compute the weight w(t) so that the classification accuracy is as high as possible and as stable as possible. Stable means that high accuracy is achieved with many different combinations of available news sources. 4.1

Simple Weighting

The weight of a keyword record is made proportional to its occurrence and is normalized with its maximum occurrence. Figure 7 is derived from Figure 5 by ignoring the number of days

14

information. Figure 7 highlights the relationship between the weight and the number of occurrences.

weight

1

0.304

The weight for classes up, steady and down is always the same.

0 0

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Count

Figure 7: weight of keyword bond rise. Formally, let n to be the maximum of all n(t) (t ∈ T). w(t) = n(t) / n As shown in Figure 5, the maximum count for bond rise is 23. So if the keyword counting for a day is, says for instance, to be 7, then the resulting weight is 7/23 = 0.304. Note that in this scheme, the weight of a keyword record is the same for all the classes up, steady, and down. That is, when generating the rule sets hsi_up, we use the same weights as when generating the rules for hsi_steady, and hsi_down. It is straightforward to see that the computation of the weights from the counts can be done O(k|T|) time where |T| is the size of the training data and k the number of keyword records. Wüthrich et al. [1998] reports that this is the best weighting scheme using concepts of inverse document frequency, class discrimination and normalization information. It can be shown that the simple weighting is equivalent to term frequency multiplied with document discrimination and normalized by maximum keyword occurrence [Leung 1997]. Peramunetilleke and Wüthrich [1997] reports that this weighting scheme also outperforms other text retrieval schemes on a different application, forecasting intra-day currency exchange rates from news headlines. Leung

15

(1997) and Peramunetilleke (1997) confirm that the best weighting scheme for rule based approach is also the best approach for other forecast engines, including regression analysis, k nearest neighbour and neural nets. 4.2

Vector Weighting with Class Relevance

In Figure 5, the centers of the three classes down, steady and up are 4.2, 11.2, and 19.2 respectively. The weight now depends on the difference between these centers and the current count. We therefore introduce a new concept, class relevance. The closer a keyword count to the center of a class, the higher the relevance of this word with respect to this class. A keyword therefore gets now three different weights, one for each of the classes up, steady and down. The weights derived from Figure 5 are shown in Figure 8 and formally defined next.

1 down steady up

weight of keyword

down 0.591

steady 0.442 up 0.000 0 0

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Word Count

Figure 8: weight of keyword record bond rise for classes up, steady and down. Given s(t) = n(t) / n. We define S(ci) to be the set containing all s(t) for which t is of class ci. S(ci) = {s(t) | t ∈ Ti} The center, s (ci) is

s (ci ) =



t∈Ti

s (t )

Ti

The relevance d(s(t), S(ci)) with respect to class ci is defined by  s (t )− s (ci )  if s(t )− s (ci ) < δ i d (s(t ), S (ci )) =  δi  1 otherwise 

16

where radius δi is the maximum distance from the center of class ci among elements in S(ci) and

δ i = Maxt∈T s(t ) − s (ci ) i

is defined by If t ∉ Ti, the distance between s(t) and s (ci) could be greater than δi and the relevance is clamped to one. The feature weight of s(t) with respect to class ci is w(t) = 1 – d(s(t), S(ci))

Using our running example bond rise from Figure 5, we get the situation as shown in Table 1. Table 1: Class center and corresponding radius down

steady

up

s (ci)

0.184

0.491

0.836

δi

0.294

0.334

0.185

If the count for t is again 7, then s(t) for bond rise is 0.304. The corresponding weight for class down is 1 - ( | 0.304 - 0.184 | )/0.294 = 0.591. Similarly, the weights for steady and up are 0.441 and 0 respectively. The weight of class up is 0 because 0.836 - 0.304 is greater than the radius of class up 0.185, that is, 0.304 is too far away from center of class up. This method requires the calculation of the center s (ci) for a class ci, and the maximum deviation of s(t) from s (ci). Thus, the algorithm needs to scan the text data simply twice. Hence, for k different keyword records, the time complexity is O(k|T|).

17

4.3

Vector Weighting with Class Relevance and Discrimination

This method introduces an additional concept, class discrimination, which reflects by how much the centers of the classes differs from each other. For a highly relevant keyword record, the centers of the classes are very different. If the centers of all classes are close to each other then class discrimination is low and nearly the same for all classes. On the other hand, if the centers are very different then class discrimination fluctuates a lot. Figure 9 shows the resulting weights when also employing class discrimination. Figure 9 is again derived from Figure 5.

1

down steady

weight of keyword

up

down 0.338

steady 0.188 up 0.000 0 0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Word Count

Figure 9: weight of keyword record "bond rise" for class up, steady and down

The discrimination factor λi and the corresponding weight w(t) with respect to class ci are defined as follows:

λi =

(1 − d (s(t ), S (c i )))

∑ (1 − d (s(t ), S (c ))) k

j =1

j

w(t) = λi(t)(1 – d(s(t), S(ci)))

If the radii δ and the centers of class c are close for all classes, d(s(t), S(c)) is rather the same for

18

all classes and the discrimination factor λ is around one-third. This will reduce the weight of the corresponding keyword record. On the other hand, if the radii δ and the centers of class c are different for all classes, d(s(t), S(ci)) is rather the different compared to d(s(t), S(cj)) and the discrimination factor λ will favour the largest 1 - d(s(t), S(ci)). This will enhance the weighting for the keyword counting close to particular class center. Continuing our previous example, the calculated weights for the three classes down, steady and up are (0.591)2/(0.591+0.441+0) = 0.338, (0.441)2/(0.591+0.441+0) = 0.188, and 0 respectively. It follows from our previous considerations that the time complexity of this weight computation scheme is again O(k|T|). 4.4

Vector Weighting with Cluster Relevance and Discrimination

As highlighted in Figure 5, the word occurrence of a class is not normally distributed. The distribution can be further partitioned into various clusters as shown in Figure 10. Each cluster is characterized by a center and a radius. The definitions of class relevance and class discrimination are therefore extended to cope with multiple clusters. Figure 10 shows that there are 2, 2, and 1 clusters for down, steady and up namely as down1, down2, steady1, steady2, and up respectively. Figure 11 shows the resulting weights.

7

Center of "steady1"

6

Center of "down1"

Center of "steady2"

Center of "down2"

Center of "up"

number of days

5

down steady up

4

center 3 2 1 0 0

5

10

Word Count

15

20

25

Figure 10: Distribution of keyword record bond rise and centers of clusters.

19

1 down weight of keyword

steady up steady 0.322

down 0.116 up 0.000 0 0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Word Count

Figure 11: weight of keyword record "bond rise" for class up, steady and down Each class may consist of several clusters. A clustering algorithm partitions the set S(ci) into disjoint clusters OiA, OiB, …, OiM. The distance between two clusters is defined as follows: d (OiL , OiP ) = O iL − O iP where

O iL =



s ( t )∈OiL

s (t )

OiL

The clustering is done as follows. Each sample point s(t1) ∈ S(ci) is a cluster of its own OiL = {s(t1)} where L = {t1} Now we calculate the distances among any two clusters. If the minimum distance, d(OiL, OiP), among the clusters is less than a pre-determined threshold value h, then the two clusters OiL and OiP are combined Oi(L∪P) = OiL∪OiP

This process continues until all the cluster distances are greater than h. For each computed cluster OiL, we define its radius as

δ iL = Maxs ( t )∈O s(t ) − O iL iL

20

The cluster relevance of s(t) where t ∈ T is as follows:

d (s (t ), S (ci )) = Min L∈{ A, B ,..., M }

s(t ) − OiL

δ iL

Again, discrimination is defined by

λi =

(1 − d (s(t ), S (c i )))

∑ (1 − d (s(t ), S (c ))) k

j

j =1

The feature weight for class ci is finally the product of relevance and discrimination.

w(t) = λi(t)(1 – d(s(t), S(ci)))

For an individual class, the number of clusters should not be too large, as otherwise the clusters become meaningless. Xu [1997] discusses the problem of how to determine the number of clusters. By comparing various threshold values in our experiment, setting the threshold h to 0.35 resulted in at most three clusters per class. This threshold is used to produce Figure 10 and Table 2. Table 2: Cluster centers and radii down

steady

up

O iL

0.049

0.40

0.385

0.74

0.84

δiL

0.168

0.139

0.167

0.174

0.18

For s(t) = 7/23 = 0.304, our running example, the recalculated minima d(s(t),S(ci)) for down, steady and up are ( | 0.304 - 0.40 | )/0.139 = 0.691, ( | 0.304 - 0.385 | )/0.167 = 0.485, and 1 respectively. Hence the weights for classes down, steady and up are (1-0.691)2/((1-0.691)+(10.485)+(1-1)) = 0.116, (1-0.485)2/((1-0.691)+(1-0.485)+(1-1)) = 0.322, and 0 respectively. Compared with the last weighting scheme, this weighting scheme gives more weight to class

21

steady. This is because s(t) is closer to the first cluster of class steady than the second cluster of class down. However, these two clusters are not much distinguishable from each other, thus the weights for the two classes steady and down are not much different and rather low. This method requires an iterative algorithm to determine the cluster centers. This clustering algorithm has time complexity O(|T|2). For k different keyword records, the overall time complexity for computing the weights is O(k|T|2). Although the time order is |T| times that of the previous methods, this time worst case complexity should be low enough for most classification tasks. Table 3 Weights calculated using the four shcmes Simple Weighting Vector Weighting with Class Relevance Vector Weighting with Class Relevance and Discrimination Vector Weighting with Cluster Relevance and Discrimination

Down 0.304 0.591 0.338

Steady 0.304 0.441 0.188

Up 0.304 0.000 0.000

0.116

0.322

0.000

5 Experimental Setup and Results We compare the suggested weighting schemes for the task of predicting the daily movements of the Hang Seng Index. Our textual input comes from five popular financial web sites: The Wall Street Journal (www.wsj.com), Financial times (www.ft.com), CNN (www.cnnfn.com), International Herald Tribune (www.iht.com), and Bloomberg (www.bloomberg.com). These web sites provide daily updated high quality textual financial news. From those five web sites, we selected 41 web sources considered to be already relevant to the task of predicting the Hang Seng Index. We collected these data in the period 14 Feb. 1997 to 6 Nov. 1997. This provides a total of 179 stock trading days for training and testing. In this period, there are 62, 58, and 59 days on which the Hang Seng Index goes up, remains steady and goes down respectively. It goes up if it appreciates by at least 0.5%, it is down when it declines at least 0.5% and it remains steady otherwise. Financial experts such as investment analysts and foreign exchange dealers recommended 392

22

keyword records relevant and adequate to forecast the Hang Seng Index. Certainly, out of the 41 sources, certain combinations of sources merged together yield better predictions than others. In order to select a good combination of sources that is out performing, forward source selection is used. Each source individually is tested and the best, from which the highest prediction is achieved, is selected, let’s say sj. Next, all pairs of sources {s1, sj}, {s2, sj}, … , {sn, sj} are tested and again the best performing source pair is selected, and so on. Conceptually, all sources in a source set are merged together and the keywords are counted in this combined text. The weights of all 392 keyword records are calculated for each trading day during the period 14 Feb 1997 to 6 Nov 1997. The results on the test data when using 100 training and 79 test days are shown in Figure 12. When day 101 is predicted, days 1 to 100 are used to train; for predicting day 102, day 2 to 101 are used to train and so on. Figure 13 shows the average prediction accuracy for forward source selection under the four discussed weighting schemes using five-fold cross validation.

Average Prediction accuracy

0.53

Scheme1

Scheme2

Scheme3

Scheme4

0.51 0.49 0.47 0.45 0.43 0.41 0.39 1

2 3 4 5 6 7 number of sources to be combined using forward source selection

Figure 12: Prediction accuracy of the weighting schemes on 79 test cases.

23

Average Prediction accuracy

0.48 0.47 0.46 0.45 0.44 0.43 0.42 0.41 0.4 0.39 1

Scheme1

Scheme2

Scheme3

Scheme4

2 3 4 5 6 7 number of sources to be combined using forward source selection

Figure 13: Mean prediction accuracy of the weighting schemes on 79 test cases using five-fold cross validation. When simply predicting up for each day, the prediction accuracy would be 62/179=0.346. It can be seen that the accuracy of all four weighting schemes is well above this benchmark. There are several criteria to select the best performing weighting scheme. •

The maximum accuracy achieved.



The smoothness of the accuracy curve. It should be smoothly increasing and not fluctuate widely. This is because as the number of combined sources increases, the information content for the training becomes richer and the prediction accuracy should therefore increase at the beginning. This is particularly the case as all possible input sources are of high quality already and differ mostly in the subjects covered.



Interpretation of the combination of selected input sources.

We argue that the last scheme, vector weighting with cluster relevance and discrimination (Scheme 4), is the best. Its highest accuracy achieved is 0.51 in Figure 12 (0.468 in Figure 13), which is close to the global maximum of 0.52 (0.472 respectively) achieved by simple weighting. So there is not much difference in this respect. The curve of simple weighting, however, is choppy whereas the curve produced by the fourth weighting scheme is very smooth and mostly increasing. The last criterion by which to judge the weighting schemes is by selected sources. The best sources chosen by forward selection are the following.

24

no cross validation (Figure 12)

cross validation (Figure 13)

wsj.com/edition/current/articles/AmericasRound

cnnfn.com/markets/bridge/100.1.html

Scheme 1 1

up.htm 2

wsj.com/edition/current/summaries/europe.htm

3

cnnfn.com/markets/bridge/800.1.html

iht.com/IHT/TODAY

Scheme 2 1

wsj.com/edition/current/articles/AmericasRound

wsj.com/edition/resources/documents

up.htm

/AmericasRoundup.htm

2

wsj.com/edition/current/summaries/wwide.htm

cnnfn.com/markets/bridge/150.1.html

3

cnnfn.com/markets/bridge/2200.1.html

4

wsj.com/edition/resources/documents/toc.htm

Scheme 3 1

cnnfn.com/markets/bridge/800.1.html

cnnfn.com/markets/bridge/100.1.html

2

cnnfn.com/markets/bridge/70.1.html

ft.com/hippocampus/ftbrief.htm

3

wsj.com/edition/current/articles/ForeignExchang

cnnfn.com/markets/bridge/150.1.html

e.htm 4

wsj.com/edition/resources/documents/toc.htm

5

cnnfn.com/markets/bridge/2270.1.html

6

wsj.com/edition/current/summaries/economy.ht m

Scheme 4 1

bloomberg.com/bbn/snapshot.html

bloomberg.com/bbn/usprv.html

2

bloomberg.com/bbn/usmov.html

bloomberg.com/bbn/topten.html

3

wsj.com/edition/current/articles/HongKong.htm

wsj.com/edition/current/summaries /economy.htm

4

cnnfn.com/markets/bridge/67.1.html

cnnfn.com/markets/bridge/800.1.html

5

wsj.com/edition/current/summaries/front.htm

cnnfn.com/markets/bridge/2270.1.html

6

wsj.com/edition/resources/documents/toc.htm

We consider the sources chosen when not employing cross validation. The selections done by scheme 4 seem intuitively reasonable. First a source containing a summary of the action on all major stock markets around the world is selected. Also human experts, when being asked to

25

guess the next day, would first study what happened yesterday on the world’s major markets. The second choice describes only the major action in the US markets. Indeed, Hong Kong’s stock market is heavily influenced by what happened the previous day on US markets. This nicely complements the first choice. The third source selected reports on how Hong Kong’s market behaved yesterday. Looking specifically at what Hong Kong did yesterday to predict today’s action is most natural. The remaining sources are complementary to the news selected so far, they report on mostly US and Japanese financial markets and politics. On the other hand, weighting scheme 2 reaches the highest accuracy by reading news from only two sources. The first source provides a preview on very few individual stocks in US. The second source summarizes the top ten economic events around the globe. Though these choices are reasonable, there is almost no information about Hong Kong specifically. Human experts would supplement these two sources with some more information regarding Asia and Hong Kong in particular. Hence the selections of scheme 4 look superior. However, it is interesting that only scheme 4 selects the sources from Bloomberg. In what follows, we determine the probability of the outcomes when our system would do random prediction. In this case, each prediction is independent from the other predictions. So we have a Binomial Distribution with mean n*p and variance n*p*(1-p) where p is the probability of success and n is the number of times we predict. If n is rather large, Binomial distribution is approximately Normal distribution. We consider only the weighting scheme 4. The probability that the prediction accuracy is equal to or above 46.8% for random guessing can be calculated as follows: The number of trials, n = 79 * 5 = 395 (five cross validations with each validation of 79 days). Since there are three classes, the expected value of p, µp = 0.333 if the prediction is random guessing. And the variance of p, σp2 = npq / n2 = pq / n = 0.333 * 0.667 / 395 = 0.00056. Hence  p − µ p 0.468 − µ p   , where Z = (p-µp )/ σp follows a we have, Pr (p ≥ 0.468)= 1 − Pr  <  σ  σ p p   standard normal distribution. This is 1 − Pr(Z < 5.705) = 1 − 1 = 0. The probability that the prediction accuracy is equal or above 46.8% for random guessing is almost will not occur. Considering the worst prediction accuracy 0.402 in Figure 13, the probability that to achieve

26

this value by random guessing is 0.0018. Hence, our system performs almost certainly better than random guessing. We also run experiments to see how the number of keyword records is affecting the prediction results. Surprisingly, this is not affecting the accuracy results in any significant way as long as one keeps at least about one hundred keywords. When having keyword records like bonds up, bonds strong, and bonds rising, one can without too much implication delete up to

two of those records carrying the same meaning. However, if all the keyword records related to bonds are deleted then this decreases the prediction accuracy. Similarly, this applies to stock, currency, or inflation.

A standard issue in data mining is the quality of the data. In our case, for example, it would affect the forecast if the news headline uses double negation as in “Fed denies dollar will be weaker” instead of the formulation “Fed says dollar will be strong”. This has implications since if we have the keyword records “dollar weak” and “dollar strong”, then the first occurs in the first version of the news whereas the second occurs in the second version. Despite the two versions carrying the same meaning their implications on the keyword records is contrariwise. One is pointing to a weaker dollar, the other to a stronger dollar. To a certain degree this can be taken care of by our system. The rules we generate can have negated attributes and therefore express the same situation using either of the two keywords. That is, not dollar_weak is about dollar_strong. However, in general, the news headlines almost never use negation. The financial reporter producing such news, for example, Dow Jones which has hired 3,500 journalists around the world, have strict guidelines on how to write their reports. One such guideline explicitly states that any kind of negation has to be avoided.

6 Conclusions There is a rich variety of financial on-line data sources available. These mostly textual data sources can be used to predict financial markets. The conventional approach to financial market prediction has been to use numeric time series data to forecast stock, bond and currency markets. Unlike time series data, textual statements contain not only the effect (e.g., stocks

27

down) but also the possible causes of the event (e.g., stocks down because of weakness in the dollar and consequently a weakening of the treasury bonds). Exploiting textual information therefore increases the quality of the input. Research on using text for prediction purposes has just started and not too many results are available yet. One of the key issues is how to preprocess the text. This paper presents several novel text processing techniques and investigates their appropriateness for predicting financial markets. We improve upon our experience gained in previous work where we started this new breed of financial market forecasting from textual input. Standard information retrieval text processing techniques are recalled. It is pointed out that information retrieval and financial market prediction have different objectives, hence different text processing techniques may be adequate. Four different text processing schemes are introduced. Simple weighting incorporates already proven concepts and is shown by Wüthrich et al. [1998] to be superior to many other text processing techniques incorporating information retrieval and statistical techniques. The most sophisticated weighting scheme, vector weighting with cluster relevance and discrimination, is shown to improve simple weighting especially in terms of stability. This is done experimentally using a data set consisting of forty one high quality financial news sources. The news has been collected over a period of about half a year. Furthermore, the experiments are continuously going on as the performance of the forecasting system can now be observed real-time via www.cs.ust.hk/~beat/Predict. It has to be said however, that the most promising weighting scheme, vector weighting with cluster relevance and discrimination, is computationally also the most demanding. It is therefore not suited for information retrieval purposes where the document collections are huge and queries should be answered instantly. For daily or even hourly prediction purposes, the time deadlines can be met using this scheme. There are many textual news sources which are not publicly available on the Web, e.g. the realtime news from Reuters, Bloomberg, Dow Jones and others. It is believed and experimentally supported by [Peramunetilleke and Wüthrich 1998] that a successful weighting scheme and prediction algorithm can also be applied to such more frequently updated news. Future research on text processing for financial market prediction will surely also include issues of semantics of keyword records, i.e. identifying clusters of records with similar meaning and exploiting the

28

existence of such clusters.

References [Agrawal R., et al] “The Quest Data Mining System”, Proc. KDD96, 1996. [Bahmani-Oskooee M.] "A Time-Series Approach to Test the Productivity Bias Hypothesis in Purchasing Power Parity", Kyklos, 45(2): 227-236, 1992 [Cho V. and Wüthrich B.] “Towards Real-time Discovery from Distributed Information Sources”, 3rd Pacific Asia Conf on Knowledge Discovery and Data Mining, Melbourne, 1998. [El-Hamdouchi, Abdelmoula, and Willett, P.] “An Improved Algorithm for the Calculation of Exact Term Discrimination Values”, Information Processing & Management, 24(1): 17-22, 1988. [Iman R.L. and Conover W. J.] Modern Business Statistics, Wiley, 1989. [Keen E.M.] “Query Term Weighting Schemes for Effective Ranked Output Retrieval”, 15th International Online Information Meeting Proceedings, pp. 135-142, 1991. [Leung S.] “Automatic Stock Market Prediction from World Wide Web Data”, MPhil thesis, The Hong Kong University of Science and Technology, Jan 1997. [Lucarella, D.] “A Document Retrieval System Based on Nearest Neighbour Searching”, Journal of Information Science Principles & Practice, 14(1): 25-33, 1988. [Nazmi N.] “Forecasting Cyclical Turning Points with an Index of Leading Indicators: A Probabilistic Approach”, Journal of Forecasting, Vol 12, No. 3&4, pp. 216-226, 1993. [Peramunetilleke D. and Wüthrich B.] “A System for Exchange Rate Forecasting from News Headlines”, MPhil thesis, The Hong Kong University of Science and Technology, Jan 1997. (revised version under submission to IEEE TKDE) [Pictet O.V. et al.] Genetic Algorithms with Collective Sharing for Robust Optimization in Financial Applications, TR Olsen & Assoc Ltd., Zurich, 1996. [Pring M. J.] "Technical Analysis Explained", McGraw-Hill, 1991. [Reynolds S.B. and Maxwell A.] “Box-Jenkins Forecast Model Identification”, AI Expert, 10(6) pp. 15-28, 1995. [Salton G. and Buckley C.] “Term-weighting Approaches in Automatic Text Retrieval, Information Processing and Management, Vol 24, No. 5, pp. 513-523, 1988. [Sparck J.K.] A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation. 28(1): 11-21, 1972. [Wong W.Y. and Lee, D.L.] “Implementations of Partial Document Ranking Using Inverted Files”, Information Processing & Management. 29(5): 647-669, 1993. [Wood D. et al.] “Classifying Trend Movements in the MSCI USA Capital Market Index – a Comparison of Regression, ARIMA and Neural Network”, Computers & Operations Research, Vol 23(6) pp. 611-622, 1996.

29

[Wüthrich B.] “Probabilistic Knowledge Bases”, IEEE Transactions of Knowledge and Data Engineering, Vol. 7, No. 5, pp. 691-698, 1995. [Wüthrich B.], Discovering Probabilistic Decision Rules, Int. Journal of Intelligent Systems in Accounting Finance and Management, Vol 6, pp. 269-277, 1997. [Wüthrich B., Leung S., Peramunetilleke D., Cho V., Zhang J. Lam W.] “Daily Prediction of Stock Market Indices from Textual WWW Data”, 4th Int Conf on Knowledge Discovery and Data Mining, to appear, New York, 1998. (This paper is also invited for IEEE Int Conf on SMC, 1998) [Xu L.] “Bayesian Ying-Yang Machine, Clustering and Number of Clusters”, Pattern Recognition Letters, Vol.18, No.11-13, pp1167-1178, 1997. [Zhou S.] "Purchasing Power Parity in High-Inflation Countries: A Cointegration Analysis of Integrated Variables with Trend Breaks", Southern Economic Journal, 64(2): 450-467, 1997

30