Outlier Mining on Multiple Time Series Data in Stock ... - Springer Link

Outlier Mining on Multiple Time Series Data in Stock Market Chao Luo, Yanchang Zhao, Longbing Cao, Yuming Ou, and Li Liu Faculty of Engineering & IT, University of Technology, Sydney, Australia {chaoluo,yczhao,lbcao,yuming,liliu}@it.uts.edu.au

Abstract. With the dramatic increase of stock market data, traditional outlier mining technologies have shown their limitations in efficiency and precision. In this paper, an outlier mining model on stock market data is proposed, which aims to detect the anomalies from multiple complex stock market data. This model is able to improve the precision of outlier mining on individual time series. The experiments on real-world stock market data show that the proposed outlier mining model is effective and outperforms traditional technologies. Keywords: Outlier mining, time series, stock market.

1

Introduction

In stock market, the key surveillance function is identifying market anomalies, such as insider trading and market manipulation, to provide a fair and efficient trading platform [2,6]. Insider trading refers to the trades on privileged information unavailable to the public [8]. Market manipulation refers to the trade or action which aims to interfere with the demand or supply of a given stock to make the price increase or decrease in a particular way [3]. Recently, new intelligent technologies are required to deal with the challenges of the rapid increase of stock data. Outlier mining technologies have been used to detect market manipulation and insider trading . The objective of outlier mining is to find the data objects which are grossly different from or inconsistent with the majority of data. However, in stock market data, outliers are highly intermixed with normal data [4] and it is difficult to judge whether an object is an outlier or not. Therefore, a more effective and more efficient approach is in demand. This paper presents a new technique for outlier detection on multiple time series data in stock market. At first, principal curve algorithm is used to detect the outliers from individual measurements of stock market. Then, the generated outliers are measured with the probability of being real alerts. To improve the accuracy and precision, these outliers are combined by some rules associated with the domain knowledge. The experimental results on real stock market data show that the proposed model is feasible in practice and achieves a higher accuracy and precision than traditional methods.

This work was partly supported by the Australian Research Council (ARC) Linkage Project LP0775041 and Discovery Projects DP0667060 & DP0773412.

T.-B. Ho and Z.-H. Zhou (Eds.): PRICAI 2008, LNAI 5351, pp. 1010–1015, 2008. c Springer-Verlag Berlin Heidelberg 2008

Outlier Mining on Multiple Time Series Data in Stock Market

2

1011

Related Work

A qualified surveillance function is expected to capture all the anomalies from a large amount of complex market records, while avoiding false alerts so as to reduce the waste of time and human resources [1]. The methods of generating alerts are typically rule-based approaches. Whenever an actual value is above a predetermined threshold, a specific alert will be triggered. In addition, statistical methods are also used to improve the effectiveness of surveillance by analysing the mean and standard deviation of values. Recently, several researches made a valuable progress in theory to improve the effect of surveillance with information technologies. Palshikar et al. [9] studied collusion set detection using graph clustering. A set of traders is a candidate collusion set when they have heavy trading among themselves, as compared to their trading with others. They proposed a new graph clustering algorithms for the above problem. Lee and Yang [5] provided a prototype artificial immune abnormal trading detecting System (AIAS), which aims to detect the abnormal trading in stock markets. An effective method to evaluate the outlier is a variance-based outlier mining model (VOMM) proposed by Qi and Wang [10]. In their model, outliers are viewed as the top k samples holding maximal abnormal information in a dataset. The VOMM is executed by using principle curve algorithm. Their experiments on real-world dataset show that the it performs better than the Gaussian model and GARCH model. VOMM detects outliers based on stock price only, which has limited information about stock market. We will present a new technique to improve VOMM for more effective outlier detection on multiple time series data in stock market.

3

Outlier Mining on Multiple Time Series (OMM)

There are multiple measures in stock market, which are price, volume, volatility, and so on, and each measure makes a time series. In order to combine multiple time series in stock market efficiently, it is necessary to first choose appropriate data based on financial knowledge in stock market and then define a quantitative measurement of outliers in stock market. The price movement and trading amount are regarded as good measurements for anomalies [7]. The price movement can be measured by price return and price fluctuation range during one day. Price fluctuation range is presented by the difference between the highest price and the lowest price in one day. Voting-based OMM. Our first model is Voting-based OMM (Voting-based Outlier Mining on Multiple time series). Let D ⊆ Rn be the sample space of stock market. Let T ⊆ D(Card(T ) = n) be a set of samples drawn from D. A simple way to find the optimal function is majority voting. That is, every time series will be used to detect outliers individually, and then a day will be outputted as an outlier if an outlier was found in the day in the majority of all time series. In the voting-based OMM, the top k outliers are detected from individual time

1012

C. Luo et al.

series with the principal curve algorithm. It produces n candicate outlier sets Vi , i = 1, 2, . . . , n. Let the function Counter(X) count the times X appear in Vi , i = 1, 2, . . . , n. Hence, Counter(X) ∈ 1, 2, . . . , n, X ∈ (V1 ∪ V2 ∪ . . . ∪ Vn ). Let the function f (X) to evaluate whether X is an outlier. If Counter(X) indicate that X is the majority voting, f (X) = 1; otherwise, f (X) = 0. Probability-based OMM. Our second model is Probability-based OMM ( Probability-based Outlier Mining on Multiple time series). In probability-based OMM, we define a quantitative measurement of outliers. It is similar with the Dixon Ratio Test. Let HV be the test samples, AV be the average value of all samples which are less than the test samples and LV be the lowest value. Our test ratio R is calculated as R = (HV − AV )/(HV − LV ).

(1)

In probability-based OMM method, the top k outliers are chosen from individual time series with the principal curve algorithm. This produces n candicate outlier sets Vi , i = 1, 2, . . . , n. Then we calculate the outlier ratio R based on Formula (1). Let Dis(Xi ) be the distance between Xi and the generated principal curve, where Xi ∈ T, i = 1, 2, . . . , k. Let LD = min(Dis(Xi )) be the lowest distance for all the X in Vi . Let P (Xi ) ∈ (0, 1) be the probability of Xi being an outlier, and we can get 1 n Dis(Xi ) − n−i j=i+1 Dis(Xj ) P (Xi ) = , (2) Dis(Xi ) − LD where i = 1, 2, . . . , k. For each X ∈ (V1 ∪ V2 ∪ . . . ∪ Vn ), the outlier test ratios P1 (X), P2 (X), . . . , Pn (X) are calculated corresponding to the n individual dimensions. Let the final P (X) be the maximum value. P (X) = max(P1 (X), P2 (X), . . . , Pn (X)), X ∈ (V1 ∪ V2 ∪ . . . ∪ Vn )

(3)

The final step is to sort descendingly the candicate outlier sets T according to P (X), and then choose the top k samples as the outliers.

4

Experiments

Experimental Method. The experimental data are daily transaction records from Shanghai Stock Exchange in 425 trading days from 1 June 2004 to 3 Mar 2006. The attributes of the data sets include the daily highest price, the daily lowest price, the daily closing price and daily trade amounts. Daily price return and daily price fluctuation range are calculated based on the financial domain knowledge. We choose the real alerts generated by China stock exchange during the above timeframe as a benchmark for our experiments. By taking a day as abnormal if there are one or more alters during the day, the alerts are converted into 21 abnormal trading days. Hence, the trading days when alerts were found


1013

Table 1. Comparison on the Number of Correctly Detected Outliers Method

k=60

k=50

k=40

k=30

k=20

k=10

VOMM on Price Return VOMM on Price Range VOMM on Trade Amount Voting-based OMM Probability-based OMM

18 17 15 20/52 21

17 17 15 20/45 20

16 16 13 17/32 20

15 16 13 17/27 20

13 15 13 16/19 16

9 9 9 9/9 10

are regarded as exceptional days. Time Constraint Principal Component Analysis (TCPCA) approach is used to preserve the temporal sequence information under the condition of the principal curve algorithm [11]. In our experiments, we set the scale factor λ = 2 according to the observations in experiments. Experimental Results. The experimental results are shown in Table 1. The columns stand for the factor k in the above experiment, which indicates the expected number of outliers. For example, k = 20 means that the top 20 samples are regarded as outliers, while the rest of the samples are regarded as normal samples. The observation in each row stands for the number of alerts which are correctly identified by corresponding methods. For example, the value 16 on row 1 and column 4 means that 16 alerts are identified by VOMM methods on price return measures. One special case is the observation of Voting-based OMM, where the left value stands for the number of real alerts detected, while the right value stands for the calculated number of outliers. Fig. 1(a), Fig. 1(b) and Fig. 1(c) show the results of VOMM on daily price return, daily price range and daily trade amount. The smooth curve passing through the middle of data sets is the generated principal curve. The X axis is the temporal sequence, which is tuned by the scale factor λ = 2. The Y axis is the value of individual measure. The samples marked only by “o” are expected outliers but not alerts, and the samples marked only by “*” are real alerts but not identified. The samples market by both of “o” and “*” are identified real alerts. Fig. 1(d) shows the experimental results of Probability-based OMM. The X axis indicates the trading day, while the Y axis shows the probabilities of being an outlier. The samples marked by “o” are detected with k = 60. The samples marked by “*” show real alerts. Experimental Result Analysis. The experimental results are analyzed based on four variables: True Positive (TP) stands for the number of detected outliers those are real alerts; False Positive (FP) stands for the number of detected which are not actual alerts; False Negative (FN) represents the number of the identified normal days which are real alerts and True Negative (TN) stands for the number of identified normal days which are not alerts. In order to compare the results in an intuitive way, we calculate the accuracy, specificity, precision and recall based on the following formulae: Accuracy = (T P + T N )/(T P + F N + F P + T N )

(4)

1014

C. Luo et al. −4

1.2

0.15

x 10

1

Daily Price Range

Daily Price Return

0.1 0.05 0 −0.05 −0.1

100

200

300

400

500

Trading Day

600

700

0.4

0

800

100

200

300

400

500

Trading Day

600

700

800

(b) VOMM on Daily Price Range

5

1

Probability of being an outlier

Daily Trade Amount (1,000,000 RMB)

0.6

0.2

(a) VOMM on Daily Price Return

4 3 2 1 0 0

0.8

100

200

300

400

500

Trading Day

600

700

0.95 0.9 0.85 0.8 0.75 0.7 0

800

(c) VOMM on Daily Trade Amount

Real Alerts Points Outliers Ranked with Probabilities 50

100

150

200

250

Trading Day

300

350

400

450

(d) Probability-Based OMM

Fig. 1. Outliers Detected

Specif icity = T N/(F P + T N )

(5)

P recision = T P/(T P + F P ) Recall = T P/(T P + F N )

(6) (7)

Fig. 2 shows the accuracy, specificity, precision and recall of the models, where the X axis stands for k and the Y axis indicates the values of four measures. We 1

1

0.96 0.94 0.92 0.9 0.88 10

VOMM on price return VOMM on price range VOMM on trade amount V−BOMM P−BOMM

0.8

Precision

Accuracy

0.98

VOMM on price return VOMM on price range VOMM on trade amount V−BOMM P−BOMM 20

0.6

0.4

30

40

50

0.2 10

60

20

30

0.98

0.9

0.96

0.8

0.92 0.9 0.88 10

Recall

Specificity

1

1

0.94

VOMM on price return VOMM on price range VOMM on trade amount V−BOMM P−BOMM 20

30

40

50

60

K Value

K Value

0.7 VOMM on price return VOMM on price range VOMM on trade amount V−BOMM P−BOMM

0.6 0.5

40

K Value

50

60

0.4 10

20

30

40

50

60

K Value

Fig. 2. Comparison on Accuracy, Specificity, Precision and Recall of Different Models


1015

can see that the Voting-based OMM (V-BOMM) and Probability-based OMM (P-BOMM) have better performance than VOMM on all the four measures, no matter what value k is. Another finding is that the accuracy have the optimal results when k=20. With the increase of k, the precision and specificity decrease and the recall increases.

5

Conclusion

In this paper, we have studied outlier mining on multiple time series in stock market, and proposed two models for outlier mining on multiple time series (OMM). The experimental results show that our proposed models perform better in all the four measures than the previous outlier mining model VOMM on single time series. In future work, we will research on improving OMM to detect exceptional patterns on multiple time series, especially in stock market. Another potential future research is using microstructure knowledge to assist stock market surveillance.

References 1. Brown, P., GoldSchmidt, P.: Alcod idss: Assisting the Australian stock market surveillance teams review process. Applied Artificial Intelligence 10, 625–641 (1996) 2. Cheng, L., Firth, M., Leung, T., Rui, O.: The effects of insider trading on liquidity. Pacific-Basin Finance Journal 14, 467–483 (2006) 3. Dobson, M., Felixson, K., Pelli, A.: Day end returns-Stock price manipulation. Journal of Multinational Financial Management 9, 95–127 (1999) 4. Han, J., Kamber, M.: Data Mining: concepts and techniques. Morgan Kaufmann Publishers, San Francisco (2001) 5. Lee, V.C.S., Yang, X.J.: Development and test of an artificial-immune- abnormaltrading-detection system for financial markets. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 410–419. Springer, Heidelberg (2005) 6. Lucas, H.C.: Market expert surveillance system. Communications of the ACM 36, 27–34 (1993) 7. Meulbroek, L.K.: An emrirical analysis of illegal insider trading. The Journal of Finance 47, 1661–1699 (1992) 8. Minenna, M.: Insider trading abnormal return and preferential information: Supervising through a probabilistic model. Journal of Banking and Finance 27, 59–86 (2003) 9. Palshikar, G.K., Apte, M.M.: Collusion set detection using graph clustering. Data Mining and Knowledge Discovery 16, 135–164 (2008) 10. Qi, H., Wang, J.: A model for mining outliers from complex data sets. In: The 2004 ACM symposium on Applied computing, pp. 595–599. ACM, New York (2004) 11. Reinhard, K., Niranjan, M.: Parametric subspace modeling of speech transitions. Speech Communication 27, 19–42 (1999)

Outlier Mining on Multiple Time Series Data in Stock ... - Springer Link

Outlier Mining on Multiple Time Series Data in Stock ... - Springer Link

Suggest Documents

Exception Mining on Multiple Time Series in Stock Market

Outlier Detection Algorithms in Data Mining Systems - Springer Link

MINING TIME SERIES DATA

MINING TIME SERIES DATA

Mining Outlier Data in Mobile Internet-Based Large Real-Time

Unsupervised Outlier Detection in Time Series Data - Semantic Scholar

Towards Stock Market Data Mining Using Enriched ... - Springer Link

Using Data Mining with Time Series Data in Short

A Time series data mining - Ircam

Visual Mining of Spatial Time Series Data

Stock Market Prediction based on Time Series Data and Market ...

Outlier Detection and Data Cleaning in Multivariate ... - Springer Link

Research Article Time Series Outlier Detection Based

Time series outlier and intervention analysis

Change Detection in Climate Time Series Based on ... - Springer Link

Clustering Time Series with Clipped Data - Springer Link

Predictive Mining of Time Series Data in Astronomy - adass

Perception Based Patterns in Time Series Data Mining

High Performance Data Mining in Time Series - NYU Computer ...

Pattern Discovery in Hydrological Time Series Data Mining during the ...

Mining Deviants in Time Series Data Streams - Semantic Scholar

time series data mining: techniques for anomalies detection in water ...

Mining Deviants in Time Series Data Streams - Semantic Scholar

5 data mining in time series data mining u ... - Ekonomski horizonti