2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)
Designing a Big Data Processing Platform for Algorithm Trading Strategy Evaluation Xiongpai QIN1, Xiaoyun ZHOU 2
1
2
School of Information, Renmin University of China, Beijing, 100872, China Computer Science Department, Jiangsu Normal University, Xuzhou, Jiangsu, 221116, China
[email protected],
[email protected]
Abstract—Algorithm trading techniques are adopted by institutional and individual investors with the expectation of making profit. Not only the algorithms are getting more complex, but also the data on which algorithms are running is becoming bigger, and various data are exploited to make more accurate prediction. Next generation of algorithm trading will be big data driven. This paper presents the big data processing platform for algorithm trading strategy evaluation, which will facilitate the back testing process so that algorithms could be put into production use with a high degree of confidence. Various data processing techniques should be integrated in a unified system. Some data accessing optimization techniques are also discussed. Keywords-Algorithm Trading; Big Data; Back Testing
I.
INTRODUCTION
With the development of computer and communication techniques, algorithm trading is rising. Algorithm trading is using computer programs to automate trading actions. In some cases it can decide on many aspects of trading such as time, quantity, and buy/sell/hold actions, and execute orders without human intervention [1]. Algorithm trading is used by buy-side traders as well sell-side traders. In this paper, the discussion is focused on buy-side traders, including institutional traders and individual traders. Buy-side traders use algorithm trading in their arbitrage and speculation. The heart of algorithm trading is the trading strategy, which relies on rule, fuzzy rule, time series analysis, and machine learning, text mining as well as other analytic techniques to generate buying and selling signals. Not only the algorithms used are getting more complex, but also the volume of data to be analyzed is becoming larger, and various data are exploited to make more accurate prediction. People use TICK format price data, news data, as well as web data (Blogs, twitters …) to identify trading opportunities. Some researchers have combined technical analysis with sentiment analysis for stock price prediction. The together-analysis of structured data (price, indicators) with unstructured data has shown great potential of making profit [2]. Pal Hawtin in London assessed the collective mood of the populace and generated a global sentiment score by monitoring 340 million Twitter posts every day, and then traded millions of dollars of financial assets. He was reported to achieve a gain of more than 7 percent in the first quarter of the year of 2012[3]. Before a new trading strategy is put into production use, it is run against historical data first, and then it is run against real time price data from exchanges and real time data from the
web in a simulation manner. The simulation running of trading strategy is to verify that the strategy can adapt to different market situations and can make profit and stop loss in exceptional situations. The evaluation process is also called back testing of algorithm trading strategy. II.
DATA PROCESSING REQUIREMENTS OF TRADING STRATEGY EVALUATION
The first type of data used in trading strategy evaluation is price data. Price data is highly structured, and it can be managed by a RDBMS. The toughest challenge here is the large volume and the high velocity of the data. When analysis is performed on the finest granularity of price – TICK format, the volume of data to process will be huge. Image that one is trying to capture arbitrage opportunities between any arbitrary pair of assets in a market, He/she should continuously collect price data of thousand of assets. According to the reference of [4], in north America alone, average daily TICKs processed by Interactive Data in October 2011 is 336,351,522 in futures markets, in stocks markets the number is 304,437,791, and in options markets the number is a order of magnitude larger, strikingly 9,030,826,468. A special class of algorithmic trading - high frequency trading, uses TICK data to trade. When more and more trading orders are entered by machines, trading is performed at a very high speed. The data should be processed in time. Otherwise one will lag behind the market and loss money, not to say making profit. Data such as news and web data is highly unstructured. The volume of the data is even larger than the price data. The data can not be easily stored into the tables of a RDBMS, which calls for a different kind of data store for it. The analysis over the unstructured data is also different from that of price data. One needs natural language processing techniques, text mining techniques as well as other techniques to extract insights from the data for trading decision. III.
DESIGNING OF THE BIG DATA PROCESSING PLATFORM
As mentioned above, there are two types of data to be stored in and processed by the data platform. Historical price data are often stored in a RDBMS for back testing. When a trading strategy is run against real time price data, the data should be processed by a streaming engine. As for unstructured data such as news and Blogs, an unstructured data processing tool should be used, MapReduce is an alternative. MapReduce rises in recent years as a standard tool for large volume of unstructured data processing [5]. Besides unstructured data, MapReduce can
The work is supported by NSF of China under Grant no.61170013 and NSF of Jiangsu Province under Grant No. BK2012578.
978-1-4673-5253-6/13/$31.00 ©2013 IEEE
1005
also process structured data well. Not only simple SQL summation, researchers have migrated many complex algorithms to the MapReduce platform, including OLAP, data mining, machine learning, information retrieval, text mining, multimedia data processing, science data processing, and social network algorithms. Hadoop is an open source implementation of MapReduce technique. TABLE I. Data Type Structured Unstructured
VARIOUS TYPES OF DATA USED IN TRADING Data Price, … News, Blogs, …
Data Processing Techniques RDBMS, Stream Database Hadoop
A.
Developing of a Trading Strategy From designing to production use, a trading strategy is going through several stages. Firstly, a trading strategy should be run against history data thoroughly to see its performance. During the stage, one may try to tune the parameters of the trading strategy for better performance. Testing algorithm strategies on historical data is a data-intensive task. The price data and news/web data are fed into a trading strategy and driving it to go forward. Finally several performance metrics are measured and a report is generated. Investors are concerned about a number of performance metrics, including the number of profitable trades, the number of loss trades, the winning ratio, trading fees, repeated wins, repeated losses, the biggest winner, the biggest loss, total wins, total losses, the max drawdown, return volatility, the closed profit, the percentage of gain, etc. Some investors are more risk tolerable, while others are favored of more steady trading strategies. Secondly, the trading strategy should be run against realtime market data to see whether it works in real market situations. The time period for the testing should be long enough to cover typical market situations, including trending markets (up and down), and sideway markets. In this stage, the algorithm will trade using a local virtual account. Slippage and partially order fulfilling can not fully simulated in this stage. News data and web data are also used in a real time feeding manner. To maintain only one copy of trading strategy, the interface to access the data in historical database should be identical to the one that is used to access the real time market data of exchanges and news companies and the web. Designing Simulation on His. Data
Stages
Simulation on Real Time Data
Production Running
Times Figure 1. Designing, Testing, & Production Running of Trading Strategies
After the first and the second stages, investors should have thrown away unprofitable trading strategies. Remaining strategies are ranked and some of them will be put into production running. The production version of trading strategy will not run in the back testing system, it is migrated to a separate production system, because the trading orders will
really get executed in the exchanges and make or loss money. For moving from designing to simulation, then to production seamlessly, the trading strategy should use only one set of interfaces to access historical data and real time data for seamlessly migration. Worthy to mention is that, during simulation stages and production running stage, when the performance of a trading strategy is not acceptable, one should go back to designing stage to revise it. B.
Use Scenarios, the Interfaces, and the Data Service Layer The whole architecture of the data platform is discussed using a bottom up approach. The data platform has several use scenarios, including simulation of trading strategies on historical data, simulation of trading strategies on real time data, journaling and auditing of trading strategies in production running. z Simulation of Trading Strategies on Historical Data In this use scenario, both price data and news/web data are used. Price data is stored in a RDBMS, while news data and web data are stored in a Hadoop cluster. The price data should have injected into the RDBMS by subscription to exchanges. News data can be bought from news companies such as Reuters and Bloomberg. Web data (mainly Blogs and twitters) is crawled from the web. When a trading strategy is running in the simulation mode, it accesses the price database and news/web data through a standard set of interfaces. A time service coordinates the simulation. It is started before the simulation, and advances in itself, and serves time requests passively from trading strategies. Interfaces Trading Strategies
Price
Execute Orders Virtually
RDBMS
Time Service Order Executor
News Blogs Twitters Hadoop Cluster
Virtual Account Figure 2. Simulation of Trading Strategies on Historical data
The simulation results are stored back into the RDBMS for later analysis. When some trading signals are generated, they are routed to an order executor, the order is executed, and a virtual balance account is changed accordingly when the position is closed. When money is made, the balance is increased, when money is lost, the balance is decreased. In some cases, for more real-life like simulation, some trading fees are paid and withdrew from the account. The simulation is performed on some financial assets, and spanning a specific period of time. Different Assets: For fully verification of a trading strategy, one may be interested in the performances of the strategy on different financial assets. A trading strategy may perform well on some financial assets, while performing not that well or poorly in others assets. Thoroughly testing of a trading
1006
strategy over a number of assets is necessary for strategy selection. Different Parameter Combinations: For a trading strategy, there are some parameters that can be tuned. These parameters can take different parameter value combinations. It is necessary to find out which parameter combination is the optimal one according to some criterion such as lower risk and higher profit. Comparison of a Bunch of Strategies: Sometimes people are concerned about among a bunch of trading strategies, which one is better in terms of some standards. In this case, a rank function can be used to compute the ranking numbers of the strategies to select the best one. Several Trading Strategies as a Package: No any trading strategy can perform well in all market situations for all financial assets. Some people combine several trading strategies into a package and run it against different assets with the expectation that the package will make a reasonable profit. The system should also support this kind of simulation.
the trading strategies can rout trading signals directly to exchanges. Time service is no need here. The trading orders get executed in the exchanges. For later easy assessment of the trading strategies, journaling of the executed orders is recorded in the local RDBMS. The local virtual account is changed accordingly. News data and web data are still received from their sources respectively, and after consumption by the trading strategies, the data will be stored in the Hadoop cluster. Real Account
Exchange
News Companies
News
Price
Interface Trading Strategies
Execute Orders Virtually
RDBMS
Web
Blogs, Twitters
Time Service Order Executor Hadoop Cluster
Virtual Account
Trading Strategies
News
Web
Blogs, Twitters
Journaling of Executed Order
RDBMS
Hadoop Cluster
Virtual Account Figure 4. Production Running of Trading Strategies
z
Interfaces for Data Accessing There are several interfaces to be defined, including the interface to advance the time service, the interface to fetch price data, the interface to fetch news data, the interface to fetch web data, as well as the interface to execute orders, the interface to journal trading actions, the interface to store web data. The interfaces can be implemented in various ways so that the trading strategies can be run in several modes without any modification to them. One only need to change configuration file to support running of trading strategies in different modes, including simulation running on historical data, simulation running on real time data, and production running. As an example, the interface for time service, the interface for price data fetching, and the interface for order execution are depicted as follows. Time Service Interface 1, startTimeService( initial_timestamp);// after started, the time service will advance in itself 2, stopTimeService( ); 3, getCurrentTime();// return the current timestamp Price Data Fetching Interface 1, fetchPriceData( current_timestamp, symbol_name);// return the quote of a symbol at any given time, including best bid, bid size, offer, and offer size etc. Order Execution Interface 1, executeOrder( current_timestamp, symbol_name, quantity, buy_sell_flag, type); // execute the order and return information about whether order is fully or partially executed; besides normal order, for stop loss orders, one can use type parameter to indicate what type of the orders is
Figure 3. Simulation of Trading Strategies on Real Time Data
Since news data and web data are huge in volume, it is not appropriate to fetching them again for later use. After the data is consumed by the trading strategies, it is also stored in the Hadoop cluster for later simulation use. As for price data, it can be loaded into the RDBMS in a batch mode rather than being inserted into the RDBMS one price by one price for efficiency reason. z Production Running of Trading Strategies After simulation on historical data and real time data, one may be confident with some trading strategies. Among these strategies, some can be put into production running. Most exchanges in the world now expose the price data fetching and order execution function through some APIs. We can implement a standard set of interfaces by adapting to the APIs, and attach the trading strategies to the interfaces. Then
News Companies
Interface
z
Simulation of Trading Strategies on Real Time Data After simulation on historical data, some trading strategies are screened away. Remaining trading strategies are going into the second stages. Data sources are different from the first stage. Price data should be fetched from exchanges just in time, news data is fed into trading strategy using a pull or push approach from news companies, and web data is crawled from the web in a real-time manner. The interfaces to access the data do not change, but the implementations of the interfaces adapt to new data sources.
Exchange
Order Executor Price Orders
Figure 5. Interfaces for Time Service, Price Data Fetching, Order Execution
z
Writing of Algorithm Trading Strategies For easy simulation and running, algorithm trading strategies should comply with the API specification. The procedure of a typical trading strategy is as follows.
1007
1, start Time Service; 2, get the current time; 3, fetch the price data, calculate some indicators according the price data and formerly received price data; 4, fetch the news data, perform some analysis on the news data; 5, fetch the web data, perform some analysis on the web data; 6, decide on trading signals using rules, fuzzy rules, time series algorithms, machine learning algorithms etc.; 7, decide on the quantity of an order according to the preset risk management policy and cash management policy; 8, initiate the order execution using newly fetched current timestamp; 9, receive order execution result, and record the information for later analysis; 10, go back to step 2 until current time goes beyond the time period.
Figure 6. The Flow of a Trading Strategy
C. Some Optimization Techniques After developing a new trading strategy, people are eager to find out whether the strategy will work. Back testing a trading strategy as quick as possible is a necessity.
After the data platform is built, individual investors will be interested in using the data in the platform for their trading strategies verification. These people prefer to run the trading strategies on their desktop. Exposing data accessing APIs through the Web will be enough. In this case, data is sold like a product. D. Continuous Evaluating of Production Running Trading Strategies After trading strategies are put into production running, people are worried about how they perform from then on. The data platform records journaling of any trading strategy in production running mode. Performance metrics should be computed as quickly as possible from the journaling data for people to monitor the trading strategies. When something is going wrong, one can adjust parameters of the trading strategies, or stop their running. Some practitioners have adopted a real-time parameter adjustment approach to make the trading strategies adapt to market situations in a continuous manner. Continuous monitoring and adjusting of a trading strategy requires a high performance computing facility as well as efficient stream data processing techniques mentioned before. The data platform should immediately processes price data and news/web data, and performs an immediate simulation of the parameter adjustment, and then adjusts the parameters according to simulation results. E.
Figure 7. Sharding and Caching of Price Data
For fast accessing of the price data, a technique can be used is sharding. Price data of different assets should be separated and stored in different databases so that accessing to them will not interfere with each other. Another technique to use is main memory database systems. Main memory databases can be setup in front of a RDBMS instance, acting as the cache for fast data retrieval. Most recently accessed data will be resident in the memory and low latency is achieved. The technique will boost performance of the simulations, especially in highly concurrent situations. Some caching & in memory processing techniques [7] for Hadoop can accelerate access of news data and web data. For continuous coming of news data and web data, how to process them for trading decisions and store them into the Hadoop cluster timely is a challenging task. Some streaming techniques for Hadoop [6] can be used in collaboration with the Hadoop platform for processing of the data. The data platform should support trading strategy contest. Contenders can write algorithms and submit the algorithms to the simulation system. The programs will be compiled if necessary, and run in a container which provides needed data access APIs. The simulation results will be wrote back into a RDBMS for later ranking. When the trading strategies are running against the same data set, the trading strategies can share the scanning and caching of the data using a data driven scheduling methods.
The Whole Architecture Diagram The whole architecture is depicted in figure 8. The containers for trading strategies simulation and production running could be implemented using the J2EE [8] technique or the.NET technique. There are some service components running in the containers, providing access to different data sources using an identical set of interfaces. Outside Users
Internal Users
Trading Strategy Container Simulation on His. Data Adapter
Production Running Container
Trading Strategy Container Simulation on Real Time Data
Web
Adapter
Migrate
Data Access Interface
Data Access Interface
News Company
News/Web Data Injection Order execution & result recording
MMDB Caching Hadoop Cluster RDBMS Sharding
In Batch Price Data Injection
: Trading Strategy
Figure 8. The Whole Architecture
Trading strategies can be migrated from the simulation container for historical data, to the simulation container for real time data, then to production running container seamlessly because they use the same set of data access interfaces. Internal users of the data platform could submit trading strategies into the containers. Outside users access the data
1008
services through web interfaces. The requests will be routed to in-container data service components by some adapter components. The container for production running of trading strategies is almost the same as the one for simulation on real time data in terms of services, except that the trading actions are routed to exchanges other than being routed to a virtual account. In real practices, the containers for simulation on historical data and real-time data can be deployed on one physical machine. However, production running of trading strategies should use some other dedicated servers. IV.
CONCLUSIONS
Next generation of algorithm trading will be big data driven. The authors present various techniques for data management in a big data processing platform for algorithm trading strategy evaluation. The data flows of the simulation and productionrunning of trading strategies are analyzed, and some optimization techniques for fast data access are discussed.
[1] [2]
[3]
[4] [5]
[6] [7]
[8]
Wikipedia, “Algorithm Trading Wiki”, http://en.wikipedia.org/wiki/Algorithmic_trading, 2012. S. Deng, T. Mitsubuchi, K. Shioda, T. Shimada, A. Sakurai, “Combining Technical Analysis with Sentiment Analysis for Stock Price Prediction”, 9th International Conference on Dependable, Autonomic and Secure Computing, pp.800-807, 2011. A. E. Cha. “Big data’ from social media, elsewhere online redefines trend-watching”, http://www.washingtonpost.com/business/economy/big-data-fromsocial-media-elsewhere-online-take-trend-watching-to-new level/2012/06/06/gJQArWWpJV_story.html, 2012. Interactive Data White Paper, "Big Data: Challenges and Opportunities", 2011. K. H. Lee, Y. J. Lee, H. Choi, Y. D. Chung, B. Moon, “Parallel data processing with MapReduce: a survey”, SIGMOD Record, 40(4), pp.1120, 2011. N. Marz, “Real-time analytics with Storm and Hadoop”, Hadoop Summit 2012. C. Engle, A. Lupher, R. Xin, M. Zaharia, M. J. Franklin, S. Shenker, I. Stoica, “Shark: Fast Data Analysis Using Coarse-grained Distributed Memory”, SIGMOD, pp. 689-692, 2012. P. J. Perrone, V. S.R. K. R. Chaganti, T. Schwenk, “J2EE Developer's Handbook”, Sams Publish, 2004.
REFERENCES
1009