Unexpected Challenges in Large Scale Machine Learning

11 downloads 115 Views 458KB Size Report
In machine learning, scale adds complexity. ... one important way a general large scale machine learning ..... scaling up machine learning algorithms and its.
Unexpected Challenges in Large Scale Machine Learning Charles Parker BigML, Inc. 2851 NW 9th St., Ste. D Corvallis, OR 97330

[email protected]

ABSTRACT In machine learning, scale adds complexity. The most obvious consequence of scale is that data takes longer to process. At certain points, however, scale makes trivial operations costly, thus forcing us to re-evaluate algorithms in light of the complexity of those operations. Here, we will discuss one important way a general large scale machine learning setting may differ from the standard supervised classification setting and show some the results of some preliminary experiments highlighting this difference. The results suggest that there is potential for significant improvement beyond obvious solutions.

Categories and Subject Descriptors I.5 [Pattern Recognition]: Implementation

Keywords Big data, non-stationary analysis, concept drift, supervised learning

1.

INTRODUCTION

The field of large scale machine learning has grown in recent years and is beginning to reach a state of relative maturity. At the same time, cloud computing and pervasive data collection has enabled small numbers of people to collect huge volumes of data. As large scale machine learning makes its way “into the wild”, it will begin to fill the gap between these volumes of data and the human understanding thereof. The road from ignorance to understanding of large data is bound to be a rocky one. Part of this rockiness will result from the disconnect between academic researchers working on large scale algorithms, and from practioners in the field trying to deploy these solutions. As data grows larger, the architectures constructed to store and transport this data increasingly become a factor in the algorithm’s performance, and these architectures are typically not the ones used by

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. BigMine ’12 August 12, 2012, Beijing, China Copyright 2012 ACM 978-1-4503-1547-0/12/08 ...$10.00.

researchers. Many things that are possible when the data is stored in a single file on disk become difficult when data is stored on many machines scattered across the country. This leads to the invalidation of many otherwise promising algorithms; one or more steps in the algorithm may intractable given these new concerns. We might call this phenomenon the curse of modularity. That is, many learning algorithms are developed using strategies and building blocks that are proven to work with in-memory data. Unfortunately, this means that when one of those building blocks is broken at large scale, an entire class of algorithms is broken with it. However, this can be an important source of opportunity for the machine learning research community. When there are important settings under which many algorithms fail, techniques need to be created that address these settings. We will here discuss such a setting, and posit a few solutions. Our results suggest the improvement beyond the most obvious candidate solutions is readily achievable.

2.

MAJOR TRENDS IN BIG LEARNING

While the amount of work in large scale machine learning has increased significantly in recent years, there are a number of subfields that attract much of the research attention. Here we mention those fields and give a sample of some of the major and recent developments.

2.1

Speed and Efficiency

There have been impressive developments in increasing the speed and efficiency of learning algorithms in recent years, sometime by orders of magnitude over previous algorithms on similar hardware. Some of the notable ones include PEGASOS [25], Hastie’s work on regularization paths [11], and Langford’s Vowpal Wabbit [18]. In addition to these marquee contributions, there are a number of similar pieces of work at big learning workshops each year [31, 27].

2.2

Parallelism

Many machine learning algorithms are trivially parallel [8, 3], and many more are non-trivially so. There is thus a cottage industry in showing how to parallelize machine learning algorithms [12, 14] or finding performance guarantees on various parallelized algorithms [16, 10]. Another subgroup of researchers have concentrated on creating tools that make it easer to deploy machine learning on clusters of machines using MapReduce and/or Hadoop [13, 1]. Often these tools will be tightly integrated with an existing programming language [22] or may be a language in

their own right [29]. Many of the tools utilize GPUs [17, 26] for greater speed and further parallelism.

2.3

Streaming Data

In large scale data manipulation, we lose the advantage of being able to hold the entire dataset in memory. An important quality of a large scale algorithm, then, is that it must be memory-bounded and able to process the data as a stream. Similarly to parallelism, many machine learning algorithms are trivially streaming algorithms. Some algorithms, such as decision trees, are not streaming algorithms in their standard incarnations, and there has been recent work in creating data structures [15, 4] that allow these algorithms to work in a streaming fashion. Finally, we will mention the field of online learning [5]. While online learning is not directly related to large scale learning, algorithms from an online learning context often work very well at scale. A large part of the reason for this is that the algorithms tend to be memory bounded, trivially parallelizable, very high speed, and have strong theoretical guarantees [2]. Important examples are the venerable Winnow [19] and Weighted Majority [20] algorithms.

3.

SOME UNANSWERED QUESTIONS

Having briefly surveyed some of the popular threads of work in the machine learning community, we will now turn to some of the less-examined questions in the area that are of practical importance. In particular, we will point out how certain assumptions in traditional algorithms are violated and the implications of these violations.

3.1

Mismatched Assumptions

In many of the algorithms above, the i.i.d. assumption is relied on for quick and reasonable convergence of the algorithm, as well as many proven error bounds. We contend that it is inappropriate to rely heavily on this assumption for three possible reasons: 1. Data may arrive in a learning system in any order. Often, this order is non-random and the data may be sorted by any of the columns, including the objective. The usual cure for this ailment is to randomize the data before learning. In cases of terabyte-scale data, this may be impractical to the point of infeasibility. At the very least, it requires a pass over the data where compute resources to make such a pass might be measured in the hundreds of dollars. 2. Even if we are able to randomize data at large scales, the data may not yet have arrived. In the case of terabyte-scale data, bandwidth may only allow tens of gigabytes per day to be transferred. In other cases, data may simply be collected and uploaded daily. In either case, it may take significant or infinite time to acquire “all” of the data, and as such true randomization may be impossible. 3. Finally, the objective may be non-stationary. If data is arriving over time as in the case above, there is little way to know a priori if that data is drawn from the same distribution as previously seen data. Note that the above points all stem from a single cause, which is that as data grows larger, seemingly simple

activities can dominate the cost of execution. In this case, reading and shuffling the data impacts runtime and responsiveness significantly enough to drive algorithmic decisions (see, e.g., [7]). Because of this, we believe that construction of algorithms that operate on non-i.i.d. data is an important imperative for the large-scale machine learning community.

3.2

Related Work

While the large-scale machine learning community has been somewhat quiet on non-i.i.d. analysis, the machine learning community in general has done a fair amount of work in this area. A number of papers have been published in recent years, and there has even been a recent workshop. Several of these papers [23, 21] prove that there are some performance guarantees on existing algorithms as long as certain mixing conditions [6] are met. The flavor of many of these mixing conditions is essentially that data encountered far enough apart in time are effectively independent. The work above provides interesting theory around types of time dependence other than the standard Markov assumptions. However, we note that the assumptions implicit in strong mixing conditions do not necessarily match the nature of common user data. For example, many source of data have periodicity at one or more time scales (hourly, yearly, etc.). Data from previous iterations may thus be relevant to classification in a given iteration if the data is from the same point in the cycle. Another notable line of work concept drift detection [9, 30, 28], whereby we can detect when a classifier may perform poorly due to a possible change in the objective function. Ideas from this area will likely be relevant to algorithm development in the learning setting we describe below.

3.3

Problem Statement

To formalize some of the above, suppose we have a sequence X of n vectors x ∈ mc (Sn ), then we create a new sample drawn from SM and Sn according to the difference in performance. In this way, we adapt retraining to empirical performance on the data. Key principles at work here are parsimony in that the model is only retrained and/or augmented with new data if the current model does not explain the data adequately and flexibility in that the size of the change in the model can be very small or large depending on its current performance. In the next section, we show that these simple rules are sufficient to improve performance overall under a number of different assumptions about the stationarity of the data when compared to the simpler approaches described above.

4.3

Results

In the following experiments, we train the system for three simulated years, measuring the performance of the system after retraining at the end of each week. We measure the performance of the classifier on two sets of data per week: The following week’s worth of data and a set of points sampled randomly from throughout the year. In the cases where the objective is non-stationary, the samples are taken for a hypothetical year in which the objective remains constant and is the same as the objective for the week being measured. We perform three experiments: In the first, the objective is stationary. Obviously, each algorithm must wait an entire year to have access to the entire input space, but the mapping from input to output remains constant. In the second, the output increases by 2% per week relative to the first week. In the third, we remove seasonality effects from the traffic, and there is instead a random -50% - 50% change in traffic every six weeks relative to the first week. In Figures 2, 3, and 4 we see the results of these experiments. The lighter red bars indicate the average weekly performance of the algorithms measured on the following week of data and the darker blue bars indicate average weekly performance measure on a sample taken from throughout the year. Performance is given in terms of R2 . In Figure 2, we see that the adaptive sampling algorithm has similar performance to the standard uniform sampling algorithm for a stationary objective, with slightly better performance on the data from next week and slightly worse on the data from throughout the year. This is to be expected as the adaptive resampling encourages taking samples from

1

0.9

Next Block Overall

0.9

Next Block Overall

0.8

0.8

0.7

0.7

0.6

0.6 R2

R2

0.5 0.5

0.4 0.4 0.3

0.3

0.2

0.2

0.1

0.1 0

Last−block

Window

Sample

0

Adapt

Figure 2: Performance of all retraining schemes under a stationary objective.

new data when it thinks performance can be improved. The other algorithms, which learn only from recent subsets of the data, are able to increase performance slightly on next week’s data at the expense of significant performance losses on data from the rest of the year. 1

Next Block Overall

0.9 0.8 0.7

R

2

0.6 0.5 0.4 0.3 0.2 0.1 0

Last−block

Window

Sample

Adapt

Figure 3: Performance of retraining schemes under a constantly increasing objective. While the uniform sampling approach obviously succeeds when the objective is stationary, Figure 3 shows us that it fails for non-stationary objectives. For both next week’s data and data from throughout the year, the adaptive resampling algorithm outperforms uniform sampling. The strictly local algorithms outperform adaptive sampling slightly for next week’s data, but adaptive resampling has an edge for the full year predictions due to its better ability to preserve history when it is relevant. Finally, Figure 4 shows what happens when the objective maintains long periods of stationarity, punctuated by dramatic changes. As we can see, both the uniform sample and the four-week window are not able to move quickly enough to accommodate the new objective and their performance suffers. Only adaptive resampling and training on the most recent data are able to demonstrate the necessary flexibility. As a final anecdotal note: Adaptive resampling is able “reject” the option to train about 60% of the time during these tests, whereas the other types of classifiers shown require retraining on every iteration. As such, we not only get a classifier that is robust to several different types of nonstationarity, but also uses less compute resources than the other types mentioned here.

Last−block

Window

Sample

Adapt

Figure 4: Performance of retraining schemes under an objective that changes dramatically and randomly.

5.

CONCLUSION

Obviously, these results are far from conclusive. We have used fairly unconvincing straw men in only a single domain. Yet the problems seem general enough to encompass a fair variety of common concerns and the result suggest tantalizingly that if we are clever about how and when new classifiers are trained we can have a system that is fairly robust even when our objective is not entirely stable. To summarize, we have tried to draw attention to the following issues: 1. In large-scale learning problems, there are several ways in which the i.i.d. assumption can be violated. An interesting set of problems are block-wise stationary but little else can be assumed beyond that. This is a space that is somewhat lacking in concrete algorithms that perform demonstrably better than their static counterparts. 2. The costs of large-scale learning are increasingly different from the costs of small scale learning. At large scales, costs of learning arise primarily from bandwidth and disk reads. This means it is difficult to get a subset of data into memory, but once it is there, the actual processing is relatively cheap. Thus, there is an important algorithmic middle ground where small batches of data can be read into memory and processed significantly, and this can be done iteratively. In particular, learning a classifier can become a primitive operation within such algorithmic contexts, especially as part of in situ measurements of performance that influence the algorithmic progression. 3. Parsimony is an important principle in large scale algorithmic development. This is for two reasons: First, because processessing data is, now more than ever, measurably expensive in a monetary sense. Second, because the range of algorithms that are practical becomes smaller as data size increases. An increasingly important aspect of large scale learning algorithms is thus to “right-size” the data with regard to the objective being learned. We hope that researchers will consider some of these intuitions when building large scale algorithms in the future.

6.

BIOGRAPHY

Dr. Charles Parker received his Ph.D. in Computer Science in 2007 under Professor Prasad Tadepalli at Oregon State University. His thesis, ”Structured Gradient Boosting”, presented a gradient-based approach to structured prediction useful in information retrieval and planning domains. From 2007 to 2011, he worked for the Eastman Kodak Company on various problems in data mining for machine reliability, scanned document analysis, and consumer video indexing, and was promoted to the rank of Research Associate. He currently works for BigML, Inc., helping to develop a web-scale infrastructure and interface for machine learning. His work has appeared in ICML, AAAI, ICDM, and other notable venues.

7.

REFERENCES

[1] N. T. Andrea Gesmundo. HadoopPerceptron: a toolkit for distributed perceptron training and prediction with MapReduce. In Proceedings of the 2012 Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 2012. [2] P. L. Bartlett. Optimal online prediction in adversarial environments. In Proceedings of the 13th international conference on Discovery science, DS’10, pages 371–371, 2010. [3] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: parallel and distributed approaches. In Proceedings of the 17th ACM SIGKDD International Conference Tutorials, KDD ’11 Tutorials, 2011. [4] Y. Ben-Haim and E. Tom-Tov. A streaming parallel decision tree algorithm. Journal of Machine Learning Research, 11:849–872, Mar. 2010. [5] A. Blum. On-line algorithms in machine learning. In Proceedings of the Workshop on On-Line Algorithms, pages 306–325, 1996. [6] R. Bradley. Basic properties of strong mixing conditions, pages 165–192. 1986. [7] C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 281–288. MIT Press, Cambridge, MA, 2007. [8] P. Domingos and G. Hulten. A general method for scaling up machine learning algorithms and its application to clustering. In In Proceedings of the Eighteenth International Conference on Machine Learning, pages 106–113. Morgan Kaufmann, 2001. [9] A. Dries and U. R¨ uckert. Adaptive concept drift detection. Statistical Analysis and Data Mining, ˘ R6):311–327, ˇ 2(5ˆ aA Dec. 2009. [10] J. Duchi, M. Wainwright, and P. Bartlett. Randomized smoothing for (parallel) stochastic optimization. In NIPS 2011 Workshop on Parallel and Large-Scale Machine Learning (BigLearn), December 2011. [11] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2 2010. [12] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 69–77, 2011. A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan. NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 334–342, 2011. J. E. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In In In Artificial Intelligence and Statistics (AISTATS, pages 177–184, 2009. S. Guha, N. Koudas, and K. Shim. Approximation and streaming algorithms for histogram construction problems. ACM Transactions on Database Systems, 31(1):396–438, 2006. D. Hsu, N. Karampatziakis, J. Langford, and A. J. Smola. Parallel online learning. CoRR, abs/1103.4204, 2011. N. S. L. P. Kumar, S. Satoor, and I. Buck. Fast parallel expectation maximization for gaussian mixture models on GPUs using CUDA. In HPCC, pages 103–109, 2009. P. Li, A. Shrivastava, and A. C. K¨ onig. Training logistic regression and svm on 200gb data using b-bit minwise hashing and comparisons with vowpal wabbit (VW). CoRR, abs/1108.3072, 2011. N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Mach. Learn., 2(4):285–318, Apr. 1988. N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, Feb. 1994. R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5–34, Apr. 2000. H. Miller, P. Haller, and M. Odersky. Tools and frameworks for big learning in scala: Leveraging the language for high productivity and performance. In NIPS 2011 Workshop on Parallel and Large-Scale Machine Learning (BigLearn), 2011. M. Mohri and A. Rostamizadeh. Stability bounds for stationary φ-mixing and β-mixing processes. Journal of Machine Learning Research (JMLR), 11:798–814, 2010. N. Reyhani and P. Bickel. Nonparametric ICA for nonstationary instantaneous mixtures. In ECML 2009 Workshop on Learning from non-IID data: Theory, Algorithms and Practice, September 2009. S. Shalev-Shwartz, Y. Singer, and N. Srebro. PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 807–814, 2007. D. Steinkrau, P. Y. Simard, and I. Buck. Using gpus for machine learning algorithms. In Proceedings of the Eighth International Conference on Document Analysis and Recognition, ICDAR ’05, pages 1115–1119, 2005.

[27] A. Subramanya and J. Bilmes. Large-scale graph-based transductive inference. In NIPS Workshop on Large-Scale Machine Learning: Parallelism and Massive Datasets, December 2009. [28] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’03, pages 226–235, 2003. [29] M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in ScalOps, a higher order cloud computing language. In NIPS 2011 Workshop on Parallel and Large-Scale Machine Learning (BigLearn), December 2011. [30] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69–101, Apr. 1996. [31] M. Zinkevich. Theoretical analysis of a warm start technique. In NIPS 2011 Workshop on Parallel and Large-Scale Machine Learning (BigLearn), December 2011.

Suggest Documents