Feb 28, 2016 - under test-time budget constraints. We propose a novel approach applicable to a wide range of structured prediction problems in computer vi-.
Structured Prediction with Test-time Budget Constraints
arXiv:1602.08761v1 [stat.ML] 28 Feb 2016
Tolga Bolukbasi1 Kai-Wei Chang2 Joseph Wang1 Venkatesh Saligrama1,2 1 Boston University, 8 Saint Mary’s Street, Boston, MA 2 Microsoft Research New England, 1 Memorial Drive, Cambridge, MA
Abstract We study the problem of structured prediction under test-time budget constraints. We propose a novel approach applicable to a wide range of structured prediction problems in computer vision and natural language processing. Our approach seeks to adaptively generate computationally costly features during test-time in order to reduce the computational cost of prediction while maintaining prediction performance. We show that training the adaptive feature generation system can be reduced to a series of structured learning problems, resulting in efficient training using existing structured learning algorithms. This framework provides theoretical justification for several existing heuristic approaches found in literature. We evaluate our proposed adaptive system on two real-world structured prediction tasks, optical character recognition (OCR) and dependency parsing. For OCR our method cuts the feature acquisition time by half coming within a 1% margin of top accuracy. For dependency parsing we realize an overall runtime gain of 20% without significant loss in performance.
1. Introduction Structured prediction is a powerful and flexible framework for making a joint prediction over mutually dependent output variables that has recently been applied successfully to a wide range of computer vision and natural language processing tasks ranging from text classification to human detection. However, the superior performance and flexibility of structured predictors come at the cost of computational complexity. The high computational cost leads to a tradeoff between the expressiveness and speed of these models. The cost of inference in structured prediction can be broken down into three parts: acquiring the features, evaluating the part responses and solving a combinatorial optimization problem to make a prediction based on part responses. Past research has focused on evaluating part responses and solving the combinatorial optimization problem, and proposed
TOLGAB @ BU . EDU KW @ KWCHANG . NET JOEWANG @ BU . EDU SRV @ BU . EDU
efficient inference algorithms for specific (e.g., Viterbi and CKY parsing algorithms) and general structured (e.g., variational inference in graphical models (Jordan et al., 1999)). However, these methods overlook the first two parts, which are bottlenecks particularly when underlying structure is relative simple or is solved efficiently. Consider the dependency parsing task as an example, where the goal is to create a directed tree, which describes semantic relations between words in a sentence. One can formulate it using a structured prediction model, where the inference problem concerns finding the maximum spanning trees (MSTs) in a directed graphs (McDonald et al., 2005). Each node in the graph represents a word, and the directed edge (xi , xj ) represents how likely xj depends on xi . The edge score can be parametrized by a weight vector with features, where the weights are learned from the training data. Fig. 1 shows an example of dependency parse and the trade off between a rich set of features and the prediction time. Empirically, we find that introducing complex features has the potential to increase system performance, however, is only required to distinguish among a small subset of difficult parts. As a result, computing complex features for all parts of all examples is not only computationally costly but also unnecessary to achieve high levels of performance. We address the problem of structured prediction under testtime budget constraints, where the goal is to learn a system that is computationally efficient during test-time with little loss of predictive performance. We consider test-time costs associated with the computational cost of evaluating feature transforms or acquiring new sensor measurements. Intuitively, our goal is to learn a system which identifies the parts in each example which are incorrectly classified/localized using “cheap” features and additionally yield large reductions in error over the entire structure given “expensive” features due to improved distinguishability and relationships to other parts in the example. We consider two forms of the budgeted structured learning problem that have been examined for the supervised learning problem, prediction under expected budget constraints and anytime prediction. For both cases, we con-
Structured Prediction with Test-time Budget Constraints
Root
I
saw
a
friend
Features POS POS+Lexicon Adaptive (proposed) Rand (baseline)
today
test-time 9.4 15.5 12.3 12.3
Performance 87.4 90.0 89.7 88.6
Figure 1. The figure on the left-hand side illustrates an example of dependency tree. The table on the right-hand side shows the trade-off between test-time complexity and the test performance. When predicting the dependency tree, some dependencies (e.g., the dashed edges) are easy to be resolved, and there is no need to use expressive features for them to make the prediction. The proposed algorithm is able to achieve a good performance under test-time budget constraints. Details can be found in the experiment section.
sider streaming test-time scenario, namely, a system that operates on each test example without observation or interaction with other test examples. In the expected budget constraint setting, the system chooses a set of features to acquire for each example, with a goal of minimizing prediction error subject to an average feature acquisition cost constraint. A fixed budget is given by the user during training time, and during test-time, for each example encountered, the system is tasked with both allocating resources to the example as well as determining the features to be acquired with the allocated resources. In the anytime structured prediction setting, the system chooses features to be acquired sequentially for each example such that the prediction error is minimized at each time step, allowing for accurate prediction at anytime. No budget is specified by the user, with the system instead tasked with sequentially choosing features to minimize prediction error at any time as features are acquired. In contrast to the expected budget constraint setting, the anytime setting requires a single system capable of making strong predictions for any budget constraint on each example. In both of these settings, we learn systems capable of adaptive feature acquisition. We propose to learn policy functions capable of exploiting relationships between parts and adapting to varying length examples, naturally leading to a reduction of policy learning to a structured learning problem. The resulting systems reduce prediction cost during test-time with minimal loss of predictive performance. We summarize our contributions as follows: • General formulation of structured prediction under expected budget constraints and anytime prediction. • Reduction of both these settings to conventional structured prediction problems. • Demonstrate significant increase in speed with little loss in accuracy for OCR and dependency parsing.
2. Budgeted Structured Learning In this section, we present the budgeted structured learning problem. We first provide an introduction of notation and brief overview of structured prediction. Following this, we formulate the problem of structured prediction under an expected budget constraint and present a reduction of this problem. We then extend this formulation to the problem of anytime structured prediction.
2.1. Structured Prediction Structured prediction is a form of supervised learning, with the goal of learning a function, F , that maps from an input space, X , to a structure space, Y. In contrast to traditional supervised learning problems, the space of outputs Y is not simply categorical as in classification or real-valued as in regression, but instead is assumed to be some exponential space of outputs (often of varying size dependent on the feature space) containing some underlying structure, generally represented by multiple parts and relationships between parts. For example, in the dependency parsing task, x ∈ X are features representing a sentence (such as words, position tags, etc.), and y ∈ Y is a parse tree. In a structured prediction model, the mapping function F is often modeled as F ≡ maxy∈Y Ψ(x, y), where Ψ : (X, Y ) → R is a scoring function. We assume the score can be broken P up into sub-scores across components C, Ψ(x, y) = c∈C ψc (x, yc ), where yc is the output assignment associated with the component c.1 For the dependency parsing example, each c is an edge in the directed graphs, and yc is an indicator variable for whether the edge is in the parse tree. The score of a parse tree consists of the scores ψc (x, yc ) of all its edges. Usually, the same class of models with the same set of features is used to formulate ψc (x, yc ), ∀c. When using an expressive model, acquiring complex features or searching for ψc (x, yc ) often results in algorithms that either are unwieldy during test-time or exclusion of powerful features/models, resulting in poor performance. In the next section, rather than choose between using computationally costly features and reduced performance, we instead propose an adaptive feature acquisition framework, where different sets of features are acquired adaptively for each component of each new example during test-time. 2.2. Structured Prediction Under an Expected Budget Our goal is to reduce the cost of prediction during testtime (representing computational time, energy consumption, etc.). We consider the case where a variety of scoring functions are available to be used for each component. Additionally, associated with each scoring function is an evaluation cost (such as the time or energy consumption 1
The number of sub-components, |C|, varies across examples.
Structured Prediction with Test-time Budget Constraints
required to extract the features for the scoring function). For each example, we define a state S ∈ S, where the space of states is defined S = {0, 1}K×|C| , representing which of the K features is used for the |C| components during prediction. In the state, the element S(k, c) = 1 indicates that the k th feature will be used during prediction for component c. For any state S, we define the evaluation cost: XX c(S) = S(k, c)δk , c∈C k∈K
where δk is the (known) cost of evaluating the k th feature for a single part. We assume that we are given a structured prediction model F : X × S → Y that maps from a set of features X ∈ X and a state S ∈ S to a structured label prediction Yˆ ∈ Y. For a predicted label, we have a loss L : Y × Y → R that maps from a predicted and true structured label, Yˆ and Y , respectively, to an error cost, generally either an indicator error, k Y L(Yˆ , Y ) = 1 − 1Yˆ (i)=Y (i) , i=1
or a Hamming error, L(Yˆ , Y ) =
k X
1Yˆ (i)6=Y (i) .
i=1
For an example (X, Y ) and state S, we now define the modified loss
Note that the objective of the minimization can be expanded with respect to the space of states, allowing the optimization problem in (2) to be expressed π ∗ = argmin π∈Π
argmin π∈Π
n X X
˜ − C (Xi , Yi , S) max C(Xi , Yi , S)
i=1 S∈S
˜ S∈S
1π(Xi )6=S .
Proofs can be found in Suppl. Material. Theorem 2.1 maps the policy learning problem in (2) to a weighted structured learning problem. For each example Xi , an example/label pair is created for each state S with an importance weight representing the savings lost by not choosing the state S. Unfortunately, the expansion of the cost function over the space of states introduces the summation over the combinatorial space of states. To avoid this, we instead introduce an approximation to the objective in (2). Using a single indicator function, we formulate the approximate policy " n X π ˆ = argmin W (Xi , Yi )1π(Xi )6=S ∗ (Xi ,Yi ) π∈Π
i=1
# +C (Xi , Yi , S (Xi , Yi )) ,
We define a policy π : X → S that maps from the feature space X and the initial state S0 to a new state. For ease of reference, we refer to this policy as the feature selection policy. Our goal is to learn a policy π chosen from a family of functions Π that, for a given example X, maps to a state with minimal expected modified loss,
(3)
∗
that represents the error induced by predicting a label from X using the sensors in S combined with the cost of acquiring the sensors in S, where λ is a trade-off pattern adjusted according to the budget B 2 . A small value of λ encourages correct classification at the expense of feature cost, whereas a large value of λ penalizes use of costly features, enforcing a tighter budget constraint.
π = argmin ED [C (X, Y, π (X))] .
C (Xi , Yi , S) 1π(Xi )=S .
i=1 S∈S
From this expression, we can reformulate the problem of learning a policy as a structured learning problem. Theorem 2.1. The minimization in (2) is equivalent to the structured learning problem:
C(X, Y, S) = L (F (X, S), Y ) + λc(S)
∗
n X X
(4)
where the pseudo-label is defined S ∗ (Xi , Yi ) = argmin C (Xi , Yi , S)
(5)
S∈S
and the example weight is defined as W (Xi , Yi ) = max C (Xi , Yi , S 0 )−C (Xi , Yi , S ∗ (Xi , Yi )) . 0 S ∈S
This formulation reduces the objective from a summation over the combinatorial set of states to a single indicator function for each example and represents an upper-bound on the original risk.
(1)
Theorem 2.2. The objective in (4) is an upper-bound on the objective in (2).
(2)
Note that the second term in (4) is not dependent on π. Therefore, Theorem 2.2 leads to an efficient algorithm for learning a policy function π by solving the importanceweighted structured learning problem:
π∈Π
In practice, D denotes a set of I.I.D training examples: π ∗ = argmin π∈Π 2
n X
C (Xi , Yi , π (Xi )) .
i=1
Our framework does not restrict the type of modified loss, C(X, Y, S), or the state cost, C(S) and extends to general losses
π ˆ = argmin π∈Π
n X i=1
W (Xi , Yi )1π(Xi )6=S ∗ (Xi ,Yi ) ,
(6)
Structured Prediction with Test-time Budget Constraints
where each example Xi having a pseudo-label S ∗ (Xi , Yi ) and importance weight W (Xi , Yi ). Combinatorial Search Space: Finding the psuedo-label in Eqn. (5) involves searching over the combinatorially large search space of states, S, which is computationally intractable. Instead, we present two approaches for approximating the pseudo-label S ∗ : a trajectory-based pseudolabel and a parsimonious pseudo-label. Trajectory Search: The trajectory-based pseudo-label is a greedy approximation to the optimization in Eqn. (5). To this end, define Si+ as the 1-stage feasible transitions: Sit = {S | d(S, Sˆit−1 ) ≤ 1, S ∧ Sˆit−1 = Sˆit−1 } where d is the Hamming distance. We define a trajectory of states Sˆit where Sˆit = argminS∈Sit C(Xi , Yi , S). The initial state is assumed to be Sˆi0 = 0K×|C| where none of the K features are evaluated for the C components. For each example i, we obtain a trajectory Si0 , Si1 , . . . , SiT , where the terminal state SiT is the all-one state. We choose the optimal pseudo-label from the trajectory: Sˆi∗ =
argmin
C(Xi , Yi , S).
(7)
ˆ0 ,...,S ˆT } S∈{S i i
Note that by restricting the search space of states differing by a single component, the approximation needs to only perform a polynomial search over states as opposed to the exhaustive combinatorial search in Eqn. (5). Observe that the modified loss is not strictly decreasing, as the cost of adding features may outweigh the reduction in loss at any time. Empirically, this approach is computationally tractable and is shown to produce strong results. Parsimonious Search: Rather than a trajectory search, which requires an inference update as we acquire more features, we consider an alternative one stage update here. The idea is to look for 1-step transitions that can potentially improve the cost. We then simultaneously update all the features that produce improvement. This obviates the need for a trajectory search. In addition we can incorporate a guaranteed loss improvement for our parsimonious search. Si+
∈ argmin 1{C(Xi ,Yi ,S t−1 )≥C(Xi ,Yi ,S)+τ } . S∈Sit
i
Note that the potential candidate transitions can be nonunique and thus we generate a collection of potential state transitions, Si+ . To obtain the final state we take the union W over these transitions, namely, Sˆ∗ = S∈S + S. Suppose i we set the margin τ = 0, replace the cost-function with the loss function then this optimization is relatively simple (assuming that acquiring more features does not increase the loss). This is because the new state is simply the collection of transitions where the sub-components are incorrect.
Finding the parsinomious pseudo-label is computationally efficient and empirically shows similar performance to the trajectory-based pseudo-label. Choosing the pseudo-label requires knowledge of the budget B to set the cost trade-off parameter λ. In the case where the budget is unspecified or varies over time, a system capable of adapting to changing budget demands is necessary. To handle this scenario, we propose an anytime system in the next section. 2.3. Anytime Structured Prediction In many applications, the budget constraint is unknown a priori or varies from example to example due to changing resource availability. In these cases, training a system to satisfy an expected budget as in Section 2.2 does not yield a feasible system. To handle these cases, we consider the problem of learning an anytime system (Grubb & Bagnell, 2012). In this setting, a single system is designed such that when a new example arrives during test-time, features are acquired until an arbitrary budget constraint (that may vary over different examples) is met for the particular example. Note that an anytime system is a special case of the expected budget constrained system, where for a given budget, the system acquires features with equal cost across all examples to satisfy the constraint. Additionally, the same system is applied to all feasible budgets, as opposed to learning unique systems for each budget constraint. We model the anytime learning problem as sequential state selection. The goal is to select a trajectory of states, starting from an initial state S0 = 0k×|C| where all components use features with negligible cost. To select this trajectory of states, we define policy functions π1 , . . . , πT , where πt : X × S → S is a function that maps from a set of structured features X and current state S to a new state S 0 . The sequential selection system is then defined by the policy functions π1 , . . . , πT . For an example X, the policy functions produce a trajectory of states S 1 (X) . . . , S T (X) defined as follows: S t (X) = π t (X, S t−1 (X)), S 0 (X) = S 0 . Our goal is to learn a system with small expected loss at any time t ∈ [0, T ]. Formally, we define this as the average modified loss of the system over the trajectory of states: " T # X 1 ED C X, Y, S t (X) (8) π1∗ , ...,πT∗ = argmin π1 ,...,πT ∈Π T t=1 where Π is a user-specified family of functions. Unfortunately, the problem of learning the policy functions is highly coupled due to the dependence of the state trajectory on all policy functions. Note that as there is no fixed budget, the choice of λ dictates the behavior of the anytime system. Decreasing λ leads to larger increases in classifi-
Structured Prediction with Test-time Budget Constraints
cation performance at the expense of budget granularity. We propose a greedy approximation to the policy learning problem by sequentially learning policy functions π1 , . . . , πT that minimize the modified loss: πt = argminπ∈Π ED C X, Y, S t (X) (9) for t ∈ {1, . . . , T }. Note that the πt selected in (10) does not take into account the future effect on the loss in (8). We consider πt in (10) to be a greedy approximation as it is instead chosen to minimize the immediate loss at time t. We restrict the output space of states for the policy πt to have the same non-zero components as the previous state with a single feature added. This space of states can be defined S(S) = {S 0 |d(S 0 , S) = 1, S 0 ∧S = S}, where d is the Hamming distance. Note that this mirrors the trajectory used for the trajectory-based pseudo-label. As in Section 2.2, we take an empirical risk minimization approach to learning policies. To this end we sequentially learn a set of function π1 , . . . , πT minimizing the risk: πt = argmin π∈Π
n X
C Xi , Yi , S t (Xi ) .
(10)
i=1
The minimization can equivalently be expressed argmin π∈Π
n X
X C (Xi , Yi , s) 1π(Xi ,S t−1 (Xi ))=s
As in Thm. 2.1, the problem of learning the sequence of policy functions π1 , . . . , πT can be viewed as a weighted structured learning problem. Theorem 2.3. The optimization problem in (11) is equivalent to solving an importance weighted structured learning problem using an indicator risk of the form:
π∈Π
n X
X
W (Xi , Yi , s)1π(Xi ,S t−1 (Xi ))6=s ,
i=1 s∈S(S t−1 (Xi ))
(12) where the weight is defined W (Xi , Yi , s) = max C(Xi , Yi , s0 ) − C (Xi , Yi , s) . 0 t−1 s ∈S(S
Algorithm 1 Anytime Policy Learning input Training set, {Xi , Yi }i=1,...,n set Si0 = 0 ∀i = 1, ..., n, t = 1 while A(Si ) 6= ∅ for any i do Train πt according to Thm. 2.3 for i ∈ [n] do Update states: Sit = πt (Xi , Sit−1 ) t←t+1 return π = {π0 , . . . , πT }
Algorithm 2 Anytime Structured Prediction input Policy, π1 , ..., πT , Example, X, Budget, B set S 0 = 0, t = 0 while A(S t ) 6= ∅ and c(S t ) < B do S t+1 = πt (X, S t ), t ← t + 1 return y = F (X, S t )
(11)
i=1 s∈S(S t−1 (Xi ))
by enumerating over the space of states that the policy πt can map for each example. Note that the space of states S(S t−1 (Xi )) may be empty if all features have been acquired for example Xi by step t − 1.
argmin
Theorem 2.3 reduces the problem of learning a policy to an importance weighted structured learning problem. Replacement of the indicators with upper-bounding convex surrogate functions results in a convex minimization problem to learn the policies π1 , ..., πT . In particular, use of a hinge-loss surrogate converts this problem to the commonly used structural SVM. Experimental results show significant cost savings by applying this sequential policy.
(Xi ))
This is equivalent to an importance weighted structured learning problem, where each state s in S(S t−1 (Xi )) defines a pseudo-label for the example Xi with an associated importance maxs0 ∈S(S t−1 (Xi )) C(Xi , Yi , s0 ) − C (Xi , Yi , s) .
The training algorithm is presented in Algorithm 1. Starting with at time t = 0, the policy π1 is trained to minimize the immediate loss. Given this policy, the states of examples at time t = 1 are fixed, and π2 is trained to minimize the immediate loss given these states. The algorithm continues learning policies until every feature for every example as been acquired. During test-time, the system sequentially applies the trained policy functions until the specified budget is reaches, as shown in Algorithm 2.
3. Related Work While adaptive feature selection has received attention in the supervised learning community, there are few methods applicable to structured prediction with streaming test samples. Detection cascades (Viola & Jones, 2001; Zhang & Zhang, 2010; Chen et al., 2012) have been widely applied to supervised learning approaches, however the inflexibility in feature acquisition and restriction to multi-class case make their application to the structured prediction problem impractical. Weiss et al. (2013) incorporate test-time budgets in structured prediction through a classifier cascade, where the models are ordered in increasing complexity. Unfortunately, this approach only operates on the entire example, with complex features generated for none of the parts or all of the parts, and during test-time can only be run on batches of examples as opposed to streaming ex-
Structured Prediction with Test-time Budget Constraints
amples. Their policy also depends on confidence based meta-features which require running the structured predictor after every step, increasing the computational overhead. Wang et al. (2014a) solve the limitations of having a fixed model order by making use of a tree graph rather than a cascade. Although this framework allows for streaming decisions on test examples, as above the system is limited to choosing to generate features on an example by example basis as opposed to the part by part basis we propose. Reinforcement learning techniques have been applied to the budgeted supervised learning problem (Busa-Fekete et al., 2012; Karayev et al., 2013), however, they generally rely on heuristic measures of performance (such as margin in the supervised learning case). Adaptive classifiers have been proposed based on minimizing empirical risk subject to budget constraints (Chen et al., 2012; Xu et al., 2013; Kusner et al., 2014; Nan et al., 2015). Adaptive decision systems choosing between fixed classifiers have been proposed (Trapeznikov & Saligrama, 2013; Wang et al., 2014b; 2015), which allow for systems to be learned with arbitrary losses, however they are mostly restricted to cascades and limited to the multi-class settings. These empirical risk based approaches are not easily adapted to the adaptive part by part feature selection problem and is complementary to our work. Recently, research has focused on improving speed of individual algorithms such as object detection using deformable parts models (Felzenszwalb et al., 2010; Zhu et al., 2014) and dependency parsing (He et al., 2013). These methods are not designed for general structured learning problems, in particular, they do not generalize to varying graph size and structures and rely on problemspecific heuristics or algorithms. In fact, a few of these approaches are special cases of our framework. The dynamic feature selection approach in (He et al., 2013) is related to Eqn. (6) with a parsimonious search but using imitation learning. On the other hand, several approaches have been designed using adaptive features to improve test performance, including easy-first decoding strategy (Goldberg & Elhadad, 2010; Stoyanov & Eisner, 2012) and learning to search approaches (Chang et al., 2015a; Ross et al., 2011) for structured learning, however the goals of these methods differ from our proposed approach. Finally, there have been methods concentrated on the third part of inference( predicting given part responses) (Weiss & Taskar, 2010; Shi et al., 2015), and therefore, are complementary to our approach.
4. Experiments In this section, we demonstrate the effectiveness of the proposed algorithm on two real structured prediction tasks in different domains: dependency parsing and OCR. We re-
port the results on both anytime and expected case policies and refer to the latter one as one-shot policy. Methods of comparison. We compare our system to two baselines, a uniform policy and a myopic policy. The uniform policy proposed by Weiss et al. (2013) which generates random sets of features that are uniform over all positions. We extend this uniform baseline to anytime case by randomly choosing a position at each step rather than using the states predicted by the policy. As a second baseline, we adapt the myopic policy used by Trapeznikov et. al. (2013) to the structured prediction case. The myopic policy runs the structured predictor initially on all cheap features, then looks at the total confidence of the classifier normalized by the sample size (e.g. sentence length). If the confidence is below a threshold, it chooses to acquire expensive features for all positions. The threshold therefore controls the budget level. Note that this policy has to run the predictor twice for examples that acquire expensive features to produce final labels. We omit comparison to the feature cascade proposed by Weiss et al. (2013) and the LP tree proposed by Wang et al. (2014a) for these tasks. In the case of the feature cascade, feature acquisition can only be done on batches of test examples as opposed to streaming examples. This changes the problem setup significantly since it not clear when to stop acquiring features given a single example and the generalization is only on the ranking among test samples. Additionally, it is often unclear how to choose meta-features to characterize each example over the variety of structures (such as different graph sizes and structures). The metafeatures chosen by the authors depend on the confidence of the predictor, which as we show in Section 4.2, add significant computational overhead. The LP tree operates on streaming examples, however requires all examples to have identical structure and cannot vary in size. For example, in the dependency parsing problem, the LP tree only operates on sentences of a fixed length. The authors get around this by using confidence based meta-features and constraining the feature states to be uniform over all positions. However, this requires running the predictor once after every feature acquisition which is not affordable for our tasks (running inference once in dependency parsing costs 75% of the difference between expensive and cheap features). 4.1. Implementation details We adopt Structured-SVM (Tsochantaridis et al., 2005) to solve the policy learning problems for expected and anytime cases defined in (6) and (12) respectively. Also 0-1 loss on the pseudo-labels is relaxed to Hamming loss over positions and upper-bounded with structured hinge loss. For the structure of the policy π we use a graph with no edges due to its simplicity. In this form, the policy learn-
Structured Prediction with Test-time Budget Constraints
ing problem can be written as a sample weighted SVM. We use liblinear (Fan et al., 2008) and train the policy using L2-Loss SVM that supports weighted instances. Note that the cost of policy is dependent on the structured predictor. Therefore, learning policy on the same training set of the predictor may cause the structured loss to be overly optimistic. We follow the cross validation scheme in (He et al., 2013; Weiss et al., 2013) to deal with this issue by splitting the training data into n folds. For each fold, we generate label predictions based on the structured predictor trained on the remaining folds. Finally, we gather these label predictions and train a policy on the complete data. 4.2. Dependency Parsing We conduct experiments on the English Penn Treebank (PTB) corpus (Marcus et al., 1993) and implement our algorithm based on a structured prediction library, IllinoisSL (Chang et al., 2015b), which supports a graph-based dependency parser described in (McDonald et al., 2005). We assume the edge scores between words xi and xj in the dependency graph is modeled by a linear function wT φ(xi , xj ), where w is a weight vector learned from the training data. Two sets of features are considered for the parser in the experiments.3 The first (ψ Full ) considers the part-of-speech (POS) tags and lexicons of xi , xj , and their surrounding words (see (McDonald et al., 2005)). The other (ψ POS ) only considers the POS features. ψ Full provides a more accurate estimation of the edge scores, but evaluating ψ POS takes less time. We assume the feature selection policy assigns one of these two feature templates to each word in the sentence, such that all the directed edges (xi , xj ) corresponding to the word xi share the same feature template. To learn the feature selection policy, we extract POS tags of the target word and surrounding words as features. The feature template costs are proportional to the actual running time of generating the feature vector and computing the edge score. Costs are computed by averaging time measurements on 56,684 words from a part of PTB corpus using a single threaded dependency parser on an i7 laptop. The first feature template, ψ POS , takes 165 µs per word and the second feature template, ψ Full , takes 275 µs per word to extract the features and compute edge scores. The decoding time given the scores is 75 µs per word supporting our hypothesis that feature extraction dominates computation time. We split PTB corpus into two parts, Sections 02-22 and Section 23 as training and test sets. We then conduct a modified cross-validation mechanism (see section 4.1) to train the feature selector and the dependency parser. The depen3 We verified that our parser performs similar to other implementations for the same set of features. Note that our framework can be used with higher-order features (e.g., (Koo et al., 2008)).
Figure 2. Performance of various adaptive policies for varying budget levels (dependency tree accuracy vs. total execution time), is compared to a uniform strategy on word and sentence level, and myopic policy for the 23 section of PTB dataset.
dency parser is trained by the averaged Structured Perceptron model (Collins, 2002) with learning rate and number of epochs set to be 0.1 and 50, respectively. This setting achieves the best test performance as reported in (Chang et al., 2015b). Notice that if we trained two dependency models with different feature sets separately the scale of the edge scores may be different, resulting sub-optimal test performance. To fix this issue, we generate data with random edge features and these two models jointly. We test results on PTB Section 23. We found that for dependency parsing expensive features are only necessary in several critical locations in the sentence. Therefore, budget levels above 10% turned out to be unachievable for any featuretradeoff parameter lambda in the pseudo-labels. To obtain those regions, we varied the class weights of feature templates in the training of one-shot feature selector. Fig. 2 shows the test performance along with inference time for the budgeted feature selection policy using different combinatorial search approximations, the anytime policy, and the two baselines. We see that all one-shot policy curves are performing similarly, losing negligible accuracy when using half of the available expensive features. It results in 20% speed gain compared to the predictor using all features and the random baselines. Although marginal, one-shot policy with greedy trajectory does better in low budget regions. This is because the greedy trajectory search has a better granularity than parsimonious in choosing positions that decrease the loss early on. The anytime policy is below one-shot policy for all budget levels. As discussed in 2.3, the anytime policy is more constrained in that, it has to achieve a fixed budget for all examples. The naive myopic policy is worse than random. The linear increase in accuracy with budget shows that using margins for classifier confidence does not work as well for this task. Also
Structured Prediction with Test-time Budget Constraints
Figure 3. Distribution of parse-tree depth for words that use cheap (green) or expensive features (orange) for anytime policy. Time increases from left to right. Each group of columns show the distribution of depths from 0(root) to 7. We see the policy is concentrated on acquiring the features for lower depth words.
Figure 5. The performance of our anytime policy (in blue) and our one-shot policy (in black) is compared to the uniform strategy (in orange) for the OCR dataset. The budget is shown on the x-axis vs. the average letter recognition accuracy on the y-axis.
Figure 4. Graphical illustration of the dependency tree for the sentence ”But some saw it as a classic negotiation tactic.” in the test data. Orange nodes represent where the expensive features are acquired by the one-shot policy. It is easy to identify parents of the adjectives and determiner. However, additional features are required for the root(verb), subject and object.
for samples that acquire the expensive features we need to decode again which adds approximately 4.5 seconds of extra time for the full test dataset (hence top performance is reached at 19 secs). We then explore the effect of importance weights for the greedy policy. We notice a small improvement. This is probably because the policy functional complexity is a limiting factor. We also conducted ablative studies to better understand the policy behavior. Fig. 3 shows the distribution of depth for the words that use expensive and cheap features in the ground truth dependency tree. We expect investing more time on the low-depth words (root in the extreme) to yield higher accuracy gains. We observe this phenomenon empirically, and the policy concentrates on extracting features close to the root. We see that the number of cheap features extracted for low-depth words decreases with time. 4.3. Optical Character Recognition We also tested our algorithm on a sequence-label problem, the OCR dataset (Taskar et al., 2003) composed of 6,877 handwritten words, where each word is represented as a sequence of 16x8 binary letter images. We use a linear-chain Markov model, and following the approach in (Weiss et al., 2013; Wang et al., 2014a), use raw pixel values and HOG
Figure 6. An example word from the test dataset is shown. Note that the word is initially incorrectly identified due to degradation in letters ”u” and ”n”. The letter classification accuracy increases after the policy acquires the HOG features at strategic positions.
features with 3x3 cell size. We set the cost of raw pixel values to 0 and HOG features to 1. To train the structured predictor, the HOG features are randomly removed with a probability of 0.5 to avoid over-dependence on expensive features. We use the first 90% percent of the data for training and last 10% for test. Fig. 5 shows the average test letter accuracy vs. average test-time budget on the OCR dataset. The proposed system reduces the budget for all performance levels, with a savings of up to 40 percent at the top performance. We see that, as in the dependency parsing case, one-shot policy is performing better than the anytime. This is in line with our intuition, since the anytime policy is more constrained and is applied at every budget level, whereas each one-shot policy minimize a specific budget. We further analyze policy behavior on individual examples. Consider the example shown in Fig. 6. The letters ”u” and ”n” are noisy, causing the first half of the word to be misclassified. The policy chooses to compute HOG features for letter ”u”, reducing the error by 7 letters with the acquisition of a single feature. The policy acts as we intuitively desire, with HOG features computed for a noisy letter connected to other misclassified letters.
Structured Prediction with Test-time Budget Constraints which is equivalent to the value of (2) if π(Xi ) = S ∗ (Xi , Yi ). Otherwise, π(Xi ) 6= S ∗ (Xi , Yi ), and therefore:
5. Conclusion We present an approach to reducing feature acquisition cost in structured learning. We propose two approaches for different test-time settings and show both can be reduced to solving structured learning problems. Experimentally, these algorithms reduce feature acquisition cost with minimal loss of predictive power.
W (Xi , Yi )1π(Xi )6=S ∗ (Xi ,Yi ) + C(Xi , Yi , S ∗ (Xi , Yi )) = W (Xi , Yi ) + C(Xi , Yi , S ∗ (Xi , Yi )) = max C(Xi , Yi , S 0 ). 0 S ∈S
This is an upper-bound on (2), and therefore (4) is a valid upperbound on (2).
6. Appendix Proof of Theorem 2.3
Proof of Theorem 2.1
Note that the objective in (11) can be expressed:
The objective in (3) can be expressed: n X
n X
=
X
i=1 S∈S n X X
C (Xi , Yi , S) 1π(Xi )=S
i=1 S∈S(S t−1 (Xi ))
=
C (Xi , Yi , S) 1 − 1π(Xi )6=S
=
n X
=
"
" n X
i=1
X
+
−
S ∈S
C (Xi , Yi , S) 1 − 1π(Xi )6=s
max
S 0 ∈S(S t−1 (Xi )
i=1
max C(Xi , Yi , S 0 ) − max C(Xi , Yi , S 0 ) 0
S 0 ∈S
max
S 0 ∈S(S t−1 (Xi ))
.
"
+
max C(Xi , Yi , s0 )
+
=
i=1
+
−
=
max C(Xi , Yi , S 0 )
+
S∈S
0 max C(X i , Yi , S ) − C (Xi , Yi , S) 0
S ∈S
X
.
C (Xi , Yi , S) !
#
n X
S 0 ∈S
X
#
S∈S(S t−1 (Xi ))
S ∈S
s∈S
" n X
C (Xi , Yi , S) 1 − 1π(Xi ,S t−1 (Xi ))6=S
X
S 0 ∈S(S t−1 )(Xi )
i=1
X 0 C (Xi , Yi , S) − max C(X , Y , S ) i i 0
1 − 1π(Xi )6=S
C(Xi , Yi , S 0 )
P Note that S∈S(S t−1 )(Xi ) 1 − 1π(Xi ,St−1 (Xi ))6=S = 1, allowing for further simplification: " n X = max C(Xi , Yi , S 0 )
S 0 ∈S
i=1
C(Xi , Yi , S 0 )
S∈S(S t−1 (Xi ))
P Note that S∈S 1 − 1π(Xi )6=S = 1, allowing for further simplification: =
#
S∈S
n X
C (Xi , Yi , S) 1 − 1π(Xi ,S t−1 (Xi ))6=S
X
i=1 S∈S(S t−1 (Xi ))
i=1 S∈S n X
C (Xi , Yi , S) 1π(Xi ,S t−1 (Xi ))=S
X
i=1
1π(Xi )6=s − 1
#
.
0
max
S 0 ∈S(S t−1 (Xi ))
C(Xi , Yi , S )
" max
S 0 ∈S(S t−1 (Xi ))
+
Proof of Theorem 2.2 For a single example/label pair (Xi , Yi ), consider the two possible values of the term in the summation of (4). In the event that π(Xi ) = S ∗ (Xi , Yi ): W (Xi , Yi )1π(Xi )6=S ∗ (Xi ,Yi ) + C(Xi , Yi , S ∗ (Xi , Yi )) = C(Xi , Yi , S ∗ (Xi , Yi )),
C(Xi , Yi , S 0 )
X
max
S∈S(S t−1 (X
i ))
! Removing constant terms (that do not affect the output of the argmin) yields the expression in (5).
1 − 1π(Xi ,S t−1 (Xi ))6=s
#
−C (Xi , Yi , S)
S 0 ∈S(S t−1 )(Xi ))
C(Xi , Yi , S 0 )
1π(Xi ,St−1 (Xi ))6=S − 1
#
.
Removing constant terms (that do not affect the output of the argmin) yields the expression in (12).
Structured Prediction with Test-time Budget Constraints
References Busa-Fekete, R, Benbouzid, D, and K´egl, B. Fast classification using sparse decision DAGs. In 29th International Conference on Machine Learning (ICML 2012), pp. 951–958. Omnipress, 2012. Chang, Kai-wei, Krishnamurthy, Akshay, Agarwal, Alekh, Daume, Hal, and Langford, John. Learning to Search Better than Your Teacher. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 2058–2066, 2015a. Chang, Kai-Wei, Upadhyay, Shyam, Chang, Ming-Wei, Srikumar, Vivek, and Roth, Dan. IllinoisSL: A JAVA Library for Structured Prediction. arXiv preprint arXiv:1509.07179, 2015b. Chen, Minmin, Weinberger, Kilian Q, Chapelle, Olivier, Kedem, Dor, and Xu, Zhixiang. Classifier cascade for minimizing feature evaluation cost. In International Conference on Artificial Intelligence and Statistics, pp. 218–226, 2012. Collins, Michael. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pp. 1–8. Association for Computational Linguistics, 2002. Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang, XiangRui, and Lin, Chih-Jen. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9: 1871–1874, 2008. Felzenszwalb, Pedro F, Girshick, Ross B, McAllester, David, and Ramanan, Deva. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010. Goldberg, Yoav and Elhadad, Michael. An efficient algorithm for easy-first non-directional dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 742–750. Association for Computational Linguistics, 2010. Grubb, Alexander and Bagnell, Drew. Speedboost: Anytime prediction with uniform near-optimality. In International Conference on Artificial Intelligence and Statistics, pp. 458–466, 2012. He, He, Daum´e III, Hal, and Eisner, Jason. Dynamic Feature Selection for Dependency Parsing. In EMNLP, pp. 1455–1464, 2013. Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999. Karayev, Sergey, Fritz, Mario, and Darrell, Trevor. Dynamic feature selection for classification on a budget. In International Conference on Machine Learning (ICML): Workshop on Prediction with Sequential Models, 2013. Koo, Terry, Carreras P´erez, Xavier, and Collins, Michael. Simple semi-supervised dependency parsing. In 46th Annual Meeting of the Association for Computational Linguistics, pp. 595–603, 2008.
Kusner, Matt J, Chen, Wenlin, Zhou, Quan, Xu, Zhixiang Eddie, Weinberger, Kilian Q, and Chen, Yixin. Feature-Cost Sensitive Learning with Submodular Trees of Classifiers. In AAAI, pp. 1939–1945, 2014. Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and Santorini, Beatrice. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330, 1993. McDonald, Ryan, Pereira, Fernando, Ribarov, Kiril, and Hajiˇc, Jan. Non-projective dependency parsing using spanning tree algorithms. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 523–530. Association for Computational Linguistics, 2005. Nan, Feng, Wang, Joseph, and Saligrama, Venkatesh. FeatureBudgeted Random Forest. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1983—-1991, 2015. Ross, St´ephane, Gordon, Geoffrey J, and Bagnell, Drew. A Reduction of Imitation Learning and Structured Prediction to NoRegret Online Learning. In International Conference on Artificial Intelligence and Statistics, pp. 627–635, 2011. Shi, Tianlin, Steinhardt, Jacob, and Liang, Percy. Learning Where to Sample in Structured Prediction. In AISTATS, 2015. Stoyanov, Veselin and Eisner, Jason. Easy-first Coreference Resolution. In COLING, pp. 2519–2534. Citeseer, 2012. Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Max-Margin Markov Networks. In Advances in Neural Information Processing Systems, pp. None, 2003. Trapeznikov, Kirill and Saligrama, Venkatesh. Supervised sequential classification under budget constraints. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, pp. 581–589, 2013. Tsochantaridis, Ioannis, Joachims, Thorsten, Hofmann, Thomas, and Altun, Yasemin. Large margin methods for structured and interdependent output variables. In Journal of Machine Learning Research, pp. 1453–1484, 2005. Viola, Paul and Jones, Michael. Robust real-time object detection. International Journal of Computer Vision, 4, 2001. Wang, Joseph, Bolukbasi, Tolga, Trapeznikov, Kirill, and Saligrama, Venkatesh. Model selection by linear programming. In Computer Vision–ECCV 2014, pp. 647–662. Springer, 2014a. Wang, Joseph, Trapeznikov, Kirill, and Saligrama, Venkatesh. An LP for Sequential Learning Under Budgets. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 987–995, 2014b. Wang, Joseph, Trapeznikov, Kirill, and Saligrama, Venkatesh. Efficient Learning by Directed Acyclic Graph For Resource Constrained Prediction. In Advances in Neural Information Processing Systems, pp. 2143–2151, 2015. Weiss, David and Taskar, Ben. Structured Prediction Cascades. In International Conference on Artificial Intelligence and Statistics, pp. 916–923, 2010.
Structured Prediction with Test-time Budget Constraints Weiss, David, Sapp, Benjamin, and Taskar, Ben. Dynamic structured model selection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2656–2663, 2013. Xu, Zhixiang, Kusner, Matt, Chen, Minmin, and Weinberger, Kilian Q. Cost-Sensitive Tree of Classifiers. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pp. 133–141, 2013. Zhang, Cha and Zhang, Zhengyou. A survey of recent advances in face detection, 2010. Zhu, Menglong, Atanasov, Nikolay, Pappas, George J, and Daniilidis, Kostas. Active deformable part models inference. In Computer Vision–ECCV 2014, pp. 281–296. Springer, 2014.