Improving Policies without Measuring Merits - Gatsby Computational ...

Recommend Documents

Physics Participation & Policies - Gatsby Foundation

the United States found that science schools are âhighly effective at ... 6) Accepting that good physics teachers are in short supply at present and are likely to ..... into the other physical sciences, maths, engineering, computer science, and.

Probabilistic Amplitude Demodulation - Gatsby Computational

classical amplitude demodulation, but traditional algorithms fail to re- alise the ... 1 s); phonemes (â¼ 10â1 s); glottal pulses (â¼ 10â2 s); and formants (â¼ 10â3 s.

Fishman & Steinschneider 2010 - Gatsby Computational

Sep 15, 2010 - in each electrode penetration (schematic of the electrode is shown on the left; intercontact distance. 150 Î¼m). Approximate laminar boundaries ...

Restricted concentration models - Gatsby Computational

Restricted concentration models â graphical Gaussian models with concentration parameters restricted to being equal. SÃ¸ren HÃ¸jsgaard. Biometry Research ...

Bayesian Hierarchical Clustering - Gatsby Computational ...

methods to perform hierarchical clustering is discussed in section 6. ...... Bayesian methods for nonlinear classification and regression. Wiley. Duda, R. O., & Hart ...

Gaussian Processes for Ordinal Regression - Gatsby Computational ...

A threshold model that generalizes the probit function is used as the likelihood func- ... learning algorithm based on a loss function between pairs of ranks.

Smooth Operators - Gatsby Computational Neuroscience Unit - UCL

smooth versions of basic mathematical opera- tions like ... large these derivatives are. The operator Ff .... often exis

Representing and comparing probabilities - Gatsby Computational ...

MMD for GAN critic. Can you use MMD as a critic to train GANs? Can you train convolutional features as input to the MMD

Dirichlet Process - Gatsby Computational Neuroscience Unit

that is fraught with difficulties, whether we use cross validation or marginal probabilities as the basis for selection.

Representing and comparing probabilities - Gatsby Computational ...

But, asymptotically ^ n 3 I. Consistent test. Test power = probability of rejecting H0 when H1 is true. Runtime: y@nA fo

System Merits or Failures? Policies for Transition ... - Semantic Scholar

May 11, 2016 - Hakala, H.; Välimäki, J. Ympäristön Tila ja Suojelu Suomessa, 2nd ed.; .... In INTERREG IVb North West Europe Programme; Hans Bressers, ...

Improving Service Availability without Improving Availability of ...

distributed architecture of the WS-Mediator system collects dependability metadata from the end-user's ... Alexander (Sascha) Romanovsky is a Professor in the CSR. ... the executive board of Dependable Systems of Systems (DSoS) IST Project. Now he is

System Merits or Failures? Policies for ... - UT Research Information

May 11, 2016 - Natural Resources Institute Finland (Luke), Jokiniemenkuja 1, ... To be effective, economic and environmental goals must be coherent, flexible,.

Computational inference without proposal kernels

Feb 10, 2017 - United Kingdom. February 13, 2017. Abstract. Likelihood-free methods, such as approximate Bayesian computation, are pow- erful tools for ...

A Balanced Memory Network - Gatsby Computational Neuroscience Unit

Sep 7, 2007 - intermediate timescales, like tens of secondsâis implemented in realistic neuronal networks. ... with the number of excitatory connections, a result that has been suggested for simple ... second, building a Hopfield-type network with

Choking on the Money - Gatsby Computational Neuroscience Unit - UCL

receiver bandwidth was 250 kHz, with a 30% ramp sampling and twofold read oversampling to allow for k-space regridding.

Measuring & Improving Sustainable Procurement ...

In the past there has been criticism on organisational green washing in sustainable procurement. Companies commonly favoured an incremental approach (cf.

Measuring the EoR Power Spectrum Without Measuring the EoR

Nov 26, 2018 - Angus Beane,1, 2 Francisco Villaescusa-Navarro,1 and Adam Lidz2. 1Center ... cross-spectra can be robustly measured even in the pres- ... [N ii] 122 Âµm (Serra et al. ..... where Pi,tot = Pi + Ni and Ni is the instrumental noise.

A Computational Model for Conversation Policies

Research in agent communication has received much attention during the past years .... An argumentation system essentially includes a logical language L, ...

disjunction without tears - Association for Computational Linguistics

tunately, the introduction of disjunctions into our descrip- tions has drastic effects on .... mar: A Concise, Extendable Grammar for Natural Language Process- ing.

An Introduction to Computational Finance Without

May 16, 2013 - 11.2 The Portfolio Allocation Problem . ... 12.2.2 Leveraged Two Times Bull/Bear ETFs . .... the portfolio has no risk, the return earned by this portfolio must be the risk-free rate. ... Alternatively, if the market price of the optio

Measuring and Explaining Cross-Country Immigration Policies

in the literature, we construct an indicator of the restrictiveness of immigration entry policy across countries as well as a more comprehensive indicator of ...

Improving adult literacy without improving the literacy of ... - Unesco

Adult Literacy From A Cohort Perspective. Bilal Barakat. 2015. This paper was commissioned by the Education for All Glob

Measuring Poverty Without The Mortality Paradox

Sep 12, 2011 - We examine several solutions to avoid that paradox. ..... the old age in the second situation is the mere outcome of income'differentiated.

Improving Policies without Measuring Merits - Gatsby Computational ...

Download PDF

2 downloads 0 Views 190KB Size Report

Comment

TR 91-57, Department of Computer Science,. University of Amherst, MA. Barto, AG, Sutton, RS & Watkins, CJCH (1990). Learning and sequential decision.

Improving Policies without Measuring Merits Peter Dayan

CBCL E25-201, MIT Cambridge, MA 02139

Satinder P Singh

Abstract Performing policy iteration in dynamic programming should only require knowledge of relative rather than absolute measures of the utility of actions { what Baird (1993) calls the advantages of actions at states. Nevertheless, existing methods in dynamic programming (including Baird's) compute some form of absolute utility function. For smooth problems, advantages satisfy two dierential consistency conditions (including the requirement that they be free of curl), and we show that enforcing these can lead to appropriate policy improvement solely in terms of advantages.

1 Introduction In deciding how to change a policy at a state, an agent only needs to know the dierences between the total return based on taking each action a for one step and then following the policy forever after, and the total return based on always following the policy (the conventional value of the state under the policy). The main theoretical attraction of these dierences, called advantages,1 is that they are like dierentials { they do not depend on the local levels of the total return. For instance, in a conventional undiscounted maze problem with a penalty for each move, the advantages for the actions might typically be ;1; 0 or 1, whereas the values vary between 0 and the maximum distance to the goal. Advantages should therefore be easier to represent than absolute value functions in a generalising system such as a neural network and, possibly, easier to learn. Although the 1 Baird's use of them has the additional merit that they can cope with arbitrarily small time steps. This is a separate issue and the method used can be used for Q-learning too.

advantages are dierential, existing methods for learning them, notably Baird (1993), require the agent simultaneously to learn the total return from each state. This removes their main attraction. The underlying trouble is that advantages do not appear to satisfy any form of a Bellman equation (or Hamilton-Jacobi-Bellman equation for continuous problems). Whereas it is clear that the value of a state should be closely related to the value of its neighbours, it is not obvious that the advantage of action a at a state should be equally closely related to its advantages nearby. In this paper, we show that under some circumstances it is possible to use a solely advantage-based scheme for policy iteration using the spatial derivatives of the value function rather than the value function itself. Advantages satisfy a particular consistency condition, and, given a model of the dynamics and reward structure of the environment, an agent can use this condition to directly acquire the spatial derivatives of the value function. It turns out that the condition alone may not impose enough constraints to specify these derivatives (this is a consequence of the problem described above) { however the value function is like a potential function for these derivatives, and this allows extra constraints to be imposed. Section 2 develops the theory behind this scheme, in the context of continuous dynamic programming (DP), and uses the linear quadratic regulator as a didactic example; section 3 shows it working using a radial-basis function based function approximator on both regulator and non-regulator problems; section 4 discusses the results and shows the relationship with Mitchell and Thrun's (1993) Explanation-Based Q-learning method.

2 Continuous DP, Advantages and Curl Consider the problem of controlling a deterministic system to minimise Z 1 V (x0 ) = min u(t) 0 r(y(t); u(t))dt where y(t) 2