T OWARDS THE U SE OF D EEP R EINFORCEMENT L EARNING WITH G LOBAL P OLICY FOR Q UERY- BASED E XTRACTIVE S UMMARISATION D IEGO M OLLÁ P ROBLEM
A CTIONS , R EWARD
• Supervised machine learning approaches to text summarisation are usually based on predicted scores of individual sentences/extracts. • The resulting system is therefore not optimised to the global multi-sentence summary.
1. The agent needs to decide whether sentence i is part of the summary or not. 2. Reward is delayed until all sentences have been processed. 3. Reward of the summary is its ROU GEL score. 0 if i < n r= ROU GEL if i = n
C ONTRIBUTIONS 1. Focus on query-based extractive summarisation. 2. Use reinforcement learning to directly optimise the final multi-sentence summary. 3. Learn a global policy using policy gradient.
R EINFORCEMENT L EARNING Agent s, r, done
a
Environment • a: Action made by the agent.
T HE G LOBAL P OLICY 2. The global policy predicts the probability that selecting sentence i would give the highest reward. 3. Implemented as a neural network with one hidden layer. • The final unit is a Bernoulli logistic unit. 4. Trained using policy gradient. o σ(h · Wh + bh ) relu(s · Ws + bs )
• r: Reward given to the action. • s: State returned after applying the action.
The learning algorithm for the global policy is a variant of the REINFORCE algorithm [1] that uses gradient descent with cross-entropy gradients that are multiplied with the reward [2, Chapter 16]. Data: train_data Result: θ = (Wh , bh , Ws , bs ) sample ∼ U nif orm(train_data); s ← env.reset(sample); all_gradients ← ∅; episode ← 0; while True do ξ ∼ Bernoulli
1. Use a training set to learn a global policy.
P r(a = 0|s; Wh , Ws , bh , bs ) = o = h =
E ARLY E XPLORATION
P OLICY G RADIENT
R ESULTS O N B IO ASQ 5 B D ATA
P r(a=0)+p 1+2×p
• We encourage exploration in the initial steps. – Exploration diminishes in later steps. • This way we may avoid locking in local minima early. P r(a = 0) + p ξ ∼ Bernoulli 1+2×p 3000 p = 0.2 3000 + episode
;
y ← 1 − ξ; ∇(cross_entropy(y,P r(a=0)) ; ∇θ
gradient ← all_gradients.append(gradient); s, r, done ← env.step(ξ); episode ← episode + 1; if done then θ ← θ−α×r×mean(all_gradients); sample ∼ U nif orm(train_data); s ← env.reset(sample); all_gradients ← ∅; end end
T HE S TATE • The state should contain all the information needed to choose sentence i. • The state must encode an arbitrary number of sentences. • The sentences are processed sequentially and decisions cannot be undone. 1. tf.idf of the candidate sentence i. 2. tf.idf of the entire input text to summarise. 3. tf.idf of the summary generated so far. 4. tf.idf of the candidate sentences that are yet to be processed. 5. tf.idf of the question.
• done: T rue if an episode has completed.
S OURCE C ODE https://github.com/dmollaaliod/ bioasq-rl
C ONTACT • http://comp.mq.edu.au/~diego/ •
[email protected]
R EFERENCES [1] Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229–256. [2] Aurélien Géron. 2017. Hands-on Machine Learning with Scikit-Learn and TensorFlow:: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media.