Exploration and Exploitation

Setting

Learning agents need to trade off two things

Exploitation: Maximise performance based on current knowledge

Exploration: Increase knowledge

The Multi-Armed Bandit

Values and Regret

The action value for action a is the expected reward

The optimal value is

Regret of an action a is

The regret for the optimal action is zero

We want to minimise total regret:

Maximise cumulative reward ≡ minimise total regret

The summation spans over the full ‘lifetime of learning’

Algorithms

Greedy
ε-greedy
UCB

The first three all use action value estimates Qt (a) ≈ q(a)

Thompson sampling
Policy gradients

Action values

The action value for action a is the expected reward

A simple estimate is the average of the sampled rewards:

The count for action a is

This can also be updated incrementally:

Where

α: step sizes, constant α would lead to tracking, rather than averaging

The greedy policy

One of the simplest policies

ε-Greedy Algorithm

Policy gradients

Policy search, action preferences Ht(a)

learn policies π (a) directly, instead of learning values

(softmax)

The preferences are not values: they are just learnable policy parameters

Goal: learn by optimising the preferences

Policy gradients

update policy parameters such that expected value increases

use gradient ascent:

Gradient bandits

Log-likelihood trick (also known as REINFORCE trick, Williams 1992):

sample this

Theory

Theorem (Lai and Robbins)

Counting Regret

Optimism in the Face of Uncertainty

More uncertainty about its value: more important to explore that action

Which action should we pick?

Algorithms: UCB

Upper Confidence Bounds

Theory: the optimality of UCB

Hoeffding’s Inequality

Calculating Upper Confidence Bounds

…

Bayesian approaches

Bayesian Bandits

Bayesian approach:

model distributions over values p(q(a) | θt )

Bayesian Learning

Example:

Bayesian Bandits with Upper Confidence Bounds

Thompson sampling

Probability Matching

Probability matching is optimistic in the face of uncertainty:

Actions have higher probability when either the estimated value is high, or the uncertainty is high

Can be difficult to compute π (a) analytically from posterior (but can be done numerically)

Planning to explore

Information State Space

Example: Bernoulli Bandits

Solving Information State Space Bandits

E.g., learn a Bayesian reward distribution, plan into the future

Bayes-adaptive RL: optimally trades off exploration with respect to the prior distribution

full RL: transition model

All included screenshots credit to Lecture 2:slides, wiki and comp0142 lectures.

Title:
Author: wy
Created at
: 2023-07-10 14:57:52

          **Updated at
              :** 2023-08-02 12:58:29

      **Link:** https://yuuee-www.github.io/blog/2023/07/10/RL/EE/RLEE/

      **
          License:
      **
      

      
          This work is licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0).
      
  


      
  

  

  

  
      
          
              
                  [Prev posts](/2023/07/10/RL/Intro/RLIntro/)
              
          
          
              
                  [Next posts](/2023/07/05/notes/)
              
          
      
  

  
      
          


  Comments



  
      



  


      
  



  
      

  On this page

Exploration and Exploitation
The Multi-Armed Bandit

1. [Values and Regret](#Values-and-Regret)

Algorithms
Action values
The greedy policy
ε-Greedy Algorithm
Policy gradients
Theory
Optimism in the Face of Uncertainty
Algorithms: UCB
Bayesian approaches

1. [Bayesian Bandits](#Bayesian-Bandits)

Thompson sampling
Planning to explore

1. [Information State Space](#Information-State-Space)



    


            

        

        

    

    
        

    
    
        ©
        
          2022
          -
        
        2024    [wy](/)
        
            
            

                
                    24 posts in total
                
                
            

        
    
    
        
        
            
                
                    VISITOR COUNT
                    
                
            
            
                
                    TOTAL PAGE VIEWS
                    
                
            
        
    
    
        POWERED BY [Hexo](https://hexo.io)
        THEME [Redefine v2.6.4](https://github.com/EvanNotFound/hexo-theme-redefine)
    
    
    
        
            Blog up for  days  hrs  Min  Sec

hola

Exploration and Exploitation

Exploration and Exploitation

The Multi-Armed Bandit

Values and Regret

Algorithms

Action values

The greedy policy

ε-Greedy Algorithm

Policy gradients

Theory

Optimism in the Face of Uncertainty

Algorithms: UCB

Bayesian approaches

Bayesian Bandits

Thompson sampling

Planning to explore

Information State Space