Exploration and Exploitation

wy Lv3

Exploration and Exploitation

Setting

Untitled

Learning agents need to trade off two things

Exploitation: Maximise performance based on current knowledge

Exploration: Increase knowledge

The Multi-Armed Bandit

Untitled

Values and Regret

The action value for action a is the expected reward

The optimal value is

Regret of an action a is

The regret for the optimal action is zero

We want to minimise total regret:

Maximise cumulative reward ≡ minimise total regret

The summation spans over the full ‘lifetime of learning’

Algorithms

  • Greedy

  • ε-greedy

  • UCB

The first three all use action value estimates Qt (a) ≈ q(a)

  • Thompson sampling

  • Policy gradients

Action values

The action value for action a is the expected reward

A simple estimate is the average of the sampled rewards:

The count for action a is

This can also be updated incrementally:

Where

α: step sizes, constant α would lead to tracking, rather than averaging

The greedy policy

One of the simplest policies

ε-Greedy Algorithm

Policy gradients

Policy search, action preferences Ht(a)

learn policies π (a) directly, instead of learning values

(softmax)

The preferences are not values: they are just learnable policy parameters

Goal: learn by optimising the preferences

Policy gradients

update policy parameters such that expected value increases

use gradient ascent:

Gradient bandits

Log-likelihood trick (also known as REINFORCE trick, Williams 1992):

sample this

Theory

Theorem (Lai and Robbins)

Untitled

Counting Regret

Untitled

Optimism in the Face of Uncertainty

More uncertainty about its value: more important to explore that action

Which action should we pick?

Untitled

Algorithms: UCB

Upper Confidence Bounds

Untitled

Theory: the optimality of UCB

Hoeffding’s Inequality

Untitled

Calculating Upper Confidence Bounds

Bayesian approaches

Bayesian Bandits

Bayesian approach:

model distributions over values p(q(a) | θt )

Bayesian Learning

Example:

Bayesian Bandits with Upper Confidence Bounds

Thompson sampling

Probability Matching

Probability matching is optimistic in the face of uncertainty:

Actions have higher probability when either the estimated value is high, or the uncertainty is high

Can be difficult to compute π (a) analytically from posterior (but can be done numerically)

Planning to explore

Information State Space

Example: Bernoulli Bandits

Solving Information State Space Bandits

E.g., learn a Bayesian reward distribution, plan into the future

Bayes-adaptive RL: optimally trades off exploration with respect to the prior distribution

full RL: transition model

All included screenshots credit to Lecture 2:slides, wiki and comp0142 lectures.

  • Title:

  • Author: wy

  • Created at
    :
    2023-07-10 14:57:52

  •           **Updated at
                  :** 2023-08-02 12:58:29
          
      
      
    
  •       **Link:** https://yuuee-www.github.io/blog/2023/07/10/RL/EE/RLEE/
      
      
    
  •       **
              License:
          **
          
    
          
              This work is licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0).
          
      
    
    
          
      
    
      
    
      
    
      
          
              
                  
                      [Prev posts](/2023/07/10/RL/Intro/RLIntro/)
                  
              
              
                  
                      [Next posts](/2023/07/05/notes/)
                  
              
          
      
    
      
          
              
    
    
      Comments
    
    
    
      
          
    
    
    
      
    
    
          
      
    
    
    
      
          
    
      On this page
    
  1. Exploration and Exploitation

  2. The Multi-Armed Bandit

1. [Values and Regret](#Values-and-Regret)
  1. Algorithms
  2. Action values
  3. The greedy policy
  4. ε-Greedy Algorithm
  5. Policy gradients
  6. Theory
  7. Optimism in the Face of Uncertainty
  8. Algorithms: UCB
  9. Bayesian approaches
1. [Bayesian Bandits](#Bayesian-Bandits)
  1. Thompson sampling
  2. Planning to explore
1. [Information State Space](#Information-State-Space)



    


            

        

        

    

    
        

    
    
        ©
        
          2022
          -
        
        2024    [wy](/)
        
            
            

                
                    24 posts in total
                
                
            

        
    
    
        
        
            
                
                    VISITOR COUNT
                    
                
            
            
                
                    TOTAL PAGE VIEWS
                    
                
            
        
    
    
        POWERED BY [Hexo](https://hexo.io)
        THEME [Redefine v2.6.4](https://github.com/EvanNotFound/hexo-theme-redefine)
    
    
    
        
            Blog up for  days  hrs  Min  Sec
        
    
    
        
    
    
    
  

    



    
        


    
    
        

-

-

-

-

-

-

-

  • Title: Exploration and Exploitation
  • Author: wy
  • Created at : 2023-07-10 06:57:52
  • Updated at : 2023-08-02 04:58:29
  • Link: https://yue-ruby-w.site/2023/07/10/2023-07-10-RL-EE-RLEE/
  • License: This work is licensed under CC BY-NC-SA 4.0.