I. What is RL (Reinforcement Learning)?
One important branch of computer science is
AI (Artificial Intelligence).
Machine learning is a subcategory of AI that becomes hot research area recently.
Machine learning in general can be classified into three categories:
1)
Supervised learning (SL). These are learning in which you know the input as well as the output.
2)
Unsupervised learning (USL). These are learning in which you know the input but not the output.
3)
Reinforcement learning (RL). This kind of learning falls between the first two categories, in which you have the input, but not output, instead you have "critic" -- whether the classifier's output is correct or wrong. RL is often the issue of an agent acting in an environment that tries to achieve the best performance over time by trial and error.
The following is a standard model of RL [3, p368]:
In this model the agent in the environment chooses an action a_i, obtains reward r_i, and switch from state s_i to state s_i+1. The goal is to maximize long term reward, where γ is called the discounting factor.
II. Central features and issues of RL
RL dates back to the 1960's, originated from the research of dynamic programming and Markov decision processes. Monte Carlo method is another source of RL methods, learning from stochastic sampling of the sample space. A third group of method that is specific to RL is the
Temporal Difference method (TD(lambda)), which combines the merits of dynamic programming and Monte Carlo method, developed in the 1980's mostly by the work of Button and Sarto
etc.
One simplest RL problem is the bandit problem. One important RL algorithm is the Q-learning algorithm introduced in the next section.
Models of optimal behavior in RL can be classfied into 1) Finite-horizon model, 2) Infinite-horizon discounted model, and 3) Average-reward model. [1]
To measure learning performance, criteria include 1) eventual convergence to optimal, 2) speed of convergence to (near-)optimality, and 3) regret, which is the expected reward gained after executing the RL algorithm. [1]
The three main categories of RL algorithms are: 1) dynamic programming, 2) Monte Carlo methods, and 3) temporal difference methods. [2]
Ad-hoc strategies used in balancing RL exploration/exploitation include greedy strategies, randomized strategies (e.g. Boltzmann exploration), interval-based techniques and more. [1]
RL algorithms can also be classified into model-free methods and model-based methods. Model-free methods include Monte carol methods and temporal difference methods. Model-based methods include dynamic programming, certainty equivalent methods, Dyna, Prioritized sweeping/queue-dyna, RTDP (Real-Time Dynamic Programming), the Plexus planning system etc. More of these are briefly mentioned in [1].
Some of the central issues of RL are:
- Exploration v.s. exploitation
This is illustrated in the bandit problem, in which a bandit machine has several levers with different payoff values. The bandit machine player is given a fixed number of chances to pull the levers. He needs to balance the number of trials used to find the lever with the best payoff, and the number of trials used to pull this lever only.
The learning is real-time and continues through the entire life of the agent. The agent learns and acts simultaneously This kind of life-long learning is also called "
online learning".
- Immediate v.s. delayed reward
A RL agent needs to maximize the expected long-term reward. To achieve this goal the agent needs to consider both immediate reward and delayed reward, and try not to be stuck in a local minimum.
- Generalization over input and action
In RL where a model-free method is used to find the strategy (e.g., in Q-learning), a problem is how to apply the learned knowledge into the unknown world. Model-based methods are better in this situation, but need enough prior knowledge about the environment, which may be unrealistic, and the computation burden is cursed by the dimension of the environment. Model-free methods, on the other hand, requires no prior knowledge, but makes inefficient use of the learned knowledge, thus requires much more experience to learn, and cannot generalize well.
- Partially observable environments
The real world may not allow the agent to have a full and accurate perception of the environment, thus often partial information are used to guide the behavior of the agent.
So far available RL algorithms all lack a way of scale up from toy applications to real world applications.
- Principle v.s. field knowledge
This is the general problem faced by AI: a general problem solver based on the first principle does not exist. Different algorithms are need to solve different problems. More over, field knowledge are often beneficial and necessary to be added to significantly improve the performance of the solution.
III. Q-learning
The Q-learning algorithm, introduced by Watkins in 1989, is rooted in dynamic programming, and is a special case of TD(lambda) when lambda = 0. It handles discounted infinite-horizon MDP (Markov Decision Process), is easy to implement, is exploration insensitive, and is so far one of the most popular and seems to be the most effective model-free algorithm for learning from delayed reinforcement. However, it does not address scaling problem, and may converge quite slowly.
The Q-learning rule is [3, p373]:
where: s - current state, s' - next state, a - action, a' - action of the next state, r - immediate reward, α - learning rate, γ - discount factor, Q(s,a) - expected discounted reinforcement of taking action a in state s. <s, a, r, s'> is an experience tuple.
The Q-learning algorithm is:For each s, a, initialize table entry Q(s,a) <- 0
Observe current state s
Do forever:
Select an action a and execute it
Receive immediate reward r
Observe the new state s'
Update the table entry for Q(s, a) as follows:
Q (s, a) = Q(s, a) + α [ r + γ max Q (s', a') - Q (s, a)]
s <- s'.
The converge criteria are:
- The system is a deterministic MDP.
- The immediate reward values are bounded by some constant.
- The agent visits every possible state-action pair infinitely often.
IV. Applications of RL
Some examples of the applications of RL are in game playing, robotics, elevator control, network routing and finance.
The TD-gammon is the application of TD(lambda) algorithm to backgammon. It achieved the competence level of the best human players. However, some people argue that its designer was already an accomplished backgrammon programmer before incorporating TD(lambda) algorithm into it, and thus borrowed a lot of experiences from the prior programming knowledge of the game. That said, similar success have not been found in any other games like chess or go. Another attempt that claims very good performance was a trial on Tetris [16].
For robotics, there are many applications and experiments already done in the past.
V. Literature
At this time the most important textbook in RL is [2] written by Sutton and Barto in 1998. [3] is a popular textbook used for machine learning, which devotes chapter 13 to RL. [1] gives a brief introduction to major topics in RL but lacks detail. [4] discusses the relationship between RL and dynamic programming. A lot of material about RL are available from Sutton's homepage [5, 6, 12] and the RLAI lab [8] of Alberta University, which is resided by Sutton. The value iteration and policy iteration methods used in RL can be traced back to the work of Howard [19].
References:
- Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore. Reinforcement Learning: A Survey. (1996)
- Richard S. Sutton, Andrew G. Barto. Reinforcement Learning: An Introduction. (1998)
- Machine Learning. Tom M. Mitchell. (1997)
- Barto, A., Bradtke, S., Singh, S. Learning to act using real-time dynamic programming. Artificial Intelligence, Special volume: Computational research on interaction and agency. 72(1), 81-138. (1995)
- Homepage of Richard S. Sutton.
- Richard S. Sutton. 499/699 courses on Reinforcement Learning. University of Alberta, Spring 2006.
- Reinforcement learning - good source of RL materials, readings online.
- RL and AI - RL community.
- RL research reporsitory at UM - a centralized resource for research on RL.
- RL introduction warehouse
- RL using NN, with applications to motor control - A PhD thesis (2002, French).
- RL FAQ - by Sutton, initiated 8/13/2001, last updated on 2/4/2004.
- RL and AI research team - iCore, Sutton.
- RL research problems - 1) scaling up, 2) partially-observable MDP.
- Application of RL to dialogue strategy selection in a spoken dialogue system for email - 2000
- RL tetris example - Seems few application of RL exist besides backgammon, here's a try with tetris. Result seems to be good. 1998.
- Q-learning by examples - numeric example, tower of hanoi, using matlab, Excel etc.
- RL course website - Utrecht University, 2006 Spring.
- Dynamic Programming and Markov Processes. Ronald A. Howard. 1960.
Source: http://www2.hawaii.edu/~chenx/ics699rl/grid/rl.html#abstract