Event learning and robust policy heuristics
A. Lõrincz, I. Pólik and I. Szita
Technical Report
NIPG-ELU-14-05-2001
Abstract
In this paper we introduce a novel form of reinforcement learning called event-learning or E-learning. Events are ordered pairs of consecutive states. We define the corresponding event-value function. Learning rules which are guaranteed to converge to the optimal event-value function are derived. Combining our method with a known robust control method, the SDS algorithm, we introduce Robust Policy Heuristics (RPH). It is shown that RPH (a fast-adapting non-Markovian policy) is particularly useful for coarse models of the environment and for partially observed systems. Fast adaptation may allow to separate the time scale of learning to control a Markovian process and the time scale of adaptation of a non-Markovian policy. In our E-learning framework the definition of modules is straightforward. E-learning is well suited for policy switching and planning, whereas RPH alleviates the `curse of dimensionality' problem. Computer simulations of a two-link pendulum with coarse discretization and noisy controller are shown to demonstrate the underlying principle.