Reinforcement Learning: : An Introduction

Kybernetes

ISSN: 0368-492X

Article publication date: 1 December 1998

1302

Citation

Andrew, A.M. (1998), "Reinforcement Learning: : An Introduction", Kybernetes, Vol. 27 No. 9, pp. 1093-1096. https://doi.org/10.1108/k.1998.27.9.1093.3

Publisher

:

Emerald Group Publishing Limited


This book introduces a new approach to the study of systems, living or artificial, that can learn from experience, and in so doing it indicates a new focus of activity within artificial intelligence. The approach is new insofar as this is the first comprehensive account to appear, but references are given to earlier papers in which the ideas were developed, back to at least 1982 for the particular approach now given, and very much earlier for the origins of specific aspects.

The idea of machines that would learn from experience is of course quite old and there have been speculations by Shannon, Turing and Ashby, among others. Principles that underlie the present approach are used in the famous checker‐playing program of Samuel, and there is reference to early discussions by Minsky, as well as to the pioneering work of a number of Russian workers including Tsetlin and Feldbaum. (Some ideas were given an airing at the 1960 IFAC Congress in Moscow.)

A capacity for learning is a feature of most schemes for artificial neural nets. Neural nets receive attention in the present book, but the main concern is overall principles of operation, with the comment that neural nets may or may not prove to be the most appropriate implementation. (The earlier papers by the authors are somewhat more strongly oriented towards neural‐net implementations.)

It is refreshing to see that the general theory of learning systems is progressing, at a fundamental level. It can be seen as a refutation of the claim of some workers in AI, notably Schank (1985), that learning is only usefully studied within an expert‐system, or knowledge‐based framework. The aspects studied within this framework are also valid, and grist to the mill, but not to an extent that warrants the attempt to hijack learning theory.

Previous studies of machine learning have tended to look to pattern classification as a source of tasks of a suitable level of difficulty, and the majority have simplified matters still further by assuming that learning would be “with a teacher” rather than by interaction with a raw environment. In particular, most work on neural nets is directed at pattern classification in the “with a teacher” situation. The limitations of this approach are discussed in the book, where instead learning is viewed as interaction between the learning agent and an incompletely‐known environment, in such a way as to maximise, in the long term, a reward obtainable from the environment.

This view of learning has been adopted by many previous workers, and Selfridge coined the attractive term “hedony” to denote the reward signal. It has been usual to see the reward as a simple quantitative feedback, even though it was sometimes acknowledged that it could be subject to unknown delay.

The term “reinforcement learning” is now used to denote a more general approach, with the environment represented in finite‐automaton terms, so that an action applied to it in a particular state produces a (perhaps stochastic) immediate reward and a (perhaps stochastic) change of state. The long‐term return from the action is the sum of the immediate reward and the return achievable from the new state. The return obtainable from each state, or “value” of the state, is therefore evaluated by a form of bootstrapping, since the value of any one state depends on the values of others that can be reached from it.

The return achievable from a state, or “value” of the state, depends on the policy according to which actions are selected as a function of state. The general idea is the basis of dynamic programming, but, unlike it, the learning procedure operates on an unknown environment and works to determine state values (or in some versions values associated with state‐action pairs) and from these values to derive favourable policies. Since the state values are functions of policy, and policy improvement depends in turn on the values, policies and valuations have to evolve jointly. Algorithms are described which allow such joint improvement and for most of them the reader is assured of the existence of a convergence proof.

It may be useful to mention that a similar alternation of evaluation and policy‐adjustment, though in a known environment and therefore coming under the heading of straightforward dynamic programming, is given in an introductory text by Andrew (1985), following Dreyfus (1961).

This approach, corresponding closely to dynamic programming, is contrasted with the alternative referred to as Monte‐Carlo method. This latter is only applicable to the learning of tasks that are episodic, in that they come to a natural end as does a play of a board game. The term is used in a rather unusual sense, as it is applied to observations on real‐world events as well as to simulations depending on random or pseudo‐random number generation. It essentially denotes derivation of state‐values without bootstrapping.

A third form of operation is then introduced, having features of both the dynamic programming and Monte‐Carlo versions. It is termed temporal‐difference learning and it is introduced by observing that data may become available during interaction with the environment that warrant revision of an earlier estimate of some quantity. The introductory example is the estimate of the time taken to travel home from work, which will be revised during the journey if one particular leg of the route takes longer (or shorter) than expected.

These considerations provide the basis for powerful methods of reinforcement learning. It is acknowledged that not all environments fit comfortably into the essentially finite‐automaton representation adopted. Often the number of states will be unmanageably large. In these cases some mathematical function can be used as an approximation, and the reinforcement‐learning principles can be adapted. All of this receives attention in the book, as does means of modifying the model as experience is gained, distinct from operation to improve the fit of a given model.

The power of the method is illustrated by description of a number of case studies, in which Samuel’s early work on checker (or draughts) playing is cited as exemplifying the general principles, especially of bootstrapping of state‐values. Much more recently the methods have been used in constructing a powerful program to play backgammon. Another example described is their application to design of a despatching system for lifts in a multistorey building, and others refer to dynamic channel allocation in a communication network, and to job‐shop scheduling. In all of these applications, impressive new results have been achieved, sometimes by taking substantial liberties with the basic methods.

A particularly interesting area of application is to robotics, and one case study is in this area. A reference is given (Connell and Mahadevan, 1993) to a publication in which robotics applications of reinforcement learning are treated more fully.

The treatment is accessible without great demands on the reader’s knowledge of mathematics. At many points, more difficult mathematical treatments are referred to, for example the convergence proofs mentioned earlier. The whole book is, in fact, admirably presented and is a magnificent piece of work. Attention is given to the historical development of the topic, and there is a large and valuable, though inevitably far from exhaustive, collection of references.

One minor criticism, with regard to the mathematics, is that although a useful summary of notation has been provided, it is tucked away between the references and the subject index and is easily overlooked. The symbols are, of course, defined when first used in the text and are mostly fairly obvious, but my initial perusal, when I had not noticed the summary of notation, was hampered by confusion between the curly‐R representing the set of real numbers, and a rather similar‐looking character (though with a subscript and superscript) indicating an expected immediate reward. Reference to the summary of notation would have clarified matters immediately.

Another comment is that the treatment uses old‐fashioned classical probability theory, although much recent discussion, notably by Klir (1989), indicates the value of other ways of dealing with uncertainty. Perhaps future developments will encompass these, though there is no clear indication in the present study of inadequacy of the classical viewpoint.

This is definitely an extremely important and well‐prepared work.

References

Andrew, A.M. (1985, Computational Techniques in Operations Research, Abacus Press, Tunbridge Wells, pp. 12333.

Connell, J. and Mahadevan, S. (1993, Robot Learning, Kluwer, Boston.

Dreyfus, S. (1961, “Dynamic programming”, inAckoff, R.L. (Ed.), Progress in Operations Research, Vol. 1, Wiley, Ch. 5, pp. 21142

Klir, G.J. (1989, “Is there more to uncertainty than some probability theorists might have us believe?”, Int. J. General Systems, Vol15, pp34778.

Schank, R.C. (1985, “Looking at learning”, inSteels, L. and Campbell, J.A. (Eds), Progress in Artificial Intelligence, Ellis Horwood, Chichester, pp. 1729

Related articles