Information Theory, Inference, and Learning Algorithms


ISSN: 0368-492X

Article publication date: 1 August 2004




Andrew, A.M. (2004), "Information Theory, Inference, and Learning Algorithms", Kybernetes, Vol. 33 No. 7, pp. 1217-1218.



Emerald Group Publishing Limited

Copyright © 2004, Emerald Group Publishing Limited

This is a quite remarkable work to have come from a single author and the three topic areas of its title actually fall short of indicating its full breadth of coverage. The book is physically quite weighty, at just over 1.5 kg. The print size is smaller than usual, so there is a great deal of material here, even though the text is interspersed with numerous illustrations.

The author's aims in writing the book are explained in his Preface:

This book is aimed at senior undergraduates and graduate students in Engineering, Science, Mathematics, and Computing. It expects familiarity with calculus, probability theory, and linear algebra as taught in a first‐ or second‐year course on mathematics for scientists and engineers.

Conventional courses on information theory cover not only the beautiful theoretical ideas of Shannon, but also practical solutions to communication problems. This book goes further, bringing in Bayesian data modelling, Monte Carlo methods, variational methods, clustering algorithms, and neural networks.

Why unify information theory and machine learning? Because they are two sides of the same coin. In the 1960s, a single field, cybernetics, was populated by information theorists, computer scientists, and neuroscientists, all studying common problems. Information theory and machine learning still belong together. Brains are the ultimate compression and communication systems. And the state‐of‐the‐art algorithms for both data compression and error‐correcting codes use the same tools as machine learning.

A great range of topics is treated following this motivation, in a total of no less than 50 chapters and three appendices. The respective topics are interlinked and indications are given about which should be regarded as prerequisites for others, and which might be omitted in a first reading. Apart from the hierarchical or “prerequisite” links, however, there are others that have implications for the structure of the subject area. For example, the treatment of statistical inference is linked, appropriately, to learning in neural nets. The treatment everywhere is mathematical and getting to grips with it is a major task, but the author helps matters by writing in a chatty and often amusing style. In one of the reviewers' comments quoted on the back cover it is suggested that readers will enjoy the presentation so much that each will want to have two copies, one in the office and one at home for entertainment.

The treatment is specially valuable because the author has made it completely up‐to‐date. In particular, methods for error‐detecting and error‐correcting coding are described that are very new and can be considered state‐of‐the‐art. Consistent with this aim of keeping abreast of developments is a large number of references to Internet sources, scattered through the text, as well as a large conventional bibliography at the end.

In spite of the author's care to integrate his topics, the book will certainly be useful as a source book for piecemeal reference, for example to the descriptions of latest advances in error‐correcting codes. Of the three appendices, one defines the mathematical notation used throughout, and the other two are entitled, respectively: “Some Physics” and “Some Mathematics”, and although they are quite short the book would be well worth consulting for these alone. The physics one gives an admirable discussion of phase transitions that underlie the Ising models of statistical physics, often referred to in connection with neural nets. The mathematics one gives admirable introductions to both Galois field theory and linear algebra.

The book has seven parts, of which the last holds only the appendices and each of the others has between 4 and 18 chapters. Some idea of the topics covered is given by the six headings: (I) Data Compression; (II) Noisy‐Channel Coding; (III) Further Topic in Information Theory; (IV) Probabilities and Inference; (V) Neural Networks; (VI) Sparse Graph Codes.

Although it is helped along by chatty remarks and wide‐ranging comments, the treatment is essentially mathematical throughout. In connection with one of the topics treated, namely “Supervised Learning in Multilayer Networks” in Chapter 44 within Part V, it has been argued (Andrew, 2001) that the strictly mathematical approach may obscure wider considerations of structure. This is a very minor, and probably controversial, criticism of the approach adopted in the book. The concentration on mathematics actually increases the value of the book as a reference source since for most people the mathematical parts of theories are the hardest to grasp.

The book is extremely valuable in introducing a new integrated viewpoint, and it is clearly an admirable basis for taught courses, as well as for self‐study and reference. I am very glad to have it on my shelves.

Alex M. Andrew


Andrew, A.M. (2001), “Backpropagation”, Kybernetes, Vol. 30 Nos 9/10, pp. 111017.

Related articles