Content‐based Video Retrieval: A Database Perspective

Peter Enser (University of Brighton, UK)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 1 October 2004




Enser, P. (2004), "Content‐based Video Retrieval: A Database Perspective", Journal of Documentation, Vol. 60 No. 5, pp. 586-588.



Emerald Group Publishing Limited

Copyright © 2004, Emerald Group Publishing Limited

The purpose of this modest text is to report on the application of techniques for adding content‐based video retrieval functionality to a database management system. The work is set within the context of the semantic gap, interpreted here as the problem of inferring semantics from raw video data.

The semantic gap is a feature of the visual information retrieval landscape which has begun to excite a great deal of attention among the research community in computer science. The authors are members of that community. They bring a pragmatic approach to the task of narrowing the semantic gap, recognising that, in the short term at least, solutions are only available in highly constrained domains of knowledge.

Having established the context the authors proceed to an overview of databases, covering fundamental concepts and architecture, followed by a description of a database management system (the Moa‐Monet database platform) which will be used to launch content‐based video retrieval functionality later. This chapter also offers a brief characterisation of information retrieval and the measurement of system performance. There is a much more developed explanation of the principles and practice of video retrieval, in which the multimodal nature of video data is emphasised.

Retrieval techniques are clearly described, in the main. The problems associated with textual annotation are expressed succinctly, and content‐based techniques are quite clearly explained, together with mixed‐mode approaches. The chapter concludes with a brief description of MPEG‐7, followed by a summary. This latter feature is present in each chapter, and provides in each case a good summarisation of the material covered in the chapter. There is also a comprehensive set of references at the conclusion of each chapter.

The third chapter deals with modelling video data. Syntactic segmentation of the video stream into shots, using automatic shot boundary detection is considered, followed by semantic segmentation of shots into scenes or scenes into story units. The difficulty in achieving the latter is emphasised, since domain knowledge or user interaction and the application of human intellect is required, whereas extraction of visual features can often be done automatically and it is commonly domain independent. Experimental systems for both still and moving are cited – but there is a tendency to reproduce screen displays which are too small to be legible.

The Cobra (Content‐Based RetrievAl) modelling framework is presented as a means of integrating low level (feature‐based) and high level (annotation‐based) representations of video content. A case study in the form of tennis court action footage is used to instantiate it. The aim is to automatically extract events like net‐playing, rally and longest point.

In the succeeding chapter the Cobra framework is extended by means of object and event grammars which seek to formalise (spatio‐temporal) descriptions of high‐level concepts. The tennis case study is used to show how the spatio‐temporal approach can be used to map features into high‐level events like net‐play, so that rallies, lobs, longest points and such like can be automatically extracted from tennis videos. The creation of primitive object and event descriptions in the object grammar is very labour intensive, and the formalisation of “complex actions of non‐rigid objects involving many objects and features” is demanding (p. 69); the authors acknowledge that a naïve user might find it very difficult to create either type of description.

In a bid to address these difficulties attention shifts to stochastic modelling in the following chapter. Hidden Markov Models (HMMs) and Bayesian belief networks are proposed since they have been successfully used in comparable problem contexts. Both techniques are described succinctly, but the reader will need to be armed with a knowledge of probability theory and notation in order to follow the argument.

The tennis case study is used to illustrate the use of HMMs for stroke recognition, e.g. forehand, backhand, smash, etc. Television broadcast videos from a number of tournaments are used as the experimental base. The specification and implementation of these stochastic processes is a considerable challenge (the mere segmentation, or differentiation, of the player from the tennis court poses difficulties), and the authors are very honest about the robustness problems encountered with such techniques.

A second strand of experimentation is described in which video footage of Formula 1 motor racing is used. The aim in this case was to achieve automatic detection of race highlights and specific events (start, vehicle passing and vehicle fly out). One strand of investigation used mono‐media audio analysis, seeking to find soundtrack segments containing excited speech and/or the use of specific keywords by the commentator or announcer. The authors also report on the use of an integrated audio‐visual fusion (audio, video and video‐embedded text) technique using dynamic Bayesian networks which showed that high recall and precision values for race highlights and specific events could be achieved, although the technique appears to be very sensitive to the characteristics of the camera work.

A prototype video database management system provides the host architecture for the implementation of these techniques for automatic detection of complex events in video sequences. Of note is the section on querying, which indicates that the system can only address a highly constrained set of pre‐defined queries. As with many other such experimental, content‐based systems, the experimentation prescribes the user transactions, such are the intellectual and computational demands of the formal modelling of information retrieval scenarios. In the authors’ defence, they do acknowledge that much greater functionality is to be gained by the integration of a content‐based search facility with a concept‐based (text‐based) search engine operating on a digital library.

The concluding chapter provides a useful summary of the work as a whole, and recommendations for future research include relevance feedback, video summarization, query representation and multimedia query language.

The book is published by Kluwer as part of their International Series in Engineering and Computer Science. Why the publisher appears to have permitted the work to go to print without copy‐editing it is a mystery: the work is littered with examples of poor English. That apart, this book has the look and feel of a good PhD thesis. It addresses a topic which is very much in vogue, and it justifies its claim to show that the semantic gap can be narrowed in restricted domains of knowledge.

Related articles