It has been said that the newspaper industry in South Africa publishes about 80% of the electronic text created in our country. What happens with that? What is the value of this information to the newspaper industry as well as to the man in the street? This paper will describe the current situation and will explain how the ‘morgue’ in a newspaper company is currently used. It will also discuss the future directions that newspapers intend to follow. New technology such as text retrieval systems, CDROM technology and the Internet may change the face of the newspaper library for ever. Questions regarding the challenges that face technology in order to establish an electronic newspaper archive, the information requirements of journalists and how they use online information will also be discussed. New developments in technology that may in the future give us an electronic newspaper personalised according to our specific information needs will also be discussed.
This paper will discuss the integration of document image processing and text retrieval principles in order to process and load existing paper documents automatically in…
This paper will discuss the integration of document image processing and text retrieval principles in order to process and load existing paper documents automatically in an electronic document database that broadens the user's capability to retrieve relevant information more accurately, without going through costly processes to get paper documents into electronic text. The principles of document image processing systems, as well as the problems and shortcomings of most of today's document image processing systems, will be discussed. Then concept retrieval as the latest development in text retrieval will be discussed, with specific reference to the ability of the TOPIC intelligent text retrieval system to allow users to build up a knowledge base of search objects or concepts that can be used at any point in time by all users for the system. This paper will further specifically look at the automatic processing of paper documents by converting the scanned document image pages through to electronic text. The use of optical character recognition technology, the indexing and loading of the documents in a text database, the automatic linking of the documents to the related document images and the retrieval technology available in TOPIC, specifically the TYPO operator that was developed to handle so‐called dirty data such as the common misspellings, character transpositions and ‘dirty’ text received as output from the OCR process, will be discussed. A possible solution to load paper documents quickly and cost‐effectively into an electronic document database will be discussed and demonstrated in detail. The advantages and disadvantages of this approach will be discussed with specific reference to an electronic news clipping service application.
Online & CDROM Review here offers abstracts of the papers presented at the Second Southern African Online Information Meeting, held in Pretoria on 2–4 June 1993. The full…
Online & CDROM Review here offers abstracts of the papers presented at the Second Southern African Online Information Meeting, held in Pretoria on 2–4 June 1993. The full Proceedings are published in a special edition of our sister journal, The Electronic Library, August/October 1993, vol. 11, no. 4/5.