Search results

11 – 20 of over 1000
Article
Publication date: 1 July 2014

Wen-Feng Hsiao, Te-Min Chang and Erwin Thomas

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable

Abstract

Purpose

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs).

Design/methodology/approach

The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata.

Findings

Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified.

Research limitations/implications

For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers’ productivity.

Practical implications

For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation.

Originality/value

The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.

Details

Program, vol. 48 no. 3
Type: Research Article
ISSN: 0033-0337

Keywords

Article
Publication date: 1 April 2001

D.C. Veal Doverton

Present and possible future developments in the techniques of document management are reviewed, the major ones being text retrieval and scanning and OCR. Acquisition, indexing and…

1510

Abstract

Present and possible future developments in the techniques of document management are reviewed, the major ones being text retrieval and scanning and OCR. Acquisition, indexing and thesauri, publishing and dissemination and the document management industry are also addressed. The emerging standards are reviewed and the impact of the Internet is analysed.

Details

Journal of Documentation, vol. 57 no. 2
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 1 August 2004

Wang Shaofeng

This paper mainly discusses the author's prototype implementation of Java‐based electronic publishing system (JEPS) that facilitates the creation and delivery of electronic…

2285

Abstract

This paper mainly discusses the author's prototype implementation of Java‐based electronic publishing system (JEPS) that facilitates the creation and delivery of electronic documents with Java technology. JEPS packages the document and viewer in a Java applet. The documents can be viewed on any computer platform with the identical content and style. This paper describes the framework of JEPS and compares JEPS with other Web publishing technologies such as PDF and XML. This paper concludes by considering the potential opportunities and prospects that JEPS provides in the area of electronic publishing over the Internet.

Details

The Electronic Library, vol. 22 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 1 February 2002

A.C.M. Fong, S.C. Hui and H.L. Vu

Research organisations and individual researchers increasingly choose to share their research findings by providing lists of their published works on the World Wide Web. To…

Abstract

Research organisations and individual researchers increasingly choose to share their research findings by providing lists of their published works on the World Wide Web. To facilitate the exchange of ideas, the lists often include links to published papers in portable document format (PDF) or Postscript (PS) format. Generally, these publication Web sites are updated regularly to include new works. While manual monitoring of relevant Web sites is tedious, commercial search engines and information monitoring systems are ineffective in finding and tracking scholarly publications. Analyses the characteristics of publication index pages and describes effective automatic extraction techniques that the authors have developed. The authors’ techniques combine lexical and syntactic analyses with heuristics. The proposed techniques have been implemented and tested for more than 14,000 Web pages and achieved consistently high success rates of around 90 percent.

Details

Online Information Review, vol. 26 no. 1
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 1 August 2001

Siriginidi Subba Rao

Highlights the evolution and potentiality of electronic books (eBooks) and presents a comprehensive definition from the various definitions reported for eBooks and their types…

4208

Abstract

Highlights the evolution and potentiality of electronic books (eBooks) and presents a comprehensive definition from the various definitions reported for eBooks and their types, the pros, cons and users. Available eBook hardware, such as Rocket eBook Reader, SoftBook Reader, EveryBook Dedicated Reader and Millennium eBook Reader; software, viz. Adobe Acrobat, Microsoft Reader, Glassbook Reader, DocAble, SoftBook Reader, RocketLibrarian, PeanutReader, etc. is listed together with sources for eBook titles. Also briefly discusses eBook standards and copyright protection. Concludes that eBooks are rapidly becoming a viable alternative over the traditional medium and will continue to stay in one form or another.

Details

The Electronic Library, vol. 19 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 1 January 2006

Susan J. Sullivan

This article sets out to explain the purpose of PDF/A, how it addresses archival and records management concerns, how PDF/A was designed to have “desirable properties of a

3246

Abstract

Purpose

This article sets out to explain the purpose of PDF/A, how it addresses archival and records management concerns, how PDF/A was designed to have “desirable properties of a long‐term preservation format”, and the future of PDF/A.

Design/methodology/approach

The contents of this article are based on the author's knowledge and experience of the subject.

Findings

It is emphasized that PDF/A must be implemented in conjunction with policies and procedures, including quality assurance procedures to ensure acceptable replication of source material.

Originality/value

This article will be of interest to anyone working with PDF files. Work has already begun on PDF/A Part 2 which will be based on PDF 1.6. Application notes and a listing of frequently asked questions will be made publicly available to assist developers of PDF/A applications to better understand the requirements of the file format and provide implementation guidance.

Details

Records Management Journal, vol. 16 no. 1
Type: Research Article
ISSN: 0956-5698

Keywords

Article
Publication date: 3 April 2018

Qian Pu, Xiaomin Zhu, Donghua Chen and Runtong Zhang

This paper aims to provide an optimization method of workflow for publishing houses and electronic book (e-book) studies in the field of digital publishing.

Abstract

Purpose

This paper aims to provide an optimization method of workflow for publishing houses and electronic book (e-book) studies in the field of digital publishing.

Design/methodology/approach

Based on the studies of publishing houses in Beijing, the present conversion workflow is illustrated using a functional modeling methodology. Then, the workflow is analyzed using 5W1H (why, who, what, where, when, how) methodology and optimized using ECRSI (eliminate, combine, rearrange, simplify and increase) principles. To validate the optimization effect, the workflow before and after optimization are generated and implemented by the ExtendSim® simulation software.

Findings

The simulation results show that under similar circumstances, both quantity and quality of the products are improved after optimization, which indicate that the optimization method is effective.

Practical implications

Electronic PUBlication (EPUB) has significant requirements to satisfy the needs of the mobile reading market and to earn increased profits, whereas some e-books are still preserved in a portable document format (PDF). This study results in the enhanced EPUB quality and production efficiency of the PDF-to-EPUB format conversion workflow in publishing houses. Publishing houses around the world can refer to this study to make a similar optimization when handling PDF-to-EPUB.

Originality/value

This research introduces the traditional industrial engineering analytical techniques to the workflow optimization of e-book conversion. Compared with the most of other methods used to optimize workflow, this method is simpler, more efficient and more suitable for e-book format conversion.

Details

The Electronic Library, vol. 36 no. 2
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 1 September 2001

Nick Poole

The purpose of this article is to examine the existing tools and guidance available to museums, archives and libraries, and then to consider new technologies such as accessible…

Abstract

The purpose of this article is to examine the existing tools and guidance available to museums, archives and libraries, and then to consider new technologies such as accessible Portable Document Format files and additional modules for existing web development software. The article reviews current tools, standards and guidelines in accessibility such as WAI, RNIB Digital Access Campaign, Information Age Government Champions guidelines, Bobby validator, Access Adobe and the Macromedia Dreamweaver Accessibility Extension. Two Case Studies concerning accessibility are included.

Details

VINE, vol. 31 no. 3
Type: Research Article
ISSN: 0305-5728

Article
Publication date: 1 February 1998

Paul Nieuwenhuysen and Patrick Vanouplines

This contribution looks at some relatively new and recent advanced tools, techniques, methods and standards related to the Internet which form the basis for mixtures of documents

Abstract

This contribution looks at some relatively new and recent advanced tools, techniques, methods and standards related to the Internet which form the basis for mixtures of documents and services, which we can call ‘document+program hybrids’. The new Internet systems contribute to an evolution from documents on the one side and computer programs on the other side, neatly separated, apart from each other, without much interaction, so that the static document can also exist without computers and networks, to hybrid systems where the classical distinction between the contents and the container is blurred; where all components are integrated, interwoven and exist in synergy with each other; they can be more dynamic and interactive, in comparison with more classical and static documents, by involving and exploiting the power of computers and networks. A collection is presented of Internet‐based sources (URLs) that can serve as illustrations. Recent methods, techniques, standards and protocols on the Internet that form the basis of the evolution are listed. As professional information intermediaries, the authors also consider the impact in the area of online access to information and knowledge.

Details

Online and CD-Rom Review, vol. 22 no. 2
Type: Research Article
ISSN: 1353-2642

Article
Publication date: 1 March 2000

Jennifer Rowley

Electronic journals are an important alternative form of document delivery. Document delivery is performed by library networks and consortia, CD‐ROM suppliers, document delivery…

2725

Abstract

Electronic journals are an important alternative form of document delivery. Document delivery is performed by library networks and consortia, CD‐ROM suppliers, document delivery services, library suppliers and subscription agents, and electronic journal suppliers. This article reviews the general issues associated with electronic journals, and illustrates these with reference to the products and projects that are available in the UK. Subsequent to the early projects such as BLEND and Project Quartet, projects on electronic journals have been led by either publishers or consortia whose members include both major libraries and publishers. Among these projects are Ariel, EDDIS, EDIL, ADONIS, APPEAL and the UK Pilot Site Initiative. In order that electronic journals become an established option for document and information delivery, there are a number of questions that need to be answered from the perspectives both of libraries, and of the information industry. This article summarises some of these questions, and identifies some of the broader issues that will determine progress towards wide acceptance of electronic journals.

Details

Library Hi Tech, vol. 18 no. 1
Type: Research Article
ISSN: 0737-8831

Keywords

11 – 20 of over 1000