Books and journals Case studies Expert Briefings Open Access
Advanced search

Search results

1 – 10 of over 13000
To view the access options for this content please click here
Article
Publication date: 1 March 1998

Information extraction: beyond document retrieval

Robert Gaizauskas and Yorick Wilks

In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a…

HTML
PDF (124 KB)

Abstract

In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a pre‐specified set of entities, relations or events from natural language texts and to record this information in structured representations called templates. Here we describe the nature of the IE task, review the history of the area from its origins in AI work in the 1960s and 70s till the present, discuss the techniques being used to carry out the task, describe application areas where IE systems are or are about to be at work, and conclude with a discussion of the challenges facing the area. What emerges is a picture of an exciting new text processing technology with a host of new applications, both on its own and in conjunction with other technologies, such as information retrieval, machine translation and data mining.

Details

Journal of Documentation, vol. 54 no. 1
Type: Research Article
DOI: https://doi.org/10.1108/EUM0000000007162
ISSN: 0022-0418

Keywords

  • Text retrieval
  • Information control
  • Documents

To view the access options for this content please click here
Article
Publication date: 1 July 2014

Extracting bibliographical data for PDF documents with HMM and external resources

Wen-Feng Hsiao, Te-Min Chang and Erwin Thomas

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in…

HTML
PDF (264 KB)

Abstract

Purpose

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs).

Design/methodology/approach

The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata.

Findings

Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified.

Research limitations/implications

For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers’ productivity.

Practical implications

For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation.

Originality/value

The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.

Details

Program, vol. 48 no. 3
Type: Research Article
DOI: https://doi.org/10.1108/PROG-12-2011-0059
ISSN: 0033-0337

Keywords

  • Bibliographical information
  • Hidden Markov Model
  • Information extraction
  • PDF documents

Content available
Article
Publication date: 19 August 2019

Ontology-based approach to enhance medical web information extraction

Nassim Abdeldjallal Otmani, Malik Si-Mohammed, Catherine Comparot and Pierre-Jean Charrel

The purpose of this study is to propose a framework for extracting medical information from the Web using domain ontologies. Patient–Doctor conversations have become…

HTML
PDF (1.1 MB)

Abstract

Purpose

The purpose of this study is to propose a framework for extracting medical information from the Web using domain ontologies. Patient–Doctor conversations have become prevalent on the Web. For instance, solutions like HealthTap or AskTheDoctors allow patients to ask doctors health-related questions. However, most online health-care consumers still struggle to express their questions efficiently due mainly to the expert/layman language and knowledge discrepancy. Extracting information from these layman descriptions, which typically lack expert terminology, is challenging. This hinders the efficiency of the underlying applications such as information retrieval. Herein, an ontology-driven approach is proposed, which aims at extracting information from such sparse descriptions using a meta-model.

Design/methodology/approach

A meta-model is designed to bridge the gap between the vocabulary of the medical experts and the consumers of the health services. The meta-model is mapped with SNOMED-CT to access the comprehensive medical vocabulary, as well as with WordNet to improve the coverage of layman terms during information extraction. To assess the potential of the approach, an information extraction prototype based on syntactical patterns is implemented.

Findings

The evaluation of the approach on the gold standard corpus defined in Task1 of ShARe CLEF 2013 showed promising results, an F-score of 0.79 for recognizing medical concepts in real-life medical documents.

Originality/value

The originality of the proposed approach lies in the way information is extracted. The context defined through a meta-model proved to be efficient for the task of information extraction, especially from layman descriptions.

Details

International Journal of Web Information Systems, vol. 15 no. 3
Type: Research Article
DOI: https://doi.org/10.1108/IJWIS-03-2018-0017
ISSN: 1744-0084

Keywords

  • Web search and information extraction
  • Metadata and ontologies
  • Knowledge engineering
  • Online patient-doctor conversation

To view the access options for this content please click here
Article
Publication date: 23 November 2010

Towards flexible and lightweight integration of web applications by end‐user programming

Hao Han and Takehiro Tokuda

The purpose of this paper is to present a method to realize the flexible and lightweight integration of general web applications.

HTML
PDF (325 KB)

Abstract

Purpose

The purpose of this paper is to present a method to realize the flexible and lightweight integration of general web applications.

Design/methodology/approach

The information extraction and functionality emulation method are proposed to realize the web information integration for the general web applications. All the processes of web information searching, submitting and extraction are run at client‐side by end‐user programming like a real web service.

Findings

The implementation shows that the required programming techniques are within the abilities of general web users, and without needing to write too many programs.

Originality/value

A Java‐based class package was developed for web information searching/submitting/extraction, which users can integrate easily with the general web applications.

Details

International Journal of Web Information Systems, vol. 6 no. 4
Type: Research Article
DOI: https://doi.org/10.1108/17440081011090257
ISSN: 1744-0084

Keywords

  • Internet
  • Computer applications
  • Programming
  • Information retrieval

To view the access options for this content please click here
Article
Publication date: 8 July 2010

Excavating grey literature: A case study on the rich indexing of archaeological documents via natural language‐processing techniques and knowledge‐based resources

Andreas Vlachidis, Ceri Binding, Douglas Tudhope and Keith May

This paper sets out to discuss the use of information extraction (IE), a natural language‐processing (NLP) technique to assist “rich” semantic indexing of diverse…

HTML
PDF (75 KB)

Abstract

Purpose

This paper sets out to discuss the use of information extraction (IE), a natural language‐processing (NLP) technique to assist “rich” semantic indexing of diverse archaeological text resources. The focus of the research is to direct a semantic‐aware “rich” indexing of diverse natural language resources with properties capable of satisfying information retrieval from online publications and datasets associated with the Semantic Technologies for Archaeological Resources (STAR) project.

Design/methodology/approach

The paper proposes use of the English Heritage extension (CRM‐EH) of the standard core ontology in cultural heritage, CIDOC CRM, and exploitation of domain thesauri resources for driving and enhancing an Ontology‐Oriented Information Extraction process. The process of semantic indexing is based on a rule‐based Information Extraction technique, which is facilitated by the General Architecture of Text Engineering (GATE) toolkit and expressed by Java Annotation Pattern Engine (JAPE) rules.

Findings

Initial results suggest that the combination of information extraction with knowledge resources and standard conceptual models is capable of supporting semantic‐aware term indexing. Additional efforts are required for further exploitation of the technique and adoption of formal evaluation methods for assessing the performance of the method in measurable terms.

Originality/value

The value of the paper lies in the semantic indexing of 535 unpublished online documents often referred to as “Grey Literature”, from the Archaeological Data Service OASIS corpus (Online AccesS to the Index of archaeological investigationS), with respect to the CRM ontological concepts E49.Time Appellation and P19.Physical Object.

Details

Aslib Proceedings, vol. 62 no. 4/5
Type: Research Article
DOI: https://doi.org/10.1108/00012531011074708
ISSN: 0001-253X

Keywords

  • Information management
  • Semantics
  • Data handling

To view the access options for this content please click here
Article
Publication date: 22 June 2020

ACRank: a multi-evidence text-mining model for alliance discovery from news articles

Yilu Zhou and Yuan Xue

Strategic alliances among organizations are some of the central drivers of innovation and economic growth. However, the discovery of alliances has relied on pure manual…

HTML
PDF (1 MB)

Abstract

Purpose

Strategic alliances among organizations are some of the central drivers of innovation and economic growth. However, the discovery of alliances has relied on pure manual search and has limited scope. This paper proposes a text-mining framework, ACRank, that automatically extracts alliances from news articles. ACRank aims to provide human analysts with a higher coverage of strategic alliances compared to existing databases, yet maintain a reasonable extraction precision. It has the potential to discover alliances involving less well-known companies, a situation often neglected by commercial databases.

Design/methodology/approach

The proposed framework is a systematic process of alliance extraction and validation using natural language processing techniques and alliance domain knowledge. The process integrates news article search, entity extraction, and syntactic and semantic linguistic parsing techniques. In particular, Alliance Discovery Template (ADT) identifies a number of linguistic templates expanded from expert domain knowledge and extract potential alliances at sentence-level. Alliance Confidence Ranking (ACRank)further validates each unique alliance based on multiple features at document-level. The framework is designed to deal with extremely skewed, noisy data from news articles.

Findings

In evaluating the performance of ACRank on a gold standard data set of IBM alliances (2006–2008) showed that: Sentence-level ADT-based extraction achieved 78.1% recall and 44.7% precision and eliminated over 99% of the noise in news articles. ACRank further improved precision to 97% with the top20% of extracted alliance instances. Further comparison with Thomson Reuters SDC database showed that SDC covered less than 20% of total alliances, while ACRank covered 67%. When applying ACRank to Dow 30 company news articles, ACRank is estimated to achieve a recall between 0.48 and 0.95, and only 15% of the alliances appeared in SDC.

Originality/value

The research framework proposed in this paper indicates a promising direction of building a comprehensive alliance database using automatic approaches. It adds value to academic studies and business analyses that require in-depth knowledge of strategic alliances. It also encourages other innovative studies that use text mining and data analytics to study business relations.

Details

Information Technology & People, vol. 33 no. 5
Type: Research Article
DOI: https://doi.org/10.1108/ITP-06-2018-0272
ISSN: 0959-3845

Keywords

  • Strategic alliances
  • Knowledge discovery
  • Business intelligence
  • Web mining
  • Text mining
  • Information extraction
  • Template-based
  • Chunk parsing

To view the access options for this content please click here
Article
Publication date: 1 February 2002

Effective techniques for automatic extraction of Web publications

A.C.M. Fong, S.C. Hui and H.L. Vu

Research organisations and individual researchers increasingly choose to share their research findings by providing lists of their published works on the World Wide Web…

HTML
PDF (2.2 MB)

Abstract

Research organisations and individual researchers increasingly choose to share their research findings by providing lists of their published works on the World Wide Web. To facilitate the exchange of ideas, the lists often include links to published papers in portable document format (PDF) or Postscript (PS) format. Generally, these publication Web sites are updated regularly to include new works. While manual monitoring of relevant Web sites is tedious, commercial search engines and information monitoring systems are ineffective in finding and tracking scholarly publications. Analyses the characteristics of publication index pages and describes effective automatic extraction techniques that the authors have developed. The authors’ techniques combine lexical and syntactic analyses with heuristics. The proposed techniques have been implemented and tested for more than 14,000 Web pages and achieved consistently high success rates of around 90 percent.

Details

Online Information Review, vol. 26 no. 1
Type: Research Article
DOI: https://doi.org/10.1108/14684520210418347
ISSN: 1468-4527

Keywords

  • Internet
  • Research
  • Electronic publishing
  • Content analysis

To view the access options for this content please click here
Article
Publication date: 27 September 2011

Automatic extraction of metadata from scientific publications for CRIS systems

Aleksandar Kovačević, Dragan Ivanović, Branko Milosavljević, Zora Konjović and Dušan Surla

The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the…

HTML
PDF (171 KB)

Abstract

Purpose

The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific research activity of the University of Novi Sad (CRIS UNS).

Design/methodology/approach

The system is based on machine learning and performs automatic extraction and classification of metadata in eight pre‐defined categories. The extraction task is realised as a classification process. For the purpose of classification each row of text is represented with a vector that comprises different features: formatting, position, characteristics related to the words, etc. Experiments were performed with standard classification models. Both a single classifier with all eight categories and eight individual classifiers were tested. Classifiers were evaluated using the five‐fold cross validation, on a manually annotated corpus comprising 100 scientific papers in PDF format, collected from various conferences, journals and authors' personal web pages.

Findings

Based on the performances obtained on classification experiments, eight separate support vector machines (SVM) models (each of which recognises its corresponding category) were chosen. All eight models were established to have a good performance. The F‐measure was over 85 per cent for almost all of the classifiers and over 90 per cent for most of them.

Research limitations/implications

Automatically extracted metadata cannot be directly entered into CRIS UNS but requires control of the curators.

Practical implications

The proposed system for automatic metadata extraction using support vector machines model was integrated into the software system, CRIS UNS. Metadata extraction has been tested on the publications of researchers from the Department of Mathematics and Informatics of the Faculty of Sciences in Novi Sad. Analysis of extracted metadata from these publications showed that the performance of the system for the previously unseen data is in accordance with that obtained by the cross‐validation from eight separate SVM classifiers. This system will help in the process of synchronising metadata from CRIS UNS with other institutional repositories.

Originality/value

The paper documents a fully automated system for metadata extraction from scientific papers that was developed. The system is based on the SVM classifier and open source tools, and is capable of extracting eight types of metadata from scientific articles of any format that can be converted to PDF. Although developed as part of CRIS UNS, the proposed system can be integrated into other CRIS systems, as well as institutional repositories and library management systems.

Details

Program, vol. 45 no. 4
Type: Research Article
DOI: https://doi.org/10.1108/00330331111182094
ISSN: 0033-0337

Keywords

  • Automatic metadata extraction
  • Classification
  • Support vector machines
  • PDF format
  • Data handling
  • Machine oriented languages

To view the access options for this content please click here
Article
Publication date: 19 June 2017

Dynamic cataloguing of the old Arabic manuscripts by automatic extraction of metadata

Mohammed Ourabah Soualah, Yassine Ait Ali Yahia, Abdelkader Keita and Abderrezak Guessoum

The purpose of this paper is to obtain online access to the digitised Arabic manuscripts images, which need to use a catalogue. The bibliographic cataloguing is unsuitable…

HTML
PDF (1.2 MB)

Abstract

Purpose

The purpose of this paper is to obtain online access to the digitised Arabic manuscripts images, which need to use a catalogue. The bibliographic cataloguing is unsuitable for old Arabic manuscripts, and it is imperative to establish a new cataloguing model. In the research, the authors propose a new cataloguing model based on manuscript annotations and transcriptions. This model can be an effective solution to dynamic catalogue old Arabic manuscripts. In this field, the authors used the automatic extraction of the metadata that is based on the structural similarity of the documents.

Design/methodology/approach

This work is based on experimental methodology. The whole proposed concepts and formulas were tested for validation. This, allows the authors to make concise conclusions.

Findings

Cataloguing old Arabic manuscripts faces problem of unavailability of information. However, this information may be found in another place in a copy of the original manuscript. Thus, cataloguing Arabic manuscript cannot be done in one time, it is a continual process which require information updating. The idea is to make a pre-cataloguing of a manuscript, then try to complete and improve it through a specific platform. Consequently, in the research work, the authors propose a new cataloguing model, which the authors call “Dynamic cataloguing”.

Research limitations/implications

The success of the proposed model is confronted with the involvement of all actors of the model. It is based on the conviction and the motivation of actors of the collaborative platform.

Practical implications

The model can be used in several cataloguing fields, where the encoding model is based on XML. The model is innovative and implements a smart cataloguing model. The model is useful by using a web platform. It allows an automatic update of a catalogue.

Social implications

The model prompts the user to participate and enrich the catalogue. The user could improve his social status from a passive to an active.

Originality/value

The dynamic cataloguing model is a new concept. It has never been proposed in the literature until now. The proposed cataloguing model is based on automatic extraction of metadata from user annotations/transcription. It is a smart system which automatically updates or fills the catalogue with the extracted metadata.

Details

Library Hi Tech, vol. 35 no. 2
Type: Research Article
DOI: https://doi.org/10.1108/LHT-07-2016-0076
ISSN: 0737-8831

Keywords

  • Transcription
  • Digital library
  • Annotations
  • Automatic extraction of metadata
  • Dynamic cataloguing
  • Structural similarity

To view the access options for this content please click here
Article
Publication date: 15 February 2008

Converting PDF files to XML files

Wende Zhang

The purpose of this paper is to develop a system that can convert PDF files to XML files.

HTML
PDF (245 KB)

Abstract

Purpose

The purpose of this paper is to develop a system that can convert PDF files to XML files.

Design/methodology/approach

The system works with XML as an information display model and XSLT as an information extraction rule. The process is illustrated by converting a scientific and technological paper in PDF to a valid XML file.

Findings

Because the PDF file adopts the self‐descriptive definition, its content information and the display information exists in different objects; therefore, it is not easy to directly extract information from the PDF source file. The undirected way to solve this problem in the system design was to convert the PDF source file to a relatively easy processing intermediate format, which can then be automatically converted to the target file in accordance with relevant rules.

Originality/value

It is important to be able to easily and conveniently extract information from PDF files and this paper shows how it can be done. The design ideas contained in the paper can also be applied to information extraction from other types of files.

Details

The Electronic Library, vol. 26 no. 1
Type: Research Article
DOI: https://doi.org/10.1108/02640470810851743
ISSN: 0264-0473

Keywords

  • Portable document format
  • Extensible Markup Language
  • Information exchange

Access
Only content I have access to
Only Open Access
Year
  • Last week (36)
  • Last month (128)
  • Last 3 months (462)
  • Last 6 months (851)
  • Last 12 months (1613)
  • All dates (13210)
Content type
  • Article (11210)
  • Book part (1181)
  • Earlycite article (726)
  • Case study (69)
  • Expert briefing (24)
1 – 10 of over 13000
Emerald Publishing
  • Opens in new window
  • Opens in new window
  • Opens in new window
  • Opens in new window
© 2021 Emerald Publishing Limited

Services

  • Authors Opens in new window
  • Editors Opens in new window
  • Librarians Opens in new window
  • Researchers Opens in new window
  • Reviewers Opens in new window

About

  • About Emerald Opens in new window
  • Working for Emerald Opens in new window
  • Contact us Opens in new window
  • Publication sitemap

Policies and information

  • Privacy notice
  • Site policies
  • Modern Slavery Act Opens in new window
  • Chair of Trustees governance statement Opens in new window
  • COVID-19 policy Opens in new window
Manage cookies

We’re listening — tell us what you think

  • Something didn’t work…

    Report bugs here

  • All feedback is valuable

    Please share your general feedback

  • Member of Emerald Engage?

    You can join in the discussion by joining the community or logging in here.
    You can also find out more about Emerald Engage.

Join us on our journey

  • Platform update page

    Visit emeraldpublishing.com/platformupdate to discover the latest news and updates

  • Questions & More Information

    Answers to the most commonly asked questions here