Search results
1 – 10 of over 13000Robert Gaizauskas and Yorick Wilks
In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a…
Abstract
In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a pre‐specified set of entities, relations or events from natural language texts and to record this information in structured representations called templates. Here we describe the nature of the IE task, review the history of the area from its origins in AI work in the 1960s and 70s till the present, discuss the techniques being used to carry out the task, describe application areas where IE systems are or are about to be at work, and conclude with a discussion of the challenges facing the area. What emerges is a picture of an exciting new text processing technology with a host of new applications, both on its own and in conjunction with other technologies, such as information retrieval, machine translation and data mining.
Details
Keywords
Wen-Feng Hsiao, Te-Min Chang and Erwin Thomas
The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in…
Abstract
Purpose
The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs).
Design/methodology/approach
The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata.
Findings
Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified.
Research limitations/implications
For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers’ productivity.
Practical implications
For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation.
Originality/value
The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.
Details
Keywords
Nassim Abdeldjallal Otmani, Malik Si-Mohammed, Catherine Comparot and Pierre-Jean Charrel
The purpose of this study is to propose a framework for extracting medical information from the Web using domain ontologies. Patient–Doctor conversations have become…
Abstract
Purpose
The purpose of this study is to propose a framework for extracting medical information from the Web using domain ontologies. Patient–Doctor conversations have become prevalent on the Web. For instance, solutions like HealthTap or AskTheDoctors allow patients to ask doctors health-related questions. However, most online health-care consumers still struggle to express their questions efficiently due mainly to the expert/layman language and knowledge discrepancy. Extracting information from these layman descriptions, which typically lack expert terminology, is challenging. This hinders the efficiency of the underlying applications such as information retrieval. Herein, an ontology-driven approach is proposed, which aims at extracting information from such sparse descriptions using a meta-model.
Design/methodology/approach
A meta-model is designed to bridge the gap between the vocabulary of the medical experts and the consumers of the health services. The meta-model is mapped with SNOMED-CT to access the comprehensive medical vocabulary, as well as with WordNet to improve the coverage of layman terms during information extraction. To assess the potential of the approach, an information extraction prototype based on syntactical patterns is implemented.
Findings
The evaluation of the approach on the gold standard corpus defined in Task1 of ShARe CLEF 2013 showed promising results, an F-score of 0.79 for recognizing medical concepts in real-life medical documents.
Originality/value
The originality of the proposed approach lies in the way information is extracted. The context defined through a meta-model proved to be efficient for the task of information extraction, especially from layman descriptions.
Details
Keywords
The purpose of this paper is to present a method to realize the flexible and lightweight integration of general web applications.
Abstract
Purpose
The purpose of this paper is to present a method to realize the flexible and lightweight integration of general web applications.
Design/methodology/approach
The information extraction and functionality emulation method are proposed to realize the web information integration for the general web applications. All the processes of web information searching, submitting and extraction are run at client‐side by end‐user programming like a real web service.
Findings
The implementation shows that the required programming techniques are within the abilities of general web users, and without needing to write too many programs.
Originality/value
A Java‐based class package was developed for web information searching/submitting/extraction, which users can integrate easily with the general web applications.
Details
Keywords
Andreas Vlachidis, Ceri Binding, Douglas Tudhope and Keith May
This paper sets out to discuss the use of information extraction (IE), a natural language‐processing (NLP) technique to assist “rich” semantic indexing of diverse…
Abstract
Purpose
This paper sets out to discuss the use of information extraction (IE), a natural language‐processing (NLP) technique to assist “rich” semantic indexing of diverse archaeological text resources. The focus of the research is to direct a semantic‐aware “rich” indexing of diverse natural language resources with properties capable of satisfying information retrieval from online publications and datasets associated with the Semantic Technologies for Archaeological Resources (STAR) project.
Design/methodology/approach
The paper proposes use of the English Heritage extension (CRM‐EH) of the standard core ontology in cultural heritage, CIDOC CRM, and exploitation of domain thesauri resources for driving and enhancing an Ontology‐Oriented Information Extraction process. The process of semantic indexing is based on a rule‐based Information Extraction technique, which is facilitated by the General Architecture of Text Engineering (GATE) toolkit and expressed by Java Annotation Pattern Engine (JAPE) rules.
Findings
Initial results suggest that the combination of information extraction with knowledge resources and standard conceptual models is capable of supporting semantic‐aware term indexing. Additional efforts are required for further exploitation of the technique and adoption of formal evaluation methods for assessing the performance of the method in measurable terms.
Originality/value
The value of the paper lies in the semantic indexing of 535 unpublished online documents often referred to as “Grey Literature”, from the Archaeological Data Service OASIS corpus (Online AccesS to the Index of archaeological investigationS), with respect to the CRM ontological concepts E49.Time Appellation and P19.Physical Object.
Details
Keywords
Strategic alliances among organizations are some of the central drivers of innovation and economic growth. However, the discovery of alliances has relied on pure manual…
Abstract
Purpose
Strategic alliances among organizations are some of the central drivers of innovation and economic growth. However, the discovery of alliances has relied on pure manual search and has limited scope. This paper proposes a text-mining framework, ACRank, that automatically extracts alliances from news articles. ACRank aims to provide human analysts with a higher coverage of strategic alliances compared to existing databases, yet maintain a reasonable extraction precision. It has the potential to discover alliances involving less well-known companies, a situation often neglected by commercial databases.
Design/methodology/approach
The proposed framework is a systematic process of alliance extraction and validation using natural language processing techniques and alliance domain knowledge. The process integrates news article search, entity extraction, and syntactic and semantic linguistic parsing techniques. In particular, Alliance Discovery Template (ADT) identifies a number of linguistic templates expanded from expert domain knowledge and extract potential alliances at sentence-level. Alliance Confidence Ranking (ACRank)further validates each unique alliance based on multiple features at document-level. The framework is designed to deal with extremely skewed, noisy data from news articles.
Findings
In evaluating the performance of ACRank on a gold standard data set of IBM alliances (2006–2008) showed that: Sentence-level ADT-based extraction achieved 78.1% recall and 44.7% precision and eliminated over 99% of the noise in news articles. ACRank further improved precision to 97% with the top20% of extracted alliance instances. Further comparison with Thomson Reuters SDC database showed that SDC covered less than 20% of total alliances, while ACRank covered 67%. When applying ACRank to Dow 30 company news articles, ACRank is estimated to achieve a recall between 0.48 and 0.95, and only 15% of the alliances appeared in SDC.
Originality/value
The research framework proposed in this paper indicates a promising direction of building a comprehensive alliance database using automatic approaches. It adds value to academic studies and business analyses that require in-depth knowledge of strategic alliances. It also encourages other innovative studies that use text mining and data analytics to study business relations.
Details
Keywords
A.C.M. Fong, S.C. Hui and H.L. Vu
Research organisations and individual researchers increasingly choose to share their research findings by providing lists of their published works on the World Wide Web…
Abstract
Research organisations and individual researchers increasingly choose to share their research findings by providing lists of their published works on the World Wide Web. To facilitate the exchange of ideas, the lists often include links to published papers in portable document format (PDF) or Postscript (PS) format. Generally, these publication Web sites are updated regularly to include new works. While manual monitoring of relevant Web sites is tedious, commercial search engines and information monitoring systems are ineffective in finding and tracking scholarly publications. Analyses the characteristics of publication index pages and describes effective automatic extraction techniques that the authors have developed. The authors’ techniques combine lexical and syntactic analyses with heuristics. The proposed techniques have been implemented and tested for more than 14,000 Web pages and achieved consistently high success rates of around 90 percent.
Details
Keywords
Aleksandar Kovačević, Dragan Ivanović, Branko Milosavljević, Zora Konjović and Dušan Surla
The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the…
Abstract
Purpose
The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific research activity of the University of Novi Sad (CRIS UNS).
Design/methodology/approach
The system is based on machine learning and performs automatic extraction and classification of metadata in eight pre‐defined categories. The extraction task is realised as a classification process. For the purpose of classification each row of text is represented with a vector that comprises different features: formatting, position, characteristics related to the words, etc. Experiments were performed with standard classification models. Both a single classifier with all eight categories and eight individual classifiers were tested. Classifiers were evaluated using the five‐fold cross validation, on a manually annotated corpus comprising 100 scientific papers in PDF format, collected from various conferences, journals and authors' personal web pages.
Findings
Based on the performances obtained on classification experiments, eight separate support vector machines (SVM) models (each of which recognises its corresponding category) were chosen. All eight models were established to have a good performance. The F‐measure was over 85 per cent for almost all of the classifiers and over 90 per cent for most of them.
Research limitations/implications
Automatically extracted metadata cannot be directly entered into CRIS UNS but requires control of the curators.
Practical implications
The proposed system for automatic metadata extraction using support vector machines model was integrated into the software system, CRIS UNS. Metadata extraction has been tested on the publications of researchers from the Department of Mathematics and Informatics of the Faculty of Sciences in Novi Sad. Analysis of extracted metadata from these publications showed that the performance of the system for the previously unseen data is in accordance with that obtained by the cross‐validation from eight separate SVM classifiers. This system will help in the process of synchronising metadata from CRIS UNS with other institutional repositories.
Originality/value
The paper documents a fully automated system for metadata extraction from scientific papers that was developed. The system is based on the SVM classifier and open source tools, and is capable of extracting eight types of metadata from scientific articles of any format that can be converted to PDF. Although developed as part of CRIS UNS, the proposed system can be integrated into other CRIS systems, as well as institutional repositories and library management systems.
Details
Keywords
Mohammed Ourabah Soualah, Yassine Ait Ali Yahia, Abdelkader Keita and Abderrezak Guessoum
The purpose of this paper is to obtain online access to the digitised Arabic manuscripts images, which need to use a catalogue. The bibliographic cataloguing is unsuitable…
Abstract
Purpose
The purpose of this paper is to obtain online access to the digitised Arabic manuscripts images, which need to use a catalogue. The bibliographic cataloguing is unsuitable for old Arabic manuscripts, and it is imperative to establish a new cataloguing model. In the research, the authors propose a new cataloguing model based on manuscript annotations and transcriptions. This model can be an effective solution to dynamic catalogue old Arabic manuscripts. In this field, the authors used the automatic extraction of the metadata that is based on the structural similarity of the documents.
Design/methodology/approach
This work is based on experimental methodology. The whole proposed concepts and formulas were tested for validation. This, allows the authors to make concise conclusions.
Findings
Cataloguing old Arabic manuscripts faces problem of unavailability of information. However, this information may be found in another place in a copy of the original manuscript. Thus, cataloguing Arabic manuscript cannot be done in one time, it is a continual process which require information updating. The idea is to make a pre-cataloguing of a manuscript, then try to complete and improve it through a specific platform. Consequently, in the research work, the authors propose a new cataloguing model, which the authors call “Dynamic cataloguing”.
Research limitations/implications
The success of the proposed model is confronted with the involvement of all actors of the model. It is based on the conviction and the motivation of actors of the collaborative platform.
Practical implications
The model can be used in several cataloguing fields, where the encoding model is based on XML. The model is innovative and implements a smart cataloguing model. The model is useful by using a web platform. It allows an automatic update of a catalogue.
Social implications
The model prompts the user to participate and enrich the catalogue. The user could improve his social status from a passive to an active.
Originality/value
The dynamic cataloguing model is a new concept. It has never been proposed in the literature until now. The proposed cataloguing model is based on automatic extraction of metadata from user annotations/transcription. It is a smart system which automatically updates or fills the catalogue with the extracted metadata.
Details
Keywords
The purpose of this paper is to develop a system that can convert PDF files to XML files.
Abstract
Purpose
The purpose of this paper is to develop a system that can convert PDF files to XML files.
Design/methodology/approach
The system works with XML as an information display model and XSLT as an information extraction rule. The process is illustrated by converting a scientific and technological paper in PDF to a valid XML file.
Findings
Because the PDF file adopts the self‐descriptive definition, its content information and the display information exists in different objects; therefore, it is not easy to directly extract information from the PDF source file. The undirected way to solve this problem in the system design was to convert the PDF source file to a relatively easy processing intermediate format, which can then be automatically converted to the target file in accordance with relevant rules.
Originality/value
It is important to be able to easily and conveniently extract information from PDF files and this paper shows how it can be done. The design ideas contained in the paper can also be applied to information extraction from other types of files.
Details