Search results

1 – 10 of over 7000
Click here to view access options
Article
Publication date: 2 December 2020

Yohanes Sigit Purnomo W.P., Yogan Jaya Kumar and Nur Zareen Zulkarnain

Extracting information from unstructured data becomes a challenging task for computational linguistics. Public figure’s statement attributed by journalists in a story is…

Abstract

Purpose

Extracting information from unstructured data becomes a challenging task for computational linguistics. Public figure’s statement attributed by journalists in a story is one type of information that can be processed into structured data. Therefore, having the knowledge base about this data will be very beneficial for further use, such as for opinion mining, claim detection and fact-checking. This study aims to understand statement extraction tasks and the models that have already been applied to formulate a framework for further study.

Design/methodology/approach

This paper presents a literature review from selected previous research that specifically addresses the topics of quotation extraction and quotation attribution. Research works that discuss corpus development related to quotation extraction and quotation attribution are also considered. The findings of the review will be used as a basis for proposing a framework to direct further research.

Findings

There are three findings in this study. Firstly, the extraction process still consists of two main tasks, namely, the extraction of quotations and the attribution of quotations. Secondly, most extraction algorithms rely on a rule-based algorithm or traditional machine learning. And last, the availability of corpus, which is limited in quantity and depth. Based on these findings, a statement extraction framework for Indonesian language corpus and model development is proposed.

Originality/value

The paper serves as a guideline to formulate a framework for statement extraction based on the findings from the literature study. The proposed framework includes a corpus development in the Indonesian language and a model for public figure statement extraction. Furthermore, this study could be used as a reference to produce a similar framework for other languages.

Details

Global Knowledge, Memory and Communication, vol. 70 no. 6/7
Type: Research Article
ISSN: 2514-9342

Keywords

Click here to view access options
Article
Publication date: 8 July 2010

Andreas Vlachidis, Ceri Binding, Douglas Tudhope and Keith May

This paper sets out to discuss the use of information extraction (IE), a natural language‐processing (NLP) technique to assist “rich” semantic indexing of diverse…

Downloads
839

Abstract

Purpose

This paper sets out to discuss the use of information extraction (IE), a natural language‐processing (NLP) technique to assist “rich” semantic indexing of diverse archaeological text resources. The focus of the research is to direct a semantic‐aware “rich” indexing of diverse natural language resources with properties capable of satisfying information retrieval from online publications and datasets associated with the Semantic Technologies for Archaeological Resources (STAR) project.

Design/methodology/approach

The paper proposes use of the English Heritage extension (CRM‐EH) of the standard core ontology in cultural heritage, CIDOC CRM, and exploitation of domain thesauri resources for driving and enhancing an Ontology‐Oriented Information Extraction process. The process of semantic indexing is based on a rulebased Information Extraction technique, which is facilitated by the General Architecture of Text Engineering (GATE) toolkit and expressed by Java Annotation Pattern Engine (JAPE) rules.

Findings

Initial results suggest that the combination of information extraction with knowledge resources and standard conceptual models is capable of supporting semantic‐aware term indexing. Additional efforts are required for further exploitation of the technique and adoption of formal evaluation methods for assessing the performance of the method in measurable terms.

Originality/value

The value of the paper lies in the semantic indexing of 535 unpublished online documents often referred to as “Grey Literature”, from the Archaeological Data Service OASIS corpus (Online AccesS to the Index of archaeological investigationS), with respect to the CRM ontological concepts E49.Time Appellation and P19.Physical Object.

Details

Aslib Proceedings, vol. 62 no. 4/5
Type: Research Article
ISSN: 0001-253X

Keywords

Open Access
Article
Publication date: 14 August 2020

Paramita Ray and Amlan Chakrabarti

Social networks have changed the communication patterns significantly. Information available from different social networking sites can be well utilized for the analysis…

Abstract

Social networks have changed the communication patterns significantly. Information available from different social networking sites can be well utilized for the analysis of users opinion. Hence, the organizations would benefit through the development of a platform, which can analyze public sentiments in the social media about their products and services to provide a value addition in their business process. Over the last few years, deep learning is very popular in the areas of image classification, speech recognition, etc. However, research on the use of deep learning method in sentiment analysis is limited. It has been observed that in some cases the existing machine learning methods for sentiment analysis fail to extract some implicit aspects and might not be very useful. Therefore, we propose a deep learning approach for aspect extraction from text and analysis of users sentiment corresponding to the aspect. A seven layer deep convolutional neural network (CNN) is used to tag each aspect in the opinionated sentences. We have combined deep learning approach with a set of rule-based approach to improve the performance of aspect extraction method as well as sentiment scoring method. We have also tried to improve the existing rule-based approach of aspect extraction by aspect categorization with a predefined set of aspect categories using clustering method and compared our proposed method with some of the state-of-the-art methods. It has been observed that the overall accuracy of our proposed method is 0.87 while that of the other state-of-the-art methods like modified rule-based method and CNN are 0.75 and 0.80 respectively. The overall accuracy of our proposed method shows an increment of 7–12% from that of the state-of-the-art methods.

Details

Applied Computing and Informatics, vol. 18 no. 1/2
Type: Research Article
ISSN: 2634-1964

Keywords

Click here to view access options
Article
Publication date: 1 June 2015

Quang-Minh Nguyen and Tuan-Dung Cao

The purpose of this paper is to propose an automatic method to generate semantic annotations of football transfer in the news. The current automatic news integration…

Abstract

Purpose

The purpose of this paper is to propose an automatic method to generate semantic annotations of football transfer in the news. The current automatic news integration systems on the Web are constantly faced with the challenge of diversity, heterogeneity of sources. The approaches for information representation and storage based on syntax have some certain limitations in news searching, sorting, organizing and linking it appropriately. The models of semantic representation are promising to be the key to solving these problems.

Design/methodology/approach

The approach of the author leverages Semantic Web technologies to improve the performance of detection of hidden annotations in the news. The paper proposes an automatic method to generate semantic annotations based on named entity recognition and rule-based information extraction. The authors have built a domain ontology and knowledge base integrated with the knowledge and information management (KIM) platform to implement the former task (named entity recognition). The semantic extraction rules are constructed based on defined language models and the developed ontology.

Findings

The proposed method is implemented as a part of the sport news semantic annotations-generating prototype BKAnnotation. This component is a part of the sport integration system based on Web Semantics BKSport. The semantic annotations generated are used for improving features of news searching – sorting – association. The experiments on the news data from SkySport (2014) channel showed positive results. The precisions achieved in both cases, with and without integration of the pronoun recognition method, are both over 80 per cent. In particular, the latter helps increase the recall value to around 10 per cent.

Originality/value

This is one of the initial proposals in automatic creation of semantic data about news, football news in particular and sport news in general. The combination of ontology, knowledge base and patterns of language model allows detection of not only entities with corresponding types but also semantic triples. At the same time, the authors propose a pronoun recognition method using extraction rules to improve the relation recognition process.

Details

International Journal of Pervasive Computing and Communications, vol. 11 no. 2
Type: Research Article
ISSN: 1742-7371

Keywords

Click here to view access options
Article
Publication date: 6 August 2019

Abir Boujelben and Ikram Amous

One key issue of maintaining Web information systems is to guarantee the consistency of their knowledge base, in particular, the rules governing them. There are currently…

Abstract

Purpose

One key issue of maintaining Web information systems is to guarantee the consistency of their knowledge base, in particular, the rules governing them. There are currently few methods that can ensure that rule bases management can scale to the amount of knowledge in these systems environment.

Design/methodology/approach

In this paper, the authors propose a method to detect correct dependencies between rules. This work represents a preliminary step for a proposal to eliminate rule base anomalies. The authors previously developed a method that aimed to ameliorate the extraction of rules dependency relationships using a new technique. In this paper, they extend the proposal with other techniques to increase the number of extracted rules dependency relationships. The authors also add some modules to filter and represent them.

Findings

The authors evaluated their own method against other semantic methods. The results show that this work succeeded in extracting better numbers of correct rules dependency relationships. They also noticed that the rule groups deduced from this method’s results are very close to those provided by the rule bases developers.

Originality/value

This work can be applied to knowledge bases that include a fact base and a rule base. In addition, it is independent of the field of application.

Details

International Journal of Web Information Systems, vol. 15 no. 5
Type: Research Article
ISSN: 1744-0084

Keywords

Click here to view access options
Article
Publication date: 10 August 2010

Yang Hai‐feng, Zhang Ji‐fu and Hu Li‐hua

The purpose of this paper is to examine the important application value of extending the concept of classification rule, so that it can describe and measure the…

Downloads
179

Abstract

Purpose

The purpose of this paper is to examine the important application value of extending the concept of classification rule, so that it can describe and measure the uncertainty of classification knowledge.

Design/methodology/approach

The rough concept lattice (RCL), which is an effective tool for uncertain data analysis and knowledge discovery, reflects a kind of unification of concept intent and upper/lower approximation extent, as well as the certain and uncertain relations between objects and attributes.

Findings

A classification rules extraction algorithm, extraction algorithm of classification rule (EACR), based on the RCL is presented by adapting the rough degree to measure uncertainty of classification rule. The algorithm EACR is experimentally validated by taking the star spectrum data as the decision context.

Practical implications

An efficient way for classification rule extraction is provided.

Originality/value

The algorithm EACR based on the RCL is presented by adapting the rough degree to measure uncertainty of classification rule.

Details

Kybernetes, vol. 39 no. 8
Type: Research Article
ISSN: 0368-492X

Keywords

Click here to view access options
Article
Publication date: 1 July 2014

Wen-Feng Hsiao, Te-Min Chang and Erwin Thomas

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in…

Abstract

Purpose

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs).

Design/methodology/approach

The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata.

Findings

Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified.

Research limitations/implications

For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers’ productivity.

Practical implications

For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation.

Originality/value

The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.

Details

Program, vol. 48 no. 3
Type: Research Article
ISSN: 0033-0337

Keywords

Click here to view access options
Article
Publication date: 23 August 2013

Auhood Alfaries, David Bell and Mark Lycett

The purpose of the research is to speed up the process of semantic web services by transformation of current Web services into semantic web services. This can be achieved…

Abstract

Purpose

The purpose of the research is to speed up the process of semantic web services by transformation of current Web services into semantic web services. This can be achieved by applying ontology learning techniques to automatically extract domain ontologies.

Design/methodology/approach

The work here presents a Service Ontology Learning Framework (SOLF), the core aspect of which extracts Structured Interpretation Patterns (SIP). These patterns are used to automate the acquisition (from production domain specific Web Services) of ontological concepts and the relations between those concepts.

Findings

A Semantic Web of accessible and re‐usable software services is able to support the increasingly dynamic and time‐limited development process. This is premised on the efficient and effective creation of supporting domain ontology.

Research limitations/implications

Though WSDL documents provide important application level service description, they alone are not sufficient for OL however, as: they typically provide technical descriptions only; and in many cases, Web services use XSD files to provide data type definitions. The need to include (and combine) other Web service resources in the OL process is therefore an important one.

Practical implications

Web service domain ontologies are the general means by which semantics are added to Web services; typically used as a common domain model and referenced by annotated or externally described Web artefacts (e.g. Web services). The development and deployment of Semantic Web services by enterprises and the wider business community has the potential to radically improve planned and ad‐hoc service re‐use. The reality is slower however, in good part because the development of an appropriate ontology is an expensive, error prone and labor intensive task. The proposed SOLF framework is aimed to overcome this problem by contributing a framework and a tool that can be used to build web service domain ontologies automatically.

Originality/value

The output of the SOLF process is an automatically generated OWL domain ontology, a basis from which a future Semantic Web Services can be delivered using existing Web services. It can be seen that the ontology created moves beyond basic taxonomy – extracting and relating concepts at a number of levels. More importantly, the approach provides integrated knowledge (represented by the individual WSDL documents) from a number of domain experts across a group of banks.

Click here to view access options
Article
Publication date: 14 November 2016

Monireh Ebrahimi, Amir Hossein Yazdavar, Naomie Salim and Safaa Eltyeb

Many opinion-mining systems and tools have been developed to provide users with the attitudes of people toward entities and their attributes or the overall polarities of…

Abstract

Purpose

Many opinion-mining systems and tools have been developed to provide users with the attitudes of people toward entities and their attributes or the overall polarities of documents. In addition, side effects are one of the critical measures used to evaluate a patient’s opinion for a particular drug. However, side effect recognition is a challenging task, since side effects coincide with disease symptoms lexically and syntactically. The purpose of this paper is to extract drug side effects from drug reviews as an integral implicit-opinion words.

Design/methodology/approach

This paper proposes a detection algorithm to a medical-opinion-mining system using rule-based and support vector machines (SVM) algorithms. A corpus from 225 drug reviews was manually annotated by a medical expert for training and testing.

Findings

The results show that SVM significantly outperforms a rule-based algorithm. However, the results of both algorithms are encouraging and a good foundation for future research. Obviating the limitations and exploiting combined approaches would improve the results.

Practical implications

An automatic extraction for adverse drug effects information from online text can help regulatory authorities in rapid information screening and extraction instead of manual inspection and contributes to the acceleration of medical decision support and safety alert generation.

Originality/value

The results of this study can help database curators in compiling adverse drug effects databases and researchers to digest the huge amount of textual online information which is growing rapidly.

Details

Online Information Review, vol. 40 no. 7
Type: Research Article
ISSN: 1468-4527

Keywords

Open Access
Article
Publication date: 7 June 2018

Zhang Yanjie and Sun Hongbo

For many pattern recognition problems, the relation between the sample vectors and the class labels are known during the data acquisition procedure. However, how to find…

Abstract

Purpose

For many pattern recognition problems, the relation between the sample vectors and the class labels are known during the data acquisition procedure. However, how to find the useful rules or knowledge hidden in the data is very important and challengeable. Rule extraction methods are very useful in mining the important and heuristic knowledge hidden in the original high-dimensional data. It can help us to construct predictive models with few attributes of the data so as to provide valuable model interpretability and less training times.

Design/methodology/approach

In this paper, a novel rule extraction method with the application of biclustering algorithm is proposed.

Findings

To choose the most significant biclusters from the huge number of detected biclusters, a specially modified information entropy calculation method is also provided. It will be shown that all of the important knowledge is in practice hidden in these biclusters.

Originality/value

The novelty of the new method lies in the detected biclusters can be conveniently translated into if-then rules. It provides an intuitively explainable and comprehensive approach to extract rules from high-dimensional data while keeping high classification accuracy.

Details

International Journal of Crowd Science, vol. 2 no. 2
Type: Research Article
ISSN: 2398-7294

Keywords

1 – 10 of over 7000