Search results

1 – 10 of over 14000
Content available
Article
Publication date: 14 August 2017

Xiu Susie Fang, Quan Z. Sheng, Xianzhi Wang, Anne H.H. Ngu and Yihong Zhang

This paper aims to propose a system for generating actionable knowledge from Big Data and use this system to construct a comprehensive knowledge base (KB), called GrandBase.

Abstract

Purpose

This paper aims to propose a system for generating actionable knowledge from Big Data and use this system to construct a comprehensive knowledge base (KB), called GrandBase.

Design/methodology/approach

In particular, this study extracts new predicates from four types of data sources, namely, Web texts, Document Object Model (DOM) trees, existing KBs and query stream to augment the ontology of the existing KB (i.e. Freebase). In addition, a graph-based approach to conduct better truth discovery for multi-valued predicates is also proposed.

Findings

Empirical studies demonstrate the effectiveness of the approaches presented in this study and the potential of GrandBase. The future research directions regarding GrandBase construction and extension has also been discussed.

Originality/value

To revolutionize our modern society by using the wisdom of Big Data, considerable KBs have been constructed to feed the massive knowledge-driven applications with Resource Description Framework triples. The important challenges for KB construction include extracting information from large-scale, possibly conflicting and different-structured data sources (i.e. the knowledge extraction problem) and reconciling the conflicts that reside in the sources (i.e. the truth discovery problem). Tremendous research efforts have been contributed on both problems. However, the existing KBs are far from being comprehensive and accurate: first, existing knowledge extraction systems retrieve data from limited types of Web sources; second, existing truth discovery approaches commonly assume each predicate has only one true value. In this paper, the focus is on the problem of generating actionable knowledge from Big Data. A system is proposed, which consists of two phases, namely, knowledge extraction and truth discovery, to construct a broader KB, called GrandBase.

Details

PSU Research Review, vol. 1 no. 2
Type: Research Article
ISSN: 2399-1747

Keywords

To view the access options for this content please click here
Article
Publication date: 3 August 2021

Irvin Dongo, Yudith Cardinale, Ana Aguilera, Fabiola Martinez, Yuni Quintero, German Robayo and David Cabeza

This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze…

Abstract

Purpose

This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on Twitter. Thus, there is clear evidence of the need of having different options to extract different data for this purpose. Nevertheless, none of these studies perform a comparative evaluation of both extraction techniques. Moreover, the authors extend a previous comparison, which uses a recent developed framework that offers both alternates of data extraction and implements a previously proposed credibility model, by adding a qualitative evaluation and a Twitter-Application Programming Interface (API) performance analysis from different locations.

Design/methodology/approach

As one of the most popular social platforms, Twitter has been the focus of recent research aimed at analyzing the credibility of the shared information. To do so, several proposals use either Twitter API or Web scraping to extract the data to perform the analysis. Qualitative and quantitative evaluations are performed to discover the advantages and disadvantages of both extraction methods.

Findings

The study demonstrates the differences in terms of accuracy and efficiency of both extraction methods and gives relevance to much more problems related to this area to pursue true transparency and legitimacy of information on the Web.

Originality/value

Results report that some Twitter attributes cannot be retrieved by Web scraping. Both methods produce identical credibility values when a robust normalization process is applied to the text (i.e. tweet). Moreover, concerning the time performance, Web scraping is faster than Twitter API and it is more flexible in terms of obtaining data; however, Web scraping is very sensitive to website changes. Additionally, the response time of the Twitter API is proportional to the distance from the central server at San Francisco.

Details

International Journal of Web Information Systems, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1744-0084

Keywords

To view the access options for this content please click here
Article
Publication date: 9 August 2021

Xintong Zhao, Jane Greenberg, Vanessa Meschke, Eric Toberer and Xiaohua Hu

The output of academic literature has increased significantly due to digital technology, presenting researchers with a challenge across every discipline, including…

Abstract

Purpose

The output of academic literature has increased significantly due to digital technology, presenting researchers with a challenge across every discipline, including materials science, as it is impossible to manually read and extract knowledge from millions of published literature. The purpose of this study is to address this challenge by exploring knowledge extraction in materials science, as applied to digital scholarship. An overriding goal is to help inform readers about the status knowledge extraction in materials science.

Design/methodology/approach

The authors conducted a two-part analysis, comparing knowledge extraction methods applied materials science scholarship, across a sample of 22 articles; followed by a comparison of HIVE-4-MAT, an ontology-based knowledge extraction and MatScholar, a named entity recognition (NER) application. This paper covers contextual background, and a review of three tiers of knowledge extraction (ontology-based, NER and relation extraction), followed by the research goals and approach.

Findings

The results indicate three key needs for researchers to consider for advancing knowledge extraction: the need for materials science focused corpora; the need for researchers to define the scope of the research being pursued, and the need to understand the tradeoffs among different knowledge extraction methods. This paper also points to future material science research potential with relation extraction and increased availability of ontologies.

Originality/value

To the best of the authors’ knowledge, there are very few studies examining knowledge extraction in materials science. This work makes an important contribution to this underexplored research area.

Details

The Electronic Library , vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0264-0473

Keywords

To view the access options for this content please click here
Article
Publication date: 11 June 2020

Yuh-Min Chen, Tsung-Yi Chen and Lyu-Cian Chen

Location-based services (LBS) have become an effective commercial marketing tool. However, regarding retail store location selection, it is challenging to collect…

Abstract

Purpose

Location-based services (LBS) have become an effective commercial marketing tool. However, regarding retail store location selection, it is challenging to collect analytical data. In this study, location-based social network data are employed to develop a retail store recommendation method by analyzing the relationship between user footprint and point-of-interest (POI). According to the correlation analysis of the target area and the extraction of crowd mobility patterns, the features of retail store recommendation are constructed.

Design/methodology/approach

The industrial density, area category, clustering and area saturation calculations between POIs are designed. Methods such as Kernel Density Estimation and K-means are used to calculate the influence of the area relevance on the retail store selection.

Findings

The coffee retail industry is used as an example to analyze the retail location recommendation method and assess the accuracy of the method.

Research limitations/implications

This study is mainly limited by the size and density of the datasets. Owing to the limitations imposed by the location-based privacy policy, it is challenging to perform experimental verification using the latest data.

Originality/value

An industrial relevance questionnaire is designed, and the responses are arranged using a simple checklist to conveniently establish a method for filtering the industrial nature of the adjacent areas. The New York and Tokyo datasets from Foursquare and the Tainan city dataset from Facebook are employed for feature extraction and validation. A higher evaluation score is obtained compared with relevant studies with regard to the normalized discounted cumulative gain index.

Details

Online Information Review, vol. 45 no. 2
Type: Research Article
ISSN: 1468-4527

Keywords

To view the access options for this content please click here
Article
Publication date: 14 May 2019

Ahsan Mahmood, Hikmat Ullah Khan, Zahoor Ur Rehman, Khalid Iqbal and Ch. Muhmmad Shahzad Faisal

The purpose of this research study is to extract and identify named entities from Hadith literature. Named entity recognition (NER) refers to the identification of the…

Abstract

Purpose

The purpose of this research study is to extract and identify named entities from Hadith literature. Named entity recognition (NER) refers to the identification of the named entities in a computer readable text having an annotation of categorization tags for information extraction. NER is an active research area in information management and information retrieval systems. NER serves as a baseline for machines to understand the context of a given content and helps in knowledge extraction. Although NER is considered as a solved task in major languages such as English, in languages such as Urdu, NER is still a challenging task. Moreover, NER depends on the language and domain of study; thus, it is gaining the attention of researchers in different domains.

Design/methodology/approach

This paper proposes a knowledge extraction framework using finite-state transducers (FSTs) – KEFST – to extract the named entities. KEFST consists of five steps: content extraction, tokenization, part of speech tagging, multi-word detection and NER. An extensive empirical analysis using the data corpus of Urdu translation of Sahih Al-Bukhari, a widely known hadith book, reveals that the proposed method effectively recognizes the entities to obtain better results.

Findings

The significant performance in terms of f-measure, precision and recall validates that the proposed model outperforms the existing methods for NER in the relevant literature.

Originality/value

This research is novel in this regard that no previous work is proposed in the Urdu language to extract named entities using FSTs and no previous work is proposed for Urdu hadith data NER.

Details

The Electronic Library , vol. 37 no. 2
Type: Research Article
ISSN: 0264-0473

Keywords

To view the access options for this content please click here
Article
Publication date: 24 June 2020

Yilu Zhou and Yuan Xue

Strategic alliances among organizations are some of the central drivers of innovation and economic growth. However, the discovery of alliances has relied on pure manual…

Abstract

Purpose

Strategic alliances among organizations are some of the central drivers of innovation and economic growth. However, the discovery of alliances has relied on pure manual search and has limited scope. This paper proposes a text-mining framework, ACRank, that automatically extracts alliances from news articles. ACRank aims to provide human analysts with a higher coverage of strategic alliances compared to existing databases, yet maintain a reasonable extraction precision. It has the potential to discover alliances involving less well-known companies, a situation often neglected by commercial databases.

Design/methodology/approach

The proposed framework is a systematic process of alliance extraction and validation using natural language processing techniques and alliance domain knowledge. The process integrates news article search, entity extraction, and syntactic and semantic linguistic parsing techniques. In particular, Alliance Discovery Template (ADT) identifies a number of linguistic templates expanded from expert domain knowledge and extract potential alliances at sentence-level. Alliance Confidence Ranking (ACRank)further validates each unique alliance based on multiple features at document-level. The framework is designed to deal with extremely skewed, noisy data from news articles.

Findings

In evaluating the performance of ACRank on a gold standard data set of IBM alliances (2006–2008) showed that: Sentence-level ADT-based extraction achieved 78.1% recall and 44.7% precision and eliminated over 99% of the noise in news articles. ACRank further improved precision to 97% with the top20% of extracted alliance instances. Further comparison with Thomson Reuters SDC database showed that SDC covered less than 20% of total alliances, while ACRank covered 67%. When applying ACRank to Dow 30 company news articles, ACRank is estimated to achieve a recall between 0.48 and 0.95, and only 15% of the alliances appeared in SDC.

Originality/value

The research framework proposed in this paper indicates a promising direction of building a comprehensive alliance database using automatic approaches. It adds value to academic studies and business analyses that require in-depth knowledge of strategic alliances. It also encourages other innovative studies that use text mining and data analytics to study business relations.

Details

Information Technology & People, vol. 33 no. 5
Type: Research Article
ISSN: 0959-3845

Keywords

To view the access options for this content please click here
Article
Publication date: 5 October 2015

Oduetse Matsebe, Khumbulani Mpofu, John Terhile Agee and Sesan Peter Ayodeji

The purpose of this paper is to present a method to extract corner features for map building purposes in man-made structured underwater environments using the…

Abstract

Purpose

The purpose of this paper is to present a method to extract corner features for map building purposes in man-made structured underwater environments using the sliding-window technique.

Design/methodology/approach

The sliding-window technique is used to extract corner features, and Mechanically Scanned Imaging Sonar (MSIS) is used to scan the environment for map building purposes. The tests were performed with real data collected in a swimming pool.

Findings

The change in application environment and the use of MSIS present some important differences, which must be taken into account when dealing with acoustic data. These include motion-induced distortions, continuous data flow, low scan frequency and high noise levels. Only part of the data stored in each scan sector is important for feature extraction; therefore, a segmentation process is necessary to extract more significant information. To deal with continuous flow of data, data must be separated into 360° scan sectors. Although the vehicle is assumed to be static, there is a drift in both its rotational and translational motions because of currents in the water; these drifts induce distortions in acoustic images. Therefore, the bearing information and the current vehicle pose corresponding to the selected scan-lines must be stored and used to compensate for motion-induced distortions in the acoustic images. As the data received is very noisy, an averaging filter should be applied to achieve an even distribution of data points, although this is partly achieved through the segmentation process. On the selected sliding window, all the point pairs must pass the distance and angle tests before a corner can be initialised. This minimises mapping of outlier data points but can make the algorithm computationally expensive if the selected window is too wide. The results show the viability of this procedure under very noisy data. The technique has been applied to 50 data sets/scans sectors with a success rate of 83 per cent.

Research limitations/implications

MSIS gives very noisy data. There are limited sensorial modes for underwater applications.

Practical implications

The extraction of corner features in structured man-made underwater environments opens the door for SLAM systems to a wide range of applications and environments.

Originality/value

A method to extract corner features for map building purposes in man-made structured underwater environments is presented using the sliding-window technique.

Details

Journal of Engineering, Design and Technology, vol. 13 no. 4
Type: Research Article
ISSN: 1726-0531

Keywords

To view the access options for this content please click here
Article
Publication date: 1 July 2014

Wen-Feng Hsiao, Te-Min Chang and Erwin Thomas

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in…

Abstract

Purpose

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs).

Design/methodology/approach

The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata.

Findings

Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified.

Research limitations/implications

For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers’ productivity.

Practical implications

For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation.

Originality/value

The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.

Details

Program, vol. 48 no. 3
Type: Research Article
ISSN: 0033-0337

Keywords

To view the access options for this content please click here
Article
Publication date: 14 June 2019

Nora Madi, Rawan Al-Matham and Hend Al-Khalifa

The purpose of this paper is to provide an overall review of grammar checking and relation extraction (RE) literature, their techniques and the open challenges associated…

Abstract

Purpose

The purpose of this paper is to provide an overall review of grammar checking and relation extraction (RE) literature, their techniques and the open challenges associated with them; and, finally, suggest future directions.

Design/methodology/approach

The review on grammar checking and RE was carried out using the following protocol: we prepared research questions, planed for searching strategy, addressed paper selection criteria to distinguish relevant works, extracted data from these works, and finally, analyzed and synthesized the data.

Findings

The output of error detection models could be used for creating a profile of a certain writer. Such profiles can be used for author identification, native language identification or even the level of education, to name a few. The automatic extraction of relations could be used to build or complete electronic lexical thesauri and knowledge bases.

Originality/value

Grammar checking is the process of detecting and sometimes correcting erroneous words in the text, while RE is the process of detecting and categorizing predefined relationships between entities or words that were identified in the text. The authors found that the most obvious challenge is the lack of data sets, especially for low-resource languages. Also, the lack of unified evaluation methods hinders the ability to compare results.

Details

Data Technologies and Applications, vol. 53 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

To view the access options for this content please click here
Article
Publication date: 5 February 2020

Mona Mohamed, Sharma Pillutla and Stella Tomasi

The purpose of this paper is to establish a new conceptual iterative framework for extracting knowledge from open government data (OGD). OGD is becoming a major source for…

Abstract

Purpose

The purpose of this paper is to establish a new conceptual iterative framework for extracting knowledge from open government data (OGD). OGD is becoming a major source for knowledge and innovation to generate economic value, if properly used. However, currently there are no standards or frameworks for applying knowledge continuum tactics, techniques and procedures (TTPs) to improve elicit knowledge extraction from OGD in a consistent manner.

Design/methodology/approach

This paper is based on a comprehensive review of literature on both OGD and knowledge management (KM) frameworks. It provides insights into the extraction of knowledge from OGD by using a vast array of phased KM TTPs into the OGD lifecycle phases.

Findings

The paper proposes a knowledge iterative value network (KIVN) as a new conceptual model that applies the principles of KM on OGD. KIVN operates through applying KM TTPs to transfer and transform discrete data into valuable knowledge.

Research limitations/implications

This model covers the most important knowledge elicitation steps; however, users who are interested in using KIVN phases may need to slightly customize it based on their environment and OGD policy and procedure.

Practical implications

After its validation, the model allows facilitating systemic manipulation of OGD for both data-consuming industries and data-producing governments to establish new business models and governance schemes to better make use of OGD.

Originality/value

This paper offers new perspectives on eliciting knowledge from OGD and discussing crucial, but overlooked area of the OGD arena, namely, knowledge extraction through KM principles.

Details

VINE Journal of Information and Knowledge Management Systems, vol. 50 no. 3
Type: Research Article
ISSN: 2059-5891

Keywords

1 – 10 of over 14000