Search results

1 – 10 of 271
Article
Publication date: 7 April 2015

Andreas Vlachidis and Douglas Tudhope

The purpose of this paper is to present the role and contribution of natural language processing techniques, in particular negation detection and word sense disambiguation in the…

Abstract

Purpose

The purpose of this paper is to present the role and contribution of natural language processing techniques, in particular negation detection and word sense disambiguation in the process of Semantic Annotation of Archaeological Grey Literature. Archaeological reports contain a great deal of information that conveys facts and findings in different ways. This kind of information is highly relevant to the research and analysis of archaeological evidence but at the same time can be a hindrance for the accurate indexing of documents with respect to positive assertions.

Design/methodology/approach

The paper presents a method for adapting the biomedicine oriented negation algorithm NegEx to the context of archaeology and discusses the evaluation results of the new modified negation detection module. A particular form of polysemy, which is inflicted by the definition of ontology classes and concerning the semantics of small finds in archaeology, is addressed by a domain specific word-sense disambiguation module.

Findings

The performance of the negation dection module is compared against a “Gold Standard” that consists of 300 manually annotated pages of archaeological excavation and evaluation reports. The evaluation results are encouraging, delivering overall 89 per cent precision, 80 per cent recall and 83 per cent F-measure scores. The paper addresses limitations and future improvements of the current work and highlights the need for ontological modelling to accommodate negative assertions.

Originality/value

The discussed NLP modules contribute to the aims of the OPTIMA pipeline delivering an innovative application of such methods in the context of archaeological reports for the semantic annotation of archaeological grey literature with respect to the CIDOC-CRM ontology.

Article
Publication date: 23 August 2013

Ivo Lašek and Peter Vojtáš

The purpose of this paper is to focus on the problem of named entity disambiguation. The paper disambiguates named entities on a very detailed level. To each entity is assigned a…

Abstract

Purpose

The purpose of this paper is to focus on the problem of named entity disambiguation. The paper disambiguates named entities on a very detailed level. To each entity is assigned a concrete identifier of a corresponding Wikipedia article describing the entity.

Design/methodology/approach

For such a fine‐grained disambiguation a correct representation of the context is crucial. The authors compare various context representations: bag of words representation, linguistic representation and structured co‐occurrence representation. Models for each representation are described and evaluated. They also investigate the possibilities of multilingual named entity disambiguation.

Findings

Based on this evaluation, the structured co‐occurrence representation provides the best disambiguation results. It showed up that this method could be successfully applied also on other languages, not only on English.

Research limitations/implications

Despite its good results the structured co‐occurrence context representation has several limitations. It trades precision for recall, which might not be desirable in some use cases. Also it is not able to disambiguate two different types of entities, which are mentioned under the same name in the same text. These limitations can be overcome by combination with other described methods.

Practical implications

The authors provide a ready‐made web service, which can be directly plugged in existing applications using a REST interface.

Originality/value

The paper proposes a new approach to named entity disambiguation exploiting various context representation models (bag of words, linguistic and structural representation). The authors constructed a comprehensive dataset based on all English Wikipedia articles for named entity disambiguation. They evaluated and compared the individual context representation models on this dataset. They evaluate the support of multiple languages.

Details

International Journal of Web Information Systems, vol. 9 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 3 June 2019

Hongqi Han, Yongsheng Yu, Lijun Wang, Xiaorui Zhai, Yaxin Ran and Jingpeng Han

The aim of this study is to present a novel approach based on semantic fingerprinting and a clustering algorithm called density-based spatial clustering of applications with noise…

Abstract

Purpose

The aim of this study is to present a novel approach based on semantic fingerprinting and a clustering algorithm called density-based spatial clustering of applications with noise (DBSCAN), which can be used to convert investor records into 128-bit semantic fingerprints. Inventor disambiguation is a method used to discover a unique set of underlying inventors and map a set of patents to their corresponding inventors. Resolving the ambiguities between inventors is necessary to improve the quality of the patent database and to ensure accurate entity-level analysis. Most existing methods are based on machine learning and, while they often show good performance, this comes at the cost of time, computational power and storage space.

Design/methodology/approach

Using DBSCAN, the meta and textual data in inventor records are converted into 128-bit semantic fingerprints. However, rather than using a string comparison or cosine similarity to calculate the distance between pair-wise fingerprint records, a binary number comparison function was used in DBSCAN. DBSCAN then clusters the inventor records based on this distance to disambiguate inventor names.

Findings

Experiments conducted on the PatentsView campaign database of the United States Patent and Trademark Office show that this method disambiguates inventor names with recall greater than 99 per cent in less time and with substantially smaller storage requirement.

Research limitations/implications

A better semantic fingerprint algorithm and a better distance function may improve precision. Setting of different clustering parameters for each block or other clustering algorithms will be considered to improve the accuracy of the disambiguation results even further.

Originality/value

Compared with the existing methods, the proposed method does not rely on feature selection and complex feature comparison computation. Most importantly, running time and storage requirements are drastically reduced.

Details

The Electronic Library , vol. 37 no. 2
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 23 October 2007

Jung‐ran Park

The purpose of this paper is to present descriptive characteristics of the historical development of concept networks. The linguistic principles, mechanisms and motivations behind…

1047

Abstract

Purpose

The purpose of this paper is to present descriptive characteristics of the historical development of concept networks. The linguistic principles, mechanisms and motivations behind the evolution of concept networks are discussed. Implications emanating from the idea of the historical development of concept networks are discussed in relation to knowledge representation and organization schemes.

Design/methodology/approach

Natural language data including both speech and text are analyzed by examining discourse contexts in which a linguistic element such as a polysemy or homonym occurs. Linguistic literature on the historical development of concept networks is reviewed and analyzed.

Findings

Semantic sense relations in concept networks can be captured in a systematic and regular manner. The mechanism and impetus behind the process of concept network development suggest that semantic senses in concept networks are closely intertwined with pragmatic contexts and discourse structure. The interrelation and permeability of the semantic senses of concept networks are captured on a continuum scale based on three linguistic parameters: concrete shared semantic sense; discourse and text structure; and contextualized pragmatic information.

Research limitations/implications

Research findings signify the critical need for linking discourse structure and contextualized pragmatic information to knowledge representation and organization schemes.

Originality/value

The idea of linguistic characteristics, principles, motivation and mechanisms underlying the evolution of concept networks provides theoretical ground for developing a model for integrating knowledge representation and organization schemes with discourse structure and contextualized pragmatic information.

Details

Journal of Documentation, vol. 63 no. 6
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 13 September 2019

Collins Udanor and Chinatu C. Anyanwu

Hate speech in recent times has become a troubling development. It has different meanings to different people in different cultures. The anonymity and ubiquity of the social media…

2135

Abstract

Purpose

Hate speech in recent times has become a troubling development. It has different meanings to different people in different cultures. The anonymity and ubiquity of the social media provides a breeding ground for hate speech and makes combating it seems like a lost battle. However, what may constitute a hate speech in a cultural or religious neutral society may not be perceived as such in a polarized multi-cultural and multi-religious society like Nigeria. Defining hate speech, therefore, may be contextual. Hate speech in Nigeria may be perceived along ethnic, religious and political boundaries. The purpose of this paper is to check for the presence of hate speech in social media platforms like Twitter, and to what degree is hate speech permissible, if available? It also intends to find out what monitoring mechanisms the social media platforms like Facebook and Twitter have put in place to combat hate speech. Lexalytics is a term coined by the authors from the words lexical analytics for the purpose of opinion mining unstructured texts like tweets.

Design/methodology/approach

This research developed a Python software called polarized opinions sentiment analyzer (POSA), adopting an ego social network analytics technique in which an individual’s behavior is mined and described. POSA uses a customized Python N-Gram dictionary of local context-based terms that may be considered as hate terms. It then applied the Twitter API to stream tweets from popular and trending Nigerian Twitter handles in politics, ethnicity, religion, social activism, racism, etc., and filtered the tweets against the custom dictionary using unsupervised classification of the texts as either positive or negative sentiments. The outcome is visualized using tables, pie charts and word clouds. A similar implementation was also carried out using R-Studio codes and both results are compared and a t-test was applied to determine if there was a significant difference in the results. The research methodology can be classified as both qualitative and quantitative. Qualitative in terms of data classification, and quantitative in terms of being able to identify the results as either negative or positive from the computation of text to vector.

Findings

The findings from two sets of experiments on POSA and R are as follows: in the first experiment, the POSA software found that the Twitter handles analyzed contained between 33 and 55 percent hate contents, while the R results show hate contents ranging from 38 to 62 percent. Performing a t-test on both positive and negative scores for both POSA and R-studio, results reveal p-values of 0.389 and 0.289, respectively, on an α value of 0.05, implying that there is no significant difference in the results from POSA and R. During the second experiment performed on 11 local handles with 1,207 tweets, the authors deduce as follows: that the percentage of hate contents classified by POSA is 40 percent, while the percentage of hate contents classified by R is 51 percent. That the accuracy of hate speech classification predicted by POSA is 87 percent, while free speech is 86 percent. And the accuracy of hate speech classification predicted by R is 65 percent, while free speech is 74 percent. This study reveals that neither Twitter nor Facebook has an automated monitoring system for hate speech, and no benchmark is set to decide the level of hate contents allowed in a text. The monitoring is rather done by humans whose assessment is usually subjective and sometimes inconsistent.

Research limitations/implications

This study establishes the fact that hate speech is on the increase on social media. It also shows that hate mongers can actually be pinned down, with the contents of their messages. The POSA system can be used as a plug-in by Twitter to detect and stop hate speech on its platform. The study was limited to public Twitter handles only. N-grams are effective features for word-sense disambiguation, but when using N-grams, the feature vector could take on enormous proportions and in turn increasing sparsity of the feature vectors.

Practical implications

The findings of this study show that if urgent measures are not taken to combat hate speech there could be dare consequences, especially in highly polarized societies that are always heated up along religious and ethnic sentiments. On daily basis tempers are flaring in the social media over comments made by participants. This study has also demonstrated that it is possible to implement a technology that can track and terminate hate speech in a micro-blog like Twitter. This can also be extended to other social media platforms.

Social implications

This study will help to promote a more positive society, ensuring the social media is positively utilized to the benefit of mankind.

Originality/value

The findings can be used by social media companies to monitor user behaviors, and pin hate crimes to specific persons. Governments and law enforcement bodies can also use the POSA application to track down hate peddlers.

Details

Data Technologies and Applications, vol. 53 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 21 October 2019

Priyadarshini R., Latha Tamilselvan and Rajendran N.

The purpose of this paper is to propose a fourfold semantic similarity that results in more accuracy compared to the existing literature. The change detection in the URL and the…

Abstract

Purpose

The purpose of this paper is to propose a fourfold semantic similarity that results in more accuracy compared to the existing literature. The change detection in the URL and the recommendation of the source documents is facilitated by means of a framework in which the fourfold semantic similarity is implied. The latest trends in technology emerge with the continuous growth of resources on the collaborative web. This interactive and collaborative web pretense big challenges in recent technologies like cloud and big data.

Design/methodology/approach

The enormous growth of resources should be accessed in a more efficient manner, and this requires clustering and classification techniques. The resources on the web are described in a more meaningful manner.

Findings

It can be descripted in the form of metadata that is constituted by resource description framework (RDF). Fourfold similarity is proposed compared to three-fold similarity proposed in the existing literature. The fourfold similarity includes the semantic annotation based on the named entity recognition in the user interface, domain-based concept matching and improvised score-based classification of domain-based concept matching based on ontology, sequence-based word sensing algorithm and RDF-based updating of triples. The aggregation of all these similarity measures including the components such as semantic user interface, semantic clustering, and sequence-based classification and semantic recommendation system with RDF updating in change detection.

Research limitations/implications

The existing work suggests that linking resources semantically increases the retrieving and searching ability. Previous literature shows that keywords can be used to retrieve linked information from the article to determine the similarity between the documents using semantic analysis.

Practical implications

These traditional systems also lack in scalability and efficiency issues. The proposed study is to design a model that pulls and prioritizes knowledge-based content from the Hadoop distributed framework. This study also proposes the Hadoop-based pruning system and recommendation system.

Social implications

The pruning system gives an alert about the dynamic changes in the article (virtual document). The changes in the document are automatically updated in the RDF document. This helps in semantic matching and retrieval of the most relevant source with the virtual document.

Originality/value

The recommendation and detection of changes in the blogs are performed semantically using n-triples and automated data structures. User-focussed and choice-based crawling that is proposed in this system also assists the collaborative filtering. Consecutively collaborative filtering recommends the user focussed source documents. The entire clustering and retrieval system is deployed in multi-node Hadoop in the Amazon AWS environment and graphs are plotted and analyzed.

Details

International Journal of Intelligent Unmanned Systems, vol. 7 no. 4
Type: Research Article
ISSN: 2049-6427

Keywords

Article
Publication date: 1 February 2016

Manoj Manuja and Deepak Garg

Syntax-based text classification (TC) mechanisms have been overtly replaced by semantic-based systems in recent years. Semantic-based TC systems are particularly useful in those…

Abstract

Purpose

Syntax-based text classification (TC) mechanisms have been overtly replaced by semantic-based systems in recent years. Semantic-based TC systems are particularly useful in those scenarios where similarity among documents is computed considering semantic relationships among their terms. Kernel functions have received major attention because of the unprecedented popularity of SVMs in the field of TC. Most of the kernel functions exploit syntactic structures of the text, but quite a few also use a priori semantic information for knowledge extraction. The purpose of this paper is to investigate semantic kernel functions in the context of TC.

Design/methodology/approach

This work presents performance and accuracy analysis of seven semantic kernel functions (Semantic Smoothing Kernel, Latent Semantic Kernel, Semantic WordNet-based Kernel, Semantic Smoothing Kernel having Implicit Superconcept Expansions, Compactness-based Disambiguation Kernel Function, Omiotis-based S-VSM semantic kernel function and Top-k S-VSM semantic kernel) being implemented with SVM as kernel method. All seven semantic kernels are implemented in SVM-Light tool.

Findings

Performance and accuracy parameters of seven semantic kernel functions have been evaluated and compared. The experimental results show that Top-k S-VSM semantic kernel has the highest performance and accuracy among all the evaluated kernel functions which make it a preferred building block for kernel methods for TC and retrieval.

Research limitations/implications

A combination of semantic kernel function with syntactic kernel function needs to be investigated as there is a scope of further improvement in terms of accuracy and performance in all the seven semantic kernel functions.

Practical implications

This research provides an insight into TC using a priori semantic knowledge. Three commonly used data sets are being exploited. It will be quite interesting to explore these kernel functions on live web data which may test their actual utility in real business scenarios.

Originality/value

Comparison of performance and accuracy parameters is the novel point of this research paper. To the best of the authors’ knowledge, this type of comparison has not been done previously.

Details

Program, vol. 50 no. 1
Type: Research Article
ISSN: 0033-0337

Keywords

Article
Publication date: 5 March 2018

Sajjad Tofighy and Seyed Mostafa Fakhrahmad

This paper aims to propose a statistical and context-aware feature reduction algorithm that improves sentiment classification accuracy. Classification of reviews with different…

Abstract

Purpose

This paper aims to propose a statistical and context-aware feature reduction algorithm that improves sentiment classification accuracy. Classification of reviews with different granularities in two classes of reviews with negative and positive polarities is among the objectives of sentiment analysis. One of the major issues in sentiment analysis is feature engineering while it severely affects time complexity and accuracy of sentiment classification.

Design/methodology/approach

In this paper, a feature reduction method is proposed that uses context-based knowledge as well as synset statistical knowledge. To do so, one-dimensional presentation proposed for SentiWordNet calculates statistical knowledge that involves polarity concentration and variation tendency for each synset. Feature reduction involves two phases. In the first phase, features that combine semantic and statistical similarity conditions are put in the same cluster. In the second phase, features are ranked and then the features which are given lower ranks are eliminated. The experiments are conducted by support vector machine (SVM), naive Bayes (NB), decision tree (DT) and k-nearest neighbors (KNN) algorithms to classify the vectors of the unigram and bigram features in two classes of positive or negative sentiments.

Findings

The results showed that the applied clustering algorithm reduces SentiWordNet synset to less than half which reduced the size of the feature vector by less than half. In addition, the accuracy of sentiment classification is improved by at least 1.5 per cent.

Originality/value

The presented feature reduction method is the first use of the synset clustering for feature reduction. In this paper features reduction algorithm, first aggregates the similar features into clusters then eliminates unsatisfactory cluster.

Details

Kybernetes, vol. 47 no. 5
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 16 November 2015

Hoang-Minh Nguyen, Hong-Quang Nguyen, Khoi-Nguyen Tran and Xuan-Vinh Vo

This paper aims to improve the semantic-disambiguation capability of an information-retrieval system by taking advantages of a well-crafted classification tree. The unstructured…

Abstract

Purpose

This paper aims to improve the semantic-disambiguation capability of an information-retrieval system by taking advantages of a well-crafted classification tree. The unstructured nature and sheer volume of information accessible over networks have made it drastically difficult for users to seek relevant information. Many information-retrieval methods have been developed to address this problem, and keyword-based approach is amongst the most common approach. Such an approach is often inadequate to cope with the conceptualization associated with user needs and contents. This brings about the problem of semantic ambiguation that refers to the disagreement in meaning of terms between involving parties of a communication due to polysemy, leading to increased complexity and lesser accuracy in information integration, migration, retrieval and other related activities.

Design/methodology/approach

A novel ontology-based search approach, named GeTFIRST (short for Graph-embedded Tree Fostering Information Retrieval SysTem), is proposed to disambiguate keywords semantically. The contribution is twofold. First, a search strategy is proposed to prune irrelevant concepts for accuracy improvement using our Graph-embedded Tree (GeT)-based ontology. Second, a path-based ranking algorithm is proposed to incorporate and reward the content specificity.

Findings

An empirical evaluation was performed on United States Patent And Trademark Office (USPTO) patent datasets to compare our approach with full-text patent search approaches. The results showed that GeTFIRST handled the ambiguous keywords with higher keyword-disambiguation accuracy than traditional search approaches.

Originality/value

The search approach of this paper copes with the semantic ambiguation by using our proposed GeT-based ontology and a path-based ranking algorithm.

Details

International Journal of Web Information Systems, vol. 11 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 29 April 2021

Heng-Yang Lu, Yi Zhang and Yuntao Du

Topic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet…

Abstract

Purpose

Topic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet Allocation may suffer from the sparsity problem when dealing with short texts, which mostly come from the Web. These models also exist the readability problem when displaying the discovered topics. The purpose of this paper is to propose a novel model called the Sense Unit based Phrase Topic Model (SenU-PTM) for both the sparsity and readability problems.

Design/methodology/approach

SenU-PTM is a novel phrase-based short-text topic model under a two-phase framework. The first phase introduces a phrase-generation algorithm by exploiting word embeddings, which aims to generate phrases with the original corpus. The second phase introduces a new concept of sense unit, which consists of a set of semantically similar tokens for modeling topics with token vectors generated in the first phase. Finally, SenU-PTM infers topics based on the above two phases.

Findings

Experimental results on two real-world and publicly available datasets show the effectiveness of SenU-PTM from the perspectives of topical quality and document characterization. It reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time.

Originality/value

The originality of SenU-PTM lies in the new procedure of modeling topics on the proposed sense units with word embeddings for short-text topic discovery.

Details

Data Technologies and Applications, vol. 55 no. 5
Type: Research Article
ISSN: 2514-9288

Keywords

1 – 10 of 271