Search results

1 – 10 of 55
Article
Publication date: 15 March 2021

Putta Hemalatha and Geetha Mary Amalanathan

Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a…

Abstract

Purpose

Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a biased distribution of classes that reflects an unequal distribution of classes within a dataset. This issue is known as the imbalance problem, which is one of the most common issues occurring in real-time applications. Learning of imbalanced datasets is a ubiquitous challenge in the field of data mining. Imbalanced data degrades the performance of the classifier by producing inaccurate results.

Design/methodology/approach

In the proposed work, a novel fuzzy-based Gaussian synthetic minority oversampling (FG-SMOTE) algorithm is proposed to process the imbalanced data. The mechanism of the Gaussian SMOTE technique is based on finding the nearest neighbour concept to balance the ratio between minority and majority class datasets. The ratio of the datasets belonging to the minority and majority class is balanced using a fuzzy-based Levenshtein distance measure technique.

Findings

The performance and the accuracy of the proposed algorithm is evaluated using the deep belief networks classifier and the results showed the efficiency of the fuzzy-based Gaussian SMOTE technique achieved an AUC: 93.7%. F1 Score Prediction: 94.2%, Geometric Mean Score: 93.6% predicted from confusion matrix.

Research limitations/implications

The proposed research still retains some of the challenges that need to be focused such as application FG-SMOTE to multiclass imbalanced dataset and to evaluate dataset imbalance problem in a distributed environment.

Originality/value

The proposed algorithm fundamentally solves the data imbalance issues and challenges involved in handling the imbalanced data. FG-SMOTE has aided in balancing minority and majority class datasets.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 23 October 2009

Ching‐Chieh Kiu and Chien‐Sing Lee

The purpose of this paper is to present an automated ontology mapping and merging algorithm, namely OntoDNA, which employs data mining techniques (FCA, SOM, K‐means) to resolve

Abstract

Purpose

The purpose of this paper is to present an automated ontology mapping and merging algorithm, namely OntoDNA, which employs data mining techniques (FCA, SOM, K‐means) to resolve ontological heterogeneities among distributed data sources in organizational memory and subsequently generate a merged ontology to facilitate resource retrieval from distributed resources for organizational decision making.

Design/methodology/approach

The OntoDNA employs unsupervised data mining techniques (FCA, SOM, K‐means) to resolve ontological heterogeneities to integrate distributed data sources in organizational memory. Unsupervised methods are needed as an alternative in the absence of prior knowledge for managing this knowledge. Given two ontologies that are to be merged as the input, the ontologies' conceptual pattern is discovered using FCA. Then, string normalizations are applied to transform their attributes in the formal context prior to lexical similarity mapping. Mapping rules are applied to reconcile the attributes. Subsequently, SOM and K‐means are applied for semantic similarity mapping based on the conceptual pattern discovered in the formal context to reduce the problem size of the SOM clusters as validated by the Davies‐Bouldin index. The mapping rules are then applied to discover semantic similarity between ontological concepts in the clusters and the ontological concepts of the target ontology are updated to the source ontology based on the merging rules. Merged ontology in a concept lattice is formed.

Findings

In experimental comparisons between PROMPT and OntoDNA ontology mapping and merging tool based on precision, recall and f‐measure, average mapping results for OntoDNA is 95.97 percent compared to PROMPT's 67.24 percent. In terms of recall, OntoDNA outperforms PROMPT on all the paired ontology except for one paired ontology. For the merging of one paired ontology, PROMPT fails to identify the mapping elements. OntoDNA significantly outperforms PROMPT due to the utilization of FCA in the OntoDNA to capture attributes and the inherent structural relationships among concepts. Better performance in OntoDNA is due to the following reasons. First, semantic problems such as synonymy and polysemy are resolved prior to contextual clustering. Second, unsupervised data mining techniques (SOM and K‐means) have reduced problem size. Third, string matching performs better than PROMPT's linguistic‐similarity matching in addressing semantic heterogeneity, in context it also contributes to the OntoDNA results. String matching resolves concept names based on similarity between concept names in each cluster for ontology mapping. Linguistic‐similarity matching resolves concept names based on concept‐representation structure and relations between concepts for ontology mapping.

Originality/value

The OntoDNA automates ontology mapping and merging without the need of any prior knowledge to generate a merged ontology. String matching is shown to perform better than linguistic‐similarity matching in resolving concept names. The OntoDNA will be valuable for organizations interested in merging ontologies from distributed or different organizational memories. For example, an organization might want to merge their organization‐specific ontologies with community standard ontologies.

Details

VINE, vol. 39 no. 4
Type: Research Article
ISSN: 0305-5728

Keywords

Article
Publication date: 1 April 1982

J.J. POLLOCK

Not only does the problem of correcting spelling errors by computer have a long history, it is evidently of considerable current interest as papers and letters on the topic…

Abstract

Not only does the problem of correcting spelling errors by computer have a long history, it is evidently of considerable current interest as papers and letters on the topic continue to appear rapidly. This is not surprising, since techniques useful in detecting and correcting mis‐spellings normally have other important applications. Moreover, both the power of small computers and the routine production of machine‐readable text have increased enormously over the last decade to the point where automatic spelling error detection/correction has become not only feasible but highly desirable.

Details

Journal of Documentation, vol. 38 no. 4
Type: Research Article
ISSN: 0022-0418

Open Access
Article
Publication date: 17 July 2020

Mukesh Kumar and Palak Rehan

Social media networks like Twitter, Facebook, WhatsApp etc. are most commonly used medium for sharing news, opinions and to stay in touch with peers. Messages on twitter are…

1275

Abstract

Social media networks like Twitter, Facebook, WhatsApp etc. are most commonly used medium for sharing news, opinions and to stay in touch with peers. Messages on twitter are limited to 140 characters. This led users to create their own novel syntax in tweets to express more in lesser words. Free writing style, use of URLs, markup syntax, inappropriate punctuations, ungrammatical structures, abbreviations etc. makes it harder to mine useful information from them. For each tweet, we can get an explicit time stamp, the name of the user, the social network the user belongs to, or even the GPS coordinates if the tweet is created with a GPS-enabled mobile device. With these features, Twitter is, in nature, a good resource for detecting and analyzing the real time events happening around the world. By using the speed and coverage of Twitter, we can detect events, a sequence of important keywords being talked, in a timely manner which can be used in different applications like natural calamity relief support, earthquake relief support, product launches, suspicious activity detection etc. The keyword detection process from Twitter can be seen as a two step process: detection of keyword in the raw text form (words as posted by the users) and keyword normalization process (reforming the users’ unstructured words in the complete meaningful English language words). In this paper a keyword detection technique based upon the graph, spanning tree and Page Rank algorithm is proposed. A text normalization technique based upon hybrid approach using Levenshtein distance, demetaphone algorithm and dictionary mapping is proposed to work upon the unstructured keywords as produced by the proposed keyword detector. The proposed normalization technique is validated using the standard lexnorm 1.2 dataset. The proposed system is used to detect the keywords from Twiter text being posted at real time. The detected and normalized keywords are further validated from the search engine results at later time for detection of events.

Details

Applied Computing and Informatics, vol. 17 no. 2
Type: Research Article
ISSN: 2634-1964

Keywords

Article
Publication date: 12 November 2018

Gunikhan Sonowal and KS Kuppusamy

This paper aims to propose a model entitled MMSPhiD (multidimensional similarity metrics model for screen reader user to phishing detection) that amalgamates multiple approaches…

Abstract

Purpose

This paper aims to propose a model entitled MMSPhiD (multidimensional similarity metrics model for screen reader user to phishing detection) that amalgamates multiple approaches to detect phishing URLs.

Design/methodology/approach

The model consists of three major components: machine learning-based approach, typosquatting-based approach and phoneme-based approach. The major objectives of the proposed model are detecting phishing URL, typosquatting and phoneme-based domain and suggesting the legitimate domain which is targeted by attackers.

Findings

The result of the experiment shows that the MMSPhiD model can successfully detect phishing with 99.03 per cent accuracy. In addition, this paper has analyzed 20 leading domains from Alexa and identified 1,861 registered typosquatting and 543 phoneme-based domains.

Research limitations/implications

The proposed model has used machine learning with the list-based approach. Building and maintaining the list shall be a limitation.

Practical implication

The results of the experiments demonstrate that the model achieved higher performance due to the incorporation of multi-dimensional filters.

Social implications

In addition, this paper has incorporated the accessibility needs of persons with visual impairments and provides an accessible anti-phishing approach.

Originality/value

This paper assists persons with visual impairments on detection phoneme-based phishing domains.

Details

Information & Computer Security, vol. 26 no. 5
Type: Research Article
ISSN: 2056-4961

Keywords

Article
Publication date: 4 April 2016

Ilija Subasic, Nebojsa Gvozdenovic and Kris Jack

The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate…

Abstract

Purpose

The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate how to learn an optimal combination of distance metrics for duplicate detection and introduce a parallel duplicate clustering algorithm.

Design/methodology/approach

The authors developed the algorithm and compared it with state-of-the art systems tackling the same problem. The authors used benchmark data sets (3k data points) to test the effectiveness of our algorithm and a real-life data ( > 90 million) to test the efficiency and scalability of our algorithm.

Findings

The authors show that duplicate detection can be improved by an additional step we call duplicate clustering. The authors also show how to improve the efficiency of map/reduce similarity calculation algorithm by introducing a sampling step. Finally, the authors find that the system is comparable to the state-of-the art systems for duplicate detection, and that it can scale to deal with hundreds of million data points.

Research limitations/implications

Academic researchers can use this paper to understand some of the issues of transitivity in duplicate detection, and its effects on digital catalogue generations.

Practical implications

Industry practitioners can use this paper as a use case study for generating a large-scale real-life catalogue generation system that deals with millions of records in a scalable and efficient way.

Originality/value

In contrast to other similarity calculation algorithms developed for m/r frameworks the authors present a specific variant of similarity calculation that is optimized for duplicate detection of bibliographic records by extending previously proposed e-algorithm based on inverted index creation. In addition, the authors are concerned with more than duplicate detection, and investigate how to group detected duplicates. The authors develop distinct algorithms for duplicate detection and duplicate clustering and use the canopy clustering idea for multi-pass clustering. The work extends the current state-of-the-art by including the duplicate clustering step and demonstrate new strategies for speeding up m/r similarity calculations.

Details

Program, vol. 50 no. 2
Type: Research Article
ISSN: 0033-0337

Keywords

Article
Publication date: 18 October 2021

Anna Jurek-Loughrey

In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision-making. Integrating data from multiple sources drastically…

Abstract

Purpose

In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision-making. Integrating data from multiple sources drastically expands the power of information and allows us to address questions that are impossible to answer using a single data source. Record Linkage (RL) is a task of identifying and linking records from multiple sources that describe the same real world object (e.g. person), and it plays a crucial role in the data integration process. RL is challenging, as it is uncommon for different data sources to share a unique identifier. Hence, the records must be matched based on the comparison of their corresponding values. Most of the existing RL techniques assume that records across different data sources are structured and represented by the same scheme (i.e. set of attributes). Given the increasing amount of heterogeneous data sources, those assumptions are rather unrealistic. The purpose of this paper is to propose a novel RL model for unstructured data.

Design/methodology/approach

In the previous work (Jurek-Loughrey, 2020), the authors proposed a novel approach to linking unstructured data based on the application of the Siamese Multilayer Perceptron model. It was demonstrated that the method performed on par with other approaches that make constraining assumptions regarding the data. This paper expands the previous work originally presented at iiWAS2020 [16] by exploring new architectures of the Siamese Neural Network, which improves the generalisation of the RL model and makes it less sensitive to parameter selection.

Findings

The experimental results confirm that the new Autoencoder-based architecture of the Siamese Neural Network obtains better results in comparison to the Siamese Multilayer Perceptron model proposed in (Jurek et al., 2020). Better results have been achieved in three out of four data sets. Furthermore, it has been demonstrated that the second proposed (hybrid) architecture based on integrating the Siamese Autoencoder with a Multilayer Perceptron model, makes the model more stable in terms of the parameter selection.

Originality/value

To address the problem of unstructured RL, this paper presents a new deep learning based approach to improve the generalisation of the Siamese Multilayer Preceptron model and make is less sensitive to parameter selection.

Details

International Journal of Web Information Systems, vol. 17 no. 6
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 11 July 2016

Iana Atanassova, Marc Bertin and Vincent Larivière

Scientific abstracts reproduce only part of the information and the complexity of argumentation in a scientific article. The purpose of this paper provides a first analysis of the…

Abstract

Purpose

Scientific abstracts reproduce only part of the information and the complexity of argumentation in a scientific article. The purpose of this paper provides a first analysis of the similarity between the text of scientific abstracts and the body of articles, using sentences as the basic textual unit. It contributes to the understanding of the structure of abstracts.

Design/methodology/approach

Using sentence-based similarity metrics, the authors quantify the phenomenon of text re-use in abstracts and examine the positions of the sentences that are similar to sentences in abstracts in the introduction, methods, results and discussion structure, using a corpus of over 85,000 research articles published in the seven Public Library of Science journals.

Findings

The authors provide evidence that 84 percent of abstract have at least one sentence in common with the body of the paper. Studying the distributions of sentences in the body of the articles that are re-used in abstracts, the authors show that there exists a strong relation between the rhetorical structure of articles and the zones that authors re-use when writing abstracts, with sentences mainly coming from the beginning of the introduction and the end of the conclusion.

Originality/value

Scientific abstracts contain what is considered by the author(s) as information that best describe documents’ content. This is a first study that examines the relation between the contents of abstracts and the rhetorical structure of scientific articles. The work might provide new insight for improving automatic abstracting tools as well as information retrieval approaches, in which text organization and structure are important features.

Details

Journal of Documentation, vol. 72 no. 4
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 3 February 2023

Frendy and Fumiko Takeda

Partners are responsible for allocating audit tasks and facilitating knowledge sharing among team members. This study considers changes in the composition of partners to proxy for…

Abstract

Purpose

Partners are responsible for allocating audit tasks and facilitating knowledge sharing among team members. This study considers changes in the composition of partners to proxy for the continuity of the audit team. This study examines the effect of audit team continuity on audit outcomes (audit quality and report lags), pricing and its determinant (lead partner experience), which have not been thoroughly examined in previous studies.

Design/methodology/approach

This study employs string similarity metrics to measure audit team continuity. The study employs multivariate panel data regression empirical models to estimate a sample of 26,007 firm-years of listed Japanese companies from 2008 to 2019.

Findings

The study reveals that audit team continuity is negatively associated with audit fees, regardless of the auditor’s size. This finding contributes to the existing literature by showing that audit team continuity represents one of the determinant factors of audit fee. For clients of large audit firms, companies with higher (lower) audit team continuity issue audit reports in less (more) time. The experience of lead partners is a strong predictor of audit team continuity, irrespective of audit firm size. Audit quality is not associated with audit team continuity for either large or small audit firms.

Originality/value

This study proposes and examines audit team continuity measures that employ string similarity metrics to quantify changes in the composition of partners in consecutive audit engagements. Audit team continuity expands upon the tenure of individual audit partners, which is commonly used in prior literature as a measure of client–partner relationships.

Details

Journal of Accounting Literature, vol. 45 no. 2
Type: Research Article
ISSN: 0737-4607

Keywords

Article
Publication date: 3 July 2017

Kai Hoberg, Margarita Protopappa-Sieke and Sebastian Steinker

The purpose of this paper is to identify the interplay between a firm’s financial situation and its inventory ownership in a single-firm and a two-firm perspective.

2371

Abstract

Purpose

The purpose of this paper is to identify the interplay between a firm’s financial situation and its inventory ownership in a single-firm and a two-firm perspective.

Design/methodology/approach

The analysis uses different secondary data sources to quantify the effect of both financial constraints and cost of capital on inventory holdings of public US firms. The authors first adopt a single-firm perspective and analyze whether financial constraints and cost of capital do generally affect the amount of inventory held. Next, the authors adopt a two-firm perspective and analyze the inventory ownership in customer-supplier relationships.

Findings

Inventory levels are affected by financial constraints and cost of capital. Results indicate that higher costs of capital are weakly associated with lower inventories. However, contrary to the authors’ expectations, firms that are less financially constrained hold less inventories than firms that are more financially constrained. Finally, the authors find that customers hold the larger fraction of supply chain inventory in supplier-customer dyads.

Practical implications

The authors’ results indicate that financial considerations generally play a role in inventory management. However, inventory holdings seem to be influenced only slightly by financing costs and inventory holdings between supplier and customer seem to be less than optimal from a financial perspective. Considering those financial aspects can lead to relevant financial advantages.

Originality/value

In contrast to other recent research, the authors study how the financial situation of a firm affects its inventory levels (not vice versa) and also consider inventories from a two-firm perspective.

Details

International Journal of Physical Distribution & Logistics Management, vol. 47 no. 6
Type: Research Article
ISSN: 0960-0035

Keywords

1 – 10 of 55