Search results

1 – 10 of 21

View access options

Article

Publication date: 14 May 2021

Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

Zhenyuan Wang, Chih-Fong Tsai and Wei-Chao Lin

Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques…

HTML

PDF (1.9 MB)

Downloads

302

Abstract

Purpose

Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.

Design/methodology/approach

In this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.

Findings

The experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.

Originality/value

The novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.

Details

Data Technologies and Applications, vol. 55 no. 5

Type: Research Article

DOI:

ISSN: 2514-9288

Keywords

View access options

Article

Publication date: 7 August 2017

Top 10 data mining techniques in business applications: a brief survey

Wei-Chao Lin, Shih-Wen Ke and Chih-Fong Tsai

Data mining is widely considered necessary in many business applications for effective decision-making. The importance of business data mining is reflected by the existence of…

HTML

PDF (285 KB)

Downloads

1900

Abstract

Purpose

Data mining is widely considered necessary in many business applications for effective decision-making. The importance of business data mining is reflected by the existence of numerous surveys in the literature focusing on the investigation of related works using data mining techniques for solving specific business problems. The purpose of this paper is to answer the following question: What are the widely used data mining techniques in business applications?

Design/methodology/approach

The aim of this paper is to examine related surveys in the literature and thus to identify the frequently applied data mining techniques. To ensure the recent relevance and quality of the conclusions, the criterion for selecting related studies are that the works be published in reputed journals within the past 10 years.

Findings

There are 33 different data mining techniques employed in eight different application areas. Most of them are supervised learning techniques and the application area where such techniques are most often seen is bankruptcy prediction, followed by the areas of customer relationship management, fraud detection, intrusion detection and recommender systems. Furthermore, the widely used ten data mining techniques for business applications are the decision tree (including C4.5 decision tree and classification and regression tree), genetic algorithm, k-nearest neighbor, multilayer perceptron neural network, naïve Bayes and support vector machine as the supervised learning techniques and association rule, expectation maximization and k-means as the unsupervised learning techniques.

Originality/value

The originality of this paper is to survey the recent 10 years of related survey and review articles about data mining in business applications to identify the most popular techniques.

Details

Kybernetes, vol. 46 no. 7

Type: Research Article

DOI:

ISSN: 0368-492X

Keywords

View access options

Article

Publication date: 1 February 2016

SAFQuery: a simple and flexible advanced Web search interface

Wei-Chao Lin, Shih-Wen Ke and Chih-Fong Tsai

This paper aims to introduce a prototype system called SAFQuery (Simple And Flexible Query interface). In many existing Web search interfaces, simple and advanced query processes…

HTML

PDF (691 KB)

Downloads

358

Abstract

Purpose

This paper aims to introduce a prototype system called SAFQuery (Simple And Flexible Query interface). In many existing Web search interfaces, simple and advanced query processes are treated separately that cannot be issued interchangeably. In addition, after several rounds of queries for specific information need(s), it is possible that users might wish to re-examine the retrieval results corresponding to some previous queries or to slightly modify some of the specific queries issued before. However, it is often hard to remember what queries have been issued. These factors make the current Web search process not very simple or flexible.

Design/methodology/approach

In SAFQuery, the simple and advanced query strategies are integrated into a single interface, which can easily formulate query specifications when needed in the same interface. Moreover, query history information is provided that displays the past query specifications, which can help with the memory load.

Findings

The authors' experiments by user evaluation show that most users had a positive experience when using SAFQuery. Specifically, it is easy to use and can simplify the Web search task.

Originality/value

The proposed prototype system provides simple and flexible Web search strategies. Particularly, it allows users to easily issue simple and advanced queries based on one single query interface, interchangeably. In addition, users can easily input previously issued queries without spending time to recall what the queries are and/or to re-type previous queries.

Details

The Electronic Library, vol. 34 no. 1

Type: Research Article

DOI:

ISSN: 0264-0473

Keywords

View access options

Article

Publication date: 29 July 2014

Modeling credit scoring using neural network ensembles

Chih-Fong Tsai and Chihli Hung

Credit scoring is important for financial institutions in order to accurately predict the likelihood of business failure. Related studies have shown that machine learning…

HTML

PDF (119 KB)

Downloads

1135

Abstract

Purpose

Credit scoring is important for financial institutions in order to accurately predict the likelihood of business failure. Related studies have shown that machine learning techniques, such as neural networks, outperform many statistical approaches to solving this type of problem, and advanced machine learning techniques, such as classifier ensembles and hybrid classifiers, provide better prediction performance than single machine learning based classification techniques. However, it is not known which type of advanced classification technique performs better in terms of financial distress prediction. The paper aims to discuss these issues.

Design/methodology/approach

This paper compares neural network ensembles and hybrid neural networks over three benchmarking credit scoring related data sets, which are Australian, German, and Japanese data sets.

Findings

The experimental results show that hybrid neural networks and neural network ensembles outperform the single neural network. Although hybrid neural networks perform slightly better than neural network ensembles in terms of predication accuracy and errors with two of the data sets, there is no significant difference between the two types of prediction models.

Originality/value

The originality of this paper is in comparing two types of advanced classification techniques, i.e. hybrid and ensemble learning techniques, in terms of financial distress prediction.

Details

Kybernetes, vol. 43 no. 7

Type: Research Article

DOI:

ISSN: 0368-492X

Keywords

View access options

Article

Publication date: 29 April 2014

Dimensionality and data reduction in telecom churn prediction

Wei-Chao Lin, Chih-Fong Tsai and Shih-Wen Ke

Churn prediction is a very important task for successful customer relationship management. In general, churn prediction can be achieved by many data mining techniques. However…

HTML

PDF (224 KB)

Downloads

708

Abstract

Purpose

Churn prediction is a very important task for successful customer relationship management. In general, churn prediction can be achieved by many data mining techniques. However, during data mining, dimensionality reduction (or feature selection) and data reduction are the two important data preprocessing steps. In particular, the aims of feature selection and data reduction are to filter out irrelevant features and noisy data samples, respectively. The purpose of this paper, performing these data preprocessing tasks, is to make the mining algorithm produce good quality mining results.

Design/methodology/approach

Based on a real telecom customer churn data set, seven different preprocessed data sets based on performing feature selection and data reduction by different priorities are used to train the artificial neural network as the churn prediction model.

Findings

The results show that performing data reduction first by self-organizing maps and feature selection second by principal component analysis can allow the prediction model to provide the highest prediction accuracy. In addition, this priority allows the prediction model for more efficient learning since 66 and 62 percent of the original features and data samples are reduced, respectively.

Originality/value

The contribution of this paper is to understand the better procedure of performing the two important data preprocessing steps for telecom churn prediction.

Details

Kybernetes, vol. 43 no. 5

Type: Research Article

DOI:

ISSN: 0368-492X

Keywords

View access options

Article

Publication date: 22 March 2013

A comparative study of hybrid machine learning techniques for customer lifetime value prediction

Chih‐Fong Tsai, Ya‐Han Hu, Chia‐Sheng Hung and Yu‐Feng Hsu

Customer lifetime value (CLV) has received increasing attention in database marketing. Enterprises can retain valuable customers by the correct prediction of valuable customers…

HTML

PDF (87 KB)

Downloads

2444

Abstract

Purpose

Customer lifetime value (CLV) has received increasing attention in database marketing. Enterprises can retain valuable customers by the correct prediction of valuable customers. In the literature, many data mining and machine learning techniques have been applied to develop CLV models. Specifically, hybrid techniques have shown their superiorities over single techniques. However, it is unknown which hybrid model can perform the best in customer value prediction. Therefore, the purpose of this paper is to compares two types of commonly‐used hybrid models by classification+classification and clustering+classification hybrid approaches, respectively, in terms of customer value prediction.

Design/methodology/approach

To construct a hybrid model, multiple techniques are usually combined in a two‐stage manner, in which the first stage is based on either clustering or classification techniques, which can be used to pre‐process the data. Then, the output of the first stage (i.e. the processed data) is used to construct the second stage classifier as the prediction model. Specifically, decision trees, logistic regression, and neural networks are used as the classification techniques and k‐means and self‐organizing maps for the clustering techniques to construct six different hybrid models.

Findings

The experimental results over a real case dataset show that the classification+classification hybrid approach performs the best. In particular, combining two‐stage of decision trees provides the highest rate of accuracy (99.73 percent) and lowest rate of Type I/II errors (0.22 percent/0.43 percent).

Originality/value

The contribution of this paper is to demonstrate that hybrid machine learning techniques perform better than single ones. In addition, this paper allows us to find out which hybrid technique performs best in terms of CLV prediction.

Details

Kybernetes, vol. 42 no. 3

Type: Research Article

DOI:

ISSN: 0368-492X

Keywords

View access options

Article

Publication date: 3 August 2012

Scenery image retrieval by meta‐feature representation

Chih‐Fong Tsai and Wei‐Chao Lin

Content‐based image retrieval suffers from the semantic gap problem: that images are represented by low‐level visual features, which are difficult to directly match to high‐level…

HTML

PDF (303 KB)

Downloads

412

Abstract

Purpose

Content‐based image retrieval suffers from the semantic gap problem: that images are represented by low‐level visual features, which are difficult to directly match to high‐level concepts in the user's mind during retrieval. To date, visual feature representation is still limited in its ability to represent semantic image content accurately. This paper seeks to address these issues.

Design/methodology/approach

In this paper the authors propose a novel meta‐feature feature representation method for scenery image retrieval. In particular some class‐specific distances (namely meta‐features) between low‐level image features are measured. For example the distance between an image and its class centre, and the distances between the image and its nearest and farthest images in the same class, etc.

Findings

Three experiments based on 190 concrete, 130 abstract, and 610 categories in the Corel dataset show that the meta‐features extracted from both global and local visual features significantly outperform the original visual features in terms of mean average precision.

Originality/value

Compared with traditional local and global low‐level features, the proposed meta‐features have higher discriminative power for distinguishing a large number of conceptual categories for scenery image retrieval. In addition the meta‐features can be directly applied to other image descriptors, such as bag‐of‐words and contextual features.

Details

Online Information Review, vol. 36 no. 4

Type: Research Article

DOI:

ISSN: 1468-4527

Keywords

View access options

Article

Publication date: 13 June 2008

Sensitivity analysis of mapping local image features into conceptual categories

Chih‐Fong Tsai and David C. Yen

Image classification or more specifically, annotating images with keywords is one of the important steps during image database indexing. However, the problem with current research…

HTML

PDF (354 KB)

Downloads

502

Abstract

Purpose

Image classification or more specifically, annotating images with keywords is one of the important steps during image database indexing. However, the problem with current research in terms of image retrieval is more concentrated on how conceptual categories can be well represented by extracted, low level features for an effective classification. Consequently, image features representation including segmentation and low‐level feature extraction schemes must be genuinely effective to facilitate the process of classification. The purpose of this paper is to examine the effect on annotation effectiveness of using different (local) feature representation methods to map into conceptual categories.

Design/methodology/approach

This paper compares tiling (five and nine tiles) and regioning (five and nine regions) segmentation schemes and the extraction of combinations of color, texture, and edge features in terms of the effectiveness of a particular benchmark, automatic image annotation set up. Differences between effectiveness on concrete or abstract conceptual categories or keywords are further investigated, and progress towards establishing a particular benchmark approach is also reported.

Findings

In the context of local feature representation, the paper concludes that the combined color and texture features are the best to use for the five tiling and regioning schemes, and this evidence would form a good benchmark for future studies. Another interesting finding (but perhaps not surprising) is that when the number of concrete and abstract keywords increases or it is large (e.g. 100), abstract keywords are more difficult to assign correctly than the concrete ones.

Research limitations/implications

Future work could consider: conduct user‐centered evaluation instead of evaluation only by some chosen ground truth dataset, such as Corel, since this might impact effectiveness results; use of different numbers of categories for scalability analysis of image annotation as well as larger numbers of training and testing examples; use of Principle Component Analysis or Independent Component Analysis, or indeed machine learning techniques for low‐level feature selection; use of other segmentation schemes, especially more complex tiling schemes and other regioning schemes; use of different datasets, use of other low‐level features and/or combination of them; use of other machine learning techniques.

Originality/value

This paper is a good start for analyzing the mapping between some feature representation methods and various conceptual categories for future image annotation research.

Details

Library Hi Tech, vol. 26 no. 2

Type: Research Article

DOI:

ISSN: 0737-8831

Keywords

View access options

Article

Publication date: 9 November 2015

A hybrid indicator for journal ranking: An example from the field of Health Care Sciences and Services

Wen-Chin Hsu, Chih-Fong Tsai and Jia-Huan Li

Although journal rankings are important for authors, readers, publishers, promotion, and tenure committees, it has been argued that the use of different measures (e.g. the journal…

HTML

PDF (388 KB)

Downloads

465

Abstract

Purpose

Although journal rankings are important for authors, readers, publishers, promotion, and tenure committees, it has been argued that the use of different measures (e.g. the journal impact factor (JIF), and Hirsch’s h-index) often lead to different journal rankings, which render it difficult to make an appropriate decision. A hybrid ranking method based on the Borda count approach, the Standardized Average Index (SA index), was introduced to solve this problem. The paper aims to discuss these issues.

Design/methodology/approach

Citations received by the articles published in 85 Health Care Sciences and Services (HCSS) journals in the period of 2009-2013 were analyzed with the use of the JIF, the h-index, and the SA index.

Findings

The SA index exhibits a high correlation with the JIF and the h-index (γ > 0.9, p < 0.01) and yields results with higher accuracy than the h-index. The new, comprehensive citation impact analysis of the 85 HCSS journals shows that the SA index can help researchers to find journals with both high JIFs and high h-indices more easily, thereby harvesting references for paper submissions and research directions.

Originality/value

The contribution of this study is the application of the Borda count approach to combine the HCSS journal rankings produced by the two widely accepted indices of the JIF and the h-index. The new HCSS journal rankings can be used by publishers, journal editors, researchers, policymakers, librarians, and practitioners as a reference for journal selection and the establishment of decisions and professional judgment.

Details

Online Information Review, vol. 39 no. 7

Type: Research Article

DOI:

ISSN: 1468-4527

Keywords

View access options

Article

Publication date: 23 November 2012

OGIR: an ontology‐based grid information retrieval framework

Chihli Hung, Chih‐Fong Tsai, Shin‐Yuan Hung and Chang‐Jiang Ku

A grid information retrieval model has benefits for sharing resources and processing mass information, but cannot handle conceptual heterogeneity without integration of semantic…

HTML

PDF (278 KB)

Downloads

688

Abstract

Purpose

A grid information retrieval model has benefits for sharing resources and processing mass information, but cannot handle conceptual heterogeneity without integration of semantic information. The purpose of this research is to propose a concept‐based retrieval mechanism to catch the user's query intentions in a grid environment. This research re‐ranks documents over distributed data sources and evaluates performance based on the user judgment and processing time.

Design/methodology/approach

This research uses the ontology lookup service to build the concept set in the ontology and captures the user's query intentions as a means of query expansion for searching. The Globus toolkit is used to implement the grid service. The modification of the collection retrieval inference (CORI) algorithm is used for re‐ranking documents over distributed data sources.

Findings

The experiments demonstrate that this proposed approach successfully describes the user's query intentions evaluated by user judgment. For processing time, building a grid information retrieval model is a suitable strategy for the ontology‐based retrieval model.

Originality/value

Most current semantic grid models focus on construction of the semantic grid, and do not consider re‐ranking search results from distributed data sources. The significance of evaluation from the user's viewpoint is also ignored. This research proposes a method that captures the user's query intentions and re‐ranks documents in a grid based on the CORI algorithm. This proposed ontology‐based retrieval mechanism calculates the global relevance score of all documents in a grid and displays those documents with higher relevance to users.

Details

Online Information Review, vol. 36 no. 6

Type: Research Article

DOI:

ISSN: 1468-4527

Keywords

Access

Year

All dates (21)

Content type

Article (21)

1 – 10 of 21

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Access

Year

Content type

We’re listening — tell us what you think

Something didn’t work…

All feedback is valuable

Join us on our journey

Platform update page

Questions & More Information