Search results

1 – 9 of 9

View access options

Article

Publication date: 14 May 2021

Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

Zhenyuan Wang, Chih-Fong Tsai and Wei-Chao Lin

Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques…

HTML

PDF (1.9 MB)

Downloads

311

Abstract

Purpose

Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.

Design/methodology/approach

In this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.

Findings

The experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.

Originality/value

The novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.

Details

Data Technologies and Applications, vol. 55 no. 5

Type: Research Article

DOI:

ISSN: 2514-9288

Keywords

View access options

Article

Publication date: 7 August 2017

Top 10 data mining techniques in business applications: a brief survey

Wei-Chao Lin, Shih-Wen Ke and Chih-Fong Tsai

Data mining is widely considered necessary in many business applications for effective decision-making. The importance of business data mining is reflected by the existence of…

HTML

PDF (285 KB)

Downloads

1917

Abstract

Purpose

Data mining is widely considered necessary in many business applications for effective decision-making. The importance of business data mining is reflected by the existence of numerous surveys in the literature focusing on the investigation of related works using data mining techniques for solving specific business problems. The purpose of this paper is to answer the following question: What are the widely used data mining techniques in business applications?

Design/methodology/approach

The aim of this paper is to examine related surveys in the literature and thus to identify the frequently applied data mining techniques. To ensure the recent relevance and quality of the conclusions, the criterion for selecting related studies are that the works be published in reputed journals within the past 10 years.

Findings

There are 33 different data mining techniques employed in eight different application areas. Most of them are supervised learning techniques and the application area where such techniques are most often seen is bankruptcy prediction, followed by the areas of customer relationship management, fraud detection, intrusion detection and recommender systems. Furthermore, the widely used ten data mining techniques for business applications are the decision tree (including C4.5 decision tree and classification and regression tree), genetic algorithm, k-nearest neighbor, multilayer perceptron neural network, naïve Bayes and support vector machine as the supervised learning techniques and association rule, expectation maximization and k-means as the unsupervised learning techniques.

Originality/value

The originality of this paper is to survey the recent 10 years of related survey and review articles about data mining in business applications to identify the most popular techniques.

Details

Kybernetes, vol. 46 no. 7

Type: Research Article

DOI:

ISSN: 0368-492X

Keywords

View access options

Article

Publication date: 1 February 2016

SAFQuery: a simple and flexible advanced Web search interface

Wei-Chao Lin, Shih-Wen Ke and Chih-Fong Tsai

This paper aims to introduce a prototype system called SAFQuery (Simple And Flexible Query interface). In many existing Web search interfaces, simple and advanced query processes…

HTML

PDF (691 KB)

Downloads

358

Abstract

Purpose

This paper aims to introduce a prototype system called SAFQuery (Simple And Flexible Query interface). In many existing Web search interfaces, simple and advanced query processes are treated separately that cannot be issued interchangeably. In addition, after several rounds of queries for specific information need(s), it is possible that users might wish to re-examine the retrieval results corresponding to some previous queries or to slightly modify some of the specific queries issued before. However, it is often hard to remember what queries have been issued. These factors make the current Web search process not very simple or flexible.

Design/methodology/approach

In SAFQuery, the simple and advanced query strategies are integrated into a single interface, which can easily formulate query specifications when needed in the same interface. Moreover, query history information is provided that displays the past query specifications, which can help with the memory load.

Findings

The authors' experiments by user evaluation show that most users had a positive experience when using SAFQuery. Specifically, it is easy to use and can simplify the Web search task.

Originality/value

The proposed prototype system provides simple and flexible Web search strategies. Particularly, it allows users to easily issue simple and advanced queries based on one single query interface, interchangeably. In addition, users can easily input previously issued queries without spending time to recall what the queries are and/or to re-type previous queries.

Details

The Electronic Library, vol. 34 no. 1

Type: Research Article

DOI:

ISSN: 0264-0473

Keywords

View access options

Article

Publication date: 3 August 2012

Scenery image retrieval by meta‐feature representation

Chih‐Fong Tsai and Wei‐Chao Lin

Content‐based image retrieval suffers from the semantic gap problem: that images are represented by low‐level visual features, which are difficult to directly match to high‐level…

HTML

PDF (303 KB)

Downloads

412

Abstract

Purpose

Content‐based image retrieval suffers from the semantic gap problem: that images are represented by low‐level visual features, which are difficult to directly match to high‐level concepts in the user's mind during retrieval. To date, visual feature representation is still limited in its ability to represent semantic image content accurately. This paper seeks to address these issues.

Design/methodology/approach

In this paper the authors propose a novel meta‐feature feature representation method for scenery image retrieval. In particular some class‐specific distances (namely meta‐features) between low‐level image features are measured. For example the distance between an image and its class centre, and the distances between the image and its nearest and farthest images in the same class, etc.

Findings

Three experiments based on 190 concrete, 130 abstract, and 610 categories in the Corel dataset show that the meta‐features extracted from both global and local visual features significantly outperform the original visual features in terms of mean average precision.

Originality/value

Compared with traditional local and global low‐level features, the proposed meta‐features have higher discriminative power for distinguishing a large number of conceptual categories for scenery image retrieval. In addition the meta‐features can be directly applied to other image descriptors, such as bag‐of‐words and contextual features.

Details

Online Information Review, vol. 36 no. 4

Type: Research Article

DOI:

ISSN: 1468-4527

Keywords

View access options

Article

Publication date: 8 August 2016

Research impact of general and funded papers: A citation analysis of two ACM international conference proceeding series

Cheng-Che Shen, Ya-Han Hu, Wei-Chao Lin, Chih-Fong Tsai and Shih-Wen Ke

The purpose of this paper is to focus on examining the research impact of papers written with and without funding. Specifically, the citation analysis method is used to compare…

HTML

PDF (403 KB)

Downloads

560

Abstract

Purpose

The purpose of this paper is to focus on examining the research impact of papers written with and without funding. Specifically, the citation analysis method is used to compare the general and funded papers published in two leading international conferences, which are ACM SIGIR and ACM SIGKDD.

Design/methodology/approach

The authors investigate the number of general and funded papers to see whether the number of funded papers is larger than the number of general papers. In addition, the total citations and the number of highly cited papers with and without funding are also compared.

Findings

The analysis results of ACM SIGIR papers show that in most cases the number of funded papers is larger than the number of general papers. Moreover, the total captions, the average number of citations per paper, and the number of highly cited papers all reveal the superiority of funded papers over general papers. However, the findings are somewhat different for the ACM SIGKDD papers. This may be because ACM SIGIR began much earlier than ACM SIGKDD, which relates to the maturity of the research problems addressed in these two conferences.

Originality/value

The value of this paper is the first attempt at examining the research impact of general and funded research papers by the citation analysis method. The research impact of other research areas can be further investigated by other analysis methods.

Details

Online Information Review, vol. 40 no. 4

Type: Research Article

DOI:

ISSN: 1468-4527

Keywords

View access options

Article

Publication date: 9 September 2014

Citation impact analysis of research papers that appear in oral and poster sessions: A case study of three computer science conferences

Shih-Wen Ke, Wei-Chao Lin, Chih-Fong Tsai and Ya-Han Hu

Conference publications are an important aspect of research activities. There are generally both oral presentations and poster sessions at large international conferences. One can…

HTML

PDF (177 KB)

Downloads

550

Abstract

Purpose

Conference publications are an important aspect of research activities. There are generally both oral presentations and poster sessions at large international conferences. One can hypothesise that, for the same conferences, the papers presented in oral sessions should have a higher research impact than the papers presented in poster sessions. However, there has been no related study examining the validity of this hypothesis. In other words, the difference of research impact between papers presented orally or during poster sessions has not been discussed in literature. Therefore, the purpose of this paper is to conduct a citation analysis to compare the research impact of papers presented in oral and poster sessions.

Design/methodology/approach

In this paper, data from three leading conferences in the field of computer vision are examined, namely CVPR (2011 and 2012), ICCV (2011) and ECCV (2012). Several types of citation-related statistics are collected, including the number of highly cited papers (i.e. high number of citations) presented in oral and poster sessions, the total citations of both types of papers, the average citations of oral and poster papers, and the average citations of each frequently cited paper of both types.

Findings

There are three main findings. First, a larger proportion of highly cited papers are from oral sessions than poster sessions. Second, the average number of citations per paper is larger for those presented in oral sessions than poster sessions. Third, the average number of citations for highly cited papers presented in oral sessions is not necessarily greater than for the ones presented in poster sessions.

Originality/value

The originality of this paper is that it is the first attempt to examine the differences of citation impacts of conference papers presented in oral and poster sessions. The findings of this study will allow future bibliometrics research to further explore this related issue for longer periods and different fields.

Details

Online Information Review, vol. 38 no. 6

Type: Research Article

DOI:

ISSN: 1468-4527

Keywords

View access options

Article

Publication date: 8 June 2015

Correlation analysis for comparison of the citation impact of journals, magazines, and conferences in computer science

Wei-Chao Lin, Chih-Fong Tsai and Shih-Wen Ke

In many research areas, there are a variety of different types of academic publications, including journals, magazines and conferences, which provide outlets for researchers to…

HTML

PDF (672 KB)

Downloads

683

Abstract

Purpose

In many research areas, there are a variety of different types of academic publications, including journals, magazines and conferences, which provide outlets for researchers to present their findings. Generally speaking, although there are differences in the reviewing criteria and publication processes of different publication types, in the same research area, there is certainly overlap in terms of the problems addressed and the audience for different publication types. Therefore, the research impacts of different publication types in the same research area should be moderately or highly correlated. The paper aims to discuss these issues.

Design/methodology/approach

To prove this hypothesis, the authors examine the correlation coefficient of citation impacts for different types of publications, in seven research areas of computer science, from 2000 to 2013. In particular, four related citation statistics are examined for each publication type, which are average citations per paper, average citations per year, average annual increase in individual h-index, and h-index.

Findings

The analysis results show only a partial correlation in terms of several specific citation measures for different publication types in the same research area. Moreover, the level of correlation of the citation impact between different publication types is different, depending on the research area.

Originality/value

The contribution of this paper is to investigate whether the research impact of different types of publications in the same area is correlated. The findings can help guide researchers and academics choose the most appropriate publication outlets.

Details

Online Information Review, vol. 39 no. 3

Type: Research Article

DOI:

ISSN: 1468-4527

Keywords

View access options

Article

Publication date: 29 April 2014

Dimensionality and data reduction in telecom churn prediction

Wei-Chao Lin, Chih-Fong Tsai and Shih-Wen Ke

Churn prediction is a very important task for successful customer relationship management. In general, churn prediction can be achieved by many data mining techniques. However…

HTML

PDF (224 KB)

Downloads

708

Abstract

Purpose

Churn prediction is a very important task for successful customer relationship management. In general, churn prediction can be achieved by many data mining techniques. However, during data mining, dimensionality reduction (or feature selection) and data reduction are the two important data preprocessing steps. In particular, the aims of feature selection and data reduction are to filter out irrelevant features and noisy data samples, respectively. The purpose of this paper, performing these data preprocessing tasks, is to make the mining algorithm produce good quality mining results.

Design/methodology/approach

Based on a real telecom customer churn data set, seven different preprocessed data sets based on performing feature selection and data reduction by different priorities are used to train the artificial neural network as the churn prediction model.

Findings

The results show that performing data reduction first by self-organizing maps and feature selection second by principal component analysis can allow the prediction model to provide the highest prediction accuracy. In addition, this priority allows the prediction model for more efficient learning since 66 and 62 percent of the original features and data samples are reduced, respectively.

Originality/value

The contribution of this paper is to understand the better procedure of performing the two important data preprocessing steps for telecom churn prediction.

Details

Kybernetes, vol. 43 no. 5

Type: Research Article

DOI:

ISSN: 0368-492X

Keywords

View access options

Article

Publication date: 1 February 1999

Breakwater construction: an effective method for industrial waste utilization

DULCY M. ABRAHAM and M.H. JOANNE YEH

The Environmental Protection Bureau of Taiwan established the South Star Project in Kaohsiung, Taiwan, as a solution to two problems facing the city—the urgent need to dispose of…

HTML

PDF (690 KB)

Downloads

448

Abstract

The Environmental Protection Bureau of Taiwan established the South Star Project in Kaohsiung, Taiwan, as a solution to two problems facing the city—the urgent need to dispose of industrial wastes and the need to increase land for the city. To embank land from the sea, breakwaters were constructed. The material used to construct breakwaters was a mixture of furnace slag (waste from the steel industry) and fly ash (waste from power plants). After constructing the breakwaters, the ‘reclaimed land’ was used as a landfill for construction and public waste. In the future, these reclaimed lands will be used for the development of a deepwater port or sea airport. Construction of breakwaters is a very repetitive process, and any improvements made would help contractors reduce the duration of the operation, improve efficiency in the process and thereby reduce costs. This paper discusses the process of breakwater construction and the utilization of industrial wastes for the concrete work on the project. Data collected from the first stage of the South Star Project is used in the modelling, simulation and analysis of the process, in order to examine the interaction between different resources.

Details

Engineering, Construction and Architectural Management, vol. 6 no. 2

Type: Research Article

DOI:

ISSN: 0969-9988

Keywords

Access

Year

All dates (9)

Content type

Article (9)

1 – 9 of 9

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Details

Keywords

Access

Year

Content type

All feedback is valuable

Report an issue or find answers to frequently asked questions