Search results

1 – 10 of over 1000
Article
Publication date: 5 May 2023

Nguyen Thi Dinh, Nguyen Thi Uyen Nhi, Thanh Manh Le and Thanh The Van

The problem of image retrieval and image description exists in various fields. In this paper, a model of content-based image retrieval and image content extraction based on the…

Abstract

Purpose

The problem of image retrieval and image description exists in various fields. In this paper, a model of content-based image retrieval and image content extraction based on the KD-Tree structure was proposed.

Design/methodology/approach

A Random Forest structure was built to classify the objects on each image on the basis of the balanced multibranch KD-Tree structure. From that purpose, a KD-Tree structure was generated by the Random Forest to retrieve a set of similar images for an input image. A KD-Tree structure is applied to determine a relationship word at leaves to extract the relationship between objects on an input image. An input image content is described based on class names and relationships between objects.

Findings

A model of image retrieval and image content extraction was proposed based on the proposed theoretical basis; simultaneously, the experiment was built on multi-object image datasets including Microsoft COCO and Flickr with an average image retrieval precision of 0.9028 and 0.9163, respectively. The experimental results were compared with those of other works on the same image dataset to demonstrate the effectiveness of the proposed method.

Originality/value

A balanced multibranch KD-Tree structure was built to apply to relationship classification on the basis of the original KD-Tree structure. Then, KD-Tree Random Forest was built to improve the classifier performance and retrieve a set of similar images for an input image. Concurrently, the image content was described in the process of combining class names and relationships between objects.

Details

Data Technologies and Applications, vol. 57 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 14 September 2023

Cheng Liu, Yi Shi, Wenjing Xie and Xinzhong Bao

This paper aims to provide a complete analysis framework and prediction method for the construction of the patent securitization (PS) basic asset pool.

Abstract

Purpose

This paper aims to provide a complete analysis framework and prediction method for the construction of the patent securitization (PS) basic asset pool.

Design/methodology/approach

This paper proposes an integrated classification method based on genetic algorithm and random forest algorithm. First, comprehensively consider the patent value evaluation model and SME credit evaluation model, determine 17 indicators to measure the patent value and SME credit; Secondly, establish the classification label of high-quality basic assets; Then, genetic algorithm and random forest model are used to predict and screen high-quality basic assets; Finally, the performance of the model is evaluated.

Findings

The machine learning model proposed in this study is mainly used to solve the screening problem of high-quality patents that constitute the underlying asset pool of PS. The empirical research shows that the integrated classification method based on genetic algorithm and random forest has good performance and prediction accuracy, and is superior to the single method that constitutes it.

Originality/value

The main contributions of the article are twofold: firstly, the machine learning model proposed in this article determines the standards for high-quality basic assets; Secondly, this article addresses the screening issue of basic assets in PS.

Details

Kybernetes, vol. 53 no. 2
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 21 December 2021

Laouni Djafri

This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P…

384

Abstract

Purpose

This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.

Design/methodology/approach

In the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.

Findings

The authors got very satisfactory classification results.

Originality/value

DDPML system is specially designed to smoothly handle big data mining classification.

Details

Data Technologies and Applications, vol. 56 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 9 September 2022

Lucie Maruejols, Hanjie Wang, Qiran Zhao, Yunli Bai and Linxiu Zhang

Despite rising incomes and reduction of extreme poverty, the feeling of being poor remains widespread. Support programs can improve well-being, but they first require identifying…

Abstract

Purpose

Despite rising incomes and reduction of extreme poverty, the feeling of being poor remains widespread. Support programs can improve well-being, but they first require identifying who are the households that judge their income is insufficient to meet their basic needs, and what factors are associated with subjective poverty.

Design/methodology/approach

Households report the income level they judge is sufficient to make ends meet. Then, they are classified as being subjectively poor if their own monetary income is inferior to the level they indicated. Second, the study compares the performance of three machine learning algorithms, the random forest, support vector machines and least absolute shrinkage and selection operator (LASSO) regression, applied to a set of socioeconomic variables to predict subjective poverty status.

Findings

The random forest generates 85.29% of correct predictions using a range of income and non-income predictors, closely followed by the other two techniques. For the middle-income group, the LASSO regression outperforms random forest. Subjective poverty is mostly associated with monetary income for low-income households. However, a combination of low income, low endowment (land, consumption assets) and unusual large expenditure (medical, gifts) constitutes the key predictors of feeling poor for the middle-income households.

Practical implications

To reduce the feeling of poverty, policy intervention should continue to focus on increasing incomes. However, improvements in nonincome domains such as health expenditure, education and family demographics can also relieve the feeling of income inadequacy. Methodologically, better performance of either algorithm depends on the data at hand.

Originality/value

For the first time, the authors show that prediction techniques are reliable to identify subjective poverty prevalence, with example from rural China. The analysis offers specific attention to the modest-income households, who may feel poor but not be identified as such by objective poverty lines, and is relevant when policy-makers seek to address the “next step” after ending extreme poverty. Prediction performance and mechanisms for three machine learning algorithms are compared.

Details

China Agricultural Economic Review, vol. 15 no. 2
Type: Research Article
ISSN: 1756-137X

Keywords

Book part
Publication date: 30 September 2020

Hera Khan, Ayush Srivastav and Amit Kumar Mishra

A detailed description will be provided of all the classification algorithms that have been widely used in the domain of medical science. The foundation will be laid by giving a…

Abstract

A detailed description will be provided of all the classification algorithms that have been widely used in the domain of medical science. The foundation will be laid by giving a comprehensive overview pertaining to the background and history of the classification algorithms. This will be followed by an extensive discussion regarding various techniques of classification algorithm in machine learning (ML) hence concluding with their relevant applications in data analysis in medical science and health care. To begin with, the initials of this chapter will deal with the basic fundamentals required for a profound understanding of the classification techniques in ML which will comprise of the underlying differences between Unsupervised and Supervised Learning followed by the basic terminologies of classification and its history. Further, it will include the types of classification algorithms ranging from linear classifiers like Logistic Regression, Naïve Bayes to Nearest Neighbour, Support Vector Machine, Tree-based Classifiers, and Neural Networks, and their respective mathematics. Ensemble algorithms such as Majority Voting, Boosting, Bagging, Stacking will also be discussed at great length along with their relevant applications. Furthermore, this chapter will also incorporate comprehensive elucidation regarding the areas of application of such classification algorithms in the field of biomedicine and health care and their contribution to decision-making systems and predictive analysis. To conclude, this chapter will devote highly in the field of research and development as it will provide a thorough insight to the classification algorithms and their relevant applications used in the cases of the healthcare development sector.

Details

Big Data Analytics and Intelligence: A Perspective for Health Care
Type: Book
ISBN: 978-1-83909-099-8

Keywords

Article
Publication date: 13 February 2024

Marcelo Cajias and Anna Freudenreich

This is the first article to apply a machine learning approach to the analysis of time on market on real estate markets.

Abstract

Purpose

This is the first article to apply a machine learning approach to the analysis of time on market on real estate markets.

Design/methodology/approach

The random survival forest approach is introduced to the real estate market. The most important predictors of time on market are revealed and it is analyzed how the survival probability of residential rental apartments responds to these major characteristics.

Findings

Results show that price, living area, construction year, year of listing and the distances to the next hairdresser, bakery and city center have the greatest impact on the marketing time of residential apartments. The time on market for an apartment in Munich is lowest at a price of 750 € per month, an area of 60 m2, built in 1985 and is in a range of 200–400 meters from the important amenities.

Practical implications

The findings might be interesting for private and institutional investors to derive real estate investment decisions and implications for portfolio management strategies and ultimately to minimize cash-flow failure.

Originality/value

Although machine learning algorithms have been applied frequently on the real estate market for the analysis of prices, its application for examining time on market is completely novel. This is the first paper to apply a machine learning approach to survival analysis on the real estate market.

Details

Journal of Property Investment & Finance, vol. 42 no. 2
Type: Research Article
ISSN: 1463-578X

Keywords

Article
Publication date: 23 March 2021

Mostafa El Habib Daho, Nesma Settouti, Mohammed El Amine Bechar, Amina Boublenza and Mohammed Amine Chikh

Ensemble methods have been widely used in the field of pattern recognition due to the difficulty of finding a single classifier that performs well on a wide variety of problems…

Abstract

Purpose

Ensemble methods have been widely used in the field of pattern recognition due to the difficulty of finding a single classifier that performs well on a wide variety of problems. Despite the effectiveness of these techniques, studies have shown that ensemble methods generate a large number of hypotheses and that contain redundant classifiers in most cases. Several works proposed in the state of the art attempt to reduce all hypotheses without affecting performance.

Design/methodology/approach

In this work, the authors are proposing a pruning method that takes into consideration the correlation between classifiers/classes and each classifier with the rest of the set. The authors have used the random forest algorithm as trees-based ensemble classifiers and the pruning was made by a technique inspired by the CFS (correlation feature selection) algorithm.

Findings

The proposed method CES (correlation-based Ensemble Selection) was evaluated on ten datasets from the UCI machine learning repository, and the performances were compared to six ensemble pruning techniques. The results showed that our proposed pruning method selects a small ensemble in a smaller amount of time while improving classification rates compared to the state-of-the-art methods.

Originality/value

CES is a new ordering-based method that uses the CFS algorithm. CES selects, in a short time, a small sub-ensemble that outperforms results obtained from the whole forest and the other state-of-the-art techniques used in this study.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 28 November 2019

Amitava Choudhury, Tanmay Konnur, P.P. Chattopadhyay and Snehanshu Pal

The purpose of this paper, is to predict the various phases and crystal structure from multi-component alloys. Nowadays, the concept and strategies of the development of…

Abstract

Purpose

The purpose of this paper, is to predict the various phases and crystal structure from multi-component alloys. Nowadays, the concept and strategies of the development of multi-principal element alloys (MPEAs) significantly increase the count of the potential candidate of alloy systems, which demand proper screening of large number of alloy systems based on the nature of their phase and structure. Experimentally obtained data linking elemental properties and their resulting phases for MPEAs is profused; hence, there is a strong scope for categorization/classification of MPEAs based on structural features of the resultant phase along with distinctive connections between elemental properties and phases.

Design/methodology/approach

In this paper, several machine-learning algorithms have been used to recognize the underlying data pattern using data sets to design MPEAs and classify them based on structural features of their resultant phase such as single-phase solid solution, amorphous and intermetallic compounds. Further classification of MPEAs having single-phase solid solution is performed based on crystal structure using an ensemble-based machine-learning algorithm known as random-forest algorithm.

Findings

The model developed by implementing random-forest algorithm has resulted in an accuracy of 91 per cent for phase prediction and 93 per cent for crystal structure prediction for single-phase solid solution class of MPEAs. Five input parameters are used in the prediction model namely, valence electron concentration, difference in the pauling negativeness, atomic size difference, mixing enthalpy and mixing entropy. It has been found that the valence electron concentration is the most important feature with respect to prediction of phases. To avoid overfitting problem, fivefold cross-validation has been performed. To understand the comparative performance, different algorithms such as K-nearest Neighbor, support vector machine, logistic regression, naïve-based approach, decision tree and neural network have been used in the data set.

Originality/value

In this paper, the authors described the phase selection and crystal structure prediction mechanism in MPEA data set and have achieved better accuracy using machine learning.

Details

Engineering Computations, vol. 37 no. 3
Type: Research Article
ISSN: 0264-4401

Keywords

Article
Publication date: 18 October 2018

Kalyan Nagaraj, Biplab Bhattacharjee, Amulyashree Sridhar and Sharvani GS

Phishing is one of the major threats affecting businesses worldwide in current times. Organizations and customers face the hazards arising out of phishing attacks because of…

Abstract

Purpose

Phishing is one of the major threats affecting businesses worldwide in current times. Organizations and customers face the hazards arising out of phishing attacks because of anonymous access to vulnerable details. Such attacks often result in substantial financial losses. Thus, there is a need for effective intrusion detection techniques to identify and possibly nullify the effects of phishing. Classifying phishing and non-phishing web content is a critical task in information security protocols, and full-proof mechanisms have yet to be implemented in practice. The purpose of the current study is to present an ensemble machine learning model for classifying phishing websites.

Design/methodology/approach

A publicly available data set comprising 10,068 instances of phishing and legitimate websites was used to build the classifier model. Feature extraction was performed by deploying a group of methods, and relevant features extracted were used for building the model. A twofold ensemble learner was developed by integrating results from random forest (RF) classifier, fed into a feedforward neural network (NN). Performance of the ensemble classifier was validated using k-fold cross-validation. The twofold ensemble learner was implemented as a user-friendly, interactive decision support system for classifying websites as phishing or legitimate ones.

Findings

Experimental simulations were performed to access and compare the performance of the ensemble classifiers. The statistical tests estimated that RF_NN model gave superior performance with an accuracy of 93.41 per cent and minimal mean squared error of 0.000026.

Research limitations/implications

The research data set used in this study is publically available and easy to analyze. Comparative analysis with other real-time data sets of recent origin must be performed to ensure generalization of the model against various security breaches. Different variants of phishing threats must be detected rather than focusing particularly toward phishing website detection.

Originality/value

The twofold ensemble model is not applied for classification of phishing websites in any previous studies as per the knowledge of authors.

Details

Journal of Systems and Information Technology, vol. 20 no. 3
Type: Research Article
ISSN: 1328-7265

Keywords

Article
Publication date: 5 June 2017

Hao Wu

This paper aims to inspect the defects of solder joints of printed circuit board in real-time production line, simple computing and high accuracy are primary consideration factors…

Abstract

Purpose

This paper aims to inspect the defects of solder joints of printed circuit board in real-time production line, simple computing and high accuracy are primary consideration factors for feature extraction and classification algorithm.

Design/methodology/approach

In this study, the author presents an ensemble method for the classification of solder joint defects. The new method is based on extracting the color and geometry features after solder image acquisition and using decision trees to guarantee the algorithm’s running executive efficiency. To improve algorithm accuracy, the author proposes an ensemble method of random forest which combined several trees for the classification of solder joints.

Findings

The proposed method has been tested using 280 samples of solder joints, including good and various defect types, for experiments. The results show that the proposed method has a high accuracy.

Originality/value

The author extracted the color and geometry features and used decision trees to guarantee the algorithm's running executive efficiency. To improve the algorithm accuracy, the author proposes using an ensemble method of random forest which combined several trees for the classification of solder joints. The results show that the proposed method has a high accuracy.

Details

Soldering & Surface Mount Technology, vol. 29 no. 3
Type: Research Article
ISSN: 0954-0911

Keywords

1 – 10 of over 1000