Search results
1 – 10 of over 1000Nguyen Thi Dinh, Nguyen Thi Uyen Nhi, Thanh Manh Le and Thanh The Van
The problem of image retrieval and image description exists in various fields. In this paper, a model of content-based image retrieval and image content extraction based on the…
Abstract
Purpose
The problem of image retrieval and image description exists in various fields. In this paper, a model of content-based image retrieval and image content extraction based on the KD-Tree structure was proposed.
Design/methodology/approach
A Random Forest structure was built to classify the objects on each image on the basis of the balanced multibranch KD-Tree structure. From that purpose, a KD-Tree structure was generated by the Random Forest to retrieve a set of similar images for an input image. A KD-Tree structure is applied to determine a relationship word at leaves to extract the relationship between objects on an input image. An input image content is described based on class names and relationships between objects.
Findings
A model of image retrieval and image content extraction was proposed based on the proposed theoretical basis; simultaneously, the experiment was built on multi-object image datasets including Microsoft COCO and Flickr with an average image retrieval precision of 0.9028 and 0.9163, respectively. The experimental results were compared with those of other works on the same image dataset to demonstrate the effectiveness of the proposed method.
Originality/value
A balanced multibranch KD-Tree structure was built to apply to relationship classification on the basis of the original KD-Tree structure. Then, KD-Tree Random Forest was built to improve the classifier performance and retrieve a set of similar images for an input image. Concurrently, the image content was described in the process of combining class names and relationships between objects.
Details
Keywords
Cheng Liu, Yi Shi, Wenjing Xie and Xinzhong Bao
This paper aims to provide a complete analysis framework and prediction method for the construction of the patent securitization (PS) basic asset pool.
Abstract
Purpose
This paper aims to provide a complete analysis framework and prediction method for the construction of the patent securitization (PS) basic asset pool.
Design/methodology/approach
This paper proposes an integrated classification method based on genetic algorithm and random forest algorithm. First, comprehensively consider the patent value evaluation model and SME credit evaluation model, determine 17 indicators to measure the patent value and SME credit; Secondly, establish the classification label of high-quality basic assets; Then, genetic algorithm and random forest model are used to predict and screen high-quality basic assets; Finally, the performance of the model is evaluated.
Findings
The machine learning model proposed in this study is mainly used to solve the screening problem of high-quality patents that constitute the underlying asset pool of PS. The empirical research shows that the integrated classification method based on genetic algorithm and random forest has good performance and prediction accuracy, and is superior to the single method that constitutes it.
Originality/value
The main contributions of the article are twofold: firstly, the machine learning model proposed in this article determines the standards for high-quality basic assets; Secondly, this article addresses the screening issue of basic assets in PS.
Details
Keywords
This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P…
Abstract
Purpose
This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.
Design/methodology/approach
In the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.
Findings
The authors got very satisfactory classification results.
Originality/value
DDPML system is specially designed to smoothly handle big data mining classification.
Details
Keywords
Lucie Maruejols, Hanjie Wang, Qiran Zhao, Yunli Bai and Linxiu Zhang
Despite rising incomes and reduction of extreme poverty, the feeling of being poor remains widespread. Support programs can improve well-being, but they first require identifying…
Abstract
Purpose
Despite rising incomes and reduction of extreme poverty, the feeling of being poor remains widespread. Support programs can improve well-being, but they first require identifying who are the households that judge their income is insufficient to meet their basic needs, and what factors are associated with subjective poverty.
Design/methodology/approach
Households report the income level they judge is sufficient to make ends meet. Then, they are classified as being subjectively poor if their own monetary income is inferior to the level they indicated. Second, the study compares the performance of three machine learning algorithms, the random forest, support vector machines and least absolute shrinkage and selection operator (LASSO) regression, applied to a set of socioeconomic variables to predict subjective poverty status.
Findings
The random forest generates 85.29% of correct predictions using a range of income and non-income predictors, closely followed by the other two techniques. For the middle-income group, the LASSO regression outperforms random forest. Subjective poverty is mostly associated with monetary income for low-income households. However, a combination of low income, low endowment (land, consumption assets) and unusual large expenditure (medical, gifts) constitutes the key predictors of feeling poor for the middle-income households.
Practical implications
To reduce the feeling of poverty, policy intervention should continue to focus on increasing incomes. However, improvements in nonincome domains such as health expenditure, education and family demographics can also relieve the feeling of income inadequacy. Methodologically, better performance of either algorithm depends on the data at hand.
Originality/value
For the first time, the authors show that prediction techniques are reliable to identify subjective poverty prevalence, with example from rural China. The analysis offers specific attention to the modest-income households, who may feel poor but not be identified as such by objective poverty lines, and is relevant when policy-makers seek to address the “next step” after ending extreme poverty. Prediction performance and mechanisms for three machine learning algorithms are compared.
Details
Keywords
Hera Khan, Ayush Srivastav and Amit Kumar Mishra
A detailed description will be provided of all the classification algorithms that have been widely used in the domain of medical science. The foundation will be laid by giving a…
Abstract
A detailed description will be provided of all the classification algorithms that have been widely used in the domain of medical science. The foundation will be laid by giving a comprehensive overview pertaining to the background and history of the classification algorithms. This will be followed by an extensive discussion regarding various techniques of classification algorithm in machine learning (ML) hence concluding with their relevant applications in data analysis in medical science and health care. To begin with, the initials of this chapter will deal with the basic fundamentals required for a profound understanding of the classification techniques in ML which will comprise of the underlying differences between Unsupervised and Supervised Learning followed by the basic terminologies of classification and its history. Further, it will include the types of classification algorithms ranging from linear classifiers like Logistic Regression, Naïve Bayes to Nearest Neighbour, Support Vector Machine, Tree-based Classifiers, and Neural Networks, and their respective mathematics. Ensemble algorithms such as Majority Voting, Boosting, Bagging, Stacking will also be discussed at great length along with their relevant applications. Furthermore, this chapter will also incorporate comprehensive elucidation regarding the areas of application of such classification algorithms in the field of biomedicine and health care and their contribution to decision-making systems and predictive analysis. To conclude, this chapter will devote highly in the field of research and development as it will provide a thorough insight to the classification algorithms and their relevant applications used in the cases of the healthcare development sector.
Details
Keywords
Marcelo Cajias and Anna Freudenreich
This is the first article to apply a machine learning approach to the analysis of time on market on real estate markets.
Abstract
Purpose
This is the first article to apply a machine learning approach to the analysis of time on market on real estate markets.
Design/methodology/approach
The random survival forest approach is introduced to the real estate market. The most important predictors of time on market are revealed and it is analyzed how the survival probability of residential rental apartments responds to these major characteristics.
Findings
Results show that price, living area, construction year, year of listing and the distances to the next hairdresser, bakery and city center have the greatest impact on the marketing time of residential apartments. The time on market for an apartment in Munich is lowest at a price of 750 € per month, an area of 60 m2, built in 1985 and is in a range of 200–400 meters from the important amenities.
Practical implications
The findings might be interesting for private and institutional investors to derive real estate investment decisions and implications for portfolio management strategies and ultimately to minimize cash-flow failure.
Originality/value
Although machine learning algorithms have been applied frequently on the real estate market for the analysis of prices, its application for examining time on market is completely novel. This is the first paper to apply a machine learning approach to survival analysis on the real estate market.
Details
Keywords
Mostafa El Habib Daho, Nesma Settouti, Mohammed El Amine Bechar, Amina Boublenza and Mohammed Amine Chikh
Ensemble methods have been widely used in the field of pattern recognition due to the difficulty of finding a single classifier that performs well on a wide variety of problems…
Abstract
Purpose
Ensemble methods have been widely used in the field of pattern recognition due to the difficulty of finding a single classifier that performs well on a wide variety of problems. Despite the effectiveness of these techniques, studies have shown that ensemble methods generate a large number of hypotheses and that contain redundant classifiers in most cases. Several works proposed in the state of the art attempt to reduce all hypotheses without affecting performance.
Design/methodology/approach
In this work, the authors are proposing a pruning method that takes into consideration the correlation between classifiers/classes and each classifier with the rest of the set. The authors have used the random forest algorithm as trees-based ensemble classifiers and the pruning was made by a technique inspired by the CFS (correlation feature selection) algorithm.
Findings
The proposed method CES (correlation-based Ensemble Selection) was evaluated on ten datasets from the UCI machine learning repository, and the performances were compared to six ensemble pruning techniques. The results showed that our proposed pruning method selects a small ensemble in a smaller amount of time while improving classification rates compared to the state-of-the-art methods.
Originality/value
CES is a new ordering-based method that uses the CFS algorithm. CES selects, in a short time, a small sub-ensemble that outperforms results obtained from the whole forest and the other state-of-the-art techniques used in this study.
Details
Keywords
Amitava Choudhury, Tanmay Konnur, P.P. Chattopadhyay and Snehanshu Pal
The purpose of this paper, is to predict the various phases and crystal structure from multi-component alloys. Nowadays, the concept and strategies of the development of…
Abstract
Purpose
The purpose of this paper, is to predict the various phases and crystal structure from multi-component alloys. Nowadays, the concept and strategies of the development of multi-principal element alloys (MPEAs) significantly increase the count of the potential candidate of alloy systems, which demand proper screening of large number of alloy systems based on the nature of their phase and structure. Experimentally obtained data linking elemental properties and their resulting phases for MPEAs is profused; hence, there is a strong scope for categorization/classification of MPEAs based on structural features of the resultant phase along with distinctive connections between elemental properties and phases.
Design/methodology/approach
In this paper, several machine-learning algorithms have been used to recognize the underlying data pattern using data sets to design MPEAs and classify them based on structural features of their resultant phase such as single-phase solid solution, amorphous and intermetallic compounds. Further classification of MPEAs having single-phase solid solution is performed based on crystal structure using an ensemble-based machine-learning algorithm known as random-forest algorithm.
Findings
The model developed by implementing random-forest algorithm has resulted in an accuracy of 91 per cent for phase prediction and 93 per cent for crystal structure prediction for single-phase solid solution class of MPEAs. Five input parameters are used in the prediction model namely, valence electron concentration, difference in the pauling negativeness, atomic size difference, mixing enthalpy and mixing entropy. It has been found that the valence electron concentration is the most important feature with respect to prediction of phases. To avoid overfitting problem, fivefold cross-validation has been performed. To understand the comparative performance, different algorithms such as K-nearest Neighbor, support vector machine, logistic regression, naïve-based approach, decision tree and neural network have been used in the data set.
Originality/value
In this paper, the authors described the phase selection and crystal structure prediction mechanism in MPEA data set and have achieved better accuracy using machine learning.
Details
Keywords
Kalyan Nagaraj, Biplab Bhattacharjee, Amulyashree Sridhar and Sharvani GS
Phishing is one of the major threats affecting businesses worldwide in current times. Organizations and customers face the hazards arising out of phishing attacks because of…
Abstract
Purpose
Phishing is one of the major threats affecting businesses worldwide in current times. Organizations and customers face the hazards arising out of phishing attacks because of anonymous access to vulnerable details. Such attacks often result in substantial financial losses. Thus, there is a need for effective intrusion detection techniques to identify and possibly nullify the effects of phishing. Classifying phishing and non-phishing web content is a critical task in information security protocols, and full-proof mechanisms have yet to be implemented in practice. The purpose of the current study is to present an ensemble machine learning model for classifying phishing websites.
Design/methodology/approach
A publicly available data set comprising 10,068 instances of phishing and legitimate websites was used to build the classifier model. Feature extraction was performed by deploying a group of methods, and relevant features extracted were used for building the model. A twofold ensemble learner was developed by integrating results from random forest (RF) classifier, fed into a feedforward neural network (NN). Performance of the ensemble classifier was validated using k-fold cross-validation. The twofold ensemble learner was implemented as a user-friendly, interactive decision support system for classifying websites as phishing or legitimate ones.
Findings
Experimental simulations were performed to access and compare the performance of the ensemble classifiers. The statistical tests estimated that RF_NN model gave superior performance with an accuracy of 93.41 per cent and minimal mean squared error of 0.000026.
Research limitations/implications
The research data set used in this study is publically available and easy to analyze. Comparative analysis with other real-time data sets of recent origin must be performed to ensure generalization of the model against various security breaches. Different variants of phishing threats must be detected rather than focusing particularly toward phishing website detection.
Originality/value
The twofold ensemble model is not applied for classification of phishing websites in any previous studies as per the knowledge of authors.
Details
Keywords
This paper aims to inspect the defects of solder joints of printed circuit board in real-time production line, simple computing and high accuracy are primary consideration factors…
Abstract
Purpose
This paper aims to inspect the defects of solder joints of printed circuit board in real-time production line, simple computing and high accuracy are primary consideration factors for feature extraction and classification algorithm.
Design/methodology/approach
In this study, the author presents an ensemble method for the classification of solder joint defects. The new method is based on extracting the color and geometry features after solder image acquisition and using decision trees to guarantee the algorithm’s running executive efficiency. To improve algorithm accuracy, the author proposes an ensemble method of random forest which combined several trees for the classification of solder joints.
Findings
The proposed method has been tested using 280 samples of solder joints, including good and various defect types, for experiments. The results show that the proposed method has a high accuracy.
Originality/value
The author extracted the color and geometry features and used decision trees to guarantee the algorithm's running executive efficiency. To improve the algorithm accuracy, the author proposes using an ensemble method of random forest which combined several trees for the classification of solder joints. The results show that the proposed method has a high accuracy.
Details