Search results
1 – 10 of over 3000This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P…
Abstract
Purpose
This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.
Design/methodology/approach
In the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.
Findings
The authors got very satisfactory classification results.
Originality/value
DDPML system is specially designed to smoothly handle big data mining classification.
Details
Keywords
Hendrik Kohrs, Benjamin Rainer Auer and Frank Schuhmacher
In short-term forecasting of day-ahead electricity prices, incorporating intraday dependencies is vital for accurate predictions. However, it quickly leads to dimensionality…
Abstract
Purpose
In short-term forecasting of day-ahead electricity prices, incorporating intraday dependencies is vital for accurate predictions. However, it quickly leads to dimensionality problems, i.e. ill-defined models with too many parameters, which require an adequate remedy. This study addresses this issue.
Design/methodology/approach
In an application for the German/Austrian market, this study derives variable importance scores from a random forest algorithm, feeds the identified variables into a support vector machine and compares the resulting forecasting technique to other approaches (such as dynamic factor models, penalized regressions or Bayesian shrinkage) that are commonly used to resolve dimensionality problems.
Findings
This study develops full importance profiles stating which hours of which past days have the highest predictive power for specific hours in the future. Using the profile information in the forecasting setup leads to very promising results compared to the alternatives. Furthermore, the importance profiles provide a possible explanation why some forecasting methods are more accurate for certain hours of the day than others. They also help to explain why simple forecast combination schemes tend to outperform the full battery of models considered in the comprehensive comparative study.
Originality/value
With the information contained in the variable importance scores and the results of the extensive model comparison, this study essentially provides guidelines for variable and model selection in future electricity market research.
Details
Keywords
R. Shashikant and P. Chetankumar
Cardiac arrest is a severe heart anomaly that results in billions of annual casualties. Smoking is a specific hazard factor for cardiovascular pathology, including coronary heart…
Abstract
Cardiac arrest is a severe heart anomaly that results in billions of annual casualties. Smoking is a specific hazard factor for cardiovascular pathology, including coronary heart disease, but data on smoking and heart death not earlier reviewed. The Heart Rate Variability (HRV) parameters used to predict cardiac arrest in smokers using machine learning technique in this paper. Machine learning is a method of computing experience based on automatic learning and enhances performances to increase prognosis. This study intends to compare the performance of logistical regression, decision tree, and random forest model to predict cardiac arrest in smokers. In this paper, a machine learning technique implemented on the dataset received from the data science research group MITU Skillogies Pune, India. To know the patient has a chance of cardiac arrest or not, developed three predictive models as 19 input feature of HRV indices and two output classes. These model evaluated based on their accuracy, precision, sensitivity, specificity, F1 score, and Area under the curve (AUC). The model of logistic regression has achieved an accuracy of 88.50%, precision of 83.11%, the sensitivity of 91.79%, the specificity of 86.03%, F1 score of 0.87, and AUC of 0.88. The decision tree model has arrived with an accuracy of 92.59%, precision of 97.29%, the sensitivity of 90.11%, the specificity of 97.38%, F1 score of 0.93, and AUC of 0.94. The model of the random forest has achieved an accuracy of 93.61%, precision of 94.59%, the sensitivity of 92.11%, the specificity of 95.03%, F1 score of 0.93 and AUC of 0.95. The random forest model achieved the best accuracy classification, followed by the decision tree, and logistic regression shows the lowest classification accuracy.
Details
Keywords
Mostafa El Habib Daho, Nesma Settouti, Mohammed El Amine Bechar, Amina Boublenza and Mohammed Amine Chikh
Ensemble methods have been widely used in the field of pattern recognition due to the difficulty of finding a single classifier that performs well on a wide variety of problems…
Abstract
Purpose
Ensemble methods have been widely used in the field of pattern recognition due to the difficulty of finding a single classifier that performs well on a wide variety of problems. Despite the effectiveness of these techniques, studies have shown that ensemble methods generate a large number of hypotheses and that contain redundant classifiers in most cases. Several works proposed in the state of the art attempt to reduce all hypotheses without affecting performance.
Design/methodology/approach
In this work, the authors are proposing a pruning method that takes into consideration the correlation between classifiers/classes and each classifier with the rest of the set. The authors have used the random forest algorithm as trees-based ensemble classifiers and the pruning was made by a technique inspired by the CFS (correlation feature selection) algorithm.
Findings
The proposed method CES (correlation-based Ensemble Selection) was evaluated on ten datasets from the UCI machine learning repository, and the performances were compared to six ensemble pruning techniques. The results showed that our proposed pruning method selects a small ensemble in a smaller amount of time while improving classification rates compared to the state-of-the-art methods.
Originality/value
CES is a new ordering-based method that uses the CFS algorithm. CES selects, in a short time, a small sub-ensemble that outperforms results obtained from the whole forest and the other state-of-the-art techniques used in this study.
Details
Keywords
Suraj Kulkarni, Suhas Suresh Ambekar and Manoj Hudnurkar
Increasing health-care costs are a major concern, especially in the USA. The purpose of this paper is to predict the hospital charges of a patient before being admitted. This will…
Abstract
Purpose
Increasing health-care costs are a major concern, especially in the USA. The purpose of this paper is to predict the hospital charges of a patient before being admitted. This will help a patient who is getting admitted: “electively” can plan his/her finance. Also, this can be used as a tool by payers (insurance companies) to better forecast the amount that a patient might claim.
Design/methodology/approach
This research method involves secondary data collected from New York state’s patient discharges of 2017. A stratified sampling technique is used to sample the data from the population, feature engineering is done on categorical variables. Different regression techniques are being used to predict the target value “total charges.”
Findings
Total cost varies linearly with the length of stay. Among all the machine learning algorithms considered, namely, random forest, stochastic gradient descent (SGD) regressor, K nearest neighbors regressor, extreme gradient boosting regressor and gradient boosting regressor, random forest regressor had the best accuracy with R2 value 0.7753. “Age group” was the most important predictor among all the features.
Practical implications
This model can be helpful for patients who want to compare the cost at different hospitals and can plan their finances accordingly in case of “elective” admission. Insurance companies can predict how much a patient with a particular medical condition might claim by getting admitted to the hospital.
Originality/value
Health care can be a costly affair if not planned properly. This research gives patients and insurance companies a better prediction of the total cost that they might incur.
Details
Keywords
Vinod Nistane and Suraj Harsha
In rotary machines, the bearing failure is one of the major causes of the breakdown of machinery. The bearing degradation monitoring is a great anxiety for the prevention of…
Abstract
Purpose
In rotary machines, the bearing failure is one of the major causes of the breakdown of machinery. The bearing degradation monitoring is a great anxiety for the prevention of bearing failures. This paper aims to present a combination of the stationary wavelet decomposition and extra-trees regression (ETR) for the evaluation of bearing degradation.
Design/methodology/approach
The higher order cumulants features are extracted from the bearing vibration signals by using the stationary wavelet decomposition (stationary wavelet transform [SWT]). The extracted features are then subjected to the ETR for obtaining normal and failure state. A dominance level curve build using the dissimilarity data of test object and retained as health degradation indicator for the evaluation of bearing health.
Findings
Experiment conducts to verify and assess the effectiveness of ETR for the evaluation of performance of bearing degradation. To justify the preeminence of recommended approach, it is compared with the performance of random forest regression and multi-layer perceptron regression.
Originality/value
The experimental results indicated that the presently adopted method shows better performance for detecting the degradation more accurately at early stage. Furthermore, the diagnostics and prognostics have been getting much attention in the field of vibration, and it plays a significant role to avoid accidents.
Details
Keywords
Daniel Abreu Vasconcellos de Paula, Rinaldo Artes, Fabio Ayres and Andrea Maria Accioly Fonseca Minardi
Although credit unions are nonprofit organizations, their objectives depend on the efficient management of their resources and credit risk aligned with the principles of the…
Abstract
Purpose
Although credit unions are nonprofit organizations, their objectives depend on the efficient management of their resources and credit risk aligned with the principles of the cooperative doctrine. This paper aims to propose the combined use of credit scoring and profit scoring to increase the effectiveness of the loan-granting process in credit unions.
Design/methodology/approach
This sample is composed by the data of personal loans transactions of a Brazilian credit union.
Findings
The analysis reveals that the use of statistical methods improves significantly the predictability of default when compared to the use of subjective techniques and the superiority of the random forests model in estimating credit scoring and profit scoring when compared to logit and ordinary least squares method (OLS) regression. The study also illustrates how both analyses can be used jointly for more effective decision-making.
Originality/value
Replacing subjective analysis with objective credit analysis using deterministic models will benefit Brazilian credit unions. The credit decision will be based on the input variables and on clear criteria, turning the decision-making process impartial. The joint use of credit scoring and profit scoring allows granting credit for the clients with the highest potential to pay debt obligation and, at the same time, to certify that the transaction profitability meets the goals of the organization: to be sustainable and to provide loans and investment opportunities at attractive rates to members.
Details
Keywords
This paper aims to inspect the defects of solder joints of printed circuit board in real-time production line, simple computing and high accuracy are primary consideration factors…
Abstract
Purpose
This paper aims to inspect the defects of solder joints of printed circuit board in real-time production line, simple computing and high accuracy are primary consideration factors for feature extraction and classification algorithm.
Design/methodology/approach
In this study, the author presents an ensemble method for the classification of solder joint defects. The new method is based on extracting the color and geometry features after solder image acquisition and using decision trees to guarantee the algorithm’s running executive efficiency. To improve algorithm accuracy, the author proposes an ensemble method of random forest which combined several trees for the classification of solder joints.
Findings
The proposed method has been tested using 280 samples of solder joints, including good and various defect types, for experiments. The results show that the proposed method has a high accuracy.
Originality/value
The author extracted the color and geometry features and used decision trees to guarantee the algorithm's running executive efficiency. To improve the algorithm accuracy, the author proposes using an ensemble method of random forest which combined several trees for the classification of solder joints. The results show that the proposed method has a high accuracy.
Details
Keywords
Samar Shilbayeh and Rihab Grassa
Bank creditworthiness refers to the evaluation of a bank’s ability to meet its financial obligations. It is an assessment of the bank’s financial health, stability and capacity to…
Abstract
Purpose
Bank creditworthiness refers to the evaluation of a bank’s ability to meet its financial obligations. It is an assessment of the bank’s financial health, stability and capacity to manage risks. This paper aims to investigate the credit rating patterns that are crucial for assessing creditworthiness of the Islamic banks, thereby evaluating the stability of their industry.
Design/methodology/approach
Three distinct machine learning algorithms are exploited and evaluated for the desired objective. This research initially uses the decision tree machine learning algorithm as a base learner conducting an in-depth comparison with the ensemble decision tree and Random Forest. Subsequently, the Apriori algorithm is deployed to uncover the most significant attributes impacting a bank’s credit rating. To appraise the previously elucidated models, a ten-fold cross-validation method is applied. This method involves segmenting the data sets into ten folds, with nine used for training and one for testing alternatively ten times changeable. This approach aims to mitigate any potential biases that could arise during the learning and training phases. Following this process, the accuracy is assessed and depicted in a confusion matrix as outlined in the methodology section.
Findings
The findings of this investigation reveal that the Random Forest machine learning algorithm superperforms others, achieving an impressive 90.5% accuracy in predicting credit ratings. Notably, our research sheds light on the significance of the loan-to-deposit ratio as a primary attribute affecting credit rating predictions. Moreover, this study uncovers additional pivotal banking features that intensely impact the measurements under study. This paper’s findings provide evidence that the loan-to-deposit ratio looks to be the purest bank attribute that affects credit rating prediction. In addition, deposit-to-assets ratio and profit sharing investment account ratio criteria are found to be effective in credit rating prediction and the ownership structure criterion came to be viewed as one of the essential bank attributes in credit rating prediction.
Originality/value
These findings contribute significant evidence to the understanding of attributes that strongly influence credit rating predictions within the banking sector. This study uniquely contributes by uncovering patterns that have not been previously documented in the literature, broadening our understanding in this field.
Details
Keywords
Ganisha N.P. Athaudage, H. Niles Perera, P.T. Ranil S. Sugathadasa, M. Mavin De Silva and Oshadhi K. Herath
The crude oil supply chain (COSC) is one of the most complex and largest supply chains in the world. It is easily vulnerable to extreme events. Recently, the severe acute…
Abstract
Purpose
The crude oil supply chain (COSC) is one of the most complex and largest supply chains in the world. It is easily vulnerable to extreme events. Recently, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (often known as COVID-19) pandemic created a massive imbalance between supply and demand which caused significant price fluctuations. The purpose of this study is to explore the influential factors affecting the international COSC in terms of consumption, production and price. Furthermore, it develops a model to predict the international crude oil price during disease outbreaks using Random Forest (RF) regression.
Design/methodology/approach
This study uses both qualitative and quantitative approaches. A qualitative study is conducted using a literature review to explore the influential factors on COSC. All the data are extracted from Web sources. In addition to COVID-19, four other diseases are considered to optimize the accuracy of predictive results. A principal component analysis is deployed to reduce the number of variables. A forecasting model is developed using RF regression.
Findings
The findings of the qualitative analysis characterize the factors that influence international COSC. The findings of quantitative analysis emphasize that production and consumption have a higher contribution to the variance of the data set. Also, this study found that the impact caused to crude oil price varies with the region. Most importantly, the model introduced using the RF technique provides a high predictive ability in short horizons such as infectious diseases. This study delivers future directions and insights to researchers and practitioners to expand the study further.
Originality/value
This is one of the few available pieces of research which uses the RF method in the context of crude oil price forecasting. Additionally, this study examines international COSC in the events of emergencies, specifically disease outbreaks using machine learning techniques.
Details