Search results

1 – 10 of 227
Article
Publication date: 7 August 2017

Sathyavikasini Kalimuthu and Vijaya Vijayakumar

Diagnosing genetic neuromuscular disorder such as muscular dystrophy is complicated when the imperfection occurs while splicing. This paper aims in predicting the type of muscular…

Abstract

Purpose

Diagnosing genetic neuromuscular disorder such as muscular dystrophy is complicated when the imperfection occurs while splicing. This paper aims in predicting the type of muscular dystrophy from the gene sequences by extracting the well-defined descriptors related to splicing mutations. An automatic model is built to classify the disease through pattern recognition techniques coded in python using scikit-learn framework.

Design/methodology/approach

In this paper, the cloned gene sequences are synthesized based on the mutation position and its location on the chromosome by using the positional cloning approach. For instance, in the human gene mutational database (HGMD), the mutational information for splicing mutation is specified as IVS1-5 T > G indicates (IVS - intervening sequence or introns), first intron and five nucleotides before the consensus intron site AG, where the variant occurs in nucleotide G altered to T. IVS (+ve) denotes forward strand 3′– positive numbers from G of donor site invariant and IVS (−ve) denotes backward strand 5′ – negative numbers starting from G of acceptor site. The key idea in this paper is to spot out discriminative descriptors from diseased gene sequences based on splicing variants and to provide an effective machine learning solution for predicting the type of muscular dystrophy disease with the splicing mutations. Multi-class classification is worked out through data modeling of gene sequences. The synthetic mutational gene sequences are created, as the diseased gene sequences are not readily obtainable for this intricate disease. Positional cloning approach supports in generating disease gene sequences based on mutational information acquired from HGMD. SNP-, gene- and exon-based discriminative features are identified and used to train the model. An eminent muscular dystrophy disease prediction model is built using supervised learning techniques in scikit-learn environment. The data frame is built with the extracted features as numpy array. The data are normalized by transforming the feature values into the range between 0 and 1 aid in scaling the input attributes for a model. Naïve Bayes, decision tree, K-nearest neighbor and SVM learned models are developed using python library framework in scikit-learn.

Findings

To the best knowledge of authors, this is the foremost pattern recognition model, to classify muscular dystrophy disease pertaining to splicing mutations. Certain essential SNP-, gene- and exon-based descriptors related to splicing mutations are proposed and extracted from the cloned gene sequences. An eminent model is built using statistical learning technique through scikit-learn in the anaconda framework. This paper also deliberates the results of statistical learning carried out with the same set of gene sequences with synonymous and non-synonymous mutational descriptors.

Research limitations/implications

The data frame is built with the Numpy array. Normalizing the data by transforming the feature values into the range between 0 and 1 aid in scaling the input attributes for a model. Naïve Bayes, decision tree, K-nearest neighbor and SVM learned models are developed using python library framework in scikit-learn. While learning the SVM model, the cost, gamma and kernel parameters are tuned to attain good results. Scoring parameters of the classifiers are evaluated using tenfold cross-validation using metric functions of scikit-learn library. Results of the disease identification model based on non-synonymous, synonymous and splicing mutations were analyzed.

Practical implications

Certain essential SNP-, gene- and exon-based descriptors related to splicing mutations are proposed and extracted from the cloned gene sequences. An eminent model is built using statistical learning technique through scikit-learn in the anaconda framework. The performance of the classifiers are increased by using different estimators from the scikit-learn library. Several types of mutations such as missense, non-sense and silent mutations are also considered to build models through statistical learning technique and their results are analyzed.

Originality/value

To the best knowledge of authors, this is the foremost pattern recognition model, to classify muscular dystrophy disease pertaining to splicing mutations.

Details

World Journal of Engineering, vol. 14 no. 4
Type: Research Article
ISSN: 1708-5284

Keywords

Book part
Publication date: 4 December 2020

Gauri Rajendra Virkar and Supriya Sunil Shinde

Predictive analytics is the science of decision-making that eliminates guesswork out of the decision-making process and applies proven scientific procedures to find right…

Abstract

Predictive analytics is the science of decision-making that eliminates guesswork out of the decision-making process and applies proven scientific procedures to find right solutions. Predictive analytics provides ideas on the occurrences of future downtimes and rejections thereby aids in taking preventive actions before abnormalities occur. Considering these advantages, predictive analytics is adopted in various diverse fields such as health care, finance, education, marketing, automotive, etc. Predictive analytics tools can be used to predict various behaviors and patterns, thereby saving the time and money of its users. Many open-source predictive analysis tools namely R, scikit-learn, Konstanz Information Miner (KNIME), Orange, RapidMiner, Waikato Environment for Knowledge Analysis (WEKA), etc. are freely available for the users. This chapter aims to reveal the best accurate tools and techniques for the classification task that aid in decision-making. Our experimental results show that no specific tool provides the best results in all scenarios; rather it depends upon the datasets and the classifier.

Article
Publication date: 6 February 2023

Francina Malan and Johannes Lodewyk Jooste

The purpose of this paper is to compare the effectiveness of the various text mining techniques that can be used to classify maintenance work-order records into their respective…

Abstract

Purpose

The purpose of this paper is to compare the effectiveness of the various text mining techniques that can be used to classify maintenance work-order records into their respective failure modes, focussing on the choice of algorithm and preprocessing transforms. Three algorithms are evaluated, namely Bernoulli Naïve Bayes, multinomial Naïve Bayes and support vector machines.

Design/methodology/approach

The paper has both a theoretical and experimental component. In the literature review, the various algorithms and preprocessing techniques used in text classification is considered from three perspectives: the domain-specific maintenance literature, the broader short-form literature and the general text classification literature. The experimental component consists of a 5 × 2 nested cross-validation with an inner optimisation loop performed using a randomised search procedure.

Findings

From the literature review, the aspects most affected by short document length are identified as the feature representation scheme, higher-order n-grams, document length normalisation, stemming, stop-word removal and algorithm selection. However, from the experimental analysis, the selection of preprocessing transforms seemed more dependent on the particular algorithm than on short document length. Multinomial Naïve Bayes performs marginally better than the other algorithms, but overall, the performances of the optimised models are comparable.

Originality/value

This work highlights the importance of model optimisation, including the selection of preprocessing transforms. Not only did the optimisation improve the performance of all the algorithms substantially, but it also affects model comparisons, with multinomial Naïve Bayes going from the worst to the best performing algorithm.

Details

Journal of Quality in Maintenance Engineering, vol. 29 no. 3
Type: Research Article
ISSN: 1355-2511

Keywords

Article
Publication date: 26 September 2022

Christian Nnaemeka Egwim, Hafiz Alaka, Oluwapelumi Oluwaseun Egunjobi, Alvaro Gomes and Iosif Mporas

This study aims to compare and evaluate the application of commonly used machine learning (ML) algorithms used to develop models for assessing energy efficiency of buildings.

Abstract

Purpose

This study aims to compare and evaluate the application of commonly used machine learning (ML) algorithms used to develop models for assessing energy efficiency of buildings.

Design/methodology/approach

This study foremostly combined building energy efficiency ratings from several data sources and used them to create predictive models using a variety of ML methods. Secondly, to test the hypothesis of ensemble techniques, this study designed a hybrid stacking ensemble approach based on the best performing bagging and boosting ensemble methods generated from its predictive analytics.

Findings

Based on performance evaluation metrics scores, the extra trees model was shown to be the best predictive model. More importantly, this study demonstrated that the cumulative result of ensemble ML algorithms is usually always better in terms of predicted accuracy than a single method. Finally, it was discovered that stacking is a superior ensemble approach for analysing building energy efficiency than bagging and boosting.

Research limitations/implications

While the proposed contemporary method of analysis is assumed to be applicable in assessing energy efficiency of buildings within the sector, the unique data transformation used in this study may not, as typical of any data driven model, be transferable to the data from other regions other than the UK.

Practical implications

This study aids in the initial selection of appropriate and high-performing ML algorithms for future analysis. This study also assists building managers, residents, government agencies and other stakeholders in better understanding contributing factors and making better decisions about building energy performance. Furthermore, this study will assist the general public in proactively identifying buildings with high energy demands, potentially lowering energy costs by promoting avoidance behaviour and assisting government agencies in making informed decisions about energy tariffs when this novel model is integrated into an energy monitoring system.

Originality/value

This study fills a gap in the lack of a reason for selecting appropriate ML algorithms for assessing building energy efficiency. More importantly, this study demonstrated that the cumulative result of ensemble ML algorithms is usually always better in terms of predicted accuracy than a single method.

Details

Journal of Engineering, Design and Technology , vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1726-0531

Keywords

Article
Publication date: 23 September 2022

Hossein Sohrabi and Esmatullah Noorzai

The present study aims to develop a risk-supported case-based reasoning (RS-CBR) approach for water-related projects by incorporating various uncertainties and risks in the…

Abstract

Purpose

The present study aims to develop a risk-supported case-based reasoning (RS-CBR) approach for water-related projects by incorporating various uncertainties and risks in the revision step.

Design/methodology/approach

The cases were extracted by studying 68 water-related projects. This research employs earned value management (EVM) factors to consider time and cost features and economic, natural, technical, and project risks to account for uncertainties and supervised learning models to estimate cost overrun. Time-series algorithms were also used to predict construction cost indexes (CCI) and model improvements in future forecasts. Outliers were deleted by the pre-processing process. Next, datasets were split into testing and training sets, and algorithms were implemented. The accuracy of different models was measured with the mean absolute percentage error (MAPE) and the normalized root mean square error (NRSME) criteria.

Findings

The findings show an improvement in the accuracy of predictions using datasets that consider uncertainties, and ensemble algorithms such as Random Forest and AdaBoost had higher accuracy. Also, among the single algorithms, the support vector regressor (SVR) with the sigmoid kernel outperformed the others.

Originality/value

This research is the first attempt to develop a case-based reasoning model based on various risks and uncertainties. The developed model has provided an approving overlap with machine learning models to predict cost overruns. The model has been implemented in collected water-related projects and results have been reported.

Details

Engineering, Construction and Architectural Management, vol. 31 no. 2
Type: Research Article
ISSN: 0969-9988

Keywords

Article
Publication date: 24 July 2020

Lafaiet Silva, Nádia Félix Silva and Thierson Rosa

This study aims to analyze Kickstarter data along with social media data from a data mining perspective. Kickstarter is a crowdfunding financing plataform and is a form of…

Abstract

Purpose

This study aims to analyze Kickstarter data along with social media data from a data mining perspective. Kickstarter is a crowdfunding financing plataform and is a form of fundraising and is increasingly being adopted as a source for achieving the viability of projects. Despite its importance and adoption growth, the success rate of crowdfunding campaigns was 47% in 2017, and it has decreased over the years. A way of increasing the chances of success of campaigns would be to predict, by using machine learning techniques, if a campaign would be successful. By applying classification models, it is possible to estimate if whether or not a campaign will achieve success, and by applying regression models, the authors can forecast the amount of money to be funded.

Design/methodology/approach

The authors propose a solution in two phases, namely, launching and campaigning. As a result, models better suited for each point in time of a campaign life cycle.

Findings

The authors produced a static predictor capable of classifying the campaigns with an accuracy of 71%. The regression method for phase one achieved a 6.45 of root mean squared error. The dynamic classifier was able to achieve 85% of accuracy before 10% of campaign duration, the equivalent of 3 days, given a campaign with 30 days of length. At this same period time, it was able to achieve a forecasting performance of 2.5 of root mean squared error.

Originality/value

The authors carry out this research presenting the results with a set of real data from a crowdfunding platform. The results are discussed according to the existing literature. This provides a comprehensive review, detailing important research instructions for advancing this field of literature.

Details

International Journal of Web Information Systems, vol. 16 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Book part
Publication date: 18 January 2023

Steven J. Hyde, Eric Bachura and Joseph S. Harrison

Machine learning (ML) has recently gained momentum as a method for measurement in strategy research. Yet, little guidance exists regarding how to appropriately apply the method…

Abstract

Machine learning (ML) has recently gained momentum as a method for measurement in strategy research. Yet, little guidance exists regarding how to appropriately apply the method for this purpose in our discipline. We address this by offering a guide to the application of ML in strategy research, with a particular emphasis on data handling practices that should improve our ability to accurately measure our constructs of interest using ML techniques. We offer a brief overview of ML methodologies that can be used for measurement before describing key challenges that exist when applying those methods for this purpose in strategy research (i.e., sample sizes, data noise, and construct complexity). We then outline a theory-driven approach to help scholars overcome these challenges and improve data handling and the subsequent application of ML techniques in strategy research. We demonstrate the efficacy of our approach by applying it to create a linguistic measure of CEOs' motivational needs in a sample of S&P 500 firms. We conclude by describing steps scholars can take after creating ML-based measures to continue to improve the application of ML in strategy research.

Article
Publication date: 2 May 2023

Aliakbar Marandi, Misagh Tasavori and Manoochehr Najmi

This study aims to use big data analysis and sheds light on key hotel features that play a role in the revisit intention of customers. In addition, this study endeavors to…

Abstract

Purpose

This study aims to use big data analysis and sheds light on key hotel features that play a role in the revisit intention of customers. In addition, this study endeavors to highlight hotel features for different customer segments.

Design/methodology/approach

This study uses a machine learning method and analyzes around 100,000 reviews of customers of 100 selected hotels around the world where they had indicated on Trip Advisor their intention to return to a particular hotel. The important features of the hotels are then extracted in terms of the 7Ps of the marketing mix. This study has then segmented customers intending to revisit hotels, based on the similarities in their reviews.

Findings

In total, 71 important hotel features are extracted using text analysis of comments. The most important features are the room, staff, food and accessibility. Also, customers are segmented into 15 groups, and key hotel features important for each segment are highlighted.

Research limitations/implications

In this research, the number of repetitions of words was used to identify key hotel features, whereas sentence-based analysis or group analysis of adjacent words can be used.

Practical implications

This study highlights key hotel features that are crucial for customers’ revisit intention and identifies related market segments that can support managers in better designing their strategies and allocating their resources.

Originality/value

By using text mining analysis, this study identifies and classifies important hotel features that are crucial for the revisit intention of customers based on the 7Ps. Methodologically, the authors suggest a comprehensive method to describe the revisit intention of hotel customers based on customer reviews.

Details

International Journal of Contemporary Hospitality Management, vol. 36 no. 1
Type: Research Article
ISSN: 0959-6119

Keywords

Article
Publication date: 5 April 2021

Seungpeel Lee, Honggeun Ji, Jina Kim and Eunil Park

With the rapid increase in internet use, most people tend to purchase books through online stores. Several such stores also provide book recommendations for buyer convenience, and…

1021

Abstract

Purpose

With the rapid increase in internet use, most people tend to purchase books through online stores. Several such stores also provide book recommendations for buyer convenience, and both collaborative and content-based filtering approaches have been widely used for building these recommendation systems. However, both approaches have significant limitations, including cold start and data sparsity. To overcome these limitations, this study aims to investigate whether user satisfaction can be predicted based on easily accessible book descriptions.

Design/methodology/approach

The authors collected a large-scale Kindle Books data set containing book descriptions and ratings, and calculated whether a specific book will receive a high rating. For this purpose, several feature representation methods (bag-of-words, term frequency–inverse document frequency [TF-IDF] and Word2vec) and machine learning classifiers (logistic regression, random forest, naive Bayes and support vector machine) were used.

Findings

The used classifiers show substantial accuracy in predicting reader satisfaction. Among them, the random forest classifier combined with the TF-IDF feature representation method exhibited the highest accuracy at 96.09%.

Originality/value

This study revealed that user satisfaction can be predicted based on book descriptions and shed light on the limitations of existing recommendation systems. Further, both practical and theoretical implications have been discussed.

Details

The Electronic Library , vol. 39 no. 1
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 19 February 2021

Zulkifli Halim, Shuhaida Mohamed Shuhidan and Zuraidah Mohd Sanusi

In the previous study of financial distress prediction, deep learning techniques performed better than traditional techniques over time-series data. This study investigates the…

Abstract

Purpose

In the previous study of financial distress prediction, deep learning techniques performed better than traditional techniques over time-series data. This study investigates the performance of deep learning models: recurrent neural network, long short-term memory and gated recurrent unit for the financial distress prediction among the Malaysian public listed corporation over the time-series data. This study also compares the performance of logistic regression, support vector machine, neural network, decision tree and the deep learning models on single-year data.

Design/methodology/approach

The data used are the financial data of public listed companies that been classified as PN17 status (distress) and non-PN17 (not distress) in Malaysia. This study was conducted using machine learning library of Python programming language.

Findings

The findings indicate that all deep learning models used for this study achieved 90% accuracy and above with long short-term memory (LSTM) and gated recurrent unit (GRU) getting 93% accuracy. In addition, deep learning models consistently have good performance compared to the other models over single-year data. The results show LSTM and GRU getting 90% and recurrent neural network (RNN) 88% accuracy. The results also show that LSTM and GRU get better precision and recall compared to RNN. The findings of this study show that the deep learning approach will lead to better performance in financial distress prediction studies. To be added, time-series data should be highlighted in any financial distress prediction studies since it has a big impact on credit risk assessment.

Research limitations/implications

The first limitation of this study is the hyperparameter tuning only applied for deep learning models. Secondly, the time-series data are only used for deep learning models since the other models optimally fit on single-year data.

Practical implications

This study proposes recommendations that deep learning is a new approach that will lead to better performance in financial distress prediction studies. Besides that, time-series data should be highlighted in any financial distress prediction studies since the data have a big impact on the assessment of credit risk.

Originality/value

To the best of authors' knowledge, this article is the first study that uses the gated recurrent unit in financial distress prediction studies based on time-series data for Malaysian public listed companies. The findings of this study can help financial institutions/investors to find a better and accurate approach for credit risk assessment.

Details

Business Process Management Journal, vol. 27 no. 4
Type: Research Article
ISSN: 1463-7154

Keywords

1 – 10 of 227