Search results
1 – 10 of 227Sathyavikasini Kalimuthu and Vijaya Vijayakumar
Diagnosing genetic neuromuscular disorder such as muscular dystrophy is complicated when the imperfection occurs while splicing. This paper aims in predicting the type of muscular…
Abstract
Purpose
Diagnosing genetic neuromuscular disorder such as muscular dystrophy is complicated when the imperfection occurs while splicing. This paper aims in predicting the type of muscular dystrophy from the gene sequences by extracting the well-defined descriptors related to splicing mutations. An automatic model is built to classify the disease through pattern recognition techniques coded in python using scikit-learn framework.
Design/methodology/approach
In this paper, the cloned gene sequences are synthesized based on the mutation position and its location on the chromosome by using the positional cloning approach. For instance, in the human gene mutational database (HGMD), the mutational information for splicing mutation is specified as IVS1-5 T > G indicates (IVS - intervening sequence or introns), first intron and five nucleotides before the consensus intron site AG, where the variant occurs in nucleotide G altered to T. IVS (+ve) denotes forward strand 3′– positive numbers from G of donor site invariant and IVS (−ve) denotes backward strand 5′ – negative numbers starting from G of acceptor site. The key idea in this paper is to spot out discriminative descriptors from diseased gene sequences based on splicing variants and to provide an effective machine learning solution for predicting the type of muscular dystrophy disease with the splicing mutations. Multi-class classification is worked out through data modeling of gene sequences. The synthetic mutational gene sequences are created, as the diseased gene sequences are not readily obtainable for this intricate disease. Positional cloning approach supports in generating disease gene sequences based on mutational information acquired from HGMD. SNP-, gene- and exon-based discriminative features are identified and used to train the model. An eminent muscular dystrophy disease prediction model is built using supervised learning techniques in scikit-learn environment. The data frame is built with the extracted features as numpy array. The data are normalized by transforming the feature values into the range between 0 and 1 aid in scaling the input attributes for a model. Naïve Bayes, decision tree, K-nearest neighbor and SVM learned models are developed using python library framework in scikit-learn.
Findings
To the best knowledge of authors, this is the foremost pattern recognition model, to classify muscular dystrophy disease pertaining to splicing mutations. Certain essential SNP-, gene- and exon-based descriptors related to splicing mutations are proposed and extracted from the cloned gene sequences. An eminent model is built using statistical learning technique through scikit-learn in the anaconda framework. This paper also deliberates the results of statistical learning carried out with the same set of gene sequences with synonymous and non-synonymous mutational descriptors.
Research limitations/implications
The data frame is built with the Numpy array. Normalizing the data by transforming the feature values into the range between 0 and 1 aid in scaling the input attributes for a model. Naïve Bayes, decision tree, K-nearest neighbor and SVM learned models are developed using python library framework in scikit-learn. While learning the SVM model, the cost, gamma and kernel parameters are tuned to attain good results. Scoring parameters of the classifiers are evaluated using tenfold cross-validation using metric functions of scikit-learn library. Results of the disease identification model based on non-synonymous, synonymous and splicing mutations were analyzed.
Practical implications
Certain essential SNP-, gene- and exon-based descriptors related to splicing mutations are proposed and extracted from the cloned gene sequences. An eminent model is built using statistical learning technique through scikit-learn in the anaconda framework. The performance of the classifiers are increased by using different estimators from the scikit-learn library. Several types of mutations such as missense, non-sense and silent mutations are also considered to build models through statistical learning technique and their results are analyzed.
Originality/value
To the best knowledge of authors, this is the foremost pattern recognition model, to classify muscular dystrophy disease pertaining to splicing mutations.
Details
Keywords
Gauri Rajendra Virkar and Supriya Sunil Shinde
Predictive analytics is the science of decision-making that eliminates guesswork out of the decision-making process and applies proven scientific procedures to find right…
Abstract
Predictive analytics is the science of decision-making that eliminates guesswork out of the decision-making process and applies proven scientific procedures to find right solutions. Predictive analytics provides ideas on the occurrences of future downtimes and rejections thereby aids in taking preventive actions before abnormalities occur. Considering these advantages, predictive analytics is adopted in various diverse fields such as health care, finance, education, marketing, automotive, etc. Predictive analytics tools can be used to predict various behaviors and patterns, thereby saving the time and money of its users. Many open-source predictive analysis tools namely R, scikit-learn, Konstanz Information Miner (KNIME), Orange, RapidMiner, Waikato Environment for Knowledge Analysis (WEKA), etc. are freely available for the users. This chapter aims to reveal the best accurate tools and techniques for the classification task that aid in decision-making. Our experimental results show that no specific tool provides the best results in all scenarios; rather it depends upon the datasets and the classifier.
Details
Keywords
Francina Malan and Johannes Lodewyk Jooste
The purpose of this paper is to compare the effectiveness of the various text mining techniques that can be used to classify maintenance work-order records into their respective…
Abstract
Purpose
The purpose of this paper is to compare the effectiveness of the various text mining techniques that can be used to classify maintenance work-order records into their respective failure modes, focussing on the choice of algorithm and preprocessing transforms. Three algorithms are evaluated, namely Bernoulli Naïve Bayes, multinomial Naïve Bayes and support vector machines.
Design/methodology/approach
The paper has both a theoretical and experimental component. In the literature review, the various algorithms and preprocessing techniques used in text classification is considered from three perspectives: the domain-specific maintenance literature, the broader short-form literature and the general text classification literature. The experimental component consists of a 5 × 2 nested cross-validation with an inner optimisation loop performed using a randomised search procedure.
Findings
From the literature review, the aspects most affected by short document length are identified as the feature representation scheme, higher-order n-grams, document length normalisation, stemming, stop-word removal and algorithm selection. However, from the experimental analysis, the selection of preprocessing transforms seemed more dependent on the particular algorithm than on short document length. Multinomial Naïve Bayes performs marginally better than the other algorithms, but overall, the performances of the optimised models are comparable.
Originality/value
This work highlights the importance of model optimisation, including the selection of preprocessing transforms. Not only did the optimisation improve the performance of all the algorithms substantially, but it also affects model comparisons, with multinomial Naïve Bayes going from the worst to the best performing algorithm.
Details
Keywords
Christian Nnaemeka Egwim, Hafiz Alaka, Oluwapelumi Oluwaseun Egunjobi, Alvaro Gomes and Iosif Mporas
This study aims to compare and evaluate the application of commonly used machine learning (ML) algorithms used to develop models for assessing energy efficiency of buildings.
Abstract
Purpose
This study aims to compare and evaluate the application of commonly used machine learning (ML) algorithms used to develop models for assessing energy efficiency of buildings.
Design/methodology/approach
This study foremostly combined building energy efficiency ratings from several data sources and used them to create predictive models using a variety of ML methods. Secondly, to test the hypothesis of ensemble techniques, this study designed a hybrid stacking ensemble approach based on the best performing bagging and boosting ensemble methods generated from its predictive analytics.
Findings
Based on performance evaluation metrics scores, the extra trees model was shown to be the best predictive model. More importantly, this study demonstrated that the cumulative result of ensemble ML algorithms is usually always better in terms of predicted accuracy than a single method. Finally, it was discovered that stacking is a superior ensemble approach for analysing building energy efficiency than bagging and boosting.
Research limitations/implications
While the proposed contemporary method of analysis is assumed to be applicable in assessing energy efficiency of buildings within the sector, the unique data transformation used in this study may not, as typical of any data driven model, be transferable to the data from other regions other than the UK.
Practical implications
This study aids in the initial selection of appropriate and high-performing ML algorithms for future analysis. This study also assists building managers, residents, government agencies and other stakeholders in better understanding contributing factors and making better decisions about building energy performance. Furthermore, this study will assist the general public in proactively identifying buildings with high energy demands, potentially lowering energy costs by promoting avoidance behaviour and assisting government agencies in making informed decisions about energy tariffs when this novel model is integrated into an energy monitoring system.
Originality/value
This study fills a gap in the lack of a reason for selecting appropriate ML algorithms for assessing building energy efficiency. More importantly, this study demonstrated that the cumulative result of ensemble ML algorithms is usually always better in terms of predicted accuracy than a single method.
Details
Keywords
Hossein Sohrabi and Esmatullah Noorzai
The present study aims to develop a risk-supported case-based reasoning (RS-CBR) approach for water-related projects by incorporating various uncertainties and risks in the…
Abstract
Purpose
The present study aims to develop a risk-supported case-based reasoning (RS-CBR) approach for water-related projects by incorporating various uncertainties and risks in the revision step.
Design/methodology/approach
The cases were extracted by studying 68 water-related projects. This research employs earned value management (EVM) factors to consider time and cost features and economic, natural, technical, and project risks to account for uncertainties and supervised learning models to estimate cost overrun. Time-series algorithms were also used to predict construction cost indexes (CCI) and model improvements in future forecasts. Outliers were deleted by the pre-processing process. Next, datasets were split into testing and training sets, and algorithms were implemented. The accuracy of different models was measured with the mean absolute percentage error (MAPE) and the normalized root mean square error (NRSME) criteria.
Findings
The findings show an improvement in the accuracy of predictions using datasets that consider uncertainties, and ensemble algorithms such as Random Forest and AdaBoost had higher accuracy. Also, among the single algorithms, the support vector regressor (SVR) with the sigmoid kernel outperformed the others.
Originality/value
This research is the first attempt to develop a case-based reasoning model based on various risks and uncertainties. The developed model has provided an approving overlap with machine learning models to predict cost overruns. The model has been implemented in collected water-related projects and results have been reported.
Details
Keywords
Lafaiet Silva, Nádia Félix Silva and Thierson Rosa
This study aims to analyze Kickstarter data along with social media data from a data mining perspective. Kickstarter is a crowdfunding financing plataform and is a form of…
Abstract
Purpose
This study aims to analyze Kickstarter data along with social media data from a data mining perspective. Kickstarter is a crowdfunding financing plataform and is a form of fundraising and is increasingly being adopted as a source for achieving the viability of projects. Despite its importance and adoption growth, the success rate of crowdfunding campaigns was 47% in 2017, and it has decreased over the years. A way of increasing the chances of success of campaigns would be to predict, by using machine learning techniques, if a campaign would be successful. By applying classification models, it is possible to estimate if whether or not a campaign will achieve success, and by applying regression models, the authors can forecast the amount of money to be funded.
Design/methodology/approach
The authors propose a solution in two phases, namely, launching and campaigning. As a result, models better suited for each point in time of a campaign life cycle.
Findings
The authors produced a static predictor capable of classifying the campaigns with an accuracy of 71%. The regression method for phase one achieved a 6.45 of root mean squared error. The dynamic classifier was able to achieve 85% of accuracy before 10% of campaign duration, the equivalent of 3 days, given a campaign with 30 days of length. At this same period time, it was able to achieve a forecasting performance of 2.5 of root mean squared error.
Originality/value
The authors carry out this research presenting the results with a set of real data from a crowdfunding platform. The results are discussed according to the existing literature. This provides a comprehensive review, detailing important research instructions for advancing this field of literature.
Details
Keywords
Steven J. Hyde, Eric Bachura and Joseph S. Harrison
Machine learning (ML) has recently gained momentum as a method for measurement in strategy research. Yet, little guidance exists regarding how to appropriately apply the method…
Abstract
Machine learning (ML) has recently gained momentum as a method for measurement in strategy research. Yet, little guidance exists regarding how to appropriately apply the method for this purpose in our discipline. We address this by offering a guide to the application of ML in strategy research, with a particular emphasis on data handling practices that should improve our ability to accurately measure our constructs of interest using ML techniques. We offer a brief overview of ML methodologies that can be used for measurement before describing key challenges that exist when applying those methods for this purpose in strategy research (i.e., sample sizes, data noise, and construct complexity). We then outline a theory-driven approach to help scholars overcome these challenges and improve data handling and the subsequent application of ML techniques in strategy research. We demonstrate the efficacy of our approach by applying it to create a linguistic measure of CEOs' motivational needs in a sample of S&P 500 firms. We conclude by describing steps scholars can take after creating ML-based measures to continue to improve the application of ML in strategy research.
Details
Keywords
Aliakbar Marandi, Misagh Tasavori and Manoochehr Najmi
This study aims to use big data analysis and sheds light on key hotel features that play a role in the revisit intention of customers. In addition, this study endeavors to…
Abstract
Purpose
This study aims to use big data analysis and sheds light on key hotel features that play a role in the revisit intention of customers. In addition, this study endeavors to highlight hotel features for different customer segments.
Design/methodology/approach
This study uses a machine learning method and analyzes around 100,000 reviews of customers of 100 selected hotels around the world where they had indicated on Trip Advisor their intention to return to a particular hotel. The important features of the hotels are then extracted in terms of the 7Ps of the marketing mix. This study has then segmented customers intending to revisit hotels, based on the similarities in their reviews.
Findings
In total, 71 important hotel features are extracted using text analysis of comments. The most important features are the room, staff, food and accessibility. Also, customers are segmented into 15 groups, and key hotel features important for each segment are highlighted.
Research limitations/implications
In this research, the number of repetitions of words was used to identify key hotel features, whereas sentence-based analysis or group analysis of adjacent words can be used.
Practical implications
This study highlights key hotel features that are crucial for customers’ revisit intention and identifies related market segments that can support managers in better designing their strategies and allocating their resources.
Originality/value
By using text mining analysis, this study identifies and classifies important hotel features that are crucial for the revisit intention of customers based on the 7Ps. Methodologically, the authors suggest a comprehensive method to describe the revisit intention of hotel customers based on customer reviews.
Details
Keywords
Seungpeel Lee, Honggeun Ji, Jina Kim and Eunil Park
With the rapid increase in internet use, most people tend to purchase books through online stores. Several such stores also provide book recommendations for buyer convenience, and…
Abstract
Purpose
With the rapid increase in internet use, most people tend to purchase books through online stores. Several such stores also provide book recommendations for buyer convenience, and both collaborative and content-based filtering approaches have been widely used for building these recommendation systems. However, both approaches have significant limitations, including cold start and data sparsity. To overcome these limitations, this study aims to investigate whether user satisfaction can be predicted based on easily accessible book descriptions.
Design/methodology/approach
The authors collected a large-scale Kindle Books data set containing book descriptions and ratings, and calculated whether a specific book will receive a high rating. For this purpose, several feature representation methods (bag-of-words, term frequency–inverse document frequency [TF-IDF] and Word2vec) and machine learning classifiers (logistic regression, random forest, naive Bayes and support vector machine) were used.
Findings
The used classifiers show substantial accuracy in predicting reader satisfaction. Among them, the random forest classifier combined with the TF-IDF feature representation method exhibited the highest accuracy at 96.09%.
Originality/value
This study revealed that user satisfaction can be predicted based on book descriptions and shed light on the limitations of existing recommendation systems. Further, both practical and theoretical implications have been discussed.
Details
Keywords
Zulkifli Halim, Shuhaida Mohamed Shuhidan and Zuraidah Mohd Sanusi
In the previous study of financial distress prediction, deep learning techniques performed better than traditional techniques over time-series data. This study investigates the…
Abstract
Purpose
In the previous study of financial distress prediction, deep learning techniques performed better than traditional techniques over time-series data. This study investigates the performance of deep learning models: recurrent neural network, long short-term memory and gated recurrent unit for the financial distress prediction among the Malaysian public listed corporation over the time-series data. This study also compares the performance of logistic regression, support vector machine, neural network, decision tree and the deep learning models on single-year data.
Design/methodology/approach
The data used are the financial data of public listed companies that been classified as PN17 status (distress) and non-PN17 (not distress) in Malaysia. This study was conducted using machine learning library of Python programming language.
Findings
The findings indicate that all deep learning models used for this study achieved 90% accuracy and above with long short-term memory (LSTM) and gated recurrent unit (GRU) getting 93% accuracy. In addition, deep learning models consistently have good performance compared to the other models over single-year data. The results show LSTM and GRU getting 90% and recurrent neural network (RNN) 88% accuracy. The results also show that LSTM and GRU get better precision and recall compared to RNN. The findings of this study show that the deep learning approach will lead to better performance in financial distress prediction studies. To be added, time-series data should be highlighted in any financial distress prediction studies since it has a big impact on credit risk assessment.
Research limitations/implications
The first limitation of this study is the hyperparameter tuning only applied for deep learning models. Secondly, the time-series data are only used for deep learning models since the other models optimally fit on single-year data.
Practical implications
This study proposes recommendations that deep learning is a new approach that will lead to better performance in financial distress prediction studies. Besides that, time-series data should be highlighted in any financial distress prediction studies since the data have a big impact on the assessment of credit risk.
Originality/value
To the best of authors' knowledge, this article is the first study that uses the gated recurrent unit in financial distress prediction studies based on time-series data for Malaysian public listed companies. The findings of this study can help financial institutions/investors to find a better and accurate approach for credit risk assessment.
Details