Search results
1 – 10 of 55Mariam Elhussein and Samiha Brahimi
This paper aims to propose a novel way of using textual clustering as a feature selection method. It is applied to identify the most important keywords in the profile…
Abstract
Purpose
This paper aims to propose a novel way of using textual clustering as a feature selection method. It is applied to identify the most important keywords in the profile classification. The method is demonstrated through the problem of sick-leave promoters on Twitter.
Design/methodology/approach
Four machine learning classifiers were used on a total of 35,578 tweets posted on Twitter. The data were manually labeled into two categories: promoter and nonpromoter. Classification performance was compared when the proposed clustering feature selection approach and the standard feature selection were applied.
Findings
Radom forest achieved the highest accuracy of 95.91% higher than similar work compared. Furthermore, using clustering as a feature selection method improved the Sensitivity of the model from 73.83% to 98.79%. Sensitivity (recall) is the most important measure of classifier performance when detecting promoters’ accounts that have spam-like behavior.
Research limitations/implications
The method applied is novel, more testing is needed in other datasets before generalizing its results.
Practical implications
The model applied can be used by Saudi authorities to report on the accounts that sell sick-leaves online.
Originality/value
The research is proposing a new way textual clustering can be used in feature selection.
Details
Keywords
Khalid Iqbal and Muhammad Shehrayar Khan
In this digital era, email is the most pervasive form of communication between people. Many users become a victim of spam emails and their data have been exposed.
Abstract
Purpose
In this digital era, email is the most pervasive form of communication between people. Many users become a victim of spam emails and their data have been exposed.
Design/methodology/approach
Researchers contribute to solving this problem by a focus on advanced machine learning algorithms and improved models for detecting spam emails but there is still a gap in features. To achieve good results, features also play an important role. To evaluate the performance of applied classifiers, 10-fold cross-validation is used.
Findings
The results approve that the spam emails are correctly classified with the accuracy of 98.00% for the Support Vector Machine and 98.06% for the Artificial Neural Network as compared to other applied machine learning classifiers.
Originality/value
In this paper, Point-Biserial correlation is applied to each feature concerning the class label of the University of California Irvine (UCI) spambase email dataset to select the best features. Extensive experiments are conducted on selected features by training the different classifiers.
Details
Keywords
Jan Svanberg, Tohid Ardeshiri, Isak Samsten, Peter Öhman, Presha E. Neidermeyer, Tarek Rana, Frank Maisano and Mats Danielson
The purpose of this study is to develop a method to assess social performance. Traditionally, environment, social and governance (ESG) rating providers use subjectively weighted…
Abstract
Purpose
The purpose of this study is to develop a method to assess social performance. Traditionally, environment, social and governance (ESG) rating providers use subjectively weighted arithmetic averages to combine a set of social performance (SP) indicators into one single rating. To overcome this problem, this study investigates the preconditions for a new methodology for rating the SP component of the ESG by applying machine learning (ML) and artificial intelligence (AI) anchored to social controversies.
Design/methodology/approach
This study proposes the use of a data-driven rating methodology that derives the relative importance of SP features from their contribution to the prediction of social controversies. The authors use the proposed methodology to solve the weighting problem with overall ESG ratings and further investigate whether prediction is possible.
Findings
The authors find that ML models are able to predict controversies with high predictive performance and validity. The findings indicate that the weighting problem with the ESG ratings can be addressed with a data-driven approach. The decisive prerequisite, however, for the proposed rating methodology is that social controversies are predicted by a broad set of SP indicators. The results also suggest that predictively valid ratings can be developed with this ML-based AI method.
Practical implications
This study offers practical solutions to ESG rating problems that have implications for investors, ESG raters and socially responsible investments.
Social implications
The proposed ML-based AI method can help to achieve better ESG ratings, which will in turn help to improve SP, which has implications for organizations and societies through sustainable development.
Originality/value
To the best of the authors’ knowledge, this research is one of the first studies that offers a unique method to address the ESG rating problem and improve sustainability by focusing on SP indicators.
Details
Keywords
Loris Nanni and Sheryl Brahnam
Automatic DNA-binding protein (DNA-BP) classification is now an essential proteomic technology. Unfortunately, many systems reported in the literature are tested on only one or…
Abstract
Purpose
Automatic DNA-binding protein (DNA-BP) classification is now an essential proteomic technology. Unfortunately, many systems reported in the literature are tested on only one or two datasets/tasks. The purpose of this study is to create the most optimal and universal system for DNA-BP classification, one that performs competitively across several DNA-BP classification tasks.
Design/methodology/approach
Efficient DNA-BP classifier systems require the discovery of powerful protein representations and feature extraction methods. Experiments were performed that combined and compared descriptors extracted from state-of-the-art matrix/image protein representations. These descriptors were trained on separate support vector machines (SVMs) and evaluated. Convolutional neural networks with different parameter settings were fine-tuned on two matrix representations of proteins. Decisions were fused with the SVMs using the weighted sum rule and evaluated to experimentally derive the most powerful general-purpose DNA-BP classifier system.
Findings
The best ensemble proposed here produced comparable, if not superior, classification results on a broad and fair comparison with the literature across four different datasets representing a variety of DNA-BP classification tasks, thereby demonstrating both the power and generalizability of the proposed system.
Originality/value
Most DNA-BP methods proposed in the literature are only validated on one (rarely two) datasets/tasks. In this work, the authors report the performance of our general-purpose DNA-BP system on four datasets representing different DNA-BP classification tasks. The excellent results of the proposed best classifier system demonstrate the power of the proposed approach. These results can now be used for baseline comparisons by other researchers in the field.
Details
Keywords
Abstract
Purpose
On-ramp merging areas are typical bottlenecks in the freeway network since merging on-ramp vehicles may cause intensive disturbances on the mainline traffic flow and lead to various negative impacts on traffic efficiency and safety. The connected and autonomous vehicles (CAVs), with their capabilities of real-time communication and precise motion control, hold a great potential to facilitate ramp merging operation through enhanced coordination strategies. This paper aims to present a comprehensive review of the existing ramp merging strategies leveraging CAVs, focusing on the latest trends and developments in the research field.
Design/methodology/approach
The review comprehensively covers 44 papers recently published in leading transportation journals. Based on the application context, control strategies are categorized into three categories: merging into sing-lane freeways with total CAVs, merging into sing-lane freeways with mixed traffic flows and merging into multilane freeways.
Findings
Relevant literature is reviewed regarding the required technologies, control decision level, applied methods and impacts on traffic performance. More importantly, the authors identify the existing research gaps and provide insightful discussions on the potential and promising directions for future research based on the review, which facilitates further advancement in this research topic.
Originality/value
Many strategies based on the communication and automation capabilities of CAVs have been developed over the past decades, devoted to facilitating the merging/lane-changing maneuvers at freeway on-ramps. Despite the significant progress made, an up-to-date review covering these latest developments is missing to the authors’ best knowledge. This paper conducts a thorough review of the cooperation/coordination strategies that facilitate freeway on-ramp merging using CAVs, focusing on the latest developments in this field. Based on the review, the authors identify the existing research gaps in CAV ramp merging and discuss the potential and promising future research directions to address the gaps.
Details
Keywords
Martin Nečaský, Petr Škoda, David Bernhauer, Jakub Klímek and Tomáš Skopal
Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking…
Abstract
Purpose
Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.
Design/methodology/approach
In this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.
Findings
The study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.
Originality/value
To the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.
Details
Keywords
Lisa M. Young and Swapnil Rajendra Gavade
The purpose of this paper is to use the data analysis method of sentiment analysis to improve the understanding of a large data set of employee comments from an annual employee…
Abstract
Purpose
The purpose of this paper is to use the data analysis method of sentiment analysis to improve the understanding of a large data set of employee comments from an annual employee job satisfaction survey of a US hospitality organization.
Design/methodology/approach
Sentiment analysis is used to examine the employee comments by identifying meaningful patterns, frequently used words and emotions. The statistical computing language, R, uses the sentiment analysis process to scan each employee survey comment, compare the words with the predefined word dictionary and classify the employee comments into the appropriate emotion category.
Findings
Employee responses written in English and in Spanish are compared with significant differences identified between the two groups, triggering further investigation of the Spanish comments. Sentiment analysis was then conducted on the Spanish comments comparing two groups, front-of-house vs back-of-house employees and employees with male supervisors vs female supervisors. Results from the analysis of employee comments written in Spanish point to higher scores for job sadness and anger. The negative comments referred to desires for improved healthcare, requests for increased wages and frustration with difficult supervisor relationships. The findings from this study add to the growing body of literature that has begun to focus on the unique work experiences of Latino employees in the USA.
Originality/value
This is the first study to examine a large unstructured English and Spanish text database from a hospitality organization’s employee job satisfaction surveys using sentiment analysis. Applying this big data analytics process to advance new insights into the human capital aspects of hospitality management is intriguing to many researchers. The results of this study demonstrate an issue that needs to be further investigated particularly considering the hospitality industry’s employee demographics.
Details
Keywords
Min Wang, Shuguang Li, Lei Zhu and Jin Yao
Analysis of characteristic driving operations can help develop supports for drivers with different driving skills. However, the existing knowledge on analysis of driving skills…
Abstract
Purpose
Analysis of characteristic driving operations can help develop supports for drivers with different driving skills. However, the existing knowledge on analysis of driving skills only focuses on single driving operation and cannot reflect the differences on proficiency of coordination of driving operations. Thus, the purpose of this paper is to analyze driving skills from driving coordinating operations. There are two main contributions: the first involves a method for feature extraction based on AdaBoost, which selects features critical for coordinating operations of experienced drivers and inexperienced drivers, and the second involves a generating method for candidate features, called the combined features method, through which two or more different driving operations at the same location are combined into a candidate combined feature. A series of experiments based on driving simulator and specific course with several different curves were carried out, and the result indicated the feasibility of analyzing driving behavior through AdaBoost and the combined features method.
Design/methodology/approach
AdaBoost was used to extract features and the combined features method was used to combine two or more different driving operations at the same location.
Findings
A series of experiments based on driving simulator and specific course with several different curves were carried out, and the result indicated the feasibility of analyzing driving behavior through AdaBoost and the combined features method.
Originality/value
There are two main contributions: the first involves a method for feature extraction based on AdaBoost, which selects features critical for coordinating operations of experienced drivers and inexperienced drivers, and the second involves a generating method for candidate features, called the combined features method, through which two or more different driving operations at the same location are combined into a candidate combined feature.
Details
Keywords
Afreen Khan, Swaleha Zubair and Samreen Khan
This study aimed to assess the potential of the Clinical Dementia Rating (CDR) Scale in the prognosis of dementia in elderly subjects.
Abstract
Purpose
This study aimed to assess the potential of the Clinical Dementia Rating (CDR) Scale in the prognosis of dementia in elderly subjects.
Design/methodology/approach
Dementia staging severity is clinically an essential task, so the authors used machine learning (ML) on the magnetic resonance imaging (MRI) features to locate and study the impact of various MR readings onto the classification of demented and nondemented patients. The authors used cross-sectional MRI data in this study. The designed ML approach established the role of CDR in the prognosis of inflicted and normal patients. Moreover, the pattern analysis indicated CDR as a strong cohort amongst the various attributes, with CDR to have a significant value of p < 0.01. The authors employed 20 ML classifiers.
Findings
The mean prediction accuracy varied with the various ML classifier used, with the bagging classifier (random forest as a base estimator) achieving the highest (93.67%). A series of ML analyses demonstrated that the model including the CDR score had better prediction accuracy and other related performance metrics.
Originality/value
The results suggest that the CDR score, a simple clinical measure, can be used in real community settings. It can be used to predict dementia progression with ML modeling.
Details
Keywords
Abdullah Alharbi, Wajdi Alhakami, Sami Bourouis, Fatma Najar and Nizar Bouguila
We propose in this paper a novel reliable detection method to recognize forged inpainting images. Detecting potential forgeries and authenticating the content of digital images is…
Abstract
We propose in this paper a novel reliable detection method to recognize forged inpainting images. Detecting potential forgeries and authenticating the content of digital images is extremely challenging and important for many applications. The proposed approach involves developing new probabilistic support vector machines (SVMs) kernels from a flexible generative statistical model named “bounded generalized Gaussian mixture model”. The developed learning framework has the advantage to combine properly the benefits of both discriminative and generative models and to include prior knowledge about the nature of data. It can effectively recognize if an image is a tampered one and also to identify both forged and authentic images. The obtained results confirmed that the developed framework has good performance under numerous inpainted images.
Details