Search results
1 – 10 of 185Harleen Kaur and Vinita Kumari
Diabetes is a major metabolic disorder which can affect entire body system adversely. Undiagnosed diabetes can increase the risk of cardiac stroke, diabetic nephropathy and other…
Abstract
Diabetes is a major metabolic disorder which can affect entire body system adversely. Undiagnosed diabetes can increase the risk of cardiac stroke, diabetic nephropathy and other disorders. All over the world millions of people are affected by this disease. Early detection of diabetes is very important to maintain a healthy life. This disease is a reason of global concern as the cases of diabetes are rising rapidly. Machine learning (ML) is a computational method for automatic learning from experience and improves the performance to make more accurate predictions. In the current research we have utilized machine learning technique in Pima Indian diabetes dataset to develop trends and detect patterns with risk factors using R data manipulation tool. To classify the patients into diabetic and non-diabetic we have developed and analyzed five different predictive models using R data manipulation tool. For this purpose we used supervised machine learning algorithms namely linear kernel support vector machine (SVM-linear), radial basis function (RBF) kernel support vector machine, k-nearest neighbour (k-NN), artificial neural network (ANN) and multifactor dimensionality reduction (MDR).
Details
Keywords
Noura AlNuaimi, Mohammad Mehedy Masud, Mohamed Adel Serhani and Nazar Zaki
Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time…
Abstract
Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time. However, storing and processing large and varied datasets (known as big data) is challenging to do in real time. In machine learning, streaming feature selection has always been considered a superior technique for selecting the relevant subset features from highly dimensional data and thus reducing learning complexity. In the relevant literature, streaming feature selection refers to the features that arrive consecutively over time; despite a lack of exact figure on the number of features, numbers of instances are well-established. Many scholars in the field have proposed streaming-feature-selection algorithms in attempts to find the proper solution to this problem. This paper presents an exhaustive and methodological introduction of these techniques. This study provides a review of the traditional feature-selection algorithms and then scrutinizes the current algorithms that use streaming feature selection to determine their strengths and weaknesses. The survey also sheds light on the ongoing challenges in big-data research.
Details
Keywords
Qinxu Ding, Ding Ding, Yue Wang, Chong Guan and Bosheng Ding
The rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a comprehensive…
Abstract
Purpose
The rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a comprehensive examination of the research landscape in LLMs, providing an overview of the prevailing themes and topics within this dynamic domain.
Design/methodology/approach
Drawing from an extensive corpus of 198 records published between 1996 to 2023 from the relevant academic database encompassing journal articles, books, book chapters, conference papers and selected working papers, this study delves deep into the multifaceted world of LLM research. In this study, the authors employed the BERTopic algorithm, a recent advancement in topic modeling, to conduct a comprehensive analysis of the data after it had been meticulously cleaned and preprocessed. BERTopic leverages the power of transformer-based language models like bidirectional encoder representations from transformers (BERT) to generate more meaningful and coherent topics. This approach facilitates the identification of hidden patterns within the data, enabling authors to uncover valuable insights that might otherwise have remained obscure. The analysis revealed four distinct clusters of topics in LLM research: “language and NLP”, “education and teaching”, “clinical and medical applications” and “speech and recognition techniques”. Each cluster embodies a unique aspect of LLM application and showcases the breadth of possibilities that LLM technology has to offer. In addition to presenting the research findings, this paper identifies key challenges and opportunities in the realm of LLMs. It underscores the necessity for further investigation in specific areas, including the paramount importance of addressing potential biases, transparency and explainability, data privacy and security, and responsible deployment of LLM technology.
Findings
The analysis revealed four distinct clusters of topics in LLM research: “language and NLP”, “education and teaching”, “clinical and medical applications” and “speech and recognition techniques”. Each cluster embodies a unique aspect of LLM application and showcases the breadth of possibilities that LLM technology has to offer. In addition to presenting the research findings, this paper identifies key challenges and opportunities in the realm of LLMs. It underscores the necessity for further investigation in specific areas, including the paramount importance of addressing potential biases, transparency and explainability, data privacy and security, and responsible deployment of LLM technology.
Practical implications
This classification offers practical guidance for researchers, developers, educators, and policymakers to focus efforts and resources. The study underscores the importance of addressing challenges in LLMs, including potential biases, transparency, data privacy, and responsible deployment. Policymakers can utilize this information to shape regulations, while developers can tailor technology development based on the diverse applications identified. The findings also emphasize the need for interdisciplinary collaboration and highlight ethical considerations, providing a roadmap for navigating the complex landscape of LLM research and applications.
Originality/value
This study stands out as the first to examine the evolution of LLMs across such a long time frame and across such diversified disciplines. It provides a unique perspective on the key areas of LLM research, highlighting the breadth and depth of LLM’s evolution.
Details
Keywords
Mamdouh Abdel Alim Saad Mowafy and Walaa Mohamed Elaraby Mohamed Shallan
Heart diseases have become one of the most causes of death among Egyptians. With 500 deaths per 100,000 occurring annually in Egypt, it has been noticed that medical data faces a…
Abstract
Purpose
Heart diseases have become one of the most causes of death among Egyptians. With 500 deaths per 100,000 occurring annually in Egypt, it has been noticed that medical data faces a high-dimensional problem that leads to a decrease in the classification accuracy of heart data. So the purpose of this study is to improve the classification accuracy of heart disease data for helping doctors efficiently diagnose heart disease by using a hybrid classification technique.
Design/methodology/approach
This paper used a new approach based on the integration between dimensionality reduction techniques as multiple correspondence analysis (MCA) and principal component analysis (PCA) with fuzzy c means (FCM) then with both of multilayer perceptron (MLP) and radial basis function networks (RBFN) which separate patients into different categories based on their diagnosis results in this paper, a comparative study of the performance performed including six structures such as MLP, RBFN, MLP via FCM–MCA, MLP via FCM–PCA, RBFN via FCM–MCA and RBFN via FCM–PCA to reach to the best classifier.
Findings
The results show that the MLP via FCM–MCA classifier structure has the highest ratio of classification accuracy and has the best performance superior to other methods; and that Smoking was the most factor causing heart disease.
Originality/value
This paper shows the importance of integrating statistical methods in increasing the classification accuracy of heart disease data.
Details
Keywords
Nicola Castellano, Roberto Del Gobbo and Lorenzo Leto
The concept of productivity is central to performance management and decision-making, although it is complex and multifaceted. This paper aims to describe a methodology based on…
Abstract
Purpose
The concept of productivity is central to performance management and decision-making, although it is complex and multifaceted. This paper aims to describe a methodology based on the use of Big Data in a cluster analysis combined with a data envelopment analysis (DEA) that provides accurate and reliable productivity measures in a large network of retailers.
Design/methodology/approach
The methodology is described using a case study of a leading kitchen furniture producer. More specifically, Big Data is used in a two-step analysis prior to the DEA to automatically cluster a large number of retailers into groups that are homogeneous in terms of structural and environmental factors and assess a within-the-group level of productivity of the retailers.
Findings
The proposed methodology helps reduce the heterogeneity among the units analysed, which is a major concern in DEA applications. The data-driven factorial and clustering technique allows for maximum within-group homogeneity and between-group heterogeneity by reducing subjective bias and dimensionality, which is embedded with the use of Big Data.
Practical implications
The use of Big Data in clustering applied to productivity analysis can provide managers with data-driven information about the structural and socio-economic characteristics of retailers' catchment areas, which is important in establishing potential productivity performance and optimizing resource allocation. The improved productivity indexes enable the setting of targets that are coherent with retailers' potential, which increases motivation and commitment.
Originality/value
This article proposes an innovative technique to enhance the accuracy of productivity measures through the use of Big Data clustering and DEA. To the best of the authors’ knowledge, no attempts have been made to benefit from the use of Big Data in the literature on retail store productivity.
Details
Keywords
Since the beginning of 2020, economies faced many changes as a result of coronavirus disease 2019 (COVID-19) pandemic. The effect of COVID-19 on the Egyptian Exchange (EGX) is…
Abstract
Purpose
Since the beginning of 2020, economies faced many changes as a result of coronavirus disease 2019 (COVID-19) pandemic. The effect of COVID-19 on the Egyptian Exchange (EGX) is investigated in this research.
Design/methodology/approach
To explore the impact of COVID-19, three periods were considered: (1) 17 months before the spread of COVID-19 and the start of the lockdown, (2) 17 months after the spread of COVID-19 and the during the lockdown and (3) 34 months comprehending the whole period (before and during COVID-19). Due to the large number of variables that could be considered, dimensionality reduction method, such as the principal component analysis (PCA) is followed. This method helps in determining the most individual stocks contributing to the main EGX index (EGX 30). The PCA, also, addresses the multicollinearity between the variables under investigation. Additionally, a principal component regression (PCR) model is developed to predict the future behavior of the EGX 30.
Findings
The results demonstrate that the first three principal components (PCs) could be considered to explain 89%, 85%, and 88% of data variability at (1) before COVID-19, (2) during COVID-19 and (3) the whole period, respectively. Furthermore, sectors of food and beverage, basic resources and real estate have not been affected by the COVID-19. The resulted Principal Component Regression (PCR) model performs very well. This could be concluded by comparing the observed values of EGX 30 with the predicted ones (R-squared estimated as 0.99).
Originality/value
To the best of our knowledge, no research has been conducted to investigate the effect of the COVID-19 on the EGX following an unsupervised machine learning method.
Details
Keywords
This paper provides a further investigation into the application of Correspondence Analysis (CA) as outlined by Greenacre (1984, 1993), which is one technique for “quantifying…
Abstract
This paper provides a further investigation into the application of Correspondence Analysis (CA) as outlined by Greenacre (1984, 1993), which is one technique for “quantifying qualitative data” in research on learning and teaching. It also builds on the utilisation of CA in the development of the emerging discipline of English as an International Language provided by Hassall and Ganesh (1996, 1999). This is accomplished by considering its application to the analysis of attitudinal data that positions the developing pedagogy of Teaching English as an International Language (TEIL) (see Hassall, 1996a & ff.) within the more established discipline of World Englishes (cf. Kachru, 1985, 1990). The multidimensional statistical technique Correspondence Analysis is used to provide an assessment of the interdependence of the rows and columns of a data matrix (primarily, a two-way contingency table). In this case, attitudinal data, produced at a number of international workshops which focused on the development of a justifiable pedagogy for Teaching English as an International Language (TEIL), are examined to provide a more complete picture of how these venues differed from each other with respect to the collective responses of the respondents. CA facilitates dimensionality reduction and provides graphical displays in low-dimensional spaces. In other words, it converts the rows and columns of a data matrix or contingency table into a series of points on a graph. The current study presents analyses of two different interpretations of this data.
This paper purposed a multi-facet sentiment analysis system.
Abstract
Purpose
This paper purposed a multi-facet sentiment analysis system.
Design/methodology/approach
Hence, This paper uses multidomain resources to build a sentiment analysis system. The manual lexicon based features that are extracted from the resources are fed into a machine learning classifier to compare their performance afterward. The manual lexicon is replaced with a custom BOW to deal with its time consuming construction. To help the system run faster and make the model interpretable, this will be performed by employing different existing and custom approaches such as term occurrence, information gain, principal component analysis, semantic clustering, and POS tagging filters.
Findings
The proposed system featured by lexicon extraction automation and characteristics size optimization proved its efficiency when applied to multidomain and benchmark datasets by reaching 93.59% accuracy which makes it competitive to the state-of-the-art systems.
Originality/value
The construction of a custom BOW. Optimizing features based on existing and custom feature selection and clustering approaches.
Details
Keywords
Bothaina A. Al-Sheeb, A.M. Hamouda and Galal M. Abdella
The retention and success of engineering undergraduates are increasing concern for higher-education institutions. The study of success determinants are initial steps in any…
Abstract
Purpose
The retention and success of engineering undergraduates are increasing concern for higher-education institutions. The study of success determinants are initial steps in any remedial initiative targeted to enhance student success and prevent any immature withdrawals. This study provides a comprehensive approach toward the prediction of student academic performance through the lens of the knowledge, attitudes and behavioral skills (KAB) model. The purpose of this paper is to aim to improve the modeling accuracy of students’ performance by introducing two methodologies based on variable selection and dimensionality reduction.
Design/methodology/approach
The performance of the proposed methodologies was evaluated using a real data set of ten critical-to-success factors on both attitude and skill-related behaviors of 320 first-year students. The study used two models. In the first model, exploratory factor analysis is used. The second model uses regression model selection. Ridge regression is used as a second step in each model. The efficiency of each model is discussed in the Results section of this paper.
Findings
The two methods were powerful in providing small mean-squared errors and hence, in improving the prediction of student performance. The results show that the quality of both methods is sensitive to the size of the reduced model and to the magnitude of the penalization parameter.
Research limitations/implications
First, the survey could have been conducted in two parts; students needed more time than expected to complete it. Second, if the study is to be carried out for second-year students, grades of general engineering courses can be included in the model for better estimation of students’ grade point averages. Third, the study only applies to first-year and second-year students because factors covered are those that are essential for students’ survival through the first few years of study.
Practical implications
The study proposes that vulnerable students could be identified as early as possible in the academic year. These students could be encouraged to engage more in their learning process. Carrying out such measurement at the beginning of the college year can provide professional and college administration with valuable insight on students perception of their own skills and attitudes toward engineering.
Originality/value
This study employs the KAB model as a comprehensive approach to the study of success predictors. The implementation of two new methodologies to improve the prediction accuracy of student success.
Details
Keywords
Xuan Ji, Jiachen Wang and Zhijun Yan
Stock price prediction is a hot topic and traditional prediction methods are usually based on statistical and econometric models. However, these models are difficult to deal with…
Abstract
Purpose
Stock price prediction is a hot topic and traditional prediction methods are usually based on statistical and econometric models. However, these models are difficult to deal with nonstationary time series data. With the rapid development of the internet and the increasing popularity of social media, online news and comments often reflect investors’ emotions and attitudes toward stocks, which contains a lot of important information for predicting stock price. This paper aims to develop a stock price prediction method by taking full advantage of social media data.
Design/methodology/approach
This study proposes a new prediction method based on deep learning technology, which integrates traditional stock financial index variables and social media text features as inputs of the prediction model. This study uses Doc2Vec to build long text feature vectors from social media and then reduce the dimensions of the text feature vectors by stacked auto-encoder to balance the dimensions between text feature variables and stock financial index variables. Meanwhile, based on wavelet transform, the time series data of stock price is decomposed to eliminate the random noise caused by stock market fluctuation. Finally, this study uses long short-term memory model to predict the stock price.
Findings
The experiment results show that the method performs better than all three benchmark models in all kinds of evaluation indicators and can effectively predict stock price.
Originality/value
In this paper, this study proposes a new stock price prediction model that incorporates traditional financial features and social media text features which are derived from social media based on deep learning technology.
Details