Stock market prediction by applying big data mining

Purpose – There is a gap in knowledge about the Gulf Cooperation Council (GCC) because most studies are undertaken in countries outside the Gulf region – such as China, India, the US and Taiwan. The stock market contains rich, valuable and considerable data, and these data need careful analysis for good decisions to be made that can lead to increases in the efficiency of a business. Data mining techniques offer data processing tools and applications used to enhance decision-maker decisions. This study aims to predict the Kuwait stock market by applying big data mining. Design/methodology/approach – The methodology used is quantitative techniques, which are mathematical and statistical models that describe a various array of the relationships of variables. Quantitative methods used to predict the direction of the stock market returns by using four techniques were implemented: logistic regression, decision trees, support vector machine and random forest. Findings – The results are all variables statistically significant at the 5% level except gold price and oil price. Also, the variables that do not have an influence on the direction of the rate of return of Boursa Kuwait are money supply and gold price, unlike the Kuwait index, which has the highest coefficient. Furthermore, the height score of the variable that affects the direction of the rate of return is the firms, and the accuracy of the overall performance of the four models is nearly 50%. Research limitations/implications – Some of the limitations identified for this study are as follows: (1) location limitation: Kuwait Stock Exchange; (2) time limitation: the amount of time available to accomplish the study, where the period was completed within the academic year 2019-2020 and the academic year 2020-2021. During2020,thecoronaviruspandemic(COVID-19),whichwasamajorobstacle,occurredduringdatacollectionandanalysis;(3)datalimitation:TheKuwaitStockExchangedatawerecollectedfromMay2019toMarch2020, whilethefactorsaffectingthestockexchangedatawerecollectedinJuly2020duetothecoronapandemic. Originality/value – Thestudyusednewtitles,variablesandtechniquessuchasusingdataminingtopredict the Kuwait stock market. There are no adequate studies that predict the stock market by data mining in the GCC,especiallyinKuwait.ThereisagapinknowledgeintheGCCasmoststudiesareinforeigncountries,such asChina,India,theUSandTaiwan.


Introduction
Computers are used in all aspects of daily transactions, and, as a result, a large amount of data is generated. The volume of the data is expected to continue to grow in the future, which leads to the need to be able to analyze big amounts of data. Big data analysis is a data mining technique that can be used in many sectorseconomic, industrial and commercial.
Data mining can be defined as preparing, visualizing and exploring massive databases (Parr & Vaudrevange, 2020), whereas the techniques for discovering patterns from these databases are based on the knowledge to be mined, such as descriptive, estimation, prediction, classification, clustering and association (Alsultanny, 2013).
The COVID-19 pandemic has inflicted heavy human and economic losses and confused social and health systems around the world. Understanding and counteracting the pandemic requires recognizing its properties and attributes by collecting and analyzing relevant big data. Consequently, big data analytics tools are an essential requirement for those needing to make decisions and establish precautionary measures (Almuhaideb, 2021). Also, Tariq et al. (2022) emphasized that software automation plays an important role in e-health and generally improves health care services for individuals by bringing efficiency to the systems. Moreover, many countries around the world are trying to smooth out the curve of the epidemic with the help of smartphone apps. Because of this, it is very important to carefully carry out strategies to protect data privacy when utilizing big data. Improvements in technology can offer several benefits, but they can also pose a risk of breaching privacy. Governments and companies in many industries use big data as a basis for automating processing and extracting important insights to aid decision-making. While big data has been confirmed as being useful in analysis and prediction, it is important to implement security procedures for maintaining confidential data on big data systems (Haafza et al., 2021;Rafiq et al., 2022).
The researcher found that the effective research tool for the analysis of big data is data mining, which is also known as data analytics and predictive analytics (Murray & Scime, 2015). Predictive analytics can be defined as building and evaluating a model that is aimed at creating empirical predictions, including an empirical predictive model, which is a statistical model, along with other methods, such as data mining algorithms designed for predicting new or future observations, or events and methods for evaluating and estimating the quality of the predictive power of a model (Shmueli & Koppius, 2011).
Big data is used in the forecasting process, particularly in the financial market, as forecasting is used in the stock market to create an automatic prediction of volatility in share prices. The main purpose of forecasting by data mining in the stock market is to discover knowledge that can assist decision-makers. It is important that companies use data mining with utmost care to improve their business by increasing revenue and reducing costs (Ahmed, 2004). For example, Amazon encourages its customers to use the Amazon Price Check Mobile App to collect a database of market price data. Another example is when Google collaborates with the Center for Disease Control (CDC) (Petersen, 2016).
Recently, many researchers such as Gupta, Bhatia, Dave, and Jain, (2019), Kohli, Zargar, Arora, and Gupta, (2019) and Petrova, Pauwels, Svidt, and Jensen, (2019) have used data mining techniques to predict the stock market, which contains rich, valuable and considerable data that need careful analysis before good decisions can be made. The Kuwait Stock Exchange (KSE) is one of the oldest stock markets in the Gulf Cooperation Council (GCC) as it was inaugurated in 1952, and it has used the automated trading system since 1995 (Bley & Chen, 2006). Electronic trading led to an accumulation of data from many sources, such as market index, sector index and the index for each company. The Kuwait stock market has 173 companies distributed for three marketspremier market, main market and auction marketwhere there are trading shares and investment funds (Boursakuwait, 2019).

AGJSR 40,2
Thus, the current study will focus on predicting the stock returns in the KSE by using data mining techniques, namely, regression, support vector machine, decision tree and random forest.

Background
Previous research has used several data mining techniques to predict future results and trends. The support vector machine method was used to predict the direction of the stock market by Chen and Hao (2017), Lai and Liu (2010) and Yu, Wang, and Lai, (2005) from China, and Usmani, Adil, Raza and Ali, (2016) from Pakistan. All agreed that the SVM strongly forecasts the performance of the stock markets.
Two studies used the decision tree technique: Tsai and Hsiao (2010) from Taiwan, and Al-Radaideh, Assaf, and Alnagi (2013) from Jordan. The Taiwan study indicated that the movement of the stock market could be forecast by the decision tree model, but Al-Radaideh et al.'s data from Jordan suggested that the accuracy of the decision tree model is low. In addition, Tsai and Hsiao presented 85 variables as important factors affecting stocks, including the observation that the US stock market has a leading effect on the Taiwan stock market. Ou and Wang (2009) attempted to find the ability of the ten data mining techniques to predict the movement of the Hang Seng Index in Hong Kong by using tree-based classification, the logistical regression model and SVM. The results of the study showed that the SVM had a better level of prediction than the decision tree and the logistical regression model. Also, Imandoust and Bolandraftar (2014) from Iran predicted the stock trend based on the decision tree and random forest, and found the performance of the decision tree model to be better than the random forest. Awan et al. (2021a) in Social media and stock market prediction: a big data approach predicted future pricing and sales of products by using linear regression and random forest, and found that linear regression gives higher accuracy than random forest.
There are also studies using the four methods, including A big data approach to Black Friday sales by Awan et al. (2021b) which used linear regression, generalized linear regression, random forest and decision tree to predict market trends, and found that linear regression, random forest and generalized linear regression provide an accuracy of 80%-98%, while the decision tree did not perform as well. Shashaank, Sruthi, Vijayalakshimi and Garcia, (2015) also used a full mix of classification algorithmsrandom forest, decision tree, support vector machine and multinomial logistic regressionto predict the stock price. The results of this Indian study showed that random forest had the best prediction performance, followed by decision tree, then SVM and, lastly, multinomial logistic regression.

Methodology
The methodology used is quantitative techniques, which are mathematical and statistical models that describe a various array of the relationships of variables for assisting managers to use these techniques in order to provide insight into problems and facilitate daily decisionmaking. The statistics algorithms are the processes of collecting a sample, organizing, analyzing and interpreting data; and the numeric values in characteristics analyzed in this process to help with problem-solving and decision-making (Devi & Devaki, 2019).
Quantitative methods used to predict the direction of stock market returns aim to assist decision-makers in taking action to buy or sell stocks at the best possible time. Prediction is one of the data mining techniques adopted in this research to achieve the research objective of using data analysis toolsregression, support vector machine, decision tree and random forestto extract knowledge. The predictive approach is a technique of data mining that forecasts predictions based on historical data or on aggregate indicators, such as key Stock market prediction by applying BDM performance indicators, so that potential problems can be detected ahead of time and thereby managed and mitigated (Metzger et al., 2014).

Limitations
Some of the limitations identified for this study are as follows: (1) Location limitation: KSE.
(2) Time limitation: The amount of time available to accomplish the study, where the period was completed within the academic year 2019-2020 and the academic year 2020-2021. During 2020, the coronavirus pandemic (COVID-19), which was a major obstacle, occurred during data collection and analysis.
(3) Data limitation: The KSE data were collected from May 2019 to March 2020, while the factors affecting the stock exchange data were collected in July 2020 due to the corona pandemic.
The research framework of this study is summarized in Figure 1. As depicted there, the process starts with data collection from various sources. The data are then pre-processed and converted to a proper format, ready for analysis. The next step is to perform some analysis (EDA) to understand the data before the final step, which is developing the model for prediction. A more detailed description of the steps in this framework follows.

Data collection
The data were collected from two sources: (1) Stock market data were collected for companies in Boursa Kuwait from January 6, 2015 to November 26, 2019.
The oil price, gold price and exchange rates of KWD to USD were collected from investing websites.
The money supply and interest rate were collected from the Central Bank of Kuwait website. Drawing up the charts helped us decide to choose the banking sector and telecommunication sector to analyze. Figure 2 shows the market capitalization for all Kuwait stock market sectors, with the banking and telecommunication sectors representing three-quarters of the stock market.

Data preprocessing
This step uses the historical data to convert the raw data into an understandable form (CSV). The CSV format file is a record of data in a tabular format, which is easy to handle by researchers (Mao et al., 2018). Two processes are applied in this study for using the data in forecasting the stock market. They are: (1) Preparing the data for forecasting by modifying the data in tables and using it in Excel software in one sheet (.XLS).
The database for the regression test, the decision tree test, the support vector machine test and the random forest test is split and partitioned into subsets according to 75% and 25%, where the 14 variables of 16,982 observations divide into 12,736 observed training data and Stock market prediction by applying BDM 4,246 observed testing data. In order to know the effect of the rate of return in Boursa Kuwait based on factors such as oil price, gold price, exchange rate of KWD to USD, money supply, interest rate, EPS, DPS and the five indices of the Gulf stock markets, a new feature has been added to the table, which is the rate of return.
The following mathematical equation was used (Dayananda et al., 2002): Rate of returns ¼ lnðclose price n Þ À lnðclose price nÀ1 Þ This feature was then analyzed based on its directiondownward, upward or stable.

Data analysis
In these steps, the data were converted to information by using different data mining techniques such as regression, SVM, decision tree and random forest on a sample of Kuwait stock market data, along with variables that affected the Kuwaiti stock market. Then we used RStudio software to build the models for each technique, and subsequently compared the models based on the accuracy rate. RStudio is free and open-source software for data science, scientific research and the technical community, which uses the R language. R is a free language and environment for statistical computing and graphics, which provides a wide variety of statistical and graphical techniques (Misra, 2020;RStudio, 2020). RStudio software was used to analyze the collected data for this study, and techniques such as regression, support vector machine, decision tree and random forest were used.

Results and discussion
Regression test This study used the multinomial logistic regression analysis test, which is a type of regression that predicts the probabilities of more than two possible outcomes of a categorically distributed dependent variable based on independent variables. So, the regression test will present the effect of selected variables on the direction rate of return for the Kuwait stock market in three categoriesupward, downward and stableas well as providing an equation for making predictions about the rate of return based on selected variables. The accuracy of the multinomial logic regression test is 54.24. The results of the multinomial logic regression test. All variables are statistically significant at the 5% level except gold price and oil price. Also, the variables that do not have an influence in the direction rate of return of Boursa Kuwait are money supply and gold price. The regression equation of the logistic model is based on this test, with three categories having two logit functions: the first logit function is for the probability of a stable direction relative to the probability of a downward direction; the second logit function is for the probability of an upward direction relative to the probability of a downward direction; and the equations based on this test are given as: In ðP 0 =P −1 Þ ¼ 3:98 À 0:76 ðALMUTAHEDÞ À 1:12 ðAUBÞ À 0:43 ðBOUBYANÞ À 1:04 ðBURGÞ þ 0:42 ðCBKÞ À 1:1 ðGBKÞ À 1:64 ðKFHÞ À 0:93 ðKIBÞ À 1:6 ðNBKÞ À 1:13 ðOOREDOOÞ À 1:58 ðVIVAÞ À 1:4 ðWarbaÞ À 1:68 ðZAINÞ À 0:02 ðOil priceÞ À 1:03 ðExch: rateÞ À 117:16 ðInt: rateÞ  Table 1 shows the confusion matrix of multinomial logistic regression for training data and testing data. As we see, the highest value in the confusion matrix for both training and testing data is when the actual direction rate of return is stable and predicts stability. The lowest value in the confusion matrix for both training and testing data is when the actual direction rate of return is stable and predicts a downward direction. Also, the accuracy scores for both training data and testing data are nearly the same.

Support vector machine test
The support vector machine test is built by RStudio software. Table 2 displays the confusion matrix of the SVM, where the parameters used are the kernel function polynomial basis, and the regularization parameter (C) is 10. Seemingly, the highest value in the confusion matrix is when the actual direction rate of return is stable and predicts stability, and the lowest value is when the actual direction rate of return is stable and predicts an upward direction. The accuracy of the confusion matrix of the polynomial kernel function of SVM is 52.73. Table 3 indicates the class of the direction of rate of return of the polynomial kernel function of the SVM; the highest-class accuracy is the stable class, and the positive and negative classes are nearly the same. Table 4 displays the confusion matrix of the SVM, where the parameters used are the kernel function radial basis and the regularization parameter (C) is 10. Table 5 indicates the class of the direction of the rate of return of the radial kernel function of the SVM; the highest-class accuracy is the stable class, and the classes of upward and downward direction are very close.

Decision trees test
The decision tree is used to predict the direction of the rate of return in Boursa Kuwait and is constructed by RStudio software. Figure 3 shows the results of implementing the decision tree. The size of this tree is 9. The leaves of the tree explain the decision tree's prediction rules.
Also, it is visible that if the condition is less than the interest rate of 0.019, then the predicted direction is stable (0). Furthermore, the stability overwhelms the results where it is predicted three times: one time, the rules predict a downward direction, and another time, an upward direction. The accuracy of the decision tree is 53.56. Table 6 presents the confusion matrix of the decision tree, where it seems that the highest value in the confusion matrix is when the actual direction of the rate of return is stable and predicts stability. Table 7 indicates the class of the direction of the rate of return of the decision tree; the highest-class accuracy is the stable class, and the positive and negative classes are nearly the same.

Random forest test
First, the random forest test searches for the best number of variables available for splitting at each tree node from 2 to 10 based on accuracy. Table 8 shows the accuracy for each variable, and the highest accuracy for the number 8 is 53.1. Table 9 displays the confusion matrix of the random forest, where the parameters used are the number of variables for splitting at each tree node (5) and the number of trees to grow (100). The result of random forest accuracy is 53.04. Table 10 indicates the class of the direction of the rate of return of the random forest; the highest-class accuracy is the stable class, then the positive class, followed by the negative class.

Class: À1
Class:  Table 3. Statistics by classes of the direction of rate of return of the polynomial kernel function of SVM Table 5.
Statistics by classes of the direction of rate of return of the radial kernel function of SVM While the random forest model is popular for its predictive performance, it also provides the feature of being a fully non-parametric measure of variable importance (VIMP), which supplies insight into a system by identifying which variables play a key role in prediction (Ishwaran & Lu, 2019). Figure 4 illustrates the variable importance for our developed random forest model. As shown in Figure 4, the three highest variables that affect the direction of the rate of return in Boursa Kuwait are the firm's variable, the Kuwait index and the EPS. We can also observe that money supply has the lowest effect on the direction of the rate of return in Boursa Kuwait.    The multinomial logistic regression test was used to measure the correlation and the influence of the variables on the direction of the rate of return. There are 14 variables that have significance on the direction of the rate of return, and the Kuwait index has the highest coefficient, equal to 99.460. The results of this test were applied to predict the value of the direction of the rate of return based on equations.
In the support vector machine test, the radial kernel function has better accuracy than the polynomial kernel function. Also, the stable class performs best based on the accuracy of other classes.
The decision tree test has good accuracy with a value of 53.56 and, based on the confusion matrix, its best prediction is in stability. Furthermore, the stable class performs best based on sensitivity, with a value of 80.29 and an accuracy of 69.08 compared to other classes.
The random forest test has good accuracy and, based on the confusion matrix, its best prediction is in stability. Moreover, the stable class performs best based on sensitivity with a value of 62.9 and an accuracy of 67.65 compared to other classes. Additionally, the height score of the variable that affects the direction of the rate of return is the firms. Table 11 indicates the accuracy of the overall performance of the four models. Accuracy is the parameter for evaluating the performance of a model; all the tests have around 50% accuracy scores, which means that the models are only moderately effective. Therefore, the accuracy will be arranged as follows: multinomial logistic regression, radial kernel function in SVM, decision tree, polynomial kernel function in SVM and, finally, random forest.

Conclusion
The main purpose of this study was to discover the relation of the probabilities of more than two possible directions of Boursa Kuwait based on other variables, and also to determine which class has highly accurate prediction of the directions of the rate of return of Boursa Kuwait. Furthermore, this study assists the decision-making process by mapping out different potential outcomes of directions of the rate of return of Boursa Kuwait by the decision trees, identifies the variables that affect the directions of the rate of return of Boursa Kuwait and recognizes suitable methods of analyzing the big data of the Boursa market.
The multinomial logistic regression analysis is used in this study to indicate the effect of selected variables on the direction and the rate of return for the Kuwait stock market. So, the variable that has more effect in the direction rate of return of Boursa Kuwait is the Kuwait index, and the variables that have no influence in the direction rate of return of Boursa Kuwait are money supply and gold price.
The support vector machine test and the random forest test are used to identify which class can more accurately predict the directions of the rate of return of Boursa Kuwait, and both tests agree that the highest-class accuracy is the stable class.

Stock market prediction by applying BDM
The decision tree test is used in this study to identify the direction of the rate of return of Boursa Kuwait based on independent variables; therefore, the decision tree is constructed based on the interest rate. The other variables that affect the direction of the rate of return are the Kuwait index and firms.
The random forest test provides the feature of a non-parametric measure of variable importance (VIMP) that can identify which variables play a key role in prediction. The highest variables that affect the direction of the rate of returns in Boursa Kuwait is the firm's variable, followed by the Kuwait index and then EPS.
Based on the accuracy scores provided by the models used in this study, all tests showed very similar accuracy, which was moderately effective; therefore, the accuracy will be arranged as follows: multinomial logistic regression, radial kernel function in SVM, decision tree, polynomial kernel function in SVM and, finally, random forest.

Recommendations
Based on the results of this study, the following recommendations are suggested: (1) In the case of data mining results, accuracy depends on the quality of the used data, so it is paramount that an effort is made to verify and preprocess the data.
(2) The three highest variables that affect the direction of the rate of returns in Boursa Kuwait are firms, the Kuwait index and EPS.
(3) Employ data mining techniques in the stock market in order to provide more considered findings, which will lead to an increase in the quality of decisions.
(4) Encourage the decision-makers to utilize data mining techniques within their analytical and strategic planning efforts.
(5) Conduct further studies on utilizing more data mining techniques and tools to support decisions in the stock market.

Future works
For further works, the following are a few suggestions: (1) Improve the models that are used in this study, such as multinomial logistic regression, SVM, decision tree and random forest, by applying the models to all the companies listed in the Kuwait stock market.
(3) Reconsider the factors that affect the Kuwait stock market return, such as trading volume, financial news, political news, global indicators and Morgan Stanley Capital International (MSCI).
(4) Finally, employ these data mining techniques on other stock markets, such as GCC, Middle East countries and global markets.