Stock price indices prediction combining deep learning algorithms and selected technical indicators based on correlation

Purpose – The prediction of stock market (SM) indices is a fascinating task. An in-depth analysis in this field canprovidevaluableinformationtoinvestors,tradersandpolicymakersinattractiveSMs.Thisarticleaimsto applyacorrelationfeatureselectionmodeltoidentifyimportanttechnicalindicators(TIs),whicharecombinedwithmultipledeeplearning(DL)algorithmsforforecastingSMindices. Design/methodology/approach – The methodology involves using a correlation feature selection model to select the most relevant features. These features are then used to predict the fluctuations of six markets using variousDLalgorithms,andtheresultsare comparedwithpredictionsmadeusingallfeaturesbyusingarange of performance measures. Findings – The experimental results show that the combination of TIs selected through correlation and Artificial Neural Network (ANN) provides good results in the MADEX market. The combination of selected indicators and Convolutional Neural Network (CNN) in the NASDAQ 100 market outperforms all other combinations of variables and models. In other markets, the combination of all variables with ANN provides the best results. Originality/value – This article makes several significant contributions, including the use of a correlation feature selection model to select pertinent variables, comparison between multiple DL algorithms (ANN, CNN and Long-Short-Term Memory (LSTM)),combining selectedvariables with algorithms to improvepredictions, evaluation of the suggested model on six datasets (MASI, MADEX, FTSE 100, SP500, NASDAQ 100 and EGX 30) and application of various performance measures (Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error(RMSE), Mean Squared Logarithmic Error (MSLE) and Root Mean Squared Logarithmic Error (RMSLE)).

1. Introduction SM prediction is a long-standing and difficult task due to the inherent complexities of financial time series, such as high volatility, non-stationarity and non-linearity (Long, Chen, He, Wu, & Ren, 2019).The Efficient Market Hypothesis posits that it is impossible to predict stock price movements and prices behave randomly (Fama, 1965).In contrast, Technical Analysis (TA) claims that prices incorporate all available information and trend detection makes price prediction easier (Patel, 2014).
Investment decisions in financial markets can be made through either fundamental analysis or TA.Fundamental analysis involves evaluating the actual price against the intrinsic value and deciding to buy or sell based on this comparison.TA, on the other hand, relies on historical data and employs TIs to help traders determine when to buy and sell assets (Naik & Mohan, 2019;Ratto, Merello, Ma, Oneto, & Cambria, 2019).
In recent times, various studies have combined Artificial Intelligence (AI) algorithms with TIs for more accurate financial market predictions.The most commonly used models include Artificial Neural Networks (ANN), Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) (Sezer, Gudelek, & Ozbayoglu, 2019).This study aims to predict different SM indices by utilizing ANN, CNN and LSTM combined with TIs.Correlation is employed as the feature selection method for selecting relevant TIs.Related works are discussed in the following section, followed by a description of our methodology and accuracy metrics in Section 3. Our findings are analyzed in Section 4, and the conclusion, along with a discussion of future work, is provided in Section 5.

Literature review
The growth of computer technology has made it easier to use and develop TA in the SM.With the help of these tools, investors are now able to create powerful decision support systems that can increase their profits and minimize losses.Chopra, Yadav, and Chopra (2019) employed an ANN model to predict the stock prices of nine companies with diverse market capitalizations and the CNX NIFTY50 index on the Indian Stock Exchange.Their research highlights the model's effectiveness in predicting stock prices, especially for the most volatile prices before and after the demonetization process.Dash and Dash (2016) in their study, presented a hybrid stock trading framework that combines TA with machine learning techniques.They proposed a decision support system called Computational Efficient Functional Link Artificial Neural Network (CEFLANN) to help investors make more informed decisions with less risk.The system uses an Extreme Learning Machine (ELM) to generate investment decisions and was compared to other models such as Support Vector Machine (SVM), Naive Bayesian, K Nearest Neighbor (KNN) and Decision Tree (DT).The results showed that CEFLANN is more profitable compared to these other models.Agrawal, Khan, and Shukla (2019) developed a model that forecasts SM price movements based on TIs.They utilized the "Optimal Long Short-Term Memory (O_LSTM)" model, which was able to predict both short-term and long-term trends.The results of this model showed its superior performance compared to other models like ELSTM, SVM and linear regression (LR).
Selvamuthu, Kumar, and Mishra (2019) compared the predictive abilities of three neural network learning algorithms: Levenberg-Marquardt, Scaled Conjugate Gradient and Bayesian Regularization.The results of this study revealed the predictive accuracy of these three algorithms with a score of 99.9%.
Qiu and Song (2016) used the "Artificial Neural Network (ANN)" model combined with the "Genetic Algorithm (GA)" model.Also, they introduced two types of inputs in the form of the TA indicators.The results showed the effectiveness of this hybrid model in the daily forecasting of the Nikkie 225 index compared to other studies that used other models.Qiu and Song (2016) employed an ANN model in conjunction with a "Genetic Algorithm (GA)" model and introduced two types of inputs in the form of TA indicators.Their results showed that this hybrid model was effective in daily forecasting of the Nikkei 225 index compared to other studies that used alternative models.AGJSR Efat, Bashar, Imtiaz Ud-Din, and Bhuiyan (2018) utilized a new model called "Trend Estimation with Linear Regression" to predict market trends.When comparing their findings with those from the ARIMA and PROPHET models, their results were more favorable.Sezer, Ozbayoglu, and Dogdu (2017) developed a decision support system for investors to determine entry points, using neural networks applied to the Relative Strength Index (RSI) and moving average.Sahoo and Mohanty (2020) combined an ANN model with "The Gray Wolf Optimization" to forecast prices on the Bombay Stock Exchange.They found that this combination works better than an ANN model alone.Sang and Di Pierro (2019) examined the application of machine learning techniques in stock trading and proposed the use of a LSTM neural network to enhance the accuracy of predictions made with TA.The authors tested their proposed method using historical stock data and found that it outperforms traditional TA methods.Their research concludes that incorporating LSTM with TA can improve stock price predictions.
Ayala, Garc ıa-Torres, Noguera, G omez-Vela, and Divina (2021) presents a study on enhancing SM index predictions through the integration of machine learning techniques into a TA strategy.The authors suggest a method that optimizes the TIs employed in the strategy by training a machine learning algorithm to determine the most significant indicators.The method was evaluated on several SM indices and found to improve the performance of predictions compared to using TIs alone or a standard machine learning method.The study's findings indicate that combining TA with machine learning can enhance the accuracy of SM index predictions.Kamara, Chen, and Pan (2022) propose a novel hybrid model for stock price forecasting.The model blends deep learning models and TA to enhance prediction accuracy.The article examines the use of an ensemble technique, which is a combination of multiple models, to improve prediction performance.The model was trained and tested on historical data of a SM index, and the results demonstrate that the proposed model outperforms traditional TA and machine learning models.Chandar (2022) presents a technique for incorporating TIs in stock trading with a CNN.The author describes the use of the CNN to recognize patterns in TIs such as moving averages and relative strength indexes, and how these patterns can be utilized to make predictions about future stock prices.The study discovered that the CNN model outperformed traditional machine learning models in stock trading by achieving higher prediction accuracy.Niu, Xu, and Wang (2020) combined variational mode decomposition (VMD) and a LSTM network to predict four stock indices (HIS, FTSE, S&P 500 and IXIC).In this work, authors demonstrate that the combination of two models is much better than using one model.Furthermore, the results of this work show that VMD-LSTM outperform VMD-ELM, VMD-CNN and VMD-BPNN for the SPX and IXIC data series.
Ecer, Ardabili, Band, and Mosavi (2020) compared the predictive performance of two combined models, MLP-GA and MLP-PSO, using two different output functions, Tanh(x) and Gauss.As input data, they mobilized TIs calculated from the historical data of Borsa Istanbul 100 index.The results of this study show that the Tanh(x) output function improves the accuracy of the models and the MLP-PSO model with population size 125 outperform other models utilized in this work.Orimoloye, Sung, Ma, and Johnson (2020) in their work they compared the abilities of deep feedforward neural networks and shallow architectures for predicting 34 stock price indices in different markets (emerging and developed) and in different time frames (daily, hourly, minute and tick).The results of this paper demonstrate the outperformance of deep NN in different time horizons using ReLu function, except in tick level data.

Stock price indices prediction
Jiang, Liu, Zhang, and Chunyu (2020) combined TIs and macroeconomics variables to predict three major indices in USA (S&P500, NASDAQ 100 and Dow 30).In this paper, they mobilized stacking method to make 20 days predictions and as results they found that this method is outperforms ensemble learning algorithms and deep learning models.Nikou, Mansourfar, and Bagherzadeh (2019) compared between different machine and deep learning algorithms in predicting the close price of IShares MSCI UK.As result, they found that deep learning models outperform machine learning models.Goel et al. (2022) aimed to predict the close price of Bombay Stock Exchange (BSE) using ANN model and macroeconomics variables.They found that ANN works well in this market and can make accurate predictions with 93%.
From our review of the literature on the prediction of SM indices, two major shortcomings were identified.First, major works are based on predicting developed markets like S&P 500 and NASDAQ 100.Second, a feature selection method to select the pertinent variables has not been undertaken.Our goal is to contribute to filling these significant gaps in the literature.

Methodology
This paper aims to predict the SM indices of MASI, MADEX, EGX 30, NASDAQ 100 and S&P 500 using three distinct models: ANN, CNN and LSTM.The data was collected from Investing.comusing the Investpy library in Python.
Table 1 displays the periods of the data series, including the observation period, the number of observations and the observations used for training and testing for each index.The observation period spans from a specific date in the past to April 16, 2021, with the number of observations ranging from 4,809 for MASI to 8,962 for NASDAQ 100.The observations used for training data range from 3,847 for MASI to 7,169 for NASDAQ 100, while the observations used for testing data range from 962 for MASI to 1,793 for NASDAQ 100.Deep learning models needs a lot of data to make good predictions, this is why we tried to import the maximum available data for each index and this explains the difference between the number of data in the different indices.The aim of this work is to make predictions of the selected indices and the difference between the number of observations will not impact the results.
We will also compute TIs for each index using the Ta-Lib library, as demonstrated in Table 2.The table lists various TIs used in financial analysis, including their symbols and order, which are commonly employed by traders and investors to analyze historical security or market performance and make predictions about future performance (Ifleh & El Kabbouri, 2022;Ecer et al., 2020).
Our method consists of seven primary steps, as depicted in Figure 1.First, we will extract the data from Investing.com.Second, we will calculate the TIs for each index using the Ta-Lib library.Third, we will use the Correlation Feature Selection method to choose the most relevant features for the model.Fourth, we will split the data into training and testing sets,  A TI that measures the momentum of an asset's price High -Low (High -Low) A TI that measures the difference between the highest and lowest prices of an asset over a specified period Momentum (Mom) A TI that measures the rate of change in an asset's price over a specified period

Money flow index (MFI)
A TI that measures the strength of buying and selling pressure in an asset

Moving average convergence divergence (MACD)
A TI that measures the difference between two exponential moving averages 9, 12, On Balance Volume (OBV) A TI that measures buying and selling pressure by adding or subtracting volume based on whether the price moves up or down Open -Close (Open -Close) A TI that measures the difference between the opening and closing prices of an asset over a specified period Percentage price oscillator (PPO) A TI that measures the difference between two exponential moving averages as a percentage

Stock price indices prediction
with an 80% and 20% ratio, respectively, as indicated in Table 1.Fifth, we will train the data using the above-mentioned models.Sixth, we will make predictions using the trained models.Finally, in the seventh step, we will evaluate the predictions using various accuracy metrics.

Feature selection
With the rise in data, reducing its dimensionality has become crucial for efficient processing.In various domains, a problem may require a large number of variables, which can cause challenges such as information loss due to noisy data, complexity and extended computation time.Correlation is a commonly used method of feature selection that measures the degree of connection or the extent to which two variables vary together.The Pearson correlation coefficient is one of the most commonly used measures of correlation.
The selection of variables as inputs for a model is based on the principle that they should be correlated with the dependent variable and not correlated with other independent variables.
Using linear correlation as a measure of input quality offers several advantages.Firstly, it enables us to eliminate inputs that are independent of the output.Secondly, it reduces the redundancy among the selected inputs.
Figures 2-7 present the correlation matrix heatmap of the TIs employed in this work in each index.The correlation scales between À1 and 1, where its value is in light (dark) regions it means positive (negative) relationship.
Table 3 presents the relevant variables selected through correlation.The specific indicators included in each index are those that have shown a strong correlation with the close price of that index and independence between them, as determined by the analysis.
For example, in the MASI index, the indicators SAR, DX, ADX, macd, macdhist, STDV, RSI, Volume, Open-Close and TRIX_60 have been selected as they have a strong correlation with the close price of the index and they are not correlated between them.Similarly, in the  It's worth mentioning that the specific indicators included in each index may vary, reflecting the application of the correlation analysis to different indexes.

Mobilised models
In this work we used three models, ANN, CNN and LSTM.
(1) ANN Artificial neural networks (ANNs) are made of multiple units, called perceptrons.Every perceptron simulates the natural neurons of the human brain.Figure 8 shows that the first perceptron receives inputs x, which all are multiplied by weights w.In the following stage, the outcome is derived by comparing it to a threshold, when the weighted sum P i w i x i is under the given threshold, then it will be zero, else the output will be one Chopra et al. (2019).A multilayer perceptron is called an ANN.ANNs are comprised of several perceptrons organized in layers.The first one gets the inputs and transmits them to the intermediate layers, known as hidden layers.The calculated values in one layer are transmitted to the following layers, where the first one gets the inputs and the last one generates the outputs.In this work, we build our ANN model and train it using 100 epochs and 2 hidden layers because they minimize the loss function.
(2) CNN A convolutional neural network (CNN) is a DL architecture specifically developed for images.CNNs are highly similar to an ordinary NN.They are provided with weights and biases that

Stock price indices prediction
work with the neurons to produce a score that ranks the input data.However, the main difference is that CNNs require that the input data be images, which allows the architecture to be tailored to specific types of data patterns, to be more efficient, and to reduce the number of parameters in the network.For example, since the assumption is that the input data is an image, this allows the network to form associations with only the neighboring pixels instead of the entire image.This avoids unnecessary use of neurons and a variety of parameters.We build our CNN model and train it using 100 epochs and 2 hidden layers because they minimize the loss function.
(3) LSTM In 1997, the LSTM model was first proposed by Hochreiter and Schmidhuber.It removes the problem of gradient vanishing in RNNs.The reason for this problem is that the information is not saved for a long period of time and the gradient in the deepest layers becomes useless.
In order to resolve this problem, the LSTM model includes a memory cell 00 C t 00 which is able to keep the information for a long period of time.Therefore, every memory cell contains three gates, input gate 00 I t 00 , forget gate 00 f t 00 and output gate 00 O t 00 (Sethia & Raut, 2019).
The 00 I t 00 determines whether the input should change the content of the cell, the 00 f t 00 chooses to return the content of the cell to zero and the 00 O t 00 determines whether the content of the cell should offer the output of the neuron.
The Gates are sigmoid functions with a binary value of 0 and 1, where 0 means that nothing passes and 1 that everything passes (See Figure 9).In this work, we build our LSTM model and train it using 100 epochs and 1 hidden layer because they minimize the loss function.

Forecasting performance measures
There is a wide range of performance measures to judge the precision of the prediction model; in our work we use those measures (Ifleh & El Kabbouri, 2021): MSE: where:  (2020).They should be nearer to zero to offer the better predictions results (Klimberg, Sillup, Boyle, & Tavva, 2010).

Results and discussion
Table 4 shows the results of the prediction using correlation as a feature selection for different indexes such as MASI, MADEX, FTSE 100, EGX 30, NASDAQ 100 and S&P 500.The table compares the performance of three different machine learning models (ANN, CNN and LSTM) for each index.
The performance of the models is evaluated using different evaluation metrics such as MSE, RMSE, MSLE, RMSLE and MAE.Lower values for these metrics indicate better performance.
For example, in the MASI index, the ANN model has an MSE of 895,324, an RMSE of 29,922, an MSLE of 0.0000, an RMSLE of 0.003 and an MAE of 26,295.This suggests that the ANN model has a relatively lower error rate when compared to other models such as LSTM model, which has an MSE of 14388,787, an RMSE of 119,953, an MSLE of 0.0001, an RMSLE of 0.011 and an MAE of 88,562.
It's worth noting that different models might perform better in different indexes, depending on the complexity and volatility of the SM.Also, different indexes might have different characteristics that might affect the prediction accuracy, so the correlation feature selection should be done carefully and with a good understanding of the SM.
Also,the outcomes show that ANN outperforms other models in predicting MASI, MADEX and FTSE 100 indices.And CNN outperforms other models in predicting EGX 30, NASDAQ 100 and S&P 500.
(1) Comparaison By comparing the results of predictions using pertinent variables with the predictions using all the variables, we can remark that using all the variables as inputs performs the predictions using ANN model in all markets except in MADEX and NASDAQ 100 where the use of pertinent variables is more interesting.
Also, the predictions using CNN and LSTM combined with selected variables based on correlation outperform the predictions using all the variables in all indices except MASI (See Table 5).

Stock price indices prediction
In other words, it's better to predict MADEX and NASDAQ 100 combining selected variables with ANN and CNN, respectively.Compared to other researchers Kamara et al. (2022), Sang and Di Pierro (2019), Chandar (2022), Ifleh andEl Kabbouri (2021, 2022), in this work we employed new methodology to make predictions and we worked on different markets (emerging and developed).The results show the markets where we can employ correlation feature selection.

Conclusion
In this study, we aimed to predict six different SM indices using various machine learning models (ANN, CNN and LSTM).We also examined whether using variables selected based on correlation would result in more accurate predictions than using all variables.
We build and train our models using 100 epochs and 2 hidden layers, except in LSTM model we use 1 hidden layer, because they minimize the loss function.
Our results showed that ANN outperformed other models in predicting MASI, MADEX and FTSE 100 indices, while CNN outperformed other models in predicting EGX 30, NASDAQ 100 and S&P 500.
When comparing the results with predictions made using all features, we found that the combination of ANN and all variables generally provided better results, except in the case of MADEX and NASDAQ 100.Additionally, predictions made using CNN and LSTM combined with selected variables based on correlation outperformed predictions made using all variables for all indices except MASI.
There are many possibilities for improving the predictive ability of this study.One promising avenue is to explore alternative feature selection models, such as the random forest algorithm, which could yield more robust results.In addition, combining various predictive models together appears to be another viable strategy.
Broadening the scope by integrating a wider range of features could prove useful in highlighting the most important elements of the predictions.These refinements could then inform and guide future research efforts.
In addition, we recommend that researchers consider incorporating additional variables, such as macroeconomic and sentiment indicators, into their research.Joining the results of these variables to our findings, valuable information can be gained to enrich the body of knowledge.
measures the strength of a trend Average true range (ATR) A TI that measures the volatility of an asset Bollinger bands (BB) A TI that measures the volatility of an asset and consists of three lines, a simple moving average, an upper band and a lower band 14; Chaikin A/D line (A/D) A TI that measures the cumulative flow of money into and out of an asset Commodity channel index (CCI) A TI that measures the deviation of an asset's price from its average price Daily return (Return) A TI that measures the percentage change in an asset's price from one day to the next Directional movement index (DX) A TI that measures the strength of a trend and the direction of the trend Double Exponential Moving Average (DEMA) A TI that smooths out price data by taking into account two exponential moving averages 5; 20; 60; Exponential moving average (EMA) A TI that smooths out price data by taking into account a specified number of past prices 5; 20; 60; Fast stochastic (FastK) 12,Rate of change (ROC)A TI that measures the percentage change in an asset's price over a specified period Relative strength index (RSI) A TI that measures the magnitude of recent price changes to evaluate overbought or oversold conditions Simple moving average (SMA) A TI that smooths out price data by taking into account a specified number of past prices 5; 20; 60;Standard deviation (STDV)A TI that measures the variability of an asset's price over a specified periodStochastic (Stoch)A TI that measures the momentum of an asset's price Stop and reverse (SAR)A TI that identifies potential reversals in an asset's price Triangular moving average (TRIMA)A TI that smooths out price data by taking into account a specified number of past prices and giving more weight to the smooths out price data by taking into account a specified number of past prices and giving more weight to more recent prices 5; 20; 60;Williams' %R (R%)A TI that measures the momentum of an asset's price and indicates overbought or oversold conditions Source(s): Figure 1.Proposed method Figure 2. The correlation matrix heatmap of the variables (MASI) Figure 4.The correlation matrix heatmap of the variables (FTSE 100) f t : Forecast value in time period t; n: Number of periods forecasted.The evaluation metrics above give us and idea about the model prediction performance, they compare the predicted close price with reel close price to indicate the accuracy of the model Ecer et al.

Table 2 .
Table created by authors Technical indicators

Table 3 .
Selected features