Forecasting stock price movement: new evidence from a novel hybrid deep learning model

Purpose – This study explores whether a new machine learning method can more accurately predict the movement of stock prices. Design/methodology/approach – This study presents a novel hybrid deep learning model, Residual-CNN- Seq2Seq (RCSNet), to predict the trend of stock price movement. RCSNet integrates the autoregressive integratedmovingaverage(ARIMA)model,convolutionalneuralnetwork(CNN)andthesequence-to-sequence(Seq2Seq)long – short-term memory (LSTM) model. Findings – The hybrid model is able to forecast both linear and non-linear time-series component of stock dataset. CNN and Seq2Seq LSTMs can be effectively combined for dynamic modeling of short- and long-term- dependent patterns in non-linear time series forecast. Experimental results show that the proposed model outperforms baseline models on S&P 500 index stock dataset from January 2000 to August 2016. Originality/value – This study develops the RCSNet hybrid model to tackle the challenge by combining both linearandnon-linearmodels.Newevidencehasbeenobtainedinpredictingthemovementofstockmarketprices.


Introduction
It is widely acknowledged that predicting stock price movement trend is a difficult financial problem. The relatively precise prediction of a stock's future price movement will maximize profits of investors. However, for the variability and instability of stock market, it is still an open question.
Generally, traditional stock forecast methods mainly include two categories, which are the technical analysis and the fundamental analysis. Fundamental analysis analyzes the company financial environment, operations, macroeconomic and microeconomic indicators to predict stock price. In fact, in recent years, as the massive growth of internet content, the development of natural language processing (NLP) technique enables investors to capture market movement trends online. However, the quality of online content of stock market is not guaranteed and could not exclude low-quality content and even including fake news and comments. That is to say, fundamental analysis based on methods are difficult to model. Therefore, in this work, the technical analysis methods is concerned.
Technical analysis methods based on historical time-series data. Stock price movement prediction is deemed as a time-series forecast problem. Generally, rather than the original stock forecast, stock price time-series data are abstracted into two components, which are a linear and non-linear. Hence, we could research both linear and non-linear stock forecasts. Classical linear time-series models, including the vector autoregression (VAR) model and autoregressive integrated moving average (ARIMA) model (Ltkepohl, 2005), have been proved effective for linear time-series forecast in many fields such as economic (Hamilton, 1989) and power price forecasting (Contreras et al., 2003), meanwhile have poor performance for non-linear forecast. On the other hand, traditional non-linear time forecast models, like support vector machine (SVM) (Hossain et al., 2009), back-propagation (BP) neural network (NN) (Maier and Dandy, 2000), specialize in describing the non-linear data, while performing worse in linear data and the long term in time-series forecast.
A successful stock time-series forecasting model should satisfy two points: Firstly, the model should be suited to both linear and non-linear data since the stock movement data contain both. Secondly, the model could capture multi-frequency patterns (short and long term) for accurate non-linear part predictions.
In recent years, based on the fast growth of the computing ability of computers and massive data, deep learning methodologies have been wildly adopted in speech recognition (Hinton et al., 2012), image classification, machine transition  and other area, which are composed of kinds of derivatives of artificial neural network (ANN), such as convolutional neural network (CNN) and recurrent neural network (RNN). CNN models perform outstanding image recognition performance by extracting local features at various granularity levels from input images. RNN (as shown in Figure 1(a)) is a type of non-linear model, and what is more, it has the ability to model long-term dependencies, which makes it is well suited to both non-linear data and long-term dependencies. However, long-term dependencies are hard to detect by traditional RNN, suffering from the gradient vanishing (Bengio et al., 2002) problem. To solve this problem, long-short-term memory (LSTM) units (as shown in Figure 1(b)) (Hochreiter and Schmidhuber, 1997) and the gated recurrent unit (GRU) (Chung et al., 2014) have achieved great success in various domains like computer version and NLP.
With the recent success of deep learning, we develop the residual-CNN-Seq2Seq (RCSNet) hybrid model to tackle the challenge by combining both linear and non-linear models. In Figure 2, the model consists of ARIMA layer, CNN layer, Seq2Seq LSTM layer and fully connected (FC) layer. Hence, the model is capable of forecasting both linear and non-linear time series and could also proceed the frequency patterns separately in the Seq2Seq LSTM layer. Overall, based on the above characteristics, we apply it to forecast the stock movement trend.
The rest of this work is planned as follows: In Section 2, we present the preliminary and related works. Section 3 provides the details of the RCSNet model proposed in this study. The Forecasting stock price movement comparison between the results of our model and traditional models is shown in Section 4. The conclusion and future work are displayed in Section 5.

Preliminary and related works
This study involves three groups model of research primarily, that is, linear models, non-linear models and hybrid models. Therefore, the traditional linear models, non-linear models and hybrid models for stock movement forecast are mentioned in this section.
2.1 Autoregressive model ARIMA model is one of the most commonly used autoregressive models, which has been extensively studied (Hamilton, 1989;Contreras et al., 2003). These linear models can be applied effectively to forecast the behavior of economic and financial (Tsay, 2005;Sims, 1980). The ARIMA model is generally expressed as ARIMA (p, d, q), where p, d and q are parameters composed by non-negative integers. p is the set for the number of time lags of the autoregressive model (the order of AR terms); d denotes the order of differences, and q indicates the order of the moving-average model. The ARIMA model is formulated as: where f i denotes the parameter of the autoregressive part of the model, θ j is the parameter of the moving average part of the model. In addition, e t is the error term at time t, and the e t−j presents the error terms, which are usually supposed to be independent, sampled from a normal distribution where the mean is zero. x t represents the forecasting result of autoregressive method.  (Gken et al., 2016) in 2016. In this study, we assume that the BPNN is composed of non-linear layers with sigma function σðÞ as an active function. We design three layers, including input, hidden and output layer. Each inner layer is fully linked to the previous one.

Back-propagation neural network model
where W i is a the ith weight matrix (parameter matrix) for ith layer, b i is bias term. After training, h i represents the forecasting result of the BPNN model.  (Kuan and Liu, 1995). An adaptive "forget gate," presented by F Gers, allows an LSTM cell to learn to rebuild itself at a suitable time, thereby releasing internal resources during forecasting (Gers et al., 2000).

JABES 29,2
2.3.1 Simple recurrent neural network model. A simple RNN model can deal with time series of inputs by utilizing its internal memory. The input is propagated in the manner of a standard feed-forward at each time step. The fixed back connections cause the context units to hold a copy of the previously hidden units values all the time. The equations are given by: In the model, X 5 ðx 1 ; x 2 ; . . . ; x t Þ, h t and y t represent the input vector, hidden vector and output vector (prediction), respectively. (W, U, b) are parameter matrices and vector.
Moreover, σ is the activation function. y t denotes the output of prediction.
2.3.2 Simple long-short-term memory model. The simple LSTM model is composed of LSTM units. There are a cell, an input gate, an output gate and a forget gate in an LSTM unit. The equations are given by: gather the weights of the input and recurrent connections, respectively; the input gate i t , output gate o t , b c t indicates the updated unit status value and forget gate f t and the memory cell c t count on the activation being calculated, which is the sigmoid function; h t denotes the output of the model for forecast.

Hybrid model.
A hybrid model is a model that combines two or more base models of various kinds to reach a better model, which has been extensively researched. Ping Feng Pai et al. researched the hybrid model combining ARIMA with SVMs based on time-series data (Pai and Lin, 2005). They proposed a hybrid methodology to predict stock prices via taking advantage of the superiority of the ARIMA and SVM models. Ashu Jain et al. have developed a hybrid time-series NN model, which can make use of the advantages of traditional time-series methods and ANNs (Jain and Kumar, 2007). A hybridized framework of SVM with K-nearest neighbor approach for the Indian stock market indices prediction proposed by Nayak et al. (2015). FA Gers et al. designed a hybrid model that combined a time window-based MLP and LSTM for forecasting (Gers et al., 2001) in 2002. A time windowbased MLP was primarily trained, then its weights were freezed, and finally, LSTM was employed to decrease the forecast error in this model.
As discussed above, overall, the related works ignore the following: (1) The stock time-series data have both linear and non-linear dependency component.
(2) The non-linear component of stock time-series data contains long and short patterns.
Hence, we design a hybrid model for solving these problems to forecast the trend of the stock movement.

Residual-CNN-Seq2Seq model
This section discusses the details of the model for stock time-series prediction. RCSNet is comprised of the ARIMA, CNN, Seq2Seq LSTM and FC layers. The objective function and the optimization strategy also are discussed.

Problem statement and algorithm
The problem is typically described as follows: given a target series X ¼ ðx 1 ; x 2 ; Á Á Á ; x t Þ, we need to train a model that aims to learn the internal rule of mapping the current value of target series X target to the value of predicted data X predict , X predict ¼ Fðx 1 ; x 2 ; Á Á Á ; x t Þ. To find the most suitable parameters for RCSNet is crucial for the stock movement forecasting. RCSNet firstly extracts the linear dependence component by the ARIMA model. Then, the residual error time series (target data subtracts ARIMA's forecast output) is seen as the nonlinear component. The CNN layer is used to extract short-and long-term trading patterns. After the CNN layer, the residual error time series of different patterns is predicted by the Seq2Seq layer to generate non-linear intermediate forecast results. Finally, the FC layer jointly outputs the final forecast results by using linear and non-linear intermediate results.
Generally, the RCSNet can be described as below: The linear model takes input x t up to time step t, producing the output b x tþh , which is what the ARIMA model predicts for a h horizon. x tþh denotes the actual value t þ h. We get the linear models residual e tþh by subtraction of b x tþh and the actual value. e tþh has multi-frequency trading patterns, and we extract the sub-frequency trading pattern by the CNN layer. A Seq2Seq LSTM is utilized to model the non-linear residuals, whose input is the sub-frequency residual b e tþh . The objective of the Seq2Seq LSTM is to predict the error that the linear model will produce in its next forecast for time step t þ h. The final model output is then generated by combing the forecast b e tþh with the forecast of the linear models b x tþh . The overall algorithm for RCSNet:

Autoregressive integrated moving average layer
RCSNet uses the ARIMA model as the linear filter. Seq2Seq LSTM model is trained by calculating the residual of the linear model (non-linear component). As Figure 3 shows, the ARIMA filter can distinguish the linear component from historical stock data series, and then we could obtain the non-linear data series.

Convolutional neural network layer
The second layer of RCSNet, designed to extract short and long patterns in the time dimension, is a convolutional network layer without pooling. The CNN layer is comprised of multiple filters of height w that is set to equal the number of variables. The k-th filter sweeps through the input matrix X. The long-term patterns reflect the trading frequency of season, month, while the short-term patterns express the trading frequency of week, day. Because of taking different kinds of trading frequency patterns into consideration, it is likely for us to forecast the time series precisely.
where 3 indicates the convolution computation, and the output h k is a vector that only has one column in our work. This work fills each vector h k of length T with zeros to the left of input matrix X. The size of the output matrix of the convolutional layer is m c * T, where m c represents the amounts of filters.

Seq2Seq long-short-term memory layer
Inspired by the success of machine translation , we have recognized the power of the Seq2Seq model in NLP. More specifically, two crucial components make up the standard Seq2Seq model, one is an encoder and the other is a decoder. The former maps the source input x to a vector representation, while the latter produces an output series based on the source. Both the encoder and decoder are LSTMs. By transmitting the last memory condition of the encoder to the decoder as the original memory condition, the encoder is Forecasting stock price movement capable of accessing information from the encoder. Input and output generally apply various LSTMs that possess their own compositional parameters to capture various compositional patterns. We apply a Seq2Seq LSTMs model to address the non-linear time-series forecasting issue as the third layer. Figure 4 shows the model constituted by an encoder and a decoder. In the encoder part, the input LSTM mechanism is used for inputting into series data. In the decoder part, an output LSTM mechanism is employed to decode the hidden states of encoder across all time steps before.
3.4.1 Encoder. The encoder module is substantially an LSTM. It can encode the input series into a characteristic representation. For example, given N input series X 5 (x 1 , x 2 ,$ $ $, x t ) ∈ R N 3 T , in which t denotes the window size of the sequence. We make use of the encoder in learning a mapping function from x t to h t (at time step t), h t 5f enc (h t1 , x t ), where h t ∈ R M is the encoder hidden state at time t, M indicates the size of the hidden state and moreover, f enc represents a non-linear activation function. At time t, each LSTM unit owns a memory cell with the state c enc ðtÞ.b c enc ðtÞ indicates the updated unit status value for encoder. There will be three sigmoid gates that control access to the memory cell: forget gate f enc ðtÞ, input gate i enc ðtÞ and output gate o enc ðtÞ. The LSTM unit update is summed up as follows: b c enc f ðtÞ ¼ f enc ðtÞ1c enc ðt À 1ÞÞ 3.4.2 Decoder. The decoder is a feed-forward neural language network model that generates the next data series based on previous generated data and encoder states. The decoder layer is trained to generate the next data series (intermediate predicted result) for the FC layer, given previous state of the encoder. Importantly, the decoder uses the hidden state vector from the encoder as initial state, where the decoder gets initial information from. Effectively, the decoder learns to generate targets h dec n , given the input series ðf; h dec 1 ; Á Á Á ; h dec t−1 Þ, cell state is c dec t , specially c dec 0 ¼ h enc t , conditioned on the input series, b c dec ðtÞ indicates the updated unit status value for decoder, where t is the window size of the output series. And, h dec 1 , h dec 2 ,. . ., h dec t are the outputs of the Seq2Seq LSTM layer.

Fully connected layer
The FC layer has two hidden layers comprised of N rectified linear units (ReLUs) presented by Nair and Hinton (2010) in 2010. Each unit in the hidden layers is fully connected to the previous layer.
where W 1 is a weight matrix for the first hidden layer, and W i are matrices for all subsequent layers. b i is the bias. The layer obtains linear forecast data series from the ARIMA filter, and non-linear forecast data series from the Seq2Seq LSTM layer. Then, it jointly generates the final forecast results from the linear and non-linear intermediate output forecasts: h i .

Objective function and optimization strategy
To adjust the parameters and evaluate the results, we adopted the squared error as the loss function to forecast. In our model, the corresponding optimization objective is formulated as, where Y n is actual data (target data), b Y n is the final predict data. Our optimization strategy is similar to the traditional time-series prediction model. Thus, the objective problem of this research becomes a regression task of a group of feature-value pairs Forecasting stock price movement ð b Y n ; Y n Þ and can be optimized by stochastic gradient decent (SGD) or its variants, such as Adam algorithm (Kingma and Ba, 2014).

Experiments
In this section, we lead a comprehensive set of experiments and present the experimental details and compare our results with baselines.

Dataset
We choose S&P 500 indices as our dataset, which is an American stock market index involving the market capitalizations of 500 major corporations. It has ordinary shares listed on the NYSE or NASDAQ and can represent the international stock price indices. The dataset is downloaded from Yahoo Finance database, including ranges 16 years trading price movement from 2000-01-02 to 2016-12-07 and consists 4,262 indices. It includes daily close prices, as showed in Figure 5. The study separates training data into two consecutive segments: one (70%) is for training and the other (30%) is for testing.

Evaluation
This research adopts two estimate metrics to assess performance of different methodologies for stock movement time-series prediction. They are root mean squared error (RMSE) (Plutowski et al., 1996) and mean absolute error (MAE), which are two scale-dependent measures. Specifically, assuming y t and b y t are the target and predicted values at time t, respectively, RMSE and MAE are calculated as: y t À b y t 2 r y t À b y t (10)

Training and testing
The training proceeds as follows: at first, the dataset is normalized with zero mean and unit variance between 0 and 1. When training, we set batch size as 90. Hence, one sample of the training data contains 90 days' indices data. Next, we set ARIMA (p, d, q) as ARIMA (5, 1, 0), and use the ARIMA layer to extract linear component of the data as linear intermediate production.
At the third step, the residual data are used for four CNN filters filter t ∈ , to capture the one, three, seven (week), 14 days (half month). After that, we use a single-layer LSTM with 128 units for the Seq2Seq encoder, and a single-layer LSTM with 128 units for the Seq2Seq decoder to generate the non-linear intermediate prediction with four frequency data extracted from the CNN, respectively. At last, we integrate the linear intermediate production and four non-linear intermediate predictions to produce the final prediction using the FC layer, which has two hidden layers comprised of 32, 64 ReLUs. Typically, we define the learning rate as 0.001 and epoch (the time of training) as 1,000. All the parameter matrices are randomly initiated, including, W enc , U enc , b enc and W dec , U dec , b dec .
The test proceeds are as follows: at the initial step, the test dataset (30% dataset) is also normalized as training data did. Then, to prove the effectiveness of the RCSNet model, we compare it with three baseline models, including ARIMA (5, 1, 0), BPNN (three layers) and simple LSTM (128 units). We perform one, three, seven and 14 days ahead prediction, respectively, for all the test variables using four models mentioned above, which means the prediction horizon (length of time steps) of h is 1, 3, 7, 14 for the four models.

Results
Firstly, we compare the results of various methods with one, three, seven and 14 days (length of time steps) ahead prediction. As Table 1 shows, when the time steps h is set as 1, the ARIMA model achieves MAE at 10.939 and MAPE at 15.108, which perform better than other methods. However, with the increasement of time step h, RCSNet performs better than the average performance of baseline models. For long time step tasks: when the time steps h is set as 3, RCSNet performs 81.17% promote than the average performance of baseline models. When the time step h is set as 7, RCSNet performs with 78.72% accuracy compared with the average performance of baseline models. When the time step h is set as 14, RCSNet improves 74.2% than the average performance of the baseline models. Obviously, we can conclude that RCSNet performs better, which is shown in Figure 6. In summary, the efficiency of our model has been clearly confirmed by those experiments. (1) The hybrid model is able to forecast both linear and non-linear time-series component of stock dataset.
(2) CNN and Seq2Seq LSTMs can be effectively combined for dynamic modeling of short-and long-term-dependent patterns in non-linear time-series forecast.
(3) The experimental results show our model that outperforms baseline models on S&P 500 index stock dataset from January 2000 to August 2016.
We will take into account the social media information that relates company's product and economic environment for future research and study how to filter "fake" information to make prediction not only by technical analysis but also by fundamental analysis using NLP tools. Besides, we will use time-series cross-validation rather than the simple data split for timeseries experiments.