A VBA module simulation for finding optimal lag order in time series models and its use on teaching financial data computation

In this study, a tool has been designed and developed for learning about the concept of lag order within a dynamic model, which can be used in any teaching classes on statistics and financial data computation. To showasolutionforacomplexandmulti-stepprocessoffindingtheoptimallagorderformultiplevariablesdataseriesbasedonaninformationcriterionamoduleusingVisualBasicforApplications(VBA)forMicrosoftExcel(MSExcel)isbeingdeveloped.Thismodulecanbeusedforestimatingamultivariatedynamicmodelas wellasdeterminingtheoptimallagorderofsuchamodel.


Introduction
The vector autoregression (VAR) model is regularly utilized by practitioners in the empirical analysis of time series data. This model is a seminal contribution by Sims [1], which allows for interaction between the variables in a multivariate sense. The VAR model is known for having good forecasting properties. It can also be used for cointegration and causality testing. Another advantage of this model is that it provides the possibility to generate impulse response functions and variance decomposition in order to capture the impact of an exogenous shock within the VBA module simulation in time series models system. A crucial issue within this context is to determine the optimal lag order of the VAR model for the reason that all inference based on this model is correct only if the optimal lag length is correctly specified. Thus, paying attention to the process of selecting the optimal lag order in the multivariate model before using the underlying model for empirical analysis is an important issue. Despite its well-known properties, there is currently no module in the MS-Excel, to the best knowledge, that can be used for estimating and determining the optimal lag order of the VAR model. Hence, the main objective of the current paper is to provide a consumer-friendly module in the MS-Excel that can be used for both estimating the parameters in the VAR as well as for determining the optimal lag order based on minimizing an information criterion.
There are numerous tools available to facilitate data analysis, which either make use of a general programming language for application development such as C [2], Cþþ [3], C# [4], Java [5], R [6], and Python [7] with the support of their relevant libraries such as Scikit-Learn [8] and NumPy [9] for Python and BLAS [10], LAPACK [11] and CUBLAS [12] for Cþþ; or more specialised statistical application software, such as Gauss [13], EViews [14], SPSS [15], Matlab [16] and Octave [17]. There is also various alternative application software available, which do not require any programming skills to develop an environment for advanced data analysis such as Weka [18]. As it has now become clear that by checking the long list of statistical application software and list of programs for developing such applications, a single point brightens up the argument in which all those applications require an environment for data processing. Microsoft Excel similar to Weka is not far from this end user interaction between machines and human cognitive thinking in the way of providing simulations [19,20].
It is a well-known fact that Microsoft Excel is used for small dataset-based projects, providing daily support to small business processes, as it works very well for analysing data and producing results in the form of numbers or charts as a visual reporting tool. All tools are handy and no advanced programming skill is required but knowing a few commands, which can be used to implement few logical and mathematical algorithms. However, there is still one fact remaining, which is the capability of this software when it comes to dealing with complex algorithmic data processing. Few institutions have developed a series of plugins to expand the power of this application so to provide more tools to data scientist and so to work with larger projects. A good list of such add-ins can be found at VeryViz website [21], such as FRED's data resource provider add-ins at [22], DataForecasting at [23] and SimTools and FormList at [24]. Few of those plug-ins are not free and they can be obtained with a payment paid to those institutions for developing such modules.
On the other hand, one of the major issues in instructional design is on the application of theoretical concepts and representation of its facts in a form which can be delivered to a learner. The approach of using chalk and board or e-smartboards have been used to draw pictures for visual representations of concepts, or a tactile approach (for example a mechanical movement of physical staff) to represent movements, the method has been used for thousands of years in teaching a child by his father or a mentor. In recent decades of advancements on computer hardware and software, this need has been extremely simplified and with the use of special software on a powerful computer any of most complicated models can easily be animated. However, this approach constrains the other end of electronic instructional design with an extreme demand for new applications of e-simulators. In this study, a simulation has been designed and developed that provides a detailed and simplified solution for presenting a complex mathematical calculation for asymmetric causality tests for multivariate autoregression model (VAR). A sample test also has been provided to add to the flexibility of module usage and ease of accessibility of its training module.

Theoretical background 2.1 Role of simulations software in education
It is well-known in the literature that an instructional designer 1 needs a set of skills on hardware and software combined with a pedagogical philosophy. But at the end, the content ACI matters and so to run a simulation software the content must be designed in such a way that the user (learner/researcher) should be able to mimic it any time with any suitable dataset [25][26][27]. Hence, in this research, we investigate the use of an algorithm named asymmetric causality test which is based on the optimal lag order determined by the information criterion, and design and develop a VBA module suitable for MS Excel 2013 so to help its user avoiding tasks of calculating the optimal lag order for a dataset. The reasoning behind the use of MS Excel is that the majority of office users have access to it and it is being used greatly for the small dataset on either as part of their daily business work or research.

Finding optimal lag order as a sample algorithm
The VAR is one of the most applied models in time series analysis since its introduction by Sims [1]. The mathematical idea comes from Granger [28] with massive literature on identifying causality between data with the use of the VAR model. Recently, asymmetric causality tests have been introduced in the literature by Hatemi-J [40] that separates the causal impact of positive changes from the negative ones. However, before testing for causality the optimal lag order needs to be selected in the VAR model. Hence, the Hatemi-J Criterion (HJC) [29] has been used to find the optimal lag order, which can also be used for forecasting purposes. 2 This model allows for dynamic and interaction between the underlying variables. It is simple to use and it has excellent forecasting properties. The VAR model can also be used for implementing causality tests. In addition, multivariate cointegration analysis is usually conducted within the context of a VAR model [30]. Impulse response functions and variance decompositions can also be estimated by utilizing the average movement representation of the VAR model. A crucial issue in this context is the correct selection of optimal lag order in the VAR model because all inference is based on the correctness of the selected lag order. Consider the following vector autoregressive model of order p, VAR(p): here v represents a n 3 1 vector of intercepts and u t is a n 3 1 vector of error terms. The A r is a n 3 n matrix of coefficients for lag order r (p 5 1, . . ., p-max). X t is an n 3 1 vector of variables that are quantified across time.
It is widely agreed that lagged correlations are known elements of data science to find a relationship between two or more time series of events. In nature, many factors could be involved in affecting each time series, whether sooner or later. Understanding lagging correlations is a major branch of data science which can present new findings on any multidimensional time series data analysis [31]. Asymmetric causality test as a sample of the algorithm is considered as an advanced level of lagged correlations studies [29,32].
Eq. 2: Formula for calculating Optimal Lag Order.
Where j b Ωpj is the determinant of the estimated variance-covariance matrix of the error terms in the VAR model based on lag order p. The denotation k represents the number of features or equations in the VAR model and T is the number of observations. The lag order that minimizes the HJC information criterion is the optimal lag order. For simulation results pertinent to this issue see [29,32] and understanding stepwise of VAR(p) calculations the reader is referred to [34].

Methodology
A VBA based module is being developed and used as an extension to MS Excel 2013 spreadsheet. The algorithm for developing such application was already presented as per mentioned in equation Eq. (2). The pseudocode for writing the code for this program is VBA module simulation in time series models devised into three main sections (Appendix A). Reading data from each sheet, retrieving data for processing, and publishing its outcome into new sets of sheets. To create a new sheet and add records to it in multiple turns depending on the number of sub-calculations needed for different sections of the formula in Eq. (2): The software produces a series of sheets which contain individual steps of manipulating variables as per denotations on Eq. (2). Each group of sheets marked by its relevant lag such as Y_1 for identifying Y values and Z_All_5_T which is a transposed collection of all Z matrices for lag 5.
Reading and writing of the first section of the module which is labelled on the dashboard (Figure 1) as "Determine Optimal Lag -Detailed" is using a series of for-loops to while interacting with individual sheets. On the second method which is labelled as "Determine Optimal Lag -No Details" uses the array techniques; and since this method does not involve reading and writing approach of data from sheets, the calculation of those formulae are much faster in comparison to the first one. The tricky part of this method was on identifying the start and ending points of each loop so to support calculations of each section of the formulae indicated on Eq. (2) and appendix B.

Development
The VBA (from Microsoft Excel) module for calculating the optimal lag order which is a combination of two types of outputs (see Figure 1) was built [35]. The slow version which supports learning how data are being calculated, and the fast version which provides only the last result for HJC and a visual representation of optimal lag order.
The first button labelled as "Load Sample Data" is a sample set of exemplary data from Chinese and US market, which contains 65 sets of observations. The dataset included quarterly gross domestic product (GDP) data from both China and US between 1991 and 2015 obtainable from FRED database provided by Federal Reserve Bank of St. Louise and it is available online [36]. Now if a researcher is interested in viewing all calculations done to calculate each part of HJC equation from Eq. (2) on individual sheet separately, s/he should use the "Determine Optimal Lag -Detailed" button. It contains all necessary calculations for finding Optimal lag order, which might be a suitable tool for learning about the data computation process of the variance and covariance matrices for different lag orders and other calculations such as detailed results of the HJC values for different lag orders and its minimum value on a sheet labelled as "HJC_All_and_min" (see Figure 3).

ACI
The outcome of HJC calculations for different lags are given in a table named "HJC_All_and_min_arr" (Figure 2) to show the details of why the indicated value is a minimum value. Figures 4-6 give other views of the module with presenting the last outcome of the calculation with estimated parameters of the VAR model for the found lag in Figure 6.
In addition to that in Figure 7, we can clearly see the curve HJC values created based on their lags (values of "p" in the Eq. (2)) and the optimal lag at the bottom of its curve which is the minimum value for the range of HJC values selected prior to calculations. As a sample of data analysis, the maximum lag order needs to be selected a prior. We select 15 as the maximum lag order in this application. But if we are not careful and choose to have a larger value for lag with a small number of observations, a disturbed curve will be drowned as sampled in Figure 8.
This test can be applied as many times as a teacher (researcher) would like to, in order to show possible variations of a maximum number of lag order to have for obtaining a minimum set of information criterion; which in this case clearly depends on the number of observations.

Results and discussions
As part assessment of this module, a series of 65 observations from stock markets of the US and China were chosen. Based on the minimization of the information criterion, the optimal lag order of 3 was found. As part of vector auto-regression analysis of data this new module clearly can support this data analysis as an add-in to capabilities of MS Excel. After selecting the optimal lag order, the VAR model can be used for multivariate cointegration analysis [37], conducting causality tests [28,38] and estimating the impulse response functions and variance decompositions [39,40].  Figure 7 clearly shows the optimal lag value of min (HJC lag orders) 5 3 as the asymmetric causality between both the US and China markets. The VAR model used in this study uses a set of data on US and China financial markets which can be acquired at FRED database provided by Federal Reserve Bank of St. Louise and it is available online [36]. The VBA module developed for MS Excel (this module was tested on MS Excel 2013), can find the optimal lag order in a vector autoregressive model based on minimization of an information criterion. Simulation results show that this information criterion is successful in correctly finding the optimal lag order if the underlying variables are non-stationary and have unit roots. It is also robust to autoregressive conditional heteroscedasticity (ARCH) [41] effects and it has good forecasting properties. For the technical description please refer to [29,32].

VBA module simulation in time series models
Furthermore, as part of the delivery of knowledge in the form of instructional design, simulation in teaching plays an important factor on learners understanding of the process of a complex mathematical calculation, which is the core part of empirical research. This new tool has used MS Excel spreadsheet, a well-known application software which is common in every office to produce a simulation module for converting theoretical concepts into practical data and visual aid to be used in classrooms and research labs.   Visual representation of optimal lag order curve for maximum 30 lags, which clearly shows the concept of over-using the number of lags and wrong doing of data analysis.

Conclusions and future work
This paper introduces a module [35] for estimating and determining the optimal lag order in the VAR model, which is developed in the Visual Basic for Applications (VBA) for Microsoft Excel. It is widely agreed that MS-Excel is among popular applications used in academia and industry and it has the capability of data processing for small sets of data while its performance is of up to the expectation [19,20]. There are numerous other applications that exist such as Gauss, EViews and Octave but there is a major drawback on working with those applications and its usage while working with a small set of data based on a complex algorithm. These mentioned applications usually require an expert understanding of those tools and there are not many modules available to automate those processes. Another advantage of MS Excel is that it is widely accessible to practitioners. The core component of this newly developed module, which is a unique module on its form, produces a series of sheets with processed data as its outcome. The application is easy to work with and the outcome is understandable in two formats of quantitative measures (for example, the sheet labelled as HJC_All_and_min) and a chart (for example, labelled as Optimal Lag Chart) holding a visual representation of a series of lag orders and its optimal lag value assigned on the chart.
After choosing the optimal lag order the interested researcher has the possibility of conducting tests for cointegration or causality. It is also possible to use the underlying model for estimating the impulse response functions and the variance decompositions in order to capture the dynamic interaction between the underlying variables.
A copy of this application for downloading is available at [35] and any type of users (students, faculties, researchers and practitioners) can use it as long as it relates to the concept of asymmetric causality for multiple variables which can lead to the Wald test (as an example).
This application is used to show the powerfulness of MS Excel modules in data science for small projects similar to other advanced applications used for developing such data analysis in Gauss, EViews, and Octave. There is one point that needs to be clarified, which is the limitations of this module. Based on the limitations set by MS-Excel, it is only capable of processing 1048575 rows (one for headers). In terms of other factors, the limitation of 52 3 52 rows for inversing matrixes within this module based on the MS-Excel prior to earlier VBA module simulation in time series models versions than 2007 has been removed [42]. Hence, the maximum lag value that practitioners are interested in usually (which requires the calculation of the inversed matrices) would rarely need the use of any value reaching this limitation. It should be mentioned that the default value is set to five. Thus, for a small project of datasets, the only limitation a researcher faces is the available memory on the computer. A potential extension of this module is to make the underlying analysis for panel data operational such as the approach developed by Hatemi-J [43]. Panel data combines the cross-sectional dimension with the time series dimension and thereby it creates higher degrees for freedom.
In the field of e-learning systems the term of teacher has been being replaced by 'Instructional Designer' since the middle of last century. 2 There are also other information criteria that can be used for this purpose. For simulation results pertinent to the performance of different information criteria see, among others, Hacker and Hatemi-J [33].
Calculations of matrices and their setups (adapted from [34]) This is a list of variables used to create an individual sheet for importing, processing and producing a suitable outcome for the next part of the calculations.
Then, presenting the list of the values for the HJC in a sheet and a chart as well as finding the minimum value between them, which provides the optimal lag order for the set data.
Corresponding author Alan Mustafa can be contacted at: alan.mustafa@ieee.org For instructions on how to order reprints of this article, please visit our website: www.emeraldgrouppublishing.com/licensing/reprints.htm Or contact us for further details: permissions@emeraldinsight.com VBA module simulation in time series models