An overlapping sliding window and combined features based emotion recognition system for EEG signals

Purpose – The purpose of this study is to propose an alternative efficient 3D emotion recognition model for variable-length electroencephalogram (EEG) data. Design/methodology/approach – Classical AMIGOS data set which comprises of multimodal records of varying lengths on mood, personality and other physiological aspects on emotional response is used for empirical assessment of the proposed overlapping sliding window (OSW) modelling framework. Two features are extracted using Fourier and Wavelet transforms: normalised band power (NBP) and normalised wavelet energy (NWE), respectively. The arousal, valence and dominance (AVD) emotions are predicted using one-dimension (1D) and two-dimensional (2D) convolution neural network (CNN) for both single and combined features. Findings – Thetwo-dimensionalconvolutionneuralnetwork(2DCNN)outcomesonEEGsignalsofAMIGOS data set are observed to yield the highest accuracy, that is 96.63%, 95.87% and 96.30% for AVD, respectively, which is evidenced to be at least 6% higher as compared to the other available competitive approaches. Originality/value – The present work is focussed on the less explored, complex AMIGOS (2018) data set which is imbalanced and of variable length. EEG emotion recognition-based work is widely available on simpler data sets. The following are the challenges of the AMIGOS data set addressed in the present work: handling of tensor form data; proposing an efficient method for generating sufficient equal-length samples corresponding to imbalanced and variable-length data.; selecting a suitable machine learning/deep learning model; improving the accuracy of the applied model.


Introduction
Emotions are a manifestation of intuitive states of the mind. They are known to be generated by events occurring in a person's environment or internally generated by thoughts [1]. Identification and classification of these emotions using computers have been widely studied under affective computing and human-computer interface [2].
Emotions are recognised using physiological or non-physiological signals [3]. Electroencephalogram (EEG), electrocardiogram (ECG) [4], galvanic skin response (GSR), blood volume pulse (BVP) [5] and respiratory suspended particulate (RSP) [6] are popular Emotion recognition system for EEG signals tools used in literature to obtain physiological signals, while facial expressions [7], speech [8], body gestures and videos [9] give non-physiological signals. The advantage of using the physiological signals for ER is that they are directly captured from human body which gives true response of human intuitions [10] unlike non-physiological signals that can be synthetically elicited. Thus, the EEG signals are suitable tool for current research. However, since EEG signals involve studying human behaviour directly, there is a limitation to the number of samples that can be collected while deep learning (DL) methods required large number of samples to work efficiently. Therefore, there is a need for innovative resampling method to be able to apply DL methods.
The EEG signals are generated by electrical waves corresponding to brain activity presented by external stimuli [11]. The raw signals need to be pre-processed, and then appropriate features need to be extracted to get emotions from the signals. Lastly, an efficient classifier is applied to obtain an appropriate recognition of emotions.
The features of EEG signals were frequently extracted in the time, frequency and timefrequency domains. The features extracted in the time domain are the Hjorth feature [12], fractal dimension feature [13] and higher-order crossing feature [14]. The features used in the frequency domain are power spectral density (PSD) [15], spectral entropy (SE) [16] and differential entropy [17]. Wavelets and a short-time Fourier transform (STFT) [18] have been used to extract the time-frequency domain features.
After feature extraction, the machine learning (ML) and DL methods are primarily applied in literature for classification [19]. The ML methods applied are k-nearest neighbour (KNN), random forest (RF), decision tree (DT), neural network (NN) and support vector machine (SVM) for ER. The DL methods used for ER are convolution neural network (CNN), long short-term memory (LSTM), recurrent neural network (RNN) and several other variants. The DL methods are found to work with greater accuracy [20]. Table 1 shows a summary of DL methods applied in recent years.
Apart from these, nature-inspired algorithms have also been applied on ER tasks for feature selection, such as on the DEAP data set along with particle swarm optimisation (PSO) [21] and firefly optimisation (FO) [30]. At the same time, LSTM and SVM were used as classifiers. Feature selection through FO has been known to achieve an accuracy of 86.90%, while PSO feature selection recorded an accuracy of 84.16%.
Emotions in ER can be classified in two ways: discrete emotions, such as anger, happiness, sadness, disgust, fear and neutral, and emotion models. There are two types of emotion models: two-dimensional (2D) [31] and three-dimensional (3D) [32]. The 2D emotion model consists of valence and arousal; valence represents the measure of pleasant and unpleasant, and arousal represents excitement and calmness. The 3D emotion model comprises AVD. The arousal and valence emotions are the same as in the 2D emotion model. Dominance is the third emotional aspect, representing dependence and independence.

Contribution
The objective of the present work is to develop an efficient ER model for the AMIGOS [33] data set in 3D emotional space (i.e. AVD) using DL models. The AMIGOS is a new data set among other popular EEG data sets for ER. The following are the challenges of the AMIGOS data set addressed in the present work: (1) Handling of tensor form data.
(2) Proposing an efficient method for generating sufficient equal-length samples corresponding to imbalanced and variable-length data.
(4) Improving the accuracy of the applied model.

ACI
The equal-length data samples are generated here by the OSW method. Although the data can be oversampled using the built-in Python function Synthetic Minority Oversampling Technique (SMOTE) [34], SMOTE generates the data by replicating the examples without adding any new information to them. Thus, the OSW method is proposed in the present work, which induces variability in the sample records by avoiding the repetition of the signals. Feature extraction is undertaken in two modes using normalised transformation of band power and a wavelet energy. The rest of this paper comprises three additional sections. Section 2 provides details of the emotion recognition system proposed in the research, Section 3 details the results and discussions and Section 4 provides the conclusions.

Emotion recognition system
The proposed emotion recognition system (ERS) is modelled in three stages: (1) Data preprocessing, (2) Feature extraction and (3) Classification implemented for AVD. Figure 1 shows the framework adopted for OSW-based ERS. The important concepts used in present research are described as follows:

Decomposition of signal using OSW
The emotion samples are amplified in current research using OSW, as a large amount of data is recommended for efficient model building in DL methods [35]. EEG signals produced in different experiments were decomposed into 512 size windows by a shift of 32, as shown in Figure 2.
The portion of signals not covered by the 512 windows was trimmed or not used for computation purposes. The window and shift were decided experimentally.

Feature extraction
Once signals were decomposed into equal-length samples using overlapping windows, NBP and NWE features were extracted using discrete Fourier transform (DFT) [36] and discrete wavelet transform (DWT) [37].

Normalised band power (NBP).
To calculate NBP feature, first Fourier transform X k was calculated for windowed signal by using Eqn (1): where N is length of vector x and 0 ≤ k ≤ N − 1.
Once signal is converted to frequency domain, the five frequency bands (4-8 Hz, 8-13 Hz, 13-16 Hz, 16-30 Hz and 30-45 Hz) were extracted. The beta band was decomposed into two (beta1 and beta2) to equalise the dimensions of the wavelet transform. Further, band power and normalised band power were then calculated for each band by Eqns (2) and (3) given below: where P B represents power of band B, k is length of each band. Framework for overlapping sliding window-based emotion recognition system where, b P B is called NBP.

Normalised wavelet energy (NWE).
In DWT, the different frequencies of signal are cut at different levels, and the process is called multi-level wavelet transform, defined in Eqn (4) where τ ¼ k: 2 −j and s ¼ 2 −j represents translation and scale respectively. ψ is called mother wavelet which was taken here Daubechies4(db4) wavelet. The signal is further decomposed into cA n and cD n which are called approximation coefficient at nth level (provides low frequencies) and detailed coefficient at nth level (provides high frequencies), respectively. Because the EEG signal provided in the pre-processed data set is in the range of 4 Hz-45 Hz, five-level decomposition is sufficient for required fourband information, as shown in Figure 3. After decomposition of signal into multilevel wavelet coefficients, the wavelet energy is calculated using detailed coefficients cD n of above five levels because the emotion information is mostly available in higher frequencies. The formula for calculating wavelet energy is given in Eqn (5): NWE is calculated using Eqn (6)

Convolution neural network
A CNN is multilayer structure that consists of different types of layers, including input, convolution, pooling, fully connected, softmax/logistic and output [38]. The extracted features are fed into two different types of CNN: 1D and 2D. Both 1D and 2D CNN followed same architectures, convolutions, including Conv1D and Conv2D, preceded by batch normalisation. A max pooling layer with ReLU activation function is applied to every convolution layer. Lastly, the max pooling layer is connected to an adaptive average pooling layer, which is then passed through a flattening layer and followed by four output dense layers. The first three are dense linear layers, and the last is a sigmoid layer for binary classification among four output dense layers. Architecture of 1D and 2D CNN is shown in Figures 4 and 5, respectively.

Results and discussions
All experiments conducted in the present work are performed on Intel i5 8 GB RAM AMD processor using Python 3.7 programming language. PyTorch version 1.7.0 is used to implement CNN, and the execution of CNN is achieved on Kaggle GPU.
The present work is executed in the following steps:

Preparation of data
The data set used to pursue research was originally prepared by Correa et al. (2018) to identify affect, mood and personality in an intricate format. This data set comprises of 40 folders wherein each folder corresponds to one participant. Further each folder consists of a MATLAB file with the following list, as shown in Table 2.
In the present study, the data for 16 short videos were taken for 14 columns of EEG and their respective labels in AVD, from self-assessment labels list. Emotion indices responses under AVD were coded as 1 and 0 according to Table 3.
3.1.1 Balancing for emotions. After preparing the data set, the number of samples in each category AVD are plotted in Figure 6(a). It is evident that the number of samples recorded as low emotions for each category is significantly fewer than those recorded as high emotional level. Thus, the low and high emotions of each category were balanced by the Python function, SMOTE. The result of upsampling is shown in Figure 6(b).
The resultant number of samples is insufficient for applying DL methods. Moreover, replication in data causes reduction in the accuracy of models as shown in Table 6. To overcome these limitations, the data is being generated by non-overlapping sliding window (NOSW) and OSW in the present work. The resultant number of samples is shown in Table 4.  Table 2. Table 3. Coding of AVD from 1-9 to 0-1 Emotion recognition system for EEG signals

Feature extraction and classification
The decomposed signals were cleaned by removing NaN values. Moreover, five NBP and five NWE features corresponding to five EEG band were extracted by Fourier and wavelet transform, respectively. A combined vector of both features {NBP, NWE} is also formed by appending the NWE features to the NBP features. A total of 70 (514 3 5) features were extracted for 14 EEG channels by each of NBP and NWE separately. Thus, there are 140 features present for combined vector. The features extracted by different resampling methods are shown in Table 5.
CNN classifiers discussed in Section 2.3 were applied to individual and combined features. The train, validation and test samples are divided into a 70:40:30 ratio. The learning rate, batch size and optimiser are taken as 0.001, 32 and "adam", respectively. A binary crossentropy function was used as loss function.
The training of CNN continues until the accuracy of the network does not become constant/start decreasing. The emotion recognition accuracies of two DL classifiers were compared with baseline ML model, SVM in Table 6; the highest accuracy is shown in italic.    Figure 7. From Figure 7(a), it can be observed that SVM outperforms other methods, and no specific pattern is observed that can indicate which among individual or combined features perform better. It is evident from 7(b) that the NWE feature is providing higher accuracy in the case of NOSW. In contrast, the NBP feature is providing higher accuracies with the OSW for DL methods shown in Figure 7(c). The combined features for 2D CNN give higher accuracies for both NOSW and OSW shown in Figure 7(b) and 7(c). Thus, by combining the observations made by Table 6, Figure 7(b) and 7(c), a 2D CNN classifier with a combined feature vector found best for all the emotional indices with 96.63%, 95.87% and 96.30% accuracies, respectively.

(a) (b)
An execution history of 2D CNN with combined features for overlapping window is shown in Figure 8 in terms of loss and accuracy curve for arousal, valence and dominance, respectively. Loss curves represent the training and validation loss, which is expected to be as close as possible. The accuracy curve shows accuracy obtained for each emotion indices for 20 epochs.
The results are also compared for time of execution of individual feature versus combine features shown in Figure 9. Figure 9 shows that as the sample size increases from SMOTE → NOSW → OSW, the time for execution increases significantly in case of SVM for both individual and combined  Table 6. Accuracy obtained after applying ML/DL classifiers Emotion recognition system for EEG signals features. The reason for this observation is the fact that SVM cannot be executed on GPUs since it involves complex calculations. As observed in this study that basic SVM performs poorly when the sample size is large (as in case of combined feature with OSW in Table 6), the same is being reported in [39]. Table 7 compares the results obtained in present study with ERS articles published from 2018 onwards on AMIGOS data set.    Table 7, the emotions are recognised using only EEG data in [29,33,40,42]. The other studies were carried out using multimodal data. The first study [33] conducted on the AMIGOS data set provides an initial analysis, produces very low accuracy -57.7% and therefore poses an open research challenge. The accuracy was improved to 71.54% in [40] in same year, in which the features were extracted using CNN. A multimodal ERS was proposed in [28,41], producing accuracies of up to 84%. Highest accuracy achieved on AMIGOS data set prior to this work is 90.54% using CNN þ SVM model in [29]. Finally, the present proposed model has improved the accuracy up to 96.63% with a single modality (EEG) through a 2D CNN classifier. Siddharth    ACI implement on AMIGOS data set, which has varying lengths of data. This indicates necessity for executing an efficient pre-processing method prior to classification. The present paper offers the most efficient classification strategy for EEG records of varying lengths through decomposition of data using an OSW approach which provides an efficient alternative for handling imbalanced variable-length data prior to the classification.

Conclusions
Despite significant development in the field of DL and its suitability to various applications, almost 59% of researchers have used an SVM with RBF kernels for BCIs [19]. This is due to the unavailability of a large-scale data set for BCIs. However, DL models are widely applied in speech and visual modality. A BCI data set provides genuine human responses as they are taken directly from the human body. Thus, ER using brain signals is preferred. There is a need for an "off-the-shelf" method to conduct research on BCIs with a high accuracy. The accuracy found in BCIs is generally lowespecially for the AMIGOS data set. The present contribution focusses on obtaining predictive outcome of the 3D emotion responses to EEG signals in context of imbalanced variable-length records. Novelty of the present paper is that it proposes application of OSW for CNN to the intricate AMIGOS data set aimed at highly accurate prediction of 3D emotions in contrast to the accuracy achieved by the existing approaches available in literature. Most of the earlier analysis of AMIGOS data set has been pivoted on 2D emotion analysis. The current paper views EEG (14 channels) on 3D emotions for predictive inference and presents a comparative assessment of the predictive accuracy with that of Siddharth et al. (2018) [40]. Thus, the present approach is found to have the highest accuracy with respect to all the three AVD emotion indices as compared to similar works referenced in literature ( Table 7).
The present work can be further extended for multiple modalities in physiological signals as well as with the inclusion of response to video interventions such as in automatic video recommendation system for enhancing the mood of individuals. Another possible extension of this work can be accomplished by representing the signal features in 2D/3D form and subsequently combining them with the respective video/image features.