Subject independent emotion recognition using EEG and physiological signals – a comparative study

Purpose – TheaimofthisstudyistoinvestigatesubjectindependentemotionrecognitioncapabilitiesofEEG and peripheral physiological signals namely: electroocoulogram (EOG), electromyography (EMG), electrodermal activity (EDA), temperature, plethysmograph and respiration. The experiments are conducted on both modalities independently and in combination. This study arranges the physiological signals in order based on the prediction accuracy obtained on test data using time and frequency domain features. Design/methodology/approach – DEAP dataset is used in this experiment. Time and frequency domain features of EEG and physiological signals are extracted, followed by correlation-based feature selection. Classifiers namely – Na ı € ve Bayes, logistic regression, linear discriminant analysis, quadratic discriminant analysis, logit boost and stacking are trained on the selected features. Based on the performance of the classifiers on the test set, the best modality for each dimension of emotion is identified. Findings – The experimental results with EEG as one modality and all physiological signals as another modality indicate that EEG signals are better at arousal prediction compared to physiological signals by 7.18%, while physiological signals are better at valence prediction compared to EEG signals by 3.51%. The valence prediction accuracy of EOG is superior to zygomaticus electromyography (zEMG) and EDA by 1.75% at the cost of higher number of electrodes. This paper concludes that valence can be measured from the eyes (EOG)whilearousalcan bemeasuredfromthechangesin bloodvolume(plethysmograph).Thesortedorderof physiological signals based on arousal prediction accuracy is plethysmograph, EOG (hEOG þ vEOG), vEOG, hEOG,zEMG,tEMG,temperature,EMG(tEMG þ zEMG),respiration,EDA,whilebasedonvalenceprediction accuracy the sorted order is EOG (hEOG þ vEOG), EDA, zEMG, hEOG, respiration, tEMG, vEOG, EMG (tEMG þ zEMG), temperature and plethysmograph. Originality/value – Many of the emotion recognition studies in literature are subject dependent and the limited subject independent emotion recognition studiesin the literature report an average of leave one subject out (LOSO) validation result as accuracy. The work reported in this paper sets the baseline for subject independent emotion recognition using DEAP dataset by clearly specifying the subjects used in training and test set. In addition, this work specifies the cut-off score used to classify the scale as low or high in arousal and valencedimensions.Generally,statisticalfeaturesareusedforemotionrecognitionusingphysiologicalsignalsasamodality,whereasinthiswork,timeandfrequencydomainfeaturesofphysiologicalsignalsandEEGare used.ThispaperconcludesthatvalencecanbeidentifiedfromEOGwhilearousalcanbepredictedfromplethysmograph.


Introduction
Subject independent emotion recognition using single or multiple modalities is a burgeoning area of research in affective computing. Emotion recognition (ER) plays a vital role in human computer interaction (HCI) as it tries to make HCI, similar to human-human interaction (HHI) by incorporating ER and emotion expression capabilities in machines. The distinguishing feature between HCI and HHI is the ER and emotion expression capabilities of humans.
Humans recognize others' emotions via facial expression and contextual information in day-to-day life. Emotions serve as evolved communication and hence should evoke behaviors that reveal the subjects' emotional state to others [1]. The emotional state of a person can be inferred from behavior in face, voice, whole-body and observer ratings. James's emotion theory [2] states that emotional response can be measured using peripheral physiological signals. Some of the peripheral physiological signals used in ER are electrodermal activity (EDA), cardiovascular activity and respiration activity. Cannon's emotion theory [3] suggests that emotions are derived from subcortical centers, and this led to the study of emotional responses of central nervous system (CNS) signals using EEG, neuroimaging techniques and electrooculogram (EOG).
Subject dependent unimodal and multimodal ER provides considerable accuracy, while subject independent ER needs improvement. One aspect that hinders baseline of subject independent ER models is the non-availability of subject independent test sets for the publicly available multimodal ER datasets. Many of the subject independent ER studies in the literature provide an average of leave one subject out (LOSO) validation score as final accuracy. In this work, the test subjects used for validation of the model are specified explicitly so that any future work can use these model scores as a baseline.
Subject independent ER capabilities of time and frequency domain features of EEG and peripheral physiological signals, namely EOG, EMG, EDA, temperature, plethysmograph and respiration both independently and in combination on the DEAP dataset in arousal and valence dimensions using classifiers -Naı €ve Bayes, logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), logit boost and stacking are explored in this research. Through this research work, it is found that, from an ergonomic perspective, valence can be measured from the eyes while arousal can be measured from changes in blood volume. The model scores of this work can be used as a baseline for future work as this work reports the results on a truly subject independent test set.

Related works
The recent advances in multimodal ER are in the areas of feature extraction, feature selection, modeling and fusion strategies. Multimodal ER involves three important aspects: extracting shared representations from multiple modalities, removing redundant features and learning key features from each modality. To address all three aspects, multimodal deep belief network (MDBN) was investigated [4]. Recent studies in the literature used global average pooling [5], deep belief network (DBN) [6] and multi-hypergraph neural network [7] to investigate the aspect of correlation among features in multimodal ER. The optimal combination of features play a significant role in multimodal ER and was studied using multi-kernel learning approach [8] and deep learning based hierarchical feature fusion approach [9]. Recent studies have explored the significance of features in ER [10].
The body of work in literature has explored the feature extraction ability of deep learning networks for end-to-end ER architectures and its performance was determined by the strength of the input signals [11]. Deep learning architectures like ensemble convolution neural network (ECNN) [5], DBN [6], inception ResNet v2 [12], spiking neural networks (SNN) [13], autoencoder [14], hierarchy modular neural network (HMNN) [15], MDBN [16], transfer learning [17], transformer-based architecture using CNN [17] and high resolution network (HRNet) [18] were explored for ER. Decision level fusion versus feature level fusion is a long-standing contention in the field of multimodal ER. The decision level fusion improves the accuracy by 5% in comparison to unimodal accuracy [19], whereas the feature level fusion provides ER accuracy comparable to decision level fusion with less computation time [10]. Some of the literature [10,12,19] reported LOSO validation score as final accuracy which is a limitation in subject independent ER.
The work reported in this paper sets the baseline for subject independent ER using DEAP dataset by clearly specifying the subjects used in training and test set. In addition, this work specifies the cut-off score used to classify the scale as low or high in arousal and valence dimensions. Generally, statistical features are used for ER using physiological signals as a modality, whereas in this work, time and frequency domain features of physiological signals and EEG are used. The experiment is conducted on both modalities independently and in combination. This work arranges the physiological signals in order based on the prediction accuracy obtained on test data using time and frequency domain features.

Materials and methods
DEAP dataset is used to compare the prediction ability of time and frequency domain features of EEG and physiological signals over a similar set of classifiers and to sort the physiological signals. In this experiment, two ensemble classifierslogit boost and stacking and two statistical classifiers -Naı €ve Bayes and QDA are used. All four classifiers are used independently and in combination of EEG and physiological signals. The feature selection and training of classifiers are performed using Weka software [20]. The proposed methods for arousal and valence prediction in multimodal and unimodal environments are shown in Figure 1 3.1 DEAP dataset description DEAP [21] dataset has EEG and peripheral physiological signal recordings of 32 participants (16 for each gender). The signals were recorded when the participants watched music video of length one minute. Each participant watched a subset of 40 music videos and rated the valence, arousal, dominance and liking of each video. For each trial, 32 channels of EEG signals and 12 channels of peripheral signals were recorded using Biosemi active two system at 512 Hz.

Evaluation measures
The evaluation metrics used to compare different models in this experiment are accuracy and F1-score. Additional metrics, namely ROC area and kappa statistic are reported for the proposed methods.
3.2.1 Accuracy. Accuracy is the measure of correctly classified instances. The accuracy percentage ranges from 0 to 100, where 100 is the best possible accuracy and is shown in equation (1).
3.2.2 F1-score. F1-score is the harmonic mean of precision and recall. The range of F1score varies from 0 to 1, where 0 is the worst possible score and 1 is the best possible score and is shown in equation (2). F1-score gives better measure of incorrectly classified instances.
The area under the ROC curve measures the ability of the binary classifier to distinguish between classes. The value ranges from 0 to 1, where 1 implies the classifier is able to perfectly distinguish between the classes.

Cohen's kappa.
Cohen's kappa values range from À1 to 1, where 1 implies the model is good. A kappa value of 0 indicates that the model is as good as a chance classifier.

Training and test dataset split
The dataset is split into subject independent training and test sets in the ratio of 70:30. The data of 22 participants is used as training set, while the remaining data of 10 participants is used as test set. Subjects -s02, s04, s05, s09, s15, s20, s23, s28, s29 and s30 are used in the test set while the rest of the subjects are used in the training set.

Labeling strategy
As the objective is to train classifiers using supervised learning, the continuous scale ratings of valence and arousal are converted into labels by splitting the continuous scale. The scale range [0, 5.0] is considered as low, while (5.0, 9.0] is considered as high. The scale value of 5.0 is chosen as the split point as the mean value of the ratings lie approximately around 5.0. The labeling strategy and distribution of labels across the training and test sets are shown in Table 1.

Pre-processing
The DEAP dataset provides the pre-processed data and is explained in this sub-section. The EEG signals were down-sampled to 128 Hz and EOG artefacts were removed. Bandpass filter with frequency range of 4.0 to 45.0 Hz was applied. The EEG data were averaged to common ACI reference and pre-trial baseline was removed. The physiological signals were down-sampled to 128 Hz and the pre-trial baseline was removed.

Feature extraction
Time domain and frequency domain features were used to find the electrode position for top-30 features and it was found that frequency-based power spectrum density provided better accuracy [22]. In contrast, another study found that power spectral density did not perform well [23]. As emotions vary with time, Hjorth features were widely used in ER as these features are useful in monitoring time varying EEG signals [24]. Hence, this work extracted both time domain features -Hjorth activity, Hjorth complexity and frequency domain feature power spectral density (PSD) and used feature selection process to select the best performing features [25]. Hjorth activity and Hjorth complexity features are computed for the entire time range of the signal. A total of 280 features are computed as shown in Table 2.
Horizontal EOG, vertical EOG, zygomaticus major EMG and trapezius EMG are computed by subtracting the corresponding values between two channels and is shown in Table 2. In this work, EEG is considered as one modality, and all other signals are grouped under physiological signals. For the feature extraction process, yðtÞ is considered as the signal and dyðtÞ dt as the first derivative of the signal.
3.6.1 Hjorth activity. Hjorth activity [24] parameter is the total power of the signal. It is the surface of the power spectrum in the frequency domain and is shown in equation (3).
Activity ¼ VarianceðyðtÞÞ 3.6.2 Hjorth complexity. Hjorth complexity [24] is a dimensionless parameter defined as the ratio of mobility of the first derivative of the signal to the mobility of the signal, as shown in equation (4). The mobility is defined as the square root of the ratio of the variance of the first derivative of the signal to the variance of the signal, as shown in equation (5). The mobility of the signal represents the frequency variance of the power spectrum and can be illustrated as the standard deviation of the power spectrum along the frequency axis. The Hjorth complexity gives an estimate of the bandwidth of the signal and indicates the shape similarity of the signal to a pure sine wave.

Feature selection
In this work, feature selection is done using best first search strategy which is a correlationbased feature subset selection [27] where the correlation between the feature and the output class, and the correlation among the features is computed. Feature selection is done such that the subset of features are highly correlated with the class while intercorrelation among the selected features is low. In this work, best first strategy is carried out with an initial empty feature list followed by iteratively including and excluding all possible single attributes. In best first search strategy, single features that have high correlation with the class are added to the search space. If the added feature does not contribute to the improvement of accuracy, then the algorithm backtracks to the last best subset in the feature space and continues the search. In order to avoid exploring the entire feature space a stopping criterion is used. In this work, the search procedure is terminated if there is no improvement for the last five iterations.
The features selected for each of EEG, physiological, and combined modalities are shown in Table 3. From the features selected, it is observed that only frequency-based PSD features are selected for physiological signals while time based Hjorth features are selected for T7, P7, Fz, FP1, FC6 electrodes of EEG signal. The position of T7, P7, Fz, FP1 and FC6 electrode is associated with superior temporal gyrus, lateral occipital cortex, superior frontal gyrus, frontal pole and precentral gyrus, respectively [28]. From the feature selection process, it is observed that the time domain features of electrodes associated with gyrus and frontal pole
Features selected from each of modalities brain regions are selected [28]. This is in accordance with literature which states that the gyrus [29] and the frontal pole [30] have a role in emotion regulation.

Results and discussion
The accuracy and F1-score of the experiment are shown in Table 4. The graphic illustration is available as Figure S1 at: https://github.com/armanjupriya-er/er-comparison-supplementary. The results obtained using EEG and physiological signals as independent modalities indicate that EEG signals are better at arousal prediction compared to physiological signals by 7.18%, while physiological signals are better at valence prediction compared to EEG signals by 3.51%. Combining EEG and physiological modalities, the arousal prediction is better than physiological signal modality by 2.39% and is inferior to EEG modality by 4.46%, while the valence prediction of the combined modality is better than EEG modality by 3.07% and is inferior to physiological modality by 0.42%. From the prediction accuracy in arousal and valence dimension, it is observed that EEG as a single modality and physiological signal as a single modality performs better than combining EEG with physiological signals. A one-way ANOVA test was conducted in order to validate whether there is any significant difference in prediction ability between the EEG and physiological modalities using the same set of features. One-way ANOVA for arousal accuracy (F (1,6) 5 7.05, p 5 0.0378) shows that there is a significant difference in accuracy levels reported by EEG and physiological signals, while the difference in F1-Score (F (1,6) 5 5.07, p 5 0.0653) is not significant at 5% level of significance. One-way ANOVA for valence accuracy (F (1,6) 5 0.08, p 5 0.7874) and F1-score (F (1,6) 5 0.25, p 5 0.6372) shows that there is no significant difference in accuracy and F1-score between EEG and physiological modalities at 5% level of significance.
Feature  Table 4. Accuracy and F1-score for EEG, physiological and EEG þ physiological modality Emotion recognition -EEG and physiological on the observation, FC6 electrode is common for arousal and valence prediction; therefore prediction ability of the FC6 electrode is studied and shown in Table 5. The experimental results suggest that the ability of the FC6 electrode to predict the valence is 59.00%, which is equal to the best prediction accuracy, obtained using all of the physiological signals. The prediction accuracy of the FC6 electrode with respect to arousal is 52.25%, which is at par with the prediction accuracy of the physiological signals. The FC6 electrode position corresponds to the primary motor cortex area in the brain, which is associated with the function of controlling different muscle groups [36]. This concludes that the prediction accuracy of the FC6 electrode in the valence dimension comes from muscle activity.
On further analysis of the selected features list, it is observed that zEMG plays a significant role in the prediction of both arousal and valence. Classifiers were trained on the physiological signals: EOG, EMG, EDA, temperature, plethysmograph and respiration using the features listed in Table 2 to study their prediction ability. To sort the physiological signals based on the prediction accuracy, same set of features are fed to the classifiers listed earlier.
The best prediction accuracy obtained and the corresponding classifier for each of the physiological signals is shown in Table 5. The graphic illustration is available as Figure S2 at: https://github.com/armanjupriya-er/er-comparison-supplementary.
Study of the prediction capability of time and frequency domain features of EOG, EMG, EDA, temperature, plethysmograph and respiration indicates that the plethysmograph shows an arousal prediction accuracy of 55.50%, which is inferior to the EEG modality by 0.89%, while the EOG shows valence prediction accuracy of 60.00%, which is better than the combination of all physiological signals by 1.69%. Features of EDA and zEMG each resulted in valence prediction accuracy of 58.25%. The sorted order of physiological signals based on arousal prediction accuracy is as follows: plethysmograph, EOG (hEOG þ vEOG), vEOG, hEOG, zEMG, tEMG, temperature, EMG (tEMG þ zEMG), respiration, EDA, whereas based on valence prediction accuracy the sorted order is EOG (hEOG þ vEOG), EDA, zEMG, hEOG, respiration, tEMG, vEOG, EMG (tEMG þ zEMG), temperature, plethysmograph. The valence prediction accuracy of EOG is superior to zEMG and EDA by 1.75% at the cost of higher number of electrodes (EOG requires four electrodes, whereas zEMG and GSR each require two electrodes). The results indicate that the valence prediction accuracy comes from muscle activity. Another notable observation is that, in a high dimensional feature space, ensemble classifiers (logit boost, stacking) perform better (Table 4), and in a low dimensional feature space, statistical models (logistic regression, LDA, QDA) perform better (Table 5) which is in line with literature [37].
The results of the experiment in comparison with the state-of-the-art (SOTA) is presented as supplementary  Tables 4, 5, 6 and S1 are from the test set. The performance of regularized deep fusion of kernel machines (RDFKMs) on EEG, EMG, EDA and respiratory rate [38] and pretrained inception ResNet v2 on facial expression, EEG and GSR modalities [12] were explored in recent literature. Similarly, recent research investigated the performance of statistical features on combination of multiple modalities [10,19]. Unlike the experiment carried out in this work, all the above-mentioned recent research reported the average LOSO validation score as the final accuracy. Also, some of the recent works did not publish the cut-off score used to distinguish low and high values in the arousal and valence dimensions [10,38] whereas, two other recent works mentioned the cut-off score as 4.5 [19] and cut-off score range as [1.0,3.0] (for low) and [7.0, 9.0] (for high) [12]. This is in contrast to the experiment carried out in this work which uses scale ranges of [0, 5.0] and (5.0, 9.0] as low and high, respectively. The accuracy obtained by proposed unimodal valence recognition using EOG and multimodal valence recognition using zEMG and EOG is better than the accuracy obtained in literature [10,12] by 5.44% and 11.52% respectively, but less than the accuracy obtained in literature [19,38] by 16.87% and 6.97%, respectively. The accuracy obtained by proposed unimodal arousal recognition using EEG or plethysmograph is better than the accuracy obtained in Ref. [12] by 4.08% and is less compared to all other methods. This research work is not compared with the subject dependent ER studies listed in Table S1, as this experiment is about subject independent ER. The low accuracy reported in this experiment can be partly attributed to the dataset used to report the test accuracy. This experiment specifically uses a separate test set while all other subject independent ER work [10,12,19,38] Table 6. Evaluation metrics for the proposed models Emotion recognition -EEG and physiological reports an average of LOSO accuracy. Also, in this experiment, the same set of features is used across different modalities. More research is needed to determine whether modality specific features improve the prediction accuracy. Table 6 shows additional evaluation metrics for the proposed methods, including accuracy, ROC area, kappa statistic, precision, recall, true positive rate (TPR), false positive rate (FPR), F1-score, true positive count (TP count), false positive count (FN count), true negative count (TN count) and false negative count (FN count). According to the kappa statistic, F1-score and accuracy, EEG is better suited for arousal prediction, whereas EOG is better suited for valence prediction. The ROC area reported for arousal prediction by EEG modality is less compared to the plethysmograph modality by 0.028. From an ergonomic perspective, obtaining a plethysmograph signal is easier compared to obtaining EEG signals.

Conclusion
The experimental results of this work suggest that arousal dimension prediction ability is high for EEG signals, while valence dimension prediction ability is high for the combination of EOG and zEMG signals. In addition, valence can be measured from the eyes (EOG) while arousal can be measured from the changes in blood volume (plethysmograph). Also, muscle activity plays a significant role in valence prediction.
Further research is required to examine whether the prediction ability of the EEG signal is resulting from brain regions associated with muscle activity or not. Whether modality specific features improve the prediction accuracy or not is yet to be explored. The experiment needs to be repeated on other existing or new datasets to identify the best modality for each emotion dimension. To determine the effect of stimulus on eye muscle, further study of eye movements while expressing emotions can be performed.