Longitudinal estimation of stress-related states through bio-sensor data

Purpose – The authors aim to develop a conceptual framework for longitudinal estimation of stress-related states in the wild (IW), based on the machine learning (ML) algorithms that use physiological and non-physiological bio-sensor data. Design/methodology/approach – Theauthorsproposeaconceptualframeworkforlongitudinalestimation of stress-related states consisting of four blocks: (1) identification; (2) validation; (3) measurement and (4) visualization. The authors implement each step of the proposed conceptual framework, using the example of Gaussian mixture model (GMM) and K-means algorithm. These ML algorithms are trained on the data of 18 workers from the public administration sector who wore biometric devices for about two months. Findings – The authors confirm the convergent validity of a proposed conceptual framework IW. Empirical data analysis suggests that two-cluster models achieve five-fold cross-validation accuracy exceeding 70% in identifying stress. Coefficient of accuracy decreases for three-cluster models achieving around 45%. The authors conclude that identification models may serve to derive longitudinal stress-related measures. Research limitations/implications – Proposed conceptual framework may guide researchers in creating validated stress-related indicators. At the same time, physiological sensing of stress through identification models is limited because of subject-specific reactions to stressors. Practical implications – Longitudinal indicators on stress allow estimation of long-term impact coming from external environment on stress-related states. Such stress-related indicators can become an integral part of mobile/web/computer applications supporting stress management programs. Social implications – Timely identification of excessive stress may improve individual well-being and prevent development stress-related diseases. Originality/value – The study develops a novel conceptual framework for longitudinal estimation of stress- related states using physiological and non-physiological bio-sensor data, given that scientific knowledge on validated longitudinal indicators of stress is in emergent state.


Introduction
Digital technology, including bio-sensor devices and well-being applications, may facilitate coping with stress and improve individual well-being [1]. Ubiquitous computing literature examines remote stress pattern recognition, data processing and feedback to users. In the context of digital stress pattern recognition, it is particularly important to understand how algorithms for remote stress identification can be used effectively. Automatic remote identification of stress episodes requires training ML algorithms using related data [2,3].
However, previous research has not clearly established the accuracy of measurements based on indirect measurements through unsupervised algorithms [3][4][5][6] and has not formalized the process of longitudinal evaluation of stress-related states.
In particular, validation is not always performed based on remote stress identification [4,5]. However, longitudinal evaluation of stress-related states using bio-sensor data requires validated measurements. This study contributes to the stress pattern recognition literature by addressing the gap regarding stress validation and develops and implements a generalizable conceptual framework for the longitudinal evaluation of stress-related states and uses examples of clustering-based algorithms.

Background
Academic literature on pattern recognition, data mining and ML [7][8][9] provides a theoretical explanation of how stress identification algorithms operate. Algorithms can be trained by combining physiological and non-physiological (contextual) activations in the time and frequency domains to infer stress-related states. Algorithms for stress identification may follow supervised [10,11] or unsupervised [2,3] learning approaches. They involve individual [12,13] and ensemble learning [14,15] models. Algorithmic training can be conducted in a laboratory (LAB) setting [13,16] or in an uncontrolled open environment, also known as "in the wild" (IW) [11,15,17]. Different stress identification models are designed for different uses, which correspond to specific types of environments: restricted, semi-restricted and nonrestricted daily life environments [1].
The measurement of stress-related signals can be direct [18,19] or indirect [3][4][5][6]. Direct measurement is usually conducted through isolated signals, while the stress level can be determined based on the level of variable(s) associated with stress. Longitudinal evaluations based on direct measurement usually follow (1) measurement and (2) visualization steps. In contrast, indirect measurement is concerned with stress pattern recognition through identification models (algorithms). Measurement involves physiological and/or nonphysiological (contextual) variables that react to stress triggers. Physiological signals often involve the heart rate (HR), galvanic skin response (GSR) or blood pressure. Nonphysiological variables include user physical activity, voice intensity and location.
There is a relatively small body of stress recognition literature that is concerned with longitudinal estimation of stress. The existing literature focuses particularly on the longitudinal estimation steps, the ML algorithm and indicators. The studies invariably have reported two contrasting direct and indirect measurement types. The indirect measurement method differs from research on direct longitudinal measures of stress using measurement and visualization steps [18,19] because it requires additional identification and validation steps to make the derived indices more reliable IW. Specifically, validation of the identification models may use precision, recall, F1-measure, or mean absolute error. Researchers used direct measures of the HR and blood pressure [7] and GSR [8]. Other researchers used indirect measurement with ML algorithms. Some indirect algorithm-based measures follow a fully unsupervised approach [4,5] that does not include validation. Two recent studies include implied IW validation [3,6]. Yet, the role of validation in the indirect measurement remains underexplored. The present study aims to address this gap in research by proposing and implementing a process model for the longitudinal evaluation of stressrelated states.

Proposed process model
The stress pattern recognition literature suggests that longitudinal evaluation could be based on indirect measurement often involving (1) identification, (2) validation, (3) measurement and (4) visualization [3,6]. This study develops and applies a process model for the ACI longitudinal evaluation of stress-related states consisting of four stages ( Figure 1). They include identification through a constructed algorithm (model), model validation, measurement and visualization.
The first stage corresponds to the construction of a stress identification model from IW data. The Yerkes-Dodson law [20] and catastrophe theory of arousal [21,22] may underpin clustering-based stress pattern recognition and subsequent analytic data labeling [23]. The Yerkes-Dodson law assumes the existence of three ranges of arousal: (1) low range of arousal, (2) middle range of arousal and (3) high range of arousal. The first range has a strong connection to relaxation, the second range to optimal performance and the third range to stress.
Validation is the second step performed by calculating convergent validity coefficients of accuracy and recall based on (1) transfer learning, and (2) LAB-collected data with labels. Transfer learning assumes training of GMM and K-means using data collected IW with GSR, HR and motion values as input. It identifies the parameters of the models and uses them to predict cluster labels using the LAB dataset based on the same input variables. The computational order of fitting the models (COM) at the identification step is as follows: twocluster K-means and GMM models and three-cluster K-means and GMM models, where models are fit independently. The estimation and validation consist of four stages as follows: (1) Standardize IW data; (2) Standardize LAB data and create five splits; (3) Fit models on IW data and save IW model parameters: two-cluster K-means and GMM; three-cluster K-means and GMM according to COM and (4) Input LAB data to trained IW models, predict labels and compare to the LAB data labels based on five-fold splits according to COM.
Validation ends by comparing clustering labels and LAB-collected labels to establish accuracy and recall. The final steps are to (3) derive stress-related indicators from identification models and (4) visualize them. The dataset to which this model is applied and data analysis are described below.

Participants and data collection
To apply the process model (i.e. conceptual framework) for longitudinal evaluation of stressrelated states, the current research draws upon two datasets. The first dataset (collected IW) serves to fit the models to the data. The second dataset (LAB data) serves to estimate the validation coefficients. The IW dataset (Table 1) was collected from the Municipal Fiscal Administration (Switzerland). A total of 18 public service employees participated in the experiment, which included males and females of different ages. Participants wore wearable biometric devices  Registrations of wearable devices included GSR, HR, motion intensity (MI), and other variables. The sampling frequency was 1 Hz. Registrations that did not meet manufacturer quality values were automatically detected and deleted from the dataset.
The validation dataset (i.e. LAB) was collected under LAB conditions (Table 1), where 15 users (male and female) wore RespiBAN sensors under induced stress, amusement and baseline conditions (data labels). The devices produced GSR, HR, accelerometer and other data. The sampling frequency was 700 Hz. Further details on the LAB data collection can be found in the Wearable Stress and Affect Detection (WESAD) scientific report [24].

Pre-processing, feature extraction and measurements
The sampling process in the IW data entails choosing three days of device exploitation at random for each of the 18 users. In total, 1,018,497 observations appeared in the sample. This corresponds to approximately 283 h of device use for 14 working days in November and 10 working days in December. The sliding windows technique was then used, with a 60-s step size and no overlap, for calculating averages. After deleting incorrect values, the dataset was reduced to 14,133 instances.
Next, pre-processing using the sliding windows technique was performed (window length of one min and 55 s of overlap) on the validation LAB data. The window length corresponded to 42,000 observations, the overlap corresponded to 35,000 observations and the increment corresponded to 7,000 data points. The resulting dataset contained 6,475 instances.
Motion intensity and physiological signals (HR and GSR) were combined. Each instance was composed of GSR, HR and MI mean interval values. HR was measured in beats per minute (bpm) [25] and GSR was measured in kilo-ohms (kohms) [26,27]. GSR reflects electrical conductivity, which changes when the skin glands produce ionic sweat [28]. It is linked to the emotional arousal state, where higher levels may increase the GSR [26,27]. Motion is an additional measure reflecting the intensity of physical activity, and its score ranges on a continuous scale from 0 to 100 [29]. For example, GSR or HR may increase with increased physical activity. This, in turn, causes physiological arousal, which does not always imply stress. Therefore, combining physiological sensing with the additional contextual variable of motion may improve stress identification [29].

Data analysis
Data pre-processing and feature extraction for IW data were performed using Matrix Laboratory (MATLAB). Unsupervised models involved using the sklearn module and writing original code for labeling the output of K-means and GMM in Python. Characteristics of datasets ACI GMM models using the IW dataset. In this study, the models were fit to standardized data. Therefore, scaling the data, fitting unsupervised models and assigning labels were required. Next, the K-means and GMM algorithms [7] were applied for two-and three-cluster problems. K-means performs hard assignments, optimizing the cost function defined in Eqn (1). GMM is a soft clustering algorithm that estimates the cluster means and covariance. It optimizes a cost function (Eqn (2)):
Cluster labeling is based on the mean GSR values for each cluster. It attributes a label corresponding to a greater level of arousal to a cluster having a greater average according to the GSR (acting as the decider variable). For a two-class problem, it attributes the label "Highest Arousal" to observations having the highest average GSR. Similarly, the cluster index having the lowest average GSR value is assigned the "Lowest Arousal" label.
For a three-class problem, it assigns the label "Emotional Overarousal" to observations having the highest average GSR value and "Arousal" to observations with the second-highest average GSR. Similarly, it assigns the "Relaxation" label to the cluster with the lowest average GSR.
3.4.2 Validation. A five-fold cross-validation step compares the predictions of the clustering output to the LAB data labels. Once cluster labels are assigned, this step follows by computing the proportion of cases that match the "laboratory ground truth labels." For twoclass learning, convergent validity coefficients are computed by matching clustering-based outputs with relevant and similar classes from the LAB dataset: "High Arousal" 5 5 "Stress" and "Low Arousal" 5 5 "Non-Stress," where "Non-Stress" comprises "Amusement" and "Baseline Condition." For the three-class learning problem, the convergent validity is based on the comparison rule: "Emotional Overarousal" 5 5 "Stress," "Arousal" 5 5 "Amusement" and "Relaxation" 5 5 "Baseline." A match between clustering labels and LAB labels is defined as a True Positive.
The IW training dataset consisted of 14,132 observations (90.84%). The LAB test dataset consisted of 1,295 observations in each of the five splits (9.16%).
Next, validation accuracy and recall are computed to establish convergent validity and to evaluate to what extent the identification models reflect stress. The accuracy and recall are defined using (4) and (5), respectively, as follows: where, TP is true positive, FP is false positive, TN is true negative and FN is false negative.

ML algorithms for estimating stress
Finally, the motion variable for the LAB dataset corresponds to the MI as follows:

Measurement and visualization.
Finally, the incidence indicators are computed and visualized. Incidence indicators count the average number of specific stress-related incidents per time window. Stress-related realization corresponds to the identification of the clustering algorithm output. For example, it is possible to count the average number of arousal episodes in a given time window. These counts distributed over time serve to reflect longitudinal stress-related incidence. The identification models serve to derive the high-arousal index and low-arousal index for the two-cluster problem, whereas the emotional over-arousal index, arousal index and relaxation index apply to the three-cluster problem. This approach allows us to obtain a longitudinal index of stress-related incidents associated with accuracy measures. They allow evaluation of the impact of the external environment on stress-related states over time. Furthermore, such indices should account for the number of hours that a biosensor device is active. For example, the number of stress episodes divided by the amount of time when the device is active during a time frame should ensure comparability between data points. This is important when device use is not uniform over time. In the final step, the visualization output is evaluated to account for the longitudinal realizations of stress-related episodes.

Results
This section presents the process model ( Figure 1) application results.

Identification
To describe the data characteristics, means and standard deviations were computed for HR (M 5 76.17; SD 5 11.57), GSR (M 5 3.75; SD 5 2.96) and motion (M 5 1.75; SD 5 1.17) and average cluster characteristics for two-cluster K-means and GMM models in standardized form and variance-covariance matrices for GMM (Table 2). Model characteristics were then used to make predictions using the LAB-collected dataset based on similar inputs: scaled HR, GSR and MI. In total, 50 iterations (i.e. repeated experiments) were performed to fit the data and determine the most common solution. Clusters with lowest mean GSR values received "Low Arousal" labels and clusters with highest GSR values received "High Arousal" labels. Results indicate that a greater GSR is associated with a higher HR. This step computes the average cluster characteristics for three-cluster K-means and GMM models in standardized form and variance-covariance matrices for GMM (  Table 2. Two-cluster models ACI Similar to the two-cluster models, predictions on the LAB dataset were performed by running 50 iterations and determining the most common solution. Clusters with the lowest mean GSR values received "Relaxation" labels; clusters with second-highest mean GSR values received "Arousal" labels and clusters with highest GSR values received "Emotional Overarousal" labels. The results indicate that the highest average motion values were associated with either the arousal or emotional over-arousal classes.

Validation
Two-cluster and three-cluster K-means and GMM models were trained using the IW dataset (Table 4). Data labels were then predicted using the LAB dataset, where the labels obtained from models trained on the IW data were compared to the LAB clustering-based labels (true states). Results indicate that the two-cluster model is more relevant for reflecting stress, whereas the three-cluster model may be more relevant for reflecting arousal. However, twocluster models distinguish between stress and any other state with accuracy greater than 70%, whereas the three-cluster models only achieve accuracies slightly above 45%.

Measurement and visualization
Measurement is defined as an average of stress-related incidence during a time interval using K-means clustering. It corresponds to deriving "High Arousal" and "Low Arousal" indices based on the two-cluster model and "Relaxation," "Arousa," and "Emotional Overarousal" indices for the three-cluster model. Low-arousal episodes dominated stress-related dynamics up to 450 min ( Figure 2); however, there was a mixture of high and low arousal states after 450 min. Figure 2 also shows some emotional over-arousal episodes in later stages.  Table 3. Three-cluster models Table 4. Validation coefficients for K-means and GMM models 5. Discussion and conclusion

Findings
This study proposed a conceptual framework for the longitudinal estimation of stress-related states through bio-sensor data consisting of four stages: (1) identification, (2) validation, (3) measurement and (4) visualization. The proposed process model facilitated the development of stress-related indices and could be generalized to supervised and unsupervised learning models. The results indicate that it is possible to use unlabeled datasets collected IW for twoclass stress pattern recognition, achieving an accuracy of approximately 70%. The findings reveal that it is possible to visualize longitudinal stress-related incidence using line plots. Thus, the evaluation framework provides additional information regarding the extent to which a longitudinal index reflects stress.
A comparison between the current study and the background literature reveals both similarities and divergence points ( Table 5). The approach used in this study is comparable in longitudinal estimation steps to those used by other researchers. The present research reflects Study 3 [4] and Study 4 [5] involving three estimation steps: identification, measurement and visualization. The evaluation framework is similar to that found in Study 5 [3] and Study 6 [6] involving four estimation steps.
The ML algorithms used are also aligned with extant stress recognition literature. Clustering algorithms were used in Study 4 [5] and Study 5 [3], similar to the proposed evaluation framework. The HR measure was previously used in Study 1 [18], Study 5 [3] and   [6] GSR was used in Study 2 [19] and motion was used in Study 3 [4]. Most importantly, the application of clustering algorithms is consistent with the methodological literature on pattern recognition, data mining and ML [7][8][9].
Simultaneously, the present study diverges from extant research and broader methodological studies on stress recognition. Compared with the present study, which proposes a formalized process model for longitudinal estimation involving four explicit steps, Study 5 [3] and Study 6 [6], also using four estimation steps, do so implicitly and without deliberately formalizing them. Additionally, Study 3 [4] and Study 4 [5], focusing on indirect measures, do not include a validation step.
Furthermore, the ML algorithm differs from Study 5 [3] and Study 6 [6]. The present study is concerned with the validation of the IW data analysis results based on a LAB setting, while Study 5 [3] and Study 6 [6] focus on IW validation. More specifically, the present study applies transfer learning when a model trained using IW data is tested using labeled data collected in a LAB setting. Moreover, the proposed evaluation framework uses emotional arousal and relaxation indicators, diverging from the extant literature except for Study 4 [5], drawing upon an index of arousal. However, our results on stress indicators are more consistent with a recent study using a theoretically justified model for identifying three stress-related states [23].
The current study contributes to the stress pattern recognition literature in two ways. First, it is the first study that formalizes four steps of longitudinal evaluation of stress-related states based on indirect measurement in a distinguishable and consecutive manner. Drawing upon the convergent validity criteria, this model applies methodological rigor to the process of longitudinal evaluation of stress-related states. Demonstrating the possibility of using transfer learning is the second contribution of the present study to the stress pattern recognition literature.

Limitations
Despite its contributions to the stress recognition literature, the present study has limitations. The visualization includes data windows with uniform hours of use, but variable user activity over disjoint time frames should be analyzed for requiring adjustments. Visualization could also involve benchmark curves for high and low stress, helping to situate excessive stress, optimum and relaxation states, although the gradient color background used for visualization of regions with high stress intensity and relaxation already enhances the graphical interface, helping to situate those states. Although there is a large sliding window overlap in the LAB-collected dataset, it allows the creation of large numbers of instances for cross-validation. Finally, the motion indicator for the IW dataset is automatically generated based on manufacturer data and may be computationally different from the MI derived from LAB data.

Implications for practice
The findings involving indirect measurement of stress-related states through unsupervised learning models imply that indirect measurement may support longitudinal stress monitoring. Digital monitoring may become an integral part of digital applications embedded in mobile, portable, wearable devices and/or context-based recommendation systems. Finally, the present study confirms that digital stress management is relevant in both work and private environments.

Future research
Future studies should consider the identification of stress-related episodes through deep learning. Notably, ensemble learning involving several models and indicators can improve the theoretical representation of stress constructs and enhance the reliability of stress pattern ACI recognition. Future research should examine this approach. Furthermore, predictive time series models of stress could be developed to identify pre-stress conditions. More research should be dedicated to validating remote stress pattern recognition algorithms in real-life contexts. Future research could address individual reactions to stressors, as well as consider complementarity, the pros and cons of training the algorithms in several environments.

Conclusion
This study concludes that indirect stress identification models may serve to derive valid stress-related indices. This study proposed a conceptual framework for the longitudinal estimation of stress-related states through bio-sensor data for any type of stress identification model and confirmed its convergent validity. The findings show that indirect measurement may support longitudinal stress monitoring in real time. Consequently, the present study confirms the feasibility of digital stress management in professional and private settings.