Neonatal pain detection in videos using the iCOPEvid dataset and an ensemble of descriptors extracted from Gaussian of Local Descriptors

Diagnosing pain in neonates is difficult but critical. Although approximately thirty manual pain instruments have been developed for neonatal pain diagnosis, most are complex, multifactorial, and geared toward research. The goals of this work are twofold: 1) to develop a new video dataset for automatic neonatal pain detection called iCOPEvid (infant Classification Of Pain Expressions videos), and 2) to present a classification system that sets a challenging comparison performance on this dataset. The iCOPEvid dataset contains 234 videos of 49 neonates experiencing a set of noxious stimuli, a period of rest, and an acute pain stimulus. From these videos 20 s segments are extracted and grouped into two classes: pain (49) and nopain (185), with the nopain videosegments handpicked toproduce a highly challenging dataset. An ensembleof twelve globaland local descriptors with a Bag-of-Features approach is utilized to improve the performance of some new descriptors based on Gaussian of Local Descriptors (GOLD). The basic classifier used in the ensembles is the Support Vector Machine, and decisions are combined by sum rule. These results are compared with standard methods,somedeeplearningapproaches,and185humanassessments.Ourbestmachinelearningmethodsareshowntooutperformthehumanjudges.


Introduction
Erroneous beliefs about the nociceptive physiology of neonates has greatly hindered research into neonatal pain. For many years, it was assumed that pain pathways were not fully developed in neonates [1]. Not only did this assumption inhibit research in this area until the mid-1980s, but it also justified inhumane surgeries performed on neonates without full anesthesia or postoperative pain interventions [2]. Not until studies revealed that nociceptive systems are fully developed in neonates did the scientific community take neonatal pain seriously [2]. Today it is widely recognized that infants experience acute physiologic, behavioral, metabolic, cortical, and hormonal responses to pain. Given that pain inhibitory pathways are not fully developed in infants, neonates may even experience greater levels of pain than do older children and adults [3].
More recent studies show that if left untreated, pain in neonates can have long-term effects on neurologic and behavioral development [4][5][6][7][8], especially in preterm infants [9][10][11]. Repeated unexpected acute pain experiences have been associated with slower weight gain [12], alterations in cerebral white matter and subcortical gray matter [13], and changes in the brain mechanisms involved in blood and oxygen supply [14,15]. These changes in the brain of neonates can result in learning and developmental disabilities as well as in lower IQ (for a review of the literature, see [16]). For all these reasons, it is critical that pain in neonates be detected and treated as early as possible.
Diagnosing pain in neonates is difficult, however. The clinical definition of pain assumes that the person experiencing pain has the ability to describe the location, duration, quality, and intensity of his or her pain experiences. Unable to speak, neonates must rely exclusively on the proxy judgments of others [17]. As a result, assessing pain in neonates is highly subjective and often results in misdiagnosis with the unwanted consequences of unnecessary interventions and failures to treat. Moreover, there is no universal method of evaluation [18] or consensus on the best proxy modalities (physiologic, behavioral, metabolic, or cortical) for assessing infant pain [19]. Recent research even suggests a discordance among, as well as within, specific modalities [20,21].
Most current pain instruments for neonatal pain assessment make use of two modalities: physiological indicators of pain (such as changes in heart and respiratory rates, blood pressure, vagal tone, and palmer sweating) [22] and behavioral indicators (such as body movement, crying, and facial expressions) [23]. The most popular validated scales are NIPS (Neonatal Infant Pain Scale) [24], PIPP (Premature Infant Pain Profile) [25], specifically, PIPP-R (Revised) [26], and DAN (Douleur Aigu€ e du Nouveau-n e) [27]. Use of such scales is not without controversy, however [18]. Although approximately thirty neonatal pain instruments have been developed, most are complex, multifactorial, and geared towards research, with few having validated clinical utility. Complaints concern the complexity of the instruments (involving so many measurements that pain scores are delayed) and the inability of most instruments to provide a reliable indication of pain over time, which is clinically important, for example, for assessing chronic and repeated acute pain and the effectiveness and necessity of additional treatment. Studies have also shown problems with observer bias based on professional attitudes towards the instruments, desensitization due to overexposure to neonatal pain in the clinic, context, culture, gender, and perceived attractiveness of the neonate [28][29][30][31]. One way of overcoming the complexity of pain diagnosis and of reducing delays in treatment is to develop machine systems capable of automatically detecting pain in neonates. Not only would such systems help counter observer bias, but they would also offer the advantage of providing close monitoring of neonates over extended periods of time.
Research in automatic pain recognition began to take off about a decade ago with the research of 1) Brahnam et al. [32][33][34][35][36], who explored classifier systems to detect the facial expressions of pain in 204 static images of neonates experiencing stressful stimuli, 2) Barajas-Montiel and Reyes-Gar ıa [37], who explored classifying cry states in 1623 samples, and 3) Pal et al. [38], who combined facial features with cry features from 100 samples to recognize different emotional states, including pain. In the last ten years, research in automatic pain detection has continued to focus on these two behavioral indicators of neonatal pain: facial expressions [32][33][34][35][36][39][40][41][42][43][44][45][46] and infant cries [38,47]. Little research to date has made use of the various physiological measures to detect pain [37,[48][49][50], and none that we are aware of have involved neonates. The general consensus in the literature is that physiological measures alone are highly unreliable primarily because responses vary greatly among individual infants, but it is also difficult to distinguish physiological changes associated with pain from those associated with other distressful emotions, such as fear and anxiety [51] (for a recent review of the literature on physiological indices of acute pain in neonates, see [52]).
In this paper we focus on the automatic detection of pain in videos of neonates. Most research to date uses the iCOPE (infant Classification Of Pain Expressions) benchmark dataset [33][34][35], which contains 204 photographs of 26 neonates experiencing different noxious stimuli, rest, and an acute pain stimulus. This database was developed to detect pain experiences but was expanded by Gholami et al. [40,41] to classify pain intensity as well. The iCOPE dataset has been used to evaluate state-of-the-art single classifiers, such as Support Vector Machines (SVM) [33,34], Relevance Vector Machines [40,41], neural networks [32,34], ensembles fusing different texture descriptors, such as the Local Binary Pattern (LBP) and its variants [36,39], and handcrafted features combined with deep features [46]. Research using this dataset is limited in value, however, since the dataset fails to capture the dynamic patterns of facial expressions, which most likely provide important information for discriminating pain.
Recently, a couple of researchers have worked on motion-based pain detection. Zamzami et al. [42], for example, used optical flow, a well-known method of motion estimation, to calculate optical strain magnitudes in the facial tissues of ten infants video recorded during an acute procedure. SVM and K-nearest-neighbors (KNN) classifiers were used to classify segmented expressions into pain and nopain categories with 96% accuracy. In Sikka et al. [53], a method based on FACS (Facial Action Coding System) was used to estimate pain levels in the facial expressions of 50 youth experiencing both acute and chronic pain. An SVM produced an AUC of 0.84 for acute pain detection and 0.94 for chronic pain detection. A major drawback in these studies involves the datasets: the infant dataset contains a small number of samples of a small group of infants and the contrasting stimuli to pain may not have been sufficiently challenging; moreover, the larger child dataset contains no videos of neonates.
The main contributions of this study are two-fold. Our first goal was to develop a challenging database of neonatal pain video, which we call iCOPEvid, that would be made available to other researchers. This dataset contains video of 49 neonates experiencing a set of noxious stimuli, periods of rest, and an acute pain stimulus. Raw video images were Neonatal pain detection in videos segmented into 20 s of pain (49) and nopain (185), with the nopain video segments handpicked to produce a highly challenging dataset (we selected video segments that resemble pain expressions but are not stimulated by a pain stimulus). Our second goal was to develop a system for classifying the iCOPEvid segments into the two categories of pain and nopain and compare it with other state-of-the-art classifier systems, with the aim of producing a challenging classification rate for future comparisons. The general outline of the classification approach proposed in this paper is as follows: The neonate's face is detected using Discriminative Response Map Fitting (DRMF), proposed in [54], which is based on a discriminative regression approach for the Constrained Local Models (CLMs) framework; The image is enhanced using [55], which is based on both an intra-layer optimization and inter-layer aggregation; Descriptors are used to represent each frame: the whole image is divided into four equal regions and from each region and the whole image a set of features is extracted and trained using an SVM; SVMs trained on these descriptors are fused; The scores of different frames of each video are then combined for classifying the video segments.
For improving the texture descriptors based on Bag-of-Features (BoF), we build an ensemble of codebooks for the well-known Bag-of-Words (BoW) method [56]. These codebooks are obtained using various strategies: different sizes for the patch, different PCA (Principle Component Analysis) projections [57] for reducing the size of the extracted feature vectors, and different patch shapes. As a result, a variety of global texton vocabularies are created. For each vocabulary, a different SVM is trained. Each PCA projection retains different training subsets for building different projection matrices. This technique provides a method for building an ensemble of classifiers by varying the projection matrix. In addition, we proposed some variants of the Gaussian of Local Descriptors (GOLD) [58] for extracting features from an image, which is an improvement of BoW [56]. Our idea is to consider the covariance matrix obtained using the standard GOLD approach as an image that can be represented using standard texture descriptors. Different local descriptors, codebook generation methods, and subwindow configurations are combined to form an ensemble for detecting neonatal pain.
The remainder of this document is organized as follows. In Section 2 we provide a detailed description and justification of the study design, including details on how the raw data were collected, processed, selected, and organized. In Section 3 we describe twelve powerful texture descriptors. GOLD is described in Section 4, and BoF in Section 5. Our ensemble approach is presented in Section 6, and results comparing these approaches as well as several Convolutional Neural Network (CNN) approaches are presented in Section 7. In Section 8, we present 185 human assessments on the same dataset and describe this study in detail. Results in Sections 7 and 8 show that our best ensemble, which combines the best feature set with the GOLD approach, outperforms all other methods, including those of the human judges. We conclude in Section 9 with suggestions for future research. The iCOPEvid dataset is freely available and can be obtained by emailing the first author. All MATLAB source code used in our experiments is included in the dataset.

iCOPEvid pain video dataset
The objective of the iCOPEvid dataset was to obtain a representative yet highly challenging set of video sequences of neonatal facial expressions for evaluating automatic neonatal pain ACI detection systems. In the medical literature, a number of noxious stimuli are typically compared in studies on neonatal pain: a pain-inducing stimulus (pinprick or puncture of a lancet) and various noxious stimuli, such as friction on the external lateral surface of the heel [59,60] and disturbances such as exposure to bright light [31] and diaper change [61]. These noxious stimuli are also compared with facial expressions in a resting state.
Following the designs of these studies, the first still image iCOPE database [32,34,35,62] collected a series of facial photographs of 26 infants (13 boys and 13 girls) responding to four stimuli: 1) the puncture of a heel lance, 2) friction on the external lateral surface of the heel, 3) transport from one crib to another, and 4) an air stimulus to provoke an eye squeeze (a facial characteristic of neonatal expressions of pain). After each stimulus, the infants were given a period of rest and photographs were collected during this period. Infants could be crying, awake, or asleep during the resting period. In addition, all infants were swaddled to obtain unobstructed images of each infant's face.
The goal of the iCOPEvid database was to obtain raw video of infants as they naturally appear in a hospital neonatal nursery, where some infants are swaddled and some are not and where flailing fists and legs often occlude facial expressions. The stimuli introduced in this study are also those commonly experienced by neonates in the nursery and include a state mandatory blood test requiring puncture of a heel lance (pain stimulus), friction on the external lateral surface of the heel administered by a cotton swab dabbed in alcohol in preparation of the heel lance, and recordings during and after two additional movement stimuli: transport from one crib to another and occasional physical disturbances. In addition, raw video was collected from neonates left undisturbed for a period of time (at rest) between stimuli administration. As with the iCOPE study design, videos were collected during the resting periods, and infants could be crying, awake, or asleep during these breaks between stimuli.

Subjects
This study complied with the protocols and ethical directives for research involving human subjects at Mercy Hospital in Springfield Missouri, USA. Informed consent was obtained from a parent, usually the mother in consultation with the father. Parents were recruited in the neonatal unit of Mercy Hospital shortly after delivery. Only mothers who had experienced uncomplicated deliveries were approached. Video images were collected from 49 neonates (26 boys and 23 girls) ranging in age from 34 to 70 h. Forty-one infants were Caucasian, one African American, one Korean, two Hispanic, and four interracial. Five males had been circumcised the day before and one male 43 h before the videos were taken. The last feeding time before video sessions ranged from 20 min to approximately 4 h. All infants were in good health.

Apparatus
All High Definition (HD) video was captured using a Sony HDR-XR520V HDD Camcorder following a white balance under ambient lighting conditions in a room separated from other newborns. The resolution of the video sequences, recorded as MP4 files, is 1920 3 1080 with a frame rate of 29.97 FPS.

Procedure
The facial expressions of the neonates were captured in one session. All stimuli were administered by an attending nurse. A total of 16 neonates were swaddled and 33 were allowed to move freely during and following stimulus administrations so that the infants were often obstructing their faces with their hands and feet or by twisting their bodies.

Neonatal pain detection in videos
Following the requirements of standard medical procedures, videos of four stimuli and three resting periods were taken in the following sequence: (1) Movement 1 Stimulus: Transport from one crib to another (Rest/Cry): After being transported from one crib to another, 1 m of video was taken. The state of the neonate was noted as either crying or resting.
(2) Resting 1: The infant was allowed to rest undisturbed for 1 m and video was taken of the infant; the infant's state during this period varied.
(3) Movement 2 Stimulus: The second movement stimulus was applied by periodically physically disturbing the neonate for 1 m. Video was taken during the disturbance and for 1 m following the application of the stimulus.
(4) Resting 2: The infant was allowed to rest undisturbed for 1 m and video was taken of the infant; the infant's state during this period varied.
(5) Friction Stimulus: The neonate received vigorous friction on the external lateral surface of the heel with cotton wool soaked in 70% alcohol for one full minute, and 1 m of video was taken as soon as the stimulus was applied.
(6) Resting 3: The infant was allowed to rest undisturbed for 1 m and video was taken of the infant; the infant's state during this period varied.
(7) Pain Stimulus: The external lateral surface of the heel was punctured for blood collection, and at least 1 m of video was taken, starting immediately after the introduction of the lancet and while the skin of the heel was squeezed for blood samples.

Image classes
A total of 234 video segments were extracted from the raw video footage and appropriately labeled for the iCOPEvid dataset. The first 20 s of video immediately following administration of the four noxious stimuli were extracted to form the pain/nopain categories, along with one to four 20 s resting segments that presented, where possible, facial expressions that varied and that ideally shared characteristics similar to pain expressions (e.g. yawning and crying). As illustrated in Figure 1, the resting segments were hand-selected by two of the investigators to provide difficult expressions to classify. The pain category contains 49 video segments (one for each infant) and the nopain category contains 185 video segments (several from each infant).

Evaluated texture descriptors
In this section, we describe twelve of the texture descriptors tested in this paper. Most are LBP-based [63] approaches, which depend on an analysis of patterns within local pixel neighborhoods; descriptors extracted from these neighborhoods are then aggregated to form a global representation for the whole image or for larger regions of an image [64]. The LBP variants outlined here include Local Ternary Patterns (LTP) [65], Local Phase Quantization (LPQ) [66], Binarized Statistical Image Features (BSIF) [67], Local Binary Pattern Histogram Fourier (LHF) [68], Rotation Invariant Co-occurrence among adjacent LBPs (RICLBP) [69], Extended Local Binary Patterns (ELBP) [70], and Local Configuration Pattern (LCP) [71].
Because the descriptors presented in this section have been extensively described in the literature and are familiar to most researchers in image/video classification, we only very ACI briefly describe them in Table 1, where we also provide the parameter settings used for each descriptor. The two descriptors based on Bag-of-Words (BoW), Gaussian of Local Descriptors (GOLD) [58], and Bag-of-Features (BOF) [77], which our approach is based on, are extensively discussed in sections 4 and 6, respectively. We compare our approach with CNN-Based descriptors at the end of the experimental section. Thus, in this study, we compare and evaluate the main types of texture descriptors as outlined in Liu et al.'s [64] recent review of the literature on texture descriptors.

Gaussian of Local descriptors (GOLD)
In this paper we not only test the original Gaussian of Local Descriptors (GOLD) features [58] in our ensembles, but we also treat the covariance matrix C generated in the process (see Section 4.4 below) as an image from which we extract standard texture descriptors. GOLD [58] improves the BoW [77] approach, which is designed to handle viewpoint, illumination, occlusions, and inter-class variability. A bag of words is a sparse histogram over a vocabulary in a group of documents. In computer vision, a bag of visual words of features is a histogram over a vocabulary of local image features. The BoW descriptor extracts local features to generate a codebook that is then used to encode local features into codes that form a global image representation. The codebook generation step is based on vector quantization, typically through k-means clustering. Once BoW features are extracted, they are sent to a classifier for testing and training.
Basing the final step in BoW on vector quantization is not ideal because it binds the dataset characteristics too tightly to the feature representation, which creates a hidden dependency that is reflected in many specializations of BoW [79,80]. This has led researchers to investigate building codebooks that are more independent [79,80]. One way to avoid hidden dependencies between training set characteristics and the feature representation is simply to eliminate the quantization step. In [81] this is accomplished by modeling each set of vectors by a probability density function (pdf) that is compared with a kernel defined over the pdfs. Choosing a pdf with zero-mean Gaussian distribution, as proposed by [81], is equivalent to using average pooling, which is what was done in a method proposed by [82]. This insight led

Neonatal pain detection in videos
to the method proposed in [58], where a reference pdf that is a multivariate Gaussian distribution with mean and full variance is employed to obtain the new vector representation of GOLD, which uses the dot product to approximate a distance between distributions. GOLD Acronym Brief Description of Descriptor and Parameter Settings Source LTP Local Ternary Patterns is a three-value variant of LBP that addresses LBP's high sensitivity to noise in the near-uniform regions of LBP [78]). We used two (radius, neighboring points) configurations: (1,8) and (2,16). We tested both LTP with uniform bins (LTP-u) and LTP with rotation invariant uniform bins (LTP-r ACI has recently been shown to obtain state-of-the-art performance on many publicly available datasets [58]. The GOLD approach is a four-step process [58]: Step 1. Feature extraction: dense SIFT descriptors are extracted on a regular grid of the input facial image; Step 2. Spatial Pyramid Decomposition: the facial image is decomposed into subregions via multilevel recursive image decomposition. Features are softly assigned to regions based on a local weighting approach; Step 3. Parametric probability density estimation: each region is represented as a multivariate Gaussian distribution of the extracted local descriptors through inferring local mean and covariance; Step 4. Projection on the tangent Euclidean space: the covariance matrix is projected on the tangent Euclidean space and concatenated to the mean. This produces the final region descriptor.
Each of these steps, along with some variations applied in this work, are detailed more fully in sections 4.2-4.5 below.

Details step 1: feature extraction
In the original description of GOLD [58], feature extraction is performed by calculating SIFT descriptors (we use the function vl_phow from the vl_feat library [68]) or their color variations at four scales, defined by setting the width of the spatial bins to {4, 6, 8, 10} pixels over a regular grid spaced by three pixels. Of course, features could be described using other methods; in this work, however, we use the features proposed in the original GOLD paper.

Details step 2: spatial pyramid decomposition
Spatial pyramid decomposition is accomplished by incrementally dividing a given image into increasingly smaller subregions, starting with level zero, where the decomposition is the entire image, followed by level one, where the image is subdivided into four quadrants, etc. A soft assignment of descriptors to these regions is performed that makes use of a weighting strategy that gives a weight to each descriptor that is based on its distance from the region's center. Given a region R, centered in ðc x ; c y Þ and with dimensions R w 3 R h , and a local descriptor D ∈ ℜ n computed at ðd x ; d y Þ, its weighting function is computed as: The function wðD; RÞ belongs to the range [0,1] and is designed such that the descriptors extracted from the center of the region have the maximum weight, while descriptors placed on the borders between two regions are equally considered for both regions. In this work one level of decomposition is used.

Details step 3: parametric probability density estimation
Parametric probability density estimation uses the set of weighted descriptors belonging to each region to infer the parameters of a multivariate Gaussian distribution. If D ¼ fD 1 ; . . . ; D N g is the set of weighed local descriptors of a region, μ ∈ ℜ n the mean vector, and C ∈ ℜ n 3 n the covariance matrix of a multivariate Gaussian distribution N belonging to a region, then:

Neonatal pain detection in videos
In this work we show that the covariance matrix C can be treated as an image so that standard texture descriptors can be used to describe it. Each of these feature vectors is fed into an SVM with a radial basis function kernel [83].

Details step 4: projection on the tangent Euclidean space
The parameters of μ and C provide a good representation of a region but are extremely high in dimensionality. To obtain a descriptor that can be fed into a classifier, the covariance matrix C is mapped into a point in the Euclidean space and concatenated to the mean μ so that the final region descriptor is a fixed length descriptor appropriate for linear classifiers based on the dot product.
Projection is performed in two steps: first, the covariance matrix C is projected on a Euclidean space tangent to the Riemannian manifold, at a specific tangency matrix P; second, the orthonormal coordinates of the projected vector are extracted. In [58] the best choice for P was determined to be the identity matrix I since the neighborhood relation between the points in the new space remain unchanged wherever the projection point P is located. Therefore, the projection formula of the covariance matrix C into a vector point c simply applies the vector operator to the standard matrix logarithm thus: where log is the matrix logarithm operator and vec is the vector operator on the tangent space at identity, which for a symmetric matrix C is defined as νecðMÞ ¼ ½m 1;1 ; ffiffi ffi 2 p m 1;2 ; ffiffi ffi 2 p m 1;3 ; :::; m 2;2 ; ffiffi ffi 2 p m 2;3 ; . . . ; m n;n . The final GOLD descriptor is size ðn 2 þ 3nÞ=2. This feature vector is fed into an SVM with a histogram kernel.

Bag-of-Features (BOF)
BOF [77], inspired by BoW [77], learns a visual vocabulary from extracted features and represents images by the frequencies of the extracted visual words. BOF feature extraction is basically a four-step process.
Step 1. Data organization and partitioning: the purpose of this step is to setup image category sets by first organizing a given image dataset into categories, and second by partitioning the dataset into training and testing sets.
Step 2. Feature extraction: in this step features (such as SIFT) are extracted from representative images in each category. There are many ways to extract features: sample features along a grid structure as with GOLD above, or extract features using interest point detectors or salient regions.

ACI
Step 3. Dictionary/vocabulary building: in this step a vocabulary of visual words is constructed using k-means clustering on the features extracted in Step 2 from the training set. The resulting k clusters are mutually exclusive, compact, and separated by similar characteristics. Each cluster center becomes a visual word, and the collection of centers (centroids) becomes the dictionary or codebook of visual words.
Step 4. Vector quantization: in this step an image can be quantified and abstractly represented using the dictionary by applying the following processes: i. Extract features from the given image as in Step 2; ii. For each feature, compute (using some distance measure) its nearest neighbor in the dictionary produced in Step 3; iii. Take the set of nearest neighbor labels and build a histogram of size k. The histogram length corresponds to the number of visual words, and the bins are incremented based on the proximity of the descriptor to a particular cluster center. The histogram is the BOF descriptor.
In this work a simple BOF approach is used for codebook assignation: each descriptor is assigned to one codebook according to the minimum distance criterion, and a dense sample keypoint extraction is performed. SVM is used as the classifier.
Since several feature descriptors are tested, several SVMs are trained, with classifier results combined using the sum rule (where the final score is the sum of the scores of a pool of classifiers that belong to an ensemble). Before fusion, the scores of each classifier are normalized to mean 0 and standard deviation 1.
To improve performance, an ensemble of codebooks is built using the following strategies: A different set of visual words is created for each class of the dataset, where for each NIMG image (images are clustered in groups, and the number of images, NIMG, in each group is 50 due to computational issues) visual words are built by clustering a local descriptor with k-means (the number k randomly selected between 10 and 40); for each descriptor the final vocabulary (codebook) is obtained by concatenating the visual words over all classes (this is done because k-means is an unstable clustering approach); A set of 50,000 patches are randomly extracted from the training set (considering the different classes), and these patches are then used to construct a PCA matrix (one projection matrix for each descriptor). This step is done three times: the first two times by retaining 99.9% of the variance and the third time by retaining 98% of the variance; a different codebook is created from each of these matrices and used to train a separate SVM; Each image is divided into overlapping patches with the size specified as a percentage (ps) of the original image taken at fixed steps st ¼ minðps 3 l; ps 3 hÞ=2, where l 3 h is the size of the original image. We tested different values of ps. In our final version, we set ps ¼ 12% and ps ¼ 8%; Different shapes of the extracted patches are used, both square and circular.

Proposed ensemble approach
In our proposed approach, a different SVM is trained for each descriptor. We used two different kernels in our experiments: 1) a histogram for BOF and the original GOLD Neonatal pain detection in videos algorithm, and 2) the radial basis function for all the other descriptors (including the image variants of GOLD).
In this section, we explain the five steps involved in our best performing approach. In outline form, they are the following: STEP 1: FACE DETECTION. For face detection we use Discriminative Response Map Fitting (DRMF), proposed in [54], which is a discriminative regression-based approach for the Constrained Local Models (CLMs) framework. The advantage of the DRMF is that it can be represented by a small set of parameters that can be efficiently used for reconstructing unseen response maps. The training procedure for the DRMF method can be broken down into two steps. In the first step, a dictionary is trained for the response map approximation so that it can be used for extracting the relevant features for learning the fitting of the update model. The second step involves iteratively learning the fitting update model, which is achieved by a modified boosting procedure.
STEP 2: ENHANCEMENT. This step enhances the contrast of each image using the method proposed in [55], which is based on both an intra-layer optimization and interlayer aggregation. A 2-D histogram hðk; k þ 1Þis extracted by counting pairs of adjacent pixels with gray-levels k and k þ 1 using a tree-like layered structure. At each layer, a constrained optimization problem is formulated for enhancement and solved to obtain a difference vector. The difference vectors at all layers are then aggregated into a single unified difference vector.  Notice that before each fusion the scores of each classifier are normalized to mean 0 and standard deviation 1.

Experimental results
We use the Area Under the receiver operating characteristic Curve (AUC) [84], which is a plot of the sensitivity vs. false positives (1 À specificity), as the performance indicator for all tests (including the evaluation of pain videos by human subjects in Section 8), where AUC is expressed as a percentage of the area of the unit square (i.e., as a value in the [0,100] range).
The testing protocol adopted was a 10-fold cross-validation (since we have 49 neonates, each fold contains the video of five neonates with one-fold containing four). All the frames of a given video segment belonged either to the training set or to the testing set.
The first experiment, reported in Tables 2 and 3, was aimed at comparing the descriptors listed in Section 3. We report the performance using two different testing protocols: Images: the AUC is calculated considering each frame as a different still image; Video: the AUC is calculated considering each video as a single pattern, where the score of each video is the average of the scores of the frames that belong to that video.
In Table 2 the method named FUSg is the fusion by sum rule among RICLBP, LCP, BSIF, MORPHO, and LPQ. We tested all the texture descriptors mentioned in Section 3, but in order ACI to avoid a huge table cluttered with unnecessary details regarding those descriptors that performed poorly, we have reported only the highest performing approaches here.
The cells of the row labeled Image contain two values: the first is the AUC where features are extracted from the whole image only (labeled WholeIm), and the second is the AUC considering features extracted from the four subwindows (as detailed in Section 6 and labeled SubIm). Clearly, the best performance is obtained when considering the subwindows; for this reason, in the row labeled Video, only the performance obtained considering the subwindows is reported.
Due to the high computation time of BOF, it was tested using only four features. In Table 3 we compare several BOF variants for demonstrating the value of combining different codebooks: Whole: these are features extracted from the whole image, and both the values of ps and only one PCA projection (98% of the variance) are used for building the codebooks. Since only square patches are considered, the number of codebooks is two.
Itera: these are features extracted as described with Whole, but all three PCA projections described in Section 5 are used for building the codebooks. Since only square patches are considered, the number of codebooks is six.
Circular: these are features extracted as described with Itera, but both square and circular patches are used. To reduce computation time, when the circular patches are extracted, only one PCA projection (98% of the variance) is used for building the codebooks. Since both ps values are used, the number of codebooks is eight.
Sub: these are features extracted as described with Circular, but the codebooks are built from the four subwindows as well as from the full image; thus, the number of codebooks is forty-five.
Sub_easy: these are features extracted as described with Whole, but the codebooks are built from the four subwindows as well as from the full image; thus, the number of codebooks is ten.

Neonatal pain detection in videos
Clearly BOF outperforms the global approaches, as does combining different codebooks. The fusion among all four descriptors, labeled FUSbof, which is based on different features, also improves performance.
In the last row we report the performance obtained by Sub using the Video testing protocol. Obviously, combing the four different descriptors by sum rule, i.e. FUSbof, improves performance.
In Table 4 we report the performance of GOLD and its variants. TB is the performance obtained by SVM trained using the texture descriptors extracted from the covariance matrix (see Section 4.4). TB_SUM is the sum rule among the different TBs. The fusions a 3 A þ b 3 B are the weighted sum rule between A, with weight a, and B, with weight b. In the last row of Table 3, we combine the approaches based on GOLD and the approaches based on TB.
Although the improvement is not impressive, it is clear that the idea of extracting texture features from the covariance matrix works (see [85]) and that further tests should be performed for optimizing the results.
The novel ensemble proposed here is given by the sum rule fusion among FUSg, FUSbof and (2 3 GOLD þ TB_SUM) to obtain an AUC (video classification rate) of 0.80. Note: before fusion, the scores of each method are normalized to mean 0 and standard deviation 1.
In an attempt to improve our best ensemble method, we used a powerful feature selection approach, Maximum Relevance Minimum Redundancy [86], to select a subset of all the features extracted by each texture descriptor. Table 5 reports the performance of 2 3 GOLD þ TB_SUM by varying the percentage of retained features and shows that feature selection did not significantly improve the performance; however, it did reduce the number of features that could be used to obtain a competitive performance.
In addition to the above methods, in Table 6 we show our results using the following transfer learning approaches using both the Image and Video protocols: VGGface_e; based on [87], which combines LBP and HOG features with deep learningbased features extracted from two pretrained CNNs (VGG face [88] and MBPCNN [89]  ResNet: This net is the same as ResNet-34 [92] except that it has fewer layers and 50% fewer filters per layer. ResNet has obtained the state-of-the-art classification performance on the LFW face benchmark [93] and was trained from scratch on 3 million faces. We used ResNet as a feature extractor for feeding into an SVM. Resnet is available at https://github.com/davisking/dlib-models. Adding the CNN to our best performing ensemble (the fusion by sum rule among FUSg, FUSbof and 2 3 GOLD þ TB_SUM) did not significantly improve results. Our GOLD-based ensemble, therefore, should provide researchers with a challenging performance rate for future comparisons.  [94]. Ratings were collected online through a website developed specifically for this study. Participants were presented with all 234 videos in the iCOPEvid dataset. All videos were 20 s in length and did not include sound. Each video was randomly presented to participants, and they were asked to use their best judgment to rate the level of pain experienced by the neonate in each video. Participants were unaware of the actual pain or nopain condition experienced by the neonate. Participants indicated pain levels with a slide control anchored by "Absolutely No Pain" and "Extreme Pain." Responses were converted to a numeric value ranging from 0 to 100. Due to the large amount of time required to complete the survey, participants were allowed to exit the website and resume later at the point where they had left off. A total of 39,865 pain assessments were collected in this manner.

Human judgments
The AUC of human subjects, with a 95% confidence interval, is lower 5 0.665/ upper 5 0.677, a performance that is significantly lower than our best ensemble. Human subjects found it most difficult to discern pain versus nopain in the presence of the friction Neonatal pain detection in videos stimulus. Statistical analysis using a Z-score of subject performance across individual stimuli confirm these differences. A Z-score indicates significant differences between two AUC values calculated from independent samples [95]. The critical ratio Z is defined as: where SE AUC1 refers to the standard error for AUC 1 and SE AUC2 refers to the standard error for AUC 2 . The value of Z is compared to a critical value CV Z at the preferred level of significance referenced in tables of the normal distribution. A value of Z greater than CV Z is considered evidence that the AUCs are statistically different. Alternatively, one may interpret the p-value for the statistic. If the p-value is less than the desired level of significance (e.g., p < .05) then one may reasonably assume that the two AUCs are different. Human subject AUCs for each stimuli measured individually against pain are: rest (0.699), movement (0.693), and friction (0.600). We find that the AUC for friction stimuli is statistically different from the AUCs for rest (Z 5 15.54, p < .001) and movement (Z 5 13.44, p < .001). This supports the poor results of the human subjects in their assessment of the friction stimulus. Removing the friction videos from the sample set significantly improves the overall AUC rating of human subjects from 0.671 to 0.697 (Z 5 4.64, p < .001). An improvement is obtained as well as in the AUC rating of our best ensemble, which went from 0.800 to 0.815 (the CNNbased method increased from 66.7 to 67.5 in the Image set and from 74.3 to 75.5 in the Video set). A possible explanation for these differences is that some neonates in our study experienced pain when receiving the friction stimulus.
In Figures 2 and 3, we show some images that proved very difficult to classify, not only by machine but also by the human judgers. Difficult pain frames; human rating of this subject (not just these images) 8.1% correct; best machine rating for this subject 13.8% correct. Difficult nopain frames; human rating of this subject (not just these images) 35.2% correct; best machine rating for this subject 43.1% correct.

Conclusion
The aim of this work was 1) to develop a challenging video dataset for the detection of pain in videos of neonates, and 2) to develop a powerful classification system that would provide a challenging performance rate for future comparisons. For the video dataset, we collected 234 videos of 49 neonates experiencing a set of noxious stimuli, a period of rest, and an acute pain stimulus, and divided the videos into two classes: pain and nopain. We purposefully extracted a set of challenging videos for the nopain class.
We experimentally developed our classification system by comparing a number of SVMs trained on twelve different state-of-the-art texture descriptors, along with an ensemble of local approaches based on Bag-of-Features (BOF) and descriptors based on Gaussian of Local Descriptors (GOLD). A set of features were extracted from the covariance matrix generated in the GOLD process with the assumption that the different features extracted from different image patches are taken from a multivariate Gaussian distribution. An ensemble of global descriptors and BOF was then used to improve the performance of the new descriptors based on GOLD.
For improving the performance with respect to some standard approaches (which we also report), we proposed a set of features that could be extracted from the covariance matrix used in GOLD. To improve the texture descriptors based on BOF, we built an ensemble of codebooks for BoW using different strategies. The results of our best ensemble proved superior not only to the standard approaches but also to some deep learning approaches. Our best ensemble was also shown to outperform 185 human judges.
Although adding CNN to our best ensemble failed to improve results, additional studies using CNN are needed. In the future, we plan on exploring more CNN-based approaches. For example, to overcome the problem of a small dataset, we want to collect different datasets of infant faces and train them using a CNN network; we will then explore fine-tuning our neonatal pain dataset using that network.
Similarly, although the improvement using the covariance matrix was not impressive, it is clear that the idea of extracting texture features from the covariance matrix works (see [85]) and that further tests should be performed for optimizing results. Finally, we plan on extracting features from the entire set of images to see if this will enhance performance.