A lightweight deep learning approach to mouth segmentation in color images

Purpose – Mouth segmentation is one of the challenging tasks of development in lip reading applications due to illumination, low chromatic contrast and complex mouth appearance. Recently, deep learning methods effectively solved mouth segmentation problems with state-of-the-art performances. This study presents a modified Mobile DeepLabV3 based technique with a comprehensive evaluation based on mouth datasets. Design/methodology/approach – This paper presents a novel approach to mouth segmentation by Mobile DeepLabV3 technique with integrating decode and auxiliary heads. Extensive data augmentation, online hard example mining (OHEM) and transfer learning have been applied. CelebAMask-HQ and the mouth dataset from 15 healthy subjects in the department of rehabilitation medicine, Ramathibodi hospital, are used in validation for mouth segmentation performance. Findings – Extensivedataaugmentation,OHEMandtransferlearninghadbeenperformedinthisstudy.This technique achieved better performance on CelebAMask-HQ than existing segmentation techniques with a mean Jaccard similarity coefficient (JSC), mean classification accuracy and mean Dice similarity coefficient (DSC) of 0.8640, 93.34% and 0.9267, respectively. This technique also achieved better performance on the mouth dataset with a mean JSC, mean classification accuracy and mean DSC of 0.8834, 94.87% and 0.9367, respectively. The proposed technique achieved inference time usage per image of 48.12 ms. Originality/value – The modified Mobile DeepLabV3 technique was developed with extensive data augmentation, OHEM and transfer learning. This technique gained better mouth segmentation performance than existing techniques. This makes it suitable for implementation in further lip-reading applications.


Introduction
Mouth segmentation is an important process in lip reading that can be applied in several applications, such as video conferencing, lip-synching, visual face recognition, speech Mouth segmentation in color images recognition and medical disease detection [1][2][3][4].The accuracy of each application depends on segmentation performances.However, mouth segmentation is challenging in an unconstrained environment due to luminance variation, low chromatic contrast, complex mouth appearance, fitness, occlusion, reflection and cosmetic agents on the lip [1-3, 5 and 6].Currently, various techniques are used for separating the lip region from the background, such as contour-based and region-based approaches.
However, several challenges still exist, such as overlapping between the lip and non-lip color, and no obvious color gradient between lip and skin [3, 5 and 6].
After the introduction of AlexNet [7], the first end-to-end multi-resolution deep learningbased semantic segmentation technique is fully convolutional network (FCN) [8].It achieves higher accuracy than conventional techniques.Later, mouth segmentation techniques have been continuously developed.Newer techniques can segment without color space transformation, manual feature extraction or even sliding window in pixel-wise prediction.
In this paper, we proposed a solution of the automatic deep learning-based mouth segmentation method evaluated on the publicly available dataset as CelebAMask-HQ [9], and the mouth dataset collected from 15 healthy people, annotated by four personnel in rehabilitation medicine, Ramathibodi hospital, and verified by two rehabilitation doctors.We applied transfer learning from COCO-Stuff [10] to CelebAMask-HQ dataset and applied from CelebAMask-HQ to the mouth dataset.
The key contribution of this paper is that we validated the performance of the Mobile DeepLabV3-based technique on mouth segmentation on the publicly available dataset and the self-collected mouth dataset from 15 healthy people.We integrated decode and auxiliary heads on Mobile DeepLabV3 to enhance supervision during training.This study applied extensive data augmentation and online hard example mining (OHEM) to relieve class imbalance.Our proposed model achieves better performance than the standard segmentation techniques.The second contribution is the application of transfer learning, taking the pretrained model on COCO-Stuff to CelebAMask-HQ, and the model trained on the CelebAMask-HQ to the mouth dataset, using a lesser amount of data for re-training.Moreover, our proposed solution does not require preprocessing and postprocessing.Thus, this can be easily integrated into mouth segmentation-related applications.
The rest of this paper is organized as follows: Section 2 describes the related work.Section 3 describes the materials and methods.Section 4 provides the experimental results.Section 5 discusses the results.Section 6 draws the conclusion.

Related work
Mouth segmentation techniques have been actively researched for solving an unconstrained condition on various illumination, mouth shape, reflection, and cosmetic agents on the lip [1-3, 5 and 6].These can be separated into three categories: contour-based approach, regionbased method and deep learning-based method.
First, the contour-based technique separates the lip and the background by a gradient between the lip and the non-lip pixels.Ozgur et al. [11] proposed PCA (Principal Component Analysis) template matching and K-means algorithm for lip corner detection.The likelihood of segmented lip pixels is estimated by Gaussian mixture model from the detected lip corner.Malek et al. [12] applied an active contour and parametric model to get lip contour.Then, a level set method finds the key points to position the result of the parametric model to fit lip deformity.Lu and Liu [2] proposed a localized active contour model from an illumination equalized RGB image, and the combination of the U component on the CIE-1975 CIELUV image, and C2 and C3 components from discrete Harley transformed image.This study applied the initial rhombus contour to the closed mouth and the combined semi-ellipses to the open mouth.Malek and Messaoud [13] proposed two methods.First, the authors proposed lip landmark detection by the geodesic active contour and a distance level set evolution model with a combination of Gaussian, ACI median and average filters [13].Next, a parametric model based on the cubic curves estimates a lip deformity from a lip landmark [13].
Second, the region-based approach applies clustering or thresholding techniques to separate between a lip and a background.Sandhya et al. [14] applied Otsu's thresholding and K-means clustering from the grayscale lip-printed image.The separation of K-means clusters is based on Euclidean distance.Wang et al. [6] proposed multi-class and shape-guided fuzzy C-means (MS-FCM) from CIE-1975 CIELAB and CIELUV.The pixel vector from the selected channels L*, a*, b*, u* and v* was separated between the lip and the complex backgrounds like skin, beards and mustaches.Gritzman et al. [3] applied shape-based adaptive thresholding (SAT) through two processes.First, this study used linear discriminative analysis with support vector regression to output segmentation error.Next, this study adjusted the color-based threshold value to estimate the best value to reduce the segmentation error until it was acceptable.
The third approach is the deep learning-based technique.Ju et al. [5] proposed lip segmentation network (LSN) which combined features from two architectures.First, FCN-based architecture maps RGB to a binary image.Second, the proposed CNN architecture based on average pooling with a 1 3 1 convolution kernel is employed to reduce the bad annotation influence.Guan et al. [15] proposed lip segmentation fuzzy CNN (LSFCNN), the U-net-like architecture with fuzzy learning modules.Zhang and Zhao [16] proposed a U-net-based local feature extractor to extract visual information from lip images with complex environmental changes and different facial attributes.They also proposed a graph-based adjacent feature extractor to effectively capture features of lips between adjacent frames.Guan et al. [17] proposed LSDNet, the combination between complex teacher and student networks.This combines three loss functions: cross-entropy, distillery and remedy losses.LSDNet increases segmentation performance, inference speed and segmentation ability in hard samples.
Nowadays, little research applies end-to-end CNN with an auxiliary head, extensive data augmentation, OHEM and transfer learning to solve a mouth segmentation problem in an unconstrained condition.Moreover, no research studies on lip and teeth segmentation performance, and computational complexity.Then, this paper validates Mobile DeepLabV3based techniques on lip and teeth segmentation performance and validates computational complexity by providing model parameters, model size and time usage per image.

Dataset
The first experiment was applied to CelebAMask-HQ [9], a large-scale publicly available highresolution face dataset with fine-masked labels of 19 facial component categories such as eye, nose and mouth regions.CelebAMask-HQ has high-quality control from several rounds of verification and refinement of each annotated mask to reduce noise.The dataset contains 30,000 face images of 512 3 512 resolution.
The next experimental study was applied to the collected videos from 15 healthy people working in the department of rehabilitation medicine at Ramathibodi hospital.This experiment was approved by the institutional review board of Ramathibodi hospital, Mahidol University (certificate of approval (COA) number.MURA2021/73).The inclusion criteria were as follows: (1) The subject requires to be a Thai.
(2) The subject should be between 18 and 80 years old.
(3) The subject works in the Faculty of Engineering at Mahidol University or the department of Rehabilitation Medicine at Ramathibodi hospital.
(4) The subject does not have a neck movement disorder or a history of cervical surgeries or trauma.

Mouth segmentation in color images
The exclusion criteria consist of a relationship with the research team, and unavailability during testing.Consent was obtained from all subjects for participating in the experiment.
The videos were acquired from the smartphone camera and the Razer webcam in an unconstrained environment in the department of rehabilitation medicine at Ramathibodi hospital.We extracted each video frame and saved it as a picture.The extracted frame was precisely annotated with Universal Data Tool (v.0.14.17) by four personnel working in the same department under the supervision of two rehabilitation doctors.Precise annotation with high-quality control and supervision reduces noise which affects training performance [5].This dataset possesses 15,495 images.

CNN architecture
The model architecture used in this study is based on Mobile DeepLabV3 [18,19].It consists of three parts, i.e., the backbone, the auxiliary head and the decode head (Figure 1).First, the backbone architecture is derived from MobileNetV2 [19].
Second, an auxiliary head [20,21] processes the output of the 5th inverted residual block for training optimization assistance.The reason is a vanishing gradient problem on the deeper network decreasing the gradient to near zero, which prevents fine-tuning parameters.Placing an auxiliary head on shallower layers increases a backpropagation signal, and additional regularization.Thus, an auxiliary head increases classifier performance from the insight of the InceptionNet [21].This head consists of a 3 3 3 convolutional layer with an output channel of 256, a dropout rate of 0.1, and a 1 3 1 convolutional layer.This head outputs two classes for CelebAMask-HQ and three classes for the mouth dataset.This head is abandoned during inference.
Third, the decode head is the main architecture head that processes the output of the 7th inverted residual block to output the same classes of an auxiliary head.This head consists of four steps, i.e., Atrous spatial pyramid pooling (ASPP) [18,22], a 3 3 3 convolution layer with an output channel of 512, a dropout rate of 0.1, and a 1 3 1 convolutional layer.ASPP is a powerful tool to capture semantic information on various scales from the computed feature maps by the model's receptive field enlargement.It consists of five parallel paths.The four parallel paths are Atrous convolutions [18,22] containing three 3 3 3 convolution layers with different dilatation rates of 12, 24 and 36, and one 1 3 1 convolution layer with a dilatation rate of 1.The last path is the image-level feature extraction which has three processes: a 1 3 1 2D global average pooling, a 1 3 1 convolution layer with an output channel of 512, and a resize layer with bilinear interpolation to the same image resolution before passing through the image-level feature extraction.The output of five parallel paths is concatenated before passing through a 3 3 3 convolutional layer.

Methods
The first experiment was the assessment of CelebAMask-HQ [9] prepared by four processes.First, the lip area was cropped by taking the masked area between the upper and lower lips to take the coordinates of extreme points.Second, the dataset was reannotated by labeling the upper and lower lip-masked areas as the lip, and the others as the background.This dataset contains 29,928 background areas and 29,505 lip areas.Third, the reannotated images were resized to the resolution of 640 3 480 pixels.Last, the dataset was separated from 30,000 images into 20,950 training, 5,987 validation, and 2,991 testing images.A total of 72 remaining images were excluded due to errors during finding extreme points of the lip-masked areas which are not available.
This experiment applied Mobile DeepLabV3 [18,19] pretrained by the COCO-Stuff dataset [10] to train with the training subset and validate with the validation subset.This network was trained with Adam optimizer for 140 epochs.The learning rate and weight decay were ACI 0.001 and 0.0001, respectively.The initial random seed was set to 0. We applied OHEM [23,24] for the segmented pixel area with a confidence value of less than 0.7.OHEM filters the difficult segmentation pixels with a low confidence value for backpropagation.The neglected class during training provides a high loss enough until reaching the probability of being sampled.Thus, OHEM mitigates a large imbalance between the annotated objects and the background on the mouth dataset.
The loss function in the main and auxiliary heads is the combination of two components: cross-entropy (L CE ) and dice losses (L Dice ).
The cross-entropy loss (L CE ) is the sum of cross-entropy losses in every class between the ground truth ðy i Þ and prediction calculated by the softmax function of the normalized exponential function of the prediction value in the current class (p i ).The numerator is the sum of the exponential function of prediction values in each class (z c ).The total number of classes is represented as C. The cross-entropy loss is shown in equation (1).
The dice loss (L Dice ) is the average of the dice coefficient in every class.In each class, the sum of correctly predicted boundary pixels is the numerator, and the sum of the total boundary between the prediction and the ground truth is the denominator.p i represents the pixel values of the prediction, and g i represents the pixel values of the ground truth.N c represents the number of pixels in each class.The total number of classes is represented as C. The dice loss is shown in equation ( 2).
The final loss (L total ) for the main and auxiliary heads can be calculated as shown in equation (3).
After the training, the model was tested with the testing dataset, and compared to the baselines (Part A of LSN [5], LSFCNN [15], LSDNet [17], U-Net [16,25], FCN [8], PSPNet [20], Residual U-Netþþ [26] and DeepLabV3 [18]) for segmentation accuracy.For LSN [5], only Part A was selected in this study because the author provided insufficient details on the structure of part B. The next experiment was the assessment of the dataset collected from healthy people, containing 15,495 images, and preprocessed in three steps.Firstly, this dataset was annotated by the personnel working in the department of rehabilitation medicine to create the masked image in three classes: the lip, the teeth and the background areas.Then, this dataset contains 15,495 background areas, 15,487 lip areas and 4,894 teeth areas.Secondly, the annotated images were resized to 640 3 480 pixels.Lastly, the dataset was separated into 10,851 training, 3,097 validation and 1,547 testing images.The pretrained model from the previous experiment was applied to train with the training subset with the same training parameters as in the previous experiment and validate with the validation subset After training, the testing subset was used in the evaluation and compared to the baselines for segmentation accuracy except for LSFCNN and LSDNet.The main reason for the exclusion is that the model architecture and loss function were specially designed for lip segmentation, which was not flexible for including teeth.
Data augmentation [27] is applied on training sets of both datasets, used for all techniques to improve the sufficiency and diversity of training data by synthetic dataset generation.The model with data augmentation copes better with the variety of colors, illumination and geometric transformation.Data augmentation consists of three steps: random crop, random flip and photometric distortion which applied random brightness, random contrast, BGR-to-HSV conversion, random saturation, random hue, HSV-to-BGR conversion and random contrast.
The third experiment was an ablation study.The same method in second experiment is applied for the proposed model without ASPP, an auxiliary head, transfer learning and OHEM.The result was compared to the proposed model.
The performance evaluation metrics in the three experiments' validation and testing phase were the mean Jaccard similarity coefficient (mean JSC), the mean classification accuracy and the mean Dice similarity coefficient (mean DSC).
The fourth experiment was the performance evaluation.We performed via Intel Core i7-4770 with a clock speed 4.50GHz and NVIDIA RTX 3060 to output the number of model parameters, the model size in MB and the inference time usage per image in milliseconds (ms).

Results
Figures 2 and 3 show the training and validation graphs of CelebAMask-HQ and the collected dataset from healthy people.This illustrates two learning graphs including training crossentropy and dice losses, which shows early convergence on all training graphs since we ACI applied transfer learning from COCO-Stuff, the large dataset, to CelebAMask-HQ, and from CelebAMask-HQ to the same domain on the mouth dataset.For validation on CelebAMask-HQ, the mean JSC mean classification accuracy and DSC achieved up to 0.8698, 93.66% and 0.9300, respectively.For validation on the mouth dataset, the mean JSC, mean classification accuracy and DSC achieved up to 0.8382, 93.39% and 0.9067, respectively.
The first experiment result on the testing subset of CelebAMask-HQ is shown in Table 1.Mobile DeepLabV3 demonstrated promising results, achieving mean JSC, mean classification accuracy, and mean DSC of 0.8640, 93.34% and 0.9267, respectively.The results demonstrated statistically significant improvement compared to the baselines (p < 0.05).An example of the ground truth images, labels and segmentation results is shown in Figure 4.
The second experiment result on the testing subset of the collected dataset is shown in Table 2. Mobile DeepLabV3 demonstrated promising results, achieving the mean JSC, classification accuracy and mean DSC of 0.8834, 94.87% and 0.9367, respectively.This technique demonstrated statistically significant improvement to the baselines (p < 0.05) except for DeepLabV3 on Mean JSC and DSC and residual U-Netþþ on DSC.An example of the ground truth images, labels and segmentation results is shown in Figure 5.
The third experiment result shown in Table 4 is an ablation study of the testing subset of the created dataset.Mobile DeepLabV3 with ASPP, an auxiliary head, transfer learning approach and OHEM.Statistical analysis is applied to evaluate the significant difference between each study compared to the proposed model (p < 0.05).
Fourth, the performance evaluation result on mouth segmentation performance is shown in Table 3. Mobile DeepLabV3 has a lower number of parameters and a smaller model size than the baselines except for Part A of LSN [5], LSFCNN [15], LSDNet [17]

ACI
First, ASPP [18,22] enables capturing semantic information on various scales by the model's receptive field enlargement in the different dilatation rates on Atrous convolution [16,20].
Second, an auxiliary head [20,21] in the intermediate layer assists the training process by back propagation through the shallow layers.This prevents the gradient vanishing problem [20,21].
The third factor is MobileNetV2 [19] including an inverted residual block that has a linear bottleneck.The bottleneck transfers the necessary information between residual blocks to decrease information and performance loss from the non-linearity transformation property from ReLU6.
Fourth, OHEM [23,24], filters the difficult segmentation pixels with a low confidence value for backpropagation.The neglected class during training provides a high loss enough until reaching the probability of being sampled.
Fifth, the network-based transfer learning approach [28] applies the reusability and transferability properties of a trained deep-learning model.This mitigates a large amount of dataset requirement on a limited amount of data in the mouth dataset for training.
Compared with the conventional techniques, this technique does not require preprocessing, and lip contour initialization and finding.This benefits from automatic feature extraction found in deep learning.It does not require additional conventional modules like fuzzy units [13] which increases computational complexity.
Compared to the baselines, Mobile DeeplabV3 [16,17] is better.All baselines do not have MobileNetV2 as the backbone with an auxiliary head for supervision, and OHEM.Almost all baselines do not have ASPP and transfer learning except for DeepLabV3 [16] and LSDNet, respectively.The lack of segmentation performance improvement factors leads to deteriorating segmentation accuracy.Moreover, Part A of LSN [4], LSFCNN [13], and U-net-based techniques [14,23,24] performed worst.They achieved the lowest segmentation accuracy compared to the other baselines, and Mobile DeepLabV3 [16,17].They misclassified the inside and outside mouth areas as the lip and teeth.

Conclusion
In this paper, we proposed the mouth segmentation technique based on the Mobile DeepLabV3 technique to handle this problem by application of MobileNetV2 as the backbone architecture with the decode head based on ASPP and the auxiliary head, and with the extensive data augmentation, the application of OHEM to relieve a class imbalance problem and the transfer learning approaches from COCO-Stuff to CelebAMask-HQ, and from this dataset to the mouth dataset.Among the baseline techniques, the proposed method has been verified to be more accurate and faster in inference speed than others for the mouth segmentation problem.This technique is suitable for implementation in further lip-reading applications, visual face recognition, speech identification, video conference and medical disease detection.

Figure 1
Figure 1.The mobile DeepLabV3 segmentation technique

Figure 2 .
Figure 2. The training graph on the training subset of CelebAMask-HQ and the collected dataset from 15 healthy people

Table 2 .
[16,17]4 and 5, show satisfactory qualitative results of Mobile DeepLabV3[16,17].However, Mobile DeepLabV3 still misclassified tongue, oral mucosa, skin and nail as the lip area due to color similarity to the lip, no obvious RGB color difference, low chromatic contrast, occlusion, The first row provides the segmentation result from the proposed method