Using transfer learning for diabetic retinopathy stage classification

Purpose – Diabeticretinopathy(DR)isoneofthe dangerouscomplicationsofdiabetes.Itsgradelevelmustbe tracked to manageits progress and to start the appropriate decisionfor treatmentin time. Effective automated methodsforthedetectionofDRandtheclassificationofitsseveritystagearenecessarytoreducetheburdenonophthalmologistsanddiagnosticcontradictionsamongmanualreaders. Design/methodology/approach – In this research, convolutional neural network (CNN) was used based on colored retinal fundus images for the detection of DR and classification of its stages. CNN can recognize sophisticated features on the retina and provides an automatic diagnosis. The pre-trained VGG-16 CNN model was applied using a transfer learning (TL) approach to utilize the already learned parameters in the detection. Findings – Byconductingdifferentexperimentssetupwithdifferentseveritygroupings,theachievedresults arepromising.Thebest-achievedaccuraciesfor2-class,3-class,4-classand5-classclassificationsare86.5,80.5,63.5and73.7,respectively. Originality/value – In this research, VGG-16 was used to detect and classify DR stages using the TL approach. Different combinations of classes were used in the classification of DR severity stages to illustrate the ability of the model to differentiate between the classes and verify the effect of these changes on the performance of the model.


Introduction
Diabetes mellitus is a chronic disease that is caused by the inability of the pancreas to produce a sufficient amount of insulin, which is a hormone that adjusts blood sugar, or it is the inability of the body to use the produced insulin effectively. High blood sugar is a prevalent result of uncontrolled diabetes and eventually affects many systems of the body, such as blood vessels and nerves. Therefore, it is a main cause of blindness, heart attacks, strokes and kidney failures [1]. Diabetic retinopathy (DR) is considered one of the serious complications of diabetes, which is responsible for 2.6% of overall blindness. High levels of blood sugar destroy the blood vessels in the retina. That rises the probability of fluid leakage and bleeding which results in dangerous vision problems that might leads to blindness [2]. To decrease the dangerous effect of DR, early detection, precise diagnosis and appropriate treatment are required [3,4]. Therefore, an intelligent automated method for early and accurate detection of DR is required to manage the progress of the disease and thus guarantee appropriate treatment.
Classification of DR includes the weighting of many features and finding the position of these features. This is an exhausting time-consuming task for ophthalmologists, and it is Classifying diabetic retinopathy stage prone to mistakes. Therefore, ophthalmologists can be supported by computer aided diagnosis systems, which can detect abnormalities and classify the severity of different cases. They can decrease the load on ophthalmologists and reduce inconsistencies between manual readers. Great work is achieved on detecting DR automatically using tradition methods such as k-nearest neighbor (K-NN) and support vector machine (SVM) that depend on hand-crafted features extraction and then classifying different cases depending on the selected features [5,6]. In contrast, features can be learned automatically from the original images through the training phase using deep learning [7]. The advancement in deep learning has motivated researchers to use deep learning in medical images analysis. Convolutional neural network (CNN) is a type of deep learning networks that are specialized in applications of image analysis. Where the layers nearer to the input of the model learn low-level features like lines, the layers in the middle learn convoluted abstract features that integrate the lower level features, and the layers closer to the output interpret the features extracted in the context of the classification [8]. The high-performing CNN models that were recently applied in image classification tasks and achieved high performance can be imported and used for another image classification task using transfer learning (TL) approach.
TL approach is to utilize a pre-trained model to train a new model. It uses the knowledge obtained during solving one problem and exploits it in solving various but relevant problems. The traits learned by pre-training on the large dataset can be transferred to the new network, where only the classification component is trained on the new smaller dataset, to fine-tune the new data. TL saves considerable time used in developing and training a deep CNN model [9]. There are many high-performing pre-trained models that can be imported and used for image recognition. Most of these models have been developed as part of the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Examples of these models from the published literature are visual geometry group (VGG) [10], inception modules (GoogleNet) [11,12], residual neural network (ResNet) [13] and neural architecture search network (NasNetLarge) [14] etc. These models were trained using ImageNet data which consists of 1,000,000 images with 1,000 classes, so they have learned to detect generic features and their learned weights are provided and used in similar problems. They achieved state-of-the-art performance and when used to develop other image recognition tasks, they remain effective [15,16].
In this research, a TL approach using a per-trained model VGG-16 was utilized to detect DR and classify its stages based on retinal fundus images. The remainder of the paper is organized as follows; related research articles about the detection and diagnosis of DR are reviewed in section 2. The proposed method for detection and classification of DR is introduced in section 3. The experimental results of the proposed model are illustrated in section 4. The discussion and comparison with the literature are presented in section 5. Finally, conclusion and future work are drawn in section 6.

Related work
Many systems were proposed in the literature for the detection and diagnosis of DR using various machine learning techniques (MLTs). These systems are either based on conventional MLTs, which depend on hand-crafted features extraction, or deep learning where the features are extracted automatically during the training. In the next sections, some of these systems that were found in the literature will be illustrated.
Based on conventional MLTs, random forest (RF) was used to classify fundus images according to DR grades based on 35 extracted features from the detected red lesions and achieved an accuracy of 74.1% on Messidor dataset [17]. Three classifiers which are neural network, RF and SVM were applied on the DIAbetic RETinopathy DataBase fundus images to classify microaneurysms which are early indicators of DR, based on collected patches from ACI images. An AUC of 0.985 and F-measure of 0.926 were achieved using SVM classifier which outperformed the other classifiers [18]. The fuzzy technique was used in different tasks of DR classification such as the preprocessing stage as filtering and histogram equalization and was used as well in the detection of 4 retinal structures. An accuracy of 0.93, specificity of 1 and sensitivity of 0.8679 were achieved using k-NN [19]. Gaussian mixture was used for region segmentation; AlexNet was used for features extraction, linear discriminant analysis and principal component analysis (PCA) for features selection and finally SVM for classification of DR. The best achieved accuracy was 97.93% with sensitivity of 1 and specificity of 0.93 [20].
Based on deep learning, PCA was used to reduce the dimensionality, followed by grey wolf optimization to select the optimal parameters and finally deep neural network was trained on Debrecen dataset from UCI machine learning repository to classify the extracted features into "affected with DR" or not. The achieved accuracy was 97.3% with sensitivity of 91% and specificity of 97% [21]. CNN was trained on the Kaggle fundus images dataset to classify DR and achieved an accuracy of 75%, sensitivity of 30% and specificity of 95% [22]. TL based on pre-trained models was used; GoogLeNet and AlexNet were applied on Kaggle and Messidor-1 datasets. The achieved accuracies were 74.5%, 68.8% and 57.2% for 2-ary, 3-ary and 4-ary classification models, respectively [23]. GoogLeNet Inception v3 classifier was applied on Kaggle dataset. The achieved accuracies were 61.3%, 60.3% and 37.7% for 2-class, 3-class and 5-class, respectively [24]. Synergic deep learning model was applied on Messidor dataset to detect DR and classify its severity. It achieved an accuracy of 99.28, sensitivity of 98 and Specificity of 99 [25]. VGG-16 model was applied on 35,126 images from Kaggle dataset. The achieved 5-class classification accuracy was 74%, the sensitivity was 80%, the specificity was 65% and the AUC was 0.80 [26]. DenseNet and vgg-16 were utilized to classify the fundus images into the 5-stages of DR using 3662 images from Kaggle dataset. The achieved accuracies were 0.9611 and 0.7326 for DenseNet and vgg-16 respectively [27]. vgg-16 was also used 3662 images from Kaggle dataset to classify the severity level of DR with accuracy of 84.31%, F1 score of 84 and an AUC of 97 [28]. AlexNet, VGG-16 and SqueezeNet were applied on 1200 images of MESSIDOR dataset to classify the severity level of DR.

The proposed method
In this section, the used data and the proposed model are described. First, the used dataset for developing the proposed model is presented. Then, the full process, which contains "Preprocessing the data" and "Developing Transfer Learning-based CNN Model" phases, is explained. In "Pre-processing the data" phase, the data is prepared for developing CNN model based on TL approach in the second phase.

The used dataset
In this research, the proposed model was conducted using the data obtained from the publicly available benchmark dataset which is the Kaggle dataset [32]. The dataset contains colored highly diverse levels of illumination in fundus images. A set of 35,126 retinal images from the Kaggle dataset was used to develop the model.
Kaggle dataset images are in PNG format and they are re-sized to 224 3 224 pixels. Each image is labeled as left or right eye. Each image is categorized according to the level of DR Classifying diabetic retinopathy stage severity into one of 5-class labels (0-4) to represent (normal, mild, moderate, severe, proliferate_DR) stages. Figure 1 shows different samples from Kaggle dataset representing different stages, where Figure 1(a) is a normal sample and (b)-(e) samples represent different stages of severity.

Pre-processing the data
To develop the proposed CNN model based on TL approach, pre-processing steps were applied on the retinal fundus images to prepare the images for the learning phase. The preprocessing steps can be summarized as follow: (1) The retinal fundus image region was cropped automatically from each image to remove the background and unwanted region. Figure 2(a) shows a sample of an original image from the Kaggle dataset, while Figure 2(b) shows the same image after removing unwanted region.
(2) One of the most important challenges in the development of a deep learning model is the unbalanced and limited data size. In this research, the used data does not suffer   Pre-processing steps for one image from the proliferate_DR images of the Kaggle dataset ACI from data limitation especially that the adopted approach for learning is TL which relatively overcomes the data limitation problem. However, as it is clear from Table 1 that there is a balancing problem in the used data where the representation of the classes is unbalanced. The images in class 3 and 4 do not have enough representation as the other classes which are an obstacle in the way of detection of the cases belonging to these classes. Therefore, augmentation was applied to the poorly represented classes which are 3 and 4 to solve the data balancing problem. Thus, each training image belonging to these classes was rotated by three angles of 908, 1808 and 2708 and then flipped to enlarge the representation of these classes in the dataset. Column 4 in Table 1 shows the number of images in different classes after augmentation. Figure 2(c) shows the augmentation by different orientations of the sample in Figure 2(b) which belongs to proliferate_DR (class 4) in the Kaggle dataset.
(3) All images were re-sized to the same size to satisfy the CNN requirement of equally sized images that are provided as input to CNN model.

Developing transfer learning-based CNN model
TL is to utilize a pre-trained model to train another model. The pre-trained models such as VGG, ResNet and Inception are trained on ImageNet which is a large dataset. The developers of these models provided their models publicly to enable more research on the use of these representations in computer vision. Where these pre-trained models contain many millions of parameters in their architectures, training them from scratch requires very long computational time and huge number of input images. So, TL is the best solution to many problems where it can exploit pre-trained models to solve other problems such as the one presented in this research. The used TL architecture is shown in Figure 3. As shown in the figure, the pre-trained CNN model was trained using ImageNet which is a large public dataset that contains 1,000,000 images to be classified into 1,000 classes. The retinal fundus dataset was utilized to train the pre-trained network, after applying pre-processing on it.
The top 2 layers of the pre-trained model, which are employed to classify 1000 classes, were removed and replaced by an output layer with SoftMax activation function as a classifier with 5-nodes to supply 5 output classes, which represent the stages of DR. The 5 nodes can be changed to (2-4) nodes according to the different combinations of severity groupings to specify the required output as will be shown in "Experimental Results" section. The residual components of CNN were handled as features extractor for the new dataset, while the pre-trained model weights were kept unchanged. The new network was re-trained with retinal fundus images dataset with learning rate of 0.001 and Adam optimizer, while the number of epochs was (10-20) epochs.
In this research, TL-based model was developed to classify the retinal fundus images dataset into the different stages of DR severity levels. The three most common pre-trained models which are VGG, ResNet and Inception were used to classify the Kaggle dataset into its five severity levels. Where VGG achieved the best result in this task, it was used as the Classifying diabetic retinopathy stage adopted model in this research and so it was further investigated by performing several experiments where they classify the dataset into different combinations of classes. VGG-16 is composed of 16 depth layers. The input to VGG-16 is a (224 3 224) size image. The network contains a set of convolutional filters with (3 3 3) size. A stride of 1 pixel is used for all convolution filters, the padding is 1 pixel for (3 3 3) convolutional filters. The rectification (ReLU) activation function is used for all hidden layers. Five of convolutional layers are followed by (2 3 2) max-pooling layers with a stride of 2. Finally, 2 fully connected (FC) layers with 4096 channels each are applied, followed by the output of 1000 channels (one for each class) with soft-max activation function.

Experimental results
This section demonstrates the analysis and the experimental results of the proposed model. To validate the efficiency of the proposed model and to compare the results with others, benchmark dataset was used for implementation. Keras Python deep learning library on top of TensorFlow framework was used for implementing the model based on VGG with 16 layers (VGG-16) on a machine with an Intel® Core™ i7 CPU@ 3.6 GHz with 32 GB RAM and a Titan X Pascal Graphics Processing Unit (GPU). Extensive experiments were conducted to get the best setting that achieves the best results.
The dataset was randomly split into training and test sets, where the training dataset represents 70% of the whole data and the remaining 30% was used to test the model. The classification model was implemented according to the proposed architecture previously described using the training dataset and tested using test data. As mentioned before, the proposed model was applied using the three most common pre-trained models which are ResNet, Inception and VGG to classify the retinal fundus images of the Kaggle dataset into the five severity levels. The achieved accuracies were 66.24%, 63.41% and 73.7% for ResNet50, Inception and VGG-16, respectively. Table 2 shows the achieved accuracies of the used pre-trained models and the input shape required for each model. Since VGG-16 achieved the best result, it was used as the adopted model and hence used for more experiments where it classifies the dataset into different combinations of classes, as will be shown below.  Transfer learning architecture ACI First, to test the capability of the proposed model to detect abnormality in general, experiment #1 was conducted. It is a binary classification task, where it classifies the cases into normal and abnormal that includes the other 4-classes which are {mild, moderate, severe, proliferate_DR}. So, these 4-classes were merged into 1-class which is abnormal class. The achieved accuracy of this experiment was 75.5% in detecting abnormality.
As mentioned before, each image in the Kaggle dataset is categorized into one of the 5-classes (0-4) according to the level of severity to represent (normal, mild, moderate, severe, proliferate_DR) stages. To test the capability of the model to classify the cases into the different 5 severity levels, experiment #2 was conducted. The achieved accuracy was 73.7%.
According to the consulted ophthalmologists, in Kaggle database the mild cases did not form an obvious class as some of its cases could be classified as normal while others were more likely to belong to moderate. So, it was suspected that the model might not be able to distinguish them from normal and moderate classes. Likewise, the severe and proliferative were not easily distinguishable. Therefore, in experiment #3, "normal" and "mild" cases; and "severe" and "proliferative" cases were merged. And, in experiment #4, "mild" and "moderate" cases; and "severe" and "proliferative" cases were merged. The achieved accuracies were 80.5% and 76.4%, respectively, which proves that the differentiating traits for the mild class are not evident and the classification of classes of the dataset is not accurate. The increment in the accuracy in experiment #3 compared to experiment #4, clarifies that the mild class is closer to normal class, which may be translated that the number of cases in the mild class that inclines to the normal class are more than the ones that inclines to the moderate class.
In another approach, there was the intention to determine the severity level among the abnormal cases only, neglecting the normal cases. In experiment #5, the model classified the 4 abnormal classes. The achieved accuracy was 63.5%. The reduction in the accuracy compared to experiment #2 is due to the absence of the "normal" class which is proved to be clearly distinct. Due to the mentioned classes convergence, merging between classes was applied excluding the "normal" cases again. So, in experiment #6 mild was merged with moderate, while severe was merged with proliferate_DR. The achieved accuracy improved to 85.78%.
By consulting ophthalmologists, we found that the 4 stages of abnormality can be mainly categorized into proliferative diabetic retinopathy (PDR) and non-proliferative diabetic retinopathy (NPDR) according to severity level. Therefore experiment#7 was conducted by classifying cases into proliferate_DR and Non-Proliferative {mild, moderate, severe}. The achieved accuracy was 86.5%. Table 3 shows the results of different experiments of the model built using TL based on VGG-16. The used metrics for evaluation are as follow: where TP is true-positive value, FP is false-positive value, TN is true negative value and FN is false-negative value.

Discussion
Detecting DR and the classification of its severity stages are one of the biggest challenges for ophthalmologists. The contribution of this work is to develop a model that helps in detecting DR and classifying its different stages. Since the available number of cases in different DR datasets are relatively limited, so, TL is the suitable approach for the proposed work to employ pre-trained models to build the model that can classify the DR cases using the available data.
To evaluate the proposed work, it was compared with previous works that used TL applied on the same dataset, which is Kaggle dataset, for a fair comparison as illustrated in Table 4. Chowdhury et al. [24] used GoogLeNet to classify DR into 2, 3 and 5 classes and the achieved accuracies are 61.3, 60.3 and 37.7 respectively. Lam et al. [23] applied AlexNet and GoogLeNet TL approach, but they stated that GoogLeNet achieved better accuracies than AlexNet, which were 74.5, 68.8 and 57.2 for 2, 3 and 4 classes respectively. Although the two researches [23,24] used the same model which is GoogLeNet and the same dataset, the results of the two researches are different. That may be resulted from different preprocessing steps and changes in the setting of TL-network. Pratt et al. [22] Table 4. Comparison between proposed work and related works ACI classify the cases into the 5 classes and the achieved accuracy was 75%. Thota and Reddy [26] used VGG-16 to classify Kaggle dataset into the 5 classes and the achieved accuracy was 75%. Pradhan et al.
[30] also used the VGG-16 to classify Kaggle dataset into 5 severity stages with an accuracy of 78%. As it is shown, the proposed model results outperform the two works that used the same dataset and TL approach but with GoogLeNet. The third, fourth and fifth works are better than the proposed model in classifying the cases into 5 classes which is the only applied classification in these researches. The third and fourth achieved an accuracy of 75%, although they used different models which are CNN and TL using VGG-16, respectively, but the proposed model achieved an accuracy of 73.7%. It is worth noting that, by applying the proposed model without augmentation the results of different experiments were better, but it was suffering from overfitting. That was clear from that the results of different experiments that remained with the same accuracy through all epochs, even by using different models which are VGG-16, ResNet50 and Inception. As an example, they all achieved 75% to classify DR into 5 severity stages (experiment # 2) and 91.93% for PDR and NPDR (experiment # 7).

Conclusion and future work
Recently, the number of diabetes patients has increased dramatically and consequently the number of DR patients has increased. To help in the detection of DR and classification of its grade stages, deep learning was used, in this research. The TL approach, which utilized the VGG-16 pre-trained CNN model, was applied on Kaggle retinal fundus images dataset. The pre-trained VGG-16 model was used for feature extraction, then the top 2-layers were replaced by SoftMax activation function with the new output layer which changed to 2-5 classes according to the experiment.
According to the results of different experiments, it was concluded that the borderline between different classes is not sharp specially between mild and normal, and also between severe and proliferative. And even the ophthalmologists suffer from the difficult distinguishing between these classes. So, more work in the future is needed to find more accurate techniques or models to extract the subtle features that can distinguish between different classes. The proposed architecture can also be applied on other datasets to investigate the behavior of the model with the similar severity groupings. This is a possible topic for future work. Application of the model in real life can be beneficial in diagnosis and prevention of complications of DR.