Detecting and staging diabetic retinopathy in retinal images using multi-branch CNN

Purpose – Thispaperaimstoproposeasolutionfordetectingandgradingdiabeticretinopathy(DR)inretinal images using a convolutional neural network (CNN)-based approach. It could classify input retinal images into a normal class or an abnormal class, which would be further split into four stages of abnormalities automatically. Design/methodology/approach – The proposed solution is developed based on a newly proposed CNN architecture, namely, DeepRoot. It consists of one main branch, which is connected by two side branches. The main branch is responsible for the primary feature extractor of both high-level and low-level features of retinal images.Then,thesidebranchesfurtherextractmorecomplexanddetailedfeaturesfromthefeaturesoutputted fromthemainbranch.TheyaredesignedtocapturedetailsofsmalltracesofDRinretinalimages,usingmodifiedzoom-in/zoom-outandattentionlayers. Findings – The proposed method is trained, validated and tested on the Kaggle dataset. The regularization of the trained model is evaluated using unseen data samples, which were self-collected from a real scenario from a hospital. It achieves a promising performance with a sensitivity of 98.18% under the two classes scenario. Originality/value – ThenewCNN-basedarchitecture(i.e.DeepRoot)isintroducedwiththeconceptofamulti- branch network. It could assist in solving a problem of an unbalanced dataset, especially when there are common characteristics across different classes (i.e. four stages of DR). Different classes could be outputted at different depths of the network.


Introduction
Diabetic retinopathy (DR) is one of the most commonly seen complications of diabetes. It could lead to blindness, especially when it is left untreated. DR is diagnosed into four stages (i.e. Stage 1 to Stage 4). Therefore, this could be considered a problem domain of classification on retinal images with five classes of four stages and one normal class (Stage 0). In addition, this paper also considers a classification domain with two classes for differentiating normal cases from DR cases of any stage.
This paper focuses on the challenges of detecting and grading DR in retinal images [1]. The proposed method addresses the difficulty of grading five stages of DR, where its traces could be tiny, especially in the early stage of the disease. The modified zoom-in/zoom-out augmentation with attention layers is then deployed to solve the problem. In addition, in a general case of classification, final decisions of all classes are made at a final output layer. However, particularly for this research question, some classes may be more straightforward than others. Thus deeper learning could lead to the overfitting problem. The proposed solution then allows different classes to be exited for decision-making at different levels of a CNN architecture. This technical contribution could improve the grading of DR in retinal images.

Literature review
In this section, our literature review is structured into two parts regarding two main ways of solving the problem of DR detection and staging.
1.1.1 The first type of approach: segmenting/detecting traces of DR for DR detection and staging. The first way is to detect/segment traces of DR. Microaneurysms are detected for identifying Stage 1, and exudate is segmented for identifying Stage 2 [2]. In addition, Stage 3 and Stage 4 could be identified using abnormalities of retinal blood vessels. For example, the method proposed by Ref. [3] segmented patches of microaneurysm in retinal images using the autoencoder-regularized neural network, while the feature-transfer network with local background suppression was proposed by Ref. [4] for microaneurysm detection. The microaneurysm is the earliest signal of DR, whose size is tiny (i.e. less than 2% of the entire image's size). Compared with CNN-based solutions, the segmentation-based solutions could achieve higher performances for identifying Stage 1 of DR.
In addition, there have been several publications of exudate segmentation/detection for identifying Stage 2 of DR. For example, using a conventional solution, a hybrid solution of instance learning (iterative graph cut) and supervised learning (neural network) was proposed for the segmentation [5]. Recently, the dual-branch network-based solution was proposed by Ref. [6], where one branch was designed to focus on a large-sized exudate, while another branch was designed to focus on a small-sized exudate. The paper [7] introduced the CNN-based solution emphasizing super-pixel multi-feature extraction. This technique also focused on solving a small-size challenge of segmentation. The exudate segmentation seems to be also crucial for the DR detection in Stage 2, due to a small-sized trace. For Stage 3 and Stage 4 [2], there was a paper [8] that attempted to detect hemorrhage for detecting an abnormality in diabetic patients. The solution applied the modified VGG19 to extract image features before using the extreme learning machine for pixel-based hemorrhage detection. However, based on our literature reviews for Stage 3 and Stage 4, it is more popular to work on stage classification directly instead of segmentation.
1.1.2 The second type of approach: image-level output solutions for DR detection and staging. The second way is to directly apply machine learning techniques such as CNN for classifying stages and abnormalities in retinal images. There are two main types of existing solutions: (1) classifying into two classes of normal and DR and (2) classifying into five classes of normal and four classes of four stages.

ACI
In the first type of the second way, for example, the paper proposed in Ref. [9] did not rely on deep learning-based techniques. A fusion of textural and ridgelet features was learned using Sequential Minimal Optimization (SMO) to classify DR. Similarly, in Ref. [10], a fusion of handcrafted features was also used but learned using Darknet53. Differently, the paper in Ref. [11] proposed a solution of feature extraction based on six convolutional layers. Then, SVM, AdaBoost, Naive Bayes, Random Forest and J48 were attempted to classify retinal images into normal or DR classes. In the work proposed by Ref. [12], the multi-task-based CNN was developed with three decoders: classification head, regression head and ordinal regression head. The regression head could be further used for the cut-off into multiple stages of DR. The method by Ref. [13] also relied on the CNN-based solution. In addition, unsharp masking was applied to enhance retinal images. Two channels were fed as an input of the CNN, including the green and entropy channels. Moreover, by Ref. [14], the well-known and pre-trained network of Inception-ResNet-v2 with an additional block of CNN layers was transferred for detecting DR in retinal images.
In the second type of the second way, retinal images are classified into different stages of DR. By [Ref. [15], the CNN-based solution was developed to identify intricate features for classifying stages of DR, such as microaneurysms, exudate and hemorrhages. The method proposed by Ref. [16] also developed the solution based on CNN. To enhance the performance of the stage grading, it applied the distances between stages of DR into loss function. The methods introduced by Refs [17][18][19][20][21] were also developed based on the newly designed CNN architectures for the DR staging. While the method proposed Ref. [22] applied well-known CNN architectures, including Resnet50, Inceptionv3, Xception, Dense121 and Dense169, for DR staging. Also, well-known Inception-v3, ResNet50, InceptionresNet50 and Xception were attempted by Ref. [23] for DR grading.
In existing works, different stages of DR were outputted from different output nodes in the final layer of the network. However, in this paper, the proposed CNN architecture is developed base on the assumption that different stages of DR are identified by different characteristics which should be extracted from retinal images at the different feature levels. This differs from a multi-level feature concept proposed in Ref. [24], where two levels were applied. The first level was a fusion of conventional image descriptors, including SIFT and GIST. While the second level referred to features extracted using CNN on the fused features from the first-level features. In contrast, the multi-level features proposed in our paper refer to features extracted at multiple depths of the CNN architecture.
1.2 Background knowledge: multi-branch network 1.2.1 Motivation. The main network of our proposed solution is developed based on the multibranch network. It is a combination of sequential branches consisting of a convolutional layer, pooling layer and a fully connected layer. Even though the sequential CNN network can perform well on some problem domains, some tasks still get poor results, e.g. DR stage classification, people re-identification and medical image segmentation.
For example, in Ref. [15], the authors proposed a solution based on sequential CNN, a stack of convolution layers and three fully connected layers. Their proposed network is deep but not wide and has a large number of training parameters. Unfortunately, the reported result was poor in some DR's stages. This is one of the reasons why the multi-branch network is developed here in our work to overcome such complex tasks. These tasks require complex layer structures which can extract small and sparse features, i.e. micro-aneurysm, hemorrhage and small blood vessels in DR cases.
1.2.2 Existing networks. In addition, the method proposed by Ref. [25] was based on a multi-branch network for hyperspectral image classification. Typically, a hyperspectral remote sensing image (HSI) has a large data volume and high spectral resolution, with limited Detecting and staging diabetic retinopathy labeled data and a small training dataset. This makes the classification very challenging. Therefore, they proposed the multi-branch fusion CNN-based network to overcome such problems. Instead of making one sequential network that goes deeper and wider which can lead to the high complexity of the network with a large number of parameters, they added additional branches. This technique provides excellent classification results on the training with the small-sized dataset. This is one of the multi-branch network benefits to extract very small features efficiently and to be convergently trained by a small dataset. So, the multibranch network indicates that it is suitable for the DR stage classification problem.
One of the multi-branch network strengths is a high performance on small-sized feature extraction. For instance, the method by Ref. [26] proposed the LadderNet, which is a chain of multiple U-Net [27]. The purpose of the LadderNet is the same as U-net, semantic segmentation, but for better capability. In their experiments, the DRIVE dataset [28], a retinal dataset for blood vessel segmentation, was used in the evaluation. The segmentation results show that the LadderNet outperformed the previous networks, where U-net was one of them due to the multi-branch structure. This was because the LadderNet had the shared-weights residual block technique, which was the weight sharing among the branches. This technique significantly reduced the number of LadderNet's parameters.
Another example from Ref. [29], their experiments focused on multiple sclerosis lesion segmentation. Their proposed CNN included a multi-branch downsampling path which enables the network to encode information from different sources. Each branch of the network was the Resnet network [30]. Information on each branch was combined at each step of the encoding process with a filter size of 64, 256 or 512. So, the network could get more information than a single straight network, leading to more accurate segmentation. Therefore, their solution was among the best solutions for the ISBI challenge. These are examples of the multi-branch network key performance on small feature extraction.
The CNN architecture of a single straight branch structure has a stack of convolutional layers with different filter sizes and on top of the fully connected layers [31]. For simple problems, this CNN handles just fine. However, the result was not good on the complex tasks, e.g. in the DR stage classification [15], especially on Stage 1 and Stage 3 classification. This experiment indicated that a standard sequential CNN could not handle the DR stage classification problem.
Each branch receives a separate input in the CNN architecture of a multi-branch structure (e.g. Siamese network). Then, features generated from multiple branches are concatenated together at the end of the network. So, the final output will come from the concatenated features of the two branches. Since the multi-branch structure takes multiple inputs from multiple branches, this affects the training duration time. The network converges into the input dataset faster when compared with the one straight branch network, which has the same length of convolution layers. So, this advantage of multi-branch can be used to add more branches into the network as long as a graphic card has enough memory. There is still one more point that should be mentioned in the multi-branch structure. Its CNN architecture has no connection between branches and no weight sharing. As noted by Ref. [29], the CNN architecture excepts at the end, which is the feature concatenation step. Even though the weight-sharing technique can make the network converges faster and deeper. But sometimes, it can cause confusion between branch's weights to the network if the structure of each branch is very different. This issue can be fixed easily by changing the structure of each branch to be the same. However, the network will lose its complexity and cannot achieve high performance. Therefore, the proposed CNN in this paper, DeepRoot, aims to overcome the multi-branch problem by changing the structure of normal multi-branch CNN and keeping its complexity.
1.2.3 Multi-branch applied to the proposed solution. Our proposed CNN architecture, DeepRoot, comprises one main branch and two side branches. The main branch is designed ACI for extracting the base features from retinal images. Then, connected from the main branch, it is split into two side branches designed using different details and purposes. The detailed technical explanations are described in Section 2 of the proposed method. The outputs of different stages are defined at different branches of the network. The proposed CNN architecture is trained and validated with the Kaggle dataset [32]. Then, the trained model is tested with the testing Kaggle dataset and unseen samples of self-collected retinal images from the real scenario of a hospital.
The main novelty of this paper is to propose a concept of combining outputs from multiple learned side branches for classifying each DR class independently. In addition, a zooming structure is also proposed for the main CNN structure for capturing small details of distinguishing DR classes. The validating process is also performed on cross-datasets where a test set was collected from real-world cases of a hospital.
The rest of this paper is organized as follows. Details of the proposed method are described in Section 2. Experiments and results are discussed in Section 3. Then, conclusions are drawn in Section 4.

Proposed method
This section explains the details of the proposed method. The proposed CNN architecture is mainly introduced to train a model for classifying retinal images into five classes of 1 normal class and 4 abnormal stages. The training retinal images with labels are fed into the architecture for learning the model. Therefore, this section mainly explains the details of the proposed CNN architecture, as described in the subsections below. In addition, some related supplementary materials of additional figures are located at https://github.com/worapanda/ ACI_DiabeticRetinopathy-.git

DeepRoot network
In this paper, the proposed network, DeepRoot, is developed based on a combination of the main branch and two side branches. As shown in Figure 1, it could be seen why it is called DeepRoot. The original DeepRoot network comes from advanced convolution neural networks nowadays that have multiple branches. In addition, these branches will typically be concatenated together at some point in the network or the end. Even though it can increase the network performance, it could also waste an opportunity that this extra information can be used for another classification. Also, sometimes the extra classification can make the Detecting and staging diabetic retinopathy network converge quickly or be used as a combination of the classification. The structure of the DeepRoot network, as shown in Figure 1, consists of one main branch and two side branches. Descriptions of the three branches and their parameters are listed in Table 1.
In addition, in the proposed solution, each of the three branches (i.e. one main branch and two side branches) generates its own output classes. Therefore, in Figure 1, the fusion is referred to as score-level fusion, instead of feature-level fusion as in other existing methods of the multi-branches network. The learning loss is a combination of the three outputs of the three branches. More details on each branch are explained in the following subsections.
2.1.1 Main branch. The main branch of the proposed DeepRoot network acts as the primary feature extractor in both high-and low-level features. Then, the extracted features are passed to the side branches for more complex feature extraction. Thus, the structure of the main branch can be easily changed or even replaced by other well-known CNN architectures. This makes the DeepRoot network flexible and easily adapts to various problem domains. In this proposed method, the EfficientNet [31] is used as the main branch. At the final stage, an output from the main branch will be concatenated with the side branch's outputs. This can prevent diminishing gradients due to the extra information from the top layer.
2.1.2 Side branch. Each side branch of the DeepRoot network does not have to be the same structure or do the same things. One of the advantages of the multi-branch network with multiple outputs, like the DeepRoot network, is flexibility. Because each side branch can have a different shape due to its distinct purpose, it can be designed to do various tasks such as extracting finer feature detail, up-sampling feature size or grouping feature for global information. The DeepRoot network consists of two side branches designed to extract finer feature detail via a zoom-in/zoomout [35] module and collect dominant features via an attention layer.

Zoom-in/zoom-out
In retinal images, many signs of diseases are very small, especially for DR. This can make the convolutional neural network struggle to converge on this data type. So, an additional module is needed for the network to achieve high performance. This is where a zoom-in/zoom-out module is added to the proposed network. The structure of zoom-in/zoom-out module is shown in Figure 2 (Left).
As shown in Figure 2 (Left), features extracted from this zoom-in/zoom-out block are the concatenation of features before and after the zooming process. This zooming process consists of four steps: zoom-in, convolution, zoom-out and convolution. The structure of the zoom-in/zoom-out module, as introduced by Ref. [35], is the process to re-size the feature. The zoom-in is for low-level feature extraction, whereas the zoom-out is for up-sampling back to the original size. This lets the network learn the low-level feature information and pass this extracted information to the up-sampling process. Then, the extracted data is re-sized to the original size for concatenation with the shortcut feature.
From this procedure, the network has extra information on tiny-size features for the classification at the terminal. But the typical structure of zoom-in/zoom-out still cannot converge well on the DR problem, especially in Stage 1 and Stage 2 of the disease. This is because features that define the symptoms in Stage 1 and Stage 2, i.e. microaneurysm and hard exudate, are hard to see due to their tiny sizes. Descriptions of the three branches of the proposed architecture and their parameters ACI Therefore, the proposed DeepRoot network improves the structure of zoom-in/zoom-out by changing the convolution2d layer to the Inception module A [36]. As a result, the proposed zoom-in/zoom-out is shown in Figure 2 (Right). This adds complexity to the module and has a high performance on small-sized features. The Inception module A is used inside the Inception V3 and V4, which are state-of-the-art classification models. By applying this module, the zoom-in/zoom-out has more receptor fields for extra signals.

Attention layer
To detect DR, only parts of DR' traces are helpful in the detection process. Thus, irrelevant information in retinal images must be ignored in the learning and inferencing processes. Attention layers [16,37] would help emphasize traces of the disease, which would be learned from common features seen across input data samples. Particularly, in Stage 3 and Stage 4 of DR, it contains lots of small new blood vessels and various diffuse patterns. This type of layer can be used for enhancing the performance of detecting Stage 3 and Stage 4. Also, it could be a benefit for separating Stages 3 and 4 from Stages 1 and 2. In the proposed DeepRoot network, attention layers are applied after the main branch's output and before the two side branches, as shown in Figure 1. The adopted attention layers contain three sequential sets of normalization, ReLu and convolutional layers.

Output fusion
Since the DeepRoot network contains three outputs from three different branches, the classification process needs an extra step to combine all three outputs. Unlike other wellknown CNN that combine branches together at some point in the network, the DeepRoot network uses all three outputs in the classification process with pre-defined fusion conditions. These conditions combine all outputs into one final output. The conditions can be changed for specific problems. The conditions for the DR stage classification are shown in Table 2.

Experiments and discussions
In the experiments, two datasets have been used in the evaluations. The first dataset is a well-known public dataset from Kaggle's DR competition dataset. The second dataset is a sell-collected dataset from the real environment of a hospital. Detecting and staging diabetic retinopathy The Kaggle DR dataset is very popular due to its purpose for the competition. Also, its large sized-dataset can be trained for a complex CNN. The training and testing datasets distributions are shown in Figure 3. Figure 3 indicates unbalanced datasets in both training and testing sets. In addition, Stage 4 images have the least amount of training and testing datasets. So, the numbers of images used in the experiments are 700 and 1,200 for training and testing processes, respectively, for each class. Then, 10% of the selected images of the training dataset are separated for a validating process. For noise and irrelevance details reduction, the datasets are applied with pre-processing techniques, including a low-pass filter on a green channel and central cropping. Figure 4 shows examples of original and pre-processed images. It could be noticed that the details of each image are reduced after the pre-processing step. However, the irrelevance information in each image is also significantly reduced. In the fourth column of Figure 4, some details of hemorrhages are removed (i.e. color information), but corresponding patterns of the lesion were remained in the image.
In addition, our self-collected dataset contains a limited number of images compared with the Kaggle dataset. However, it includes images from a real scenario in a hospital. The distribution of all stages is shown in Figure 3. This dataset is used for testing only to validate the generalization of the trained model when it must be applied to unseen data samples in the real scenario. Similarly, the pre-processing technique is applied to the Kaggle dataset, where sample images are shown in Figure 4.
These fundus images were captured with a 458-508 field of view (FoV) and converted into a jpeg file format. The original size of images from the self-collected dataset is 1,604 3 1,206,

Conditions
Final output  Table 2.
The conditions for outputs combination at the classification process  In the experiments, the proposed CNN network is trained using four GPUs of Nvidia A100 with VRAM of 40 GB and system memory of 1,024 GB. The proposed network is trained using the Kaggle training dataset for 300 epochs, with an image's size of 1500 3 1500 pixels. The training time is approximately 21 h. Adam Optimizer is used in the optimization process, where a batch size is 16, and a learning rate is dynamically adjusted, starting from a value of 0.001 and reduced by 1/10 in a period of epochs. Also, early stopping and data augmentation are applied to prevent overfitting. The augmentation includes vertical flips, linear contrast and rotation.
Then, the Kaggle validation dataset is used at the end of every epoch for measuring the model performance. The epoch checkpoint with the highest performance is chosen as the final model. Finally, the Kaggle test dataset and self-collected dataset are used for the testing. A confusion matrix is used to demonstrate the corresponding results. It is a tool to show correct and incorrect classification results to explain which classes testing images are miss-classified [38].
In Figure 5a, it could be seen that the trained model can achieve good performances in Stage 0 (normal) and Stage 4, where there is no disease at all in Stage 0 and a significant trace of disease in Stage 4. It obtains moderate performances in Stage 2 and Stage 3, where traces of the disease are still sufficiently significant and could be noticed. In contrast, it achieves a low performance on Stage 1 since it could be noticed using traces of microaneurysm, which is very small (less than 2% of the total size of a retinal image). Particularly, the used dataset from Kaggle seems to be very difficult since it contains many challenges due to its first objective of being used in the competition. Compared with the recent technique using the same dataset on the five-classes scenario, the proposed solution achieves accuracies of 0.83, 0.19, 0.53, 0.48 and 0.72. In contrast, the method in Ref. [34] achieves the accuracies of 0.98, 0.54, 0.84, 0.35 and 0.29 for Stages 0-4, respectively. The proposed solution outperforms for detecting Stages 3 and 4.
Then, Figure 5c shows the confusion matrix of classifying two classes (normal and abnormal classes). The sensitivity is 79.64%, whereas the specificity is 83.08% for unseen data samples. However, the trained model of the proposed CNN architecture could perform ACI better on unseen samples that were collected from the real scenario in a hospital. The results are shown in Figure 5b.
In addition, the ablation study is performed on a scenario of two classes (normal and abnormal) based on the Kaggle dataset. Four components of the proposed solution are investigated. The experimental results are shown in Table 3. It is proved that all additional three components of side-branch, zoom and attention can enhance the performance on top of the main branch. So, all components of the proposed solution are applied for the rest of the experiments.
In Figure 5b, the trained model is tested with unseen data samples, which were collected from a real scenario in a hospital, where test images were classified into five classes, including a normal class and four abnormal classes of four stages. Then, Figure 5d shows the confusion matrix of classifying two classes (normal and abnormal). The sensitivity is 98.18%, where the specificity is 54.55 % for unseen data samples. Compared with testing on Kaggle, the trained model could detect the abnormality (i.e. high sensitivity) better on the unseen data of the selfcollected dataset. This could be because the retinal images collected from a hospital are of better quality when compared with the retinal images in the Kaggle dataset designed for a competition.
The trained model developed in this paper is also compared with other existing techniques under both scenarios of two classes and five classes with four stages. The results are shown in Table 4. For the two-classes scenario, the performances are reported in terms of sensitivity, specificity and accuracy. While for the five-classes scenario, the performances are reported using a weighted average of accuracies from the five classes.
In Table 4, for a fair comparison, the performances of our proposed method are also evaluated based on the dataset without the test set balancing as done for results reported in Figure 5. It can be seen that the proposed model is somehow comparable with the other Detecting and staging diabetic retinopathy methods. The main objective of this paper is also to introduce the new CNN-based architecture, as explained above. Also, the trained model is validated to be sufficiently generalized on unseen samples of the self-collected dataset, with a sensitivity of 98.18% under the 2-classes scenario.
In addition, it is sometimes useful to classify the retinal image into three classes in a realworld scenario, where Stages 1 and 2 are grouped and Stages 3 and 4 are grouped. Thus, this experiment further validates the proposed method in this scenario of three classes. Figure 5e shows that the accuracy across the three classes becomes more stable at 81.86%, 72.71% and 84.29%. The sensitivity and specificity values of classifying normal from abnormal cases also balance, as 81.86% and 78.5%, respectively.
The advantage of the proposed solution is that it could distinguish DR from non-DR cases with high accuracy of over 80%. It could elaborate the sensitivity up to 98% with the lower specificity. The early stages (i.e. Stage 1 and Stage 2) could also be differentiated from the severe stages (i.e. Stage 3 and Stage 4) with high accuracy of 80% on average. However, the proposed solution's main limitation is grading individual stages. It still suffers from a low performance of separating Stage 1 from Stage 2 since the traces of the early stages of DR are very small (i.e. cover less than 2% of the whole image). In future work, the proposed solution can be used to classify Stage 1 and Stage 2 into the same category. Then, the two classes could be further split using microaneurysm, exudate and hemorrhage detections.

Conclusion
This paper proposed a new CNN architecture, DeepRoot, to detect and grade DR in retinal images. DeepRoot was designed to cope with fine-level features for detecting tiny traces of DR, such as microaneurysm and exudate. The staging outputs were determined at different locations in different layers of DeepRoot. DeepRoot was then trained and validated with the retinal images dataset from Kaggle. The trained model was also tested with unseen data samples, i.e. self-collected from a hospital. The model could achieve a very high sensitivity of 98.18% for the scenario of classifying into two classes of normal and DR. It could also be seen from the confusion matrix that the model could handle well with severe Stages 3 and 4 due to the advantage of the added attention layers. However, the performance significantly dropped in the early stages. In future work, the techniques of DR traces segmentation and CNN-based solution of DR staging should be combined. The segmentation-based solution should be employed for detecting Stage 1 and Stage 2, by segmentation microaneurysm and exudate. Then, Stage 3 and Stage 4 should be detected using the CNN-based classifier.