Deep learning for Plankton and Coral Classification

Oceans are the essential lifeblood of the Earth: they provide over 70% of the oxygen and over 97% of the water. Plankton and corals are two of the most fundamental components of ocean ecosystems, the former due to their function at many levels of the oceans food chain, the latter because they provide spawning and nursery grounds to many fish populations. Studying and monitoring plankton distribution and coral reefs is vital for environment protection. In the last years there has been a massive proliferation of digital imagery for the monitoring of underwater ecosystems and much research is concentrated on the automated recognition of plankton and corals. In this paper, we present a study about an automated system for monitoring of underwater ecosystems. The system here proposed is based on the fusion of different deep learning methods. We study how to create an ensemble based of different CNN models, fine tuned on several datasets with the aim of exploiting their diversity. The aim of our study is to experiment the possibility of fine-tuning pretrained CNN for underwater imagery analysis, the opportunity of using different datasets for pretraining models, the possibility to design an ensemble using the same architecture with small variations in the training procedure. The experimental results are very encouraging, our experiments performed on 5 well-knowns datasets (3 plankton and 2 coral datasets) show that the fusion of such different CNN models in a heterogeneous ensemble grants a substantial performance improvement with respect to other state-of-the-art approaches in all the tested problems. One of the main contributions of this work is a wide experimental evaluation of famous CNN architectures to report performance of both single CNN and ensemble of CNNs in different problems. Moreover, we show how to create an ensemble which improves the performance of the best single model.


Introduction
Oceans are the essential lifeblood of the Earth: they provide over 70% of the oxygen and over 97% of the water. Without our oceans, all life, including humans, would not survive. Increases in human population and their resource use have drastically intensified pressures on marine ecosystem services, therefore monitoring and maintaining the oceanic ecosystem is essential to the maintenance of marine habitats. These habitats include plankton population and coral reefs, which are critical to marine food cycles, habitat provision and nutrient cycling [1]. Planktons are one of the main components of ocean ecosystems, due to their function in the oceans food chain. Studying variations of plankton distribution gives useful indicators for oceanic state of health. Coral reefs are among the oldest ecosystems on Earth. They are created by the accumulation of hard calcium carbonate skeletons that coral species leave behind when they die. Not only are coral reef biologically rich ecosystems and a source of natural beauty, they also provide spawning and nursery grounds to many fish populations, protect coastal communities from storm surges and erosion from waves, and give many other services that could be lost forever if a coral reef was degraded or destroyed.
Therefore, the study of plankton and coral distribution is crucial to protect marine ecosystems. In the last years there has been a massive proliferation of digital imagery [2] for the monitoring of underwater ecosystems. Considering that, typically, less than 2% of the acquired imagery can be manually observed by a marine expert, this increase in image data has driven the need for automatic detection and classification systems. Many researchers explored automated methods for performing accurate automatic annotation of marine imagery using computer vision and machine learning based techniques [3]: the accuracy of these systems often depends on the availability of high-quality ground truth dataset.
Deep learning has been certainly one of the most used techniques for underwater imagery analysis within the recent past [3] and a growing number of works use CNN for underwater marine object detection and recognition [4] [5]. Researchers have increasingly replaced traditional techniques [6] [7], where feature extraction was based on hand-crafted descriptors (such as SIFT and LBP) and classification was done with Support Vector Machines or Random Forests, in favor of deep learning approaches [8] [9], that exploit Convolutional Neural Networks (CNN) [10] for image classification. CNNs are multi-layered neural networks whose architecture is somewhat similar to that of the human visual system: they use restricted receptive fields, and a hierarchy of layers which progressively extract more and more abstracted features. A great advantage of CNNs vs traditional approaches is the use of view-invariant representations learnt from largescale data which make useless any kind of pre-processing.
The oldest attempts to use deep learning on underwater imagery analysis date back to 2015 in the National Data Science Bowl 1 for plankton image classification. The winner of the competition [11] proposed an ensemble of over 40 convolutional neural networks including layers designed to increase the network robustness to cyclic variation: this is a valid proof of the performance advantage of CNN ensembles vs. single models.
The availability of a large training set has encouraged other works: Py et al. [12] proposed a CNN inspired to Goog-leNet improved with an inception module; Lee et al. [8] addressed the class-imbalance problem by performing transfer learning pre-training the CNN on class-normalized data; Dai et al. [9] suggested an ad-hoc model, named ZooplanktoNet inspired by AlexNet and VGGNet; Dai et al. [13] proposed a hybrid 3-channel CNN which takes as input the original image and two preprocessed version of it. When large training sets were not available, automatically labelling was proposed [14] based on Deep Active Learning. Cui et al. [15] proposed a transfer learning approach starting from a model trained on several datasets. In [16] the authors showed that deep learning outperforms handcrafted features for plankton classification and the use of handcrafted approaches is useless also in an ensemble with deep learned methods. In [17] the CNNs are used as feature extractors in combination with a Support Vector Machine for plankton classification.
There are even fewer works that use CNNs for coral classification, since it is a very challenging task due to the high intra-class variance and the fact that some coral species tend to appear together. In [18] the authors proposed a new handcrafted descriptor for coral classification and used a CNN only for the classification step. In [19] Mahmood et al. reported the application of generic CNN representations combined with hand-crafted features for coral reef classification. Afterwards the same authors [20] proposed a framework for coral classification, which employs transfer learning from a pre-trained CNN, thus avoiding the problem of small training set. Beijbom et al. [21] proposed the first architecture specifically designed for coral classification: a five-channel CNN based on CIFAR architecture. Gomez-Rios et al. [22] tested several CNN architectures and transfer learning approaches for classifying coral image from small datasets.
In this work we study ensembles of different CNN models, fine-tuned on several datasets with the aim of exploiting their diversity in designing an ensemble of classifiers. We deal with: (i) the ability of fine-tuning pre-trained CNN for underwater imagery analysis, (ii) the possibility of using different datasets for pre-training models (iii) the possibility of design an ensemble using the same architecture with small variations.
Our ensembles are validated using five well-known datasets (three plankton datasets and two coral datasets) and compared with other state-of-the-art approaches proposed in the literature. Our ensembles based on the combination of different CNNs grant a substantial performance improvement with respect to the state-of-the-art results in all the tested problems. Despite of the complexity in terms of memory requirements, the proposed system has the great benefit of working well "out-of-the-box" in different problems, requiring few parameters tuning without specific pre-processing or optimization for each dataset.
The paper is organized as follows. In section 2 we present the different CNN architectures used in this work, as well as the training options/methods used for fine-tuning the networks. In Section 3, we describe the experimental environments, including the five datasets used for experiments, the testing protocols and the performance indicators; moreover, we suggest and discuss a set of experiments to evaluate our ensembles. In section 4 the conclusion is given along with some proposal for future research.

Methods
In this work the deep learned methods are based on fine-tuning well-known CNN architectures according to different training strategies: one and two round training (see the end of this section for details), different activation functions, preprocessing before training. We test several CNN architectures among the most promising models proposed in the literature; the aim of our experiments is both evaluating the most suitable model for these classification problems and considering their diversity to design an ensemble.
CNNs are a class of deep neural networks designed for computer vision and image classification, image clustering by similarity, and object recognition. Among the different application of CNNs there are face identification, object recognition, medical image analysis, pedestrian and traffic signs recognition. CNNs are designed to work similarly to the human brain in visually perceiving the world: they are made of neurons (the basic computation units of neural networks), that are activated by specific signals. The neurons of a CNN are stacked in lines called "layers", which are the building blocks of a neural network. A CNN is a repeated concatenation of some classes of (hidden) layers included between the input and output layers [23]:  Convolutional layers (CONV) perform feature extraction: a CONV layer makes use of a set of learnable filters to detect the presence of specific features or patterns in the input image. Each filter, a matrix with a smaller dimension but the same depth as the input file, is convolved across the input file to return an activation map.  Activation layers (ACT), implement functions that help to decide if the neuron would fire or not. An activation function is a non linear transformation of the input signal. Since activation functions play a vital role in the training of CNN, several activation functions have been proposed, including Sigmoid, Tanh and Rectified Linear Unit (ReLU). In this work we test a variation of the standard ReLU recently proposed in [24] and named Scaled Exponential Linear Unit (SELU). SELU is basically an exponential function multiplied by an additional parameter, designed to avoid the problem of gradient vanishing or explosion.  Pooling layers (POOL) are subsampling layers used to reduce the number of parameters and computations in the network with the aim of controlling overfitting. The most used pooling functions are max, average and sum.  Fully connected layers (FC) are the ones where the neurons are connected to all the activations from the previous layer. The aim of a FC layer is to use the activations from the previous layers for classifying the input image into various classes. Usually the last FC layer basically takes an input volume and outputs an N dimensional vector, where N is the number of classes of the target problem.  Classification layers (CLASS) perform the final classification selecting the most likely class. They are usually implemented using a SoftMax function in case of single label problem or using a sigmoid activation function with a multiclass output-layer for multi label problems.
In our experiments, we test and combine the following different pre-trained models available in the MATLAB Deep Learning Toolbox; all the models are modified changing the last FC and CLASS layers to fit the number of classes of the target problem, without freezing the weights of the previous layers. Moreover, a variant of each model is evaluated implementing a SELU activation function instead of each ReLU layer that follows a convolution. The models evaluated are:  AlexNet [25]. AlexNet (the winner of the ImageNet ILSVRC challenge in 2012) is a model including 5 CONV layers followed by 3 FC layers, with some max-POOL layers in the middle. Fast training is achieved applying ReLU activations after each convolutional and fully connected layer. AlexNet accepts images of 227×227 pixels.  GoogleNet [26]. GoogleNet (the winner of the ImageNet ILSVRC challenge in 2014) is an evolution of AlexNet based on new "Inception" layers (INC), that are a combination of some CONV layers at different granularity, whose outputs are concatenated into a single output vector. This solution makes the network deeper limiting the number of parameters to be inferred. GoogleNet is composed by 27 layers, but has less parameters than AlexNet. GoogleNet accepts input images of 224×224 pixels.  InceptionV3 [27]. InceptionV3 is an evolution of GoogleNet (also known as Inception1) based on the factorization of 7x7 convolutions into 2 or 3 consecutive layers of 3×3 convolutions. InceptionV3 accepts larger images of 299×299 pixels.  VGGNet [28]. VGGNet (the network placed second in ILSVRC 2014) is a very deep network which includes 16 or more CONV/FC layers, each based on small 3×3 convolution filters, interspersed by POOL layers (one for each group of 2 or 3 CONV layers). The total number of trainable layers is 23 or more depending on the net: in our experiments we consider two of the best-performing VGG models: VGG-16 and VGG-19, where 16 and 19 stand for the number of layers. The VGG models accept images of 224×224 pixels.  ResNet [29]. ResNet (the winner of ILSVRC 2015) is a network about 8 times deeper than VGGNet. ResNet introduces a new "network-in-network" architecture using residual (RES) layers. Moreover, differently from above models, ResNet proposes global average pooling layers instead of FC layers at the end of the network. The result is a model deeper than VGGNet, with a smaller size. In this work we use ResNet50 (a 50 layer Residual Network) and ResNet101 (a deeper variant of ResNet50). Both models accept images of 224×224 pixels.  DenseNet [30]. DenseNet is an evolution of ResNet which includes dense connections among layers: each layer is connected to each following layer in a feed-forward fashion. Therefore the number of connections increases from the number of layers L to L×(L+1)/2. DenseNet improves the performance of previous models at the cost of an augmented computation requirement. DenseNet accepts images of 224×224 pixels.
 MobileNetV2 [31]. MobileNet is a light architecture designed for mobile and embedded vision applications. The model is based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. The network is made of only 54 layers and has an image input size of 224×224.  NasNet [32]. NasNet is a well performing model, whose architecture is predefined, but blocks or cells are learned by reinforcement learning search method. The basic idea of NasNet is to learn architectural blocks from a small dataset and transfer them on the target problem. The network training is quite heavy and requires large images (input size of 331×331).
In this work we tested three different approaches for fine-tuning the models using one or two training sets of the target problem:  One round tuning (1R): one round is the standard approach for fine tuning pre-trained networks; the net is initialized according pre-trained weights (obtained on the large ImageNet dataset) and retrained using the training set of the target problem. Differently from other works that fix weights of the first layers, we retrain all layers' weights using the same learning rate in all the network.  Two rounds tuning (2R): this strategy involves a first round of fine-tuning in a dataset similar to the target one and a second round using the training set of the target problem. The first step consists in fine-tuning the net (initialized according pre-trained weights) on an external dataset including images from classes not incorporated in the target problem. The second step is a One round tuning performed starting from the network tuned on the external dataset. The motivation behind this method is to firstly teach the network to recognize underwater patterns, which are very different from the images in ImageNet dataset, then the second round is used to adjust the classification weights according to the target problem. The plankton and coral datasets used for preliminary tuning are described in section 3.    Fig. 1 a schema of the three methods is reported, where each color is related to a separate approach. One round tuning involves a single fine tuning: the training set is used to fine-tune the input model (yellow arrow) and obtain as output the final 1R tuned CNN (yellow dotted arrow). Two round tuning involves a first tuning of the input model using an external dataset (green arrow), the resulting "domain trained CNN" (output of the first green dotted arrow) is re-tuned using the training set to obtain the final 2R CNN. Incremental tuning starts as the 1R tuning, then sequential tuning steps using the same training set (orange arrows) are preformed to obtain a final ensemble of CNNs (resulting from dotted orange arrows).
The training options are the following: 30 epochs for training (45 for the INC tuning, at steps of 3), mini-batch size varying from 16 to 128 observations (depending on the memory requirements of the model) 2 , learning rate of 0.001. Unlike most of works published in the literature, we do not use data augmentation since it did not grant sensible performance improvements in our experiments.
Each model trained according to the three fine-tuning approaches listed above has been evaluated as stand-alone method and as a component of several ensembles. We tested several selection rules to design ensembles: the first exploits a diversity of architectures and it is the fusion of different models trained using the same approach (Fus_1R, Fus_2R…), the second is the fusion of the best stand-alone model (DenseNet) trained using different approaches (DN_1R+2R, DN_1R+2R+INC,…), the third and the fourth are two trained rules whose aim is selecting the best stand-alone models (named SFFS and WS, respectively). SFFS is based on one of the most performing feature selection approach, i.e. Sequential Forward Floating Selection [33], which has been adapted for selecting the most performing/independent classifiers to be added to the ensemble. In the SFFS method, each model to be included in the final ensemble is selected by adding at each step the model which provides the highest incremental of performance to existing subset of models. Then a backtracking step is performed in order to exclude the worst model from the actual ensemble. Since SFFS requires a training phase, in order to select the best suited models, we perform a leave-one-out-dataset selection. The pseudo-code of SFFS selection using leave-one-out dataset is reported in Fig. 2. A detailed description of each ensemble tested in this work is given in section 3. weight of some classifier. In order to force WS to assign a positive weight to only few classifiers, the loss function is the sum of the usual crossentropy loss and a regularization term given by L REG =Σw γ where < 1. Since the sum of the weights is constrained to be 1, the regularization loss is minimized when only one classifier has a positive weight. Hence, the algorithm must find a balance between a high accuracy and a small number of classifiers. This balance depends on the value of . The optimization is performed using Stochastic Gradient Descent, and the training is performed according to a leave-one-out-dataset protocol.

Experiments
In order to validate our approaches we perform experiments on five well-known datasets (three plankton datasets and two coral datasets): for plankton classification we use the same three datasets used by [7] 3 , while for coral classification we use two coral datasets tested in [22] 4  WHOI is a dataset containing 6600 greyscale images stored in tiff format. The images, acquired by Imaging FlowCytobot from Woods Hole Harbor water, belongs to 22 manually categorized plankton classes with equal number of samples for class. In our experiments, we used the same testing protocol proposed by the authors of [34] based on the splitting of the dataset between training and testing sets of equal size.  ZooScan is a small dataset of 3771 greyscale images acquired using the Zooscan technology from the Bay of Villefranche-sur-mer. Since images contain artifacts (due to manual segmentation), all the images have been automatically cropped before classification. The images belong to 20 classes with variable number of samples for each class.
In this work we use the same testing protocol proposed by [7]: 2-fold cross validation.  Kaggle is a subset, selected by the authors of [7], of the large dataset acquired by ISIIS technology in the Straits of Florida and used for the National Data Science Bowl 2015 competition. The selected subset includes 14374 greyscale images from 38 classes. The distribution among classes is not uniform, but each class has at least 100 samples. In this work we use the same testing protocol proposed by [7]: 5-fold cross validation.  EILAT is a coral dataset containing 1123 RGB image patches of size 64×64. The patches are cut out from larger images acquired from coral reefs near Eilat in the Red Sea. The dataset is divided into 8 classes characterized by imbalanced distribution. In this work we use the same testing protocol proposed by [22]: 5-fold cross validation.  RSMAS is a small coral dataset including 766 RGB image patches of size 256×256. The patches are cut out from larger images acquired by the Rosenstiel School of Marine and Atmospheric Sciences of the University of Miami. These images were taken using different cameras in different places. The dataset is divided into 14 classes characterized by an imbalanced distribution. In this work we use the same testing protocol proposed by [22]: 5-fold cross validation.
For the 2-rounds training we used a further training dataset for the plankton problems, obtained by fusing the images from the dataset used for the National Data Science Bowl and not included in the Kaggle dataset (15962 images from 83 classes) and the dataset "Esmeraldo" (11005 samples, 13 classes) obtained from the Zooscan [35] site 5 . For the coral problems we simply perform the first round training using the coral dataset not used for testing: EILAT for RSMAS and vice versa. In Fig. 3 some sample images (2 images per class) from the five datasets are shown. From top to bottom: WHOI, Zooscan, Kaggle, EILAT and RSMAS.
In all the experiments the class distribution has not been maintained when splitting the dataset between training and testing, in order to better deal with the dataset drift problem, i.e. the variation of distribution between training and test set which often causes performance degradation (e.g. [36]). Moreover, we wish to stress that our experiments have been carried out without ad hoc preprocessing for each dataset. The first experiment exhaustively evaluates the ten CNN models according to the One round fine-tuning strategy. Since CNNs require input images at fixed size, we compare 2 different strategies for resizing: square resize (SqR) pads the image to square size before resizing to the CNN input size, padding (Pad) simply pads the image to the CNN input size (only in few cases where the size of the image is larger than the CNN input size, the image is resized). Padding is performed adding white pixels to plankton images, but it is not suited for RGB coral images, therefore we use tiling (Tile) in the 2 coral datasets, consisting in replicating the starting image to a standard size (256×256) and then resizing.
In Table 1 the performance (in terms of F-measure) obtained by different models, fine-tuned according to the 1R strategy, are reported. The results of all the CNNs were obtained using Stochastic Gradient Descent as optimizer, with a fixed learning rate of 0.001. The last two rows in Table 1 report the classification results obtained by the fusion at score level of the above approaches:  Fus_SqR/ Fus_PT: is the sum rule among the models trained using the same resizing strategy.  Fus_1R: is the sum rule among Fus_SqR + Fus_PT DenseNet is the best performing model (Table 1), while NasNet, which has been proved to be one of the most performing architecture in several problems [32], works worse than expected. The reason may be that its automatic block learning is overfitted in ImageNet. Another interesting observation from Table 1 is that the performance of single architectures can be further improved by ensemble approaches. Even the lightweight MobileNetv2, which is one of the worst performing architecture in these datasets, is useful in the ensemble (wrt results reported in [16] where MobileNetV2 was not considered). Since SqR is the resizing strategy that works better in most of the datasets and models, we fixed it for the following experiments. Anyway, it is interesting to note that the fusion among scores obtained from different resizing strategies grants better results than other ensembles.
In tables 2 and 3 exhaustive experiments obtained from different CNN models, using the following methods, are reported (NasNet is excluded for computational reasons): 1R (simple fine tuning using SqR resizing strategy), 2R (2 rounds tuning using SqR resizing strategy), INC (ensemble of models obtained by incremental training), SELU (a variation of each model based on SELU activation, trained by 1R).     In Table 4 the results obtained by several ensembles are reported. We consider both the ensembles obtained fusing all the CNN models trained with the same strategy (named "Fus_*") and the fusion of the best single model DenseNet (named "DN_*") trained according to different approaches:  Fus_1R is the fusion (already reported in Table 1) among the models trained by 1R tuning  Fus_2R is the fusion among the models trained by 2R tuning  Fus_INC is the fusion of the ensembles obtained by incremental training  Fus_SELU is the fusion among the models modified by means of SELU activation layers.  DN_1R is the fusion among the two DenseNet models fine-tuned by 1R tuning using two resizing strategies (SqR + Pad/Tile)  DN_1R+2R is the fusion among DN_1R and the DenseNet model trained by 2R tuning  DN_1R+2R+INC is the fusion among DN_1R+2R and the INC version of DenseNet  DN_1R+2R+INC+SELU is the fusion among the above ensemble and the SELU version of DenseNet The last column of Table 4 shows the Rank of the average rank, which is obtained by ranking methods for each dataset, averaging the results and ranking again the approaches.  Tables 2 and 3 it is clear that a single fine tuning is enough for the tested problem, maybe because the datasets used in the first round tuning are not sufficiently similar to the target problem or more probably because the dimension of the training set of the target problem is large enough to perform training. The INC version of each model slightly improves the performance in some cases but does not grant a substantial advantage. As to SELU, it works better than ReLU only in few cases and does not work in VGG models. Anyway from the ensembles of Table 4 we can see that the use of a preliminary training (2R) or other variations allows to create classifiers diverse from 1R and their fusion can significantly improve the performance in these classification problems. Clearly the ensembles of different CNN models (Fus_*) strongly outperform the stand-alone CNNs in all the five tested datasets. However, due to computational reasons, we also considered lighter ensembles based on a single architecture (we selected the most performing one, i.e. DenseNet): it is interesting to note that DN_1R+2R obtains a very good performance using only three networks.
Moreover, we make some experiments using a CNN as feature extractor for training Support Vector Machine classifiers. We used the same approach proposed in [37] starting from DenseNet trained by 1R-SqR approach. The results are reported in Table 5: the first row reports the same performance of DenseNet trained by 1R-SqR of table 2 (here named DN_SqR), the second row reports the performance of the ensemble of SVM trained using the features extracted by Dense-Net (DN_SVM); the last row Sum is the sum rule between DN_SVM and DN_SqR. Unfortunately, the performance improvement is almost negligible, anyway, a slight improvement is obtained in all the datasets. The last experiment is aimed at reducing the computational requirement of the best ensemble. To this aim, we tested two "classifier selection approaches" as detailed in section 2: SFFS and WS.
The results obtained selecting 11 and 3 classifiers are reported in Table 6 and are very interesting, since they demonstrate that a reduced set of 11 classifiers improve the performance with respect to the previous best ensemble. Using less classifiers permits to develop a lighter approach, SFFS(11 classifiers) has an average memory usage (measured as the sum of occupation of the CNNs models) of 2026MB while WS(11 classifiers) has an average memory usage of 2129MB, SFFS(3 classifiers) has an average memory usage of 582.8 MB while WS(3 classifiers) has a memory usage of 502 MB, respect the ~5.5 GB of Fus_2R + Fus_1R. The performance increase obtained starting from the best stand-alone approach (i.e. DN_SqR) to our best ensemble is shown in Fig. 4 where each step towards our best solution is compared. To show that the performance increase, in case of imbalanced distribution, is not related only to larger classes, but also to small classes we show in Fig. 5 the confusion matrices obtained by DN_SqR and SFFS(11) on the ZooScan dataset which is the dataset with larger performance increase (the confusion matrices on the other datasets are included as supplemental material). The comparison of the two matrices confirms an improvement distributed over all the classes.  Finally, in Table 7 and Table 8 we report the comparison among the ensembles proposed in this work and other stateof-the-art approaches evaluated on the same datasets:  FUS_Hand [38] is an ensemble of handcrafted descriptors;  Gaussian SVM [7] is a handcrafted approach based on a SVM classifier.  MKL [7] is a handcrafted approach based on multiple kernel learning classifiers.  DeepL [22] is a deep learned approach based on ResNet  Opt [18] is handcrafted approach based on a novel feature descriptor. The reported result is the best among all tested feature descriptors.  EnsHC [39] is an ensemble of several handcrafted features (i.e. completed local binary pattern, grey level co-occurrence matrix, Gabor filter response, …) and classifiers. The results are reported in terms of F-measures and accuracy depending on the performance indicator used in the literature. The same testing protocol is used in all the methods. The same ensemble, not adapted in each given dataset, obtains state of the art results in all the five tested datasets.

Conclusions
Underwater imagery analysis is a challenging task due to the large number of different classes, the great intra-class variance, the low extra-class differences and the lightning variations due to the water. In this paper we studied several deep learned approaches for plankton and coral classification with the aim of exploiting their diversity for designing an ensemble of classifiers. Our final system is based on the fine-tuning of several CNN models trained according to different strategies, which fused together in a final ensemble gain higher performance than the single CNN. In our experiments, carried out on 5 datasets (3 plankton and 2 coral ones), we evaluated well-known CNN models fine-tuned on the target problem using some training variations (different resizing for input images, tuning on similar datasets, small variations of the original CNN model): the experimental results show that the best stand-alone model for most of the target datasets is DenseNet, anyway the combination of several CNNs in an ensemble grants a substantial performance improvement with respect to the single best model. In order to reduce the complexity of the resulting ensemble, we used a feature selection approach aimed at selecting the best classifiers to be included in the fusion: the final result is a lighter version of the ensemble including only 11 classifiers which outperforms all the other ensembles proposed.
All the MATLAB code used in our experiments will be freely available in our GitHub repository (https://github.com/LorisNanni) in order to reproduce the experiments reported and for future comparisons.