Classification assessment methods

Classificationtechniqueshavebeenappliedtomanyapplicationsinvariousfieldsofsciences.Thereareseveralwaysofevaluatingclassificationalgorithms.Theanalysisofsuchmetricsanditssignificancemustbeinterpretedcorrectlyforevaluatingdifferentlearningalgorithms.Mostofthesemeasuresarescalarmetrics andsomeofthemaregraphicalmethods.Thispaperintroducesadetailedoverviewoftheclassification assessmentmeasureswiththeaimofprovidingthebasicsofthesemeasuresandtoshowhowitworkstoserveasacomprehensivesourceforresearcherswhoareinterestedinthisfield.Thisoverviewstartsbyhighlightingthedefinitionoftheconfusionmatrixinbinaryandmulti-classclassificationproblems.Manyclassificationmeasuresarealsoexplainedindetails,andtheinfluenceofbalancedandimbalanceddataoneachmetricis presented.Anillustrativeexampleisintroducedtoshow(1)howtocalculatethesemeasuresinbinaryandmulti-classclassificationproblems,and(2)therobustnessofsomemeasuresagainstbalancedandimbalanced data.Moreover,somegraphicalmeasuressuchasReceiveroperatingcharacteristics(ROC),Precision-Recall,andDetectionerrortrade-off(DET)curvesarepresentedwithdetails.Additionally,inastep-by-stepapproach, differentnumericalexamplesaredemonstratedtoexplainthepreprocessingstepsofplottingROC,PR,andDETcurves.


Introduction
Classification techniques have been applied to many applications in various fields of sciences. In classification models, the training data are used for building a classification model to predict the class label for a new sample. The outputs of classification models can be discrete as in the decision tree classifier or continuous as the Naive Bayes classifier [7]. However, the outputs of learning algorithms need to be assessed and analyzed carefully and this analysis must be interpreted correctly, so as to evaluate different learning algorithms.
The classification performance is represented by scalar values as in different metrics such as accuracy, sensitivity, and specificity. Comparing different classifiers using these measures is easy, but it has many problems such as the sensitivity to imbalanced data and ignoring the performance of some classes. Graphical assessment methods such as Receiver operating characteristics (ROC) and Precision-Recall curves give different interpretations of the classification performance.
Some of the measures which are derived from the confusion matrix for evaluating a diagnostic test are reported in [19]. In that paper, only eight measures were introduced. Powers introduced an excellent discussion of the precision, Recall, F-score, ROC, Informedness, Markedness and Correlation assessment methods with details explanations [16]. Sokolova et al. reported some metrics which are used in medical diagnosis [20]. Moreover, a good investigation of some measures and the robustness of these measures against different changes in the confusion matrix are introduced in [21]. Tom Fawcett presented a detailed introduction to the ROC curve including (1) good explanations of the basics of the ROC curve, (2) clear example for generating the ROC curve, (3) comprehensive discussions, and (4) good explanations of the Area under curve (AUC) metric [8]. Jesse Davis and Mark Goadrich reported the relationship between the ROC and Precision-Recall curves [5]. Our paper introduces a detailed overview of the classification assessment methods with the goal of providing the basic principles of these measures and to show how it works to serve as a comprehensive source for researchers who are interested in this field. This paper has details of most of the well-known classification assessment methods. Moreover, this paper introduces (1) the relations between different assessment methods, (2) numerical examples to show how to calculate these assessment methods, (3) the robustness of each method against imbalanced data which is one of the most important problems in real-time applications, and (4) explanations of different curves in a step-by-step approach.
This paper is divided into eight sections. Section 2 gives an overview of the classification assessment methods. This section begins by explaining the confusion matrix for binary and multi-class classification problems. Based on the data that can be extracted from the confusion matrix, many classification metrics can be calculated. Moreover, the influence of balanced and imbalanced data on each assessment method is introduced. Additionally, an illustrative numerical example is presented to show (1) how to calculate these measures in both binary and multi-class classification problems, and (2) the robustness of some measures against balanced and imbalanced data. Section 3 introduces the basics of the ROC curve, which are required for understanding how to plot and interpret it. This section also presents visualized steps with an illustrative example for plotting the ROC curve. The AUC measure is presented in Section 4. In this section, the AUC algorithm with detailed steps is explained. Section 5 presents the basics of the Precision-Recall curve and how to interpret it. Further, in a step-by-step approach, different numerical examples are demonstrated to explain the preprocessing steps of plotting ROC and PR curves in Sections 3 and 5. Classification assessment methods for biometric models including steps of plotting the DET curve are presented in Section 6. In Section 7, results in terms of different assessment methods of a simple experiment are presented. Finally, concluding remarks will be given in Section 8.

Classification performance
The assessment method is a key factor in evaluating the classification performance and guiding the classifier modeling. There are three main phases of the classification process, namely, training phase, validation phase, and testing phase. The model is trained using input patterns and this phase is called the training phase. These input patterns are called training data which are used for training the model. During this phase, the parameters of a classification model are adjusted. The training error measures how well the trained model fits the training data. However, the training error always smaller than the testing error and the validation error because the trained model fits the same data which are used in the training phase. The goal of a learning algorithm is to learn from the training data to predict class labels for unseen data; this is in the testing phase. However, the testing error or out-of-sample error cannot be estimated because the class labels or outputs of testing samples are unknown. This is the reason why the validation phase is used for evaluating the performance of the trained Methods of classification assessment model. In the validation phase, the validation data provide an unbiased evaluation of the trained model while tuning the model's hyperparameters. According to the number of classes, there are two types of classification problems, namely, binary classification where there are only two classes, and multi-class classification where the number of classes is higher than two. Assume we have two classes, i.e., binary classification, P for positive class and N for negative class. An unknown sample is classified to P or N. The classification model that was trained in the training phase is used to predict the true classes of unknown samples. This classification model produces continuous or discrete outputs. The discrete output that is generated from a classification model represents the predicted discrete class label of the unknown/test sample, while continuous output represents the estimation of the sample's class membership probability. Figure 1 shows that there are four possible outputs which represent the elements of a 2 3 2 confusion matrix or a contingency table. The green diagonal represents correct predictions and the pink diagonal indicates the incorrect predictions. If the sample is positive and it is classified as positive, i.e., correctly classified positive sample, it is counted as a true positive (TP); if it is classified as negative, it is considered as a false negative (FN) or Type II error. If the sample is negative and it is classified as negative it is considered as true negative (TN); if it is classified as positive, it is counted as false positive (FP), false alarm or Type I error.
As we will present in the next sections, the confusion matrix is used to calculate many common classification metrics. Figure 2 shows the confusion matrix for a multi-class classification problem with three classes (A, B, and C). As shown, TP A is the number of true positive samples in class A, i.e., the number of samples that are correctly classified from class A, and E AB is the samples from class A that were incorrectly classified as class B, i.e., misclassified samples. Thus, the false negative in the A class (FN A ) is the sum of E AB and E AC (FN A ¼ E AB þ E AC ) which indicates the sum of all class A samples that were incorrectly classified as class B or C. Simply, FN of any class which is located in a column can be calculated by adding the errors in that class/ column. Whereas the false positive for any predicted class which is located in a row represents the sum of all errors in that row. For example, the false positive in class A (FP A ) is calculated as follows,   2.1 Classification metrics with imbalanced data Different assessment methods are sensitive to the imbalanced data when the samples of one class in a dataset outnumber the samples of the other class(es) [25]. To explain this is so, consider the confusion matrix in Figure 1. The class distribution is the ratio between the positive and negative samples ( P N ) represents the relationship between the left column to the right column. Any assessment metric that uses values from both columns will be sensitive to the imbalanced data as reported in [8]. For example, some metrics such as accuracy and precision 1 use values from both columns of the confusion matrix; thus, as data distributions change, these metrics will change as well, even if the classifier performance does not. Therefore, such these metrics cannot distinguish between the numbers of corrected labels from different classes [11]. This fact is partially true because there are some metrics such as Geometric Mean (GM) and Youden's index (YI) 2 use values from both columns and these metrics can be used with balanced and imbalanced data. This can be interpreted as that the metrics which use values from one column cancel the changes in the class distribution. However, some metrics which use values from both columns are not sensitive to the imbalanced data because the changes in the class distribution cancel each other. For example, the accuracy is defined as follows, Acc ¼ TPþTN TPþTNþFPþFN and the GM is defined as follows, ; thus, both metrics use values from both columns of the confusion matrix. Changing the class distribution can be obtained by increasing/decreasing the number of samples of negative/positive class. With the same classification performance, assume that the negative class samples are increased by α times; thus, the TN and FP values will be αTN and αFP, respectively; thus, the accuracy will be, TPþTN þFPþFN . This means that the accuracy is affected by the changes in the class distribution. On the other hand, the GM metric will be, q and hence the changes in the negative class cancel each other. This is the reason why the GM metric is suitable for the imbalanced data. Similarly, any metric can be checked to know if it is sensitive to the imbalanced data or not.

Accuracy and error rate
Accuracy (Acc) is one of the most commonly used measures for the classification performance, and it is defined as a ratio between the correctly classified samples to the total number of samples as follows [20]: where P and N indicate the number of positive and negative samples, respectively. The complement of the accuracy metric is the Error rate (ERR) or misclassification rate. This metric represents the number of misclassified samples from both positive and negative classes, and it is calculated as follows, [4]. Both accuracy and error rate metrics are sensitive to the imbalanced data. Another problem with the accuracy is that two classifiers can yield the same accuracy but perform differently with respect to the types of correct and incorrect decisions they provide [9]. However, Takaya Saito and Marc Rehmsmeier reported that the accuracy is suitable with imbalanced data because they found that the accuracy values of the balanced and imbalanced data in their example were identical [17]. The reason why the accuracy values were identical in their example is that the sum of TP and TN in the balanced and imbalanced data was the same.

Methods of classification assessment
2.3 Sensitivity and specificity Sensitivity, True positive rate (TPR), hit rate, or recall, of a classifier represents the positive correctly classified samples to the total number of positive samples, and it is estimated according to Eq. (2) [20]. Whereas specificity, True negative rate (TNR), or inverse recall is expressed as the ratio of the correctly classified negative samples to the total number of negative samples as in Eq. (2) [20]. Thus, the specificity represents the proportion of the negative samples that were correctly classified, and the sensitivity is the proportion of the positive samples that were correctly classified. Generally, we can consider sensitivity and specificity as two kinds of accuracy, where the first for actual positive samples and the second for actual negative samples. Sensitivity depends on TP and FN which are in the same column of the confusion matrix, and similarly, the specificity metric depends on TN and FP which are in the same column; hence, both sensitivity and specificity can be used for evaluating the classification performance with imbalanced data [9].
The accuracy can also be defined in terms of sensitivity and specificity as follows [20]: 2.4 False positive and false negative rates False positive rate (FPR) is also called false alarm rate (FAR), or Fallout, and it represents the ratio between the incorrectly classified negative samples to the total number of negative samples [16]. In other words, it is the proportion of the negative samples that were incorrectly classified. Hence, it complements the specificity as in Eq. (4) [21]. The False negative rate (FNR) or miss rate is the proportion of positive samples that were incorrectly classified. Thus, it complements the sensitivity measure and it is defined in Eq. (5). Both FPR and FNR are not sensitive to changes in data distributions and hence both metrics can be used with imbalanced data [9].

Predictive values
Predictive values (positive and negative) reflect the performance of the prediction. Positive prediction value (PPV) or precision represents the proportion of positive samples that were correctly classified to the total number of positive predicted samples as indicated in Eq. (6) [20]. On the contrary, Negative predictive value (NPV), inverse precision, or true negative accuracy (TNA) measures the proportion of negative samples that were correctly classified to the total ACI 17,1 number of negative predicted samples as indicated in Eq. (7) [16]. These two measures are sensitive to the imbalanced data [21,9]. False discovery rate (FDR) and False omission rate (FOR) measures complements the PPV and NPV, respectively (see Eq. (6) and (7)).
The accuracy can also be defined in terms of precision and inverse precision as follows [16]:

Likelihood ratio
The likelihood ratio combines both sensitivity and specificity, and it is used in diagnostic tests. In that tests, not all positive results are true positives and also the same for negative results; hence, the positive and negative results change the probability/likelihood of diseases. Likelihood ratio measures the influence of a result on the probability. Positive likelihood (LRþ) measures how much the odds of the disease increases when a diagnostic test is positive, and it is calculated as in Eq. (9) [20]. Similarly, Negative likelihood (LR − ) measures how much the odds of the disease decreases when a diagnostic test is negative, and it is calculated as in Eq. (9). Both measures depend on the sensitivity and specificity measures; thus, they are suitable for balanced and imbalanced data [6].
Both LRþ and LR − are combined into one measure which summarizes the performance of the test, this measure is called Diagnostic odds ratio (DOR). The DOR metric represents the ratio between the positive likelihood ratio to the negative likelihood ratio as in Eq. (10). This measure is utilized for estimating the discriminative ability of the test and also for comparing between two diagnostic tests. From Eq. (10) it can be remarked that the value of DOR increases when (1) the TP and TN are high and (2) the FP and FN are low [18].

Youden's index
Youden's index (YI) or Bookmaker Informedness (BM) metric is one of the well-known diagnostic tests. It evaluates the discriminative power of the test. The formula of Youden's index combines the sensitivity and specificity as in the DOR metric, and it is defined as follows, YI ¼ TPR þ TNR − 1 [20]. The YI metric is ranged from zero when the test is poor to one which represents a perfect diagnostic test. It is also suitable with imbalanced data. One of the major disadvantages of this test is that it does not change concerning the differences between the sensitivity and specificity of the test. For example, given two tests, the sensitivity Methods of classification assessment values for the first and second tests are 0.7 and 0.9, respectively, and the specificity values for the first and second tests are 0.8 and 0.6, respectively; the YI value for both tests is 0.5.

Another metrics
There are many different metrics that can be calculated from the previous metrics. Some details about each measure are as follow: Matthews correlation coefficient (MCC): this metric was introduced by Brian W. Matthews in 1975 [14], and it represents the correlation between the observed and predicted classifications, and it is calculated directly from the confusion matrix as in Eq. (11). A coefficient of þ1 indicates a perfect prediction, −1 represents total disagreement between prediction and true values and zero means that no better than random prediction [16,3]. This metric is sensitive to imbalanced data.
Discriminant power (DP): this measure depends on the sensitivity and specificity and it is defined as follows, DP ¼ 20]. This metric evaluates how well the classification model distinguishes between positive and negative samples. Since this metric depends on the sensitivity and specificity metrics; it can be used with imbalanced data.
F-measure: this is also called F 1 -score, and it represents the harmonic mean of precision and recall as in Eq. (12) [20]. The value of F-measure is ranged from zero to one, and high values of F-measure indicate high classification performance. This measure has another variant which is called F β -measure. This variant represents the weighted harmonic mean between precision and recall as in Eq. (13). This metric is sensitive to changes in data distributions. Assume that the negative class samples are increased by α times; thus, the F − measure is calculated as follows, and hence this metric is affected by the changes in the class distribution.
Adjusted F-measure (AGF) was introduced in [13]. The F-measures used only three of the four elements of the confusion matrix and hence two classifiers with different TNR values may have the same F-score. Therefore, the AGF metric is introduced to use all elements of the confusion matrix and provide more weights to samples which are correctly classified in the minority class. This metric is defined as follows: where F 2 is the F-measure where β ¼ 2 and InvF 0:5 is calculated by building a new confusion matrix where the class label of each sample is switched (i.e. positive samples become negative and vice versa).
Markedness (MK): this is defined based on PPV and NPV metrics as follows, [16]. This metric sensitive to data changes and hence it is not suitable for imbalanced data. This is because the Markedness metric depends on PPV and NPV metrics and both PPV and NPV are sensitive to changes in data distributions.
Balanced classification rate or balanced accuracy (BCR): this metric combines the sensitivity and specificity metrics and it is calculated as follows, Also, Balance error rate (BER) or Half total error rate (HTER) represents 1 − BCR. Both BCR and BER metrics can be used with imbalanced datasets.
Geometric Mean (GM): The main goal of all classifiers is to improve the sensitivity, without sacrificing the specificity. However, the aims of sensitivity and specificity are often conflicting, which may not work well, especially when the dataset is imbalanced. Hence, the Geometric Mean (GM) metric aggregates both sensitivity and specificity measures according to Eq. (15) [3]. Adjusted Geometric Mean (AGM) is proposed to obtain as much information as possible about each class [11]. The AGM metric is defined according to Eq. (16).
GM metric can be used with imbalanced datasets. Lopez et al. reported that the AGM metric is suitable with the imbalanced data [12]. However, changing the distribution of negative class has a small influence on the AGM metric and hence it is not suitable with the imbalanced data. This is can be proved simply by assuming that the negative class samples are increased by α times. Thus, the AGM metric is calculated as follows, ; as a consequence, the AGM metric is slightly affected by the changes in the class distribution.
Optimization precision (OP): This metric is defined as follows: where the second term jTPR − TNRj TPRþTNR computes how balanced both class accuracies are and this metric represents the difference between the global accuracy and that term [9]. High OP value indicates high accuracy and well-balanced class accuracies. Since the OP metric depends on the accuracy metric, it is not suitable for imbalanced data.

Methods of classification assessment
Jaccard: This metric is also called Tanimoto similarity coefficient. Jaccard metric explicitly ignores the correct classification of negative samples as follows, Jaccard ¼ TP TPþFPþFN . Jaccard metric is sensitive to changes in data distributions. Figure 4 shows the relations between different classification assessment methods. As shown, all assessment methods can be calculated from the confusion matrix. As shown, there are two classes; red class and blue class. After applying a classifier, the classifier is represented by a black circle and the samples which are inside the circle are classified as red class samples and the samples outside the circle are classified as blue class samples. Additionally, from the figure, it is clear that many assessment methods depend on the TPR and TNR metrics, and all assessment methods can be estimated from the confusion matrix.

Receiver operating characteristics (ROC)
The receiver operating characteristics (ROC) curve is a two-dimensional graph in which the TPR represents the y-axis and FPR is the x-axis. The ROC curve has been used to evaluate many systems such as diagnostic systems, medical decision-making systems, and machine learning systems [26]. It is used to make a balance between the benefits, i.e., true positives, and costs, i.e., false positives. Any classifier that has discrete outputs such as decision trees is Methods of classification assessment designed to produce only a class decision, i.e., a decision for each testing sample, and hence it generates only one confusion matrix which in turn corresponds to one point into the ROC space. However, there are many methods that were introduced for generating full ROC curve from a classifier instead of only a single point such as using class proportions [26] or using some combinations of scoring and voting [8]. On the other hand, in continuous output classifiers such as the Naive Bayes classifier, the output is represented by a numeric value, i.e., score, which represents the degree to which a sample belongs to a specific class. The ROC curve is generated by changing the threshold on the confidence score; hence, each threshold generates only one point in the ROC curve [8].  Figure 5 shows the perfect classification performance. It is the green curve which rises vertically from (0,0) to (0,1) and then horizontally to (1,1). This curve reflects that the classifier perfectly ranked the positive samples relative to the negative samples.
A point in the ROC space is better than all other points that are in the southeast, i.e., the points that have lower TPR, higher FPR, or both (see Figure 5). Therefore, any classifier appears in the lower right triangle performs worse than the classifier appears in the upper left triangle. Figure 6 shows an example of the ROC curve. In this example, a test set consists of 20 samples from two classes; each class has ten samples, i.e., ten positive and ten negative samples. As shown in the table in Figure 6, the initial step to plot the ROC curve is to sort the samples according to their scores. Next, the threshold value is changed from maximum to minimum to plot the ROC curve. To scan all samples, the threshold is ranged from ∞ to -∞. The samples are classified into the positive class if their scores are higher than or equal the  Methods of classification assessment threshold; otherwise, it is estimated as negative [8]. Figures 7 and 8 shows how changing the threshold value changes the TPR and FPR. As shown in Figure 6, the threshold value is set at maximum (t 1 ¼ ∞); hence, all samples are classified as negative samples and the values of FPR and TPR are zeros and the position of t 1 is in the lower left corner (the point (0,0)). The threshold value is decreased to 0:82, and the first sample is classified correctly as a positive sample (see Figures 6-8(a)). The TPR increased to 0:1, while the FPR remains zero. As the threshold is further reduced to be 0:8, the TPR is increased to 0:2 and the FPR remains zero. As shown in Figure 7, increasing the TPR moves the ROC curve up while increasing the FPR moves the ROC curve to the right as in t 4 . The ROC curve must pass through the point (0,0) where the threshold value is ∞ (in which all samples are classified as negative samples) and the point (1,1) where the threshold is −∞ (in which all samples are classified as positive samples). Figure 8 shows graphically the performance of the classification model with different threshold values. From this figure, the following remarks can be drawn.
The value of this threshold was ∞ as shown in Figure 8a) and hence all samples are classified as negative samples. This means that (1) all positive samples are incorrectly classified; hence, the value of TP is zero, (2) all negative samples are correctly classified and hence there is no FP (see also Figure 6).      Methods of classification assessment contrary, more negative samples are misclassified and this increases FP and reduces TN.
t 20 : As shown in Figure 8(f), decreasing the threshold value hides the FN area. This is because all positive samples are correctly classified. Also, from the figure, it is clear that the FP area is much larger than the area of TN. This is because 90% of the negative samples are incorrectly classified, and only 10% of negative samples are correctly classified.
From Figure 7 it is clear that the ROC curve is a step function. This is because we only used 20 samples (a finite set of samples) in our example and a true curve can be obtained when the number of samples increased. The figure also shows that the best accuracy (70%) (see Table 1) is obtained at (0.1,0.5) when the threshold value was ≥ 0.6, rather than at ≥ 0.5 as we might expect with a balanced data. This means that the given learning model identifies positive samples better than negative samples. Since the ROC curve depends mainly on changing the threshold value, comparing classifiers with different score ranges will be meaningless. For example, assume we have two classifiers, the first generates scores in the range [0,1] and the other generates scores in the range [À1,þ1] and hence we cannot compare these classifiers using the ROC curve. The steps of generating ROC curve are summarized in Algorithm 1. The algorithm requires OðnlognÞ for sorting samples, and OðnÞ for scanning them; resulting in OðnlognÞ total complexity, where n is the number of samples. As shown, the two main steps to generate ROC points are (1) sorting samples according to their scores and (2) changing the threshold value from maximum to minimum to process one sample at a time and update the values of TP and FP in each time. The algorithm shows that the TP and the FP start at zero. The algorithm scans all samples and the value of TP is increased for each positive sample while the value of FP is increased for each negative sample. Next, the values of TPR and FPR are calculated and pushed into the ROC stack (see step 6). When the threshold becomes very low (threshold → − ∞), all samples are classified as positive samples and hence the values of both TPR and FPR are one.
Steps 5-8 handle sequences of equally scored samples. Assume we have a test set which consists of P positive samples and N negative samples. In this test set, assume we have p positive samples and n negative samples with the same score value. There are two extreme cases. In the first case which is the optimistic case, all positive samples end up at the beginning of the sequence, and this case represents the upper L segment of the rectangle in Figure 5. In the second case, i.e., pessimistic case, all the negative samples end up at the beginning of the sequence, and this case represents the lower L segment of the rectangle in Figure 5. The ROC curve represents the expected performance which is the average of the two cases, and it represents the diagonal of the rectangle in Figure 5. The size of this rectangle is pn PN , and the number of errors in both optimistic and pessimistic cases can be calculated as follows, pn 2PN .
In multi-class classification problems, plotting ROC becomes much more complex than in binary classification problems. One of the well-known methods to handle this problem is to produce one ROC curve for each class. For plotting ROC of the class i (c i ), the samples from c i represent positive samples and all the other samples are negative samples.
ROC curves are robust against any changes to class distributions. Hence, if the ratio of positive to negative samples changes in a test set, the ROC curve will not change. In other words, ROC curves are insensitive with the imbalanced data. This is because ROC depends on TPR and FPR, and each of them is a columnar ratio 3 .
The following example compares between the ROC using balanced and imbalanced data. Assume the data is balanced and it consists of two classes each has 1000 samples. The point (0.2,0.5) on the ROC curve means that the classifier obtained 50% sensitivity (500 positive samples are correctly classified from 1000 positive samples) and 80% specificity (800 negative samples are correctly classified from 1000 negative samples). If the class distribution changed to be imbalanced and the first and second classes have 1000 and Methods of classification assessment 10,000 samples, respectively. Hence, the same point (0.2, 0.5) means that the classifier obtained 50% sensitivity (500 positive samples are correctly classified from 1000 positive samples) and 80% specificity (8000 negative samples are correctly classified from 1000 negative samples). The AUC 4 score for both cases are the same while the other metrics which are sensitive to the imbalanced data will be changed. For example, the accuracy rates of the classifier using the balanced and imbalanced data are 65 and 77.3%, respectively, and the precision values will be 0:71 and 0.20, respectively. These results reflect how the precision and accuracy metrics are sensitive to the imbalanced data as mentioned in Section 2.1.
It is worth mentioning that the comparison between different classifiers using ROC is valid only when (1) there is only single dataset, (2) there are multiple datasets with the same data size and the same positive:negative ratio.

Area under the ROC curve (AUC)
Comparing different classifiers in the ROC curve is not easy. This is because there is no scalar value represents the expected performance. Therefore, the Area under the ROC curve (AUC) metric is used to calculate the area under the ROC curve. The AUC score is always bounded between zero and one, and there is no realistic classifier has an AUC lower than 0.5 [4,15]. Figure 9 shows the AUC value of two classifiers, A and B. As shown, the AUC of B classifier is greater than A; hence, it achieves better performance. Moreover, the gray shaded area is common in both classifiers, while the red shaded area represents the area where the B classifier outperforms the A classifier. It is possible for a lower AUC classifier to outperform a higher AUC classifier in a specific region. For example, in Figure 9, the classifier B outperforms A except at FPR > 0:6 where A has a slight difference (blue shaded area). However, two classifiers with two different ROC curves may have the same AUC score.
The AUC value is calculated as in Algorithm 2. As shown, the steps in Algorithm 2 represent a slight modification from Algorithm 1. In other words, instead of generating ROC points in Algorithm 1, Algorithm 2 adds areas of trapezoids 5 of the ROC curve [4]. As shown in Algorithm 2 the AUC score can be calculated by adding the areas of trapezoids of the AUC Figure 9. An illustrative example of the AUC metric. ACI 17,1 measure. Figure 9 shows an example of one trapezoid; the base of this trapezoid is ðFPR 2 − FPR 1 Þ, and the height of the trapezoid is ðTPR 1 þ TPR 2 Þ=2; hence, the total area of this trapezoid is calculated as follows, The AUC can be also calculated under the PR curve using the trapezoidal rule as in the ROC curve, and the AUC score of the perfect classifier in PR curves is one as in ROC curves.
In multi-class classification problems, Provost and Domingos calculated the total AUC of all classes by generating a ROC curve for each class and calculate the AUC value for each ROC curve [10]. The total AUC (AUC total ) is the summation of all AUC scores weighted by the prior probability of each class as follows, AUC total ¼ P C i ∈C AUCðc i Þ:pðc i Þ, where AUCðc i Þ is the AUC under the ROC curve of the class c i ; C is a set of classes, and pðc i Þ is the prior probability of c i [10]. This method of calculating the AUC score is simple and fast but it is sensitive to class distributions and error costs.

Precision-Recall (PR) curve
Precision and recall metrics are widely used for evaluating the classification performance. The Precision-Recall (PR) curve has the same concept of the ROC curve, and it can be generated by changing the threshold as in ROC. However, the ROC curve shows the relation between sensitivity/recall (TPR) and 1-specificity (FPR) while the PR curve shows the relationship between recall and precision. Thus, in the PR curve, the x-axis is the recall and the y-axis is the precision, i.e., the x-axis of ROC curve is the y-axis of PR curve [8]. Hence, in the PR curve, there is no need for the TN value.
In the PR curve, the precision value for the first point is undefined because the number of positive predictions is zero, i.e., TP ¼ 0 and FP ¼ 0. This problem can be solved by estimating the first point in the PR curve from the second point. There are two cases for estimating the first point depending on the value of TP of the second point. As shown in Figure 10, the PR curve is often zigzag curve; hence, PR curves tend to cross each other much more frequently than ROC curves. In the PR curve, a curve above the other has a better classification performance. The perfect classification performance in the PR curve is represented in Figure 10 by a green curve. As shown, this curve starts from the (0,1) horizontally to (1,1) and then vertically to (1,0), where (0,1) represents a classifier that achieves 100% precision and 0% recall, (1,1) represents a classifier that obtains 100% precision and sensitivity and this is the ideal point in the PR curve, and (1,0) indicates the classifier obtains 100% sensitivity and 0% precision. Hence, we can say that the closer the PR curve is to the upper right corner, the better the classification performance is. Since the PR curve depends only on the precision and recall measures, it ignores the performance of correctly handling negative examples (TN) [16]. Eq. (18) indicates the nonlinear interpolation of the PR curve that was introduced by Davis and Goadrich [5].
where TP A and TP B represent the true positives of the first and second points, respectively, FP A and FP B represent the false positives of the first and second points, respectively, y is the precision of the new point, and x is the recall of the new point. The value of x can be any value between zero and jTP B − TP A j. A smooth curve can be obtained by calculating many intermediate points between two points A and B. In our example in Figure 10, assume the first point is the fifth point and the second point is the sixth point (see Table 1). From Table 1, the point A is (0.3,0.75) and the point B is (0.4,0.8). The value of jTP B − TP A j ¼ j4 − 3j ¼ 1 and hence the value of x can be any value between zero and one. Let x ¼ 0:5, which is the middle Figure 10. Á . This is because (1) the recall increases by increasing the threshold value and at the end point the recall reaches to the maximum recall, (2) increasing the threshold value increases both TP and FP. Therefore, if the data are balanced, the precision of the end point is P PþN ¼ 1 2 . The horizontal line which passes through P PþN represents a classifier with the random performance level. This line separates the area of the PR curve into (1) the area above the line and this is the area of good performance and (2) the area below the line and this is the area of poor performance (see Figure 10). Thus, the ratio of positives and negatives defines the baseline. Hence, changing the ratio between the positive and negative classes changes that line and hence changes the classification performance.
As indicated in Eq. (6), according to the precision metric, lowering the threshold value increases the TP or FP. Increasing TP increases the precision while increasing the FP decreases the precision. Hence, lowering the threshold value fluctuates the precision. On the other hand, as indicated in Eq. (2), lowering the threshold may leave the recall value unchanged or increase it. Due to the precision axis in the PR curve; hence, the PR curve is sensitive to the imbalanced data. In other words, the PR curves and their AUC values are different between balanced and imbalanced data.

Biometrics measures
Biometrics matching is slightly different than the other classification problems and hence it is sometimes called two-instance problem. In this problem, instead of classifying one sample into one of c groups or classes, biometric determines if the two samples are in the same group. This can be achieved by identifying an unknown sample by matching it with all the other known samples. This step generates a score or similarity distance between the unknown sample and the other samples. The model assigns the unknown sample to the person which has the most similar score. If this level of similarity is not reached, the sample is rejected. In other words, if the similarity score exceeds a pre-defined threshold; hence, the corresponding sample is said to be matched; otherwise, the sample is not matched. Theoretically, scores of clients (persons known by the biometric system) should always be higher than the scores of imposters (persons who are not known by the system). In biometric systems, a single threshold separates the two groups of scores; thus, it can be utilized for differentiating between clients and imposters. In real applications, for many reasons sometimes imposter samples generate scores higher than the scores of some client samples. Accordingly, it is a fact that however the classification threshold is perfectly chosen, some classification errors occur. For example, given a high threshold; hence, the imposters' scores will not exceed this limit. As a result, no imposters are incorrectly accepted by the model. On the contrary, some clients are falsely rejected (see Figure 11 (top panel)). In opposition to this, lowering the threshold value accepts all clients and also some imposters are falsely accepted.
Two of the most commonly used measures in biometrics are the False acceptance rate (FAR) and False rejection/recognition rate (FRR). The FAR is also called false match rate (FMR) and it is the ratio between the number of false acceptance to the total number of imposters attempts. Hence, it measures the likelihood that the biometric model will incorrectly accept an access by an imposter or an unauthorized user. Hence, to prevent imposter samples from being easily correctly identified by the model, the similarity score has to exceed a certain level (see Figure 11) [2]. The FRR or false non-match rate (F NMR)

Methods of classification assessment
measures the likelihood that the biometric model will incorrectly reject a client, and it represents the ratio between the number of false recognitions to the total number of clients' attempts [2]. For example, if FAR ¼ 10% this means that for one hundred attempts to access the system by imposters, only ten will be succeeded and hence increasing FAR decreases the accuracy of the model. On the other hand, with FRR ¼ 10%, ten authorized persons will be rejected from 100 attempts and hence reducing FRR will help to avoid a high number of trails of authorized clients. As a consequence, FAR and FRR in biometrics are similar to false positive rate (FPR) and false negative rate (FNR), respectively (see Section 2.4). Equal error rate (EER) measure solves the problem of selecting a threshold value partially, and it   Table 1.

Methods of classification assessment
represents the failure rate when the values of FMR and F NMR are equal. Figure 11 shows the FAR and FRR curves and also the EER measure. Detection Error Trade-off (DET) curve is used for evaluating biometric models. In this curve, as in the ROC and PR curves, the threshold value is changed and the values of FAR and FRR are calculated at each threshold. Hence, this curve shows the relation between FAR and FRR. Figure 12 shows an example of the DET curve. As shown, as in the ROC curve, the DET curve is plotted by changing the threshold on the confidence score; thus, each threshold generates only one point in the DET curve. The ideal point in this curve is the origin point where the values of both FRR and FAR are zeros and hence the perfect classification performance in the DET curve is represented in Figure 12 by a green curve. As shown, this curve starts from the point (0,1) vertically to (0,0) and then horizontally to (1,0), where (1) the point (0,1) represents a classifier that achieves 100% FAR and 0% FRR, (2) the point (0,0) represents a classifier that obtains 0% FAR and FRR, and (3) the point (1,0) represents a classifier that indicates 0% FAR and 100% FRR. Thus, we can say that the closer a DET curve is to the lower left corner, the better the classification performance is.

Experimental results
In this section, an experiment was conducted to evaluate the classification performance using different assessment methods. In this experiment, we used Iris dataset which is one of the standard classification datasets and it is obtained from the University of California at Irvin (UCI) Machine Learning Repository [1]. This dataset has three classes, each class has 50 samples, and each sample is represented by four features. We used (1) the Principal component analysis (PCA) [23] for reducing the features to two features and (2) Support vector machine (SVM) 6 for classification.
In our experiment, we used different assessment methods for evaluating the learning model. Figure 13 shows the ROC and Precision-Recall curves. As shown, there are three curves, one curve for each class and as shown, the first class obtained results better than the other two classes. Figure 14 shows the confusion matrix for each class. From these confusion matrices we can calculate different metrics as mentioned before (see Figure 3). For example, the results of the first class were as follows, Acc; TPR; TNR; PPV, and NPV were 99.33, 100, 98.0, 99.01, 100, respectively. Similarly, the results of the other two classes can be calculated.

Conclusions
In this paper, the definition, mathematics, and visualizations of the most well-known classification assessment methods were presented and explained. The paper aimed to give a detailed overview of the classification assessment measures. Moreover, based on the confusion matrix, different measures are introduced with detailed explanations. The relations between these measures and the robustness of each of them against imbalanced data are also introduced. Additionally, an illustrative numerical example was used for explaining how to calculate different classification measures with binary and multi-class problems and also to show the robustness of different measures against the imbalanced data. Graphical measures such as ROC, PR, and DET curves are also presented with illustrative examples and visualizations. Finally, various classification measures for evaluating biometric models are also presented. More details about these two metrics are in Sections 2.2 and 2.5. 2 More details about these two metrics are in Section 2.8. 3 As mentioned before TPR ¼ TP TPþFN ¼ TP P and both TP and FN are in the same column, and similarly FNR. 4 The AUC metric will be explained in Section 4. 5 A trapezoid is a 4-sided shape with two parallel sides. 6 More details about SVM can be found in [24].