Neural networks for Anatomical Therapeutic Chemical (ATC) classification

Motivation: Automatic Anatomical Therapeutic Chemical (ATC) classification is a critical and highly competitive area of research in bioinformatics because of its potential for expediting drug develop-ment and research. Predicting an unknown compound's therapeutic and chemical characteristics ac-cording to how these characteristics affect multiple organs/systems makes automatic ATC classifica-tion a challenging multi-label problem. Results: In this work, we propose combining multiple multi-label classifiers trained on distinct sets of features, including sets extracted from a Bidirectional Long Short-Term Memory Network (BiLSTM). Experiments demonstrate the power of this approach, which is shown to outperform the best methods reported in the literature, including the state-of-the-art developed by the fast.ai research group. Availability: All source code developed for this study is available at https://github.com/LorisNanni. Contact: loris.nanni@unipd.it


Introduction
From start to market, the price for engineering new drugs, which can take decades before final approval, is now estimated to be 2.8 billion USD [1].Of all drugs currently under development, approximately 86% will fail to be better than placebo [2] or will prove to cause more harm than good [3].To weed out new drugs with a low probability of being efficacious and safe has led researchers to search for automatic methods for classifying compounds according to the organs they are likely to affect based on these compounds' Anatomical Therapeutic Chemical (ATC) classes.An automatic classification system with good ATC prediction would not only accelerate research but also significantly reduce drug development costs.The ATC coding system [4] classifies compounds into one or more classes at five levels in terms of the drug's effects on organs or physiological systems.Most relevant to the automatic ATC classification problem is the first ATC level, which determines the general anatomical groups, as coded with fourteen semi-mnemonic letters that a particular compound targets.These alphabetic codes range from A (alimentary tract and metabolism) to V, a category that includes various groups.Levels 2/3 are pharmacological subgroups, and levels 4/5 contain chemical subgroups.A compound is assigned to as many ATC codes as relevant within each of these five levels.
Despite the serviceability of the ATC classification system for assessing the clinical value of a compound, most pharmaceuticals have yet to be assigned ATC codes.Accurate coding involves expensive, labor-intensive experimental procedures.Hence, the pressing need for machine learning (ML) to be applied to this problem.
Early ML systems tended to simplify the complexity of the ATC classification problem by reducing the level 1 multi-class problem to a single class problem.Dunkel et al. [5], for example, took advantage of a compound's unique structure to identify its class, while Wu et al. [6] based their approach on extracting relationships among level 1 subgroups.Chen et al. [7] tackled the multi-label complexity of ATC classification by examining a drug's chemical-chemical interactions.The authors also established the de facto benchmark data set for ATC classification.Cheng et al.,in [8] and [9], designed ML systems to handle class overlapping by fusing different descriptors: structural similarity, fingerprint similarity, and chemical-chemical interaction.Nanni and Brahnam [10] transformed these same 1D vectors into images (matrices) and extracted texture descriptors from them.The descriptors were then trained on ensembles of multi-label classifiers.
Convolutional Neural Networks (CNNs) were trained on 2D descriptors in [11] and in [12], but in Lumini and Nanni [12], a set of features were extracted from deep leaners for training two multilabel classifiers.This approach was further expanded in Nanni, Brahnam, and Lumini [13].
Ensembles of CNNs were constructed by adjusting batch sizes and learning rates, and different methods were applied to handle multi-label inputs.
In this work, an ensemble of different feature descriptors and classifiers is proposed that strongly outperforms the state-of-the-art classification results on the ATC benchmark data set developed by Chen et al. [7].The system proposed here was experimentally developed by comparing and evaluating multi-label classifiers trained on different feature/descriptor sets.Our best results were obtained by combining a Bidirectional Long Short-Term Memory Network (BiLSTM) [14] with a multi-label classifier.

Methods
The approach taken in this study is to produce experimentally ensembles that combine multi-label classifiers (hML) based on Multiple Linear Regression with LSTM classification and features taken from LSTM and trained on hML.All these classifiers are trained on , a set of three different descriptros (DDI, FRAKEL, NRAKEL, detailed in section 3.1).The features extracted from LSTM are also fed into hML classifiers.As illustrated in Figure 1, the results are then combined and evaluated.The LSTM feature extraction process and multi-label classifiers are discussed in sections2.1-2.2.In this work, we also examine the FastAI Tabular model [15], detailed in section 2.3, a method which has obtained the best classification result on the Chen benchmark [7].

LSTM multi-label classifier and feature extractor
LSTM is a Recurrent Neural Network that makes a decision for what to remember at every time step.As illustrated in Figure 2, this network contains three gates: 1) input gate I, 2) output gate , and 3) forget gate , each of which consist of one layer with the sigmoid () activation function.
LSTM also contains a specialized single layer network candidate  ̅ , which has a ℎ activation function.In addition, there are four state vectors: 1) memory state  with 2) its previous memory state  −1 and 3) hidden state  with 4) its previous state  −1 .The varable  in Figure 2   The process for updating LSTM at time  is as follows.Given   and  −1 and letting , ,  be the learnable weights of the network (each independent of ), the candidate layer   ̅ is The next memory cell is where * is element-wise multiplication.
The gates are defined as The output is   =   * (  ) of   and the sigmoid of   .
Regarding input, all sequences for this task are of the same length, so sorting input by length is not required.The output of LSTM can be the entire sequence   (this permits several layers to be stacked in a single network) or the last term of this sequence.
An LSTM that has two stacked layers trained on the same set of samples is called a Bidirectional LSTM (BiLSTM).The second LSTM connects to the end of the first sequence and runs in reverse.BiLSTM is best used to train data not related to time.Accordingly, this study uses the BiLSTM, as implemented in the MATLAB LSTM toolbox.Parameters were set to the following values:  = 100,  = 14, and ℎ = 27.
LSTM is not ordinarily considered a multi-label classifier but can perform multi-label classification if the training strategy outlined in [13 ] is implemented, which involves replicating a sample  times for each of its  labels.To assign a test pattern to more than one class, a rule is applied in the final softmax layer where a given pattern is assigned to each of the classes whose score is larger than a given threshold.
LSTM can function not only as a classifier but also as a feature extractor.As noted in Figure 1, in this study LSTM functions in both capacities.Feature extraction with LSTM is accomplished by representing each pattern using the activations from the last layer, which produce a feature vector with a dimension equal to the number of classes.Feature perturbation and extraction are performed several times by randomly sorting the original set of features used to train the LSTM.

Classification by hML
The algorithm hML-KNN, proposed in [16] In our experiments, we use the default values where the weight factor α is set to 0.5, and the number of neighbors is K=15.

Classification by FastAI Tabular Model
In addition to hML and LSTM, we explore the FastAI Tabular model [15], which is a powerful deep learning technique for tabular/structured data based on the creation of some embedding layers for categorical variables.This deep learner uses embedding layers to represent categorical variables by a numerical vector whose values are learned during training.Embeddings allow for relationships between categories to be captured, and they can also serve as inputs to other models.following layers, which, in our experiments, are two hidden layers and one output layer, as illustrated in Figure 3.We also use a binary encoding to represent binary variables, and the resulting variable is treated as categorical.

The data set and descriptors
The ensembles generated by the proposed approach are compared and evaluated on the data set in [7] (Supporting Information S1).This data set is a collection of 3883 ATC-coded pharmaceuticals taken from KEGG [17], a publicly available drug databank.The most classes any one compound belongs to is six.The total number of drugs with more than one label is 4912.This virtual subset based on the number of samples is called N(Vir).The average number of labels per sample is thus 4912/3883=1.27.
The following descriptors represent the drugs in this data set: • DDI represents each drug with three mathematical expressions representing the maximum interaction score with the drugs, the maximum structural similarity score, and the molecular fingerprint similarity score, with each expression based on its correlation with the 14 level 1 classes.Thus, the resulting descriptor is of size 14×3=42 (available in the supplementary material in Nanni and Brahnam [10]).
• FRAKEL represents each drug by its ECFP fingerprint [18], which is a 1024-dimensional binary vector (located at http://cie.shmtu.edu.cn/iatc/index).The descriptor is obtained by feeding the drug into RDKit (http://www.rdkit.org/), a free ML toolkit for chemistry informatics.From this 1024dimensional binary vector, a 64-dimensional categorical descriptor is obtained, representing each group in 16 bits as an integer.This version of FRAKEL has been used with the FastAI Tabular model.
• NRAKEL represents a drug by a 700-dimensional descriptor obtained from the Mashup algorithm [19], which generates output from seven drug networks (five based on chemicalchemical interaction and two on drug similarities).

Testing protocol
The jackknife testing protocol is used here to generate both the training and testing sets.At each iteration of this protocol, one sample is placed in the testing set and the remainder in the training set.Iteration continues until each pattern has taken a turn in the testing set.The K-fold crossvalidation is also applied.The jackknife protocol was selected as stipulated in [20].

Performance indicators
ATC classification is evaluated using the standard performance indicators defined in [20] and repeated below: where M the number of classes, N is the number of samples,   is the true label,   * is the predicted label, and Δ(•,•) returns 1 if the two sets have the same elements, 0 otherwise.

Experiments
The first experiment (see Table 1) compares the three multi-label classifiers described in section 2. Also compared are three other standard classifiers, each trained on the three sets of features (DDI, FRAKEL, and NRAKEL).As already mentioned, LSTM is not a native multi-label classifier; thresholding was used as described in section 2.1 to adapt this classifier to the ATC classification problem.
In the cell Tab-FRAKEL, the reported value was obtained by transforming the original 1024 bit feature vector into 64 int16 features since the original descriptor gained very low performance (0.3165).To avoid overfitting, default parameters were used for the classifiers.Examining the results in Table 1, Tab is the best standalone approach, producing an outstanding 0.7422 absolute true rate using NRAKEL descriptors.Of note as well is LSTM, which produced good results on all three descriptors.
The second experiment, reported in Table 2, considers the following ensembles: • LS, a stacking method based on the approach described in section 2, where LSTM is used as a feature extractor, and the resulting descriptors are given as input to an hML classifier; • X+Y, fusion by the average rule between the methods X and Y; • eLS, an ensemble generated by randomly perturbing features obtained as the fusion of ten LS methods trained using random rearrangements of the input features.Results reported in Table 2 show a strong performance improvement for descriptors trained on LS (a single hML classifier trained with LSTM features) compared to eLS (an ensemble of ten LS classifiers).The best performance is obtained by the ensemble eLS+LSTM+hML+Tab, which is the fusion of methods with the greatest diversity, compared to the others.This ensemble produces the highest performance in this classification problem, outperforming all the standalone approaches for each of the three descriptors.
In the third experiment (see Table 3), fusion at the feature level is tested.The starting descriptor is the concatenation of two or three sets of features for the Tab approach, while for other classifiers, the combination is the average rule applied to each of them (e.g., LSTM trained on DDI is combined by average rule with LSTM trained on NRAKEL).
When a cell in Table 3 spans more than one column, that indicates that the related classifier is trained using more features, and, for each feature, a different classifier is trained with results fused using the average rule.The results reported in Table 3 show the usefulness of the ensemble: all the approaches that contain Tab outperform the Fast.AI research group, which has achieved the highest classification score to date.
Finally, in Table 4, we report a comparison of our proposed method with the literature.Clearly, our ensemble strongly outperforms the other approaches.Compare the performance difference of the original papers on NRAKEL [24] and FRAKEL [18] and the classifiers tested in this work.
The main reason for this difference is that the classifiers were not optimized here since we are using a single dataset.Our concern in this regard is to avoid any risk of overfitting by running the approaches using default values.

Conclusion
Since ATC classification is a difficult multi-label problem, the goal of this study was to improve performance by generating ensembles trained on three different feature vectors.The original input vectors were fed into a BiLSTM, which functioned (with modification) not only as a multi-label classifier but also as a feature extractor, with features taken from the output layer.
Two other classifiers aside from LSTM were evaluated: one based on Multiple Linear Regression and another a deep learning technique for tabular/structured data based on the creation of some embedding layers for categorical variables.To boost the performance of these classifiers, they were trained on the feature sets with results fused via average rule.Comparisons of the best ensembles were made with the standalone classifiers and other notable systems.Results show that the top-performing ensemble constructed by the method proposed here obtained superior results for ATC classification using five performance indicators.
Future work will explore the performance of different LSTM and CNN topologies combined using many activation functions.The fusion of other deep learning topologies for extracting features will also be the focus of an investigation.

Figure 1 .
Figure 1.Schematic of proposed ATC classification approach.
represents the current input at time step .

Figure 3 .
Figure 3. Schematic of the FastaAI Tabular model contained in the whole training set.In contrast, the neighbor score decides a sample's class labels based on the class assignment of its neighbors.The feature score  1 (,   ) for a given pattern  with respect to an anatomical group   is calculated to evaluate whether the pattern belongs to the group   using a regression model.The neighbor score  2 (,   ) calculates the significance of the class membership of K neighbors of a pattern belonging to a given group   : the neighbor score increases if more neighbors of  have the label  .Thus,  2 (,   ) is 1 if all neighbors of  belong to   , 0 otherwise.The final score of  is a weighted sum of the two factors:  (, ) =  1 (,   ) + (1 − ) 2 (,   ) , is a multi-label classifier that integrates a feature score and a neighbor score.The feature score decides if a sample belongs to a particular class using the global information

Table 1 .
Absolute true rates achieved by the classifiers trained on the three descriptors.

Table 2 .
Absolute true rates achieved by the ensembles on the three descriptors

Table 3 .
Combinations of descriptors (absolute true rates) achieved by the ensembles using combinations of features.

Table 4 .
Comparison of the best ensemble here with the best reported in the literature.