Are you a good borrower? Mining interpretable pattern structures in credit scoring

Purpose – The purpose of this study is to show that closure-based classi ﬁ cation and regression models provide both high accuracyandinterpretability. Design/methodology/approach – Pattern structures allow one to approach the knowledge extraction problem in case of partially ordered descriptions. They provide a way to apply techniques based on closed descriptions to non-binary data. To provide scalability of the approach, the author introduced a lazy (query- based)classi ﬁ cationalgorithm. Findings – The experiments support the hypothesis that closure-based classi ﬁ cation and regression allow one to both achieve higher accuracy in scoring models as compared to results obtained with classical banking models and retain interpretability of model results, whereas black-box methods grant better accuracy for the cost of losing interpretability. Originality/value – This is an original research showing theadvantageof closure-based classi ﬁ cation and regression models in thebanking sphere.


Introduction
Banks and credit institutions face classification problem each time they consider a loan application. A bank aims to have a tool to discriminate between solvent and potentially delinquent borrowers, i.e. the tool to predict whether the applicant is going to meet his or her obligations or not. Before 1950s, such decision-making process was expert-driven and involved no explicit statistical modeling (Thomas et al., 2002). The decision whether to grant a loan or not was made upon an interview and information about spouse and close relatives. From the 1960s, banks have started to adopt statistical scoring systems that were trained on data sets of applicants, consisting of their socio-demographic factors and loan application features (Thomas et al., 2002;Siddiqi, 2005).
Classification algorithms can either produce so-called "black box" models with limited interpretability of model result, or, on the contrary, provide interpretable results and transparent model structure (Baesens et al., 2003). As a rule, black-box models have superior accuracy and less sophisticated models may provide less accurate predictions. This is also shown in previous work as soon as credit scoring problem has been approached with various statistical and machine learning techniques (Yap et al., 2011;Lee et al., 2006;Nanni and Lumini, 2009).
However, the key feature of banking risk management practice is that, regardless of the model accuracy, it should not be a black box. Regulators require that banks are able to provide reject reasons for borrowers and also when central banks examine the bank models they have to understand economic intuition behind them to prove the models are going to show expected and stable performance (bis.org, 2020; law.cornell, 2020).
That is why methods such as neural networks and support vector machines (SVM) classifiers did not earn much trust within banking community. The dividing hyperplane in an artificial high-dimensional space (dependent on the chosen kernel) cannot be easily interpreted to claim the reject reason for the client (Ghodselahi, 2011). As far as neural networks are concerned, they also do not provide the user with a set of reasons why a particular loan application has been approved or rejected. In other words, these algorithms do not provide a decision-maker with knowledge. The predicted class is produced, but no intuition is retrieved from data. This paper introduces data analysis algorithms that have accuracy superior to simple algorithms widely adopted within the banks (such as logisitic regression, decision trees and scorecards) and still maintain the property of interpretability in sense that they provide a decision-maker with a set of rules applicable to assess the borrower.

Lattices of closed descriptions in classification problem
Methods such as generating association rules, emerging patterns and decision trees provide the user with easily interpretable rules which can be applied to the loan application. Algorithms based on formal concept analysis (FCA) also belong to this group of methods, as they use clearly defined concepts to classify objects (Ganter and Wille, 1999;Kuznetsov, 1999;Meddouri et al., 2014). The intent of a concept can be seen as a set of rules supported by the extent of the concept. However, for non-binary data the computation of the concepts and their relations can be very time-consuming. In case of credit scoring we deal with numerical data, as soon as categorical variables can be transformed into a set of dummy variables. Lazy classification (query-based) (Aha, 1997) seems to be appropriate in this case, as it provides a decision-maker with a set of explicit rules for the loan application and can be easily parallelized.

Main definitions
Let G be a set (of objects), let D be a set of all possible object descriptions equipped with a "more general than" or subsumption partial order v, which is a very natural requirement. For many description types this partial order induces an intersection operation, which is idempotent, commutative and associative (i.e. induces a semilattice on descriptions): for binary attributes this is just a set-theoretic intersection, for multisets this is component-wise min or max, etc. For sequences and graphs this is intersection on sets of graphs and sequences based maximal common subsequences and subgraphs (Kuznetsov, 1999;Kaytoue et al., 2015). So, in what follows we will assume that the set of descriptions D is equipped with such an intersection u, such that given c, So, let (D, u) be a meet-semi-lattice of all possible object descriptions called patterns and let d : G ! D be a mapping taking each object to its description. Then (G, D, d ), where D ¼ D; u ð Þ, is called a pattern structure (Ganter and Kuznetsov, 2001). Operation u is also called a similarity operation. A pattern structure G; D; d ð Þgives rise to the following derivation operators (·)^: Here, A^means similarity (as intersection) of all objects from set A, and d^means the set of all objects with descriptions subsuming (i.e. more specific or equal to) d. The pairs (A, d) satisfying A ( G, d [ D, A^= d, and A = d^are called pattern concepts of G; D; d ð Þwith pattern extent A and pattern intent d. Operator (·)^^is a closure operator on patterns, as it is idempotent, extensive and monotone (Ganter and Kuznetsov, 2001 A pattern h [ D is called an a-weak positive hypothesis iff: In case of credit scoring we work with pattern structures where descriptions are tuples of intervals of many-valued attributes. Instead of binarizing (scaling) data, one can directly work with many-valued attributes by applying interval pattern structure. For two intervals [a 1 , b 1 ] and [a 2 , b 2 ], with a 1 , b 1 , a 2 , b 2 [ R the meet operation (or similarity operator) is defined as follows (Kaytoue et al., 2011): This definition may seem counterintuitive at the first glance, as intersection of two intervals gives a larger interval. The explanation is that the intersection gives actually less information, as the attribute values are allowed to vary in a larger interval. To make these main definitions clear, let us provide an example and apply the meet operator (3) to a "toy" data set provided in Table 1.

Lazy classification with pattern structures
To efficiently classify test objects one can employ the lazy learning approach (Veloso et al., 2006;Veloso and Wagner, 2011), where one does not need to generate all possible good classifiers in advance. Having a similarity operation on object descriptions, one can compute similarity of the test object with the objects from the training set. Consider a pattern structure (G(D,u),d ) and suppose that we have a training set given by disjoint sets G þ , G -( G of positive and negative examples w.r.t. a target attribute, and a set of unclassified test objects G t . Then the value of the target attribute of a test object g n [ G t appears in the closure of the intersection of description of g n with descriptions of every object g [ G.
If for some object g the closure contains the target attribute, then g n is classified positively, otherwise the test object is classified negatively. More formally, this can be described as the following simple two-stage procedure: (1) For every g [ G compute (d (g n ) u d (g))^, i.e. select all objects from G whose descriptions contain (d (g n ) u d (g))^. This takes O(jGj (p(u) þjGj p(v))) time; and (2) If for some g [ G all objects from (d (g n ) u d (g))^have the target attribute, classify g n positively, otherwise negatively. This takes O(jGj 2 ) time for looking for the target attribute in object descriptions in at most jGj families of object subsets, each subset consisting of at most jGj objects.
Lazy classification is thus reduced to computing (d (g n ) u d (g))^and testing the target attribute in all objects of this set. This computation is easily parallelizable: one partitions the data set G in G = G 1 |. . .|G k , where k is the number of processors, computes the set of objects (d (g n ) u d (g))^in each G i , tests the target attribute for all objects in the union of these sets over i.

Query-based classification algorithm
In credit scoring, however, the original lazy classification setting may become inappropriate Kuznetsov, 2016,Masyutin et al., 2015). The reason is that the data is typically numerical, features can have arbitrary distributions and take wide range of values. At the same time categorical variables and dummies can occur. With relatively large number of attributes (over 100) it produces high-dimensional space of continuous variables. So, the meet operator (3) gives a very specific result, i.e. for almost every g [ G only g and g n have the description d (g n ) u d (g). This happens owing to the fact that numerical variables, ratios especially, can have unique values for every object. This results in that for test object g n the number of positive and negative classifiers is close to the number of examples in G þ , G -, respectively. Thus, it seems reasonable to seek the concepts with larger extents and with not too specific intent. At the same time, we would like to preserve the advantages of lazy classification, as we do not need to compute full concept lattice and one can take advantage of parallelization.
The query-based classification algorithm is our modification. Idea behind proposed algorithm is to check whether it is positive or negative class that test object is more similar to.
The first step is a mining step, when we extract a-weak hypotheses from the data. The procedure is carried out for G þ and Gseparately. Random subsample of examples is extracted and their descriptions are intersected. The resulting description d = d (g 1 ) u. . .u d (g s ) is checked whether it is a-weak or not. If hypothesis is a-weak then it is added to a set of hypotheses to be used for classification later.
The size of subsample s is an algorithm's hyperparameter, and it is tuned via grid search. The number of times (i.e. number of iterations) we randomly select a subsample is the second hyperparameter, which is also tuned through grid search. As we mentioned, the greater the subsample size, the more it is likely that (d (g 1 ) u. . .u d (g s ))^contains an example of the opposite class. It is a threshold hyperparameter that controls this issue.
The second step is an updating step, when the test object description d (g n ) is intersected with each a-weak hypothesis. If resulting description is also a-weak then it is considered to be a-weak classifier, i.e. the rule relevant for this particular test object g n .
The third and final step is a voting step, when all a-weak classifiers vote to produce prediction for a test object g n .

Voting schemes
The final classification for a test object is based on the voting of a-weak classifiers. In the most general case voting scheme F is a mapping: where g n is the test object with unknown class, c þ i is a positive a-weak classifier 8i ¼ 1; p and c À j is a negative a-weak classifier 8j ¼ 1; n, À1 is the label for the negative class, and 1 is the label for the positive class (i.e. defaulters). In other words, F is an aggregating rule that takes classifiers to classification labels (empty label is allowed).
If the label is empty it is said that the algorithm abstains from classification. It can happen when there is no a-weak classifier found for the test object, which can be owing to poor algorithm tuning (e.g. inappropriately low a threshold or small number of iterations).
There may be different approaches to build up aggregating rules. The voting scheme is built upon weighting function v (·), aggregation operator A(·) and comparing operator .
To configure a new weighting scheme it is sufficient to define the operators and the weighting function. In this paper, the best weighting scheme (in terms of accuracy) is based on the relative number of positive versus negative a-weak classifiers, weighted by their confidence: Mining interpretable pattern structures However, there are many different ways to build voting schemes, and a number of them can be found in Github code repository [1].
One can think of margin a b as a measure for discrimination between two classes and consider, e.g. the decision boundary based on receiver operating characteristic analysis. As soon as decision boundary is defined (i.e. when a b > x then 1 else -1), the voting scheme produces the predicted label.

Experiments with open data
In this paper we use open data sets for credit scoring. To compare our algorithm against benchmarks we keep some portion of data as validation sample, which is not used when training the model, and then we calculate performance metrics (Gini coefficient) for that sample. Where applicable we use grid search for hyperparameters to tune benchmark performance. The results are summarized in Table 2.

Lending club loan data
Lending Club is a large US peer-to-peer online credit platform. It has accumulated hundreds of thousands of payment profiles on loans being issued since 2007 till nowadays [2]. The data has been used widely as a benchmark when testing machine learning models (kaggle. com, 2020; triamus.github.io, 2020). We extracted 25,000 observations with nine features. They represent client credit history, loan term, rate, borrower's ownership information, income, employment length, etc. The target attribute is loan status which indicates whether the payments were made on time or not.

Credit Scoring Catalonia data
This data set is designed for studying purposes but, nevertheless, it is still applicable for benchmark analysis. The data consists of 4,456 observations and 13 numeric and categorical features with single target attribute [3]. Categorical variables were one-hot-encoded before applying algorithms. The features are similar to the ones in previous data sets.
3.3 Give me some credit data "Give Me Some Credit" data set is taken from Kaggle contest held in 2012 [4]. The data has a binary target variable (class label) whether the borrower defaulted or not. We develop a scorecard and examine its accuracy via out-of-sample validation with provided target variable. The data set we used consists of 25,000 observations with ten numeric features. They describe client's status and previous credit experience and contain information on total balance on credit cards, monthly income, debt-to-income ratio, number of dependents (children, spouse), current revolving utilization limit, etc.

Benchmarking: scorecards and black-box methods
We compare our algorithm to both classical algorithms adopted in banks and ML algorithms. As far as, classical algorithms are concerned, we use scorecards (Siddiqi, 2005) which are, in effect, logistic regressions run on transformed features. The features are transformed according to weight-of-evidence transformation (WOE) (stats.stackexchange. com, 2020; documentation.statsoft.com, 2020). The WOE-transformation was controlled for maximum number of observations in the final nodes of one-factor trees to escape overfitting at the starting point. The example of variable binning is provided in Figure 1.
As soon as we have transformed the factors, the individual Gini coefficients were calculated to assess predictive power of features. We excluded variables that have shown dramatic pairwise correlation and Gini drop on validation sample, so the rest were fed to logistic regression and the final model was fitted. The pipeline can be found in Bitbucket code repository [5].
Finally, we applied the Xgboost, Random Forest, Decision Tree and kNN algorithms to the same data to estimate the classification quality achievable with the "black-box" models as well.
As we can see, Xgboost performs better in terms of Gini. However, its results are not interpretable, and the best explanation for classification that we one can extract from the trained Xgboost model is the estimated feature importance, based on the number of times splits in trees that were done with each feature.

Random sampling alteration
In subsection 3.5, we study an alternative approach to generate a-weak hypotheses, which described in the previous section. Modification impacts the way we extract random subsamples from G þ or G -. In previous setting subsample of fixed size s is extracted. Thus, such classifiers have some fixed predictive power, and therefore, their effectiveness can be lower for some test objects and higher for others. To solve this problem, it is proposed to vary subsample size while the algorithm is running, and extract subsamples of different sizes in each iteration.
In this paper, two ways of specifying the size of subsample in each iteration were considered: (1) Random choice of a number from the uniform distribution from 1 to N, where N is the size of the subsample.

Mining interpretable pattern structures
It is worth noting that the second method is equivalent to uniform sampling from the power set of the context. Thus, the second method provides convergence to classic lazy classification, as it considers descriptions of all possible combinations of objects (after a large number of iterations). Nevertheless, even with a small training sample size N, convergence will never be achieved, as the number of all subsets 2 N is too large. In practice, given large values of the subsample size m, the description of d (g n ) u d (g m ) turns out to be too "general" and the proportion of objects of the opposite class in the set (d (g 1 ) u. . .u d (g k ) u d (g m ))^exceeds the given threshold a in each iteration. That is, for any value of the threshold a, it makes sense to extract such subsamples that its size does not exceed a certain maximum value of M. Thus, the two methods described above for setting the subsample size in each iteration can be slightly modified as follows: (1) Random choice of a number from the uniform distribution from 1 to M, where M is the number calculated experimentally, depending on the data array used.
(2) Select a number m from 1 to M with a probability proportional to the number of combinations N m .
Note that in the second method, probabilities of sampling a subsample of size k and k þ 1 are related as follows: It means that the sample of size k þ 1 will be extracted approximately NÀk kþ1 times more frequently than the sample size k. It follows that given a large amount of data and a fixed number of iterations, small sample sizes is not used in the classification. The most probable size of the subsamples equals the value min M; N 2 ! . As a result of experiments with Kaggle data set, it was found that the optimal maximal subsample size is M = 20.

Visualization and interpretability of hypotheses
There is no unified definition for model interpretability in machine learning. However, there are some general requirements which are common among researchers. Some authors focus on the rule induction criterion (Feraud and Clerot, 2002), i.e. model is interpretable if it provides user with a set of simple rules to make decision.
Also, the causality is emphasized in (Miller, 2017) by stating that interpretability is the degree to which a human can understand the cause of a decision.
Further, one distinguishes between global and local interpretability in (Kim et al., 2016). Global interpretability shows which features have major impact on prediction and also impact direction. Local interpretability answers the question why this particular test object received that particular prediction.
In this paper, we try to combine previously mentioned criteria and, to be more precise, we outline three properties of interpretability: (1) Prediction is performed based on rules derived from initial factors preserving initial feature space. (2) The algorithm processes initially defined target attribute.
(3) Prediction can be explained individually on test object level.
For example, decision trees and random forests are considered to be interpretable algorithms as soon as they process initial feature space and target attribute and produce rules which then are applied to test objects.
On the contrary, SVM with kernels does not satisfy first condition as soon as classification is performed in artificially constructed feature space. Also, XGBoost lacks the second property as soon as each next tree fits the errors of previous one, which is not the initially defined target attribute. Neural networks do not provide rules for a decision-maker. So, all these examples of algorithms cannot be interpreted.
The situation is different with query-based classification algorithm as soon as it works with hypotheses and, therefore, is an interpretable algorithm.
Hypotheses are mined in initial feature space and, in effect, they are just tupples of intervals. So, hypotheses define an area in initial feature space and serve as rules for a decision-maker. Therefore, premise can be visualized as a hypercube in a space of dimension d, where d is the number of intervals (and features). To visualize the premise, one can make the projection of this hypercube on the plane. As far as prediction explanation is concerned, each test object receives a number of rules (i.e. a-weak classifiers), which can be treated as portraits of good and bad borrowers. So, if the loan applicant is rejected it happens owing to the fact he or she is more similar to delinquent clients, i.e. more positive a-weak classifiers were found for the applicant. Figure 2 shows two positive and two negative hypotheses on two features plane. Positive hypotheses are given in red, and the negative ones are given in blue. To construct each positive hypothesis, two objects from the positive context were randomly extracted. Then the meetoperator was applied and a set of intervals was obtained. After that only the intervals for two features were left. The same algorithm was run for negative hypotheses and negative context.
In Figure 3, there were ten positive and ten negative hypotheses built according to the same algorithm, whereas in Figure 4 their number reaches fifty. One can see that they are localized in different areas. As long as we extract more random hypotheses the boundary between good and bad regions becomes more and more obvious. For 1,000 random hypotheses, the boundary is almost clear ( Figure 5). As long as the number of extracted negative and positive hypotheses increases the number of multiple intersections between them grows. One can observe an expansion of the area with sparse positive hypotheses, while negative ones are fixed at interval from 0 to 1, positive intervals are in the range from 0.4 to 2.
It is interesting to realize that certain patterns can be extracted from the query-based classification algorithm (QBCA) model. We can observe rules such as if a loan applicant's age is greater than 50, and there was no delinquency in the past and the overall revolving utilization of unsecured lines was less than 11%, then the probability of default is almost four times lower than average. On the other side, applicants younger than 30 and having revolving utilization of unsecured lines greater than 72% will default 1.5 times more frequently than on average. This is where we enjoy the advantage of interval pattern structures: they represent the rules that can be easily interpreted, and at the same time they make prediction for each new object in validation data set individually, which allows to improve classification accuracy over the default scorecard model.  In addition, it is possible to see disputable areas (depicted in purple), that are areas of features values which are shared by both positive and negative hypotheses. In addition to this, one can see that for some hypotheses, the right border of the RevolvingUtilizationOfUnsecuredLines feature's interval is 1. But some hypotheses have a right-hand boundary of more than 1. Based on this one can make a conclusion about data errors or heterogeneity of the values of a given feature (the hypotheses were constructed on data without preprocessing). Thus, such visualization has an additional practical value.

Lattices of closed descriptions in regression problem
Classification is not the only problem that arises in credit scoring process. One has to predict other client parameters which can be continuously distributed. For example, when loans are granted online borrowers fill in their income amount in loan application form. It is necessary to verify those amounts by taking into account all other borrowers features such as education, employment experience, ownership, etc. Therefore, income prediction model is built to compare the predicted value and the one filled in the application. If the latter dramatically exceeds the value predicted by model then the warning alert can be sent indicating the borrower may have provided fake information.
In Section 4, we will adopt previously developed QBCA model to the case of continuous target variable (i.e. regression problem).

Augmented interval pattern structures
For the case where the target attribute is not a class label, but a continuous variable we adjust the definition of the interval pattern structure by equipping it with additional component h.
Let us define an augmented interval pattern structure as a quadruple (G, D, d , h), where the description d consists of two elements d x and d y (d y is an interval for target attribute y [ R and d x is a vector of intervals for explanatory attributes x which are supposed to predict the target attribute y), d : G ! D and h [ H, where H is a family of density distribution functions for the target attribute y, i.e. ð þ1 À1 h s ð Þds ¼ 1. We will also use notation d x and d y to distinguish between descriptions containing explanatory attributes and target attribute, respectively. The definition of the meet operation is left unchanged. x Mj ],[y j ; y j ]}, for j = 1,. . ., J, where M is the number of explanatory attributes. Then we define the derivation operator in the following way and target attribute description d y0 = d y (g 1 ) u. . .u d y (g J ), which is in fact a single interval [y min , y max ] and h 0 : The h 0 is in effect a target attribute density distribution function based on observations of A 0 , which we describe below. Let t 0 ,. . .,t K be a partition of d y0 and t 0 = y min , t K = y max and Dt i ¼ ymaxÀy min Thus, h is a function of target attribute y values of objects in A. We will use the second derivation operator in a similar way it was used with interval pattern structures, however it will return the image for the description d x0 whatever target description d y0 and density function h are: d y1 ), i.e. only target attribute description d y is updated, so does h density function, while the explanatory variables description d x0 remains the same.
To approach target attribute prediction problem it will be useful to define a-weak descriptions by analogy with binary target case. An h-augmented interval pattern d [ D is called an a-weak hypothesis: called an a-weak regressor: , d y is a single interval d min y ; d max y h i for target attribute y, and h is a density function which reflects the distribution of target attribute within the interval d y based on objects from A. This definition involves the parameter a that controls the frequency of hypothesis falsifications, i.e. how dramatically it is falsified. To emphasize the connection between a-weak descriptions in classification and regression cases it's convenient to apply a small transformation to the definition above: 1 À jfg 2 Gjd min y # d y g ð Þ# d max y gj jAj # a () () 1 À jfg 2 Gjd y vd y g ð Þgj jAj # a () jfg 2 Gjd y v d y g ð Þgj jAj # a The example below demonstrates the crucial differences between a-weak descriptions in cases of binary and continuous target variable. Example 4.1 Consider a dataset provided in Table 3. Suppose A 0 = {g 1 , g 2 }. First, let's calculate A Å 0 . d 0 = (d 0x , d 0y ) = ([30; 35], [10; 12], [0.5; 0.7] Next let us find A ÅÅ 0 .

Query-based regression algorithm
Assume that we have a set of objects G and numerical data with a section of explanatory attributes x 1 ,. . .,x M and continuous target attribute y. Now, suppose we receive a test object g t with observable attributes x, but with unknown value of target attribute y. Next, we describe an approach to predict y using interval pattern structures. The steps of query-based regression algorithm (QBRA) are similar to the case of classification. First, we mine a-weak hypotheses. Second, we calculate a-weak regressors by intersecting description d (g t ) with the hypotheses. Third, we predict target attribute for test object g t using mined a-weak regressors.
Let us start by choosing subsample size parameter which is the number of objects being randomly extracted from the set of objects G. Upon random extraction of objects A 0 = {g 1 ,. . .,g K } we calculate the following pattern d 0 = d (g 1 ) u. . .u d (g K ) and density distribution function h 0 for target attribute values. If d 0 is an a-weak hypothesis, then it is added to the collection of hypotheses that will be used for prediction later. Together with the pattern it is necessary to store the density function h 0 .
On the second stage of out algorithm we derive a-weak regressors by calculating intersections d x u d (g t ) and obtain the new density functions h 1 . Having finished with regressors mining, we move on to the next stage which is building up a prediction for the target attribute. In our case, the resulting prediction was defined by the mixture of distributions from all regressors. In practice, all target attribute values stored within regressors were put together Table 3. Toy data set for example 2 x 1 x 2 y g 1 30 10 0.5 g 2 35 12 0.7 g 3 31.5 11.5 0.8 Mining interpretable pattern structures to form a final distribution. Finally, we tried both the average and the median of this distribution as the prediction for target attribute. Such approach takes into account supports of the regressors (the regressors supported by greater number of objects will contribute more). But which of h 0 , h 1 or other we have to use?
Here we introduce another hyperparameter of the algorithm which is called "capped." Capped is a Boolean value, and if true, then the range for target attribute d y1 in d ÅÅ 0 is truncated to d y0 and corresponding density function is h 1 calculated on the truncated set of target values. If capped hyperparameter is false, then we add d y1 and calculate the density function based on all target values that fell into d y1 based on objects from d Å 0 . The whole procedure is repeated so many times as controlled by the number of iterations parameter.
However, one can argue that regressors are different in sense of anti-support and deviation in target attribute values. Indeed, we would put more weight to the prediction based on regressors with narrow range of target attribute values. Therefore, we added target values to the final distribution with different weights, also both weighted average and weighted median were used as prediction.
We introduced two Boolean hyperparameters which control the scheme of assigning weights. The first one is account for anti-support and the second is penalty for high deviation.
When account for anti-support is true, then the target values d y (g) of objects g [ A with the regressor d are given weight according to the anti-support of that regressor: When penalty for high deviation is true, then the weight is decreased with the higher deviation in the target attribute values: If the parameters values are false then the weights are equal to one. The final weight for the target attribute value of the object g, which will be contributed to aggregate distribution used for prediction, is defined as the product of the two weights: Finally, suppose that P is a set of mined a-weak regressors. The prediction for target attribute y of a test object g t can be based on weighted average: or on the weighted median: In case where P is an empty set, the prediction is average or median of all target attribute values in the knowledge base, i.e. the prediction is based on "naive" model.

Data and experiments
The data we used for the calculations represent verified borrower income information. We predict the income level using all other borrower features. Having an income prediction model one can apply it to the borrowers when there is no opportunity to verify income (e.g. in online lending). In case when the predicted value is dramatically lower than the value provided in a loan application form one can expect the borrower is trying to embellish his or her financial standing. Other use cases for income modeling include client base segmentation when different products are offered to customers based on their expected income level. All three data sets Lending Club, Credit Scoring Catalonia and Give Me Some Credit contain income column which is a new target variable for QBRA.
The data was randomly divided into two parts with 70% of observations in one part and 30% in the other. The bigger part was used for training benchmarks and QBRA and 30% was used as a test set to evaluate predictions and their accuracy summarized in Table 4. The accuracy of predictions were evaluated in terms of mean absolute percentage deviation (MAPE) [6]: where y i is a target attribute (recovery rate) for i-th client in the test set andŷ i is prediction. The accuracy of the algorithm was compared to benchmarks represented by random forests, as soon as their predictions are based on combination of simple rules, too. The proposed query-based regression algorithm showed comparable quality in the greater number of runs and in certain parameters area it outperformed random forests.

Conclusion
In this paper, we propose query-based classification approach to credit scoring problem. Model interpretability versus model accuracy is discussed and it is emphasized that model transparency is an important requirement in banking risk management. We define properties of interpretability and demonstrate that proposed algorithms satisfy them. We also address the continuous target variable prediction with query-based regression algorithm.
We argue that proposed algorithms allow user to extract interpretable rules for prediction and at the same time they outperform wide-spread methods in banking (e.g. scorecards, decision trees), however, black box models may have superior accuracy at the cost of interpretability.
Introduced algorithms are applied to open data on retail loan applicants and benchmark analysis is performed. Gini coefficient is used as a model quality metric. We performed grid search by running algorithms with different hyperparameter values.
Then, we describe a query-based regression algorithm and apply it to income prediction problem. The proposed algorithm shows comparable quality with benchmarks and in certain hyperparameters area it outperforms random forests. Mean absolute percentage error is used as a model quality metrics. As an area for further research, one can consider and compare accuracy when other voting schemes are used. It is expected that taking into account a-weak classifiers specificity one can improve overall accuracy of the classification algorithm or, alternatively, one will reach the same accuracy given less number of iterations, which can improve time required for calculations. As for regression algorithm, one can consider keeping the density function h not only for target attribute in regressors, but also for explanatory attributes as well. It can be expected, that if a-weak regressors are mined based on some additional properties of features distribution, then they will be more relevant for test objects and will produce more accurate predictions for target attribute.