Ordering-Based Pruning for Improving the Performance of Ensembles of Classifiers in the Framework of Imbalanced Datasets

This Website contains additional material to the paper

M. Galar, A. Fernández, E. Barrenechea, H. Bustince and F. Herrera, Ordering-Based Pruning for Improving the Performance of Ensembles of Classifiers in the Framework of Imbalanced Datasets. Submitted to Information Sciences (2015),

according to the following summary:

  1. Paper Content
  2. Addressing Imbalanced Problems by Ensemble Techniques
    1. Bagging Based Ensembles
    2. Boosting Based Ensembles
    3. Hybrid Based Ensembles
  3. Data-sets partitions employed in the Paper
  4. Experimental Results

Paper Content

Abstract: The scenario of classification with imbalanced datasets has gained a notorious significance in the last years. This is due to the fact that a large number of problems where classes are highly skewed may be found, affecting the global performance of the system. In this way, a great number of approaches to address this problem have been developed. These techniques have been traditionally proposed under three different perspectives: data treatment, adaptation of algorithms, and cost-sensitive learning.

Ensemble based models for classifiers are an extension over the former solutions. They define a pool of classifiers, and they can integrate in turn any of these proposals. The quality and performance of this type of methodology over baseline solutions have been shown in several studies of the specialized literature.

The goal of this work is to improve the capabilities of ensemble-based solutions that were specifically designed for imbalanced classification, focusing on the best behaving bagging- and boosting-based ensembles in this scenario. In order to do so, this paper proposes several new metrics for ordering-based pruning, which are properly adapted to address the skewed-class distribution. From our experimental study we show two main results: on the one hand, the use of the new developed metrics allows pruning to be a very successful approach in this scenario; on the other hand, the behavior of Under-Bagging model excels, achieving the highest gain with the usage of pruning as it enables the selection of those random undersampling sets that complement the best. Accordingly, this scheme outperforms ensemble models selected from the state-of-the-art.

Summary:

  1. Introduction.
  2. Preliminaries: Ensembles for Classification with Imbalanced Datasets.
    1. Basic concepts on classification with imbalanced datasets.
    2. Ensemble methods.
  3. A Proposal for Ordering-Based Pruning Scheme for Ensembles in Imbalanced Domains.
    1. Pruning in ordered ensembles.
    2. Ordering-based ensemble pruning metrics for imbalanced domains.
  4. Experimental framework.
    1. Benchmark data.
    2. Selected parameters.
    3. Statistical tests for performance comparison.
    4. Web page associated with the paper.
  5. Experimental study.
    1. Analysis of the pruning metrics of ensembles in imbalanced classification.
    2. Intra-family comparison for ensemble ordering-based pruning in imbalanced classification
  6. Lessons learned.
  7. Concluding remarks.

Addressing Imbalanced Problems by Ensemble Techniques

Ensemble-based classifiers, also known as multiple classifier systems (R. Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems Magazine 6 (3) (2006) 21–45), try to improve the performance of single classifiers by inducing several classifiers and combining them to obtain a new classifier that outperforms every one of them. Hence, the basic idea is to construct several classifiers from the original data and then aggregate their predictions when unknown instances are presented.

In recent years, ensembles of classifiers have arisen as a possible solution to class imbalance problem attracting great interest
among researchers. Ensemble-based methods are modified by a combination between the ensemble learning algorithm and one of the previously discussed techniques, namely data level and algorithmic approaches based on cost-sensitive learning. In the case of adding a data level approach to the ensemble learning algorithm, the new hybrid method usually preprocess the data before training each classifier. On the other hand, cost-sensitive ensembles, instead of modifying the base classifier in order to accept costs in the learning process, guide the cost minimization via the ensemble learning algorithm. This way, the modification of the base learner is avoided, but the major drawback, which is costs definition, is still present.

A complete taxonomy for ensemble methods in learning with imbalanced classes can be found on a recent review (M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for class imbalance problem: bagging, boosting and hybrid based approaches, IEEE Transactions on Systems, Man, and Cybernetics – part C: Applications and Reviews 42 (4) (2012) 463–484). We encourage readers to refer to this paper in order to have a global vision of ensembles for imbalanced classification, as well as checking the original references for each method described next. In order to do so, please click on the PDF icon next: 

In the former review, the authors distinguished four different families among ensemble approaches for imbalanced learning. On the one hand, they identified cost-sensitive boosting approaches which are similar to cost-sensitive methods, but where the costs minimization procedure is guided by a boosting algorithm. On the other hand, they distinguished among three more families which have a common feature: all of them consist on embedding a data preprocessing technique in an ensemble learning algorithm. They categorized these three families depending on the ensemble learning algorithm used, i.e. bagging, boosting and hybrid ensembles.

Bagging Based Ensembles

Many approaches have been developed using bagging ensembles to deal with class imbalance problems due to its simplicity and good generalization ability. The hybridization of bagging and data preprocessing techniques is usually simpler than their integration in boosting. A bagging algorithm does not require to recompute any kind of weights, therefore neither is needed to adapt the weight update formula nor to change computations in the algorithm. In these methods, the key factor is the way to collect each bootstrap replica, that is, how the class imbalance problem is dealt to obtain a useful classifier in each iteration without forgetting the importance of the diversity.

We have selected two main algorithms in this family: SMOTE-Bagging (S. Wang, X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, in: Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining (CIDM’09), 2009) and UnderBagging (R. Barandela, R. Valdovinos, J. S´anchez, New applications of ensembles of classifiers, Pattern Analysis Applications 6 (3) (2003) 245–256).

  • SMOTE-Bagging: An easy way to overcome class imbalance problem in each bag is to take into account the classes of the instances when they are randomly drawn from the original data-set. Hence, instead of performing a random sampling of the whole data-set, an oversampling process can be carried out before training each classifier (OverBagging). This procedure can be developed in at least two ways. Oversampling consists in increasing the number of minority class instances by their replication, all majority class instances can be included in the new bootstrap, but another option is to resample them trying to increase the diversity. Notice that in OverBagging probably all instances will take part in at least one bag, but each bootstrapped replica will contain many more instances than the original data-set.

    In this case, minority class instances are oversampled via the SMOTE preprocessing algorithm, leading to the SMOTEBagging approach. This differs from the use of random oversampling not only because the different preprocessing mechanism. The way it creates each bag is significantly different. As well as in OverBagging, in this method both classes contribute to each bag with $N_{maj}$ instances. But, a SMOTE resampling rate ($a\%$) is set in each iteration (ranging from $10\%$ in the first iteration to $100\%$ in the last, always being multiples of $10$) and this ratio define the number of positive instances ($a\%\cdot N_{maj}$) randomly resampled (with replacement) from the original data-set in each iteration. The rest of the positive instances are generated by SMOTE algorithm. Besides, the set of negative instances is bootstrapped in each iteration in order to form a more diverse ensemble.

  • Under-Bagging: On the contrary to OverBagging, UnderBagging procedure uses an undersampling instead of an oversampling. However, in the same manner as OverBagging, it can be developed in at least two ways. The undersampling procedure is usually applied only to the majority class, however a resampling with replacement of the minority class can also be applied in order to obtain a priori more diverse ensembles. Point out that, in UnderBagging it is more probable to ignore some useful negative instances, but each bag has less instances than the original data-set (on the contrary to OverBagging).
  • Roughly-Balanced Bagging: This algorithm sets the number of positive samples equal to that in the original dataset. If they are sampled without replacement, all of the positive examples will be contained in all of the sampled subsets. In contrast, the number of negative samples is decided probabilistically according to the negative binomial distribution, whose parameters are the number of minority (i.e. positive) examples and the probability of success q = 0.5. The key is that the examples of both classes are drawn with equal probability, but only the size of the negative samples vary, and the number of positive samples is kept constant since it is small. There are no significant changes in results, and either sampling with replacement or sampling without replacement can be used. In prediction, the aggregated model simply outputs the class label with the highest average probability estimated by the base models. The implementation considered in our work includes all minority class examples in the bag, and does the resampling with replacement only over the majority class

Boosting Based Ensembles

In this family we have included those algorithms which embed techniques for data preprocessing into boosting algorithms. In such manner, these methods alter and bias the weight distribution used to train the next classifier towards the minority class every iteration. Inside this family, we include SMOTEBoost (N. V. Chawla, A. Lazarevic, L. O. Hall, K.W. Bowyer, SMOTEBoost: Improving prediction of the minority class in boosting, in: Proceedings of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’03), 2003), RUSBoost (C. Seiffert, T. Khoshgoftaar, J. Van Hulse, A. Napolitano, RUSBoost: A hybrid approach to alleviating class imbalance, IEEE Transactions on System, Man and Cybernetics A 40 (1) (2010) 185–197) and EUS-Boost (M. Galar, A. Fern´andez, E. Barrenechea, , F. Herrera, Eusboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling, Pattern Recognition 46 (12) (2013) 3460–3471) algorithms.

  • SMOTEBoost: This method has the basis of the AdaBoost.M2 algorithm, and it introduces synthetic instances using SMOTE just before computing the new weights. Weights of new instances are proportional to the total number of instances in the new data-set. Hence, their weights are always the same (in all iterations and for all new instances), whereas original data-set's instances weights are normalized in such a way that they form a distribution with the new instances. After training a classifier, the weights of the original data-set instances are updated, then another sampling phase is applied (again, modifying the weight distribution). The repetition of this process also bring along more diversity in the training data which generally benefits the ensemble learning.
  • RUSBoost: In other respects, RUSBoost performs similarly to SMOTEBoost, but it removes instances from the majority class by random undersampling the data-set in each iteration. In this case, it is not necessary to assign new weights to the instances. It is enough with simply normalizing the weights of the remaining instances in the new data-set with respect to their total sum of weights. The rest of the procedure is the same as in SMOTEBoost.
  • EUSBoost: this technique follows the same scheme as RUS-Boost but with two main differences: (1) it uses the Evolutionary UnderSampling (EUS) approach as preprocessing method; (2) it promotes the diversity of the ensemble using the Q-statistic diversity measure in the evolutionary process.

Hybrid Based Ensembles

The main difference of the algorithms in this category with respect to the previous ones is that they carry out a double ensemble learning, that is, they combine both bagging and boosting (also with a preprocessing technique). The main representant of this type of approaches is EasyEnsemble (X.-Y. Liu, J. Wu, Z.-H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Transactions on System, Man and Cybernetics B 39 (2) (2009) 539–550), which is referred as exploratory undersampling technique. This method use Bagging as main ensemble learning method, but in spite of training a classifier for each new bag, they train each bag using AdaBoost. Hence, the final classifier is an ensemble of ensembles.

In the same manner as UnderBagging, each balanced bag is constructed by randomly undersampling instances from the majority class and using all the instances from the minority class. The difference between these methods is the way in which they treat the negative instances after each iteration as explained below.

This approach does not perform any operation with the instances from the original data-set after each AdaBoost iteration. Hence, all the classifiers can be trained in parallel. Note that, EasyEnsemble can be seen as an UnderBagging where the base learner is AdaBoost, if we fix the number of classifiers, EasyEnsemble will train less bags than UnderBagging, but more classifiers will be assigned to learn each single bag.

Data-sets partitions employed in the paper

In the study, we selected sixty-six data-sets from the KEEL dataset repository. AUC and GM metric estimates were obtained by means of a 5x5-DOB-SCV. That is, the data-set was split into 5 folds, each one containing 20% of the patterns of the data-set. For each fold, the algorithm was trained with the examples contained in the remaining folds and then tested with the current fold, using 5 different seeds to obtain the partition folds.

Table 1 summarises the properties of the selected data-sets. It shows, for each data-set, in which the name, number of examples, number of attributes, and IR (ratio between the majority and minority class instances) are shown. Datasets are ordered with respect to their degree of imbalance. Multi-class problems were modified to obtain two-class imbalanced problems, defining the joint of one or more classes as positive and the joint of one or more classes as negative, as defined in the name of the dataset. The last column of this table contains a link for downloading the 5-fold cross validation partitions for each data-set in KEEL format. You may also download all data-sets by clicking here.

 

Table 1. Summary description of data-sets.
Name \#Ex. \#Atts IR Download Name \#Ex. \#Atts IR Download
glass1 214 9 1.82 iconZip.png glass04vs5 92 9 9.22 iconZip.png
ecoli0vs1 220 7 1.86 iconZip.png ecoli0346vs5 205 7 9.25 iconZip.png
wisconsin 683 9 1.86 iconZip.png ecoli0347vs56 257 7 9.28 iconZip.png
pima 768 8 1.90 iconZip.png yeast05679vs4 528 8 9.35 iconZip.png
iris0 150 4 2.00 iconZip.png ecoli067vs5 220 6 10.00 iconZip.png
glass0 214 9 2.06 iconZip.png vowel0 988 13 10.10 iconZip.png
yeast1 1484 8 2.46 iconZip.png glass016vs2 192 9 10.29 iconZip.png
vehicle2 846 18 2.52 iconZip.png glass2 214 9 10.39 iconZip.png
vehicle1 846 18 2.52 iconZip.png ecoli0147vs2356 336 7 10.59 iconZip.png
vehicle3 846 18 2.52 iconZip.png led7digit02456789vs1 443 7 10.97 iconZip.png
haberman 306 3 2.68 iconZip.png ecoli01vs5 240 6 11.00 iconZip.png
glass0123vs456 214 9 3.19 iconZip.png glass06vs5 108 9 11.00 iconZip.png
vehicle0 846 18 3.23 iconZip.png glass0146vs2 205 9 11.06 iconZip.png
ecoli1 336 7 3.36 iconZip.png ecoli0147vs56 332 6 12.28 iconZip.png
newthyroid2 215 5 4.92 iconZip.png cleveland0vs4 1771 13 12.62 iconZip.png
newthyroid1 215 5 5.14 iconZip.png ecoli0146vs5 280 6 13.00 iconZip.png
ecoli2 336 7 5.46 iconZip.png ecoli4 336 7 13.84 iconZip.png
segment0 2308 19 6.01 iconZip.png shuttle0vs4 1829 9 13.87 iconZip.png
glass6 214 9 6.38 iconZip.png yeast1vs7 459 8 13.87 iconZip.png
yeast3 1484 8 8.11 iconZip.png glass4 214 9 15.47 iconZip.png
ecoli3 336 7 8.19 iconZip.png pageblocks13vs4 472 10 15.85 iconZip.png
pageblocks0 5472 10 8.77 iconZip.png abalone918 731 8 16.68 iconZip.png
ecoli034vs5 200 7 9.00 iconZip.png glass016vs5 184 9 19.44 iconZip.png
yeast2vs4 514 8 9.08 iconZip.png shuttle2vs4 129 9 20.50 iconZip.png
ecoli067vs35 222 7 9.09 iconZip.png yeast1458vs7 693 8 22.10 iconZip.png
ecoli0234vs5 202 7 9.10 iconZip.png glass5 214 9 22.81 iconZip.png
glass015vs2 506 8 9.12 iconZip.png yeast2vs8 482 8 23.10 iconZip.png
yeast0359vs78 172 9 9.12 iconZip.png yeast4 1484 8 28.41 iconZip.png
yeast0256vs3789 1004 8 9.14 iconZip.png yeast1289vs7 947 8 30.56 iconZip.png
yeast02579vs368 1004 8 9.14 iconZip.png yeast5 1484 8 32.78 iconZip.png
ecoli046vs5 203 6 9.15 iconZip.png yeast6 1484 8 39.15 iconZip.png
ecoli01vs235 244 7 9.17 iconZip.png ecoli0137vs26 281 7 39.15 iconZip.png
ecoli0267vs35 244 7 9.18 iconZip.png abalone19 4174 8 128.87 iconZip.png

 Experimental Study

We present the results of the experimental study divided into three parts, according to the metric of performance employed:

The analysis of the ordering-based pruning approaches regarding AUC.
The analysis of the ordering-based pruning approaches regarding GM. iconExcel.jpg
Finally, a comparative study is carried out to view the differences according to the true rates of the positive and negative classes the baseline classifiers.iconExcel.jpg

All the results obtained in these subsection and reported in the paper can be downloaded as an Excel document by clicking on the corresponding link. We provide both for each ensemble classifier and pruning method.
 


Page Maintained by A. Fernández