Ordering-Based Pruning for Improving the Performance of Ensembles of Classifiers in the Framework of Imbalanced Datasets

This Website contains additional material to the paper

M. Galar, A. Fernández, E. Barrenechea, H. Bustince and F. Herrera, Ordering-Based Pruning for Improving the Performance of Ensembles of Classifiers in the Framework of Imbalanced Datasets. Submitted to Information Sciences (2015),

according to the following summary:

Paper Content
Addressing Imbalanced Problems by Ensemble Techniques
Data-sets partitions employed in the Paper
Experimental Results

Paper Content

Abstract: The scenario of classification with imbalanced datasets has gained a notorious significance in the last years. This is due to the fact that a large number of problems where classes are highly skewed may be found, affecting the global performance of the system. In this way, a great number of approaches to address this problem have been developed. These techniques have been traditionally proposed under three different perspectives: data treatment, adaptation of algorithms, and cost-sensitive learning.

Ensemble based models for classifiers are an extension over the former solutions. They define a pool of classifiers, and they can integrate in turn any of these proposals. The quality and performance of this type of methodology over baseline solutions have been shown in several studies of the specialized literature.

The goal of this work is to improve the capabilities of ensemble-based solutions that were specifically designed for imbalanced classification, focusing on the best behaving bagging- and boosting-based ensembles in this scenario. In order to do so, this paper proposes several new metrics for ordering-based pruning, which are properly adapted to address the skewed-class distribution. From our experimental study we show two main results: on the one hand, the use of the new developed metrics allows pruning to be a very successful approach in this scenario; on the other hand, the behavior of Under-Bagging model excels, achieving the highest gain with the usage of pruning as it enables the selection of those random undersampling sets that complement the best. Accordingly, this scheme outperforms ensemble models selected from the state-of-the-art.

Summary:

Introduction.
Preliminaries: Ensembles for Classification with Imbalanced Datasets.
1. Basic concepts on classification with imbalanced datasets.
2. Ensemble methods.
A Proposal for Ordering-Based Pruning Scheme for Ensembles in Imbalanced Domains.
1. Pruning in ordered ensembles.
2. Ordering-based ensemble pruning metrics for imbalanced domains.
Experimental framework.
1. Benchmark data.
2. Selected parameters.
3. Statistical tests for performance comparison.
4. Web page associated with the paper.
Experimental study.
1. Analysis of the pruning metrics of ensembles in imbalanced classification.
2. Intra-family comparison for ensemble ordering-based pruning in imbalanced classification
Lessons learned.
Concluding remarks.

Addressing Imbalanced Problems by Ensemble Techniques

Ensemble-based classifiers, also known as multiple classifier systems (R. Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems Magazine 6 (3) (2006) 21–45), try to improve the performance of single classifiers by inducing several classifiers and combining them to obtain a new classifier that outperforms every one of them. Hence, the basic idea is to construct several classifiers from the original data and then aggregate their predictions when unknown instances are presented.

In recent years, ensembles of classifiers have arisen as a possible solution to class imbalance problem attracting great interest
among researchers. Ensemble-based methods are modified by a combination between the ensemble learning algorithm and one of the previously discussed techniques, namely data level and algorithmic approaches based on cost-sensitive learning. In the case of adding a data level approach to the ensemble learning algorithm, the new hybrid method usually preprocess the data before training each classifier. On the other hand, cost-sensitive ensembles, instead of modifying the base classifier in order to accept costs in the learning process, guide the cost minimization via the ensemble learning algorithm. This way, the modification of the base learner is avoided, but the major drawback, which is costs definition, is still present.

A complete taxonomy for ensemble methods in learning with imbalanced classes can be found on a recent review (M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for class imbalance problem: bagging, boosting and hybrid based approaches, IEEE Transactions on Systems, Man, and Cybernetics – part C: Applications and Reviews 42 (4) (2012) 463–484). We encourage readers to refer to this paper in order to have a global vision of ensembles for imbalanced classification, as well as checking the original references for each method described next. In order to do so, please click on the PDF icon next:

In the former review, the authors distinguished four different families among ensemble approaches for imbalanced learning. On the one hand, they identified cost-sensitive boosting approaches which are similar to cost-sensitive methods, but where the costs minimization procedure is guided by a boosting algorithm. On the other hand, they distinguished among three more families which have a common feature: all of them consist on embedding a data preprocessing technique in an ensemble learning algorithm. They categorized these three families depending on the ensemble learning algorithm used, i.e. bagging, boosting and hybrid ensembles.

Bagging Based Ensembles

Many approaches have been developed using bagging ensembles to deal with class imbalance problems due to its simplicity and good generalization ability. The hybridization of bagging and data preprocessing techniques is usually simpler than their integration in boosting. A bagging algorithm does not require to recompute any kind of weights, therefore neither is needed to adapt the weight update formula nor to change computations in the algorithm. In these methods, the key factor is the way to collect each bootstrap replica, that is, how the class imbalance problem is dealt to obtain a useful classifier in each iteration without forgetting the importance of the diversity.

We have selected two main algorithms in this family: SMOTE-Bagging (S. Wang, X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, in: Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining (CIDM’09), 2009) and UnderBagging (R. Barandela, R. Valdovinos, J. S´anchez, New applications of ensembles of classifiers, Pattern Analysis Applications 6 (3) (2003) 245–256).

SMOTE-Bagging: An easy way to overcome class imbalance problem in each bag is to take into account the classes of the instances when they are randomly drawn from the original data-set. Hence, instead of performing a random sampling of the whole data-set, an oversampling process can be carried out before training each classifier (OverBagging). This procedure can be developed in at least two ways. Oversampling consists in increasing the number of minority class instances by their replication, all majority class instances can be included in the new bootstrap, but another option is to resample them trying to increase the diversity. Notice that in OverBagging probably all instances will take part in at least one bag, but each bootstrapped replica will contain many more instances than the original data-set.
In this case, minority class instances are oversampled via the SMOTE preprocessing algorithm, leading to the SMOTEBagging approach. This differs from the use of random oversampling not only because the different preprocessing mechanism. The way it creates each bag is significantly different. As well as in OverBagging, in this method both classes contribute to each bag with $N_{maj}$ instances. But, a SMOTE resampling rate ($a\%$) is set in each iteration (ranging from $10\%$ in the first iteration to $100\%$ in the last, always being multiples of $10$) and this ratio define the number of positive instances ($a\%\cdot N_{maj}$) randomly resampled (with replacement) from the original data-set in each iteration. The rest of the positive instances are generated by SMOTE algorithm. Besides, the set of negative instances is bootstrapped in each iteration in order to form a more diverse ensemble.
Under-Bagging: On the contrary to OverBagging, UnderBagging procedure uses an undersampling instead of an oversampling. However, in the same manner as OverBagging, it can be developed in at least two ways. The undersampling procedure is usually applied only to the majority class, however a resampling with replacement of the minority class can also be applied in order to obtain a priori more diverse ensembles. Point out that, in UnderBagging it is more probable to ignore some useful negative instances, but each bag has less instances than the original data-set (on the contrary to OverBagging).
Roughly-Balanced Bagging: This algorithm sets the number of positive samples equal to that in the original dataset. If they are sampled without replacement, all of the positive examples will be contained in all of the sampled subsets. In contrast, the number of negative samples is decided probabilistically according to the negative binomial distribution, whose parameters are the number of minority (i.e. positive) examples and the probability of success q = 0.5. The key is that the examples of both classes are drawn with equal probability, but only the size of the negative samples vary, and the number of positive samples is kept constant since it is small. There are no significant changes in results, and either sampling with replacement or sampling without replacement can be used. In prediction, the aggregated model simply outputs the class label with the highest average probability estimated by the base models. The implementation considered in our work includes all minority class examples in the bag, and does the resampling with replacement only over the majority class

Boosting Based Ensembles

In this family we have included those algorithms which embed techniques for data preprocessing into boosting algorithms. In such manner, these methods alter and bias the weight distribution used to train the next classifier towards the minority class every iteration. Inside this family, we include SMOTEBoost (N. V. Chawla, A. Lazarevic, L. O. Hall, K.W. Bowyer, SMOTEBoost: Improving prediction of the minority class in boosting, in: Proceedings of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’03), 2003), RUSBoost (C. Seiffert, T. Khoshgoftaar, J. Van Hulse, A. Napolitano, RUSBoost: A hybrid approach to alleviating class imbalance, IEEE Transactions on System, Man and Cybernetics A 40 (1) (2010) 185–197) and EUS-Boost (M. Galar, A. Fern´andez, E. Barrenechea, , F. Herrera, Eusboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling, Pattern Recognition 46 (12) (2013) 3460–3471) algorithms.

SMOTEBoost: This method has the basis of the AdaBoost.M2 algorithm, and it introduces synthetic instances using SMOTE just before computing the new weights. Weights of new instances are proportional to the total number of instances in the new data-set. Hence, their weights are always the same (in all iterations and for all new instances), whereas original data-set's instances weights are normalized in such a way that they form a distribution with the new instances. After training a classifier, the weights of the original data-set instances are updated, then another sampling phase is applied (again, modifying the weight distribution). The repetition of this process also bring along more diversity in the training data which generally benefits the ensemble learning.
RUSBoost: In other respects, RUSBoost performs similarly to SMOTEBoost, but it removes instances from the majority class by random undersampling the data-set in each iteration. In this case, it is not necessary to assign new weights to the instances. It is enough with simply normalizing the weights of the remaining instances in the new data-set with respect to their total sum of weights. The rest of the procedure is the same as in SMOTEBoost.
EUSBoost: this technique follows the same scheme as RUS-Boost but with two main differences: (1) it uses the Evolutionary UnderSampling (EUS) approach as preprocessing method; (2) it promotes the diversity of the ensemble using the Q-statistic diversity measure in the evolutionary process.

Hybrid Based Ensembles

The main difference of the algorithms in this category with respect to the previous ones is that they carry out a double ensemble learning, that is, they combine both bagging and boosting (also with a preprocessing technique). The main representant of this type of approaches is EasyEnsemble (X.-Y. Liu, J. Wu, Z.-H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Transactions on System, Man and Cybernetics B 39 (2) (2009) 539–550), which is referred as exploratory undersampling technique. This method use Bagging as main ensemble learning method, but in spite of training a classifier for each new bag, they train each bag using AdaBoost. Hence, the final classifier is an ensemble of ensembles.

In the same manner as UnderBagging, each balanced bag is constructed by randomly undersampling instances from the majority class and using all the instances from the minority class. The difference between these methods is the way in which they treat the negative instances after each iteration as explained below.

This approach does not perform any operation with the instances from the original data-set after each AdaBoost iteration. Hence, all the classifiers can be trained in parallel. Note that, EasyEnsemble can be seen as an UnderBagging where the base learner is AdaBoost, if we fix the number of classifiers, EasyEnsemble will train less bags than UnderBagging, but more classifiers will be assigned to learn each single bag.

Data-sets partitions employed in the paper

In the study, we selected sixty-six data-sets from the KEEL dataset repository. AUC and GM metric estimates were obtained by means of a 5x5-DOB-SCV. That is, the data-set was split into 5 folds, each one containing 20% of the patterns of the data-set. For each fold, the algorithm was trained with the examples contained in the remaining folds and then tested with the current fold, using 5 different seeds to obtain the partition folds.

Table 1 summarises the properties of the selected data-sets. It shows, for each data-set, in which the name, number of examples, number of attributes, and IR (ratio between the majority and minority class instances) are shown. Datasets are ordered with respect to their degree of imbalance. Multi-class problems were modified to obtain two-class imbalanced problems, defining the joint of one or more classes as positive and the joint of one or more classes as negative, as defined in the name of the dataset. The last column of this table contains a link for downloading the 5-fold cross validation partitions for each data-set in KEEL format. You may also download all data-sets by clicking here.

Table 1. Summary description of data-sets.
Name	\#Ex.	\#Atts	IR	Name	\#Ex.	\#Atts	IR
glass1	214	9	1.82	glass04vs5	92	9	9.22
ecoli0vs1	220	7	1.86	ecoli0346vs5	205	7	9.25
wisconsin	683	9	1.86	ecoli0347vs56	257	7	9.28
pima	768	8	1.90	yeast05679vs4	528	8	9.35
iris0	150	4	2.00	ecoli067vs5	220	6	10.00
glass0	214	9	2.06	vowel0	988	13	10.10
yeast1	1484	8	2.46	glass016vs2	192	9	10.29
vehicle2	846	18	2.52	glass2	214	9	10.39
vehicle1	846	18	2.52	ecoli0147vs2356	336	7	10.59
vehicle3	846	18	2.52	led7digit02456789vs1	443	7	10.97
haberman	306	3	2.68	ecoli01vs5	240	6	11.00
glass0123vs456	214	9	3.19	glass06vs5	108	9	11.00
vehicle0	846	18	3.23	glass0146vs2	205	9	11.06
ecoli1	336	7	3.36	ecoli0147vs56	332	6	12.28
newthyroid2	215	5	4.92	cleveland0vs4	1771	13	12.62
newthyroid1	215	5	5.14	ecoli0146vs5	280	6	13.00
ecoli2	336	7	5.46	ecoli4	336	7	13.84
segment0	2308	19	6.01	shuttle0vs4	1829	9	13.87
glass6	214	9	6.38	yeast1vs7	459	8	13.87
yeast3	1484	8	8.11	glass4	214	9	15.47
ecoli3	336	7	8.19	pageblocks13vs4	472	10	15.85
pageblocks0	5472	10	8.77	abalone918	731	8	16.68
ecoli034vs5	200	7	9.00	glass016vs5	184	9	19.44
yeast2vs4	514	8	9.08	shuttle2vs4	129	9	20.50
ecoli067vs35	222	7	9.09	yeast1458vs7	693	8	22.10
ecoli0234vs5	202	7	9.10	glass5	214	9	22.81
glass015vs2	506	8	9.12	yeast2vs8	482	8	23.10
yeast0359vs78	172	9	9.12	yeast4	1484	8	28.41
yeast0256vs3789	1004	8	9.14	yeast1289vs7	947	8	30.56
yeast02579vs368	1004	8	9.14	yeast5	1484	8	32.78
ecoli046vs5	203	6	9.15	yeast6	1484	8	39.15
ecoli01vs235	244	7	9.17	ecoli0137vs26	281	7	39.15
ecoli0267vs35	244	7	9.18	abalone19	4174	8	128.87

Experimental Study

We present the results of the experimental study divided into three parts, according to the metric of performance employed:

The analysis of the ordering-based pruning approaches regarding AUC.
The analysis of the ordering-based pruning approaches regarding GM.
Finally, a comparative study is carried out to view the differences according to the true rates of the positive and negative classes the baseline classifiers.

All the results obtained in these subsection and reported in the paper can be downloaded as an Excel document by clicking on the corresponding link. We provide both for each ensemble classifier and pruning method.

Soft Computing and Intelligent Information Systems

Ordering-Based Pruning for Improving the Performance of Ensembles of Classifiers in the Framework of Imbalanced Datasets

Paper Content

Addressing Imbalanced Problems by Ensemble Techniques

Bagging Based Ensembles

Boosting Based Ensembles

Hybrid Based Ensembles

Data-sets partitions employed in the paper

Experimental Study

Page Maintained by A. Fernández

You are here

Ordering-Based Pruning for Improving the Performance of Ensembles of Classifiers in the Framework of Imbalanced Datasets

Paper Content

Addressing Imbalanced Problems by Ensemble Techniques

Bagging Based Ensembles

Boosting Based Ensembles

Hybrid Based Ensembles

Data-sets partitions employed in the paper

Experimental Study

Page Maintained by A. Fernández

User login

SCI2S Web-site Related