Addressing the Noisy and Borderline Examples Problem in Classification with Imbalanced Datasets via a Class Noise Filtering Method-based Re-sampling Technique

This Website contains complementary material to the paper:

José A. Sáez, Julián Luengo, Jerzy Stefanowski, Francisco Herrera, Addressing the Noisy and Borderline Examples Problem in Classification with Imbalanced Datasets via a Class Noise Filtering Method-based Re-sampling Technique. Pattern Recognition, SUBMMITED.

The web is organized according to the following summary:

Abstract
Introduction
Re-sampling Techniques
Analysis Methodology
Datasets
Performance Evaluation of Re-sampling Methods

Abstract

When performing a classification task, datasets usually have a different class distribution among their patterns. This problem is known as imbalanced classification and it is currently a relevant topic in the literature. Traditionally, balancing the different number of examples of each class has been the problem to solve and re-sampling techniques have been widely used to this end.

However, recent works have claimed that class imbalance is not a problem in itself: the degradation of classification performance is associated with other factors related to data distribution. One of them is the higher or lower presence of examples in the area surrounding class boundaries, where classes overlap; these ones are called borderline examples. A classic way in standard classification to reduce the negative effects on the performance of a classifier of borderline examples and also noisy ones is the application of class noise filtering methods.

This paper aims to propose the combined use of re-sampling techniques along with class noise filters to deal with imbalanced datasets with noisy and borderline examples. Concretely, we propose a method based on the Synthetic Minority Over-sampling Technique and the Iterative-Partitioning Filter. We compare it with respect to many others re-sampling techniques over a set of synthetic datasets created with different levels of borderline examples and also over real-world ones. The results obtained show the suitability of our proposal in the different scenarios considered.

Introduction

There are current real-world classification problems that present a highly imbalanced distribution of examples in classes. These problems, formally known as imbalanced classification problems, contain more examples of one class (the majority or negative class) than the other one (the minority or positive class). The latter usually represents the most interesting concept from the point of view of learning.

This issue has been intensively researched in the last decade and several methods have been proposed to address it. They can be divided into algorithmic and data approaches. The former assume modifications in the operation of the algorithms, making them cost-sensitive towards the minority class. The latter are classifier-independent and modify the data distribution to change the balance between classes, taking into account local characteristics of examples. Re-sampling of data could be done by means of under-sampling, by removing instances from the data, or over-sampling, by replicating or generating new minority class examples. There have been numerous works exemplifying their advantages. The Synthetic Minority Over-sampling Technique (SMOTE) is one of most well-known over-sampling techniques. Its main idea is to form new minority class examples by interpolating between several minority class examples that lie together.

However, there are works in the literature that have already claimed that class imbalance is not a problem in itself: the degradation of classification performance is linked to other factors related to data distributions. Among them, in \cite{Napierala10Learning} a study about the effect of noisy and borderline examples on classification performance in imbalanced datasets is carried out. Borderline examples are defined as such examples located in the area surrounding class boundaries, where the minority and majority classes overlap. The authors demonstrated the influence of the quantity of this kind of examples on the degradation of the performance of classifiers.

In standard classification, class noise filtering methods are classically used in order to detect and eliminate noisy examples from training datasets. These methods also help to clean up and to create more regular class boundaries. Between these methods, the Iterative-Partitioning Filter (IPF) out-stands, due to its good results when detecting these kind of examples.

In this work our aim is to propose and analyze the suitability of an approach to deal with imbalanced problems with noisy and borderline examples. Our proposal is based on the combined use of a re-sampling technique, the SMOTE algorithm, and a class noise filter, the IPF method. We will study the differences between this strategy and others re-sampling techniques through an analysis of the performance of different classifiers over the preprocessed datasets, which we will also contrast using the proper statistical tests as recommended in the specialized literature. We claim that the distribution of borderline examples and also noisy ones causes difficulties for learning algorithms, thus we focus our interest on careful processing these examples.

The expected good performance of our proposal is based on the two methods complementing each other perfectly to address imbalanced problems with borderline examples:

The SMOTE algorithm fulfills a dual function: it balances the class distribution and it helps to fill the interior of the minority class clusters defined by borderline examples.
The IPF filter removes the noisy examples originally present in the dataset and also those created by SMOTE. Besides, IPF cleans up the boundaries of the classes making them more regular

Re-sampling Techniques

In the specialized literature, we may find some papers for re-sampling techniques from the point of view of the study of the effect of the class distribution in classification to treat with imbalanced datasets. It has been proved that applying a preprocessing step in order to balance the class distribution is a positive solution to the imbalance dataset problem. Besides, the main advantage of these techniques is that they are independent of the classifier used. In this work we evaluate different re-sampling techniques to adjust the class distribution in the training data. Specifically, we have used the following methods:

Synthetic Minority Over-sampling Technique (SMOTE). SMOTE is an over-sampling method whose main idea is to form new minority class examples by interpolating between several minority class examples that lie together.

SMOTE + Tomek Links (SMOTE+TL). Frequently, class clusters are not well defined since some majority class examples might be invading the minority class space. The opposite can also be true, since interpolating minority class examples can expand the minority class clusters, introducing artificial minority class examples too deeply in the majority class space. Inducing a classifier under such a situation can lead to overfitting. In order to create better-defined class clusters, we propose applying Tomek Links to the over-sampled training set as a data cleaning method. Thus, instead of removing only the majority class examples that form Tomek Links, examples from both classes are removed.

SMOTE + Edited Nearest Neighbor (SMOTE+ENN). The motivation behind this method is similar to SMOTE+TL. ENN tends to remove more examples than the TL does, so it is expected that it will provide a more in depth data cleaning. ENN is used to remove examples from both classes. Thus, any example that is misclassified by its three nearest neighbors is removed from the training set.

Borderline + SMOTE. This method only oversample or strengthen the borderline minority examples. First, it finds out the borderline minority examples $P$, defined as the examples of the minority class having more than an half, but not all, of their $m$ nearest neighbors belonging to the majority class. Finally, for each of those examples, we calculate its $k$ nearest neighbors from $P$ (for the algorithm version B1+SMOTE) or from all the training data (for the algorithm version B2+SMOTE) and operate similar to SMOTE. Then, synthetic examples are generated from them and added to the original training set.

Safe Levels + SMOTE (SL+SMOTE). This method assigns each positive instance its safe level before generating synthetic instances. Each synthetic instance is positioned closer to the largest safe level so all synthetic instances are generated only in safe regions.

SPIDER2. This method consists of two phases corresponding to preprocessing of majority and minority classes respectively. In the first phase, it identifies characteristics of examples from the majority class, and it either removes or relabels noisy examples from this one. In the second phase, it identifies characteristic of examples from the minority class considering changes introduced in the first phase. Then, noisy examples from the minority class are amplified by replicating them.

The papers associated with these methods can be downloaded here: ZIP file

The implementation can be found in the KEEL Software Tool

Analysis Methodology

In order to validate our hypothesis and to extract meaningful conclusions, we will carry out controlled experiments with special synthetic datasets. These datasets present different shapes of the minority class and different levels of borderline examples. In addition, we will also consider a set of real-world datasets. All of them were used in \cite{Napierala10Learning} and are available in the KEEL-dataset repository. We will preprocess all these datasets with many well-known re-sampling techniques in order to study the goodness of our strategy for imbalanced problems with borderline examples. Over these preprocessed datasets, we will evaluate classifiers built with several classification algorithms.

Classification Algorithms and Parameters

The effect of the above mentioned preprocessing techniques will be analyzed with respect to several well-known classification algorithms from the literature. A brief description of these algorithms is given in the following:

C4.5 is a decision tree generating algorithm. It has been used in many recent analyses of imbalanced data. It induces classification rules in the form of decision trees from a set of given examples. The decision tree is constructed in a top-down way, using the normalized information gain (difference in entropy) that results from choosing an attribute to split the data. The attribute with the highest normalized information gain is the one used to make the decision.
k-Nearest Neighbors (k-NN) finds a group of k instances in the training set that are closest to the test pattern. The predicted class label is based on the predominance of a particular class in this neighborhood. The distance and the number of neighbors are key elements of this algorithm.
Support Vector Machine (SVM) maps the original input space into a high-dimensional feature space via a certain kernel function (avoiding the computation of the inner product of two vectors). In the new feature space, the optimal separating hyperplane with maximal margin is determined in order to minimize an upper bound of the expected risk instead of the empirical risk. We use SMO training algorithm to obtain the SVM classifiers.
Repeated Incremental Pruning to Produce Error Reduction (RIPPER) builds a decision list of rules to predict the corresponding class for each instance. The list of rules is grown one by one and immediately pruned. Once a decision list for a given class is completely learned, an optimization stage is performed.
PART is based on C4.5 algorithm. In each iteration, a partial C45 tree is generated and its best rule extracted (the one wich cover more examples). The algorithm ends when all the examples are covered. A partial tree is one whose braches are not fully explored. When a node has grown all its children, the node is eligible to prune. If the node is not pruned, all the exploration ends.
Quadratic Discriminant Analysis (QDA) algorithm separates examples of two or more classes by a quadric surface. QDA is closely related to Linear Discriminant Analysis, where it is assumed that there are only two classes of examples, and that the measurements are normally distributed. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical. When the assumption is true, the best possible test for the hypothesis that a given measurement is from a given class is the likelihood ratio test.

The classification algorithms have been executed with the KEEL tool using the parameters shown in the following table:

The papers associated with these methods can be downloaded here: ZIP file

The implementation can be found in the KEEL Software Tool

Performance Measures and Statistical Tests

In order to evaluate the performance of the classifiers built we use the Area Under the ROC Curve (AUC) measure provided by the classification algorithms over the test sets, defined as its performance averaged across all classification problems. Over the test AUC results, we also use the Wilcoxon's signed ranks statistical test with a level of significance of α = 0.1.

Datasets

Synthetic Imbalanced Datasets with Borderline Examples

As it is mentioned in X, the class imbalance in itself it not the main problem in imbalanced domains; the degradation of classification performance is linked to other factors related to data distributions. Following these motivations we prepare our synthetic datasets varying the presence and frequency of borderline examples as it was pointed out in Y. We also consider the decomposition of the minority class into smaller sub-clusters in a similar way to that described in Y and the role of changing decision boundary between classes from linear to non-linear shapes. Concretely, the building of the synthetic datasets is based on the procedure followed in Y, whose main characteristics are:

Number of classes and attributes. We focus on binary classification problems (the minority versus the majority class) with examples randomly and uniformly distributed in the two-dimensional real-value space.
Number of examples and imbalanced ratios. We generate multiple datasets with two different numbers of examples and imbalance ratios: datasets with 600 examples and $IR = 5$ and datasets with 800 examples and $IR = 7$.
Shapes of the minority class. We consider three different shapes of the minority class surrounded uniformly by the majority class:
- Sub-cluster (Sub-figure (a)): the examples from the minority class are located inside rectangles following related works on small disjuncts \cite{Japkowicz03ClassImbalance}.
- Clover (Sub-figure (b)): represents a more difficult, non-linear setting, where the minority class resembles a flower with elliptic petals.
- Paw (Sub-figure (c)): the minority class is decomposed into 3 elliptic subregions of varying cardinalities, where two subregions are located close to each other, and the remaining smaller sub-region is separated.

You can also download all these datasets by clicking here: ZIP file

Real-world Datasets

The choice of the real-world datasets have been based on the work about imbalanced classification with borderline examples presented in \cite{Napierala10Learning}. Concretely, we have obtained them from the KEEL-dataset repository. Multi-class datasets are modified to obtain two-class non-balanced problems, defining the joint of one or more classes as positive and the remainder as negative. The accuracy estimation of each classifier in a dataset is again obtained by means of 5 runs of a stratified 5-fold cross-validation. Table \ref{tab:datasets} shows these datasets' characteristics. For each one, the number of instances (\#Ins), attributes (\#Att), the minority class (Min), the number of examples in the minority class (\#Min), in the majority class (\#Maj) and the imbalance ratio (IR) are shown.

You can also download all these datasets by clicking here: ZIP file

Performance Evaluation of Re-sampling Methods

Here we show the AUC results of each re-sampling method with the both kinds of datasets considered in our work: synthetic and real-world.

In addition, the statistical tests performed with each re-sampling technique are found too.

Experimentation with Synthetic Datasets

Pre-processed Datasets:
- ZIP file - SMOTE
- ZIP file - SMOTE+ENN
- ZIP file - SMOTE+TL
- ZIP file - SL+SMOTE
- ZIP file - B1+SMOTE
- ZIP file - B2+SMOTE
- ZIP file - SPIDER2
- ZIP file - SMOTE+IPF
Results on Synthetic Datasets:

In the following tables you can download the files with the results of each re-sampling method considered. In the Test AUC row you find the XLS file with the test AUC of each classification algorithm for each single dataset. In the None row the results of the Wilcoxon's test between SMOTE+IPF and no preprocessing are found. Finally, in the Classic and Borderline columns the Friedman Aligned and Hochberg's post-hoc test results for generic and borderline approaches are found, respectively.

---	C4.5	k-NN	SVM	RIPPER	PART
Test AUC
None
Classic
Borderline

Experimentation with Real-world Datasets

Pre-processed Datasets:
- ZIP file - SMOTE
- ZIP file - SMOTE+ENN
- ZIP file - SMOTE+TL
- ZIP file - SL+SMOTE
- ZIP file - B1+SMOTE
- ZIP file - B2+SMOTE
- ZIP file - SPIDER2
- ZIP file - SMOTE+IPF

Results on Real-world Datasets:

---	C4.5	k-NN	SVM	RIPPER	PART
Test AUC
None
Classic
Borderline

Experimentation with Real-world Datasets with Class Noise

Pre-processed Datasets:
- ZIP file - None
- ZIP file - SMOTE
- ZIP file - SMOTE+ENN
- ZIP file - SMOTE+TL
- ZIP file - SL+SMOTE
- ZIP file - B1+SMOTE
- ZIP file - B2+SMOTE
- ZIP file - SPIDER2
- ZIP file - SMOTE+IPF
Results on Real-world Datasets with Class Noise:

---	C4.5	k-NN	SVM	RIPPER	PART
Test AUC
None
Classic
Borderline

Experimentation with Real-world Datasets with Attribute Noise

Pre-processed Datasets:
- ZIP file - None
- ZIP file - SMOTE
- ZIP file - SMOTE+ENN
- ZIP file - SMOTE+TL
- ZIP file - SL+SMOTE
- ZIP file - B1+SMOTE
- ZIP file - B2+SMOTE
- ZIP file - SPIDER2
- ZIP file - SMOTE+IPF

Results on Real-world Datasets with Attribute Noise:

---	C4.5	k-NN	SVM	RIPPER	PART
Test AUC
None
Classic
Borderline

You are here

Addressing the Noisy and Borderline Examples Problem in Classification with Imbalanced Datasets via a Class Noise Filtering Method-based Re-sampling Technique

Abstract

Introduction

Re-sampling Techniques

Analysis Methodology

Classification Algorithms and Parameters

Performance Measures and Statistical Tests

Datasets

Synthetic Imbalanced Datasets with Borderline Examples

Real-world Datasets

Performance Evaluation of Re-sampling Methods

Experimentation with Synthetic Datasets

Pre-processed Datasets:

Results on Synthetic Datasets:

Experimentation with Real-world Datasets

Pre-processed Datasets:

Results on Real-world Datasets:

Experimentation with Real-world Datasets with Class Noise

Pre-processed Datasets:

Results on Real-world Datasets with Class Noise:

Experimentation with Real-world Datasets with Attribute Noise

Pre-processed Datasets:

Results on Real-world Datasets with Attribute Noise:

User login

SCI2S Web-site Related