Class Noise Cleaning by Ensemble Filtering and Noise Scoring

This Website contains complementary material to the paper:

Julian Luengo, Seong-o Shim, Saleh Alshomrania, Abdulrahman Altalhi and Francisco Herrera, CNC-NOS: Class Noise Cleaning by Ensemble Filtering and Noise Scoring. Knowledge-based Systems, accepted.

The web is organized according to the following summary:

Abstract
Proposal
Datasets
Performance results

Abstract

Obtaining data in the real world is subject to imperfections. In the data collecting process, a common consequence of these imperfections is the appearance of noise. In classification, noisy data may deteriorate the performance of a classifier depending on the sensitivity of the learning method to data corruptions. A particular disruptive type of noise in classification occurs when noise affects example class labeling, as it may severely mislead the model building.

Several strategies have emerged to deal with class noise in classification. Among the most popular is that of filtering. However, instance filtering can be harmful as it may eliminate more examples than necessary or produce loss of information. For this reason, we advance a new proposal based on an ensemble of noise filters with the goal not only of filtering the instances, but also correcting, when possible, those that are mislabeled. A noise score is applied to decide whether the instance identified as noise is filtered, maintained or relabeled, relying on the label that each base filter considers the most appropriate for the instance. The proposal, named CNC-NOS (Class Noise Cleaner with Noise Scoring), is compared against state-of-the-art noise filters, showing that it is able to deliver a quality training instance set that overcomes the limitations of such filters, both in terms of classification accuracy and properly treated instances.

Proposal

Data correcting methods are an ideal solution to the aforementioned cases since they are able to keep more instances in the dataset by means of relabeling instead of removal. However, the complete repair of the dataset is not always possible due to the complexity of this process. For instance, whereas noise filtering only requires detecting the noisy examples, noise correction also needs an additional phase where one of all the possible classes of the problem must be chosen for each noisy example.

The main idea is to relabel those examples affected by class noise only when it is possible to establish with a high degree of confidence the class label they belong to. In the cases where the instance cannot be repaired due to the impossibility of determining the true class label, a safe filtering can be applied. We have called this preprocessing technique Class Noise Cleaning with Noise Scoring (CNC-NOS for short). Figure 1 shows an scheme of the proposal. The five phases carried out by the method are described as follows:

{Application of an ensemble of noise filters. Several noise filters $F_1$,...,$F_n$ are built from the partially filtered data of the previous step ($T_i$). A requirement of these filters is that they must internally built a model to assign a class label to each training example in $T_i$ (which is then compared to the original class label to determine whether the example is noisy). Thus, both a set of noisy examples and a proposal of corrected class labels for them is obtained by each noise filter in this step.
Construction of the final set of noisy examples. The second step consists of building the final set of noisy examples $N_F$ from the combination of the different sets of noisy examples $N_i$ obtained in the previous step. In order to carry out this combination it is necessary to use a decision combination scheme using the noisy sets $N_i$.
Computation of the Noise Score $wNS for each noisy example. Once the noisy examples have been identified, a score is computed for each one to measure how noisy they are and to help to decide whether to remove or repair them.
Treatment of each noisy example. When the set of noisy examples $N_F$ and their score are obtained, it is necessary to decide which treatment to apply to each one of the examples in this set: the correction of its class label or the complete removal of the example. A decision combination scheme is used to fulfill this action. A noise score is computed for the instances and used to decide whether to stop the main loop of CNC-NOS.
Final adaptive filtering. Once the main loop of CNC-NOS has finished and the instances have been repaired or filtered, a conservative filtering phase is applied to eliminate the possible wrongly relabeled instances or those instances that have not been identified as noise yet. As the quality of the dataset has increased at the end of the main loop of CNC-NOS, new corrupted instances can be more accurately identified.

Datasets

The experimentation is based on 31 datasets from the KEEL-dataset repository. They are described in the following table, where #Examples refers to the number of examples, #Attributesto the number of attributes and #Classesto the number of classes. Examples containing missing values are removed from the datasets before their usage.

Several levels of uniform class noise has been introduced in the training partitions of these data sets, ranging from 5% to 40%, in a 5-fcv fashion. Five different seeds have been used to introduce the noise, resulting in five different version of each data set for a given noise level.

You can download all these datasets by clicking here:

Performance results

In the following table you can download the files with the results of each method considered and the analysis of the examples affected by our proposal.

Accuracy	Examples

The accumulated differences in performance between CNC-NOS and all the filters/cleaners sorted by number of instances, attributes or classes can be found here:

Execution times

In the following link you may download the average execution times for all the methods compared in the paper:

You may also be interested in the difference of execution times between applying the final adaptive filtering and disable it. You may find these diferences here:

You are here