SCI2S Thematic Public WebsitesSCI2S Thematic Public Websites SCI2S Complementary Material Websites   SCI2S Thematic Public Websites
Icono GFSGenetic
Fuzzy
Systems
Icono Computing with Words in DMComputing
with Words in
Decision Making
Icono Statistical Inference in Computational Intelligence and Data MiningStatistical Inference in
Computational Intelligence
and Data Mining
Icono HindexH-index
&
Variants
Icono MV in DMMissing Values
in
Data Mining
Evolutionary Algorithms and other Metaheuristics for Continuous Optimization ProblemsE. A. & Metaheur.
for Continuous
Optim. Problems
Icono Interpretability of FRBSsInterpretability
of
FRBSs
Icon PRPrototype Reduction
in
Nearest Neighbor Classification

 

               A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning

This Website contains complementary material to the paper:

S. García, J. Luengo, J.A. Sáez, V. López and F.Herrera, A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering 25:4 (2013) 734-750, doi: 10.1109/TKDE.2012.35 iconPdf.png .

Summary:

  1. Abstract
  2. Experimental Framework
    1. Data-sets partitions employed in the paper
    2. Algorithms and parameters
  3. Results obtained
  4. Statistical Tests
  5. JAVA code for the methods used

S. García, J. Luengo, J.A. Sáez, V. López and F.Herrera, A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning.

Abstract

Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. In this manner, symbolic data mining algorithms can be applied over continuous data and the representation of information is simplified, making it more concise and specific. The literature provides numerous proposals of discretization and some attempts to categorize them into a taxonomy can be found. However, in previous papers, there is a lack of consensus in the definition of the properties and no formal categorization has been established yet, which may be confusing for practitioners. Furthermore, only a small set of discretizers have been widely considered, while many other methods have gone unnoticed. With the intention of alleviating these problems, this paper provides a survey of discretization methods proposed in the literature from a theoretical and empirical perspective. From the theoretical perspective, we develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. Empirically, we conduct an experimental study in supervised classification involving the most representative and newest discretizers, different types of classifiers and a large number of data sets. The results of their performances measured in terms of accuracy, number of intervals and inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of discretizers are highlighted as the best performing ones.


Top of Page


2. Experimental framework

2.1 Data-sets partitions employed in the paper

The performance of discretization algorithms is analyzed by using forty data sets from the KEEL-datasets repository. The data sets considered are partitioned using the ten fold cross-validation (10-fcv) procedure.

Table 1 summarize the properties of the selected data sets. It shows, for each data-set, the name, number of examples, number of attributes (numeric and nominal) and number of classes. The last column of this table contains a link for downloading the 10-fold cross validation partitions for each data set in KEEL format. All data sets may be downloaded by clicking here.

Table 1: Summary description for classification data sets
Data set#Ex.#Atts.#Num.#Nom.#Cl.DownloadData set#Ex.#Atts.#Num.#Nom.#Cl.Download
abalone417487128zip filemammographic9615502zip file
appendicitis1067702zip filemovement3609090015zip file
australian69014862zip filenewthyroid2155503zip file
autos2052515106zip filepageblocks5472101005zip file
balance6254403zip filepenbased109921616010zip file
banana53002202zip filephoneme54045502zip file
bands539191902zip filepima7688802zip file
bupa3456602zip filesaheart4629812zip file
cleveland303131305zip filesatimage6435363607zip file
contraceptive14739903zip filesegment2310191907zip file
crx69015692zip filesonar208606002zip file
dermatology366343406zip filespambase4597575702zip file
ecoli3367708zip filespecfheart267444402zip file
flare-solar10669902zip filetae1515503zip file
glass2149907zip filetitanic22013302zip file
haberman3063302zip filevehicle846181804zip file
hayes1604403zip filevowel9901313011zip file
heart270131302zip filewine178131303zip file
hepatitis155191902zip filewisconsin6999902zip file
iris1504403zip fileyeast148488010zip file

2.2 Algorithms and parameters

We have used 30 discretizers in the experimental study, those that we have identified as the most relevant ones. Table 2 presents an enumeration of these discretizers, containing their complete name and the abbreviation used to identify them in the results section. For further details on their descriptions, the reader can visit the URL associated to the KEEL project. This web contains the original reference for each discretizer in the experimental study and the Java implementation for each one of them integrated in the KEEL Software tool.

Table 2: Discretizers used in the experimental study
Complete nameAbbreviation
Equal Width DiscretizerEqualWidth
Equal Frequency DiscretizerEqualFrequency
ChiMergeChiMerge
One-Rule Discretizer1R
Iterative Dichotomizer 3 DiscretizerID3
Minimum Description Length PrincipleMDLP
Class-Attribute Dependent DiscretizerCADD
Bayesian DiscretizerBayesian
Cluster Analysis DiscretizerClusterAnalysis
ZetaZeta
Distance-based DiscretizerDistance
Chi2Chi
FUSINTERFUSINTER
Multivariate DiscretizationMVD
Modified Chi2Modified Chi2
Unparametrized Supervised DiscretizerUSD
KhiopsKhiops
Class-Attribute Interdependence MaximizationCAIM
Extended Chi2Extended Chi2
Heterogeneity DiscretizerHeter-Disc
Unsupervised Correlation Preserving DiscretizerUCPD
MODLMODL
Hellinger-Based DiscretizerHellingerBD
Distribution Index-Based DiscretizerDIBD
Interval Distance DiscretizerIDD
Class-Attribute Contingency CoefficientCACC
AmevaAmeva
Proportional DiscretizerPKID
Fixed Frequency DiscretizerFFD
Hypercube Division DiscretizerHDD

Furthermore, in this study six classifiers have been used in order to find differences in performance among the discretizers. The classifiers are:

Bolita C4.5: A well-known decision tree, considered one of the top 10 DM algorithms.

Bolita DataSqueezer: This learner belongs to the family of inductive rule extraction. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact and comprehensible, but accuracy is to some extent degraded in order to achieve this goal.

Bolita KNN: One of the simplest and most effective methods based on similarities among a set of objects. It is also considered one of the top 10 DM algorithms and it can handle nominal attributes using proper distance functions such as HVDM. It belongs to the lazy learning family.

Bolita Naive Bayes: This is another of the the top 10 DM algorithms. Its aim is to construct a rule which will allow us to assign future objects to a class, assuming independence of attributes when probabilities are established.

Bolita PUBLIC: It is an advanced decision tree that integrates the pruning phase with the building stage of the tree in order to avoid the expansion of branches that would be pruned afterwards.

Bolita Ripper: This is a widely used rule induction method based on a separate and conquer strategy. It incorporates diverse mechanisms to avoid overfitting and to handle numeric and nominal attributes simultaneously. The models obtained are in the form of decision lists.


The parameters of the discretizers and classifiers are those recommended by their respective authors. They are specified in Tables 3 and 4 for those methods which require them. We assume that the choice of the values of parameters is optimally chosen by their own authors. Nevertheless, in discretizers that require the input of the number of intervals as a parameter, we use a rule of thumb which is dependent on the number of instances in the data set. It consists in dividing the number of instances by 100 and taking the maximum value between this result and the number of classes.

Table 3: Parameter specification for the discretizers employed in the experimentation
MethodParameters
1R6 examples of the same class per interval
CADDconfidence threshold = 0.01
Chi2inconsistency threshold = 0.02
ChiMergeconfidence threshold = 0.05
FDDfrequency size = 30
FUSINTERα = 0.975, λ = 1
HDDcoefficient = 0.8
IDDneighborhood = 3, windows size = 3, nominal distance
MODLoptimized process type
UCPDintervals = [3, 6], KNN map type, neighborhood = 6, minimum support = 25, merged threshold = 0.5, scaling factor = 0.5, use discrete
Table 4: Parameter specification for the classifiers employed in the experimentation
MethodParameters
C4.5pruned tree, confidence = 0.25, 2 examples per leaf
DataSqueezerpruning and generalization threshold = 0.05
KNNK = 3, HVDM distance
PUBLIC25 nodes between prune
Ripperk = 2, grow set = 0.66


Top of Page


3. Results obtained

In this section, we report the full results obtained in the experimental study. We divide the results in two parts: the first part contains results associated to the intrinsic properties of the results obtained by the discretizers while the second part presents the results associated to the classifiers when they use the discretized data sets.

We present these data in xls and ods files available for download. These files are organized as follows: each data set is put in the rows of the file and each discretizer performance results are placed in the columns of the file. There are two columns for every discretizer: one representing the average performance results (number of intervals, inconsistency rate, accuracy or kappa) and another with the standard deviation for that average performance results. Additionally, for measures computed in train and test partitions, the corresponding file contains two sheets with this information: one for the training set results and the other for the test set results.

Results from intrinsic properties of the discretizers

In this section we obtain the number of intervals generated and the the inconsistency rate for the train and test sets generated for each pair data set/discretizer. The files available for download are in Table 5.

Table 5: Intrinsic properties files of the discretizers
Performance measurexls fileods file
Number of intervalsxls fileods file
Inconsistency ratexls fileods file

Results for classifiers performance over the discretized data sets

In this section we obtain the classifiers performance over the discretized data sets generated with each discretizer. We use two performance measures that are widely used because of their simplicity and successful application when multi-class classification problems are treated. We refer to accuracy and Cohen's kappa measures, which will be adopted to measure the efficacy discretizers in terms of generalization classification rate.

Bolita Accuracy: is the number of successful hits relative to the total number of classifications. It has been by far the most commonly used metric for assessing the performance of classifiers for years.

Bolita Cohenís kappa: is an alternative to accuracy, a method, known for decades, which compensates for random hits. Its original purpose was to measure the degree of agreement or disagreement between two people observing the same phenomenon. Cohen's kappa can be adapted to classification tasks and its use is recommended because it takes random successes into consideration as a standard, in the same way as the AUC measure.

Tables 6 and 7 present the Accuracy and Cohen's kappa files associated to the performance of the algorithms used in the study.

Table 6: Accuracy for the classifiers over the discretized data sets
Algorithmxls fileods file
C4.5xls fileods file
DataSqueezerxls fileods file
KNNxls fileods file
Naive Bayesxls fileods file
PUBLICxls fileods file
Ripperxls fileods file


Table 7: Cohen's kappa for the classifiers over the discretized data sets
Algorithmxls fileods file
C4.5xls fileods file
DataSqueezerxls fileods file
KNNxls fileods file
Naive Bayesxls fileods file
PUBLICxls fileods file
Ripperxls fileods file

Summary results for all data

Finally, this section summarizes the results from the previous sections. Here, we present three differenf files with information from the previous sections. All these files have the following structure: they sum up the information contained in the files from Tables 5, 6 and 7. Each row of those tables is summarized into a sheet of the file resulting in files that have the discretizer results in their rows and the average performance measure in their columns.

Bolita Intrinsic Measures Summary: xls file ods file

Bolita Accuracy Results Summary: xls file ods file

Bolita Cohen's Kappa Results Summary: xls file ods file


Top of Page


4. Statistical tests

Statistical analysis are carried out by means of nonparametric statistical tests. The reader can look up the Statistical Inference in Computational Intelligence and Data Mining web page for details in nonparametric statistical analysis.

The Wilcoxon test is used in order to conduct pairwise comparisons among all discretizers considered in the study. Specifically, the Wilcoxon test is adopted in this study considering a level of significance equal to α = 0.05.

In Table 8 we provide the results of the Wilcoxon test for every performance measure used in this study. For each comparison we provide two different files: a pdf file with the detailed results of the Wilcoxon test and a second file with the summary associated to it.

Table 8: Wilcoxon Test results for every performance measure used in the study
Comparison testedDetailed Wilcoxon testSummary of the Wilcoxon test
Number of Intervalspdf filepdf file
Inconsistency Rate in Trainingpdf filepdf file
Inconsistency Rate in Testpdf filepdf file
Accuracy using C4.5pdf filepdf file
Kappa using C4.5pdf filepdf file
Accuracy using DataSqueezerpdf filepdf file
Kappa using DataSqueezerpdf filepdf file
Accuracy using KNNpdf filepdf file
Kappa using KNNpdf filepdf file
Accuracy using Naive Bayespdf filepdf file
Kappa using Naive Bayespdf filepdf file
Accuracy using PUBLICpdf filepdf file
Kappa using PUBLICpdf filepdf file
Accuracy using Ripperpdf filepdf file
Kappa using Ripperpdf filepdf file


Top of Page


5. JAVA code for the methods used

In this section, we provide information about the source code of the methods used in the experimental framework and their descriptions.

All the algorithms used in the comparison are integrated in the KEEL Software Tool. The complete list of algorithms available in it can be found here, including the discretization algorithms used in this study and the classifiers used over the generated discretized sets.

You can download the KEEL Sofware Tool here. Since the KEEL Sofware Tool is an open source (GPLv3) tool, its source code it is also available for its usage here.

                


Top of Page


(c) Copyright 2011 SCI2S (Soft Computing and Intelligent Information Systems)
This Web page was created and maintained by Victoria López