
This Website contains complementary material to the paper:
Summary:
S. García, J. Luengo, J.A. Sáez, V. López and F.Herrera, A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning.
Abstract
Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. In this manner, symbolic data mining algorithms can be applied over continuous data and the representation of information is simplified, making it more concise and specific. The literature provides numerous proposals of discretization and some attempts to categorize them into a taxonomy can be found. However, in previous papers, there is a lack of consensus in the definition of the properties and no formal categorization has been established yet, which may be confusing for practitioners. Furthermore, only a small set of discretizers have been widely considered, while many other methods have gone unnoticed. With the intention of alleviating these problems, this paper provides a survey of discretization methods proposed in the literature from a theoretical and empirical perspective. From the theoretical perspective, we develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. Empirically, we conduct an experimental study in supervised classification involving the most representative and newest discretizers, different types of classifiers and a large number of data sets. The results of their performances measured in terms of accuracy, number of intervals and inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of discretizers are highlighted as the best performing ones.
2. Experimental framework
2.1 Datasets partitions employed in the paper
The performance of discretization algorithms is analyzed by using forty data sets from the KEELdatasets repository. The data sets considered are partitioned using the ten fold crossvalidation (10fcv) procedure.
Table 1 summarize the properties of the selected data sets. It shows, for each dataset, the name, number of examples, number of attributes (numeric and nominal) and number of classes. The last column of this table contains a link for downloading the 10fold cross validation partitions for each data set in KEEL format. All data sets may be downloaded by clicking here.
Data set  #Ex.  #Atts.  #Num.  #Nom.  #Cl.  Download  Data set  #Ex.  #Atts.  #Num.  #Nom.  #Cl.  Download 

abalone  4174  8  7  1  28  mammographic  961  5  5  0  2  
appendicitis  106  7  7  0  2  movement  360  90  90  0  15  
australian  690  14  8  6  2  newthyroid  215  5  5  0  3  
autos  205  25  15  10  6  pageblocks  5472  10  10  0  5  
balance  625  4  4  0  3  penbased  10992  16  16  0  10  
banana  5300  2  2  0  2  phoneme  5404  5  5  0  2  
bands  539  19  19  0  2  pima  768  8  8  0  2  
bupa  345  6  6  0  2  saheart  462  9  8  1  2  
cleveland  303  13  13  0  5  satimage  6435  36  36  0  7  
contraceptive  1473  9  9  0  3  segment  2310  19  19  0  7  
crx  690  15  6  9  2  sonar  208  60  60  0  2  
dermatology  366  34  34  0  6  spambase  4597  57  57  0  2  
ecoli  336  7  7  0  8  specfheart  267  44  44  0  2  
flaresolar  1066  9  9  0  2  tae  151  5  5  0  3  
glass  214  9  9  0  7  titanic  2201  3  3  0  2  
haberman  306  3  3  0  2  vehicle  846  18  18  0  4  
hayes  160  4  4  0  3  vowel  990  13  13  0  11  
heart  270  13  13  0  2  wine  178  13  13  0  3  
hepatitis  155  19  19  0  2  wisconsin  699  9  9  0  2  
iris  150  4  4  0  3  yeast  1484  8  8  0  10 
2.2 Algorithms and parameters
We have used 30 discretizers in the experimental study, those that we have identified as the most relevant ones. Table 2 presents an enumeration of these discretizers, containing their complete name and the abbreviation used to identify them in the results section. For further details on their descriptions, the reader can visit the URL associated to the KEEL project. This web contains the original reference for each discretizer in the experimental study and the Java implementation for each one of them integrated in the KEEL Software tool.
Complete name  Abbreviation 

Equal Width Discretizer  EqualWidth 
Equal Frequency Discretizer  EqualFrequency 
ChiMerge  ChiMerge 
OneRule Discretizer  1R 
Iterative Dichotomizer 3 Discretizer  ID3 
Minimum Description Length Principle  MDLP 
ClassAttribute Dependent Discretizer  CADD 
Bayesian Discretizer  Bayesian 
Cluster Analysis Discretizer  ClusterAnalysis 
Zeta  Zeta 
Distancebased Discretizer  Distance 
Chi2  Chi 
FUSINTER  FUSINTER 
Multivariate Discretization  MVD 
Modified Chi2  Modified Chi2 
Unparametrized Supervised Discretizer  USD 
Khiops  Khiops 
ClassAttribute Interdependence Maximization  CAIM 
Extended Chi2  Extended Chi2 
Heterogeneity Discretizer  HeterDisc 
Unsupervised Correlation Preserving Discretizer  UCPD 
MODL  MODL 
HellingerBased Discretizer  HellingerBD 
Distribution IndexBased Discretizer  DIBD 
Interval Distance Discretizer  IDD 
ClassAttribute Contingency Coefficient  CACC 
Ameva  Ameva 
Proportional Discretizer  PKID 
Fixed Frequency Discretizer  FFD 
Hypercube Division Discretizer  HDD 
Furthermore, in this study six classifiers have been used in order to find differences in performance among the discretizers. The classifiers are:
C4.5: A wellknown decision tree, considered one of the top 10 DM algorithms.
DataSqueezer: This learner belongs to the family of inductive rule extraction. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact and comprehensible, but accuracy is to some extent degraded in order to achieve this goal.
KNN: One of the simplest and most effective methods based on similarities among a set of objects. It is also considered one of the top 10 DM algorithms and it can handle nominal attributes using proper distance functions such as HVDM. It belongs to the lazy learning family.
Naive Bayes: This is another of the the top 10 DM algorithms. Its aim is to construct a rule which will allow us to assign future objects to a class, assuming independence of attributes when probabilities are established.
PUBLIC: It is an advanced decision tree that integrates the pruning phase with the building stage of the tree in order to avoid the expansion of branches that would be pruned afterwards.
Ripper: This is a widely used rule induction method based on a separate and conquer strategy. It incorporates diverse mechanisms to avoid overfitting and to handle numeric and nominal attributes simultaneously. The models obtained are in the form of decision lists.
The parameters of the discretizers and classifiers are those recommended by their respective authors. They are specified in Tables 3 and 4 for those methods which require them. We assume that the choice of the values of parameters is optimally chosen by their own authors. Nevertheless, in discretizers that require the input of the number of intervals as a parameter, we use a rule of thumb which is dependent on the number of instances in the data set. It consists in dividing the number of instances by 100 and taking the maximum value between this result and the number of classes.
Method  Parameters 

1R  6 examples of the same class per interval 
CADD  confidence threshold = 0.01 
Chi2  inconsistency threshold = 0.02 
ChiMerge  confidence threshold = 0.05 
FDD  frequency size = 30 
FUSINTER  α = 0.975, λ = 1 
HDD  coefficient = 0.8 
IDD  neighborhood = 3, windows size = 3, nominal distance 
MODL  optimized process type 
UCPD  intervals = [3, 6], KNN map type, neighborhood = 6, minimum support = 25, merged threshold = 0.5, scaling factor = 0.5, use discrete 
Method  Parameters 

C4.5  pruned tree, confidence = 0.25, 2 examples per leaf 
DataSqueezer  pruning and generalization threshold = 0.05 
KNN  K = 3, HVDM distance 
PUBLIC  25 nodes between prune 
Ripper  k = 2, grow set = 0.66 
3. Results obtained
In this section, we report the full results obtained in the experimental study. We divide the results in two parts: the first part contains results associated to the intrinsic properties of the results obtained by the discretizers while the second part presents the results associated to the classifiers when they use the discretized data sets.
We present these data in xls and ods files available for download. These files are organized as follows: each data set is put in the rows of the file and each discretizer performance results are placed in the columns of the file. There are two columns for every discretizer: one representing the average performance results (number of intervals, inconsistency rate, accuracy or kappa) and another with the standard deviation for that average performance results. Additionally, for measures computed in train and test partitions, the corresponding file contains two sheets with this information: one for the training set results and the other for the test set results.
Results from intrinsic properties of the discretizers
In this section we obtain the number of intervals generated and the the inconsistency rate for the train and test sets generated for each pair data set/discretizer. The files available for download are in Table 5.
Performance measure  xls file  ods file 

Number of intervals  
Inconsistency rate 
Results for classifiers performance over the discretized data sets
In this section we obtain the classifiers performance over the discretized data sets generated with each discretizer. We use two performance measures that are widely used because of their simplicity and successful application when multiclass classification problems are treated. We refer to accuracy and Cohen's kappa measures, which will be adopted to measure the efficacy discretizers in terms of generalization classification rate.
Accuracy: is the number of successful hits relative to the total number of classifications. It has been by far the most commonly used metric for assessing the performance of classifiers for years.
Cohen’s kappa: is an alternative to accuracy, a method, known for decades, which compensates for random hits. Its original purpose was to measure the degree of agreement or disagreement between two people observing the same phenomenon. Cohen's kappa can be adapted to classification tasks and its use is recommended because it takes random successes into consideration as a standard, in the same way as the AUC measure.
Tables 6 and 7 present the Accuracy and Cohen's kappa files associated to the performance of the algorithms used in the study.
Algorithm  xls file  ods file 

C4.5  
DataSqueezer  
KNN  
Naive Bayes  
PUBLIC  
Ripper 
Algorithm  xls file  ods file 

C4.5  
DataSqueezer  
KNN  
Naive Bayes  
PUBLIC  
Ripper 
Summary results for all data
Finally, this section summarizes the results from the previous sections. Here, we present three differenf files with information from the previous sections. All these files have the following structure: they sum up the information contained in the files from Tables 5, 6 and 7. Each row of those tables is summarized into a sheet of the file resulting in files that have the discretizer results in their rows and the average performance measure in their columns.
4. Statistical tests
Statistical analysis are carried out by means of nonparametric statistical tests. The reader can look up the
Statistical Inference in Computational Intelligence and Data Mining web page for details in nonparametric statistical analysis.
The Wilcoxon test is used in order to conduct pairwise comparisons among all discretizers considered in the study. Specifically, the Wilcoxon test
is adopted in this study considering a level of significance equal to α = 0.05.
In Table 8 we provide the results of the Wilcoxon test for every performance measure used in this study. For each comparison we provide two
different files: a pdf file with the detailed results of the Wilcoxon test and a second file with the summary associated to it.
Comparison tested  Detailed Wilcoxon test  Summary of the Wilcoxon test 

Number of Intervals  
Inconsistency Rate in Training  
Inconsistency Rate in Test  
Accuracy using C4.5  
Kappa using C4.5  
Accuracy using DataSqueezer  
Kappa using DataSqueezer  
Accuracy using KNN  
Kappa using KNN  
Accuracy using Naive Bayes  
Kappa using Naive Bayes  
Accuracy using PUBLIC  
Kappa using PUBLIC  
Accuracy using Ripper  
Kappa using Ripper 
5. JAVA code for the methods used
In this section, we provide information about the source code of the methods used in the experimental framework and their descriptions.
All the algorithms used in the comparison are integrated in the KEEL Software Tool. The complete list of algorithms available in it can be found here, including the discretization algorithms used in this study and the classifiers used over the generated discretized sets.
You can download the KEEL Sofware Tool here. Since the KEEL Sofware Tool is an open source (GPLv3) tool, its source code it is also available for its usage here.
(c) Copyright 2011 SCI^{2}S (Soft Computing and Intelligent Information Systems)
This Web page was created and maintained by Victoria López