A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning

This Website contains complementary material to the paper:

S. García, J. Luengo, J.A. Sáez, V. López and F.Herrera, A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering 25:4 (2013) 734-750, doi: 10.1109/TKDE.2012.35 PDF Icon

Summary:

  1. Abstract
  2. Experimental Framework
    1. Data-sets partitions employed in the paper
    2. Algorithms and parameters
  3. Results obtained
  4. Statistical Tests
  5. JAVA code for the methods used

Abstract

Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. In this manner, symbolic data mining algorithms can be applied over continuous data and the representation of information is simplified, making it more concise and specific. The literature provides numerous proposals of discretization and some attempts to categorize them into a taxonomy can be found. However, in previous papers, there is a lack of consensus in the definition of the properties and no formal categorization has been established yet, which may be confusing for practitioners. Furthermore, only a small set of discretizers have been widely considered, while many other methods have gone unnoticed. With the intention of alleviating these problems, this paper provides a survey of discretization methods proposed in the literature from a theoretical and empirical perspective. From the theoretical perspective, we develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. Empirically, we conduct an experimental study in supervised classification involving the most representative and newest discretizers, different types of classifiers and a large number of data sets. The results of their performances measured in terms of accuracy, number of intervals and inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of discretizers are highlighted as the best performing ones.

Experimental framework

Data-sets partitions employed in the paper

The performance of discretization algorithms is analyzed by using forty data sets from the KEEL-datasets repository. The data sets considered are partitioned using the ten fold cross-validation (10-fcv) procedure.

Table 1 summarize the properties of the selected data sets. It shows, for each data-set, the name, number of examples, number of attributes (numeric and nominal) and number of classes. The last column of this table contains a link for downloading the 10-fold cross validation partitions for each data set in KEEL format. All data sets may be downloaded by clicking here.

Table 1: Summary description for classification data sets

Data set #Ex. #Atts. #Num. #Nom. #Cl. Download Data set #Ex. #Atts. #Num. #Nom. #Cl. Download
abalone 4174 8 7 1 28 iconZip.png mammographic 961 5 5 0 2 iconZip.png
appendicitis 106 7 7 0 2 iconZip.png movement 360 90 90 0 15 iconZip.png
australian 690 14 8 6 2 iconZip.png newthyroid 215 5 5 0 3 iconZip.png
autos 205 25 15 10 6 iconZip.png pageblocks 5472 10 10 0 5 iconZip.png
balance 625 4 4 0 3 iconZip.png penbased 10992 16 16 0 10 iconZip.png
banana 5300 2 2 0 2 iconZip.png phoneme 5404 5 5 0 2 iconZip.png
bands 539 19 19 0 2 iconZip.png pima 768 8 8 0 2 iconZip.png
bupa 345 6 6 0 2 iconZip.png saheart 462 9 8 1 2 iconZip.png
cleveland 303 13 13 0 5 iconZip.png satimage 6435 36 36 0 7 iconZip.png
contraceptive 1473 9 9 0 3 iconZip.png segment 2310 19 19 0 7 iconZip.png
crx 690 15 6 9 2 iconZip.png sonar 208 60 60 0 2 iconZip.png
dermatology 366 34 34 0 6 iconZip.png spambase 4597 57 57 0 2 iconZip.png
ecoli 336 7 7 0 8 iconZip.png specfheart 267 44 44 0 2 iconZip.png
flare-solar 1066 9 9 0 2 iconZip.png tae 151 5 5 0 3 iconZip.png
glass 214 9 9 0 7 iconZip.png titanic 2201 3 3 0 2 iconZip.png
haberman 306 3 3 0 2 iconZip.png vehicle 846 18 18 0 4 iconZip.png
hayes 160 4 4 0 3 iconZip.png vowel 990 13 13 0 11 iconZip.png
heart 270 13 13 0 2 iconZip.png wine 178 13 13 0 3 iconZip.png
hepatitis 155 19 19 0 2 iconZip.png wisconsin 699 9 9 0 2 iconZip.png
iris 150 4 4 0 3 iconZip.png yeast 1484 8 8 0 10 iconZip.png

Algorithms and parameters

We have used 30 discretizers in the experimental study, those that we have identified as the most relevant ones. Table 2 presents an enumeration of these discretizers, containing their complete name and the abbreviation used to identify them in the results section. For further details on their descriptions, the reader can visit the URL associated to the KEEL project. This web contains the original reference for each discretizer in the experimental study and the Java implementation for each one of them integrated in the KEEL Software tool.

Table 2: Discretizers used in the experimental study

Complete name Abbreviation
Equal Width Discretizer EqualWidth
Equal Frequency Discretizer EqualFrequency
ChiMerge ChiMerge
One-Rule Discretizer 1R
Iterative Dichotomizer 3 Discretizer ID3
Minimum Description Length Principle MDLP
Class-Attribute Dependent Discretizer CADD
Bayesian Discretizer Bayesian
Cluster Analysis Discretizer ClusterAnalysis
Zeta Zeta
Distance-based Discretizer Distance
Chi2 Chi
FUSINTER FUSINTER
Multivariate Discretization MVD
Modified Chi2 Modified Chi2
Unparametrized Supervised Discretizer USD
Khiops Khiops
Class-Attribute Interdependence Maximization CAIM
Extended Chi2 Extended Chi2
Heterogeneity Discretizer Heter-Disc
Unsupervised Correlation Preserving Discretizer UCPD
MODL MODL
Hellinger-Based Discretizer HellingerBD
Distribution Index-Based Discretizer DIBD
Interval Distance Discretizer IDD
Class-Attribute Contingency Coefficient CACC
Ameva Ameva
Proportional Discretizer PKID
Fixed Frequency Discretizer FFD
Hypercube Division Discretizer HDD

Furthermore, in this study six classifiers have been used in order to find differences in performance among the discretizers. The classifiers are:

  • C4.5: A well-known decision tree, considered one of the top 10 DM algorithms.
  • DataSqueezer: This learner belongs to the family of inductive rule extraction. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact and comprehensible, but accuracy is to some extent degraded in order to achieve this goal.
  • KNN: One of the simplest and most effective methods based on similarities among a set of objects. It is also considered one of the top 10 DM algorithms and it can handle nominal attributes using proper distance functions such as HVDM. It belongs to the lazy learning family.
  • Naive Bayes: This is another of the the top 10 DM algorithms. Its aim is to construct a rule which will allow us to assign future objects to a class, assuming independence of attributes when probabilities are established.
  • PUBLIC: It is an advanced decision tree that integrates the pruning phase with the building stage of the tree in order to avoid the expansion of branches that would be pruned afterwards.
  • Ripper: This is a widely used rule induction method based on a separate and conquer strategy. It incorporates diverse mechanisms to avoid overfitting and to handle numeric and nominal attributes simultaneously. The models obtained are in the form of decision lists.

The parameters of the discretizers and classifiers are those recommended by their respective authors. They are specified in Tables 3 and 4 for those methods which require them. We assume that the choice of the values of parameters is optimally chosen by their own authors. Nevertheless, in discretizers that require the input of the number of intervals as a parameter, we use a rule of thumb which is dependent on the number of instances in the data set. It consists in dividing the number of instances by 100 and taking the maximum value between this result and the number of classes.

Table 3: Parameter specification for the discretizers employed in the experimentation

Method Parameters
1R 6 examples of the same class per interval
CADD confidence threshold = 0.01
Chi2 inconsistency threshold = 0.02
ChiMerge confidence threshold = 0.05
FDD frequency size = 30
FUSINTER α = 0.975, λ = 1
HDD coefficient = 0.8
IDD neighborhood = 3, windows size = 3, nominal distance
MODL optimized process type
UCPD intervals = [3, 6], KNN map type, neighborhood = 6, minimum support = 25, merged threshold = 0.5, scaling factor = 0.5, use discrete

Table 4: Parameter specification for the classifiers employed in the experimentation

Method Parameters
C4.5 pruned tree, confidence = 0.25, 2 examples per leaf
DataSqueezer pruning and generalization threshold = 0.05
KNN K = 3, HVDM distance
PUBLIC 25 nodes between prune
Ripper k = 2, grow set = 0.66

Results obtained

In this section, we report the full results obtained in the experimental study. We divide the results in two parts: the first part contains results associated to the intrinsic properties of the results obtained by the discretizers while the second part presents the results associated to the classifiers when they use the discretized data sets.

We present these data in xls and ods files available for download. These files are organized as follows: each data set is put in the rows of the file and each discretizer performance results are placed in the columns of the file. There are two columns for every discretizer: one representing the average performance results (number of intervals, inconsistency rate, accuracy or kappa) and another with the standard deviation for that average performance results. Additionally, for measures computed in train and test partitions, the corresponding file contains two sheets with this information: one for the training set results and the other for the test set results.

Results from intrinsic properties of the discretizers

In this section we obtain the number of intervals generated and the the inconsistency rate for the train and test sets generated for each pair data set/discretizer. The files available for download are in Table 5.

Table 5: Intrinsic properties files of the discretizers

Performance measure xls file ods file
Number of intervals iconExcel.jpg ods file
Inconsistency rate iconExcel.jpg ods file

Results for classifiers performance over the discretized data sets

In this section we obtain the classifiers performance over the discretized data sets generated with each discretizer. We use two performance measures that are widely used because of their simplicity and successful application when multi-class classification problems are treated. We refer to accuracy and Cohen's kappa measures, which will be adopted to measure the efficacy discretizers in terms of generalization classification rate.

  • Accuracy: is the number of successful hits relative to the total number of classifications. It has been by far the most commonly used metric for assessing the performance of classifiers for years.
  • Cohen’s kappa: is an alternative to accuracy, a method, known for decades, which compensates for random hits. Its original purpose was to measure the degree of agreement or disagreement between two people observing the same phenomenon. Cohen's kappa can be adapted to classification tasks and its use is recommended because it takes random successes into consideration as a standard, in the same way as the AUC measure.

Tables 6 and 7 present the Accuracy and Cohen's kappa files associated to the performance of the algorithms used in the study.

Table 6: Accuracy for the classifiers over the discretized data sets

Algorithm xls file ods file
C4.5 iconExcel.jpg ods file
DataSqueezer iconExcel.jpg ods file
KNN iconExcel.jpg ods file
Naive Bayes iconExcel.jpg ods file
PUBLIC iconExcel.jpg ods file
Ripper iconExcel.jpg ods file

Table 7: Cohen's kappa for the classifiers over the discretized data sets

Algorithm xls file ods file
C4.5 iconExcel.jpg ods file
DataSqueezer iconExcel.jpg ods file
KNN iconExcel.jpg ods file
Naive Bayes iconExcel.jpg ods file
PUBLIC iconExcel.jpg ods file
Ripper iconExcel.jpg ods file

Summary results for all data

Finally, this section summarizes the results from the previous sections. Here, we present three differenf files with information from the previous sections. All these files have the following structure: they sum up the information contained in the files from Tables 5, 6 and 7. Each row of those tables is summarized into a sheet of the file resulting in files that have the discretizer results in their rows and the average performance measure in their columns.

Intrinsic Measures Summary: iconExcel.jpg ods file

Accuracy Results Summary: iconExcel.jpg ods file

Cohen's Kappa Results Summary: iconExcel.jpg ods file

Statistical tests

Statistical analysis are carried out by means of nonparametric statistical tests. The reader can look up the Statistical Inference in Computational Intelligence and Data Mining web page for details in nonparametric statistical analysis.

The Wilcoxon test is used in order to conduct pairwise comparisons among all discretizers considered in the study. Specifically, the Wilcoxon test is adopted in this study considering a level of significance equal to α = 0.05.

In Table 8 we provide the results of the Wilcoxon test for every performance measure used in this study. For each comparison we provide two different files: a pdf file with the detailed results of the Wilcoxon test and a second file with the summary associated to it.

Table 8: Wilcoxon Test results for every performance measure used in the study

Comparison tested Detailed Wilcoxon test Summary of the Wilcoxon test
Number of Intervals PDF Icon PDF Icon
Inconsistency Rate in Training PDF Icon PDF Icon
Inconsistency Rate in Test PDF Icon PDF Icon
Accuracy using C4.5 PDF Icon PDF Icon
Kappa using C4.5 PDF Icon PDF Icon
Accuracy using DataSqueezer PDF Icon PDF Icon
Kappa using DataSqueezer PDF Icon PDF Icon
Accuracy using KNN PDF Icon PDF Icon
Kappa using KNN PDF Icon PDF Icon
Accuracy using Naive Bayes PDF Icon PDF Icon
Kappa using Naive Bayes PDF Icon PDF Icon
Accuracy using PUBLIC PDF Icon PDF Icon
Kappa using PUBLIC PDF Icon PDF Icon
Accuracy using Ripper PDF Icon PDF Icon
Kappa using Ripper PDF Icon PDF Icon

JAVA code for the methods used

In this section, we provide information about the source code of the methods used in the experimental framework and their descriptions.

All the algorithms used in the comparison are integrated in the KEEL Software Tool. The complete list of algorithms available in it can be found here, including the discretization algorithms used in this study and the classifiers used over the generated discretized sets.

You can download the KEEL Sofware Tool here. Since the KEEL Sofware Tool is an open source (GPLv3) tool, its source code it is also available for its usage here.