A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning

This Website contains complementary material to the paper:

S. García, J. Luengo, J.A. Sáez, V. López and F.Herrera, A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering 25:4 (2013) 734-750, doi: 10.1109/TKDE.2012.35

Summary:

Abstract
Experimental Framework
1. Data-sets partitions employed in the paper
2. Algorithms and parameters
Results obtained
Statistical Tests
JAVA code for the methods used

Abstract

Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. In this manner, symbolic data mining algorithms can be applied over continuous data and the representation of information is simplified, making it more concise and specific. The literature provides numerous proposals of discretization and some attempts to categorize them into a taxonomy can be found. However, in previous papers, there is a lack of consensus in the definition of the properties and no formal categorization has been established yet, which may be confusing for practitioners. Furthermore, only a small set of discretizers have been widely considered, while many other methods have gone unnoticed. With the intention of alleviating these problems, this paper provides a survey of discretization methods proposed in the literature from a theoretical and empirical perspective. From the theoretical perspective, we develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. Empirically, we conduct an experimental study in supervised classification involving the most representative and newest discretizers, different types of classifiers and a large number of data sets. The results of their performances measured in terms of accuracy, number of intervals and inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of discretizers are highlighted as the best performing ones.

Experimental framework

Data-sets partitions employed in the paper

The performance of discretization algorithms is analyzed by using forty data sets from the KEEL-datasets repository. The data sets considered are partitioned using the ten fold cross-validation (10-fcv) procedure.

Table 1 summarize the properties of the selected data sets. It shows, for each data-set, the name, number of examples, number of attributes (numeric and nominal) and number of classes. The last column of this table contains a link for downloading the 10-fold cross validation partitions for each data set in KEEL format. All data sets may be downloaded by clicking here.

Table 1: Summary description for classification data sets

Data set	#Ex.	#Atts.	#Num.	#Nom.	#Cl.	Data set	#Ex.	#Atts.	#Num.	#Nom.	#Cl.
abalone	4174	8	7	1	28	mammographic	961	5	5	0	2
appendicitis	106	7	7	0	2	movement	360	90	90	0	15
australian	690	14	8	6	2	newthyroid	215	5	5	0	3
autos	205	25	15	10	6	pageblocks	5472	10	10	0	5
balance	625	4	4	0	3	penbased	10992	16	16	0	10
banana	5300	2	2	0	2	phoneme	5404	5	5	0	2
bands	539	19	19	0	2	pima	768	8	8	0	2
bupa	345	6	6	0	2	saheart	462	9	8	1	2
cleveland	303	13	13	0	5	satimage	6435	36	36	0	7
contraceptive	1473	9	9	0	3	segment	2310	19	19	0	7
crx	690	15	6	9	2	sonar	208	60	60	0	2
dermatology	366	34	34	0	6	spambase	4597	57	57	0	2
ecoli	336	7	7	0	8	specfheart	267	44	44	0	2
flare-solar	1066	9	9	0	2	tae	151	5	5	0	3
glass	214	9	9	0	7	titanic	2201	3	3	0	2
haberman	306	3	3	0	2	vehicle	846	18	18	0	4
hayes	160	4	4	0	3	vowel	990	13	13	0	11
heart	270	13	13	0	2	wine	178	13	13	0	3
hepatitis	155	19	19	0	2	wisconsin	699	9	9	0	2
iris	150	4	4	0	3	yeast	1484	8	8	0	10

Algorithms and parameters

We have used 30 discretizers in the experimental study, those that we have identified as the most relevant ones. Table 2 presents an enumeration of these discretizers, containing their complete name and the abbreviation used to identify them in the results section. For further details on their descriptions, the reader can visit the URL associated to the KEEL project. This web contains the original reference for each discretizer in the experimental study and the Java implementation for each one of them integrated in the KEEL Software tool.

Table 2: Discretizers used in the experimental study

Complete name	Abbreviation
Equal Width Discretizer	EqualWidth
Equal Frequency Discretizer	EqualFrequency
ChiMerge	ChiMerge
One-Rule Discretizer	1R
Iterative Dichotomizer 3 Discretizer	ID3
Minimum Description Length Principle	MDLP
Class-Attribute Dependent Discretizer	CADD
Bayesian Discretizer	Bayesian
Cluster Analysis Discretizer	ClusterAnalysis
Zeta	Zeta
Distance-based Discretizer	Distance
Chi2	Chi
FUSINTER	FUSINTER
Multivariate Discretization	MVD
Modified Chi2	Modified Chi2
Unparametrized Supervised Discretizer	USD
Khiops	Khiops
Class-Attribute Interdependence Maximization	CAIM
Extended Chi2	Extended Chi2
Heterogeneity Discretizer	Heter-Disc
Unsupervised Correlation Preserving Discretizer	UCPD
MODL	MODL
Hellinger-Based Discretizer	HellingerBD
Distribution Index-Based Discretizer	DIBD
Interval Distance Discretizer	IDD
Class-Attribute Contingency Coefficient	CACC
Ameva	Ameva
Proportional Discretizer	PKID
Fixed Frequency Discretizer	FFD
Hypercube Division Discretizer	HDD

Furthermore, in this study six classifiers have been used in order to find differences in performance among the discretizers. The classifiers are:

C4.5: A well-known decision tree, considered one of the top 10 DM algorithms.
DataSqueezer: This learner belongs to the family of inductive rule extraction. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact and comprehensible, but accuracy is to some extent degraded in order to achieve this goal.
KNN: One of the simplest and most effective methods based on similarities among a set of objects. It is also considered one of the top 10 DM algorithms and it can handle nominal attributes using proper distance functions such as HVDM. It belongs to the lazy learning family.
Naive Bayes: This is another of the the top 10 DM algorithms. Its aim is to construct a rule which will allow us to assign future objects to a class, assuming independence of attributes when probabilities are established.
PUBLIC: It is an advanced decision tree that integrates the pruning phase with the building stage of the tree in order to avoid the expansion of branches that would be pruned afterwards.
Ripper: This is a widely used rule induction method based on a separate and conquer strategy. It incorporates diverse mechanisms to avoid overfitting and to handle numeric and nominal attributes simultaneously. The models obtained are in the form of decision lists.

The parameters of the discretizers and classifiers are those recommended by their respective authors. They are specified in Tables 3 and 4 for those methods which require them. We assume that the choice of the values of parameters is optimally chosen by their own authors. Nevertheless, in discretizers that require the input of the number of intervals as a parameter, we use a rule of thumb which is dependent on the number of instances in the data set. It consists in dividing the number of instances by 100 and taking the maximum value between this result and the number of classes.

Table 3: Parameter specification for the discretizers employed in the experimentation

Method	Parameters
1R	6 examples of the same class per interval
CADD	confidence threshold = 0.01
Chi2	inconsistency threshold = 0.02
ChiMerge	confidence threshold = 0.05
FDD	frequency size = 30
FUSINTER	α = 0.975, λ = 1
HDD	coefficient = 0.8
IDD	neighborhood = 3, windows size = 3, nominal distance
MODL	optimized process type
UCPD	intervals = [3, 6], KNN map type, neighborhood = 6, minimum support = 25, merged threshold = 0.5, scaling factor = 0.5, use discrete

Table 4: Parameter specification for the classifiers employed in the experimentation

Method	Parameters
C4.5	pruned tree, confidence = 0.25, 2 examples per leaf
DataSqueezer	pruning and generalization threshold = 0.05
KNN	K = 3, HVDM distance
PUBLIC	25 nodes between prune
Ripper	k = 2, grow set = 0.66

Results obtained

In this section, we report the full results obtained in the experimental study. We divide the results in two parts: the first part contains results associated to the intrinsic properties of the results obtained by the discretizers while the second part presents the results associated to the classifiers when they use the discretized data sets.

We present these data in xls and ods files available for download. These files are organized as follows: each data set is put in the rows of the file and each discretizer performance results are placed in the columns of the file. There are two columns for every discretizer: one representing the average performance results (number of intervals, inconsistency rate, accuracy or kappa) and another with the standard deviation for that average performance results. Additionally, for measures computed in train and test partitions, the corresponding file contains two sheets with this information: one for the training set results and the other for the test set results.

Results from intrinsic properties of the discretizers

In this section we obtain the number of intervals generated and the the inconsistency rate for the train and test sets generated for each pair data set/discretizer. The files available for download are in Table 5.

Table 5: Intrinsic properties files of the discretizers

Performance measure	xls file	ods file
Number of intervals
Inconsistency rate

Results for classifiers performance over the discretized data sets

In this section we obtain the classifiers performance over the discretized data sets generated with each discretizer. We use two performance measures that are widely used because of their simplicity and successful application when multi-class classification problems are treated. We refer to accuracy and Cohen's kappa measures, which will be adopted to measure the efficacy discretizers in terms of generalization classification rate.

Accuracy: is the number of successful hits relative to the total number of classifications. It has been by far the most commonly used metric for assessing the performance of classifiers for years.
Cohen’s kappa: is an alternative to accuracy, a method, known for decades, which compensates for random hits. Its original purpose was to measure the degree of agreement or disagreement between two people observing the same phenomenon. Cohen's kappa can be adapted to classification tasks and its use is recommended because it takes random successes into consideration as a standard, in the same way as the AUC measure.

Tables 6 and 7 present the Accuracy and Cohen's kappa files associated to the performance of the algorithms used in the study.

Table 6: Accuracy for the classifiers over the discretized data sets

Algorithm	xls file	ods file
C4.5
DataSqueezer
KNN
Naive Bayes
PUBLIC
Ripper

Table 7: Cohen's kappa for the classifiers over the discretized data sets

Algorithm	xls file	ods file
C4.5
DataSqueezer
KNN
Naive Bayes
PUBLIC
Ripper

Summary results for all data

Finally, this section summarizes the results from the previous sections. Here, we present three differenf files with information from the previous sections. All these files have the following structure: they sum up the information contained in the files from Tables 5, 6 and 7. Each row of those tables is summarized into a sheet of the file resulting in files that have the discretizer results in their rows and the average performance measure in their columns.

Intrinsic Measures Summary:

Accuracy Results Summary:

Cohen's Kappa Results Summary:

Statistical tests

Statistical analysis are carried out by means of nonparametric statistical tests. The reader can look up the Statistical Inference in Computational Intelligence and Data Mining web page for details in nonparametric statistical analysis.

The Wilcoxon test is used in order to conduct pairwise comparisons among all discretizers considered in the study. Specifically, the Wilcoxon test is adopted in this study considering a level of significance equal to α = 0.05.

In Table 8 we provide the results of the Wilcoxon test for every performance measure used in this study. For each comparison we provide two different files: a pdf file with the detailed results of the Wilcoxon test and a second file with the summary associated to it.

Table 8: Wilcoxon Test results for every performance measure used in the study

Comparison tested	Detailed Wilcoxon test	Summary of the Wilcoxon test
Number of Intervals
Inconsistency Rate in Training
Inconsistency Rate in Test
Accuracy using C4.5
Kappa using C4.5
Accuracy using DataSqueezer
Kappa using DataSqueezer
Accuracy using KNN
Kappa using KNN
Accuracy using Naive Bayes
Kappa using Naive Bayes
Accuracy using PUBLIC
Kappa using PUBLIC
Accuracy using Ripper
Kappa using Ripper

JAVA code for the methods used

In this section, we provide information about the source code of the methods used in the experimental framework and their descriptions.

All the algorithms used in the comparison are integrated in the KEEL Software Tool. The complete list of algorithms available in it can be found here, including the discretization algorithms used in this study and the classifiers used over the generated discretized sets.

You can download the KEEL Sofware Tool here. Since the KEEL Sofware Tool is an open source (GPLv3) tool, its source code it is also available for its usage here.

You are here