A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning
This Website contains complementary material to the paper:
S. García, J. Luengo, J.A. Sáez, V. López and F.Herrera, A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering 25:4 (2013) 734-750, doi: 10.1109/TKDE.2012.35
- Experimental Framework
- Results obtained
- Statistical Tests
- JAVA code for the methods used
Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. In this manner, symbolic data mining algorithms can be applied over continuous data and the representation of information is simplified, making it more concise and specific. The literature provides numerous proposals of discretization and some attempts to categorize them into a taxonomy can be found. However, in previous papers, there is a lack of consensus in the definition of the properties and no formal categorization has been established yet, which may be confusing for practitioners. Furthermore, only a small set of discretizers have been widely considered, while many other methods have gone unnoticed. With the intention of alleviating these problems, this paper provides a survey of discretization methods proposed in the literature from a theoretical and empirical perspective. From the theoretical perspective, we develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. Empirically, we conduct an experimental study in supervised classification involving the most representative and newest discretizers, different types of classifiers and a large number of data sets. The results of their performances measured in terms of accuracy, number of intervals and inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of discretizers are highlighted as the best performing ones.
Data-sets partitions employed in the paper
The performance of discretization algorithms is analyzed by using forty data sets from the KEEL-datasets repository. The data sets considered are partitioned using the ten fold cross-validation (10-fcv) procedure.
Table 1 summarize the properties of the selected data sets. It shows, for each data-set, the name, number of examples, number of attributes (numeric and nominal) and number of classes. The last column of this table contains a link for downloading the 10-fold cross validation partitions for each data set in KEEL format. All data sets may be downloaded by clicking here.
Table 1: Summary description for classification data sets
Algorithms and parameters
We have used 30 discretizers in the experimental study, those that we have identified as the most relevant ones. Table 2 presents an enumeration of these discretizers, containing their complete name and the abbreviation used to identify them in the results section. For further details on their descriptions, the reader can visit the URL associated to the KEEL project. This web contains the original reference for each discretizer in the experimental study and the Java implementation for each one of them integrated in the KEEL Software tool.
Table 2: Discretizers used in the experimental study
|Equal Width Discretizer||EqualWidth|
|Equal Frequency Discretizer||EqualFrequency|
|Iterative Dichotomizer 3 Discretizer||ID3|
|Minimum Description Length Principle||MDLP|
|Class-Attribute Dependent Discretizer||CADD|
|Cluster Analysis Discretizer||ClusterAnalysis|
|Modified Chi2||Modified Chi2|
|Unparametrized Supervised Discretizer||USD|
|Class-Attribute Interdependence Maximization||CAIM|
|Extended Chi2||Extended Chi2|
|Unsupervised Correlation Preserving Discretizer||UCPD|
|Distribution Index-Based Discretizer||DIBD|
|Interval Distance Discretizer||IDD|
|Class-Attribute Contingency Coefficient||CACC|
|Fixed Frequency Discretizer||FFD|
|Hypercube Division Discretizer||HDD|
Furthermore, in this study six classifiers have been used in order to find differences in performance among the discretizers. The classifiers are:
- C4.5: A well-known decision tree, considered one of the top 10 DM algorithms.
- DataSqueezer: This learner belongs to the family of inductive rule extraction. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact and comprehensible, but accuracy is to some extent degraded in order to achieve this goal.
- KNN: One of the simplest and most effective methods based on similarities among a set of objects. It is also considered one of the top 10 DM algorithms and it can handle nominal attributes using proper distance functions such as HVDM. It belongs to the lazy learning family.
- Naive Bayes: This is another of the the top 10 DM algorithms. Its aim is to construct a rule which will allow us to assign future objects to a class, assuming independence of attributes when probabilities are established.
- PUBLIC: It is an advanced decision tree that integrates the pruning phase with the building stage of the tree in order to avoid the expansion of branches that would be pruned afterwards.
- Ripper: This is a widely used rule induction method based on a separate and conquer strategy. It incorporates diverse mechanisms to avoid overfitting and to handle numeric and nominal attributes simultaneously. The models obtained are in the form of decision lists.
The parameters of the discretizers and classifiers are those recommended by their respective authors. They are specified in Tables 3 and 4 for those methods which require them. We assume that the choice of the values of parameters is optimally chosen by their own authors. Nevertheless, in discretizers that require the input of the number of intervals as a parameter, we use a rule of thumb which is dependent on the number of instances in the data set. It consists in dividing the number of instances by 100 and taking the maximum value between this result and the number of classes.
Table 3: Parameter specification for the discretizers employed in the experimentation
|1R||6 examples of the same class per interval|
|CADD||confidence threshold = 0.01|
|Chi2||inconsistency threshold = 0.02|
|ChiMerge||confidence threshold = 0.05|
|FDD||frequency size = 30|
|FUSINTER||α = 0.975, λ = 1|
|HDD||coefficient = 0.8|
|IDD||neighborhood = 3, windows size = 3, nominal distance|
|MODL||optimized process type|
|UCPD||intervals = [3, 6], KNN map type, neighborhood = 6, minimum support = 25, merged threshold = 0.5, scaling factor = 0.5, use discrete|
Table 4: Parameter specification for the classifiers employed in the experimentation
|C4.5||pruned tree, confidence = 0.25, 2 examples per leaf|
|DataSqueezer||pruning and generalization threshold = 0.05|
|KNN||K = 3, HVDM distance|
|PUBLIC||25 nodes between prune|
|Ripper||k = 2, grow set = 0.66|
In this section, we report the full results obtained in the experimental study. We divide the results in two parts: the first part contains results associated to the intrinsic properties of the results obtained by the discretizers while the second part presents the results associated to the classifiers when they use the discretized data sets.
We present these data in xls and ods files available for download. These files are organized as follows: each data set is put in the rows of the file and each discretizer performance results are placed in the columns of the file. There are two columns for every discretizer: one representing the average performance results (number of intervals, inconsistency rate, accuracy or kappa) and another with the standard deviation for that average performance results. Additionally, for measures computed in train and test partitions, the corresponding file contains two sheets with this information: one for the training set results and the other for the test set results.
Results from intrinsic properties of the discretizers
In this section we obtain the number of intervals generated and the the inconsistency rate for the train and test sets generated for each pair data set/discretizer. The files available for download are in Table 5.
Table 5: Intrinsic properties files of the discretizers
|Performance measure||xls file||ods file|
|Number of intervals|
Results for classifiers performance over the discretized data sets
In this section we obtain the classifiers performance over the discretized data sets generated with each discretizer. We use two performance measures that are widely used because of their simplicity and successful application when multi-class classification problems are treated. We refer to accuracy and Cohen's kappa measures, which will be adopted to measure the efficacy discretizers in terms of generalization classification rate.
- Accuracy: is the number of successful hits relative to the total number of classifications. It has been by far the most commonly used metric for assessing the performance of classifiers for years.
- Cohen’s kappa: is an alternative to accuracy, a method, known for decades, which compensates for random hits. Its original purpose was to measure the degree of agreement or disagreement between two people observing the same phenomenon. Cohen's kappa can be adapted to classification tasks and its use is recommended because it takes random successes into consideration as a standard, in the same way as the AUC measure.
Tables 6 and 7 present the Accuracy and Cohen's kappa files associated to the performance of the algorithms used in the study.
Table 6: Accuracy for the classifiers over the discretized data sets
|Algorithm||xls file||ods file|
Table 7: Cohen's kappa for the classifiers over the discretized data sets
|Algorithm||xls file||ods file|
Summary results for all data
Finally, this section summarizes the results from the previous sections. Here, we present three differenf files with information from the previous sections. All these files have the following structure: they sum up the information contained in the files from Tables 5, 6 and 7. Each row of those tables is summarized into a sheet of the file resulting in files that have the discretizer results in their rows and the average performance measure in their columns.
Statistical analysis are carried out by means of nonparametric statistical tests. The reader can look up the Statistical Inference in Computational Intelligence and Data Mining web page for details in nonparametric statistical analysis.
The Wilcoxon test is used in order to conduct pairwise comparisons among all discretizers considered in the study. Specifically, the Wilcoxon test is adopted in this study considering a level of significance equal to α = 0.05.
In Table 8 we provide the results of the Wilcoxon test for every performance measure used in this study. For each comparison we provide two different files: a pdf file with the detailed results of the Wilcoxon test and a second file with the summary associated to it.
Table 8: Wilcoxon Test results for every performance measure used in the study
JAVA code for the methods used
In this section, we provide information about the source code of the methods used in the experimental framework and their descriptions.
All the algorithms used in the comparison are integrated in the KEEL Software Tool. The complete list of algorithms available in it can be found here, including the discretization algorithms used in this study and the classifiers used over the generated discretized sets.