Complementary material to a paper on MV
This Website contains two complementary documents related to the paper:
J. Luengo, S. García, F. Herrera, On the choice of an imputation method for missing values. A study of three groups of classification methods: rule induction learning, lazy learning and approximate methods. Knowledge and Information Systems 32:1 (2012) 77-108, doi:10.1007/s10115-011-0424-2
- Paper abstract
- Complementary Document 1: Tables with the accuracy results for the different classification methods used in the paper
- Complementary Document 2: Tables with the Wilcoxon Signed Rank test results summarized for each classification method used in the paper
J. Luengo, S. García, F. Herrera, On the choice of an imputation method for missing values. A study of three groups of classification methods: rule induction learning, lazy learning and approximate methods.
In real-life data, loss of information is very frequent in data mining and is produced by the presence of missing values in attributes. Several schemes have been studied to surpass the drawbacks produced by missing values in data mining tasks, one of the most well known is based on preprocessing, formerly known as imputation. In this work we focus on a classification task with twenty-three classification methods and fourteen different approaches to Missing attribute Values treatment that are presented and analysed. The analysis involves a group-based approach, in which we distinguish between three different categories of classification methods. Each category offers different behaviour, and the evidences obtained shows that the use of determined Missing Values imputation methods could improve the accuracy obtained for these methods.
From this study, the convenience of using imputation methods for pre-processing data sets with Missing Values is stated. The analysis suggests that the use of particular imputation methods conditioned to the groups is required.
Complementary Document 1:
Summary: In this document we present the accuracy results obtained for each classification method. These are the complete results from the study performed in the contribution mentioned above. Both training and test accuracy are present for each method.
We have used a 10-fcv validation scheme. Each data set have been preprocessed with 14 imputation methods. Therefore, there are 14 results for each fold, that is, the training and test partitions have been preprocessed with 14 different Missing Values estimation approaches. The mean of the 10 folds for each imputation method are represented in the rows of the Tables 2 to 24. We have stressed in bold the accuracy results which are highest among the imputation methods for a given data set, that is, the columns. Thus, we can observe the highest accuracy obtained by the classification method for a particular imputation method over each data set.
Complementary Document 2:
Summary: In this document we present the summary of the Wilcoxon Signed Rank test results obtained for each classification method. These are the complete wilcoxon tables for the study performed in the contribution mentioned above.
The procedure followed in order to obtain the tables follows:
- We create a n x n table for each classification method. In each cell, the outcome of the Wilcoxon signed rank test is showed.
- In the aforementioned tables, if the p-value obtained by the Wilcoxon tests for the considered classification for a pair of imputation methods is higher than our α level, formerly 0.1, then we establish that there is a tie in the comparison (e.g. no significant difference was found), represented by a D in the cell.
- If the p-value obtained by the Wilcoxon tests is lower than our α level, formerly 0.1, then we establish that there is a win (represented by a W) or a loss (represented by a L) in the comparison. If the method presented in the row has better ranking than the method presented in the column in the Wilcoxon test then is a win, otherwise is a loss.
Then for each table we have attached three extra columns.
- The column "Ties+Wins" represents the amount of D and W present in the row. That is, the number of times that the imputation method performs better or equal than the rest for the classifier.
- The column "Wins" represents the amount of W present in the row. That is, the number of times that the imputation method performs better than the rest for the classifier.
- The column "RANKING" shows the average ranking derived from the two previous columns. The higher "Ties+Wins" has the method, the better. If there is a draw for "Ties+Wins", then the "Wins" are used in order to break the tie. Again, the higher "Wins" has the method, the better. If there also exists a tie in the "Wins", then an average ranking is established for all the tied methods mi to mj, given by:
where lRank() represents the lower ranking that all the methods can obtain (e.g. the last assigned ranking plus 1) and hRank() represents the highest possible rank for all the methods, that is
Data sets and Experimental results
For our study we have selected twenty-one data sets from the UCI repository. In all the experiments, we have adopted a 10-fold cross-validation model, i.e., we have split the data-set randomly into 10 folds, each one containing the 10% of the patterns of the data-set. Thus, nine folds have been used for training and one for test. Table 1 summarizes the properties of the selected data-sets. It shows, for each data-set, the number of Instances (#Inst.), the number of attributes (#Atts.), the number of classes (#Cl.), the total percentage of Missing Values (% MV) and the percentage of instances with at least one Missing Value (% Inst. with MV). The last column of this table contains a link for downloading the 10-fold cross validation partitions for each data-set in KEEL format. You may also download all data-sets by clicking here.