Complementary material to a paper on MV

This Website contains two complementary documents related to the paper:

J. Luengo, S. García, F. Herrera, On the choice of an imputation method for missing values. A study of three groups of classification methods: rule induction learning, lazy learning and approximate methods. Knowledge and Information Systems 32:1 (2012) 77-108, doi:10.1007/s10115-011-0424-2

Summary:

Paper abstract
Complementary Document 1: Tables with the accuracy results for the different classification methods used in the paper
Complementary Document 2: Tables with the Wilcoxon Signed Rank test results summarized for each classification method used in the paper

Paper abstract.

In real-life data, loss of information is very frequent in data mining and is produced by the presence of missing values in attributes. Several schemes have been studied to surpass the drawbacks produced by missing values in data mining tasks, one of the most well known is based on preprocessing, formerly known as imputation. In this work we focus on a classification task with twenty-three classification methods and fourteen different approaches to Missing attribute Values treatment that are presented and analysed. The analysis involves a group-based approach, in which we distinguish between three different categories of classification methods. Each category offers different behaviour, and the evidences obtained shows that the use of determined Missing Values imputation methods could improve the accuracy obtained for these methods.

From this study, the convenience of using imputation methods for pre-processing data sets with Missing Values is stated. The analysis suggests that the use of particular imputation methods conditioned to the groups is required.

Complementary Document 1:

Tables with the accuracy results for the different classification methods used in the paper

Summary: In this document we present the accuracy results obtained for each classification method. These are the complete results from the study performed in the contribution mentioned above. Both training and test accuracy are present for each method.

We have used a 10-fcv validation scheme. Each data set have been preprocessed with 14 imputation methods. Therefore, there are 14 results for each fold, that is, the training and test partitions have been preprocessed with 14 different Missing Values estimation approaches. The mean of the 10 folds for each imputation method are represented in the rows of the Tables 2 to 24. We have stressed in bold the accuracy results which are highest among the imputation methods for a given data set, that is, the columns. Thus, we can observe the highest accuracy obtained by the classification method for a particular imputation method over each data set.

These tables can be downloaded as an Excel document by clicking on the following link .

Complementary Document 2:

Tables with the Wilcoxon Signed Rank test results summarized for each classification method used in the paper

Summary: In this document we present the summary of the Wilcoxon Signed Rank test results obtained for each classification method. These are the complete wilcoxon tables for the study performed in the contribution mentioned above.

The procedure followed in order to obtain the tables follows:

We create a n x n table for each classification method. In each cell, the outcome of the Wilcoxon signed rank test is showed.
In the aforementioned tables, if the p-value obtained by the Wilcoxon tests for the considered classification for a pair of imputation methods is higher than our α level, formerly 0.1, then we establish that there is a tie in the comparison (e.g. no significant difference was found), represented by a D in the cell.
If the p-value obtained by the Wilcoxon tests is lower than our α level, formerly 0.1, then we establish that there is a win (represented by a W) or a loss (represented by a L) in the comparison. If the method presented in the row has better ranking than the method presented in the column in the Wilcoxon test then is a win, otherwise is a loss.

Then for each table we have attached three extra columns.

The column "Ties+Wins" represents the amount of D and W present in the row. That is, the number of times that the imputation method performs better or equal than the rest for the classifier.
The column "Wins" represents the amount of W present in the row. That is, the number of times that the imputation method performs better than the rest for the classifier.
The column "RANKING" shows the average ranking derived from the two previous columns. The higher "Ties+Wins" has the method, the better. If there is a draw for "Ties+Wins", then the "Wins" are used in order to break the tie. Again, the higher "Wins" has the method, the better. If there also exists a tie in the "Wins", then an average ranking is established for all the tied methods m_i to m_j, given by:

$$RANKING=\frac{lRank(m_i,m_{i+1},...,m_j)+hRank(m_i,m_{i+1},...,m_j}{2}$$

where lRank() represents the lower ranking that all the methods can obtain (e.g. the last assigned ranking plus 1) and hRank() represents the highest possible rank for all the methods, that is

$$hRank(m_i,m_{i+1},...,m_j)lRank(m_i,m_{i+1},...,m_j)+|(m_i,...,m_j)|$$

Data sets and Experimental results

For our study we have selected twenty-one data sets from the UCI repository. In all the experiments, we have adopted a 10-fold cross-validation model, i.e., we have split the data-set randomly into 10 folds, each one containing the 10% of the patterns of the data-set. Thus, nine folds have been used for training and one for test. Table 1 summarizes the properties of the selected data-sets. It shows, for each data-set, the number of Instances (#Inst.), the number of attributes (#Atts.), the number of classes (#Cl.), the total percentage of Missing Values (% MV) and the percentage of instances with at least one Missing Value (% Inst. with MV). The last column of this table contains a link for downloading the 10-fold cross validation partitions for each data-set in KEEL format. You may also download all data-sets by clicking here.

Table 1. Summary Description for the Used Data-Sets
Data set	Acronym	#Inst.	#Atts.	#Cl.	% MV	% Inst. with MV	Download.
Cleveland	CLE	303	14	5	0,14	1,98
Wisconsin	WIS	699	10	2	0,23	2,29
Credit	CRX	689	16	2	0,61	5,37
Breast	BRE	286	10	2	0,31	3,15
Autos	AUT	205	26	6	1,11	22,44
Primary tumor	PRT	339	18	21	3,69	61,06
Dermatology	DER	365	35	6	0,06	2,19
House-votes-84	HOV	434	17	2	5,3	46,54
Water-treatment	WAT	526	39	13	2,84	27,76
Sponge	SPO	76	46	12	0,63	28,95
Bands	BAN	540	40	2	4,63	48,7
Horse-colic	HOC	368	24	2	21,82	98,1
Audiology	AUD	226	71	24	1,98	98,23
Lung-cancer	LUN	32	57	3	0,27	15,63
Hepatitis	HEP	155	20	2	5,39	48,39
Mushroom	MUS	8124	23	2	1,33	30,53
Post-operative	POS	90	9	3	0,37	3,33
Echocardiogram	ECH	132	12	4	4,73	34,09
Soybean	SOY	307	36	19	6,44	13,36
Mammographic	MAM	961	6	2	2,81	13,63
Ozone	OZO	2534	73	2	8,07	27,11

You are here

Complementary material to a paper on MV

Paper abstract.

Complementary Document 1:

Complementary Document 2:

Data sets and Experimental results

User login

SCI2S Web-site Related