SCI2S Thematic Public WebsitesSCI2S Thematic Public Websites SCI2S Complementary Material Websites   SCI2S Thematic Public Websites
Icono GFSGenetic
Fuzzy
Systems
Icono Computing with Words in DMComputing
with Words in
Decision Making
Icono Statistical Inference in Computational Intelligence and Data MiningStatistical Inference in
Computational Intelligence
and Data Mining
Icono HindexH-index
&
Variants
Icono MV in DMMissing Values
in
Data Mining
Evolutionary Algorithms and other Metaheuristics for Continuous Optimization ProblemsE. A. & Metaheur.
for Continuous
Optim. Problems
Icono Interpretability of FRBSsInterpretability
of
FRBSs
Icon PRPrototype Reduction
in
Nearest Neighbor Classification
Logo Thematic Public Webscites

 

               Statistical Inference in Computational Intelligence and Data Mining

    GFS

 

This Website contains additional material to the SCI2S research group papers on the use of non-parametric tests for data mining and Computational Intellingece:

The web is organized according to the following summary:

  1. 1. Introduction to Inferential Statistics
  2. 2. Conditions for the safe use of Nonparametric Tests
  3. 3. Nonparametric tests
    1. 3.1. Pairwise Comparisons
    2. 3.2. Multiple Comparisons with a control method
    3. 3.3. Post-hoc procedures
    4. 3.4. Adjusted p-values
    5. 3.5. Multiple Comparisons among all methods
  4. 4. Case Studies
    1. 4.1. Multiple Comparisons with a control method
    2. 4.2. Multiple Comparisons among all methods
  5. 5. Considerations and Recommendations on the use of Nonparametric tests
    1. 5.1. Considerations on the use of Nonparametric tests
      1. 5.1.1. Pairwise Comparisons
      2. 5.1.2. Multiple Comparisons with a control method
    1. 5.2. Recommendations on the use of Nonparametric tests
      1. 5.2.1. Pairwise Comparisons
      2. 5.2.2. Multiple Comparisons with a control method
      3. 5.2.3. Multiple Comparisons among all methods
  6. 6. Relevant Journal Papers with Data Mining and Computational Intelligence Case Studies
  7. 7. Relevant books on Non-parametric tests
  8. 8. Topic Slides
  9. 9. Software and User's Guide

1. Introduction to Inferential Statistics

The experimental analysis on the performance of a new method is a crucial and necessary task to carry out in a research on Data Mining, Computational Intelligence techniques. Deciding when an algorithm is better than other one may not be a trivial task.

Hyphotesis testing and p-values: In inferential statistics, sample data are primarily employed in two ways to draw inferences about one or more populations. One of them is the hypothesis testing.

The most basic concept in hypothesis testing is a hypothesis. It can be defined as a prediction about a single population or about the relationship between two or more populations. Hypothesis testing is a procedure in which sample data are employed to evaluate a hypothesis. There is a distinction between research hypothesis and statistical hypothesis. The first is a general statement of what a researcher predicts. In order to evaluate a research hypothesis, it is restated within the framework of two statistical hypotheses. They are the null hypothesis, represented by the notation H0, and the alternative hypothesis, represented by the notation H1.

The null hypothesis is a statement of no effect or no difference. Since the statement of the research hypothesis generally predicts the presence of a difference with respect to whatever is being studied, the null hypothesis will generally be a hypothesis that the researcher expects to be rejected. The alternative hypothesis represents a statistical statement indicating the presence of an effect or a difference. In this case, the researcher generally expects the alternative hypothesis to be supported.

An alternative hypothesis can be nondirectional (two-tailed hypothesis) and directional (one-tailed hypothesis). The first type does not make a prediction in a specific direction; i.e. H1 : µ ≠ 100. The latter implies a choice of one of the following directional alternative hypothesis; i.e. H1:µ > 100 or H1:µ < 100.

Upon collecting the data for a study, the next step in the hypothesis testing procedure is to evaluate the data through use of the appropriate inferential statistical test. An inferential statistical test yields a test statistic. The latter value is interpreted by employing special tables that contain information with regard to the expected distribution of the test statistic. Such tables contain extreme values of the test statistic (referred to as critical values) that are highly unlikely to occur if the null hypothesis is true. Such tables allow a researcher to determine whether or not the results of a study is statistically significant.

The conventional hypothesis testing model employed in inferential statistics assumes that prior to conducting a study, a researcher stipulates whether a directional or nondirectional alternative hypothesis is employed, as well as at what level of significance is represented the null hypothesis to be evaluated. The probability value which identifies the level of significance is represented by α.

When one employs the term significance in the context of scientific research, it is instructive to make a distinction between statistical significance and practical significance. Statistical significance only implies that the outcome of a study is highly unlikely to have occurred as a result of chance, but it does no necessarily suggest that any difference or effect detected in a set of data is of any practical value. For example, no-one would normally care if algorithm A in continuos optimization solves the sphere function to within 10-10 of error of the global optimum and algorithm B solves it within 10-15. Between them, statistical significance could be found, but in practical sense, this difference is not significant.

Instead of stipulating a priori a level of significance α, one could calculate the smallest level of significance that results in the rejection of the null hypothesis. This is the definition of p-value, which is an useful and interesting datum for many consumers of statistical analysis. A p-value provides information about whether a statistical hypothesis test is significant or not, and it also indicates something about “how significant” the result is: The smaller the p-value, the stronger the evidence against the null hypothesis. Most important, it does this without committing to a particular level of significance.

The most common way for obtaining the p-value associated to a hypothesis is by means of normal approximations, that is, once computed the statistic associated to a statistical test or procedure, we can use a specific expression or algorithm for obtaining a z value, which corresponds to a normal distribution statistics. Then, by using normal distribution tables, we could obtain the p-value associated with z.

The reader can found more information about the introduction of Inferential Statistics in the chapter 1 of Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton (2006).


Top of Page


2. Conditions for the safe use of Nonparametric Tests

In order to distinguish a nonparametric test from a parametric one, we must check the type of data used by the test. A nonparametric test is that which uses nominal or ordinal data. This fact does not force it to be used only for these types of data. It is possible to transform the data from real values to ranking based data. In such way, a non-parametric test can be applied over classical data of parametric test when they do not verify the required conditions imposed by the test. As a general rule, a non-parametric test is less restrictive than a parametric one, although it is less robust than a parametric when data are well conditioned.

Parametric tests have been commonly used in the analysis of experiments. For example, a common way to test whether the difference between two algorithms' results is non-random is to compute a paired t-test, which checks whether the average difference in their performance over the data sets is significantly different from zero. When comparing a set of multiple algorithms, the common statistical method for testing the differences between more than two related sample means is the repeated-measures ANOVA (or within-subjects ANOVA) (R. A. Fisher. Statistical methods and scientific inference (2nd edition). Hafner Publishing Co., New York, 1959.). The "related samples" are again the performances of the algorithms measured across the same problems. The null-hypothesis being tested is that all classifiers perform the same and the observed differences are merely random.

Unfortunately, Parametric tests are based on assumptions which are most probably violated when analyzing the performance of computational intelligence and data mining algorithms (Zar, J.H.: Biostatistical Analysis. Prentice Hall, Englewood Cliffs (1999)). These assumptions are:

Independence: In statistics, two events are independent when the fact that one occurs does not modify the probability of the other one occurring.

Normality:An observation is normal when its behaviour follows a normal or Gauss distribution with a certain value of average μ and variance σ . A normality test applied over a sample can indicate the presence or absence of this condition in observed data.

Homoscedasticity: This property indicates the hypothesis of equality of variances. Levene’s test is used for checking whether or
not k samples present this homogeneity of variances. When observed data does not fulfill the normality condition, this test’s result is more reliable than Bartlett’s test (Zar, J.H.: Biostatistical Analysis. Prentice Hall, Englewood Cliffs (1999)), which checks the same property.

The conditions are usually not satisfied in the case of analyzing results provided by computational intelligence experiments or data mining algorithms' comparisons. Let us show a case study involving a set of Neural Networks classifiers run over multiple data sets and using a well-known 10-fold cross validation procedure. We apply the normality test of Kolmogorov–Smirnov and D’Agostino–Pearson by considering a level of confidence of α = 0.05 (we employ SPSS statistical software package). Next tables show the results in 10fcv where the symbol ‘*’ indicates that the normality is not satisfied and the value in brackets is the p-value needed for rejecting the normality hypothesis.

Tab. 1.  Test of Normality of Kolmogorov-Smirnov for 10fcv

Tab. 1. Test of Normality of Kolmogorov-Smirnov for 10fcv

Tab. 2.  Test of Normality of D’Agostino–Pearson for 10fcv

Tab. 2. Test of Normality of D’Agostino–Pearson for 10fcv

As we can observe in the run of the two tests, we can declare that the conditions needed for the application of parametric tests are not fulfilled in some cases. The normality condition is not always satisfied although the size of the sample of results would be enough (50 in this case). A main factor that influences this condition seems to be the nature of the problem, since there exist some problems in which it is never satisfied. D’Agostino–Pearson’s test is the most suitable test in these situations, where it is frequent that the sample of results would contain some ties.

In addition, we present a case study done for a given sample of results. Figure 1 presents an example of graphical representations of histograms and Q–Q graphics. An histogram represents a statistical variable by using bars, so that the area of each bar is proportional to the frequency of the represented values. A Q–Q graphic represents a confrontation between the quartiles from data observed and those from the normal distribution. In Fig. 1 we observe a common case of absolute lack of normality. The case corresponds to the run of the RBFN decremental algorithm in a Hold-Out Validation

Fig. 1.  Results of RBFN Decremental over crx data set in HOV: histogram and Q–Q graphic.

Fig. 1. Results of RBFN Decremental over crx data set in Hold-Out Validation: histogram and Q–Q graphic.

In relation to the heteroscedasticity study, next table shows the results by applying Levene’s test, where the symbol ‘*’ indicates that the variances of the distributions of the different algorithms for a certain data set are not homogeneous (we reject the null hypothesis). The homoscedasticity property is even more difficult to be fulfilled, since the variances associated to each problem also depend on the algorithm’s results, that is, the capacity of the algorithms for offering similar results with random seed variations. This fact implies that an analysis of performance of ANN methods performed through parametric statistical treatment could mean erroneous conclusions.

Tab. 3.  Test of Homoscedasticity of Levene (based on means)

Tab. 3. Test of Homoscedasticity of Levene (based on means)


Top of Page


3. Nonparametric Tests

This section tries to describe the most used nonparametric tests in the analysis of results:

Bolita Pairwise comparisons.

Bolita Multiple comparisons with a control method.

Bolita Multiple comparisons among all methods.


3.1. Pairwise comparisons.
In the discussion of the tests for comparisons of two methods over multiple cases of problems we will make two points. We shall warn against the widely used t-test as usually conceptually inappropriate and statistically unsafe. Since we will finally recommend the Wilcoxon (F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80–83, 1945) signed-ranks test, it will be presented with more details. Another, even more rarely used test is the sign test which is weaker than the Wilcoxon test but also has its distinct merits. The other message will be that the described statistics measure differences between the methods from different aspects, so the selection of the test should be based not only on statistical appropriateness but also on what we intend to measure.

The Sign Test

A popular way to compare the overall performances of algorithms is to count the number of cases on which an algorithm is the overall winner. When multiple algorithms are compared, pairwise comparisons are sometimes organized in a matrix.

Some authors also use these counts in inferential statistics, with a form of binomial test that is known as the sign test Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton (2006). If the two algorithms compared are, as assumed under the null-hypothesis, equivalent, each should win on approximately N/2 out of N problems. The number of wins is distributed according to the binomial distribution; the critical number of wins can be found in next Table. For a greater number of cases, the number of wins is under the null-hypothesis distributed according to N(N/2,SQRT(N)/2), which allows for the use of z-test: if the number of wins is at least N/2+1.96*SQRT(N)/2 (or, for a quick rule of a thumb, N/2+SQRT(N)), the
algorithm is significantly better with p < 0.05. Since tied matches support the null-hypothesis we should not discount them but split them evenly between the two methods; if there is an odd number of them, we again ignore one.

#cases 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
α=0.05 5 6 7 7 8 9 9 10 10 11 12 12 13 13 14 15 15 16 17 18 18
α=0.1 5 6 6 7 7 8 9 9 10 10 11 12 12 13 13 14 14 15 16 16 17

Table 4. Critical values for the two-tailed sign test at α = 0.05 and α = 0.1. A method is significantly better than another if it performs better on at least the cases presented in each row.

The Wilcoxon Test

Wilcoxon’s test is used for answering this question: do two samples represent two different populations? It is a non-parametric procedure employed in a hypothesis testing situation involving a design with two samples. It is the analogous of the paired t-test
in non-parametrical statistical procedures; therefore, it is a pairwise test that aims to detect significant differences between the behavior of two algorithms.

The null hypothesis for Wilcoxon’s test is H0 : θD = 0; in the underlying populations represented by the two samples of results, the median of the difference scores equals zero. The alternative hypothesis is H1 : θD ≠ 0, but also can be used H1 : θD > 0 or H1 : θD < 0 as directional hypothesis.

In the following, we describe the tests computations. Let di be the difference between the performance scores of the two algorithms on i-th out of N cases. The differences are ranked according to their absolute values; average ranks are assigned in case of ties. Let R+ be the sum of ranks for the functions on which the second algorithm outperformed the first, and R the sum of ranks for the opposite. Ranks of di = 0 are split evenly among the sums; if there is an odd number of them, one is ignored:

Let T be the smallest of the sums, T = min(R+,R). If T is less than or equal to the value of the distribution of Wilcoxon for N degrees of freedom (Table B.12 in Zar, J.H.: Biostatistical Analysis. Prentice Hall, Englewood Cliffs (1999)), the null hypothesis of equality of means is rejected.

The obtaining of the p-value associated to a comparison is performed by means of the normal approximation for the Wilcoxon T statistic (Section VI, Test 18 in Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton (2006) . Furthermore, the computation of the p-value for this test is usually included in well-known statistical software packages (SPSS, SAS, R, etc.).

3.2. Multiple comparisons with a control method.
Wilcoxon’s test performs individual comparisons between two algorithms (pairwise comparisons). The p-value in a pairwise comparison is independent from another one. If we try to extract a conclusion involving more than one pairwise comparison
in a Wilcoxon’s analysis, we will obtain an accumulated error coming from the combination of pairwise comparisons. In statistical terms, we are losing the control on the Family Wise Error Rate (FWER), defined as the probability of making one or more false discoveries among all the hypotheses when performing multiple pairwise tests. The true statistical significance for combining pairwise comparisons is given by:

So, a pairwise comparison test, such as Wilcoxon's test, should not be used to conduct various comparisons involving a set of algorithms, because the FWER is not controlled. The expresion defined above computes the true significance obtained after performing several comparisons, hence the level of significance cannot be set before performing the comparisons and the statistical significance cannot be known a priori.

In order to carry out a comparison which involves more than two methods, under the assumption of being significance at a certain level of significance, established previously to the statistical analysis, the multiple comparisons tests should be used. In this part, we describe the most used test for performing multiple test comparisons together with a set of post-hoc procedures to compare a control method with other methods (1 x n comparisons). We refer to the Friedman Test and derivatives.

Multiple Sign Test

The multiple sign test, allows us to compare all of the other algorithms with a control labeled algorithm. It carries out the following steps:

  1. BolitaRepresent by xi1sub> and xij the performances of the control and the jth classifier in the ith data set.
  2. BolitaCompute the signed differences dij = xij - xi1. In other words, pair each performance with the control and, in each data set, subtract the control performance from the jth classifier.
  3. BolitaLet rj equal the number of differences, dij, that have the less frequently occurring sign (either positive or negative) within a pairing of an algorithm with the control.
  4. BolitaLet M1 be the median response of a sample of results of the control method and Mj be the median response of a sample of results of the jth algorithm. Apply one of the following decision rules:
    1. BolitaFor testing H0 : Mj ≥ M1 against H1 : Mj < M1, reject H0 if the number of plus signs is less than or equal to the critical value of Rj appearing in Table A.1 in (S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental Analysis of Power. Information Sciences 180 (2010) 2044–2064.) for k - 1 (number of algorithms excluding control), n and the chosen experimentwise error rate.
    2. BolitaFor testing H0 : Mj ≤ M1 against H1 : Mj > M1, reject H0 if the number of minus signs is less than or equal to the critical value of Rj appearing in Table A.1 in (S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental Analysis of Power. Information Sciences 180 (2010) 2044–2064.) for k - 1; n and the chosen experimentwise error rate.

Contrast Estimation based on Medians

Using the data resulting from the run of various classifiers over multiple data sets in an experiment, the researcher could be interested in the estimation of the difference between two classifiers’ performance. A procedure for this purpose assumes that the expected differences between performances of algorithms are the same across data sets. We assume that the performance is reflected by the magnitudes of the differences between the performances of the algorithms. Consequently, we are interested in estimating the contrast between medians of samples of results considering all pairwise comparisons. It obtains a quantitative difference computed through medians between two algorithms over multiple
data sets, but the value obtained will change when using other data sets in the experiment.

It carries out the following steps:

  1. BolitaFor every pair of k algorithms in the experiment, we compute the difference between the performances of the two algorithms in each of the n data sets. In other words, we compute the differences
    where i = 1; ... ; n; u = 1; ... ; k; and v = 1; ... ; k. We form performance pairs only for those in which u < v.
  2. BolitaWe find the median of each set of differences and call it Zuv . We call Zuv the unadjusted estimator of Mu - Mv . Since Zvu = Zuv , we have only to calculate Zuv for the case where u < v. There are k(k - 1)/2 of these medians. Also note that Zuu = 0.
  3. BolitaWe compute the mean of each set of unadjusted medians having the same first subscript and call the result mu; that is, we compute
  4. BolitaThe estimator of Mu - Mv is mu - mv , where u and v range from 1 through k. For example, the difference between M1 and M2 is m1 - m2.

The Friedman Test

Friedman’s test is used for answering this question: In a set of k samples (where k ≥ 2), do at least two of the samples represent populations with different median values? It is a non-parametric procedure employed in a hypothesis testing situation involving a design with two or more samples. It is the analogous of the repeatedmeasures ANOVA in non-parametrical statistical procedures; therefore, it is a multiple comparison test that aims to detect significant differences between the behavior of two or more algorithms.

The null hypothesis for Friedman’s test is H0 : θ1 = θ2 = ··· = θk ; the median of the population i represents the median of the population j , i ≠ j, 1 ≤ i ≤ k, 1 ≤ j ≤ k. The alternative hypothesis is H1 : Not H0, so it is non-directional.

Next, we describe the tests computations. It computes the ranking of the observed results for algorithm (rj for the algorithm j with k algorithms) for each function, assigning to the best of them the ranking 1, and to the worst the ranking k. Under the null hypothesis, formed from supposing that the results of the algorithms are equivalent and, therefore, their rankings are also similar, the Friedman’s statistic

is distributed according to χF2 with k − 1 degrees of freedom, being Rj = , and N the number of cases of the problem considered. The critical values for the Friedman’s statistic coincide with the established in the χ2 distribution when N > 10 and k > 5.

The Iman and Davenport Test

Iman and Davenport (Iman, R.L., Davenport, J.M.: Approximations of the critical region of the Friedman statistic. Commun. Stat. 18, 571–595 (1980)) proposed a derivation from the Friedman’s statistic given that this last metric produces a conservative undesirably effect. The proposed statistic is

and it is distributed according to a F distribution with k − 1 and (k − 1)(N − 1) degrees of freedom.

Computation of the p-values given a χ2 or FF statistic can be done by using the algorithms in Abramowitz, M.: Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover, New York (1974). Also, most of the statistical software packages include it.

The rejection of the null hypothesis in both tests described above does not involve the detection of the existing differences among the algorithms compared. They only inform us about the presence of differences among all samples of results compared. In order to conducting pairwise comparisons within the framework of multiple comparisons, we can proceed with a post-hoc procedure. In this case, a control algorithm (maybe a proposal to be compared) is usually chosen. Then, the post-hoc procedures proceed to compare the control algorithm with the remain k − 1 algorithms. Next, we describe three post-hoc procedures:

Friedman Aligned Ranks Test

The Friedman test is based on n sets of ranks, one set for each data set in our case; and the performances of the algorithms analyzed are ranked separately for each data set. Such a ranking scheme allows for intra-set comparisons only, since inter-set comparisons are not meaningful. When the number of algorithms for comparison is small, this may pose a disadvantage. In such cases, comparability among data sets is desirable and we can employ the method of aligned ranks (J.L. Hodges, E.L. Lehmann, Ranks methods for combination of independent experiments in analysis of variance, Annals of Mathematical Statistics 33
(1962) 482–497)
.


In this technique, a value of location is computed as the average performance achieved by all algorithms in each data set. Then, it calculates the difference between the performance obtained by an algorithm and the value of location. This step is repeated for algorithms and data sets. The resulting differences, called aligned observations, which keep their identities with respect to the data set and the combination of algorithms to which they belong, are then ranked from 1 to kn relative to each other. Then, the ranking scheme is the same as that employed by a multiple comparison procedure which employs independent samples; such as the Kruskal–Wallis test. The ranks assigned to the aligned observations are called aligned ranks.


The Friedman Aligned Ranks test statistic can be written as

where Ri. is equal to the rank total of the ith data set and R.j is the rank total of the jth algorithm.

The test statistic T is compared for significance with a chi-square distribution for k - 1 degrees of freedom. Critical values can be found at Table A3 in (Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton (2006)). Furthermore, the p-value could be computed through normal approximations (Abramowitz, M.: Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover, New York (1974)). If the null hypothesis is rejected, we can proceed with a post hoc test.

Quade Test

The Friedman test considers all data sets to be equal in terms of importance. An alternative to this could take into account the fact that some data sets are more difficult or the differences registered on the run of various algorithms over them are larger. The rankings computed on each data set could be scaled depending on the differences observed in the algorithms’ performances. The Quade test conducts a weighted ranking analysis of the sample of results (D. Quade, Using weighted rankings in the analysis of complete blocks with additive block effects, Journal of the American Statistical Association 74 (1979) 680–683.).

The procedure starts finding the ranks rji in the same way as the Friedman test does. The next step requires the original values of performance of the classifiers xij. Ranks are assigned to the data sets themselves according to the size of the sample range in each data set. The sample range within data set i is the difference between the largest and the smallest observations within that data set:

Obviously, there are n sample ranges, one for each data set. Assign rank 1 to the data set with the smallest range, rank 2 to the second smallest, and so on to the data set with the largest range, which gets rank n. Use average ranks in case of ties. Let Q1,Q2, ... ,Qn be the ranks assigned to data sets 1, 2, ... , n, respectively.

Finally, the data set rank Qi is multiplied by the difference between the rank within data set i, rji , and the average rank within data sets, (k + 1) / 2, to get the product Sij, where

is a statistic that represents the relative size of each observation within the data set, adjusted to reflect the relative significance of the data set in which it appears.

For convenience and to establish a relationship with the Friedman test, we will also use rankings without average adjusting:

Let Sj denote the sum for each classifier,

Next we must to calculate the terms:

The test statistic is

which is distributed according to the F-distribution with k - 1 and (k - 1)(n - 1) degrees of freedom. The table of critical values for the F-distribution is given in Table A10 in Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton (2006). Moreover, the p-value could be computed through normal approximations (Abramowitz, M.: Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover, New York (1974)). If A2 = B, consider the point to be in the critical region of the statistical distribution and calculate the p-value as (1/k!)n-1. If the null hypothesis is rejected, we can proceed with a post hoc test.

3.3. Post-hoc Procedures.
We focus on the comparison between a control method, which is usually the proposed method, and a set of algorithms used in the empirical study. This set of comparisons is associated with a set or family of hypotheses, all of which
are related to the control method. Any of the post hoc tests is suitable for application to nonparametric tests working over a family of hypotheses. The test statistic for comparing the ith algorithm and jth algorithm depends on the main nonparametric procedure used:

3.5. Multiple comparisons among all methods.
Friedman’s test is an omnibus test which can be used to carry out these types of comparison. It allows to detect differences considering the global set of classifiers. Once Friedman’s test rejects the null hypothesis, we can proceed with a post-hoc test in order to find the concrete pairwise comparisons which produce differences. Before, we focused on procedures that control the family-wise error when comparing with a control classifier, arguing that the objective of a study is to test whether a newly proposed method is better than the existing ones. For this reason, we have described and studied procedures such as Bonferroni-Dunn, Holm’s and Hochberg’s methods.

When our interest lies in carrying out a multiple comparison in which all possible pairwise comparisons need to be computed (n x n comparison), the classic procedure that can be used is the Nemenyi procedure (P. B. Nemenyi. Distribution-free Multiple comparisons. PhD thesis, Princeton University, 1963). It adjusts the value of a in a single step by dividing the value of α by the number of comparisons performed, m = k(k−1)=2. This procedure is the simplest but it also has little power.

The hypotheses being tested belonging to a family of all pairwise comparisons are logically interrelated so that not all combinations of true and false hypotheses are possible. As a simple example of such a situation suppose that we want to test the three hypotheses of pairwise equality associated with the pairwise comparisons of three classifiers Ci; i = 1, 2, 3. It is easily seen from the relations among the hypotheses that if any one of them is false, at least one other must be false. For example, if C1 is better/worse than C2, then it is not possible that C1 has the same performance as C3 and C2 has the same performance as C3. C3 must be better/worse than C1 or C2 or the two classifiers at the same time. Thus, there cannot be one false and two true hypotheses among these three.

Based on this argument, Shaffer proposed two procedures which make use of the logical relation among the family of hypotheses for adjusting the value of α (J.P. Shaffer. Modified sequentially rejective multiple test procedures. Journal of the American Statistical Association, 81(395):826–831, 1986.).

In G. Bergmann and G. Hommel. Improvements of general multiple test procedures for redundant systems of hypotheses. In P. Bauer, G. Hommel, and E. Sonnemann, editors, Multiple Hypotheses Testing, pages 100–115. Springer, Berlin, 1988 was proposed a procedure based on the idea of finding all elementary hypotheses which cannot be rejected. In order to formulate Bergmann-Hommel’s procedure, we need the following definition.

Definition 1: An index set of hypotheses I ⊆ {1, ..., m} is called exhaustive if exactly all Hj, j ∈ I, could be true.

Under this definition, Bergmann-Hommel procedure works as follows.

Finally, we will explain how to compute the APVs for the three post-hoc procedures described above, following the indications given in Wright, S.P.: Adjusted p-values for simultaneous inference. Biometrics 48, 1005–1013 (1992).


Top of Page


4. Case Studies

In this section, two case studies will be presented in order to illustrate the use of multiple comparisons nonparametric test in Computational Intelligence and Data Mining experimentations:

Bolita Multiple comparisons with a control method.

Bolita Multiple comparisons among all methods.


4.1. Multiple comparisons with a control method.
We present a case study where four rule induction algorithms will be compared with a control method. The complete study can be found in S. García, A. Fernández, J. Luengo, F. Herrera, A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability. Soft Computing 13:10 (2009) 959-977, doi:10.1007/s00500-008-0392-y.

Four of them are Genetics-based Machine Learning (GBML) methods:

and one of them is the classic CN2 algorithm (Clark P, Niblett T (1989) The CN2 induction algorithm. Machine Learn 3(4):261–283). One of the GBML algorithms is chosen as the control algorithm, XCS (Wilson SW (1995) Classifier fitness based on accuracy. Evol Comput 3(2):149–175). Two performance measures are used for compariong the behaviour of the algorithms: Accuracy and Cohen's kappa (Cohen JA (1960) Coefficient of agreement for nominal scales. Educ Psychol Meas 37–46).

First of all, we have to test whether significant differences exist among all the mean values. The next table shows the result of applying a Friedman and Iman–Davenport tests. The table shows the Friedman and Iman–Davenport values, χF2 and FF,respectively, and it relates them with the corresponding critical values for each distribution by using a level of significance α = 0.05. The p value obtained is also reported for each test. Given that the statistics of Friedman and Iman–Davenport are clearly greater than their associated critical values, there are significant differences among the observed results with a level of significance α ≤ 0.05. According to these results, a posthoc statistical analysis is needed in the two cases.

Then, we will employ a Bonferroni-Dunn test to detect significant differences for the control algorithm in each measure. It obtains the values CD = 1.493 and CD = 1.34 for α = 0.05 and α = 0.10 respectively in the two measures considered. The following figures summarize the ranking obtained by the Friedman test and draw the threshold of the critical difference of Bonferroni–Dunn’ procedure, with the two levels of significance mentioned above. They display a graphical representation composed by bars whose height is proportional to the average ranking obtained for each algorithm in each measure studied. If we choose the smallest of them (which corresponds to the best algorithm), and we sum its height with the critical difference obtained by the Bonferroni method (CD value), we represent a cut line that goes through all the graphic. Those bars which are higher than this cut line belong to the algorithms whose performance is significantly worse than that of the control algorithm.

We will apply more powerful procedures, such as Holm and Hochbergs ones, for comparing the control algorithm with the rest of algorithms. The next table shows all the adjusted p values for each comparison which involves the control algorithm. The p value is indicated in each comparison and we stress in bold the algorithms which are worse than the control, considering a level of significance α = 0.05.

Taken from S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences 180 (2010) 2044–2064 doi:10.1016/j.ins.2009.12.010, we present a case study where three different computational intelligence algorithms will be compared with a control method.

The three algorithms and the control methods are:

The following tables show the results in the final form of APVs for the experimental study considered in this case. As we can see, this example is suitable for observing the difference of power among the test procedures. Also, these tables can provide information about the state of retention or rejection of any hypothesis, comparing its associated APV with the level of significance fixed at the beginning of the statistical analysis.

First of all, we can observe that the Bonferroni procedure obtains the highest APV. Theoretically, the step-down procedures usually have less power than step-up ones and the Li procedure seems to be the multiple comparison test with highest power. On the other hand, referring to a comparison between multiple comparison nonparametric tests; that is, Friedman, Friedman Aligned Ranks and Quade; we can see that Quade’s procedure is the one which obtains the lowest unadjusted p-value in this example.

4.2. Multiple comparisons among all methods.
Taken from the paper S. García, F. Herrera, An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons. Journal of Machine Learning Research 9 (2008) 2677-2694, this case study show an example involving the four procedures of all pairwise comparison described with a comparison of five classifiers:

We have used 10-fold cross validation and standard parameters for each algorithm. The results correspond to average accuracy or 1 − class_error in test data. We have used 30 data sets. Next table shows the overall process of computation of average rankings.

Friedman and Iman and Davenport tests check whether the measured average ranks are significantly different from the mean rank Rj = 3. They respectively use the χ2 and the F statistical distributions to determine if a distribution of observed frequencies differs from the theoretical expected frequencies. Their statistics use nominal (categorical) or ordinal level data, instead of using means and variances. Demsar (2006) detailed the computation of the critical values in each distribution. In this case, the critical values are 9.488 and 2.45, respectively at α = 0.05, and the Friedman’s and Iman-Davenport’s statistics are:

χF2 = 39.647; FF = 14.309.

Due to the fact that the critical values are lower than the respective statistics, we can proceed with the post-hoc tests in order to detect significant pairwise differences among all the classifiers. For this, we have to compute and order the corresponding statistics and p-values. The standard error in the pairwise comparison between two classifiers is SE = 0.408. The following table presents the family of hypotheses ordered by their p-value and the adjustment of a by Nemenyi’s, Holm’s and Shaffer’s static procedures.

Next table shows the results in the final form of APVs for the example considered in this section. As we can see, this example is suitable for observing the difference of power among the test procedures. Also, this table can provide information about the state of retainment or rejection of any hypothesis, comparing its associated APV with the level of significance previously fixed.


Top of Page


5. Considerations and Recommendations on the Use of Nonparametric Tests

This section remarks some considerations and recommendations about Nonparametric tests when they are used over any type of comparison.

5.1. Considerations on the use of Nonparametric Tests.

Bolita Pairwise comparisons.

Bolita Multiple comparisons with a control method.


5.1.1. Pairwise comparisons.

 

5.1.2. Multiple Comparisons with a control method.

 

5.2. Recommendations on the use of Nonparametric Tests.

Bolita Pairwise comparisons.

Bolita Multiple comparisons with a control method.

Bolita Multiple comparisons among all methods.


5.2.1. Pairwise comparisons.

 

5.2.2. Multiple Comparisons with a control method.

 

5.2.3. Multiple Comparison among all methods.

 

 


Top of Page


6. Relevant Journal Papers with Data Mining and Computational Intelligence Case Studies

J. Demsar, Statistical Comparisons of Classifiers over Multiple Data Sets. Journal Of Machine Learning Research 7 (2006) 1-30 link.   iconPdf.png


Abstract:While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.

Summary:

  1. 1. Introduction.
  2. 2. Previous Work.
    1. 2.1. Related Theoretical Work.
    2. 2.2. Testing in Practice: Analysis of ICML Papers.
  3. 3. Statistics and Tests for Comparison of Classifiers.
    1. 3.1. Comparisons of Two Classifiers.
    2. 3.2. Comparisons of Multiple Classifiers.
  4. 4. Empirical Comparison of Tests.
    1. 4.1. Experimental Setup.
    2. 4.2. Comparisons of Two Classifiers.
    3. 4.3. Comparisons of Multiple Classifiers.
  5. 5. Conclusion.
.


S. García, F. Herrera, An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons. Journal of Machine Learning Research 9 (2008) 2677-2694 link .   iconPdf.png


Abstract: In a recently published paper in JMLR, Demsar (2006) recommends a set of non-parametric statistical tests and procedures which can be safely used for comparing the performance of classifiers over multiple data sets. After studying the paper, we realize that the paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method. However, it does not deal with some advanced topics in depth. Regarding these topics, we focus on more powerful proposals of statistical procedures for comparing n x n classifiers. Moreover, we illustrate an easy way of obtaining adjusted and comparable p-values in multiple comparison procedures.

Summary:

  1. 1. Introduction.
  2. 2. Comparison of Multiple Classifiers: Performing All Pairwise Comparisons.
    1. 2.1. Advanced Procedures for Performing All Pairwise Comparisons.
    2. 2.2. Performing All Pairwise Comparisons: A Case Study.
  3. 3. Adjusted P-Values.
  4. 4. Experimental Framework.
  5. 5. Conclusions.
.


J. Luengo, S. García, F. Herrera, A Study on the Use of Statistical Tests for Experimentation with Neural Networks: Analysis of Parametric Test Conditions and Non-Parametric Tests. Expert Systems with Applications 36 (2009) 7798-7808 doi:10.1016/j.eswa.2008.11.041 iconPdf.png


Abstract: In this paper, we focus on the experimental analysis on the performance in artificial neural networks with the use of statistical tests on the classification task. Particularly, we have studied whether the sample of results from multiple trials obtained by conventional artificial neural networks and support vector machines checks the necessary conditions for being analyzed through parametrical tests. The study is conducted by considering three possibilities on classification experiments: random variation in the selection of test data, the selection of training data and internal randomness in the learning algorithm. The results obtained state that the fulfillment of these conditions are problem-dependent and indefinite, which justifies the need of using non-parametric statistics in the experimental analysis.

Summary:

  1. 1. Introduction.
  2. 2. Classification algorithms and experimentation framework.
    1. 2.1. Artificial neural networks and support vector machines.
    2. 2.2. Experimentation framework
  3. 3. Study on the initial conditions for parametric tests using artificial neural networks
    1. 3.1. Conditions for the use of parametric tests.
    2. 3.2. Normality test over the group of data sets and algorithms.
    3. 3.3. Case studies of the normality property.
  4. 4. On the use of rank-based non-parametric tests: a short experimental study.
    1. 4.1. Rank-based non-parametric tests.
    2. 4.2. Experimental study: results and analysis.
  5. 5. Conclusions
.


S. García, A. Fernández, J. Luengo, F. Herrera, A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability. Soft Computing 13:10 (2009) 959-977, doi:10.1007/s00500-008-0392-y iconPdf.png


Abstract: The experimental analysis on the performance of a proposed method is a crucial and necessary task to carry out in a research. This paper is focused on the statistical analysis of the results in the field of genetics-based machine Learning. It presents a study involving a set of techniques which can be used for doing a rigorous comparison among algorithms, in terms of obtaining successful classification models. Two accuracy measures for multiclass problems have been employed: classification rate and Cohen’s kappa. Furthermore, two interpretability measures have been employed: size of the rule set and number of antecedents. We have studied whether the samples of results obtained by genetics-based classifiers, using the performance measures cited above, check the necessary conditions for being analysed by means of parametrical tests. The results obtained state that the fulfillment of these conditions are problem-dependent and indefinite, which supports the use of non-parametric statistics in the experimental analysis. In addition, non-parametric tests can be satisfactorily employed for comparing generic classifiers over various data-sets considering any performance measure. According to these facts, we propose the use of the most powerful non-parametric statistical tests to carry out multiple comparisons. However, the statistical analysis conducted on interpretability must be carefully considered.

Summary:

  1. 1. Introduction.
  2. 2. Genetics-based machine learning algorithms for classification.
  3. 3. Performance measures and experimental results.
    1. 3.1. Accuracy measures for multi-class problems.
    2. 3.2. Interpretability measures.
    3. 3.3. Experimental results.
  4. 4. Study on the initial conditions for parametric tests using genetics-based machine learning.
    1. 4.1. Conditions for a safe use of parametric tests.
    2. 4.2. Analysis of the conditions for a safe use of parametric tests.
    3. 4.3. Case studies of the normality property.
  5. 5. Non-parametric tests for comparing two algorithms in multiple data-set analysis.
    1. 5.1. Wilcoxon signed-ranks test.
    2. 5.2. A case study in GBML: performing pairwise comparisons.
  6. 6. Non-parametric tests for multiple comparisons among more than two algorithms.
    1. Friedman test and post-hoc tests.
    2. A case study in GBML: performing multiple comparisons.
  7. 7. Analysing interpretability of models.
  8. 8. Conclusions.
.


S. García, D. Molina, M. Lozano, F. Herrera, A Study on the Use of Non-Parametric Tests for Analyzing the Evolutionary Algorithms' Behaviour: A Case Study on the CEC'2005 Special Session on Real Parameter Optimization. Journal of Heuristics, 15 (2009) 617-644. doi: 10.1007/s10732-008-9080-4 iconPdf.png


Abstract: In recent years, there has been a growing interest for the experimental analysis in the field of evolutionary algorithms. It is noticeable due to the existence of numerous papers which analyze and propose different types of problems, such as the basis for experimental comparisons of algorithms, proposals of different methodologies in comparison or proposals of use of different statistical techniques in algorithms’ comparison.
In this paper, we focus our study on the use of statistical techniques in the analysis of evolutionary algorithms’ behaviour over optimization problems. A study about the required conditions for statistical analysis of the results is presented by using some models of evolutionary algorithms for real-coding optimization. This study is conducted in two ways: single-problem analysis and multiple-problem analysis. The results obtained state that a parametric statistical analysis could not be appropriate specially when we deal with multiple-problem results. In multiple-problem analysis, we propose the use of non-parametric statistical tests given that they are less restrictive than parametric ones and they can be used over small size samples of results. As a case study, we analyze the published results for the algorithms presented in the CEC’2005 Special Session on Real Parameter Optimization by using non-parametric test procedures.

Summary:

  1. 1. Introduction.
  2. 2. Preliminaries: settings of the CEC’2005 Special Session.
    1. 2.1. Evolutionary algorithms.
    2. 2.2. Test functions.
    3. 2.3. Characteristics of the experimentation.
  3. 3. Study of the required conditions for the safe use of parametric tests.
    1. 3.1. Conditions for the safe use of parametric tests.
    2. 3.2. On the study of the required conditions over single-problem analysis.
    3. 3.3. On the study of the required conditions over multiple-problem analysis.
  4. 4. A case study: on the use of non-parametric statistics for comparing the results of the CEC’2005 Special Session in Real Parameter Optimization.
  5. 5. Some considerations on the use of non-parametric tests.
  6. 6. Conclusions
.


S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental Analysis of Power. Information Sciences 180 (2010) 2044–2064. doi:10.1016/j.ins.2009.12.010 iconPdf.png


Abstract: Experimental analysis of the performance of a proposed method is a crucial and necessary task in an investigation. In this paper, we focus on the use of nonparametric statistical inference for analyzing the results obtained in an experiment design in the field of computational intelligence. We present a case study which involves a set of techniques in classification tasks and we study a set of nonparametric procedures useful to analyze the behavior of a method with respect to a set of algorithms, such as the framework in which a new proposal is developed.

Particularly, we discuss some basic and advanced nonparametric approaches which improve the results offered by the Friedman test in some circumstances. A set of post hoc procedures for multiple comparisons is presented together with the computation of adjusted p-values. We also perform an experimental analysis for comparing their power, with the objective of detecting the advantages and disadvantages of the statistical tests described. We found that some aspects such as the number of algorithms, number of data sets and differences in performance offered by the control method are very influential in the statistical tests studied. Our final goal is to offer a complete guideline for the use of nonparametric statistical procedures for performing multiple comparisons in experimental studies.

Summary:

  1. 1. Introduction.
  2. 2. Experimental Framework.
  3. 3. Basic nonparametric tests for performing multiple comparisons: Friedman test, Multiple Sign-test and Contrast Estimation based on medians.
    1. 3.1. Friedman test and Iman-Davenpor extension.
    2. 3.2. Multiple Sign-Test.
    3. 3.3. Constrast Estimation based on medians.
  4. 4. Advanced nonparametric tests for performing multiple comparisons: Friedman Aligned Ranks and the test of Quade.
    1. 4.1. Friedman Aligned Ranks.
    2. 4.2. Quade Test.
  5. 5. A candidate set of post hoc tests: p-values and adjusted p-values.
    1. 5.1. Post hoc procedures.
    2. 5.2. Experimental Study.
  6. 6. Experimental analysis: power of the multiple comparisons tests.
    1. 6.1. Analysis of the power of nonparametric multiple comparisons tests.
    2. 6.2. Analysis of the power of the post hoc procedures.
    3. 6.3. Analysis of the stability of the Quade test.
  7. 7. Summary and suggestions.
  8. 8. Conclusions
.


C. Drummond , N. Japkowicz, Warning: statistical benchmarking is addictive. Kicking the habit in machine learning. Journal of Experimental & Theoretical Artificial Intelligence 22:1 (2010) 67–80 doi:10.1080/09528130903010295 iconPdf.png


Abstract: Algorithm performance evaluation is so entrenched in the machine learning community that one could call it an addiction. Like most addictions, it is harmful and very difficult to give up. It is harmful because it has serious limitations. Yet, we have great faith in practicing it in a ritualistic manner: we follow a fixed set of rules telling us the measure, the data sets and the statistical test to use. When we read a paper, even as reviewers, we are not sufficiently critical of results that follow these rules. Here, we will debate what are the limitations and how to best address them. This article may not cure the addiction but hopefully it will be a good first step along that road.

Summary:

  1. 1. Introduction.
  2. 2. What is wrong with what we are doing now?.
    1. 2.1. What are we measuring?.
    2. 2.2. What does a statistical test buy us?.
    3. 2.3. What do our data sets represent?.
  3. 3. What is the alternative?.
    1. 3.1. What should we measure?.
    2. 3.2. What tests should we do?.
    3. 3.3. What data should we use?.
  4. 4. Conclusions.
.


J. Derrac, S. García, D. Molina,F. Herrera, A Practical Tutorial on the Use of Nonparametric Statistical Tests as a Methodology for Comparing Evolutionary and Swarm Intelligence Algorithms. Swarm and Evolutionary Computation 1:1 (2011) 3-18. doi:10.1016/j.swevo.2011.02.002 iconPdf.png


Abstract: The interest in nonparametric statistical analysis has grown recently in the field of computational intelligence. In many experimental studies, the lack of the required properties for a proper application of parametric procedures – independence, normality, and homoscedasticity – yields to nonparametric ones the task of performing a rigorous comparison among algorithms.

In this paper, we will discuss the basics and give a survey of a complete set of nonparametric procedures developed to perform both pairwise and multiple comparisons, for multi-problem analysis. The test problems of the CEC’2005 special session on real parameter optimization will help to illustrate the use of the tests throughout this tutorial, analyzing the results of a set of well-known evolutionary and swarm intelligence algorithms. This tutorial is concluded with a compilation of considerations and recommendations, which will guide practitioners when using these tests to contrast their experimental results.

Summary:

  1. 1. Introduction.
  2. 2. Preliminaries.
    1. 2.1. Benchmark functions: CEC’2005 special session on real parameter optimization.
    2. 2.2. Evolutionary and swarm intelligence algorithms.
    3. 2.3. Some basic concepts on inferential statistics.
  3. 3. Pairwise comparisons.
    1. 3.1. A simple first-sight procedure: the Sign test.
    2. 3.2. The Wilcoxon signed ranks test.
  4. 4. Multiple comparisons with a control method.
    1. 4.1. Multiple Sign test.
    2. 4.2. The Friedman, Friedman Aligned Ranks, and Quade tests.
    3. 4.3. Post-hoc procedures.
    4. 4.4. Contrast Estimation.
  5. 5. Multiple comparisons among all methods.
  6. 6. Considerations and recommendations on the use of nonparametric tests.
    1. 6.1. General considerations.
    2. 6.2. Multiple comparisons with a control method.
    3. 6.3. Multiple comparisons among all methods.
  7. 7. Conclusions
.


A. Ulas, O. T. Yildiz, E. Alpaydin, Cost-conscious comparison of supervised learning algorithms over multiple data sets. Pattern Recognition 45 (2012) 1772–1781 doi:10.1016/j.patcog.2011.10.005 iconPdf.png


Abstract: In the literature,there exist statistical tests to compare supervised learning algorithms on multiple data sets in terms of accuracy but they do not always generate an ordering. We propose Multi2Test, a generalization of our previous work,for ordering multiple learning algorithms on multiple data sets from ‘‘best’’ to ‘‘worst’’ where our goodness measure is composed of a prior cost term additional to generalization error. Our simulations show that Multi2Test generates orderings using pairwise tests on error and different types of cost using time and space complexity of the learning algorithms.

Summary:

  1. 1. Introduction.
  2. 2. Comparing multiple algorithms over multiple datasets.
    1. 2.1. The sign test.
    2. 2.2. Multiple pairwise comparisons on multiple datasets.
    3. 2.3. Correction for multiple comparisons.
  3. 3. Multi2Test.
    1. 3.1. MultiTest.
    2. 3.2. Multi2Test.
  4. 4. Results.
    1. 4.1. Experimental setup.
    2. 4.2. The sign test over averages.
    3. 4.3. Friedman’s test and Bergmann-Hommel’s dynamic procedure.
    4. 4.4. The sign test over pair wise tests.
    5. 4.5. Applying Multi2Test.
      1. 4.5.1. Training time as cost.
      2. 4.5.2. Space complexity as cost.
    6. 4.6. Testing MultiTest and Multi2Test.
    7. 4.7. Verification of results on test.
  5. 5. Discussions and conclusions.
.


D. Berrar, J. A. Lozano, Significance tests or confidence intervals: which are preferable for the comparison of classifiers?. Journal of Experimental & Theoretical Artificial Intelligence, In press (2012) doi:10.1080/0952813X.2012.680252 iconPdf.png


Abstract: Null hypothesis significance tests and their p-values currently dominate the statistical evaluation of classifiers in machine learning. Here, we discuss fundamental problems of this research practice. We focus on the problem of comparing multiple fully specified classifiers on a small-sample test set. On the basis of the method by Quesenberry and Hurst, we derive confidence intervals for the effect size, i.e. the difference in true classification performance. These confidence intervals disentangle the effect size from its uncertainty and thereby provide information beyond the p-value. This additional information can drastically change the way in which classification results are currently interpreted, published and acted upon. We illustrate how our reasoning can change, depending on whether we focus on p-values or confidence intervals. We argue that the conclusions from comparative classification studies should be based primarily on effect size estimation with confidence intervals, and not on significance tests and p-values.

Summary:

  1. 1. Introduction.
  2. 2. Problem statement.
  3. 3. Significance tests and CIs: two sides of the same coin?.
  4. 4. Formal preliminaries.
  5. 5. Significance test for the difference in performance on the same test set and a 0-1 loss function.
  6. 6. CIs for the effect size.
  7. 7. Materials and methods.
  8. 8. Results.
  9. 9. Discussion and conclusions.
.



Top of Page


7. Relevant books on Non-parametric tests

The reader can increase the information provided in this web page. We recommend the reading of the following books:


Top of Page


8. Topic Slides

Bolita S. García, F. Herrera (November 2010). How must I conduct statistical comparisons in my Experimental Study? Design of Experiments in Data Mining/Computational Intelligence. On the use Non-parametric Statistical Tests. Some Cases of Study. iconPdf.png

Bolita H. Takagi (April 2013). Statistical Tests for Computational Intelligence Research and Human Subjective Tests. 2013 IEEE Symposium Series on Computational Intelligence. iconPdf.png http://www.design.kyushu-u.ac.jp/~takagi/TAKAGI/StatisticalTests.html


Top of Page


9. Software and User's Guide

We offer a software developed in JAVA which calculates all the multiple comparisons procedures described in this web page. It allows as input files in CSV format and obtains as output a LaTeX file with tabulated information about Friedman, Iman-Davenport, Friedman Aligned Ranks, Quade, Contrast Estimation, Bonferroni-Dunn, Holm, Hochberg, Hommel, Holland, Rom, Finner, Li, Shaffer and Bergamnn-Hommel tests. It also computes and shows the adjusted p-values.

The first one (CONTROLTEST package) can be used to:

The second one (MULTIPLETEST package) can be used to:

 

The programs are written in JAVA, so an installed JVM in the computer is needed in order to run it. To do this, execute:
java Friedman < data file >
An example of data is included in the package. The output is given in LATEX format on standard output. We recommend to redirect the output to a file in the following manner:
java Friedman data.csv > output.tex


Top of Page


 


© Copyright 2010 SCI2S (Soft Computing and Intelligent Information Systems)
This Web page was created and maintained by Salvador García López