|
|
![]() |
This Website contains additional material to the SCI2S research group papers on the use of non-parametric tests for data mining and Computational Intellingece:
- P1: S. García, F. Herrera, An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons. Journal of Machine Learning Research 9 (2008) 2677-2694 link .

- P2: J. Luengo, S. García, F. Herrera, A Study on the Use of Statistical Tests for Experimentation with Neural Networks: Analysis of Parametric Test Conditions and Non-Parametric Tests. Expert Systems with Applications 36 (2009) 7798-7808 doi:10.1016/j.eswa.2008.11.041.

- P3: S. García, A. Fernández, J. Luengo, F. Herrera, A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability. Soft Computing 13:10 (2009) 959-977, doi:10.1007/s00500-008-0392-y.

- P4: S. García, D. Molina, M. Lozano, F. Herrera, A Study on the Use of Non-Parametric Tests for Analyzing the Evolutionary Algorithms' Behaviour: A Case Study on the CEC'2005 Special Session on Real Parameter Optimization. Journal of Heuristics, 15 (2009) 617-644. doi: 10.1007/s10732-008-9080-4.

They are complementary work to the Demsar's paper for using non-parametric tests in data mining.
- J. Demsar, Statistical Comparisons of Classifiers over Multiple Data Sets. Journal Of Machine Learning Research 7 (2006) 1-30 link.

The web is organized according to the following summary:
- 1. Introduction to Inferential Statistics
- 2. Conditions for the safe use of Nonparametric Tests
- 3. Nonparametric tests
- 3.1. Pairwise Comparisons
- 3.2. Multiple Comparisons with a control method
- 3.3. Multiple Comparisons among all methods
- 4. Case Studies
- 5. Considerations on the use of Nonparametric tests
- 6. Relevant Journal Papers with Data Mining and Computational Intelligence Case Studies
- 7. Relevant books on Non-parametric tests
- 8. Topic Slides
- 9. Software and User's Guide
Introduction to Inferential Statistics
The experimental analysis on the performance of a new method is a crucial and necessary task to carry out in a research on Data Mining, Computational Intelligence techniques. Deciding when an algorithm is better than other one may not be a trivial task.
Hyphotesis testing and p-values:
In inferential statistics, sample data are primarily employed in two ways to draw inferences about one or more populations. One of them
is the hypothesis testing.
The most basic concept in hypothesis testing is a hypothesis. It can be defined as a prediction about a single
population or about the relationship between two or more populations. Hypothesis testing is a procedure in which sample data are employed
to evaluate a hypothesis. There is a distinction between research hypothesis and statistical hypothesis. The first is a general statement
of what a researcher predicts. In order to evaluate a research hypothesis, it is restated within the framework of two statistical
hypotheses. They are the null hypothesis, represented by the notation H0, and the alternative hypothesis, represented by the notation H1.
The null hypothesis is a statement of no effect or no difference. Since the statement of the research hypothesis generally predicts the
presence of a difference with respect to whatever is being studied, the null hypothesis will generally be a hypothesis that the
researcher expects to be rejected. The alternative hypothesis represents a statistical statement indicating the presence of an effect
or a difference. In this case, the researcher generally expects the alternative hypothesis to be supported.
An alternative hypothesis can be nondirectional (two-tailed hypothesis) and directional (one-tailed hypothesis). The first type does not
make a prediction in a specific direction; i.e. H1 : µ ≠ 100. The latter implies a choice of one of the following
directional alternative hypothesis; i.e. H1:µ > 100 or H1:µ < 100.
Upon collecting the data for a study, the next step in the hypothesis testing procedure is to evaluate the data through use of the
appropriate inferential statistical test. An inferential statistical test yields a test statistic. The latter value is interpreted by
employing special tables that contain information with regard to the expected distribution of the test statistic. Such tables contain
extreme values of the test statistic (referred to as critical values) that are highly unlikely to occur if the null hypothesis
is true. Such tables allow a researcher to determine whether or not the results of a study is statistically significant.
The conventional hypothesis testing model employed in inferential statistics assumes that prior to conducting a study, a researcher
stipulates whether a directional or nondirectional alternative hypothesis is employed, as well as at what level of significance
is represented the null hypothesis to be evaluated. The probability value which identifies the level of significance is represented
by α.
When one employs the term significance in the context of scientific research, it is instructive to make a distinction between statistical
significance and practical significance. Statistical significance only implies that the outcome of a study is highly unlikely to have
occurred as a result of chance, but it does no necessarily suggest that any difference or effect detected in a set of data is of any
practical value. For example, no-one would normally care if algorithm A in continuos optimization solves the sphere function to within
10-10 of error of the global optimum and algorithm B solves it within 10-15. Between them, statistical significance
could be found, but in practical sense, this difference is not significant.
Instead of stipulating a priori a level of significance α, one could calculate the smallest level of significance that results
in the rejection of the null hypothesis. This is the definition of p-value, which is an useful and interesting datum for many consumers
of statistical analysis. A p-value provides information about whether a statistical hypothesis test is significant or not, and it also
indicates something about “how significant” the result is: The smaller the p-value, the stronger the evidence against
the null hypothesis. Most important, it does this without committing to a particular level of significance.
The most common way for obtaining the p-value associated to a hypothesis is by means of normal approximations, that is, once computed
the statistic associated to a statistical test or procedure, we can use a specific expression or algorithm for obtaining a z value, which
corresponds to a normal distribution statistics. Then, by using normal distribution tables, we could obtain the p-value associated with
z.
The reader can found more information about the introduction of Inferential Statistics in the chapter 1 of Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton
(2006).
Conditions for the safe use of Nonparametric Tests
In order to distinguish a nonparametric test from a parametric one, we must check the type of data used by the test. A nonparametric
test is that which uses nominal or ordinal data. This fact does not force it to be used only for these types of data. It is possible to
transform the data from real values to ranking based data. In such way, a non-parametric test can be applied over classical data of
parametric test when they do not verify the required conditions imposed by the test. As a general rule, a non-parametric test is less
restrictive than a parametric one, although it is less robust than a parametric when data are well conditioned.
Parametric tests have been commonly used in the analysis of experiments. For example, a common way to test whether the difference between
two algorithms' results is non-random is to compute a paired t-test, which checks whether the average difference in their performance
over the data sets is significantly different from zero. When comparing a set of multiple algorithms, the common statistical method for
testing the differences between more than two related sample means is the repeated-measures ANOVA (or within-subjects ANOVA)
(R. A. Fisher. Statistical methods and scientific inference (2nd edition). Hafner Publishing Co., New
York, 1959.). The "related samples" are again the performances of the algorithms measured across the same problems.
The null-hypothesis being tested is that all classifiers perform the same and the observed differences are merely random.
Unfortunately, Parametric tests are based on assumptions which are most probably violated when analyzing the performance of
computational intelligence and data mining algorithms (Zar, J.H.: Biostatistical Analysis. Prentice Hall,
Englewood Cliffs (1999)). These assumptions are:
Independence: In statistics, two events are independent when the fact that one occurs does not modify the probability of the other one occurring.
When we compare two optimization algorithms they are usually independent.
When we compare two machine learning methods, it depends on the partition:
The independency is not truly verified in 10-fcv (a portion of samples is used either for training and
testing in different partitions.
Hold out partitions can be safely take as independent, since training and test partitions do not overlap.
Normality:An observation is normal when its behaviour follows a normal or Gauss distribution with a certain value of average μ and variance σ . A normality test applied over a sample can indicate the presence or absence of this condition in observed data.
Kolmogorov–Smirnov: it compares the accumulated distribution
of observed data with the accumulated distribution
expected for a Gaussian distribution, obtaining the p-value
based on both discrepancies. Therefore, it is a quality of fit
procedure that can be used to test the hypothesis of normality
in the population distribution. However, this method performs
poorly because it possess very low power.
Shapiro–Wilk: it analyzes the observed data for computing
the level of symmetry and kurtosis (shape of the curve) in
order to compute the difference with respect to a Gaussian
distribution afterwards, obtaining the p-value from the sum
of the squares of these discrepancies. The power of this test
has been shown to be excellent. However, the performance
of this test is adversely affected in the common situation
where there is tied data.
D’Agostino–Pearson: it first computes the skewness and kurtosis
to quantify how far from Gaussian the distribution is
in terms of asymmetry and shape. It then calculates how
far each of these values differs from the values expected with
a Gaussian distribution, and computed a single p-value form
the sum of these discrepancies. The performance of this test
is not as good as that of Shapiro–Wilk’s procedure, but it is
not affected by tied data.
Homoscedasticity:
This property indicates the hypothesis
of equality of variances. Levene’s test is used for checking whether or
not k samples present this homogeneity of variances. When
observed data does not fulfill the normality condition, this test’s result is more
reliable than Bartlett’s test (Zar, J.H.: Biostatistical Analysis. Prentice Hall,
Englewood Cliffs (1999)), which checks the same property.
The conditions are usually not satisfied in the case of analyzing results provided by computational intelligence experiments or data
mining algorithms' comparisons.
Let us show a case study involving a set of Neural Networks classifiers run over multiple data sets and using a well-known 10-fold cross validation procedure. We apply the normality test of Kolmogorov–Smirnov and D’Agostino–Pearson by considering
a level of confidence of α = 0.05 (we employ SPSS statistical software package). Next tables show the results in
10fcv where the symbol ‘*’ indicates
that the normality is not satisfied and the value in brackets is the
p-value needed for rejecting the normality hypothesis.
Tab. 1. Test of Normality of Kolmogorov-Smirnov for 10fcv
Tab. 2. Test of Normality of D’Agostino–Pearson for 10fcv
As we can observe in the run of the two tests, we can declare that the conditions needed for the application of parametric tests are not fulfilled in some cases. The normality condition is not always satisfied although the size of the sample of results would be enough (50 in this case). A main factor that influences this condition seems to be the nature of the problem, since there exist some problems in which it is never satisfied. D’Agostino–Pearson’s test is the most suitable test in these situations, where it is frequent that the sample of results would contain some ties.
In addition, we present a case study done for a given sample of results. Figure 1 presents an example of graphical representations of histograms and Q–Q graphics. An histogram represents a statistical variable by using bars, so that the area of each bar is proportional to the frequency of the represented values. A Q–Q graphic represents a confrontation between the quartiles from data observed and those from the normal distribution. In Fig. 1 we observe a common case of absolute lack of normality. The case corresponds to the run of the RBFN decremental algorithm in a Hold-Out Validation
Fig. 1. Results of RBFN Decremental over crx data set in Hold-Out Validation: histogram and Q–Q graphic.
In relation to the heteroscedasticity study, next table shows the results by applying Levene’s test, where the symbol ‘*’ indicates that the variances of the distributions of the different algorithms for a certain data set are not homogeneous (we reject the null hypothesis). The homoscedasticity property is even more difficult to be fulfilled, since the variances associated to each problem also depend on the algorithm’s results, that is, the capacity of the algorithms for offering similar results with random seed variations. This fact implies that an analysis of performance of ANN methods performed through parametric statistical treatment could mean erroneous conclusions.
Tab. 3. Test of Homoscedasticity of Levene (based on means)
Nonparametric Tests
This section tries to describe the most used nonparametric tests in the analysis of results:
Pairwise comparisons.
In the discussion of the tests for comparisons of two methods over multiple cases of problems we will make
two points. We shall warn against the widely used t-test as usually conceptually inappropriate and
statistically unsafe. Since we will finally recommend the Wilcoxon (F. Wilcoxon. Individual comparisons
by ranking methods. Biometrics, 1:80–83, 1945) signed-ranks test, it will
be presented with more details. Another, even more rarely used test is the sign test which is weaker
than the Wilcoxon test but also has its distinct merits. The other message will be that the described
statistics measure differences between the methods from different aspects, so the selection of the
test should be based not only on statistical appropriateness but also on what we intend to measure.
The Sign Test
A popular way to compare the overall performances of algorithms is to count the number of cases on which an algorithm is the overall winner. When multiple algorithms are compared, pairwise comparisons are sometimes organized in a matrix.
Some authors also use these counts in inferential statistics, with a form of binomial test that
is known as the sign test Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures.
CRC Press, Boca Raton (2006). If the two algorithms compared are, as assumed under the null-hypothesis, equivalent, each should
win on approximately N/2 out of N problems. The number of wins is distributed according to the binomial distribution; the critical number
of wins can be found in next Table. For a greater number of cases, the number of wins is under
the null-hypothesis distributed according to N(N/2,SQRT(N)/2), which allows for the use of z-test: if
the number of wins is at least N/2+1.96*SQRT(N)/2 (or, for a quick rule of a thumb, N/2+SQRT(N)), the
algorithm is significantly better with p < 0.05. Since tied matches support the null-hypothesis we
should not discount them but split them evenly between the two methods; if there is an odd number
of them, we again ignore one.
| #cases | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| α=0.05 | 5 | 6 | 7 | 7 | 8 | 9 | 9 | 10 | 10 | 11 | 12 | 12 | 13 | 13 | 14 | 15 | 15 | 16 | 17 | 18 | 18 |
| α=0.1 | 5 | 6 | 6 | 7 | 7 | 8 | 9 | 9 | 10 | 10 | 11 | 12 | 12 | 13 | 13 | 14 | 14 | 15 | 16 | 16 | 17 |
Table 4. Critical values for the two-tailed sign test at α = 0.05 and α = 0.1. A method is significantly better than another if it performs better on at least the cases presented in each row.
The Wilcoxon Test
Wilcoxon’s test is used for answering this question: do two samples represent two different
populations? It is a non-parametric procedure employed in a hypothesis testing
situation involving a design with two samples. It is the analogous of the paired t-test
in non-parametrical statistical procedures; therefore, it is a pairwise test that aims to
detect significant differences between the behavior of two algorithms.
The null hypothesis for Wilcoxon’s test is H0 : θD = 0; in the underlying populations represented by the two samples of results, the median of the difference scores equals zero. The alternative hypothesis is H1 : θD ≠ 0, but also can be used H1 : θD > 0 or H1 : θD < 0 as directional hypothesis.
In the following, we describe the tests computations. Let di be the difference between the performance scores of the two algorithms on i-th out of N cases. The differences are ranked according to their absolute values; average ranks are assigned in case of ties. Let R+ be the sum of ranks for the functions on which the second algorithm outperformed the first, and R− the sum of ranks for the opposite. Ranks of di = 0 are split evenly among the sums; if there is an odd number of them, one is ignored:
Let T be the smallest of the sums, T = min(R+,R−). If T is less than or equal to the value of the distribution of Wilcoxon for N degrees of freedom (Table B.12 in Zar, J.H.: Biostatistical Analysis. Prentice Hall, Englewood Cliffs (1999)), the null hypothesis of equality of means is rejected.
The obtaining of the p-value associated to a comparison is performed by means of the normal approximation for the Wilcoxon T statistic (Section VI, Test 18 in Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton (2006) . Furthermore, the computation of the p-value for this test is usually included in well-known statistical software packages (SPSS, SAS, R, etc.).
Multiple comparisons with a control method.
Wilcoxon’s test performs individual comparisons between two algorithms (pairwise
comparisons). The p-value in a pairwise comparison is independent from another
one. If we try to extract a conclusion involving more than one pairwise comparison
in a Wilcoxon’s analysis, we will obtain an accumulated error coming from the
combination of pairwise comparisons. In statistical terms, we are losing the control
on the Family Wise Error Rate (FWER), defined as the probability of making one or
more false discoveries among all the hypotheses when performing multiple pairwise
tests. The true statistical significance for combining pairwise comparisons is given
by:
So, a pairwise comparison test, such as Wilcoxon's test, should not be used to conduct various comparisons involving a set of algorithms, because the FWER is not controlled. The expresion defined above computes the true significance obtained after performing several comparisons, hence the level of significance cannot be set before performing the comparisons and the statistical significance cannot be known a priori.
In order to carry out a comparison which involves more than two methods, under the assumption of being significance at a certain level of significance, established previously to the statistical analysis, the multiple comparisons tests should be used. In this part, we describe the most used test for performing multiple test comparisons together with a set of post-hoc procedures to compare a control method with other methods (1 x n comparisons). We refer to the Friedman Test and derivatives.
The Friedman Test
Friedman’s test is used for answering this question: In a set of k samples (where k ≥ 2), do at least two of the samples represent populations with different median values? It is a non-parametric procedure employed in a hypothesis testing situation involving a design with two or more samples. It is the analogous of the repeatedmeasures ANOVA in non-parametrical statistical procedures; therefore, it is a multiple comparison test that aims to detect significant differences between the behavior of two or more algorithms.
The null hypothesis for Friedman’s test is H0 : θ1 = θ2 = ··· = θk ; the median of the population i represents the median of the population j , i ≠ j, 1 ≤ i ≤ k, 1 ≤ j ≤ k. The alternative hypothesis is H1 : Not H0, so it is non-directional.
Next, we describe the tests computations. It computes the ranking of the observed results for algorithm (rj for the algorithm j with k algorithms) for each function, assigning to the best of them the ranking 1, and to the worst the ranking k. Under the null hypothesis, formed from supposing that the results of the algorithms are equivalent and, therefore, their rankings are also similar, the Friedman’s statistic
is distributed according to χF2
with k − 1 degrees of freedom, being Rj =
, and N the number of cases of the problem considered. The critical values for the Friedman’s
statistic coincide with the established in the χ2 distribution when N > 10 and
k > 5.
The Iman and Davenport Test
Iman and Davenport (Iman, R.L., Davenport, J.M.: Approximations of the critical region of the Friedman statistic. Commun. Stat. 18, 571–595 (1980)) proposed a derivation from the Friedman’s statistic given that this last metric produces a conservative undesirably effect. The proposed statistic is
and it is distributed according to a F distribution with k − 1 and (k − 1)(N − 1) degrees of freedom.
Computation of the p-values given a χ2 or FF statistic can be done by using the algorithms in Abramowitz, M.: Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover, New York (1974). Also, most of the statistical software packages include it.
The rejection of the null hypothesis in both tests described above does not involve the detection of the existing differences among the algorithms compared. They only inform us about the presence of differences among all samples of results compared. In order to conducting pairwise comparisons within the framework of multiple comparisons, we can proceed with a post-hoc procedure. In this case, a control algorithm (maybe a proposal to be compared) is usually chosen. Then, the post-hoc procedures proceed to compare the control algorithm with the remain k − 1 algorithms. Next, we describe three post-hoc procedures:
Post-Hoc Procedures for Comparing a Control Method
Bonferroni-Dunn’s procedure (Zar, J.H.: Biostatistical Analysis. Prentice Hall, Englewood Cliffs (1999)): it is similar to Dunnet’s test for ANOVA
designs. The performance of two algorithms is significantly different if the corresponding average of rankings is at least as great as its critical difference (CD).
Holm procedure (Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian J. Statist. 6, 65–70 (1979)): for contrasting the procedure of Bonferroni-Dunn, we
dispose of a procedure that sequentially checks the hypotheses ordered according
to their significance. We will denote the p-values ordered by p1,p2, ... ,
in the way that p1 ≤ p2 ≤ ··· ≤ pk−1. Holm’s method compares each pi with α/(k−i) starting from the most significant p-value. If p1 is below than α/(k−1),
the corresponding hypothesis is rejected and it leaves us to compare p2 with α/(k − 2). If the second hypothesis is rejected, we continue with the process. As soon as a certain hypothesis can not be rejected, all the remaining hypotheses are
maintained as supported. The statistic for comparing the i algorithm with the j algorithm is:
Hochberg procedure (Hochberg, Y.: A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800–803
(1988)): It is a step-up procedure that works in the opposite
direction to Holm’s method, comparing the largest p-value with α, the next largest
with α/2 and so forth until it encounters a hypothesis it can reject. All hypotheses
with smaller p values are then rejected as well. Hochberg’s method is more
powerful than Holm’s (Shaffer, J.P.: Multiple hypothesis testing. Annu. Rev. Psychol. 46, 561–584 (1995)).
The value of qα is the critical value of Q' for a multiple non-parametric comparison with a control (Table B.16 in Zar, J.H.: Biostatistical Analysis. Prentice Hall, Englewood Cliffs (1999)).
The value of z is used for finding the corresponding probability from the table of the normal distribution (p-value), which is compared with the corresponding value of α. Holm’s method is more powerful than Bonferroni-Dunn’s and it does no additional assumptions about the hypotheses checked.
When a p-value is within a multiple comparison it reflects the probability error of a certain comparison, but it does not take into account the remaining comparisons belonging to the family. One way to solve this problem is to report Adjusted P -Values (APVs) which take into account that multiple tests are conducted. An APV can be directly taken as the p-value of a hypothesis belonging to a comparison of multiple algorithms.
Finally, we will explain how to compute the APVs for the three post-hoc procedures described above, following the indications given in Wright, S.P.: Adjusted p-values for simultaneous inference. Biometrics 48, 1005–1013 (1992).
Bonferroni APVi: min{v;1}, where v = (k −1)pi .
Holm APVi: min{v;1}, where v = max{(k −j)pj : 1 ≤ j ≤ i}.
Hochberg APVi: max{(k -j)pj : (k - 1) = j = i}.
Multiple comparisons among all methods.
Friedman’s test is an omnibus test which can
be used to carry out these types of comparison. It allows to detect differences considering the global
set of classifiers. Once Friedman’s test rejects the null hypothesis, we can proceed with a post-hoc
test in order to find the concrete pairwise comparisons which produce differences. Before, we focused
on procedures that control the family-wise error when comparing with a control classifier, arguing
that the objective of a study is to test whether a newly proposed method is better than the existing ones. For this reason, we have described and studied procedures such as Bonferroni-Dunn, Holm’s and Hochberg’s methods.
When our interest lies in carrying out a multiple comparison in which all possible pairwise comparisons need to be computed (n x n comparison), the classic procedure that can be used is the Nemenyi procedure (P. B. Nemenyi. Distribution-free Multiple comparisons. PhD thesis, Princeton University, 1963). It adjusts the value of a in a single step by dividing the value of α by the number of comparisons performed, m = k(k−1)=2. This procedure is the simplest but it also has little power.
The hypotheses being tested belonging to a family of all pairwise comparisons are logically interrelated so that not all combinations of true and false hypotheses are possible. As a simple example of such a situation suppose that we want to test the three hypotheses of pairwise equality associated with the pairwise comparisons of three classifiers Ci; i = 1, 2, 3. It is easily seen from the relations among the hypotheses that if any one of them is false, at least one other must be false. For example, if C1 is better/worse than C2, then it is not possible that C1 has the same performance as C3 and C2 has the same performance as C3. C3 must be better/worse than C1 or C2 or the two classifiers at the same time. Thus, there cannot be one false and two true hypotheses among these three.
Based on this argument, Shaffer proposed two procedures which make use of the logical relation among the family of hypotheses for adjusting the value of α (J.P. Shaffer. Modified sequentially rejective multiple test procedures. Journal of the American Statistical Association, 81(395):826–831, 1986.).
Shaffer’s static procedure: following Holm’s step down method, at stage j, instead of rejecting
Hi if pi ≤ α /(m − i + 1), reject Hi if pi ≤ α /
ti, where ti is the maximum number of hypotheses
which can be true given that any (i − 1) hypotheses are false. It is a static procedure, that is,
t1, ..., tm are fully determined for the given hypotheses H1, ..., Hm, independent
of the observed
p-values. The possible numbers of true hypotheses, and thus the values of ti can be obtained
from the recursive formula
where S(k) is the set of possible numbers of true hypotheses with k methods being compared, k ≥ 2, and S(0) = S(1) = {0}.
In G. Bergmann and G. Hommel. Improvements of general multiple test procedures for redundant systems of hypotheses. In P. Bauer, G. Hommel, and E. Sonnemann, editors, Multiple Hypotheses Testing, pages 100–115. Springer, Berlin, 1988 was proposed a procedure based on the idea of finding all elementary hypotheses which cannot be rejected. In order to formulate Bergmann-Hommel’s procedure, we need the following definition.
Definition 1: An index set of hypotheses I ⊆ {1, ..., m} is called exhaustive if exactly all Hj, j ∈ I, could be true.
Under this definition, Bergmann-Hommel procedure works as follows.
Bergmann and Hommel procedure (G. Bergmann and G. Hommel. Improvements of general multiple test procedures for redundant
systems of hypotheses. In P. Bauer, G. Hommel, and E. Sonnemann, editors, Multiple Hypotheses
Testing, pages 100–115. Springer, Berlin, 1988): Reject all Hj with j not in A, where the acceptance set
is the index set of null hypotheses which are retained.
For this procedure, one has to check for each subset I of {1, ..., m} if I is exhaustive, which leads to intensive computation. Due to this fact, we will obtain a set, named E, which will contain all the possible exhaustive sets of hypotheses for a certain comparison. A rapid algorithm which was described in G. Hommel and G. Bernhard. A rapid algorithm and a computer program for multiple test procedures using procedures using logical structures of hypotheses. Computer Methods and Programs in Biomedicine, 43:213–216, 1994 allows a substantial reduction in computing time. Once the E set is obtained, the hypotheses that do not belong to the A set are rejected.
Next figure shows a valid algorithm for obtaining all the exhaustive sets of hypotheses, using as input a list of algorithms C. E is a set of families of hypotheses; likewise, a family of hypotheses is a set of hypotheses. The most important step in the algorithm is the number 6. It performs a division of the classifiers into two subsets, in which the last classifier k always is inserted in the second subset and the first subset cannot be empty. In this way, we ensure that a subset yielded in a division is never empty and no repetitions are produced.
Finally, we will explain how to compute the APVs for the three post-hoc procedures described above, following the indications given in Wright, S.P.: Adjusted p-values for simultaneous inference. Biometrics 48, 1005–1013 (1992).
Shaffer static APVi: min {v;1}, where v = max {tjpj : 1 ≤ j ≤ i}.
Bergmann-Hommel APVi: min {v;1}, where v = max {|I|·min{pj; j ∈ I} : I exhaustive; i ∈
I}.
Case Studies
In this section, two case studies will be presented in order to illustrate the use of multiple comparisons nonparametric test in Computational Intelligence and Data Mining experimentations:
Multiple comparisons with a control method.
We present a case study where four rule induction algorithms will be compared
with a control method. The complete study can be found in S. García, A. Fernández, J. Luengo, F. Herrera, A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability. Soft Computing 13:10 (2009) 959-977, doi:10.1007/s00500-008-0392-y.
Four of them are Genetics-based Machine Learning (GBML) methods:
XCS (Wilson SW (1995) Classifier fitness based on accuracy. Evol Comput
3(2):149–175)
GASSIST-ADI (Bacardit J, Garrell JM (2007) Bloat control and generalization
pressure using the minimum description length principle for
Pittsburgh approach learning classifier system. In: Kovacs T,
Llorá X, Takadama K (eds) Advances at the frontier of learning
classifier systems, vol 4399. LNCS, USA, pp 61–80)
Pitts-GIRLA (Corcoran AL, Sen S (1994) Using real-valued genetic algorithms to
evolve rule sets for classification. In: Proceedings of the IEEE
conference on evolutionary computation, pp 120–124)
HIDER (Aguilar-Ruiz JS, Giráldez R, Riquelme JC (2007) Natural encoding
for evolutionary supervised learning. IEEE Trans Evol Comput
11(4):466–479)
and one of them is the classic CN2 algorithm (Clark P, Niblett T (1989) The CN2 induction algorithm. Machine Learn 3(4):261–283). One of the GBML algorithms is chosen as the control algorithm, XCS (Wilson SW (1995) Classifier fitness based on accuracy. Evol Comput 3(2):149–175). Two performance measures are used for compariong the behaviour of the algorithms: Accuracy and Cohen's kappa (Cohen JA (1960) Coefficient of agreement for nominal scales. Educ Psychol Meas 37–46).
First of all, we have to test whether significant differences exist among all the mean values. The next table shows the result of applying a Friedman and Iman–Davenport tests. The table shows the Friedman and Iman–Davenport values, χF2 and FF,respectively, and it relates them with the corresponding critical values for each distribution by using a level of significance α = 0.05. The p value obtained is also reported for each test. Given that the statistics of Friedman and Iman–Davenport are clearly greater than their associated critical values, there are significant differences among the observed results with a level of significance α ≤ 0.05. According to these results, a posthoc statistical analysis is needed in the two cases.
Then, we will employ a Bonferroni-Dunn test to detect significant differences for the control algorithm in each measure. It obtains the values CD = 1.493 and CD = 1.34 for α = 0.05 and α = 0.10 respectively in the two measures considered. The following figures summarize the ranking obtained by the Friedman test and draw the threshold of the critical difference of Bonferroni–Dunn’ procedure, with the two levels of significance mentioned above. They display a graphical representation composed by bars whose height is proportional to the average ranking obtained for each algorithm in each measure studied. If we choose the smallest of them (which corresponds to the best algorithm), and we sum its height with the critical difference obtained by the Bonferroni method (CD value), we represent a cut line that goes through all the graphic. Those bars which are higher than this cut line belong to the algorithms whose performance is significantly worse than that of the control algorithm.

We will apply more powerful procedures, such as Holm and Hochbergs ones, for comparing the control algorithm with the rest of algorithms. The next table shows all the adjusted p values for each comparison which involves the control algorithm. The p value is indicated in each comparison and we stress in bold the algorithms which are worse than the control, considering a level of significance α = 0.05.
Multiple comparisons among all methods.
Taken from the paper S. García, F. Herrera, An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons. Journal of Machine Learning Research 9 (2008) 2677-2694, this case study show an example involving the four procedures of all pairwise comparison described with a comparison
of five classifiers:
C4.5 (J. R. Quinlan. Programs for Machine Learning. Morgan Kauffman, 1993)
One Nearest Neighbor (1-NN) with Euclidean distance
NaiveBayes
Kernel (G. J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley Series in Probability
and Mathematical Statistics, 2004)
CN2 (P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3(4):261–283, 1989)
We have used 10-fold cross validation and standard parameters for each algorithm. The results correspond to average accuracy or 1 − class_error in test data. We have used 30 data sets. Next table shows the overall process of computation of average rankings.
Friedman and Iman and Davenport tests check whether the measured average ranks are significantly different from the mean rank Rj = 3. They respectively use the χ2 and the F statistical distributions to determine if a distribution of observed frequencies differs from the theoretical expected frequencies. Their statistics use nominal (categorical) or ordinal level data, instead of using means and variances. Demsar (2006) detailed the computation of the critical values in each distribution. In this case, the critical values are 9.488 and 2.45, respectively at α = 0.05, and the Friedman’s and Iman-Davenport’s statistics are:
χF2 = 39.647; FF = 14.309.
Due to the fact that the critical values are lower than the respective statistics, we can proceed with the post-hoc tests in order to detect significant pairwise differences among all the classifiers. For this, we have to compute and order the corresponding statistics and p-values. The standard error in the pairwise comparison between two classifiers is SE = 0.408. The following table presents the family of hypotheses ordered by their p-value and the adjustment of a by Nemenyi’s, Holm’s and Shaffer’s static procedures.
Nemenyi’s test rejects the hypotheses [1–4] since the corresponding p-values are smaller than
the adjusted α’s.
Holm’s procedure rejects the hypotheses [1–5].
Shaffer’s static procedure rejects the hypotheses [1–6].
Bergmann-Hommel’s dynamic procedure first obtains the exhaustive index set of hypotheses.
It obtains 51 index sets. From the index sets, it computes the A
set. It rejects all hypotheses Hj with j not in A, so it rejects the hypotheses [1–8].
Next table shows the results in the final form of APVs for the example considered in this section. As we can see, this example is suitable for observing the difference of power among the test procedures. Also, this table can provide information about the state of retainment or rejection of any hypothesis, comparing its associated APV with the level of significance previously fixed.
Considerations on the Use of Nonparametric Tests
Taking into consideration all the characteristics and empirical results obtained on the application of the non-parametric tests, we can suggest some aspects and details about the use of non-parametric statistical techniques:
A multiple comparison of various algorithms must be carried out first by using a
statistical method for testing the differences among the related samples means, that
is, the results obtained by each algorithm. Once this test rejects the hypothesis of
equivalence of means, the detection of the concrete differences among the algorithms
can be done with the application of post-hoc statistical procedures, which
are methods used for comparing a control algorithm with two or more algorithms.
Holm’s procedure can always be considered better than Bonferroni-Dunn’s one,
because it appropriately controls the FWER and it is more powerful than the
Bonferroni-Dunn’s. We strongly recommend the use of Holm’s method in a rigorous
comparison. Nevertheless, the results offered by the Bonferroni-Dunn’s test are suitable to be visualized in graphical representations.
Hochberg’s procedure is more powerful than Holm’s. The differences reported
between it and Holm’s procedure are in practice rather small. We recommend the use of this test together with Holm’s
method.
Although Wilcoxon’s test and the remaining post-hoc tests for multiple comparisons
belong to the non-parametric statistical tests, they operate in a different way.
The main difference lies in the computation of the ranking. Wilcoxon’s test computes
a ranking based on differences between case problems independently, whereas
Friedman and derivative procedures compute the ranking between algorithms.
In relation to the sample size (number of case problems when performing Wilcoxon’s
or Friedman’s tests in multiple-problem analysis), there are two main aspects to be determined. Firstly, the minimum sample considered acceptable for each test needs
to be stipulated. There is no established agreement about this specification. Statisticians
have studied the minimum sample size when a certain power of the statistical
test is expected. In our case, the employment of a,
as large as possible, sample size is preferable, because the power of the statistical
tests (defined as the probability that the test will reject a false null hypothesis) will
increase. Moreover, in a multiple-problem analysis, the increasing of the sample
size depends on the availability of new case problems (which should be well-known in
computational intelligence or data mining field). Secondly, we have to study how the results are
expected to vary if there was a larger sample size available. In all statistical tests
used for comparing two or more samples, the increasing of the sample size benefits
the power of the test. In the following items, we will state that Wilcoxon’s test is
less influenced by this factor than Friedman’s test. Finally, as a rule of thumb, the
number of case problems (N) in a study should be N = a · k, where k is the number of
algorithms to be compared and a ≥ 2.
Taking into account the previous observation and knowing the operations performed
by the non-parametric tests, we can deduce that Wilcoxon’s test is influenced
by the number of case problems used. On the other hand, both the number of
algorithms and case problems are crucial when we refer to the multiple comparisons
tests (such as Friedman’s test), given that all the critical values depend on the value
of N (see expressions above). However, the increasing/decreasing of
the number of case problems rarely affects in the computation of the ranking. In these
procedures, the number of functions used is an important factor to be considered
when we want to control the FWER.
An appropriate number of algorithms in contrast with an appropriate number of
case problems are needed to be used in order to employ each type of test. The number
of algorithms used in multiple comparisons procedures must be lower than
the number of case problems. In general, p-values are lower agreeing with the increasing of the number of case problems used in multiple comparison procedures; therefore, the differences among the algorithms are more detectable.
The previous statement may not be true in Wilcoxon’s test. The influence of the
number of case problems used is more noticeable in multiple comparisons procedures
than in Wilcoxon’s test.
When comparing all algorithms among themselves, we do not recommend the use of Nemenyi’s test, because it is a very conservative procedure
and many of the obvious differences may not be detected.
However, conducting the Shaffer static procedure means a not very significant increase of the
difficulty with respect to the Holm procedure. Moreover, the benefit of using information
about logically related hypothesis is noticeable, thus we strongly encourage the use of this
procedure.
Bergmann-Hommel’s procedure is the best performing one, but it is also the most difficult
to understand and computationally expensive. We recommend its usage when the situation
requires so (i.e., the differences among the classifiers compared are not very significant),
given that the results it obtains are as valid as using other testing procedure.
Relevant Journal Papers with Data Mining and Computational Intelligence Case Studies
J. Demsar, Statistical Comparisons of Classifiers over Multiple Data Sets. Journal Of Machine Learning Research 7 (2006) 1-30 link.
![]()
Abstract:While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.
Summary:
- 1. Introduction.
- 2. Previous Work.
- 2.1. Related Theoretical Work.
- 2.2. Testing in Practice: Analysis of ICML Papers.
- 3. Statistics and Tests for Comparison of Classifiers.
- 3.1. Comparisons of Two Classifiers.
- 3.2. Comparisons of Multiple Classifiers.
- 4. Empirical Comparison of Tests.
- 4.1. Experimental Setup.
- 4.2. Comparisons of Two Classifiers.
- 4.3. Comparisons of Multiple Classifiers.
- 5. Conclusion.
S. García, F. Herrera, An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons. Journal of Machine Learning Research 9 (2008) 2677-2694 link .
![]()
Abstract: In a recently published paper in JMLR, Demsar (2006) recommends a set of non-parametric statistical tests and procedures which can be safely used for comparing the performance of classifiers over multiple data sets. After studying the paper, we realize that the paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method. However, it does not deal with some advanced topics in depth. Regarding these topics, we focus on more powerful proposals of statistical procedures for comparing n x n classifiers. Moreover, we illustrate an easy way of obtaining adjusted and comparable p-values in multiple comparison procedures.
Summary:
- 1. Introduction.
- 2. Comparison of Multiple Classifiers: Performing All Pairwise Comparisons.
- 2.1. Advanced Procedures for Performing All Pairwise Comparisons.
- 2.2. Performing All Pairwise Comparisons: A Case Study.
- 3. Adjusted P-Values.
- 4. Experimental Framework.
- 5. Conclusions.
J. Luengo, S. García, F. Herrera, A Study on the Use of Statistical Tests for Experimentation with Neural Networks: Analysis of Parametric Test Conditions and Non-Parametric Tests. Expert Systems with Applications 36 (2009) 7798-7808 doi:10.1016/j.eswa.2008.11.041.
![]()
Abstract: In this paper, we focus on the experimental analysis on the performance in artificial neural networks with the use of statistical tests on the classification task. Particularly, we have studied whether the sample of results from multiple trials obtained by conventional artificial neural networks and support vector machines checks the necessary conditions for being analyzed through parametrical tests. The study is conducted by considering three possibilities on classification experiments: random variation in the selection of test data, the selection of training data and internal randomness in the learning algorithm. The results obtained state that the fulfillment of these conditions are problem-dependent and indefinite, which justifies the need of using non-parametric statistics in the experimental analysis.
Summary:
- 1. Introduction.
- 2. Classification algorithms and experimentation framework.
- 2.1. Artificial neural networks and support vector machines.
- 2.2. Experimentation framework
- 3. Study on the initial conditions for parametric tests using artificial neural networks
- 3.1. Conditions for the use of parametric tests.
- 3.2. Normality test over the group of data sets and algorithms.
- 3.3. Case studies of the normality property.
- 4. On the use of rank-based non-parametric tests: a short experimental study.
- 4.1. Rank-based non-parametric tests.
- 4.2. Experimental study: results and analysis.
- 5. Conclusions
S. García, A. Fernández, J. Luengo, F. Herrera, A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability. Soft Computing 13:10 (2009) 959-977, doi:10.1007/s00500-008-0392-y.
![]()
Abstract: The experimental analysis on the performance of a proposed method is a crucial and necessary task to carry out in a research. This paper is focused on the statistical analysis of the results in the field of genetics-based machine Learning. It presents a study involving a set of techniques which can be used for doing a rigorous comparison among algorithms, in terms of obtaining successful classification models. Two accuracy measures for multiclass problems have been employed: classification rate and Cohen’s kappa. Furthermore, two interpretability measures have been employed: size of the rule set and number of antecedents. We have studied whether the samples of results obtained by genetics-based classifiers, using the performance measures cited above, check the necessary conditions for being analysed by means of parametrical tests. The results obtained state that the fulfillment of these conditions are problem-dependent and indefinite, which supports the use of non-parametric statistics in the experimental analysis. In addition, non-parametric tests can be satisfactorily employed for comparing generic classifiers over various data-sets considering any performance measure. According to these facts, we propose the use of the most powerful non-parametric statistical tests to carry out multiple comparisons. However, the statistical analysis conducted on interpretability must be carefully considered.
Summary:
- 1. Introduction.
- 2. Genetics-based machine learning algorithms for classification.
- 3. Performance measures and experimental results.
- 3.1. Accuracy measures for multi-class problems.
- 3.2. Interpretability measures.
- 3.3. Experimental results.
- 4. Study on the initial conditions for parametric tests using genetics-based machine learning.
- 4.1. Conditions for a safe use of parametric tests.
- 4.2. Analysis of the conditions for a safe use of parametric tests.
- 4.3. Case studies of the normality property.
- 5. Non-parametric tests for comparing two algorithms in multiple data-set analysis.
- 5.1. Wilcoxon signed-ranks test.
- 5.2. A case study in GBML: performing pairwise comparisons.
- 6. Non-parametric tests for multiple comparisons
among more than two algorithms.
- Friedman test and post-hoc tests.
- A case study in GBML: performing multiple comparisons.
- 7. Analysing interpretability of models.
- 8. Conclusions.
S. García, D. Molina, M. Lozano, F. Herrera, A Study on the Use of Non-Parametric Tests for Analyzing the Evolutionary Algorithms' Behaviour: A Case Study on the CEC'2005 Special Session on Real Parameter Optimization. Journal of Heuristics, 15 (2009) 617-644. doi: 10.1007/s10732-008-9080-4.
![]()
Abstract: In recent years, there has been a growing interest for the experimental
analysis in the field of evolutionary algorithms. It is noticeable due to the existence
of numerous papers which analyze and propose different types of problems, such as
the basis for experimental comparisons of algorithms, proposals of different methodologies
in comparison or proposals of use of different statistical techniques in algorithms’
comparison.
In this paper, we focus our study on the use of statistical techniques in the analysis
of evolutionary algorithms’ behaviour over optimization problems. A study about
the required conditions for statistical analysis of the results is presented by using
some models of evolutionary algorithms for real-coding optimization. This study is
conducted in two ways: single-problem analysis and multiple-problem analysis. The
results obtained state that a parametric statistical analysis could not be appropriate
specially when we deal with multiple-problem results. In multiple-problem analysis,
we propose the use of non-parametric statistical tests given that they are less restrictive
than parametric ones and they can be used over small size samples of results.
As a case study, we analyze the published results for the algorithms presented in the CEC’2005 Special Session on Real Parameter Optimization by using non-parametric
test procedures.
Summary:
- 1. Introduction.
- 2. Preliminaries: settings of the CEC’2005 Special Session.
- 2.1. Evolutionary algorithms.
- 2.2. Test functions.
- 2.3. Characteristics of the experimentation.
- 3. Study of the required conditions for the safe use of parametric tests.
- 3.1. Conditions for the safe use of parametric tests.
- 3.2. On the study of the required conditions over single-problem analysis.
- 3.3. On the study of the required conditions over multiple-problem analysis.
- 4. A case study: on the use of non-parametric statistics for comparing the results of the CEC’2005 Special Session in Real Parameter Optimization.
- 5. Some considerations on the use of non-parametric tests.
- 6. Conclusions
Relevant books on Non-parametric tests
The reader can increase the information provided in this web page. We recommend the reading of the following books:

D.J. Sheskin. Handbook of Parametric and Non-Parametric Statistical Procedures. CRC Press (2007) 
J.H. Zar. Biostatistical Analysis. Prentice Hall. (2009) 
W.W. Daniel. Applied Nonparametric Statistics. Houghton Mifflin Harcourt. (1990) 
W.J. Conover. Practical Nonparametric Statistics. Wiley. (1998) 
M. Hollander and D.A. Wolfe. Nonparametric Statistical Methods.Wiley-Interscience. (1999) 
J.J. Higgins. Introduction to Modern Nonparametric Statistics. Duxbury Press. (2003).
Topic Slides
F. Herrera (November 2008). How must I Do my Experimental Study? Design of Experiments in Data Mining/Computational Intelligence. Using Non-parametric Tests. Some Cases of Study.
Software and User's Guide
We offer a software developed in JAVA which calculates all the multiple comparisons procedures described in this web page. It allows as input files in CSV format and obtains as output a LaTeX file with tabulated information about Friedman, Iman-Davenpor. Bonferroni-Dunn, Holm, Hochberg, Shaffer and Bergamnn-Hommel tests. It also computes and shows the adjusted p-values.
The software can be found here and can be used to:
Read data of results of k algorithms over N case problems in CSV format. The
data can correspond to accuracy, AUC or any other performance measure.
Compute the rankings through the Friedman procedure of k algorithms
over N case problems.
Compute the Friedman and Iman-Davenport Statistics corresponding to
the input data.
Show the tables with the set of hypotheses, unadjusted p-values for each
comparison and adjusted level of significance for Bonferroni-Dunn, Holm and
Hochberg procedures: 1 x n comparison.
Show the table with adjusted p-values for the procedures 1 x n mentioned
in the previous item.
Show the tables with the set of hypotheses, unadjusted p-values for each
comparison and adjusted level of significance for Nemenyi, Holm, Shaffer's
static and Bergmann-Hommel's dynamic procedures: n x n comparison.
Show the table with adjusted p-values for the procedures n x n mentioned
in the previous item.
Give a report detailing the rejected hypotheses considering the levels of
significance α = 0.05 and α = 0.10.
The program is written in JAVA, so an installed JVM in the computer is
needed in order to run it. To do this, execute:
java Friedman < data file >
An example of data is included in the package. The output is given in LATEX
format on standard output. We recommend to redirect the output to a file in
the following manner:
java Friedman data.csv > output.tex
© Copyright 2009 SCI2S (Soft Computing and Intelligent Information Systems)
This Web page was created and maintained by Salvador García López
Genetic
Computing
Statistical Inference in
H-index
Missing Values
E. A. & Metaheur. 


SECABA Software







