An Overview of Ensemble Methods for Binary Classifiers in Multi-class Problems: Experimental Study on One-vs-One and One-vs-All Schemes
This Website contains additional material to the paper
M. Galar, A. Fernández, E. Barrenechea, H. Bustince and F. Herrera, An Overview of Ensemble Methods for Binary Classifiers in Multi-class Problems: Experimental Study on One-vs-One and One-vs-All Schemes. Pattern Recognition 44:8 (2011) 1761-1776, doi: 10.1016/j.patcog.2011.01.017.
according to the following summary:
- Paper Content
- Reducing Multi-class Problems by Binarization Techniques
- State-of-the-art on aggregation schemes for binarization techniques
- Data-sets partitions employed in the Paper
- Experimental Study
Paper Content
Abstract: Classification problems involving multiple classes can be addressed in different ways. One of the most popular techniques consists in dividing the original data-set into two-class subsets, learning a different binary model for each new subset. These techniques are known as binarization strategies.
In this work, we are interested in ensemble methods by binarization techniques, in particular, we focus on the well-known One-vs-One and One-vs-All decomposition strategies, paying special attention to the final step of the ensembles, the combination of the outputs of the binary classifiers. Our aim is to develop an empirical analysis of different aggregations to combine these outputs. To do so, we develop a double study: first, we use different base classifiers in order to observe the suitability and potential of each combination within each classifier. Then, we compare the performance of these ensemble techniques with the classifiers' itself. Hence, we also analyse the improvement with respect to the classifiers which handle multiple classes inherently.
We carry out the experimental study with several well-known algorithms of the literature such as Support Vector Machines, Decision Trees, Instance Based Learning or Rule Based Systems. We will show, supported by several statistical analysis, the goodness of the binarization techniques with respect to the base classifiers and finally, we will point out the most robust techniques within this framework.
Summary:
- Introduction.
- Reducing multi-class problems by binarization techniques.
- Preliminaries: decomposition strategies in multi-classification.
- One-vs-One decomposition scheme.
- One-vs-All decomposition scheme.
- State-of-the-art on aggregation schemes for binarization techniques.
- Aggregations in One-vs-One.
- Aggregations in One-vs-All.
- Experimental framework.
- Algorithms used for the study.
- Performance measures.
- Statistical tests.
- Data-sets.
- Parameters.
- Web page associated to the paper.
- Experimental study.
- Which is the most appropriate aggregation for each decomposition scheme?
- Analysis of aggregations for OVO scheme.
- Analysis of aggregations for OVA scheme.
- Should we do binarization? how should we do it?
- Which is the most appropriate aggregation for each decomposition scheme?
- Discussion: lessons learned and future work.
- Concluding remarks.
Reducing Multi-class Problems by Binarization Techniques
Many proposals have been developed under the label of binarization for multi-classification (A.C. Lorena, A.C. Carvalho, and J.M. Gama. A review on the combination of binary classifiers in multiclass problems. Artificial Intelligence Review, 30(1-4):19–37, 2008). The underlying idea is to undertake the multi-classification using binary classifiers with a divide and conquer strategy. Binary problems are simpler to solve than the original multicategory problem, however, drawbacks exist, the outputs from each new classifier have to be combined in order to make the final decision of the predicted class. Hence, a correct management of the outputs is crucial to produce a correct prediction.
The most common decomposition strategies include OVO and OVA. The former consists in using a binary classifier to discriminate between each pair of classes, while the latter, uses a binary classifier to distinguish between a single class and the remaining ones. In both cases, the simplest combination is the application of a voting strategy where each classifier votes for the predicted class and the one with the largest amount of votes is predicted (in OVA only one positive answer is expected).
One-vs-One decomposition scheme
OVO decomposition scheme divide an m class problem into m(m-1) / 2 binary problems. Each problem is face up by a binary classifier which is responsible of distinguishing between a different pair of classes. The learning phase of the classifiers is done using as training data only a subset of instances from the original data-set which contain any of the two corresponding class labels, whereas the instances with different class labels are simply ignored.
In validation phase, a pattern is presented to each one of the binary classifiers. The output of a classifier given by rij in [0,1] is the confidence of the binary classifier discriminating classes i and j in favour of the former class. The confidence of the classifier for the latter is computed by rji = 1 - rij if the classifier do not provide it (the class with the largest confidence is the output class of a classifier). These outputs are represented by a score matrix R:
Matriz que no puedo poner con formula tipo latex
The final output of the system is derived from the score matrix by different aggregation models. The state-of-the-art on the combinations to obtain the final output is summarised in Subsection 3.1. A voting strategy is the simplest case, where each classifier gives a vote for the predicted class and the class with the largest number of votes is predicted.
One-vs-All decomposition scheme
OVA decomposition divide an m class problem into m binary problems. Each problem is face up by a binary classifier which is responsible of distinguishing one of the classes from all other classes. The learning step of the classifiers is done using the whole training data, considering the patterns from the single class as positives and all other examples as negatives.
In validation phase, a pattern is presented to each one of the binary classifiers and then, the classifier which gives a positive output indicates the output class. In many cases, the positive output is not unique and some tie-breaking techniques are required. The most common approach uses the confidence of the classifiers to decide the final output, predicting the class from the classifier with the largest confidence. Instead of having a score matrix, when dealing with the outputs of OVA classifiers (where ri in [0,1] is the confidence for class i) a score vector is used:
$$R=(r_1,r_2,...,r_i,...,r_m)$$
In Subsection 3.2 we summarise the state-of-the-art on the combinations for OVA approach.
State-of-the-art on aggregation schemes for binarization techniques
We divide this section into two subsections, the first one is oriented to the combinations for OVO decomposition where the aggregation is made from a score matrix. The second one reviews the combinations for OVA scheme, where the outputs of the classifiers are given by a score vector.
Aggregations in One-vs-One
A short description of the aggregation methods to obtain the predicted class from a score matrix obtained from the classifiers of an OVO decomposition scheme which we have employed in the experimental study follows.
- Voting strategy (VOTE) (also called binary voting and Max-Wins rule (J.H. Friedman. Another approach to polychotomous classification. Technical report, Department of Statistics, Stanford University, 1996)). Each binary classifier gives a vote for the predicted class. The votes received by each class are counted and the class with the largest number of votes is predicted:
$$Class=arg max_{i?1,...,m} \displaystyle\sum_{1 \le j \ne i \le m}s_{ij}$$
where sij is 1 if rij > rji and 0 otherwise.
- Weighted voting strategy (WV). Each binary classifier votes for both classes. The weight for the vote is given by the confidence of the classifier predicting the class. The class with the largest sum value is the final output class:
$$Class=arg max_{i?1,...,m} \displaystyle\sum_{1 \le j \ne i \le m}r_{ij}$$
Classification by Pairwise Coupling (PC) (T. Hastie and R. Tibshirani. Classification by pairwise coupling. Annals of Statistics, 26(2):451–471, 1998).This method estimates the joint probability for all classes from the pairwise class probabilities of the binary classifiers. Hence, when rij = Prob(Classi | Classi or Classj), the method finds the best approximation of the class posterior probabilities $\widehat{p}=(\widehat{p_1},...,\widehat{p_m})$ according to the classifiers outputs. The class with the largest posterior probability is predicted:
$$Class=arg max_{i=1,...,m} \widehat{p_i}$$
To compute the posterior probabilities the Kullback-Leibler (KL) distance between $r_{ij}$ and $mu_{ij}$ is minimised:
$$l(p)= \displaystyle\sum_{1 \le j \ne i \le m}n_{ij}r_{ij}log \frac{r_{ij}}{mu_{ij}=\sum_{1<j}n_{ij}(r_{ij}log \frac{r_{ij}}{mu_{ij}}+(1-r_{ij})log \frac{1-r_{ij}}{1-mu_{ij}})$$
where $mu_{ij}$, $r_{ij}=1-r_{ij}$ and $n_{ij}$ is the number of training data in the ith and jth classes.
- Decision Directed Acyclic Graph (DDAG) (N. Platt, J.C. Cristianini and J. Shawe-taylor. Large margin dags for multiclass classification. In Advances in Neural Information Processing Systems, pages 547–553. MIT Press, 2000). DDAG method constructs a rooted binary acyclic graph where each node is associated to a list of classes and a binary classifier. In each level a classifier discriminates between two classes, and the class which is not predicted is removed. The last class remaining on the list is the final output class.
- Learning Valued Preference for Classification (LVPC) E. Hüllermeier and K. Brinker. Learning valued preference structures for solving classification problems. Fuzzy Sets and Systems, 159(18):2337–2352, 2008 AND J.C. Huhn and E. Hüllermeier. FR3: A fuzzy rule learner for inducing reliable classifiers. IEEE Transactions on Fuzzy Systems, 17(1):138–149, 2009. This method consider the score matrix as a fuzzy preference relation, based in fuzzy preference modeling the original relation is decomposed in three new relations with different meanings, the strict preference, the conflict and the ignorance. A decision rule based on voting strategy is proposed to obtain the output class from them:
$$Class=arg \ max_{i=1,...,m} \displaystyle\sum_{1 \le j \ne i \le m}P_{ij}+\frac{1}{2}C_{ij}+\frac{N_i}{N_i+N_j}I_{ij}$$
where Ni is the number of examples from class i in the training data (and hence, an unbiased estimate of the class probability), Cij is the degree of conflict (the degree to which both classes are supported), Iij is the degree of ignorance (the degree to which none of the classes is supported) and finally, Pij and Pji are respectively the strict preference for i and j. Preference, confidence and ignorance degrees are computed as follows:
$$P_{ij}=r_{ij}-min(r_{ij},r_{ji})$$
$$P_{ji}=r_{ji}-min(r_{ij},r_{ji})$$
$$C_{ij}=min(r_{ij},r_{ji})$$
$$I_{ij}=1-max(r_{ij},r_{ji})$$
- Preference Relations Solved by Non-Dominance Criterion (ND) A. Fernández, M. Calderón, E. Barrenechea, H. Bustince, and F. Herrera. Enhancing fuzzy rule based systems in multi-classification using pairwise coupling with preference relations. In EUROFUSE’09: Workshop on Preference Modelling and Decision Analysis, pages 39–46, 2009). The Non-Dominance Criterion was originally defined for decision making with fuzzy preference relations (S.A. Orlovsky. Decision-making with a fuzzy preference relation. Fuzzy Sets and Systems, 1(3):155–167, 1978). In this case, as in LVPC, the score matrix is consider as a fuzzy preference relation. The relation has to be normalised. Then the degree of non-dominance is computed (the degree to which the class i is dominated by no one of the remaining classes) and the class with the largest degree is predicted.
- Binary Tree of Classifiers (BTC). Binary Tree of SVM (BTS) (B. Fei and J. Liu. Binary tree of SVM: a new fast multiclass training and classification algorithm. IEEE Transactions on Neural Networks, 17(3):696–704, 2006), easily can be extent to any type of binary classifier. The idea behind this method is to reduce the number of classifiers and increase the global accuracy using some of the binary classifiers which discriminate between two classes to distinguish other classes at the same time. The tree is constructed recursively and in similar way to DDAG approach, each node has associated a binary classifier and a list of classes. But in this case, the decision of the classifier can distinguish other classes as well as the pair of classes used for training. So, in each node, when the decision is done, more than one class can be removed from the list. In order to avoid false assumptions, a probability is used when the examples from a class are near the discriminant boundary so the class cannot be removed from the lists in the following level.
- Nesting One-vs-One (NEST) (Z. Liu, B. Hao and X. Yang. Nesting algorithm for multi-classification problems. Soft Computing, 11(4):383–389, 2007 AND Z. Liu, B. Hao and E.C.C. Tsang. Nesting one-against-one algorithm based on SVMs for pattern classification. IEEE Transactions on Neural Networks, 19(12):2044–2052, 2008). This method is directly developed to tackle the unclassifiable region produced in voting strategy (it is easy to see that in a three class problem, if each binary classifier votes for a different class, there is not a winner so, some tie-breaking technique has to be applied). Nesting OVO uses the voting strategy, but when there exist examples within the unclassifiable region, a new OVO system is constructed using only the examples in the region in order to make them classifiable. This process is made until no examples remain in the unclassifiable region of the nested OVO.
- Wu, Lin and Weng Probability Estimates by Pairwise Coupling approach (PE) (T.-F. Wu, C.-J. Lin, R.C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5:975-1005, 2004. PE is similar to PC, which also estimates the posterior probabilities (\textbf{p}) of each class starting from the pairwise probabilities. In this case, while the decision rule is equivalent (predicting the class with the largest probability), the optimization formulation is different. PE optimises the following problem:
$$min_p \displaystyle\sum_{i=1}^{m} \displaystyle\sum_{1 \le j \ne i \le m}(r_{ji}p_i - r_{ij}p_j)^2 \ subject \ to \ \displaystyle\sum_{i=1}^{k}p_i=1,p_i \ge 0, ∀$$
Aggregations in One-vs-All
A brief description of the combination methods to obtain the predicted class from a score vector from an OVA scheme outputs follows.
- Voting strategy (MAX). It is similar to the voting strategy from OVO systems, the output class is taken from the classifier with the largest positive answer:
$$Class=arg \ max_{i=1,...,m}r_{ij}$$
- Dynamically Ordered One-vs-All (DOO) (J.-H. Hong, J.-K. Min, U.-K. Cho, and S.-B. Cho. Fingerprint classification using one-vs-all support vector machines dynamically ordered with na¨ive bayes classifiers. Pattern Recognition, 41(2):662–671, 2008). This method does not base its decision in the confidence of OVA classifiers. In this case, a Naïve Bayes classifier is also trained (using samples from all classes) together with all other classifiers. This new classifier establish the order in which the OVA classifiers are executed for a given pattern. Then, the instance is submitted to each OVA classifier in that order until a positive answer is obtained, which indicates the predicted class. This is done dynamically for each example. In this manner, ties are avoided a priori by the Naïve Bayes classifier instead of relying in the degree of confidence given by the outputs of the classifiers.
Technical description report
This Website has a complementary PDF associated to the paper, which contains a more extensive and detailed description of the aggregation methods for OVO and OVA schemes:
Aggregation schemes for binarization techniques. Methods' description
Data-sets partitions employed in the paper
In the study, we selected fifteen data-sets from the UCI repository. Accuracy rate and kappa metric estimates were obtained by means of a 5-fold cross-validation. That is, the data-set was split into 5 folds, each one containing 20% of the patterns of the data-set. For each fold, the algorithm was trained with the examples contained in the remaining folds and then tested with the current fold.
Table 1 summarises the properties of the selected data-sets. It shows, for each data-set, the number of examples (#Ex.), the number of attributes (#Atts.), the number of numerical (#Num.) and nominal (#Nom.) attributes, and the number of classes (#Cl.). The last column of this table contains a link for downloading the 5-fold cross validation partitions for each data-set in KEEL format. You may also download all data-sets by clicking here. Some of the largest data-sets (nursery, page-blocks, penbased, satimage shuttle and led7digit) were stratified sampled at 10% in order to reduce the computational time required for training. In the case of missing values (autos, cleveland and dermatology) we removed those instances from the data-set before doing the partitions.
The selection of this data-sets has been carried out according to the premise of having more than 3 classes and a good behaviour with all the base classifiers, that is, considering an average accuracy higher than the 50%. This will allow us to make a good analysis based on data-sets with a large representation of classes and without noise from data-sets with low classification rate, in such way that we obtain more meaningful results from a multi-classification ponint-of-view.
Experimental Study
Algorithms used for the study
We have selected several well-known Machine Learning algorithms as base classifiers:
- SVM (V. Vapnik. Statistical Learning Theory. New York: Wiley, 1998) maps the original input space into a high-dimensional feature space via a certain kernel function (avoiding the computation of the inner product of two vectors). In the new feature space, the optimal separating hyperplane with maximal margin is determined in order to minimise an upper bound of the expected risk instead of the empirical risk. We use SMO ((J.C. Platt. Fast training of support vector machines using sequential minimal optimization. MIT Press, Cambridge, MA, USA, 1999)) training algorithm to obtain the SVM base classifiers.
- C4.5 (J.R. Quinlan. C4.5: Programs for Machine Learning. San Mateo-California: Morgan Kaufmann Publishers, 1st edition, 1993.) is a decision tree generating algorithm. It induces classification rules in the form of decision trees from a set of given examples. The decision tree is constructed top-down using the normalised information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalised information gain is the one used to make the decision.
- kNN (G.J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. JohnWiley and Sons, 2004.) k-Nearest Neighbours finds a group of k instances in the training set that are closest to the test pattern. The predicted class label is based on the predominance of a particular class in this neighbourhood. The distance and the number of neighbours are key elements of this algorithm.
- Ripper (W.W. Cohen. Fast effective rule induction. In ICML’95: Proceedings of the Twelfth International Conference on Machine Learning, pages 1–10, 1995.) Repeated Incremental Pruning to Produce Error Reduction builds a decision list of rules to predict the corresponding class for each instance. The list of rules is grown one by one and immediately pruned. Once a decision list for a given class is completely learned, an optimization stage is performed.
- PDFC (J.Z. Chen, Y. Wang. Support vector learning for fuzzy rule-based classification systems. IEEE Transactions Fuzzy Systems, 11(6):716–728, 2003.) Positive Definite Fuzzy Classifier constructs a fuzzy rule-based classification system extracting fuzzy rules from trained SVM. Since the learning process minimises an upper bound on the expected risk instead of the empirical risk, the classifier usually has a good generalisation ability.
Parameters
In Table 2, we detail the parameters values for the different algorithms selected in this study, which have been set considering the recommendation of the corresponding authors, which is the default parameters' setting included in the KEEL software (J. Alcalá-Fdez, L. Sánchez, S. García, M. del Jesus, S. Ventura, J. Garrell, J. Otero, C. Romero, J. Bacardit, V. Rivas, J. Fernández, and F. Herrera, "KEEL: A software tool to assess evolutionary algorithms to data mining problems," Soft Computing, vol. 13, no. 3, pp. 307–318, 2009 and J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, Sánchez, F. Herrera, KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing In Press, Corrected Proof. ).
Algorithm | Parameters |
SVM | C = 1.0 Tolerance Parameter = 0.001 Epsilon = 1.0E-12 Kernel Type = Polynomial Polynomial Degree = 1 Fit Logistic Models = True |
C4.5 | Confidence level = 0.25 Minimum number of item-sets per leaf = 2 |
1NN | k = 1 Distance metric = Heterogeneous Value Difference Metric (HVDM) |
3NN | k = 3 Distance metric = Heterogeneous Value Difference Metric (HVDM) |
Ripper | Size of growing subset = 66% Repetitions of the optimization stage = 2 |
PDFC | C = 100.0 Tolerance Parameter = 0.001 Epsilon = 1.0E-12 Kernel Type = Polynomial Polynomial Degree = 1 PDRF Type = Gaussian |
Some of the aggregation methods based their decision in the confidence of the predictions from the base classifiers. We obtain the confidence for each classifier as follows:
- SVM: Logistic model parameter is set to True in order to use the probability estimates from the SVM (J.C. Platt. Fast training of support vector machines using sequential minimal optimization. MIT Press, Cambridge, MA, USA, 1999) as the confidence for the predicted class.
- C4.5: The confidence is obtained from the accuracy of the leaf which makes the prediction. The accuracy of a leaf is the percentage of correctly classified train examples from the total number of covered train instances.
- kNN: We use the following equation to estimate the confidence of kNN:
$$Confidende=\frac{\sum_{l=1}^{k}\frac{e_l}{d_l}}{\sum_{l=1}^{k}\frac{1}{d_l}}$$
where dl is the distance between the input pattern and the lth neighbour and el = 1 if the neighbour l is from the class and 0 otherwise. Note that when k > 1, the probability estimate depends on the distance from the neighbours of each class, hence the estimation is not restricted to a few values (when only the numbers of neighbours from each class are considered, a multi-valued result is obtained, which is not desired).
- Ripper: The confidence is taken from the accuracy of the rule used in the prediction, that is, the percentage of correctly classified train instances from the total number of train instances classified.
- PDFC: The confidence depends on the prediction of the classifier, confidence = 1 is given for the predicted class.
For the models which confidence degree is included in {0,1} such as 1NN and PDFC, note that some aggregation methods are equivalent, i.e. VOTE = WV = LVPC. In these cases, we consider VOTE as their representative.
Regarding the aggregation strategies, Binary Tree of Classifiers has a parameter to define the reasonability of reassignment. In this study we set this parameter δ = 5% that was the value recommended by the authors in B. Fei and J. Liu. Binary tree of SVM: a new fast multiclass training and classification algorithm. IEEE Transactions on Neural Networks, 17(3):696–704, 2006.
Analysis of Ensemble Methods based on One-vs-One and One-vs-All Schemes
We present the results of the experimental study divided into three parts:
The analysis of the aggregations for OVO strategy.
The analysis of the aggregations for OVA strategy.
Finally, a comparative study is carried out to view the differences between the best aggregations from OVA and OVO strategies and the baseline classifiers.
All the results obtained in these subsection and reported in the paper can be downloaded as an Excel document by clicking on the corresponding link. We provide both Accuracy rate and Cohen's Kappa results for each base classifier and aggregation method.
From the results of the first two parts, we derive the best OVO and OVA decomposition for each base classifier and then, we compare them together with the original base classifier. These results are summarised in the following table (bold letters indicate the best test results for each base classifier and metric):
Base Classifier |
Aggregation | Accuracy | Kappa | ||||
Test | Avg. Rank | Test | Avg. Rank | ||||
SVM | NESTovo | 81.14 ± 3.32 | 1.37 (1) | .7243 ± .0559 | 1.32 (1) | ||
DOOova | 78.75 ± 3.15 | 1.63 (2) | .6868 ± .0565 | 1.68 (2) | |||
C4.5 | C4.5 | 80.51 ± 3.85 | 2.05 (2) | .7203 ± .0554 | 2.14 (2) | ||
WVovo | 81.59 ± 3.28 | 1.42 (1) | .7348 ± .0485 | 1.21 (1) | |||
DOOova | 78.78 ± 4.36 | 2.53 (3) | .6938 ± .0649 | 2.64 (3) | |||
1NN | 1NN | 81.24 ± 2.98 | 1.84 (1) | .7369 ± .0475 | 1.82 (1) | ||
PEovo | 82.30 ± 3.11 | 2.05 (2) | .7453 ± .0497 | 2.05 (2) | |||
DOOova | 81.77 ± 4.45 | 2.11 (3) | .7368 ± .0701 | 2.13 (3) | |||
3NN | 3NN | 81.54 ± 2.65 | 2.24 (3) | .7335 ± .0452 | 2.42 (3) | ||
NDovo | 83.07 ± 2.93 | 1.87 (1) | .7524 ± .0500 | 1.84 (2) | |||
DOOova | 82.76 ± 4.38 | 1.89 (2) | .7473 ± .0710 | 1.74 (1) | |||
Ripper | Ripper | 76.52 ± 4.00 | 2.61 (3) | .6799 ± .0554 | 2.42 (3) | ||
WVovo | 80.54 ± 3.03 | 1.66 (1) | .7249 ± .0455 | 1.58 (1) | |||
DOOova | 79.12 ± 4.67 | 1.74 (2) | .7004 ± .0716 | 2 (2) | |||
PDFC | PCovo | 84.12 ± 3.05 | 1.26 (1) | .7670 ± .0529 | 1.26 (1) | ||
MAXova | 83.59 ± 3.12 | 1.74 (2) | .7556 ± .0589 |
1.74 (2) |