KEEL: A software tool to assess evolutionary algorithms for Data Mining problems (regression, classification, clustering, pattern mining and so on)

Introduction
Imbalanced data sets
Preprocessed data sets
Contact information

Imbalanced data sets are a special case for classification problem where the class distribution is not uniform among the classes. Typically, they are composed by two classes: The majority (negative) class and the minority (positive) class.

These type of sets suppose a new challenging problem for Data Mining, since standard classification algorithms usually consider a balanced training set and this supposes a bias towards the majority class.

Each data file has the following structure:

@relation: Name of the data set
@attribute: Description of an attribute (one for each attribute)
@inputs: List with the names of the input attributes
@output: Name of the output attribute
@data: Starting tag of the data

The rest of the file contains all the examples belonging to the data set, expressed in comma sepparated values format.

We offer information about experimental studies using these data sets (result files, papers and more) in the Experimental studies with imbalanced data sets section of the repository.

Imbalanced data sets

All the Imbalanced data sets presented in this web-page are partitioned using a 5-folds stratified cross validation. Note that dividing the dataset into 5 folds is considered in order to dispose of a sufficient quantity of minority class examples in the test partitions. In this way, test partition examples are more representative of the underlying knowledge.

We divide our Imbalanced data sets into the following sections:

-    Imbalance ratio between 1.5 and 9
-    Imbalance ratio higher than 9 - Part I
-    Imbalance ratio higher than 9 - Part II
-    Imbalance ratio higher than 9 - Part III
-    Multiple class imbalanced problems
-    Noisy and Borderline Examples

Imbalance ratio between 1.5 and 9

From Fernández, A., García, S., del Jesus, M. J., and Herrera, F. 2008. A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets and Systems 159, 18 (Sep. 2008), 2378-2398.

Below you can find all the Imbalanced data sets available with imbalance ratio between 1.5 and 9. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued) and imbalance ratio value.

The table allows to download each data set in KEEL format (inside a ZIP file). Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or IR. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	IR
glass1	9 (9/0/0)	214	1.82
ecoli-0_vs_1	7 (7/0/0)	220	1.86
wisconsin	9 (0/9/0)	683	1.86
pima	8 (8/0/0)	768	1.87
iris0	4 (4/0/0)	150	2
glass0	9 (9/0/0)	214	2.06
yeast1	8 (8/0/0)	1484	2.46
haberman	3 (0/3/0)	306	2.78
vehicle2	18 (0/18/0)	846	2.88
vehicle1	18 (0/18/0)	846	2.9
vehicle3	18 (0/18/0)	846	2.99
glass-0-1-2-3_vs_4-5-6	9 (9/0/0)	214	3.2
vehicle0	18 (0/18/0)	846	3.25
ecoli1	7 (7/0/0)	336	3.36
new-thyroid1	5 (4/1/0)	215	5.14
new-thyroid2	5 (4/1/0)	215	5.14
ecoli2	7 (7/0/0)	336	5.46
segment0	19 (19/0/0)	2308	6.02
glass6	9 (9/0/0)	214	6.38
yeast3	8 (8/0/0)	1484	8.1
ecoli3	7 (7/0/0)	336	8.6
page-blocks0	10 (4/6/0)	5472	8.79
All data sets

Imbalance ratio higher than 9 - Part I

From Fernández, A., del Jesus, M. J., and Herrera, F. 2009. Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets. Int. J. Approx. Reasoning 50, 3 (Mar. 2009), 561-577.

Below you can find the first block of the Imbalanced data sets available with imbalance ratio higher than 9. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued) and imbalance ratio value.

The table allows to download each data set in KEEL format (inside a ZIP file). Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or IR. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	IR
yeast-2_vs_4	8 (8/0/0)	514	9.08
yeast-0-5-6-7-9_vs_4	8 (8/0/0)	528	9.35
vowel0	13 (10/3/0)	988	9.98
glass-0-1-6_vs_2	9 (9/0/0)	192	10.29
glass2	9 (9/0/0)	214	11.59
shuttle-c0-vs-c4	9 (0/9/0)	1829	13.87
yeast-1_vs_7	7 (7/0/0)	459	14.3
glass4	9 (9/0/0)	214	15.47
ecoli4	7 (7/0/0)	336	15.8
page-blocks-1-3_vs_4	10 (4/6/0)	472	15.86
abalone9-18	8 (7/0/1)	731	16.4
glass-0-1-6_vs_5	9 (9/0/0)	184	19.44
shuttle-c2-vs-c4	9 (0/9/0)	129	20.5
yeast-1-4-5-8_vs_7	8 (8/0/0)	693	22.1
glass5	9 (9/0/0)	214	22.78
yeast-2_vs_8	8 (8/0/0)	482	23.1
yeast4	8 (8/0/0)	1484	28.1
yeast-1-2-8-9_vs_7	8 (8/0/0)	947	30.57
yeast5	8 (8/0/0)	1484	32.73
ecoli-0-1-3-7_vs_2-6	7 (7/0/0)	281	39.14
yeast6	8 (8/0/0)	1484	41.4
abalone19	8 (7/0/1)	4174	129.44
All data sets

Imbalance ratio higher than 9 - Part II

Below you can find the second block of the Imbalanced data sets available with imbalance ratio higher than 9. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued) and imbalance ratio value.

The table allows to download each data set in KEEL format (inside a ZIP file). Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or IR. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	IR
ecoli-0-3-4_vs_5	7 (7/0/0)	200	9
ecoli-0-6-7_vs_3-5	7 (7/0/0)	222	9.09
ecoli-0-2-3-4_vs_5	7 (7/0/0)	202	9.1
glass-0-1-5_vs_2	9 (9/0/0)	172	9.12
yeast-0-3-5-9_vs_7-8	8 (8/0/0)	506	9.12
yeast-0-2-5-7-9_vs_3-6-8	8 (8/0/0)	1004	9.14
yeast-0-2-5-6_vs_3-7-8-9	8 (8/0/0)	1004	9.14
ecoli-0-4-6_vs_5	6 (6/0/0)	203	9.15
ecoli-0-1_vs_2-3-5	7 (7/0/0)	244	9.17
ecoli-0-2-6-7_vs_3-5	7 (7/0/0)	224	9.18
glass-0-4_vs_5	9 (9/0/0)	92	9.22
ecoli-0-3-4-6_vs_5	7 (7/0/0)	205	9.25
ecoli-0-3-4-7_vs_5-6	7 (7/0/0)	257	9.28
ecoli-0-6-7_vs_5	6 (6/0/0)	220	10
ecoli-0-1-4-7_vs_2-3-5-6	7 (7/0/0)	336	10.59
led7digit-0-2-4-5-6-7-8-9_vs_1	7 (7/0/0)	443	10.97
glass-0-6_vs_5	9 (9/0/0)	108	11
ecoli-0-1_vs_5	6 (6/0/0)	240	11
glass-0-1-4-6_vs_2	9 (9/0/0)	205	11.06
ecoli-0-1-4-7_vs_5-6	6 (6/0/0)	332	12.28
cleveland-0_vs_4	13 (13/0/0)	177	12.62
ecoli-0-1-4-6_vs_5	6 (6/0/0)	280	13
All data sets

Imbalance ratio higher than 9 - Part III

Below you can find the third block of the Imbalanced data sets available with imbalance ratio higher than 9. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued) and imbalance ratio value.

The table allows to download each data set in KEEL format (inside a ZIP file). Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or IR. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	IR
dermatology-6	34 (0/34/0)	358	16.9
zoo-3	16 (0/0/16)	101	19.2
shuttle-6_vs_2-3	9 (0/9/0)	230	22
lymphography-normal-fibrosis	18 (0/3/15)	148	23.67
flare-F	11 (0/0/11)	1066	23.79
car-good	6 (0/0/6)	1728	24.04
car-vgood	6 (0/0/6)	1728	25.58
kr-vs-k-zero-one_vs_draw	6 (0/0/6)	2901	26.63
kr-vs-k-one_vs_fifteen	6 (0/0/6)	2244	27.77
winequality-red-4	11 (11/0/0)	1599	29.17
poker-9_vs_7	10 (0/10/0)	244	29.5
kddcup-guess_passwd_vs_satan	41 (26/0/15)	1642	29.98
abalone-3_vs_11	8 (7/0/1)	502	32.47
winequality-white-9_vs_4	11 (11/0/0)	168	32.6
kr-vs-k-three_vs_eleven	6 (0/0/6)	2935	35.23
winequality-red-8_vs_6	11 (11/0/0)	656	35.44
abalone-17_vs_7-8-9-10	8 (7/0/1)	2338	39.31
abalone-21_vs_8	8 (7/0/1)	581	40.5
winequality-white-3_vs_7	11 (11/0/0)	900	44
winequality-red-8_vs_6-7	11 (11/0/0)	855	46.5
kddcup-land_vs_portsweep	41 (26/0/15)	1061	49.52
abalone-19_vs_10-11-12-13	8 (7/0/1)	1622	49.69
kr-vs-k-zero_vs_eight	6 (0/0/6)	1460	53.07
winequality-white-3-9_vs_5	11 (11/0/0)	1482	58.28
poker-8-9_vs_6	10 (0/10/0)	1485	58.4
shuttle-2_vs_5	9 (0/9/0)	3316	66.67
winequality-red-3_vs_5	11 (11/0/0)	691	68.1
abalone-20_vs_8-9-10	8 (7/0/1)	1916	72.69
kddcup-buffer_overflow_vs_back	41 (26/0/15)	2233	73.43
kddcup-land_vs_satan	41 (26/0/15)	1610	75.67
kr-vs-k-zero_vs_fifteen	6 (0/0/6)	2193	80.22
poker-8-9_vs_5	10 (0/10/0)	2075	82
poker-8_vs_6	10 (0/10/0)	1477	85.88
kddcup-rootkit-imap_vs_back	41 (26/0/15)	2225	100.14
All data sets

Multiple class imbalanced problems

Below you can find all the Multi-class Imbalanced data sets available. For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued) and imbalance ratio value.

The table allows to download each data set in KEEL format (inside a ZIP file). Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or IR. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	IR
wine	13 (13/0/0)	178	1.5
hayes-roth	4 (0/4/0)	132	1.7
contraceptive	9 (6/0/3)	1473	1.89
penbased	16 (16/0/0)	1100	1.95
new-thyroid	5 (4/1/0)	215	4.84
dermatology	34 (0/34/0)	366	5.55
balance	4 (4/0/0)	625	5.88
glass	9 (9/0/0)	214	8.44
autos	25 (15/0/10)	159	16
yeast	8 (8/0/0)	1484	23.15
thyroid	21 (6/0/15)	720	36.94
lymphography	18 (3/0/15)	148	40.5
ecoli	7 (7/0/0)	336	71.5
pageblocks	10 (10/0/0)	548	164
shuttle	9 (0/9/0)	2175	853
All data sets

Noisy and Borderline Examples

From K. Napierala, J. Stefanowski, S. Wilk. Learning from Imbalanced Data in Presence of Noisy and Borderline Examples. 7th International Conference on Rough Sets and Current Trends in Computing (RSCTC2010). LNCS 6086, Springer 2010, Warsaw (Poland, 2010) 158-167.

Below you can find several synthetic Imbalanced data sets used in the above paper and whose examples are divided into 3 categories by the authors: safe, borderline and noisy examples.

   - Borderline examples are located in the area surrounding class boundaries, where the minority and majority classes overlap.
   - Safe examples are placed in relatively homogeneous areas with respect to the class label.
   - Noisy examples are individuals from one class occurring in safe areas of the other class.

For each data set, it is shown its name and its number of instances, attributes (Real/Integer/Nominal valued) and imbalance ratio value.

The table allows to download each data set in KEEL format (inside a ZIP file). Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples, attributes or IR. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	IR
paw02a-600-5-70-BI	2 (2/0/0)	600	5
paw02a-600-5-60-BI	2 (2/0/0)	600	5
paw02a-600-5-50-BI	2 (2/0/0)	600	5
paw02a-600-5-30-BI	2 (2/0/0)	600	5
paw02a-600-5-0-BI	2 (2/0/0)	600	5
04clover5z-600-5-70-BI	2 (2/0/0)	600	5
04clover5z-600-5-60-BI	2 (2/0/0)	600	5
04clover5z-600-5-50-BI	2 (2/0/0)	600	5
04clover5z-600-5-30-BI	2 (2/0/0)	600	5
04clover5z-600-5-0-BI	2 (2/0/0)	600	5
03subcl5-600-5-70-BI	2 (2/0/0)	600	5
03subcl5-600-5-60-BI	2 (2/0/0)	600	5
03subcl5-600-5-50-BI	2 (2/0/0)	600	5
03subcl5-600-5-0-BI	2 (2/0/0)	600	5
03subcl5-600-5-30-BI	2 (2/0/0)	600	5
paw02a-800-7-60-BI	2 (2/0/0)	800	7
paw02a-800-7-50-BI	2 (2/0/0)	800	7
paw02a-800-7-30-BI	2 (2/0/0)	800	7
paw02a-800-7-0-BI	2 (2/0/0)	800	7
04clover5z-800-7-70-BI	2 (2/0/0)	800	7
04clover5z-800-7-60-BI	2 (2/0/0)	800	7
04clover5z-800-7-50-BI	2 (2/0/0)	800	7
04clover5z-800-7-30-BI	2 (2/0/0)	800	7
04clover5z-800-7-0-BI	2 (2/0/0)	800	7
03subcl5-800-7-70-BI	2 (2/0/0)	800	7
03subcl5-800-7-60-BI	2 (2/0/0)	800	7
03subcl5-800-7-50-BI	2 (2/0/0)	800	7
03subcl5-800-7-30-BI	2 (2/0/0)	800	7
03subcl5-800-7-0-BI	2 (2/0/0)	800	7
paw02a-800-7-70-BI	2 (2/0/0)	800	7
All data sets

Preprocessed data sets

This subsection contains a collection of some of the previous data sets already preprocessed by several oversampling techniques. For each technique, a ZIP file containing 5-folds cross validation partitions for each of the data sets of this page is provided. Moreover, a brief description and references about each method can be found below:

Imbalance ratio between 1.5 and 9

Type of preprocessing	Data sets
SMOTE
SMOTE+ENN
SMOTE+Tomek Links

Imbalance ratio higher than 9 - Part I

Type of preprocessing	Data sets
SMOTE
SMOTE+ENN
SMOTE+Tomek Links
SMOTE-RSB*

Imbalance ratio higher than 9 - Part II

Type of preprocessing	Data sets
SMOTE
SMOTE+ENN
SMOTE+Tomek Links
Bordeline 1
Bordeline 2
SafeLevels
SMOTE-RSB*

SMOTE: The Synthetic Minority Over-sampling Technique (Chawla et al, 2002) is an oversampling technique of the minority class. It works by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbours.
SMOTE+ENN: This method consists of the application of the Edited Nearest Neighbor rule (ENN, Wilson, 1972) as cleaning method over the data set obtained by the application of SMOTE. It was proposed by Batista et al, 2004, where the use of 3 neighbors for ENN is suggested.
SMOTE+Tomek Links: This method consists of the application of Tomek Links (Tomek, 1976) as cleaning method over the data set obtained by the application of SMOTE. It was proposed by Batista et al, 2004.
Bordeline: This methods only oversample or strengthen the borderline minority examples (Han et al, 2005). First, it finds out the borderline minority examples P; then, synthetic examples are generated from them and are added to the original training set. This method, for every minority examples (pi) calculate its m nearest neighbors from the whole training set. The number of majority examples among the m nearest neighbors is n. If all the m nearest neighbors are majority examples, pi is considered to be noise and is not operated in the following step. If m/2 <= n < m, namely the number of pi's majority nearest neighbors is larger than the number of its minority ones, pi is considered to be easily misclassified and put into a set called DANGER. If 0 <= n < m/2, pi is safe and does not need to participate in the following steps. The examples in the DANGER set are the borderline data of the minority class P. For each example in DANGER, we calculate its k nearest neighbors from P and we operate similarly to SMOTE.
SafeLevels: This method (Bunkhumpornpat et al, 2009) computes for each positive instance its safe level before generating synthetic instances. Each synthetic instance is positioned closer to the largest safe level, so all synthetic instances are generated only in safe regions.
SMOTE-RSB*: This method (Ramentol et al, 2011) first applies the SMOTE algorithm, and then, it only selects the minority synthetic examples that belong to the lower approximation using Rough Set Theory (Pawlak, 1982). This process is repeated until the training set is balanced.

Contact information

Collecting Data Sets

If you have some example data sets and you would like to share them with the rest of the research community by means of this page, please be so kind as to send your data to the Webmaster Team with the following information:

People answerable for the data (full name, affiliation, e-mail, web page, ...).
training and test data sets considered, preferably in ASCII format.
A brief description of the application.
References where it is used.
Results obtained by the methods proposed by the authors or used for comparison.
Type of experiment developed.
Any additional useful information.

Collecting Results

If you have applied your methods to some of the problems presented here we will be glad of showing your results in this page. Please be so kind as to send the following information to Webmaster Team:

Name of the application considered and type of experiment developed.
Results obtained by the methods proposed by the authors or used for comparison.
References where the results are shown.
Any additional useful information.

If you are interested on being informed of each update made in this page or you would like to comment on it, please contact with the Webmaster Team.