KEEL: A software tool to assess evolutionary algorithms for Data Mining problems (regression, classification, clustering, pattern mining and so on)

Imbalanced data sets are a special case for classification problem where the class distribution is not uniform among the classes. Typically, they are composed by two classes: The majority (negative) class and the minority (positive) class.

These type of sets suppose a new challenging problem for Data Mining, since standard classification algorithms usually consider a balanced training set and this supposes a bias towards the majority class.

Each data file has the following structure:

@relation: Name of the data set
@attribute: Description of an attribute (one for each attribute)
@inputs: List with the names of the input attributes
@output: Name of the output attribute
@data: Starting tag of the data

The rest of the file contains all the examples belonging to the data set, expressed in comma sepparated values format.

We offer information about experimental studies using these data sets (result files, papers and more) in the Experimental studies with imbalanced data sets section of the repository.

Below you can find all the Imbalanced data sets available. For each data set, it is shown its name and its number of examples (instances), attributes (the table details the number of Real/Integer/Nominal attributes in the data) and IR (Imbalace Ratio, the ratio between instances of the majority class and minority class). In addition, the table shows if the corresponding data set has missing values or not.

The table allows to download each data set in KEEL format (inside a ZIP file). Additionally, it is possible to obtain the data set already partitioned, by means of a 5-folds cross validation procedure. Finally, we provide a header file to give additional information about each data set and its attributes.

By clicking in the column headers, you can order the table by names (alphabetically), by the number of examples or attributes, by its IR or by the presence of missing values. Clicking again will sort the rows in reverse order.

Name	#Attributes (R/I/N)	#Examples	IR	Miss Val.
glass1	9 (9/0/0)	214	1.82	No
ecoli-0_vs_1	7 (7/0/0)	220	1.86	No
wisconsin	9 (0/9/0)	683	1.86	No
pima	8 (8/0/0)	768	1.87	No
iris0	4 (4/0/0)	150	2	No
glass0	9 (9/0/0)	214	2.06	No
yeast1	8 (8/0/0)	1484	2.46	No
haberman	3 (0/3/0)	306	2.78	No
vehicle2	18 (0/18/0)	846	2.88	No
vehicle1	18 (0/18/0)	846	2.9	No
vehicle3	18 (0/18/0)	846	2.99	No
glass-0-1-2-3_vs_4-5-6	9 (9/0/0)	214	3.2	No
vehicle0	18 (0/18/0)	846	3.25	No
ecoli1	7 (7/0/0)	336	3.36	No
new-thyroid1	5 (4/1/0)	215	5.14	No
new-thyroid2	5 (4/1/0)	215	5.14	No
ecoli2	7 (7/0/0)	336	5.46	No
segment0	19 (19/0/0)	2308	6.02	No
glass6	9 (9/0/0)	214	6.38	No
yeast3	8 (8/0/0)	1484	8.1	No
ecoli3	7 (7/0/0)	336	8.6	No
page-blocks0	10 (4/6/0)	5472	8.79	No
yeast-2_vs_4	8 (8/0/0)	514	9.08	No
yeast-0-5-6-7-9_vs_4	8 (8/0/0)	528	9.35	No
vowel0	13 (10/3/0)	988	9.98	No
glass-0-1-6_vs_2	9 (9/0/0)	192	10.29	No
glass2	9 (9/0/0)	214	11.59	No
shuttle-c0-vs-c4	9 (0/9/0)	1829	13.87	No
yeast-1_vs_7	7 (7/0/0)	459	14.3	No
glass4	9 (9/0/0)	214	15.47	No
ecoli4	7 (7/0/0)	336	15.8	No
page-blocks-1-3_vs_4	10 (4/6/0)	472	15.86	No
abalone9-18	8 (7/0/1)	731	16.4	No
glass-0-1-6_vs_5	9 (9/0/0)	184	19.44	No
shuttle-c2-vs-c4	9 (0/9/0)	129	20.5	No
yeast-1-4-5-8_vs_7	8 (8/0/0)	693	22.1	No
glass5	9 (9/0/0)	214	22.78	No
yeast-2_vs_8	8 (8/0/0)	482	23.1	No
yeast4	8 (8/0/0)	1484	28.1	No
yeast-1-2-8-9_vs_7	8 (8/0/0)	947	30.57	No
yeast5	8 (8/0/0)	1484	32.73	No
ecoli-0-1-3-7_vs_2-6	7 (7/0/0)	281	39.14	No
yeast6	8 (8/0/0)	1484	41.4	No
abalone19	8 (7/0/1)	4174	129.44	No
All data sets

This subsection contains a collection of the previous data sets already preprocessed by several oversampling techniques. For each technique, a ZIP file containing 5-folds cross validation partitions for each of the 44 imbalanced data sets of this page is provided. Moreover, a brief description and references about each method can be found below:

Type of preprocessing	Data sets
SMOTE
SMOTE+ENN
SMOTE+Tomek Links

SMOTE: The Synthetic Minority Over-sampling Technique (Chawla et al, 2002) is an oversampling technique of the minority class. It works by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbours.
SMOTE+ENN: This method consists of the application of the Edited Nearest Neighbor rule (ENN, Wilson, 1972) as cleaning method over the data set obtained by the application of SMOTE. It was proposed by Batista et al, 2004, where the use of 3 neighbors for ENN is suggested.
SMOTE+Tomek Links: This method consists of the application of Tomek Links (Tomek, 1976) as cleaning method over the data set obtained by the application of SMOTE. It was proposed by Batista et al, 2004.

Collecting Data Sets

If you have some example data sets and you would like to share them with the rest of the research community by means of this page, please be so kind as to send your data to the Webmaster Team with the following information:

People answerable for the data (full name, affiliation, e-mail, web page, ...).
training and test data sets considered, preferably in ASCII format.
A brief description of the application.
References where it is used.
Results obtained by the methods proposed by the authors or used for comparison.
Type of experiment developed.
Any additional useful information.

Collecting Results

If you have applied your methods to some of the problems presented here we will be glad of showing your results in this page. Please be so kind as to send the following information to Webmaster Team:

Name of the application considered and type of experiment developed.
Results obtained by the methods proposed by the authors or used for comparison.
References where the results are shown.
Any additional useful information.

If you are interested on being informed of each update made in this page or you would like to comment on it, please contact with the Webmaster Team.