On the use of MapReduce for Imbalanced Big Data using Random Forest

This Website contains complementary material to the paper:

S. Río, V. López, J.M. Benítez and F.Herrera, On the use of MapReduce for Imbalanced Big Data using Random Forest. Information Sciences 285 (2014) 112-137 doi: 10.1016/j.ins.2014.03.043 PDF Icon

Summary:

  1. Abstract
  2. Experimental Framework
    1. Data-sets partitions employed in the paper
    2. Algorithms and parameters
  3. Results obtained
    1. Analysis of the sequential versions when the size of the data available is increased
    2. Analysis of the effectiveness in classification
    3. Analysis of the runtime

S. Río, V. López, J.M. Benítez and F.Herrera, On the use of MapReduce for Imbalanced Big Data using Random Forest.

Abstract

In this age, big data applications are increasingly becoming the main focus of attention because of the enormous increment of data generation and storage that has taken place in the last years. This situation becomes a challenge when huge amounts of data are processed to extract knowledge because the data mining techniques are not adapted to the new space and time requirements. Furthermore, real-world data applications usually present a class distribution where the samples that belong to one class, which is precisely the main interest, are hugely outnumbered by the samples of the other classes. This circumstance, known as the class imbalance problem, complicates the learning process as the standard learning techniques do not correctly address this situation.

In this work, we analyze the performance of several techniques used to deal with imbalanced datasets in the big data scenario using the Random Forest classifier. Specifically, oversampling, undersampling and cost-sensitive learning have been adapted to big data using MapReduce so that these techniques are able to manage datasets as large as needed providing the necessary support to correctly identify the underrepresented class. The Random Forest classifier provides a solid basis for the comparison because of its performance, robustness and versatility.

An experimental study is carried out to evaluate the performance of the diverse algorithms considered. The results obtained show that there is not an approach to imbalanced big data classification that outperforms the others for all the data considered when using Random Forest. Moreover, even for the same type of problem, the best performing method is dependent on the number of mappers selected to run the experiments. In most of the cases, when the number of splits is increased, an improvement in the running times can be observed, however, this progress in times is obtained at the expense of a slight drop in the accuracy performance obtained. This decrement in the performance is related to the lack of density problem, which is evaluated in this work from the imbalanced data point of view, as this issue degrades the performance of classifiers in the imbalanced scenario more severely than in standard learning.

Experimental framework

Data-sets partitions employed in the paper

In order to analyze the quality of the solutions provided by the different algorithms for imbalanced big data that we are comparing, we have selected three big data problems that are available in the UCI Machine Learning Repository. Specifically, we have selected the KDD Cup 1999 dataset, the Record Linkage Comparison Patterns (RLCP) dataset and the Poker Hand dataset. In the case where missing values were present, the attributes that contained them have been discarded.

As the selected problems contain multiple classes, we have derived several cases of study from them to address each issue separately. Concretely, we have created new datasets using the classes that contained a notable number of samples in comparison with the rest as majority classes, while the classes less represented have been considered as minority classes. The data are summarized in Table 1, where we include the number of examples (#Ex.), number of attributes (#Atts.), class name of each class (minority and majority), number of instances for each class, class distribution and IR. To develop the different experiments we consider a 5-fold stratified cross-validation partitioning scheme.

Table 1: Summary of imbalanced datasets

Dataset #Ex. #Atts. Class (maj;min) #Class(maj; min) %Class(maj; min) IR
kddcup_DOS_versus_normal 4856151 41 (DOS; normal) (3883370; 972781) (79.968; 20.032) 3.992
kddcup_DOS_versus_PRB 3924472 41 (DOS; PRB) (3883370; 41102) (98.953; 1.047) 94.481
kddcup_DOS_versus_R2L 3884496 41 (DOS; R2L) (3883370; 1126) (99.971; 0.029) 3448.819
kddcup_DOS_versus_U2R 3883422 41 (DOS; U2R) (3883370; 52) (99.999; 0.001) 74680.192
kddcup_normal_versus_PRB 1013883 41 (normal; PRB) (972781; 41102) (95.946; 4.054) 23.667
kddcup_normal_versus_R2L 973907 41 (normal; R2L) (972781; 1126) (99.884; 0.116) 863.926
kddcup_normal_versus_U2R 972833 41 (normal; U2R) (972781; 52) (99.995; 0.005) 18707.327
poker_0_vs_2 562530 10 (0; 2) (513702; 48828) (91.32; 8.68) 10.521
poker_0_vs_3 535336 10 (0; 3) (513702; 21634) (95.959; 4.041) 23.745
poker_0_vs_4 517680 10 (0; 4) (513702; 3978) (99.232; 0.768) 129.136
poker_0_vs_5 515752 10 (0; 5) (513702; 2050) (99.603; 0.397) 250.586
poker_0_vs_6 515162 10 (0; 6) (513702; 1460) (99.717; 0.283) 3.992
poker_1_vs_2 481925 10 (1; 2) (433097; 48828) (89.868; 10.132) 8.870
poker_1_vs_3 454731 10 (1; 3) (433097; 21634) (95.242; 4.758) 20.019
poker_1_vs_4 437075 10 (1; 4) (433097; 3978) (99.09; 0.91) 108.873
poker_1_vs_5 435147 10 (1; 5) (433097; 2050) (99.529; 0.471) 211.267
poker_1_vs_6 434557 10 (1; 6) (433097; 1460) (99.664; 0.336) 296.642
RLCP 5749132 2 (FALSE; TRUE) (5728201; 20931) (99.636; 0.364) 273.671
 

Algorithms and parameters

In our comparisons we test all the algorithms that have been described in this work. When the sequential versions are examined, we check the behavior of the original Random Forest (RF) and its adaptation to the imbalanced scenario (RF-CS). The basic RF is also combined with the sequential preprocessing approaches, namely "Random OverSampling" (ROS), "Random UnderSampling" (RUS) and "Synthetic Minority Oversampling TEchnique" (SMOTE), to deal with imbalanced datasets.

When our focus is directed towards the algorithms that are able to deal with imbalanced big data, we test the behavior of the Random Forest Partial Implementation (RF-BigData) and the modifications that we have added to it to adapt its behavior to properly deal with imbalanced data (RF-BigDataCS). The RF-BigData algorithm is again associated with the ROS, RUS and SMOTE algorithms that have been updated using MapReduce to deal with big data. These algorithms are:

  • RF-BigData: The Random Forest MapReduce algorithm that allows the building of Random Forest out of large datasets. In this MapReduce design each mapper process is responsible for building a subset of the forest using only the data available in its partition.
  • RF-BigDataCS: A cost-sensitive approach for Random Forest MapReduce algorithm (RF-BigData) to deal with Imbalanced Big Data.
  • ROS+RF-BigData: The Random Forest MapReduce algorithm (RF-BigData) is combined with a MapReduce version of the ROS algorithm to deal with big data.
  • RUS+RF-BigData: Is the Random Forest MapReduce algorithm (RF-BigData) combined with a MapReduce version of the RUS algorithm to deal with big data.
  • SMOTE+RF-BigData: The Random Forest MapReduce algorithm (RF-BigData) is combined with a MapReduce version of the SMOTE algorithm to deal with big data.

The RF variations are run using the following parameters: maxDepth = unlimited, numFeatures = log2(Natts)+1 and numTrees = 100. The maxDepth, numFeatures and numTrees parameters represent how the forest is built: the numTrees parameter indicates how many trees compose the forest, maxDepth indicates the depth of each of the trees generated and the numFeatures parameter shows how many attributes are selected to build one of those trees.

For the RF-BigData variants we also include the parameter numMappers = 8/16/32/64 which is related to the MapReduce procedure that is included to deal with big data: this parameter represents the number of subsets of the original data that are created and are provided for the map tasks. In this manner, the number of trees associated to each map task is also modified according to the number of partitions of the data, as the final forest is composed of the trees generated by each map task.

The preprocessing algorithms (ROS, RUS and SMOTE) modify the dataset until the class distribution becomes completely balanced. The SMOTE algorithm uses the 5 nearest neighbors of minority class samples to generate new instances and it computes the distances using the Euclidean function. The misclassification costs used for the cost-sensitive learning approaches and the computation of the β value of the f-measure are C(+|–) = IR and C(–|+) = 1.

Results obtained

This section is divided into three parts: the first part provides the results of the study about the limitations that the sequential versions encounter when dealing with datasets with an increased size. Next, in the second part, the performance results for the approaches using the imbalanced classification measures can be found. Finally, in the third part the runtime results are provided.

Analysis of the sequential versions when the size of the data available is increased

In this section, we want to analyze the behavior of the sequential versions when they are faced with data that grows in size. In order to do so, we select three of the cases of study that were derived from the KDD Cup 1999 dataset, the one with the largest size. For each one of these cases of study, we create smaller versions of them selecting a 10%, 20%, 30%, 40% and 50% of the samples of the original problem maintaining the proportion between the classes. Table 2 shows the details of the selected cases of study and their generated reduced versions.

Table 2: Summary of imbalanced datasets with reduced sizes

Dataset #Ex. #Atts. Class (maj;min) #Class(maj; min) %Class(maj; min) IR
kddcup_10_DOS_versus_normal 485615 41 (DOS; normal) (388337; 97278) (79.968; 20.032) 3.992
kddcup_20_DOS_versus_normal 971230 41 (DOS; normal) (776674; 194556) (79.968; 20.032) 3.992
kddcup_30_DOS_versus_normal 1456845 41 (DOS; normal) (1165011; 291834) (79.968; 20.032) 3.992
kddcup_40_DOS_versus_normal 1942460 41 (DOS; normal) (1553348; 389112) (79.968; 20.032) 3.992
kddcup_50_DOS_versus_normal 2428075 41 (DOS; normal) (1941685; 486390) (79.968; 20.032) 3.992
kddcup_DOS_versus_normal 4856151 41 (DOS; normal) (3883370; 972781) (79.968; 20.032) 3.992
kddcup_10_DOS_versus_U2R 388342 41 (DOS; U2R) (388337; 5) (99.999; 0.001) 77667.400
kddcup_20_DOS_versus_U2R 776684 41 (DOS; U2R) (776674; 10) (99.999; 0.001) 77667.400
kddcup_30_DOS_versus_U2R 1165026 41 (DOS; U2R) (1165011; 15) (99.999; 0.001) 77667.400
kddcup_40_DOS_versus_U2R 1553368 41 (DOS; U2R) (1553348; 20) (99.999; 0.001) 77667.400
kddcup_50_DOS_versus_U2R 1941711 41 (DOS; U2R) (1941685; 26) (99.999; 0.001) 74680.192
kddcup_DOS_versus_U2R 3883422 41 (DOS; U2R) (3883370; 52) (99.999; 0.001) 74680.192
kddcup_10_normal_versus_R2L 97390 41 (normal; R2L) (97278; 112) (99.885; 0.115) 868.554
kddcup_20_normal_versus_R2L 194781 41 (normal; R2L) (194556; 225) (99.884; 0.116) 864.693
kddcup_30_normal_versus_R2L 292171 41 (normal; R2L) (291834; 337) (99.885; 0.115) 865.976
kddcup_40_normal_versus_R2L 389562 41 (normal; R2L) (389112; 450) (99.884; 0.116) 864.693
kddcup_50_normal_versus_R2L 486953 41 (normal; R2L) (486390; 563) (99.884; 0.116) 863.925
kddcup_normal_versus_R2L 973907 41 (normal; R2L) (972781; 1126) (99.884; 0.116) 863.926

We present the performance results for all the sequential Random Forest versions in xls, xlsx and ods files available for download. These files are organized as follows: each dataset is represented in the rows of the file and the performance results for the approaches are placed in the columns of the file. There are two columns for every approach: one representing the average results in training and another with the average results in test.

The performance results available for download for all the sequential Random Forest versions considered in the study over the three selected cases of study using increasing sized versions and considering the Geometric Mean and β-f-measure respectively as effectiveness measures can be found in Table 3.

Table 3: Average results for the sequential RF versions for the imbalanced big data cases of study

Performance measure xls file xlsx file ods file
Geometric Mean iconExcel.jpg iconExcel.jpg iconExcel.jpg
β-f-measure iconExcel.jpg iconExcel.jpg iconExcel.jpg

Analysis of the effectiveness in classification

In this section we present the results obtained by the MapReduce Random Forest versions that are able to manage imbalanced big data over the whole datasets considered in the study according to the precision measures tested in this work: the GM and the β-f-measure. Instead of using all the datasets together to get an idea of which approaches are the best for dealing with this type of data, we have grouped the data available in three groups so that we can get a better idea of what is happening in the correct identification of instances of each class (from bigger datasets to smaller): the cases of study derived from the kddcup dataset define the first group; the second group involves the RLCP dataset; and the third group is composed of the cases of study derived from the poker dataset.

We present the performance results obtained by the MapReduce Random Forest versions in xls, xlsx and ods files available for download. The files associated with the cases of study derived from the kddcup and poker datasets are organized as follows: each dataset is represented in the rows of the file and the performance results for the approaches are placed in the columns of the file. There are two columns for every approach: one representing the average results in training and another with the average results in test. Moreover, the files associated with the RLCP dataset are organized as follows: each approach is represented in the rows of the file and the performance results for the approaches are placed in the columns of the file. There are also two columns for every approach: one representing the average results in training and another with the average results in test.

The average results available for download in training and test of all the MapReduce approaches studied in this work using 8, 16, 32 and 64 mappers respectively over the kddcup dataset cases of study are in Table 4.

Table 4: Average results for the MapReduce RF versions using 8, 16, 32 and 64 mappers on the imbalanced big data cases based on the kddcup dataset

Performance measure xls file xlsx file ods file
Geometric Mean iconExcel.jpg iconExcel.jpg iconExcel.jpg
β-f-measure iconExcel.jpg iconExcel.jpg iconExcel.jpg

The average results available for download in training and test of all the MapReduce approaches studied in this work using 8, 16, 32 and 64 mappers respectively over the RLCP dataset cases of study are in Table 5.

Table 5: Average results for the MapReduce RF versions using 8, 16, 32 and 64 mappers on the RLCP dataset

Performance measure xls file xlsx file ods file
Geometric Mean iconExcel.jpg iconExcel.jpg iconExcel.jpg
β-f-measure iconExcel.jpg iconExcel.jpg iconExcel.jpg

The average results available for download in training and test of all the MapReduce approaches studied in this work using 8, 16, 32 and 64 mappers respectively over the poker dataset cases of study are in Table 6.

Table 6: Average results for the MapReduce RF versions using 8, 16, 32 and 64 mappers on the imbalanced big data cases based on the poker dataset

Performance measure xls file xlsx file ods file
Geometric Mean iconExcel.jpg iconExcel.jpg iconExcel.jpg
β-f-measure iconExcel.jpg iconExcel.jpg iconExcel.jpg

Analysis of the runtime

In this section we present the results for the runtime in a similar manner to the analysis of the precision given in the previous section, presenting in a first group the results for the cases of study based on the kddcup dataset; then, the runtime associated to the RLCP dataset; and finally, the time spent by the cases of study based on the poker dataset is shown.

We present the runtime results in xls, xlsx and ods files available for download. The files associated with the cases of study derived from the kddcup and poker datasets are organized as follows: each dataset is represented in the rows of the file and the runtime results for the approaches are placed in the columns of the file. Moreover, the files associated with the RLCP dataset are organized as follows: each approach is represented in the rows of the file and the runtime results for the approaches are placed in the columns of the file.

Table 7 shows the time elapsed in seconds and in the hh:mm:ss.SS format for the cases of study derived from the kddcup dataset on the MapReduce algorithms selected for imbalanced big data with 8, 16, 32 and 64 mappers respectively.

Table 7: Runtime elapsed in seconds and in the hh:mm:ss.SS format for the MapReduce RF versions on the imbalanced big data cases based on the kddcup dataset for 8, 16, 32 and 64 mappers

Performance measure xls file xlsx file ods file
seconds iconExcel.jpg iconExcel.jpg iconExcel.jpg
hh:mm:ss.SS iconExcel.jpg iconExcel.jpg iconExcel.jpg

Table 8 displays in seconds and in the hh:mm:ss.SS format the runtime for the RLCP dataset using the MapReduce algorithms developed for imbalanced big data with 8, 16, 32 and 64 mappers respectively.

Table 8: Runtime elapsed in seconds and in the hh:mm:ss.SS format for the MapReduce RF versions on the RLCP dataset

Performance measure xls file xlsx file ods file
seconds iconExcel.jpg iconExcel.jpg iconExcel.jpg
hh:mm:ss.SS iconExcel.jpg iconExcel.jpg iconExcel.jpg

Table 9 presents the runtime for the poker cases of study in seconds and in the hh:mm:ss.SS format. This runtime is referred to the MapReduce algorithms developed for imbalanced big data and has been tested with 8, 16, 32 and 64 mappers respectively.

Table 9: Runtime elapsed in seconds and in the hh:mm:ss.SS format for the MapReduce RF versions on the imbalanced big data cases based on the poker dataset for 8, 16, 32 and 64 mappers

Performance measure xls file xlsx file ods file
seconds iconExcel.jpg iconExcel.jpg iconExcel.jpg
hh:mm:ss.SS iconExcel.jpg iconExcel.jpg iconExcel.jpg

Page Maintained by S. del Rio