Data pre-processing for Big Data
Data Pre-processing on Big Data
Data preprocessing (S. García, J. Luengo, and F. Herrera, Data Preprocessing in Data Mining. Cham, Switzerland: Springer, 2015.) is a crucial research topic in Data Mining (DM) since most real-world databases are highly influenced by negative elements such as the presence of noise, missing values, inconsistent and superfluous data. The redution of data is also an essential task especially when dealing with large data sets, focusing on the selection or extraction of the most informative features or instances in the data.. During the last few decades, the dimensionality of datasets employed in DM tasks has significantly increased. This presents an unprecedented challenge for researchers in these areas, since the existing algorithms not always respond in an adequate time when dealing with this new extremely high dimensions (both in number of features and instances). Exceptional technologies, paradigms and algorithms are thus needed to efficiently process these large quantities of data to obtain information, within tolerable elapsed times.
Feature Selection
With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection is growing in interest as an important tool to identify relevant features on huge datasets; both in number of instances and features. From the beginning, data scientists have generally focused on only one side of Big Data, which early days refers to the huge number of instances; paying less attention to the feature side. Big Dimensionality (Zhai Y, Ong Y, Tsang IW (2014) The emerging “big dimensionality”. IEEE Comp Int Mag 9(3):14–26), though, calls for new FS strategies and methods that are able to deal with the feature explosion problem. It has been captured in many of the most famous dataset repositories in computational intelligence (like UCI or libSVM), where the majority of the new added datasets present a humongous dimensionality (in order of millions of features).
Isolating high value features from the raw set of features (potentially irrelevant, redundant and noisy), while maintaining the requirements in measurement and storage, is one of the most important tasks in Big Data research. In this field, we have developed some software packages in order to solve the aforementioned problem.
- An Information Theoretic Feature Selection Framework for Spark
This implements FS on Spark for its application on Big Data problems. This package contains a generic implementation of greedy Information Theoretic Feature Selection methods. The implementation is based on the common theoretic framework presented in (Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp. 1226–1238, 2005). Implementations of mRMR, ICAP, JMI and other commonly used FS filters are provided. The journal publication for this package is under revision process.
- An improved implementation of method: minimum Redundancy and Maximum Relevance (mRMR) algorithm:
This is an improved implementation of the classical feature selection method: minimum Redundancy and Maximum Relevance (mRMR); presented by Peng in (Hanchuan Peng, Fuhui Long, and Chris Ding "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005.). This includes several optimizations such as: cache marginal probabilities, accumulation of redundancy (greedy approach) and a data-access by columns.
Discretization
Discretization is one of the most important tasks in data mining process, aimed at simplifying and reducing continuous-valued data in large datasets. In spite of the great interest in this reduction mechanism, only a few simple discretization techniques have been implemented in the literature for Big Data.
Among its main benefits, discretization causes in learning methods remarkable improvements in learning speed and accuracy. Although most of real-world problems often imply numerical attributes, many algorithms can only handle categorical attributes, for example some feature selection methods (which are relevant in the Big Data picture).
Classical discretization methods are not expected to scale well when managing huge data -both in terms of features and instances- so that its application can be undermined or even become impracticable. To solve that, we have developed a software package described below:
- Distributed Minimum Description Length Discretizer for Spark
This package implements Fayyad's discretizer (Fayyad, U., & Irani, K. (1993). "Multi-interval discretization of continuous-valued attributes for classification learning.") based on Minimum Description Length Principle (MDLP) in order to treat non discrete datasets from a distributed perspective. We have developed a distributed version from the original one performing some important changes.
Prototype Generation
Prototype Reduction (PR) techniques (I. Triguero, J. Derrac, S. García, F. Herrera, A taxonomy and experimental study on prototype generation for nearest neighbor classification, IEEE Trans. Syst., Man, Cybern. Part C. Appl. Rev. 42 (1) (2012) 86–100.), which are instance reduction methods that aim to improve the classification capabilities of the Nearest Neighbor rule (NN). These techniques may select instances from the original data set, or build new artificial prototypes, to form a resulting set of prototypes that better adjusts the decision boundaries between classes in NN classification.
The main problems found to deal with large-scale data are the following:
- Runtime: The complexity of PR models is O((n D)^2) or higher, where n is the number of instances and D the number of features. Although these techniques are only applied once, if this process takes too long, its application could become inoperable for real applications.
- Memory consumption: Most of PR methods need to store in the main memory many partial calculations, intermediate solutions, and/or also the entire dataset. When the dataset is too big, it could easily exceed the available RAM memory.
In this field, we have developed a MapReduce-based framework for PR based on the stratification procedure:
- I. Triguero, D. Peralta, J. Bacardit, S. García, F. Herrera. MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification. Neurocomputing 150 (2015), 331-345. doi: 10.1016/j.neucom.2014.04.078