BigDaPSpark Library
BigDaPTOOLS Project
This library contains all the developed data preprocessing algorithms in Apache Spark. They are available in both Spark Packages and GitHub.
Our algorithms
Spark Infotheoric Feature Selection Framework
Information theory based framework for feature selection that includes: miminum redundancy maximum relevance, InfoGain, JMI and other commonly used filters for feature selection.
A Distributed Evolutionary Multivariate Discretizer (DEMD)
Evolutionary-based discretizer that uses binary chromosome representation and a wrapper fitness function for optimizing the cut points selection problem by trading-off two factors: simplicity of solutions and its classification accuracy. In order to alleviate the complexity, the evaluation phase has been parallelized by splitting the set of chromosomes and instances into different partitions and performing a random cross-evaluation process.
PCARD
Ensemble method which is a distributed upgrade of the method presented by A. Ahmad. Performs Random Discretization and PCA (Principal Components Analysis) to the input data, joins the results and trains a decision tree on the result data.
NoiseFramework
This framework implements two Big Data preprocessing approaches to remove noisy examples: an homogeneous ensemble (HME-BD) and an heterogeneous ensemble (HTE-BD) filter, with special emphasis in their scalability and performance traits. A simple filtering approach based on similarities between instances (ENN-BD) is also implemented.
SmartReduction
This framework implements four distance based Big Data preprocessing algorithms for prototype selection and generation: FCNN_MR, SSMASFLSDE_MR, RMHC_MR, MR_DIS, with special emphasis in their scalability and performance traits.
SmartFiltering
This framework implements four distance based Big Data preprocessing algorithms to remove noisy examples: ENN_BD, AllKNN_BD, NCNEdit_BD and RNG_BD filters, with special emphasis in their scalability and performance traits.
Smart_Imputation
This contribution implements two approaches of the k Nearest Neighbor Imputation focused on the scalability in order to handle big dataset. k Nearest Neighbor - Local Imputation and k Nearest Neighbor Imputation - Global Imputation. The global proposal takes into account all the instances to calculate the k nearest neighbors. The local proposal considers those that are into the same partition, achieving higher times, but losing the information because it does not consider all the samples.
Minimum Description Length Discretizer (MDLP)
This method implements Fayyad's discretizer based on Minimum Description Length Principle (MDLP) in order to treat non discrete datasets from a distributed perspective. It supports sparse data, parallel-processing of attributes, etc.
Imb-sampling-ROS_and_RUS
Spark implementations of two data sampling methods (random oversampling and random undersampling) for imbalanced data.
Third-party algorithms
DiReliefF
An Apache Spark package containing a distributed implementation of the classical ReliefF algorithm.
Multithreaded and Spark parallelization of feature selection filters
Reimplementation of four popular feature selection algorithms included in Weka by multithreaded implementations previously not included in Weka as well as parallel Spark implementations for each algorithm.
Sparkling Water
Provides a connection between H2O and Spark algorithms. It enables launching H2O on top of Spark and using H2O capabilities including various ML algorithms, graphical user interface, or R integration.
Model Matrix
Framework on Apache Spark for solving large scale feature engineering problem: building model features for machine learning with high feature sparsity. It is build on Spark DataFrames, can read input data, and write from/to HDFS (CSV, Parquet) and Hive.
Spark Stemming
Spark MLlib wrapper around Snowball, a string processing language for creating stemming algorithms for use in Information Retrieval.
t-Distributed Stochastic Neighbor Embedding
Apache Spark implementation of a distributed version of a technique for dimensionality reduction based on t-Distributed Stochastic Neighbor Embedding (t-SNE) that is approximate to the reference design but can scale horizontally.