BigDaPSpark Library

BigDaPTOOLS Project

 

This library contains all the developed data preprocessing algorithms in Apache Spark. They are available in both Spark Packages and GitHub.

Our algorithms

Spark Infotheoric Feature Selection Framework 

Information theory based framework for feature selection that includes: miminum redundancy maximum relevance, InfoGain, JMI and other commonly used filters for feature selection.  

  github   spark packages   code as for 29/1/2018: 

 

A Distributed Evolutionary Multivariate Discretizer (DEMD)

Evolutionary-based discretizer  that uses binary chromosome representation and a wrapper fitness function for optimizing the cut points selection problem by trading-off two factors: simplicity of solutions and its classification accuracy. In order to alleviate the complexity, the evaluation phase has been parallelized by splitting the set of chromosomes and instances into different partitions and performing a random cross-evaluation process. 

github  spark packages code as for 6/3/2018: 

 

PCARD

Ensemble method which is a distributed upgrade of the method presented by A. Ahmad. Performs Random Discretization and PCA (Principal Components Analysis) to the input data,  joins the results and trains a decision tree on the result data.

github spark packages code as for 6/3/2018: 

 

NoiseFramework

This framework implements two Big Data preprocessing approaches to remove noisy examples: an homogeneous ensemble (HME-BD) and an heterogeneous ensemble (HTE-BD) filter, with special emphasis in their scalability and performance traits. A simple filtering approach based on similarities between instances (ENN-BD) is also implemented.

   code as for 30/10/2018:  

 

SmartReduction

This framework implements four distance based Big Data preprocessing algorithms for prototype selection and generation: FCNN_MR, SSMASFLSDE_MR, RMHC_MR, MR_DIS, with special emphasis in their scalability and performance traits.

  github   spark packages   code as for 29/1/2018: 

 

SmartFiltering

This framework implements four distance based Big Data preprocessing algorithms to remove noisy examples: ENN_BD, AllKNN_BD, NCNEdit_BD and RNG_BD filters, with special emphasis in their scalability and performance traits.

  github   spark packages   code as for 29/1/2018: 

 

Smart_Imputation

This contribution implements two approaches of the k Nearest Neighbor Imputation focused on the scalability in order to handle big dataset. k Nearest Neighbor - Local Imputation and k Nearest Neighbor Imputation - Global Imputation. The global proposal takes into account all the instances to calculate the k nearest neighbors. The local proposal considers those that are into the same partition, achieving higher times, but losing the information because it does not consider all the samples.

  github   spark packages   code as for 29/1/2018: 

 

Minimum Description Length Discretizer (MDLP)

This method implements Fayyad's discretizer based on Minimum Description Length Principle (MDLP) in order to treat non discrete datasets from a distributed perspective. It supports sparse data, parallel-processing of attributes, etc.

    code as for 31/10/2018:  

 

Imb-sampling-ROS_and_RUS

Spark implementations of two data sampling methods (random oversampling and random undersampling) for imbalanced data.

    code as for 30/10/2018: 

 

Third-party algorithms

DiReliefF

An Apache Spark package containing a distributed implementation of the classical ReliefF algorithm.

github  spark packages code as for 7/3/2018: 

 

Multithreaded and Spark parallelization of feature selection filters

Reimplementation of four popular feature selection algorithms included in Weka by multithreaded implementations previously not included in Weka as well as parallel Spark implementations  for each algorithm.

Code as for 7/3/2018:   

 

Sparkling Water

Provides a connection between H2O and Spark algorithms. It enables launching H2O on top of Spark and using H2O capabilities including various ML algorithms, graphical user interface, or R integration.

github spark packages code as for 7/3/2018: 

 

Model Matrix

Framework on Apache Spark for solving large scale feature engineering problem: building model features for machine learning with high feature sparsity. It is build on Spark DataFrames, can read input data, and write  from/to HDFS (CSV, Parquet) and Hive.

github spark packages code as for 7/3/2018: 

 

Spark Stemming

Spark MLlib wrapper around Snowball, a  string processing language for creating stemming algorithms for use in Information Retrieval.

github spark packages code as for 7/3/2018: 

 

t-Distributed Stochastic Neighbor Embedding

Apache Spark implementation of  a distributed version of a technique for dimensionality reduction based on t-Distributed Stochastic Neighbor Embedding (t-SNE) that is approximate to the reference design but can scale horizontally.

github spark packages code as for 7/3/2018: