This library contains all the developed data preprocessing algorithms in Apache Spark. They are available in both Spark Packages and GitHub.
Information theory based framework for feature selection that includes: miminum redundancy maximum relevance, InfoGain, JMI and other commonly used filters for feature selection.
Evolutionary-based discretizer that uses binary chromosome representation and a wrapper fitness function for optimizing the cut points selection problem by trading-off two factors: simplicity of solutions and its classification accuracy. In order to alleviate the complexity, the evaluation phase has been parallelized by splitting the set of chromosomes and instances into different partitions and performing a random cross-evaluation process.
Ensemble method which is a distributed upgrade of the method presented by A. Ahmad. Performs Random Discretization and PCA (Principal Components Analysis) to the input data, joins the results and trains a decision tree on the result data.
This framework implements two Big Data preprocessing approaches to remove noisy examples: an homogeneous ensemble (HME-BD) and an heterogeneous ensemble (HTE-BD) filter, with special emphasis in their scalability and performance traits. A simple filtering approach based on similarities between instances (ENN-BD) is also implemented.
This framework implements four distance based Big Data preprocessing algorithms for prototype selection and generation: FCNN_MR, SSMASFLSDE_MR, RMHC_MR, MR_DIS, with special emphasis in their scalability and performance traits.
This framework implements four distance based Big Data preprocessing algorithms to remove noisy examples: ENN_BD, AllKNN_BD, NCNEdit_BD and RNG_BD filters, with special emphasis in their scalability and performance traits.
This contribution implements two approaches of the k Nearest Neighbor Imputation focused on the scalability in order to handle big dataset. k Nearest Neighbor - Local Imputation and k Nearest Neighbor Imputation - Global Imputation. The global proposal takes into account all the instances to calculate the k nearest neighbors. The local proposal considers those that are into the same partition, achieving higher times, but losing the information because it does not consider all the samples.
This method implements Fayyad's discretizer based on Minimum Description Length Principle (MDLP) in order to treat non discrete datasets from a distributed perspective. It supports sparse data, parallel-processing of attributes, etc.
Spark implementations of two data sampling methods (random oversampling and random undersampling) for imbalanced data.
An Apache Spark package containing a distributed implementation of the classical ReliefF algorithm.
Reimplementation of four popular feature selection algorithms included in Weka by multithreaded implementations previously not included in Weka as well as parallel Spark implementations for each algorithm.
Provides a connection between H2O and Spark algorithms. It enables launching H2O on top of Spark and using H2O capabilities including various ML algorithms, graphical user interface, or R integration.
Framework on Apache Spark for solving large scale feature engineering problem: building model features for machine learning with high feature sparsity. It is build on Spark DataFrames, can read input data, and write from/to HDFS (CSV, Parquet) and Hive.
Spark MLlib wrapper around Snowball, a string processing language for creating stemming algorithms for use in Information Retrieval.
Apache Spark implementation of a distributed version of a technique for dimensionality reduction based on t-Distributed Stochastic Neighbor Embedding (t-SNE) that is approximate to the reference design but can scale horizontally.