BigDaPFlink Library

BigDaPTOOLS Project


DPASF (Data Preprocessing Algorithms for Streaming in Flink)

This library contains six of the most popular data preprocessing algorithms for online data (data streams), three for discretization and the rest for feature selection. They are implemented under the data stream library for Big Data preprocessing Apache Flink.

Associated paper:


Fast Correlation-Based Filter (FCBF)

FCBF is a multivariate feature selection method where the class relevance and the dependency between each feature pair are taken into account. Based on information theory, FCBF uses symmetrical uncertainty to calculate dependencies of features and the class relevance. Starting with the full feature set, FCBF heuristically applies a backward selection technique with a sequential search strategy to remove irrelevant and redundant features. The algorithm stops when there are no features left to eliminate.

github code as for 28/09/2018: 

Online Feature Selection (OFS)

OFS proposes an ε-greedy online feature selection method based on weights generated by an online classifier (neural networks) which makes a trade-off between exploration and exploitation of features.

github code as for 28/09/2018: 

Information Gain

This FS scheme is formed by two steps: a) an incremental feature ranking method, and b) an incremental learning algorithm that can consider a subset of the features during prediction (Naive Bayes).

github code as for 28/09/2018: 

Incremental Discretization Algorithm (IDA)

Incremental Discretization Algorithm (IDA) approximates quantile-based discretization on the entire data stream encountered to date by maintaining a random sample of the data which is used to calculate the cut points. IDA uses the reservoir sampling algorithm to maintain a sample drawn uniformly at random from the entire stream up until the current time.

github code as for 28/09/2018: 

Partition Incremental Discretization algorithm (PiD)

PiD performs incremental discretization. The basic idea is to perform the task in two layers. The first layer receives the sequence of input data and keeps some statistics on the data using many more intervals than required. Based on the statistics stored by the first layer, the second layer creates the final discretization. The proposed architecture processes streaming exam ples in a single scan, in constant time and space even for infinite sequences of examples.

github code as for 28/09/2018: 

Local Online Fusion Discretizer (LOFD)

LOFD is an online, self-adaptive discretizer for streaming classification. It smoothly adapts its interval limits reducing the negative impact of shifts and analyze interval labeling and interaction problems in data streaming. Interaction discretizer-learner is addressed by providing 2 alike solutions. The algorithm generates an online and self-adaptive discretization solution for streaming classification which aims at reducing the negative impact of fluctuations in evolving intervals.

github code as for 28/09/2018: