Multithreaded and Spark parallelization of FS filters

Multithreaded and Spark parallelization of feature selection filters

Vast amounts of data are generated every day, constituting a volume that is challenging to analyze. Techniques such as feature selection are advisable when tackling large datasets. Among the tools that provide this functionality, Weka is one of the most popular ones, although the implementations it provides struggle when processing large datasets, requiring excessive times to be practical. Parallel processing can help alleviate this problem, effectively allowing users to work with Big Data. The computational power of multicore machines can be harnessed by using multithreading and distributed programming, effectively helping to tackle larger problems. Both these techniques can dramatically speed up the feature selection process allowing users to work with larger datasets. The reimplementation of four popular feature selection algorithms included in Weka is the focus of this work. Multithreaded implementations previously not included in Weka as well as parallel Spark implementations were developed for each algorithm. Experimental results obtained from tests on real-world datasets show that the new versions offer significant reductions in processing times.

Use

To build the project using maven,  run:

mvn clean package

Then, to use RELIEFF method:

spark-submit --class org.apache.spark.mllib.feature.ReliefFFeatureSelector PATH_TO_COMPILED_JAR PATH_TO_LIBSVM_FILE

to use CFS:

spark-submit --class org.apache.spark.mllib.feature.CFSFeatureSelector PATH_TO_COMPILED_JAR PATH_TO_LIBSVM_FILE

to use SVM-RFE:

spark-submit --class org.apache.spark.mllib.feature.SVMRFEFeatureSelector PATH_TO_COMPILED_JAR PATH_TO_LIBSVM_FILE

 

Release

The latest version is: Spark  Weka 

Reference

Eiras-Franco, Carlos, Verónica Bolón-Canedo, Sabela Ramos, Jorge González-Domínguez, Amparo Alonso-Betanzos, and Juan Touriño. "Multithreaded and Spark parallelization of feature selection filters." Journal of Computational Science 17 (2016): 609-619.