modelmatrix

Model Matrix

Spark framework for solving large scale feature engineering problem: building model features for machine learning with high feature sparsity. It is build on top Spark DataFrames and can read input data, and write from/to HDFS (CSV, Parquet) and Hive. It is an alternative to Spark machine learning pipeline feature extractors, focused on building sparse feature vectors.

Status

           

Use

To get the latest version of the model matrix, add the following to your SBT build:

resolvers += "Collective Media Bintray" at "https://dl.bintray.com/collectivemedia/releases"

And use following library dependencies:

libraryDependencies += "com.collective.modelmatrix" %% "modelmatrix-client" % "0.0.1"

once you already installed Model Matrix schema in your PostgreSQL database,  packaging CLI distribution is required, so in root directory run:

sbt universal:packageBin

And unzip the file to start using Model Matrix. Next, run simple cli command to ensure that schema installed successfully:

bin/modelmatrix-cli definition list

Add new model matrix definition from config file and view new model definition:

bin/modelmatrix-cli definition add --config ./model.conf
bin/modelmatrix-cli definition view features --definition-id 1

Check that definition can be used to build model instance from Hive:

bin/modelmatrix-cli instance validate --definition-id 1 --source hive://mm.clicks_2015_05_05

Create model instance by applying definition to input data (will calculate categorical and continuous features transformations based on shape of input data):

 bin/modelmatrix-cli instance create \
       --definition-id 1 \
       --source hive://mm.clicks_2015_05_05 \
       --name clicks \
       --comment "getting started"

View instance feature transformations:

bin/modelmatrix-cli instance view features --instance-id ID
bin/modelmatrix-cli instance view columns --instance-id ID

where ID is the previous command output

Release

No releases yet in Spark packages.