modelmatrix
Model Matrix
Spark framework for solving large scale feature engineering problem: building model features for machine learning with high feature sparsity. It is build on top Spark DataFrames and can read input data, and write from/to HDFS (CSV, Parquet) and Hive. It is an alternative to Spark machine learning pipeline feature extractors, focused on building sparse feature vectors.
Status
Use
To get the latest version of the model matrix, add the following to your SBT build:
resolvers += "Collective Media Bintray" at "https://dl.bintray.com/collectivemedia/releases"
And use following library dependencies:
libraryDependencies += "com.collective.modelmatrix" %% "modelmatrix-client" % "0.0.1"
once you already installed Model Matrix schema in your PostgreSQL database, packaging CLI distribution is required, so in root directory run:
sbt universal:packageBin
And unzip the file to start using Model Matrix. Next, run simple cli command to ensure that schema installed successfully:
bin/modelmatrix-cli definition list
Add new model matrix definition from config file and view new model definition:
bin/modelmatrix-cli definition add --config ./model.conf
bin/modelmatrix-cli definition view features --definition-id 1
Check that definition can be used to build model instance from Hive:
bin/modelmatrix-cli instance validate --definition-id 1 --source hive://mm.clicks_2015_05_05
Create model instance by applying definition to input data (will calculate categorical and continuous features transformations based on shape of input data):
bin/modelmatrix-cli instance create \ --definition-id 1 \ --source hive://mm.clicks_2015_05_05 \ --name clicks \ --comment "getting started"
View instance feature transformations:
bin/modelmatrix-cli instance view features --instance-id ID
bin/modelmatrix-cli instance view columns --instance-id ID
where ID is the previous command output
Release
No releases yet in Spark packages.