Simple Regression with TransmogrifAI using Runner

Author : Rajdeep Dua
Last Updated : May 20 2019

In this article we look at how we can use TransmogrifAI to for regression use case, we predict Profit for a cold drink vendor based on population of the city.

DataSet

Context

CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from the cities. The dataset contains the dataset for our linear regression exercise. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss.

Content

The datasets consists of one predictor and one response variable (variable which is the outcome)

Columns of the dataset

  • population of the city
  • profit (could be positive or negative)

Dataset

../_images/simple_regression_data_table.png

Dataset plottted : Population vs Profit

../_images/simple_regression_data_plot.png

Pre-Requisites

  • Scala, sbt and your favorite IDE (Injellij or Eclipse)

Schema Definition in Scala

We will start with defining the Case class

package com.salesforce.hw.regression

case class SimpleRegression (
    population: Double,
    profit: Double
)

Feature Engineering

Define features from the Schema definition and mark it as predictor or a response.

import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._

trait SimpleRegressionFeatures extends Serializable {
    val population = FeatureBuilder.RealNN[SimpleRegression].extract(_.population.toRealNN)
        .asPredictor
    val profit = FeatureBuilder.RealNN[SimpleRegression].extract(_.profit.toRealNN)
        .asResponse
}

Reader Encoder

Next we define the encoder, CSV DataReader which maps to case class SimpleRegression.

implicit val srEncoder = Encoders.product[SimpleRegression]

Transmogrify features

Next we create a Seq of all features and call transmogrify() on it.

transmogrify() function performs following tasks

  • Convert features into a single vector feature using the feature engineering steps most likely to provide good results based on the types of the individual features passed in
  • It takes parameter label optional label feature to be passed into stages that require the label column (not applicable in this case)
  • It returns a vector feature
val features = Seq(population).transmogrify()

Modelling and Evaluation

Next we define a DataCutter and use it to create a RegressionModelSelector, set input fields as labels and features. Output is the prediction which we get by calling getOutput() on the selector

val randomSeed = 42L

val prediction = RegressionModelSelector
  .withCrossValidation(
    dataSplitter = Some(splitter), seed = randomSeed,
    modelTypesToUse = Seq(OpGBTRegressor, OpRandomForestRegressor)
  ).setInput(profit, features).getOutput()

Evaluator and Workflow

Finally we define the evaluator by using Evaluators.Regression. This is a factory that performs the evaluation of metrics for Regression. The metrics returned are Precision, Recall, F1 and Error Rate

val evaluator = Evaluators.Regression().setLabelCol(profit).
 setPredictionCol(prediction)

Now we define the workflow. Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and

  • constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the
  • pipeline dag to create a sequence of transformations that are saved in a workflow model.
val workflow = new OpWorkflow().setResultFeatures(prediction, profit)

Putting it together

All the commands above are available in the class OpPimaIndiansBase

class OpPimaIndiansBase extends PimaIndianFeatures {

  implicit val piEncoder = Encoders.product[PimaIndians]

  val piReader = DataReaders.Simple.csvCase[PimaIndians]()
  val labels = piClass.indexed()

  val features = Seq( numberOfTimesPreg, plasmaGlucose,bp,spinThickness,serumInsulin,
    bmi,diabetesPredigree,ageInYrs).transmogrify()

  val randomSeed = 42L

  val cutter = DataCutter(reserveTestFraction = 0.2, seed = randomSeed)

  val prediction = MultiClassificationModelSelector
    .withCrossValidation(splitter = Option(cutter), seed = randomSeed)
    .setInput(labels, features).getOutput()

  val evaluator = Evaluators.MultiClassification.f1().setLabelCol(labels)
    .setPredictionCol(prediction)

  val workflow = new OpWorkflow().setResultFeatures(prediction, labels)
}

Training the model

We extend OpAppWithRunner with SimpleRegressionFeatures

object OpSimpleRegressionTrain extends OpAppWithRunner with SimpleRegressionFeatures {
 ...
}

In OpSimpleRegressionTrain class we override main(args: Array[String]) function of OpWithRunner and implement runner() function. Notice the myArgs which container all the parameters listed below

  • –run-type=train
  • –model-location=/tmp/pi-model
  • –read-location PimaIndians=./src/main/resources/PimaIndiansDataset/Pimaindiansdiabetes.data
    object OpSimpleRegressionTrain extends OpAppWithRunner with SimpleRegressionFeatures {

  val conf = new SparkConf().setMaster("local[*]").setAppName("SRPrediction")
  implicit val spark = SparkSession.builder.config(conf).getOrCreate()

  val opSRBase = new OpSimpleRegressionBase()

  def runner(opParams: OpParams): OpWorkflowRunner =
    new OpWorkflowRunner(
      workflow = opSRBase.workflow,
      trainingReader = opSRBase.srReader,
      scoringReader = opSRBase.srReader,
      evaluationReader = Option(opSRBase.srReader),
      evaluator = Option(opSRBase.evaluator),
      featureToComputeUpTo = Option(opSRBase.features)
    )

  override def main(args: Array[String]): Unit = {
    val myArgs = Array("--run-type=train", "--model-location=/tmp/sr-model",
      "--read-location", "SimpleRegression=./src/main/resources/SimpleRegressionDataset/simple_regression.csv")
    val (runType, opParams) = parseArgs(myArgs)
    val batchDuration = Duration(opParams.batchDurationSecs.getOrElse(1), TimeUnit.SECONDS)
    val (spark, streaming) = sparkSession -> sparkStreamingContext(batchDuration)
    run(runType, opParams)(spark, streaming)
  }

}

Output of Training

Output of the training is stores in /tmp/sr-model. Actual output is in a json file : /tmp/sr-model/op-model.json/part-00000

Scoring the Model

To score the model, we follow the same process as above but with slightly different tags. Notice the –run-type parameter which is –run-type=score.

    object OpSimpleRegressionTrain extends OpAppWithRunner with SimpleRegressionFeatures {

  val conf = new SparkConf().setMaster("local[*]").setAppName("SRPrediction")
  implicit val spark = SparkSession.builder.config(conf).getOrCreate()

  val opSRBase = new OpSimpleRegressionBase()

  def runner(opParams: OpParams): OpWorkflowRunner =
    new OpWorkflowRunner(
      workflow = opSRBase.workflow,
      trainingReader = opSRBase.srReader,
      scoringReader = opSRBase.srReader,
      evaluationReader = Option(opSRBase.srReader),
      evaluator = Option(opSRBase.evaluator),
      featureToComputeUpTo = Option(opSRBase.features)
    )

  override def main(args: Array[String]): Unit = {
    val myArgs = Array("--run-type=score", "--model-location=/tmp/sr-model",
      "--read-location", "SimpleRegression=./src/main/resources/SimpleRegressionDataset/simple_regression.csv",
      "--write-location=/tmp/sr-scores")
    val (runType, opParams) = parseArgs(myArgs)
    val batchDuration = Duration(opParams.batchDurationSecs.getOrElse(1), TimeUnit.SECONDS)
    val (spark, streaming) = sparkSession -> sparkStreamingContext(batchDuration)
    run(runType, opParams)(spark, streaming)
  }

}

Output is stored in /tmp/sr-scores as specified above.

Evaluating the model

We need to use :code:`–run-type=evaluate and write location flag –write-location=/tmp/sr-eval

object OpSimpleRegressionEvaluate  extends OpAppWithRunner with PimaIndianFeatures {

  ...
  override def main(args: Array[String]): Unit = {
    val metricsLocation = "/tmp/sr-metrics"
    val evalLocation = "/tmp/sr-eval"

   val myArgs = Array("--run-type=evaluate", "--model-location=/tmp/sr-model",
  "--metrics-location=" + metricsLocation,
  "--read-location",
  "SimpleRegression=./src/main/resources/SimpleRegressionDataset/simple_regression.csv",
  "--write-location=" + evalLocation
)
}

Model Metrics

Model is giving following metrics

  • ...

Selecting the best model

AutoML used the following metrics to arrive at the best model. It used GBTRegresison and Randomforest with various hyper-parameter

Plot below compares the error for RandomForest vs GBT (Gradient Boosted Trees)

../_images/simple_regression_algo_vs_error_plot.png

Plot below shows how error rates vary for various hyperparameter for Random Forest algorithm.

../_images/simple_regression_randomforest_error_vs_maxdepth.png

Plot below shows how error rates vary for various hyperparameter for Gradient Boosted Trees Regression algorithm.

../_images/simple_regression_gbt_error_vs_maxdepth.png