Simple Regression using TransmogrifAI

Author : Rajdeep Dua
Last Updated : May 20 2019

In this article we look at how we can use TransmogrifAI to for regression use case, we predict Profit for a cold drink vendor based on population of the city.

DataSet

Context

CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from the cities. The dataset contains the dataset for our linear regression exercise. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss.

Content

The datasets consists of one predictor and one response variable (variable which is the outcome)

Columns of the dataset

  • population of the city
  • profit (could be positive or negative)

Dataset

../_images/simple_regression_data_table.png

Dataset plottted : Population vs Profit

../_images/simple_regression_training_data.png

Pre-Requisites

Scala, sbt and your favorite IDE (Injellij or Eclipse)

Schema Definition in Scala

We will start with defining the Case class

package com.salesforce.hw.regression

case class SimpleRegression (
    population: Double,
    profit: Double
)

Instantiate Spark

val conf = new SparkConf().setMaster("local[*]").setAppName("..")
implicit val spark = SparkSession.builder.config(conf).getOrCreate()

Feature Engineering

Define features from the Schema definition and mark it as predictor or a response. We start by importing the following classes. We will use FeatureBuilder which is a factory for building features.

import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
val population = FeatureBuilder.RealNN[SimpleRegression].extract(_.population.toRealNN)
        .asPredictor
val profit = FeatureBuilder.RealNN[SimpleRegression].extract(_.profit.toRealNN)
        .asResponse

Reader Encoder

Next we define the encoder.

implicit val srEncoder = Encoders.product[SimpleRegression]

Encoders.product is a factory for constructing encoders that convert objects and primitives to and from the internal row format using catalyst expressions and code generation

Training path and DataReader

Now we define trainFilePath from where to load the csv file using DataReaders.Simple.csvCase and parse it based on the case Class.

val trainFilePath = "./src/main/resources/SimpleRegressionDataset/simple_regression.csv"

val trainDataReader = DataReaders.Simple.csvCase[SimpleRegression](
  path = Option(trainFilePath)
)

DataReaders containers an object Simple which has csvCase method. This internally created CSVProductReader and calls read method on it shown below. SimpleRegression provides the schema for reading the file.

Code below is for information only

override def read(params: OpParams = new OpParams())(
        implicit sc: SparkSession): Either[RDD[T], Dataset[T]] = Right {
    val finalPath = getFinalReadPath(params)
    val data: Dataset[T] = sc.read
      .options(options.toSparkCSVOptionsMap)
      .schema(implicitly[Encoder[T]].schema)
       // without this, every value gets read in as a string
      .csv(finalPath)
      .as[T]
    maybeRepartition(data, params)
  }

Code is for information only

Transmogrify features

Next we create a Seq of all features and call transmogrify() on it.

transmogrify() function performs following tasks

  • Convert features into a single vector feature using the feature engineering steps most likely to provide good results based on the types of the individual features passed in
  • It takes parameter label optional label feature to be passed into stages that require the label column (not applicable in this case)
  • It returns a vector feature
val features = Seq(population).transmogrify()

Modelling and Evaluation

Next we define a DataCutter and use it to create a RegressionModelSelector, set input fields as labels and features. Output is the prediction which we get by calling getOutput() on the selector

val randomSeed = 42L

val prediction = RegressionModelSelector
  .withCrossValidation(
    dataSplitter = Some(splitter), seed = randomSeed,
    modelTypesToUse = Seq(OpGBTRegressor, OpRandomForestRegressor)
  ).setInput(profit, features).getOutput()

Create Evaluator

Finally we define the evaluator by using Evaluators.Regression. This is a factory that performs the evaluation of metrics for Regression. The metrics returned are Precision, Recall, F1 and Error Rate

val evaluator = Evaluators.Regression().setLabelCol(profit).
 setPredictionCol(prediction)

Workflow - create and train

Create workflow

Now we define the workflow. Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the pipeline dag to create a sequence of transformations that are saved in a workflow model.

val workflow = new OpWorkflow().setResultFeatures(prediction, profit).
    setReader(trainDataReader)

Train the workflow

train() method is used on workflow instance to fit all of the estimators in the pipeline and return a pipeline model of only transformers. Uses data loaded as specified by the data reader to generate the initial data set.

val workflowModel = workflow.train()

Score and Evaluate the Model

To score and evaluate the model, we call scoreAndEvaluate(..) on workflowModel as shown below and get the scores and evaluation parameters.

val dfScoreAndEvaluate = workflowModel.scoreAndEvaluate(evaluator)
val dfScore = dfScoreAndEvaluate._1.
    withColumnRenamed("population-profit_3-stagesApplied_Prediction_00000000000f",
    "predicted_profit")
val dfEvaluate = dfScoreAndEvaluate._2

We renamed the scored column to predicted_profit

Evaluator output

Print the output of Evaluator

println("Evaluate:\n" + dfEvaluate.toString())

Output of the print statement above will be:

{
  "RootMeanSquaredError" : 3.2384424508547105,
  "MeanSquaredError" : 10.487509507497863,
  "R2" : 0.6509976667047017,
  "MeanAbsoluteError" : 2.3879064702007646
}

Model chosed has an MSE of 10.48 and RMSE 3.23

Show the scores

dfScore.show(false)

Output will be similar to:

+--------------------+-------+----------------------------------+
|key                 |profit |predicted_profit                  |
+--------------------+-------+----------------------------------+
|4888594985882367692 |17.592 |[prediction -> 4.230561150603311] |
|8020157188189231317 |9.1302 |[prediction -> 2.5443535144523106]|
|734589599645700297  |13.662 |[prediction -> 7.781771232315265] |
|2073615787707515974 |11.854 |[prediction -> 4.704125138578778] |
|8227609953757034682 |6.8233 |[prediction -> 3.4318698944369954]|
|-3430766866446242375|11.886 |[prediction -> 7.737034008872041] |
|7526338684630736359 |4.3483 |[prediction -> 4.687224111323368] |
|4921755212716439631 |12.0   |[prediction -> 7.781771232315265] |
|-466660369762144986 |6.5987 |[prediction -> 4.497698003899803] |
|-2662461837352799172|3.8166 |[prediction -> 1.5378404994651533]|
|5748526649699471540 |3.2522 |[prediction -> 2.8269199669197933]|
|-4491317630949059399|15.505 |[prediction -> 15.253309190943435]|
|-1036466213603419833|3.1551 |[prediction -> 3.4318698944369954]|
|49827461235003711   |7.2258 |[prediction -> 7.737034008872041] |
|8021242688795550819 |0.71618|[prediction -> 2.7841598383912647]|
|-7789624635709463055|3.5129 |[prediction -> 1.6806952063426013]|
|4172135869839283180 |5.3048 |[prediction -> 4.497698003899803] |
|2924631943149725529 |0.56077|[prediction -> 1.5378404994651533]|
|-2298902207799678212|3.6518 |[prediction -> 4.497698003899803] |
|8510971026908274006 |5.3893 |[prediction -> 4.704125138578778] |
+--------------------+-------+----------------------------------+

We extract the predicted_profit value and store in a file for plotting

val dfScoreMod = dfScore.rdd.map(x => x(2).toString.split("->")(1).
    dropRight(1).dropRight(1))
dfScoreMod.foreach(println)
dfScoreMod.saveAsTextFile("./output/simple_regression/predictions")

In the next section we will plot the output along with the training data values

Plotting the Actual vs predicted score

We used pandas to plot the actual vs predicted values from the original dataset and the output obtained above.

../_images/simple_regression_actual_vs_predicted_profit.png

Summary

In this article we looked at a simple regression use case with one input variable and dependent variable - profit. We used TransmogrifAI to provide us with the best first model based on the least Error, RMSE.