# Simple Regression using TransmogrifAI¶

```Author : Rajdeep Dua
Last Updated : May 20 2019
```

In this article we look at how we can use TransmogrifAI to for regression use case, we predict Profit for a cold drink vendor based on population of the city.

## DataSet¶

### Context¶

CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from the cities. The dataset contains the dataset for our linear regression exercise. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss.

### Content¶

The datasets consists of one predictor and one response variable (variable which is the outcome)

Columns of the dataset

• population of the city
• profit (could be positive or negative)

Dataset

Dataset plottted : Population vs Profit

## Source Code¶

github.com_transmogrify_samples

## Pre-Requisites¶

Scala, sbt and your favorite IDE (Injellij or Eclipse)

## Schema Definition in Scala¶

We will start with defining the Case class

```package com.salesforce.hw.regression

case class SimpleRegression (
population: Double,
profit: Double
)
```

## Instantiate Spark¶

```val conf = new SparkConf().setMaster("local[*]").setAppName("..")
implicit val spark = SparkSession.builder.config(conf).getOrCreate()
```

## Feature Engineering¶

Define features from the Schema definition and mark it as predictor or a response. We start by importing the following classes. We will use FeatureBuilder which is a factory for building features.

```import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
```
```val population = FeatureBuilder.RealNN[SimpleRegression].extract(_.population.toRealNN)
.asPredictor
val profit = FeatureBuilder.RealNN[SimpleRegression].extract(_.profit.toRealNN)
.asResponse
```

## Reader Encoder¶

Next we define the encoder.

```implicit val srEncoder = Encoders.product[SimpleRegression]
```

Encoders.product is a factory for constructing encoders that convert objects and primitives to and from the internal row format using catalyst expressions and code generation

## Training path and DataReader¶

Now we define trainFilePath from where to load the csv file using DataReaders.Simple.csvCase and parse it based on the case Class.

```val trainFilePath = "./src/main/resources/SimpleRegressionDataset/simple_regression.csv"

val trainDataReader = DataReaders.Simple.csvCase[SimpleRegression](
path = Option(trainFilePath)
)
```

DataReaders containers an object Simple which has csvCase method. This internally created CSVProductReader and calls read method on it shown below. SimpleRegression provides the schema for reading the file.

Code below is for information only

```override def read(params: OpParams = new OpParams())(
implicit sc: SparkSession): Either[RDD[T], Dataset[T]] = Right {
val finalPath = getFinalReadPath(params)
val data: Dataset[T] = sc.read
.options(options.toSparkCSVOptionsMap)
.schema(implicitly[Encoder[T]].schema)
// without this, every value gets read in as a string
.csv(finalPath)
.as[T]
maybeRepartition(data, params)
}
```

Code is for information only

## Transmogrify features¶

Next we create a Seq of all features and call transmogrify() on it.

transmogrify() function performs following tasks

• Convert features into a single vector feature using the feature engineering steps most likely to provide good results based on the types of the individual features passed in
• It takes parameter label optional label feature to be passed into stages that require the label column (not applicable in this case)
• It returns a vector feature
```val features = Seq(population).transmogrify()
```

## Modelling and Evaluation¶

Next we define a DataCutter and use it to create a RegressionModelSelector, set input fields as labels and features. Output is the prediction which we get by calling getOutput() on the selector

```val randomSeed = 42L

val prediction = RegressionModelSelector
.withCrossValidation(
dataSplitter = Some(splitter), seed = randomSeed,
modelTypesToUse = Seq(OpGBTRegressor, OpRandomForestRegressor)
).setInput(profit, features).getOutput()
```

## Create Evaluator¶

Finally we define the evaluator by using Evaluators.Regression. This is a factory that performs the evaluation of metrics for Regression. The metrics returned are Precision, Recall, F1 and Error Rate

```val evaluator = Evaluators.Regression().setLabelCol(profit).
setPredictionCol(prediction)
```

## Workflow - create and train¶

Create workflow

Now we define the workflow. Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the pipeline dag to create a sequence of transformations that are saved in a workflow model.

```val workflow = new OpWorkflow().setResultFeatures(prediction, profit).
setReader(trainDataReader)
```

Train the workflow

train() method is used on workflow instance to fit all of the estimators in the pipeline and return a pipeline model of only transformers. Uses data loaded as specified by the data reader to generate the initial data set.

```val workflowModel = workflow.train()
```

## Score and Evaluate the Model¶

To score and evaluate the model, we call scoreAndEvaluate(..) on workflowModel as shown below and get the scores and evaluation parameters.

```val dfScoreAndEvaluate = workflowModel.scoreAndEvaluate(evaluator)
val dfScore = dfScoreAndEvaluate._1.
withColumnRenamed("population-profit_3-stagesApplied_Prediction_00000000000f",
"predicted_profit")
val dfEvaluate = dfScoreAndEvaluate._2
```

We renamed the scored column to predicted_profit

Evaluator output

Print the output of Evaluator

```println("Evaluate:\n" + dfEvaluate.toString())
```

Output of the print statement above will be:

```{
"RootMeanSquaredError" : 3.2384424508547105,
"MeanSquaredError" : 10.487509507497863,
"R2" : 0.6509976667047017,
"MeanAbsoluteError" : 2.3879064702007646
}
```

Model chosed has an MSE of 10.48 and RMSE 3.23

Show the scores

```dfScore.show(false)
```

Output will be similar to:

```+--------------------+-------+----------------------------------+
|key                 |profit |predicted_profit                  |
+--------------------+-------+----------------------------------+
|4888594985882367692 |17.592 |[prediction -> 4.230561150603311] |
|8020157188189231317 |9.1302 |[prediction -> 2.5443535144523106]|
|734589599645700297  |13.662 |[prediction -> 7.781771232315265] |
|2073615787707515974 |11.854 |[prediction -> 4.704125138578778] |
|8227609953757034682 |6.8233 |[prediction -> 3.4318698944369954]|
|-3430766866446242375|11.886 |[prediction -> 7.737034008872041] |
|7526338684630736359 |4.3483 |[prediction -> 4.687224111323368] |
|4921755212716439631 |12.0   |[prediction -> 7.781771232315265] |
|-466660369762144986 |6.5987 |[prediction -> 4.497698003899803] |
|-2662461837352799172|3.8166 |[prediction -> 1.5378404994651533]|
|5748526649699471540 |3.2522 |[prediction -> 2.8269199669197933]|
|-4491317630949059399|15.505 |[prediction -> 15.253309190943435]|
|-1036466213603419833|3.1551 |[prediction -> 3.4318698944369954]|
|49827461235003711   |7.2258 |[prediction -> 7.737034008872041] |
|8021242688795550819 |0.71618|[prediction -> 2.7841598383912647]|
|-7789624635709463055|3.5129 |[prediction -> 1.6806952063426013]|
|4172135869839283180 |5.3048 |[prediction -> 4.497698003899803] |
|2924631943149725529 |0.56077|[prediction -> 1.5378404994651533]|
|-2298902207799678212|3.6518 |[prediction -> 4.497698003899803] |
|8510971026908274006 |5.3893 |[prediction -> 4.704125138578778] |
+--------------------+-------+----------------------------------+
```

We extract the predicted_profit value and store in a file for plotting

```val dfScoreMod = dfScore.rdd.map(x => x(2).toString.split("->")(1).
dropRight(1).dropRight(1))
dfScoreMod.foreach(println)
dfScoreMod.saveAsTextFile("./output/simple_regression/predictions")
```

In the next section we will plot the output along with the training data values

## Plotting the Actual vs predicted score¶

We used pandas to plot the actual vs predicted values from the original dataset and the output obtained above. ## Summary¶

In this article we looked at a simple regression use case with one input variable and dependent variable - profit. We used TransmogrifAI to provide us with the best first model based on the least Error, RMSE.