PimaIndians Diabetes Detection using TransmogrifAI

Author : Rajdeep Dua
Last Updated : April 15 2019

In this article we look at how we can use TransmogrifAI to detect diabetes in PimaIndia species.

DataSet

Context

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Columns of the dataset

  • Pregnancies : Number of times pregnant
  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • Diabetes: PedigreeFunctionDiabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0

Pre-Requisites

  • Scala, sbt and your favorite IDE (Injellij or Eclipse)

Schema Definition in Scala

We will start with defining the Case class

 case class PimaIndians
 (
     numberOfTimesPreg: Double,
     plasmaGlucose: Double,
     bp: Double,
     spinThickness: Double,
     serumInsulin: Double,
     bmi: Double,
     diabetesPredigree : Double,
     ageInYrs : Double,
     piClass: String
)

Feature Engineering

Define features from the Schema definition and mark it as predictor or a response.

import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._

trait PimaIndianFeatures extends Serializable {
  val numberOfTimesPreg = FeatureBuilder.Real[PimaIndians].extract(_.numberOfTimesPreg.toReal).asPredictor
  val plasmaGlucose = FeatureBuilder.Real[PimaIndians].extract(_.plasmaGlucose.toReal).asPredictor
  val bp = FeatureBuilder.Real[PimaIndians].extract(_.bp.toReal).asPredictor
  val spinThickness = FeatureBuilder.Real[PimaIndians].extract(_.spinThickness.toReal).asPredictor
  val serumInsulin = FeatureBuilder.Real[PimaIndians].extract(_.serumInsulin.toReal).asPredictor
  val bmi = FeatureBuilder.Real[PimaIndians].extract(_.bmi.toReal).asPredictor
  val diabetesPredigree = FeatureBuilder.Real[PimaIndians].extract(_.diabetesPredigree.toReal).asPredictor
  val ageInYrs = FeatureBuilder.Real[PimaIndians].extract(_.diabetesPredigree.toReal).asPredictor
  val piClass = FeatureBuilder.Text[PimaIndians].extract(_.piClass.toText).asResponse
}

Reader Encoder and Indexed Label

Next we define the encoder, CSV DataReader which maps to case class PimaIndians and indexed labels

implicit val piEncoder = Encoders.product[PimaIndians]

val piReader = DataReaders.Simple.csvCase[PimaIndians]()
val labels = piClass.indexed()

Transmogrify features

Next we create a Seq of all features and call transmogrify() on it.

transmogrify() function performs following tasks

  • Convert features into a single vector feature using the feature engineering steps most likely to provide
  • good results based on the types of the individual features passed in
  • It takes parameter label optional label feature to be passed into stages that require the label column
  • It returns a vector feature
val features = Seq( numberOfTimesPreg, plasmaGlucose,bp,spinThickness,serumInsulin,
   bmi,diabetesPredigree,ageInYrs).transmogrify()

Modelling and Evaluation

Next we define a DataCutter and use it to create a MultiClassificationModelSelector, set input fields as labels and features. Output is the prediction which we get by calling getOutput() on the selector

val randomSeed = 42L

val cutter = DataCutter(reserveTestFraction = 0.2, seed = randomSeed)

val prediction = MultiClassificationModelSelector
  .withCrossValidation(splitter = Option(cutter), seed = randomSeed)
  .setInput(labels, features).getOutput()

Evaluator and Workflow

Finally we define the evaluator by using Evaluators.MultiClassification. This is a factory that performs the evaluation of metrics for Binary Classification. The metrics returned are Precision, Recall, F1 and Error Rate

val evaluator = Evaluators.MultiClassification.f1().setLabelCol(labels)
   .setPredictionCol(prediction)

Now we define the workflow. Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and

  • constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the
  • pipeline dag to create a sequence of transformations that are saved in a workflow model.
val workflow = new OpWorkflow().setResultFeatures(prediction, labels)

Putting it together

All the commands above are available in the class OpPimaIndiansBase

class OpPimaIndiansBase extends PimaIndianFeatures {

  implicit val piEncoder = Encoders.product[PimaIndians]

  val piReader = DataReaders.Simple.csvCase[PimaIndians]()
  val labels = piClass.indexed()

  val features = Seq( numberOfTimesPreg, plasmaGlucose,bp,spinThickness,serumInsulin,
    bmi,diabetesPredigree,ageInYrs).transmogrify()

  val randomSeed = 42L

  val cutter = DataCutter(reserveTestFraction = 0.2, seed = randomSeed)

  val prediction = MultiClassificationModelSelector
    .withCrossValidation(splitter = Option(cutter), seed = randomSeed)
    .setInput(labels, features).getOutput()

  val evaluator = Evaluators.MultiClassification.f1().setLabelCol(labels)
    .setPredictionCol(prediction)

  val workflow = new OpWorkflow().setResultFeatures(prediction, labels)
}

Training the model

We extend OpAppWithRunner with PimaIndianFeatures

object OpPimaIndiansTrain  extends OpAppWithRunner with PimaIndianFeatures {
 ...
}

In OpPimaIndiansTrain class we override main(args: Array[String]) function of OpWithRunner and implement runner() function. Notice the myArgs which container all the parameters listed below

  • –run-type=train
  • –model-location=/tmp/pi-model
  • –read-location PimaIndians=./src/main/resources/PimaIndiansDataset/Pimaindiansdiabetes.data
object OpPimaIndiansTrain  extends OpAppWithRunner with PimaIndianFeatures {

  val conf = new SparkConf().setMaster("local[*]").setAppName("PimaPrediction")
  implicit val spark = SparkSession.builder.config(conf).getOrCreate()

  val opPIBase = new OpPimaIndiansBase()

  def runner(opParams: OpParams): OpWorkflowRunner =
    new OpWorkflowRunner(
      workflow = opPIBase.workflow,
      trainingReader = opPIBase.piReader,
      scoringReader = opPIBase.piReader,
      evaluationReader = Option(opPIBase.piReader),
      evaluator = Option(opPIBase.evaluator),
      featureToComputeUpTo = Option(opPIBase.features)
    )

  override def main(args: Array[String]): Unit = {
    val myArgs = Array("--run-type=train", "--model-location=/tmp/pi-model",
      "--read-location",
      "PimaIndians=./src/main/resources/PimaIndiansDataset/Pimaindiansdiabetes.data"
    )

    val (runType, opParams) = parseArgs(myArgs)
    val batchDuration = Duration(opParams.batchDurationSecs.getOrElse(1),
                                 TimeUnit.SECONDS)
    val (spark, streaming) = sparkSession -> sparkStreamingContext(batchDuration)
    run(runType, opParams)(spark, streaming)
  }
}

Output of Training

Output of the training is stores in /tmp/pi-model. Actual output is in a json file : /tmp/pi-model/op-model.json/part-00000

Scoring the Model

To score the model, we follow the same process as above but with slightly different tags

object OpPimaIndiansScore  extends OpAppWithRunner with PimaIndianFeatures {

  val conf = new SparkConf().setMaster("local[*]").setAppName("PimaPrediction")
  implicit val spark = SparkSession.builder.config(conf).getOrCreate()

  val opPIBase = new OpPimaIndiansBase()

  def runner(opParams: OpParams): OpWorkflowRunner =
     ...
  )

  override def main(args: Array[String]): Unit = {
    val myArgs = Array("--run-type=score", "--model-location=/tmp/pi-model",
      "--read-location",
      "PimaIndians=./src/main/resources/PimaIndiansDataset/Pimaindiansdiabetes.data",
      "--write-location=/tmp/pi-scores"
    )

    val (runType, opParams) = parseArgs(myArgs)
    val batchDuration = Duration(opParams.batchDurationSecs.getOrElse(1), TimeUnit.SECONDS)
    val (spark, streaming) = sparkSession -> sparkStreamingContext(batchDuration)
    run(runType, opParams)(spark, streaming)
  }

}

Output is stored in /tmp/pi-scores as specified above.

Evaluating the model

We need to use :code:`–run-type=evaluate and write location flag –write-location=/tmp/pi-eval

object OpPimaIndiansEvaluate  extends OpAppWithRunner with PimaIndianFeatures {

  ...
  override def main(args: Array[String]): Unit = {
    val metricsLocation = "/tmp/pi-metrics"
    val evalLocation = "/tmp/pi-eval"

    val myArgs = Array("--run-type=evaluate", "--model-location=/tmp/pi-model",
      "--metrics-location=" + metricsLocation,
      "--read-location",
      "PimaIndians=./src/main/resources/PimaIndiansDataset/Pimaindiansdiabetes.data",
      "--write-location=" + evalLocation
    )
}

Model Metrics

Model is giving following metrics

  • Precision : 0.7577300712020105
  • Recall : 0.76171875
  • F1 : 0.7597191752951721
  • Error : 0.23828125

Selecting the best model

AutoML used the following metrics to arrive at the best model. It used RandomForest and LogisticRegression with various hyper-parameters.

           modelName                     modelType             regParam  maxDepth error
0 OpLogisticRegression_000000000019_1  OpLogisticRegression      0.010     NaN    0.229517
1 OpLogisticRegression_000000000019_2  OpLogisticRegression      0.100     NaN    0.220464
2 OpLogisticRegression_000000000019_3  OpLogisticRegression      0.200     NaN    0.226691
3 OpLogisticRegression_000000000019_4  OpLogisticRegression      0.001     NaN    0.216625
4 OpLogisticRegression_000000000019_5  OpLogisticRegression      0.010     NaN    0.234593
5 OpLogisticRegression_000000000019_6  OpLogisticRegression      0.100     NaN    0.222506
6 OpLogisticRegression_000000000019_7  OpLogisticRegression      0.200     NaN    0.291327
7 OpRandomForestClassifier_.000001a_0  OpRandomForestClassifier   NaN      3.0    0.245524
8 OpRandomForestClassifier_.000001a_1  OpRandomForestClassifier   NaN      6.0    0.243518
9 OpRandomForestClassifier_.000001a_2  OpRandomForestClassifier   NaN     12.0    0.243823

Plot below compares the error for RandomForest vs Logistic Regression

../_images/pima_indians_lr_vs_rf.png

Plot below shows how error rates vary for various hyperparameter for Random Forest classfication algorithm.

../_images/pima_indians_rf_hyperparameters.png

Plot below shows how error rates vary for various hyperparameter for Logistic Regression classfication algorithm.

../_images/pima_indians_lr_hyperparameters.png