Sentiment Prediction from a Movie Review DataSet with Einstein Language APIs

Author : Rajdeep Dua
Last Updated : Sep 6 2017

In this article we look at how Einstein Language APIs can help find sentiment in textual data from Movie Review DataSet

Background

Salesforce launched Einstein Language APIs couple of months ago with the aim to help developers provide service which can help them detect sentiments from text data. In this article we look at how these APIs could be successfully used for famous Movie Review data set

../_images/einstein_language_phrases_basic.png

Movie Review DataSet

There’s a thin line between likably old-fashioned and fuddy-duddy, and The Count of Monte Cristo ... never quite settles on either side.” The Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee [1].

We have modified the dataset to change the labels from 5 to 3.

The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. The train/test split has been preserved for the purposes of benchmarking, but the sentences have been shuffled from their original order. Each Sentence has been parsed into many phrases by the Stanford parser. Each phrase has a PhraseId. Each sentence has a SentenceId. Phrases that are repeated (such as short/common words) are only included once in the data.

train.tsv contains the phrases and their associated sentiment labels. test.tsv contains just phrases. You must assign a sentiment label to each phrase.

The sentiment labels are:

  • 1 - negative
  • 2 - neutral
  • 3 - positive

DataSet size

  • Total Labels : 3
  • Number of examples : 155835
Phrase Label Number of Samples
negative 1 34298
neutral 2 79463
positive 3 42074

We uploaded the DataSet using standard curl commands and trained the model

Creating DataSet in Einstein

We need to do a HTTP POST call to https://api.einstein.ai/v2/vision/datasets/upload with appropriate parameters as shown below.

curl -X POST -H "Authorization: Bearer <TOKEN>" \
             -H "Cache-Control: no-cache" \
             -H "Content-Type: multipart/form-data" \
             -F "path=https://www.dropbox.com/s/ezrbw1pep9yqh1i/train_3_labels.tsv?dl=1" \
             -F "type=text-sentiment" https://api.einstein.ai/v2/language/datasets/upload

Note : We are using Asynch request and you will have to ping the endpoint below to check training status

Response will be similar to listing below

{
   "id":1012242,
   "name":"train_3_labels.tsv; filename*=UTF-8''train_3_labels.tsv",
   "createdAt":"2017-09-06T09:28:05.000+0000",
   "updatedAt":"2017-09-06T09:28:05.000+0000",
   "labelSummary":{
      "labels":[

      ]
   },
   "totalExamples":0,
   "available":false,
   "statusMsg":"UPLOADING",
   "type":"text-sentiment",
   "object":"dataset"
}

Note the dataset id which is 1012242 in our case.

Check Upload Status of the DataSet

curl -X GET -H "Authorization: Bearer <TOKEN>" \
            -H "Cache-Control: no-cache" \
            https://api.einstein.ai/v2/vision/datasets/1012242
{
   "id":1012242,
   "name":"train_3_labels.tsv; filename*=UTF-8''train_3_labels.tsv",
   "createdAt":"2017-09-06T09:28:05.000+0000",
   "updatedAt":"2017-09-06T09:36:06.000+0000",
   "labelSummary":{
      "labels":[
         {
            "id":116112,
            "datasetId":1012242,
            "name":"1",
            "numExamples":34298
         },
         {
            "id":116113,
            "datasetId":1012242,
            "name":"2",
            "numExamples":79463
         },
         {
            "id":116114,
            "datasetId":1012242,
            "name":"3",
            "numExamples":42074
         }
      ]
   },
   "totalExamples":155835,
   "totalLabels":3,
   "available":true,
   "statusMsg":"SUCCEEDED",
   "type":"text-sentiment",
   "object":"dataset"
}

Training the DataSet

We will make a HTTP POST call to URL https://api.einstein.ai/v2/vision/datasets/<id> where <datasetId> is 1011538 in our case

Training Command

curl -X POST -H "Authorization: Bearer <TOKEN>" \
             -H "Cache-Control: no-cache" \
             -H "Content-Type: multipart/form-data" \
             -F "name=movie sentiment analysis" \
             -F "datasetId=<DATASET ID>" https://api.einstein.ai/v2/language/train

Note : It takes about 1 hour to train the model.

Once the model is trained the model metrics can be seen

{
   "createdAt":"2017-09-06T10:51:14.000+0000",
   "metricsData":{
      "f1":[
         0.5569429817962991,
         0.7427413501088796,
         0.6161539435371566
      ],
      "labels":[
         "1",
         "2",
         "3"
      ],
      "testAccuracy":0.6708281797054414,
      "trainingLoss":0.7768682526393756,
      "confusionMatrix":[
         [
            3702, 2373, 714
         ],
         [
            1793, 12279, 1614
         ],
         [
            1010, 2726, 4867
         ]
      ],
      "trainingAccuracy":0.6583838530397638
   },
   "id":"VLEITGZ347BJBAQOHCDFV4VXTI",
   "object":"metrics"
}

Interpret Training Results

F1 scores Label wise

Phrase Label Number of Samples
negative 1 0.5569
neutral 2 0.7427
positive 3 0.6161

Confusion Matrix

Phrase negative neutral positive
negative 3702 2373 714
neutral 1793 12279 1614
positive 1010 2726 4867

Predicting the Label from the Phrase

Let us take a phrase from the test dataset and try to predict the Label

Prediction Command

We will send a prediction HTTP POST request to Url https://api.einstein.ai/v2/vision/predict with authorization token and path to the image to be predicted.

curl -X POST -H "Authorization: Bearer <TOKEN>" \
             -H "Cache-Control: no-cache" \
             -H "Content-Type: multipart/form-data" \
             -F "modelId=VLEITGZ347BJBAQOHCDFV4VXTI" \
             -F "document=An intermittently pleasing but mostly routine effort" \
             https://api.einstein.ai/v2/language/sentiment

Results

{
   "probabilities":[
      {
         "label":"3",
         "probability":0.563523
      },
      {
         "label":"2",
         "probability":0.32334027
      },
      {
         "label":"1",
         "probability":0.11313678
      }
   ],
   "object":"predictresponse"
}

Prediction Result

  • Input : An intermittently pleasing but mostly routine effort

  • Output :

    • Negative (label 1) - 11%
    • Neutral (label 2) - 32%
    • Somewhat Positive (label 3) - 56%

Summary

As we can see the einstein does an interesting prediction of sentiment for a phrase from test set of a movie review.