Classification of Text into Topics

Author : Rajdeep Dua
Last Updated : Oct 7 2017


In this article we will learn how to use PredictionIO to make topic based Classification. We will be using pre-built Prediction IO Template


It is assumed you have compiled PredictionIO distribution and configured to work with PostgreSQL.

We tested this with PredictionIO-0.10.0-incubating distribution.

PredictionIO Template

We are going to use the following template for Classificiation.

Algorithm Used

Template uses NaiveBayes Algorithm implemented in Spark MLlib Package.

import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
val nbModel =  NaiveBayes.train(pd.labeledpoints,lambda = ap.lambda)

Where ap.lambda = 1.0 and pd is a custom class containing RDD of Labeled points

Clone the Template

Let us assumed we cloned it under PredictionIO-0.10.0-incubating distribution directory.

git clone

Download the Wikipedia Content

We will use wikipedia python library to download content and store it in sample_wiki_pages_data.csv. For this sample we are using the list from Politics and Sports.

import wikipedia
import pandas

def get_wikiPages(filename):
  pagesFile = open(filename,"r")
  df = pandas.DataFrame(columns=['title','content'])
  titles = list()
  content = list()
  for pageName in pagesFile:
      page =
      print('Page title:'+page.title)
  df['title'] = titles
  df['content'] = content
  return df

df = get_wikiPages('Pages_Names_Sample.txt')



cd data

Application Name and Key

We are going to use the following key and appname

  • key : 584a7a65e-e626-4557-a98a-5638f9a61b26
  • appname : topics_wikipedia

Note : You can generate your own access key here

Event Server

  1. Start the Event Server

    ./bin/pio eventserver &

    You can check status of the event server by going to the following url http://localhost:7070

  2. Create a new App by giving an access key.

      cd template-Labelling-Topics-with-wikipedia
    ../bin/pio app new topics_wikipedia --access-key 584a7a65e-e626-4557-a98a-5638f9a61b26

    Output will be similar to the listing below

    [INFO] [App$] Initialized Event Store for this app ID: 3.
    [INFO] [App$] Created new app:
    [INFO] [App$]       Name: topics_wikipedia
    [INFO] [App$]         ID: 3
    [INFO] [App$] Access Key: 584a7a65e-e626-4557-a98a-5638f9a61b26
  3. Export the Access Key into an Environment Variable

    export ACCESS_KEY=584a7a65e-e626-4557-a98a-5638f9a61b26
  4. Import Data into Event Server

    python data/ --access_key $ACCESS_KEY
  5. View the Events in the Browser at the url http://localhost:7070/events.json?accessKey=584a7a65e-e626-4557-a98a-5638f9a61b26=-1


Classification Engine

  1. Build the Engine

    ../bin/pio build --verbose
    [INFO] [Console$] Your engine is ready for training.
  2. Train the Engine

    ../bin/pio train

    Output will be similar to listing below

    [INFO] [Engine$] org.apache.spark.mllib.recommendation.
    ALSModel does not support data sanity check. Skipping check.
    [INFO] [Engine$] EngineWorkflow.train completed
    [INFO] [Engine] engineInstanceId=7cb0a26d-f2e6-4954-9831-67944a625ac6
    [INFO] [CoreWorkflow$] Inserting persistent model
    [INFO] [CoreWorkflow$] Updating engine instance
    [INFO] [CoreWorkflow$] Training completed successfully.
  3. Start the Engine

    ../bin/pio deploy

    Output will be similar to listing below. In our case engine is listening at

    INFO] [HttpListener] Bound to /
    [INFO] [MasterActor] Engine is deployed and running.
    Engine API is live at

Browse to the link to see the Engine output.

../_images/pio-classification-engine-wikipedia-1.png ../_images/pio-classification-engine-wikipedia-2.png

Find Topics from Text

Query the Engine Rest Endpoint above

curl -H "Content-Type: application/json" -d '{"topics": [["cricket","baseball"]]}' http://localhost:8000/queries.json