Classification of Text into Topics

Author : Rajdeep Dua
Last Updated : Jan 30 2017

Introduction

In this article we will learn how to use PredictionIO to make topic based Classification. We will be using pre-built Prediction IO Template

Pre-Requisites

It is assumed you have compiled PredictionIO distribution and configured to work with PostgreSQL.

We tested this with PredictionIO-0.10.0-incubating distribution.

PredictionIO Template

We are going to use the following template for Classificiation.

https://github.com/peoplehum/template-Labelling-Topics-with-wikipedia

Algorithm Used

Template uses NaiveBayes Algorithm implemented in Spark MLlib Package.

import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
val nbModel =  NaiveBayes.train(pd.labeledpoints,lambda = ap.lambda)

Where ap.lambda = 1.0 and pd is a custom class containing RDD of Labeled points

Clone the Template

Let us assumed we cloned it under PredictionIO-0.10.0-incubating distribution directory.

git clone https://github.com/peoplehum/template-Labelling-Topics-with-wikipedia

Download the Wikipedia Content

We will use wikipedia python library to download content and store it in sample_wiki_pages_data.csv. For this sample we are using the list from Politics and Sports.

import wikipedia
import pandas

def get_wikiPages(filename):
  pagesFile = open(filename,"r")
  df = pandas.DataFrame(columns=['title','content'])
  titles = list()
  content = list()
  for pageName in pagesFile:
    try:
      page = wikipedia.page(pageName)
      titles.append(page.title)
      content.append(page.content)
      print('Page title:'+page.title)
    except:
      continue
  df['title'] = titles
  df['content'] = content
  return df

df = get_wikiPages('Pages_Names_Sample.txt')
print(df)

df.to_csv('sample_wiki_pages_data.csv',',',header=True,
  columns=["title","content"],encoding='utf-8')

Command

cd data
python get_wikiPages.py

Application Name and Key

We are going to use the following key and appname

  • key : 584a7a65e-e626-4557-a98a-5638f9a61b26
  • appname : topics_wikipedia

Note : You can generate your own access key here https://www.uuidgenerator.net/version4

Event Server

  1. Start the Event Server

    ./bin/pio eventserver &
    

    You can check status of the event server by going to the following url http://localhost:7070

    ../_images/pio-eventserver-status.png
  2. Create a new App by giving an access key.

      cd template-Labelling-Topics-with-wikipedia
    
    ../bin/pio app new topics_wikipedia --access-key 584a7a65e-e626-4557-a98a-5638f9a61b26
    

    Output will be similar to the listing below

    [INFO] [App$] Initialized Event Store for this app ID: 3.
    [INFO] [App$] Created new app:
    [INFO] [App$]       Name: topics_wikipedia
    [INFO] [App$]         ID: 3
    [INFO] [App$] Access Key: 584a7a65e-e626-4557-a98a-5638f9a61b26
    
  3. Export the Access Key into an Environment Variable

    export ACCESS_KEY=584a7a65e-e626-4557-a98a-5638f9a61b26
    
  4. Import Data into Event Server

    python data/import_eventServer.py --access_key $ACCESS_KEY
    
  5. View the Events in the Browser at the url http://localhost:7070/events.json?accessKey=584a7a65e-e626-4557-a98a-5638f9a61b26=-1

../_images/pio-events-classification-wikipedia.png

Classification Engine

  1. Build the Engine

    ../bin/pio build --verbose
    
    ...
    
    [INFO] [Console$] Your engine is ready for training.
    
  2. Train the Engine

    ../bin/pio train
    

    Output will be similar to listing below

    [INFO] [Engine$] org.apache.spark.mllib.recommendation.
    ALSModel does not support data sanity check. Skipping check.
    [INFO] [Engine$] EngineWorkflow.train completed
    [INFO] [Engine] engineInstanceId=7cb0a26d-f2e6-4954-9831-67944a625ac6
    [INFO] [CoreWorkflow$] Inserting persistent model
    [INFO] [CoreWorkflow$] Updating engine instance
    [INFO] [CoreWorkflow$] Training completed successfully.
    
  3. Start the Engine

    ../bin/pio deploy
    

    Output will be similar to listing below. In our case engine is listening at http://0.0.0.0:8000

    INFO] [HttpListener] Bound to /0.0.0.0:8000
    [INFO] [MasterActor] Engine is deployed and running.
    Engine API is live at http://0.0.0.0:8000.
    

Browse to the link http://0.0.0.0:8000 to see the Engine output.

../_images/pio-classification-engine-wikipedia-1.png ../_images/pio-classification-engine-wikipedia-2.png

Find Topics from Text

Query the Engine Rest Endpoint above

curl -H "Content-Type: application/json" -d '{"topics": [["cricket","baseball"]]}' http://localhost:8000/queries.json

Output

{"Category":"Sport"}