Naive Bayes Classifier to perform the sentiment analysis on Google Play store apps reviews using Scikit-learn

Photo by Pathum Danthanarayana on Unsplash

Suppose you want user reviews to be classified as positive and negative. Sentiment Analysis is a popular job to be performed by data scientists. This is a simple guide using Naive Bayes Classifier and Scikit-learn to create a Google Play store reviews classifier (Sentiment Analysis) in Python.

You can find all the code used in this tutorial along with additional materials in this GitHub Repository

Naive Bayes is the simplest and fastest classification algorithm for a large chunk of data. In various applications such as spam filtering, text classification, sentiment analysis, and recommendation systems, Naive Bayes classifier is used successfully. It uses Bayes probability theorem for unknown class prediction.

Data Overview

Reviews of 23 popular mobile apps have been scrapped. In order to create the dataset, the data was compiled manually labeling each data as positive or negative and can be found here: Reviews DataSet.


In this tutorial, we need all of the following python libraries.

pandas: Python Data Analysis Library. pandas are open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Numpy: NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

sci-kit learn: Simple and efficient tools for data mining and data analysis.

scipy: SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering.

python-dateutil: The dateutil module provides powerful extensions to the standard datetime module, available in Python.

Pytz: pytz brings the Olson tz database into Python. This library allows accurate and cross-platform time zone calculations using Python 2.4 or higher.

Read into Python

Let’s first read the required data from CSV file using Pandas library.

import pandas as pd
from sklearn.model_selection import train_test_split
import joblib
from sklearn.feature_extraction.text import CountVectorizer
data = pd.read_csv('google_play_store_apps_reviews_training.csv')

Now, show the data how looks like…


Pre-process Data

We need to remove package name as it’s not relevant. Then convert text to lowercase for CSV data. So, this is data pre-process stage.

def preprocess_data(data):
# Remove package name as it's not relevant
data = data.drop('package_name', axis=1)

# Convert text to lowercase
data['review'] = data['review'].str.strip().str.lower()
return data
data = preprocess_data(data)

Note: There are many different and more sophisticated ways in which text data can be cleaned that would likely produce better results than what I did here. To be as easy as possible in this tutorial. I also generally think it’s best to get baseline predictions with the simplest solution possible before spending time doing unnecessary transformations.

Splitting Data

First, separate the columns into dependent and independent variables (or features and label). Then you split those variables into train and test set.

# Split into training and testing data
x = data['review']
y = data['polarity']
x, x_test, y, y_test = train_test_split(x,y, stratify=y, test_size=0.25, random_state=42)

Then Vectorize text reviews to numbers.

# Vectorize text reviews to numbers
vec = CountVectorizer(stop_words='english')
x = vec.fit_transform(x).toarray()
x_test = vec.transform(x_test).toarray()

Vectorization: To make sense of this data for our machine learning algorithm, we will need to convert each review to a numerical representation that we call vectorization.

Model Generation

After splitting and vectorize text reviews to number, we will generate a random forest model on the training set and perform prediction on test set features.

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB(), y)

Evaluating Model

After model generation, check the accuracy using actual and predicted values. Our model accuracy is 85%.

model.score(x_test, y_test)

Then check prediction…

model.predict(vec.transform(['Love this app simply awesome!']))

And there it is. A very simple classifier with 85% pretty decent accuracy out of the box.

Thanks for reading! This is my first Medium post, so please comment on any questions or suggestions you may have.




Passionate about data science {📊, 💻, 🛠️} LOVE !t

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

What is ethical in data collection and sharing?

Download In ^PDF Microsoft Visio 2016 Step By Step Read *book @#ePub

A Game-Changing Approach to Space Data Analytics

Building the “Hello World” of Kaggle projects using AutoAI

The 11 building blocks of a tech enabled data governance framework- PPT framework or the Golden…

Scrape Like a Champ with EDA and Linear Regression

How to Easily Automate Your Keyboard to do Tasks in Python

Three Ways Writing about Your Work as a Data Scientist Can Benefit Your Career

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Passionate about data science {📊, 💻, 🛠️} LOVE !t

More from Medium

Classification of people with or without complaints using tripadvisor reviews


Support Vector Machines for Sentiment Analysis

Sentiment Analysis in Data Science using VADER

K -Nearest Neighbors Method: Another Prediction Method (A statistical approach for classification)