Naive Bayes Classification

Melisa Krisnawati
5 min readJun 27, 2021

Hello all, today I'm gonna show you the naive bayes classification using pyhton language programming. First of all classification is a machine learning model that is used to discriminate different objects based on certain features. A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. you can take a look in this Naive Bayes function

Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.

Advantages

  • It is easy and fast to predict the class of the test data set. It also performs well in multi class prediction.
  • When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
  • It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Disadvantages

  • If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as Zero Frequency. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
  • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Applications

  • Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.
  • Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.
  • Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
  • Recommendation System: Naive Bayes Classifier can builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.

Example

Let us take an example to get some better intuition. actually there are a lot of datasets that can be used for the examples. here i’ll give you an extra link for the other datasets Iris dataset, Wine dataset, Adult dataset. but for now we will using to consider the problem of weekend activities. The dataset is represented as below.

We classify whether the day is suitable for going to watch cinema, playing tennis, shopping or staying in, given the features of the day. The columns represent these features and the rows represent individual entries. If we take the example from the first row of the dataset, we can observe that people will go to the cinema if the weather is rainy, parents is yes, and money is rich.

The Coding Exercise

i don’t provide the dataset so you can write it on the excel like the example and then you can follow the instruction below. after that you can open your Google Collaboratory and we let’s start!

  1. Insert this library that we will used like pandas, matplotlib, spicy, numpy

2. input the dataset that you already created on the excel by placing based on the path on your drive. in this case i put it inside the folder Collab Notebooks. and then delete the weekend column because it’s unnecessary

3. get the target attribute (decision column) and input attribute in this case there are 3 columns, weather, parents, money and save it into a variable.

4. get the input attribute values and target attribute values, the result will be in array forms

5. and then you can count the instance and the target distribution

6. if you want to know how is the forms of each attributes values in this case i will take on the example for the weather and its value. you can do it for the same for the other attributes to check on the forming.

in the next article i will show you the functions of Gini and Entropy that will used on this case, so stay tuned! and don't forget to clap on this article! Happy Coding :)

--

--

Melisa Krisnawati

Google Developer Student Club Leader @Universitas Ciputra Surabaya Indonesia.