Header Ads Widget

Naïve Bayes Classifier

Naïve Bayes Classifier Algorithm

  • Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems.
  • It is mainly used in text classification that includes a high-dimensional training dataset.
  • Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions.
  • It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
  • Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

  • Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other.
  • Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

  • Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability.
  • The formula for Bayes' theorem is given as:

Naïve Bayes Classifier Algorithm

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this dataset we need to decide that whether we should play or not on a particular day according to the weather conditions. So to solve this problem, we need to follow the below steps:

  1. Convert the given dataset into frequency tables.
  2. Generate Likelihood table by finding the probabilities of given features.
  3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

OutlookPlay
0RainyYes
1SunnyYes
2OvercastYes
3OvercastYes
4SunnyNo
5RainyYes
6SunnyYes
7OvercastYes
8RainyNo
9SunnyNo
10SunnyYes
11RainyNo
12OvercastYes
13OvercastYes

Frequency table for the Weather Conditions:

WeatherYesNo
Overcast50
Rainy22
Sunny32
Total105

Likelihood table weather condition:

WeatherNoYes
Overcast055/14= 0.35
Rainy224/14=0.29
Sunny235/14=0.35
All4/14=0.2910/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.


Example-2

Let’s consider a simple binary classification problem where we want to classify whether an email is spam or not spam based on three features:

  1. Whether the email contains the word "offer."
  2. Whether the email contains the word "free."
  3. Whether the email contains the word "win."

We have the following training data:

EmailContains "offer"Contains "free"Contains "win"Class (Spam/Not Spam)
Email 1YesNoYesSpam
Email 2NoYesNoNot Spam
Email 3YesYesNoSpam
Email 4NoNoYesNot Spam

We want to classify a new email that contains "offer" and "win" but does not contain "free."

Step 1: Calculate Prior Probabilities

We first calculate the prior probabilities of each class (spam or not spam) based on the training data.

  • P(Spam) = Number of Spam Emails / Total Number of Emails = 2 / 4 = 0.5
  • P(Not Spam) = Number of Not Spam Emails / Total Number of Emails = 2 / 4 = 0.5

Step 2: Calculate Likelihoods for Each Feature

Next, we calculate the likelihoods of each feature given the class.

For the class Spam:

  • P(Contains "offer" | Spam) = Number of Spam Emails containing "offer" / Total Spam Emails = 1 / 2 = 0.5
  • P(Contains "free" | Spam) = Number of Spam Emails containing "free" / Total Spam Emails = 1 / 2 = 0.5
  • P(Contains "win" | Spam) = Number of Spam Emails containing "win" / Total Spam Emails = 1 / 2 = 0.5

For the class Not Spam:

  • P(Contains "offer" | Not Spam) = Number of Not Spam Emails containing "offer" / Total Not Spam Emails = 0 / 2 = 0
  • P(Contains "free" | Not Spam) = Number of Not Spam Emails containing "free" / Total Not Spam Emails = 1 / 2 = 0.5
  • P(Contains "win" | Not Spam) = Number of Not Spam Emails containing "win" / Total Not Spam Emails = 1 / 2 = 0.5

Step 3: Calculate Posterior Probabilities

Now, we use Bayes' Theorem to calculate the posterior probability for each class (Spam or Not Spam) given the features of the new email.

For Spam:

  • P(Spam | Contains "offer", Contains "win", Does not contain "free")
    = P(Spam) * P(Contains "offer" | Spam) * P(Does not contain "free" | Spam) * P(Contains "win" | Spam)
    = 0.5 * 0.5 * (1 - 0.5) * 0.5
    = 0.5 * 0.5 * 0.5 * 0.5
    = 0.0625

For Not Spam:

  • P(Not Spam | Contains "offer", Contains "win", Does not contain "free")
    = P(Not Spam) * P(Contains "offer" | Not Spam) * P(Does not contain "free" | Not Spam) * P(Contains "win" | Not Spam)
    = 0.5 * 0 * (1 - 0.5) * 0.5
    = 0.5 * 0 * 0.5 * 0.5
    = 0

Step 4: Make a Prediction

The Naive Bayes Classifier selects the class with the highest posterior probability. In this case, the probability of the email being Spam is 0.0625, while the probability of it being Not Spam is 0. Since the probability of spam is greater than the probability of not spam, the classifier predicts that the email is Spam.

Advantages of Naïve Bayes Classifier:

  • Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
  • It can be used for Binary as well as Multi-class Classifications.
  • It performs well in Multi-class predictions as compared to the other Algorithms.
  • It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

  • Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:

  • It is used for Credit Scoring.
  • It is used in medical data classification.
  • It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
  • It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

  • Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.
  • Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It is primarily used for document classification problems, it means a particular document belongs to which category such as Sports, Politics, education, etc.
    The classifier uses the frequency of words for the predictors.
  • Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables. Such as if a particular word is present or not in a document. This model is also famous for document classification tasks.

Q. What is the difference between an optimal Bayes classifier and a naive Bayes classifier?

The Optimal Bayes Classifier and the Naive Bayes Classifier are probabilistic models based on Bayes' Theorem. Still, they differ in key ways, particularly in how they handle data and make predictions. Here's a breakdown of their main differences:

1. Assumptions about Features

  • Optimal Bayes Classifier:
    The Optimal Bayes Classifier makes no assumptions about the relationships or dependencies between the features. It considers all possible hypotheses and uses the joint distribution of the features to make predictions, which makes it theoretically the most accurate classifier if all probabilities are known.

  • Naive Bayes Classifier:
    The Naive Bayes Classifier makes a strong independence assumption: it assumes that all features are conditionally independent given the class label. This means that it simplifies the joint probability of features by treating them as independent. While this assumption is often unrealistic, it allows for faster computation and works well in many practical situations.

2. Complexity

  • Optimal Bayes Classifier:
    The Optimal Bayes Classifier is computationally infeasible in most real-world scenarios because it requires knowledge of the full joint probability distribution of the features. Calculating this joint distribution becomes increasingly complex as the number of features grows, especially when the features are not independent.

  • Naive Bayes Classifier:
    The Naive Bayes Classifier is computationally efficient because of its independence assumption. It calculates probabilities separately for each feature and combines them to estimate the probability of a class. This makes it scalable even with large datasets and high-dimensional data.

3. Accuracy

  • Optimal Bayes Classifier:
    The Optimal Bayes Classifier provides the most accurate possible predictions in theory, as it considers the full probability distribution of all features without making any simplifying assumptions. It minimizes the error rate and is considered the ideal classifier.

  • Naive Bayes Classifier:
    The Naive Bayes Classifier can be less accurate than the Optimal Bayes Classifier, especially when the independence assumption does not hold. However, despite its simplistic assumptions, it often performs surprisingly well in practice, particularly in cases like text classification and spam detection.

4. Training Data Requirements

  • Optimal Bayes Classifier:
    Since it needs the full joint probability distribution of all features, the Optimal Bayes Classifier typically requires large amounts of training data to estimate these probabilities accurately.

  • Naive Bayes Classifier:
    The Naive Bayes Classifier can work with smaller datasets because it estimates probabilities for individual features independently, reducing the need for large amounts of training data.

5. Practical Use

  • Optimal Bayes Classifier:
    It is mostly of theoretical importance and not used in practice for large, complex problems due to its computational complexity and the difficulty of obtaining complete knowledge of the joint distribution of features.

  • Naive Bayes Classifier:
    The Naive Bayes Classifier is widely used in practical applications because it is simple to implement, fast, and performs well even when its independence assumption is violated. It's commonly applied in text classification tasks, spam filtering, sentiment analysis, and medical diagnosis.

Summary of Differences

FeatureOptimal Bayes ClassifierNaive Bayes Classifier
Assumptions about FeaturesNo independence assumptionsAssumes conditional independence between features
ComplexityComputationally infeasible for large datasetsComputationally efficient and scalable
AccuracyTheoretically optimal, minimal error rateLess accurate if independence assumption is violated
Data RequirementsRequires large amounts of data to model joint probabilitiesCan work with smaller datasets
Practical UseMostly theoretical, rarely used in practiceWidely used in many practical applications

Conclusion

The Optimal Bayes Classifier is theoretically the best model, providing the most accurate predictions by considering the full joint distribution of features, but it is computationally complex and impractical for large datasets. On the other hand, the Naive Bayes Classifier simplifies the problem by assuming feature independence, making it computationally efficient and suitable for many real-world problems, even though its assumptions may not always hold.


Example: Weather Prediction

We want to predict whether it will rain or not rain based on two features:

  1. Cloudy: Whether the sky is cloudy.
  2. Windy: Whether it is windy.

Dataset

Let’s consider this simplified dataset:

DayCloudyWindyRain/Not Rain
1YesYesRain
2NoYesNot Rain
3YesNoRain
4YesYesRain
5NoNoNot Rain

The goal is to predict whether it will rain or not rain based on whether it is cloudy and windy.


Optimal Bayes Classifier

The Optimal Bayes Classifier uses the exact joint probabilities of the features (cloudy and windy) and the class label (rain or not rain).

It calculates the posterior probabilities P(Rain | Cloudy and Windy) and P(Not Rain | Cloudy and Windy) directly by considering both features together (cloudy and windy) at the same time. This approach does not assume that cloudy and windy are independent of each other.

For example, to predict whether it will rain when it is both cloudy and windy, the Optimal Bayes Classifier would compute:

  • P(Rain | Cloudy = Yes, Windy = Yes) = Number of times it was cloudy and windy and it rained / Number of times it was cloudy and windy
  • P(Not Rain | Cloudy = Yes, Windy = Yes) = Number of times it was cloudy and windy and it didn’t rain / Number of times it was cloudy and windy

From the dataset:

  • P(Rain | Cloudy = Yes, Windy = Yes) = 2/2 (Day 1 and Day 4 are cloudy, windy, and rainy)
  • P(Not Rain | Cloudy = Yes, Windy = Yes) = 0/2 (No day is cloudy, windy, and not rainy)

So, based on this joint distribution, if it is both cloudy and windy, the Optimal Bayes Classifier would predict that it will definitely rain because all the examples with cloudy and windy conditions led to rain.

This method is very accurate but requires knowing or estimating all possible combinations of features, which can be computationally expensive and require a lot of data.


Naive Bayes Classifier

The Naive Bayes Classifier simplifies things by assuming that the features (cloudy and windy) are independent given the class label (rain or not rain). Instead of looking at the combination of cloudy and windy together, it calculates the probabilities separately and multiplies them.

In this case, the classifier calculates:

  • P(Rain | Cloudy = Yes) and P(Rain | Windy = Yes)
  • P(Not Rain | Cloudy = Yes) and P(Not Rain | Windy = Yes)

Then, it combines these probabilities using the assumption of independence.

Step-by-Step Calculation for Naive Bayes

  1. Calculate Prior Probabilities:

    • P(Rain) = 3/5 (Three days it rained out of five)
    • P(Not Rain) = 2/5 (Two days it did not rain out of five)
  2. Calculate Conditional Probabilities:

    • P(Cloudy = Yes | Rain) = 2/3 (Two out of three rainy days were cloudy)
    • P(Cloudy = Yes | Not Rain) = 0/3 (One out of two non-rainy days was cloudy)
    • P(Windy = Yes | Rain) = 2/3 (Two out of three rainy days were windy)
    • P(Windy = Yes | Not Rain) = 1/2 (One out of two non-rainy days was windy)
  3. Make Prediction (if it’s both cloudy and windy):

    • P(Rain | Cloudy = Yes, Windy = Yes) (using independence assumption) = P(Cloudy = Yes | Rain) * P(Windy = Yes | Rain) * P(Rain)

      • P(Rain | Cloudy = Yes, Windy = Yes) = (3/3) * (2/3) * (3/5) = 0.40
    • P(Not Rain | Cloudy = Yes, Windy = Yes) = P(Cloudy = Yes | Not Rain) * P(Windy = Yes | Not Rain) * P(Not Rain)

      • P(Not Rain | Cloudy = Yes, Windy = Yes) = (0) * (1/2) * (2/5) = 0.00
  4. Final Prediction:

    • The classifier will choose the class with the higher probability. Since P(Rain | Cloudy = Yes, Windy = Yes) (0.40) is greater than P(Not Rain | Cloudy = Yes, Windy = Yes) (0.00), the Naive Bayes Classifier predicts Rain.

Key Difference in This Example:

  • Optimal Bayes Classifier: It directly uses the joint probability of the features together (cloudy and windy) to make a prediction. In this example, it sees that every time it was cloudy and windy, it rained, so it predicts rain with certainty.

  • Naive Bayes Classifier: It simplifies things by calculating probabilities for each feature separately and assumes that being cloudy and being windy are independent of each other given the class. While it also predicts rain, it does so with less certainty because it doesn't account for the possible dependency between cloudy and windy.

The Naive Bayes Classifier assumes that cloudy and windy are unrelated, which simplifies the calculation, but this assumption might not be true in reality (e.g., cloudiness and wind could be correlated). However, Naive Bayes often works surprisingly well even with this assumption.

Summary:

  • The Optimal Bayes Classifier gives the most accurate result by considering the dependencies between all features but is computationally expensive and needs a lot of data.
  • The Naive Bayes Classifier simplifies things by assuming that all features are independent given the class, making it faster and easier to implement, though it may be less accurate if features are highly dependent.

Post a Comment

0 Comments