Naive Bayes is not an algorithm; it is a class of algorithms. Naive Bayes is very easy to understand and reasonably accurate, making it a great class of algorithms to use when starting a classification project.
Classification is a machine learning technique in which we wish to predict what class (or category) an item belongs to. Examples of classification include:
Two important definitions:
Naive Bayes algorithms follow the general form:
Today's talk will show how the Naive Bayes class of algorithms works, solve for a simplied form, and then use R packages to solve larger-scale problems.
There are several forms of Naive Bayes algorithms that we will not discuss, but they can be quite useful under certain circumstances.
Supposing multiple inputs, we can combine them together like so:
$P(B|A) = \dfrac{P(x_1|B) * P(x_2|B) * ... * P(x_n|B) * P(B)}{P(A)}$This is because we assume that the inputs are independent from one another.
Given $B_1, B_2, ..., B_N$ as possible classes, we want to find the $B_i$ with the highest probability.
Goal: determine, based on input conditions, whether we should go play golf.
Steps:
Suppose today = {Sunny, Hot, Normal, False}
. Let's compare the P(golf)
versus P(no golf)
:
$P(Y|t) = \dfrac{P(O_s|Y) \cdot P(T_h|Y) \cdot P(H_n|Y) \cdot P(W_f|Y) \cdot P(Y)}{P(t)}$
$P(N|t) = \dfrac{P(O_s|N) \cdot P(T_h|N) \cdot P(H_n|N) \cdot P(W_f|N) \cdot P(N)}{P(t)}$
Note the common denominator: because we're comparing P(Yes|today)
versus P(No|today)
, the common denominator cancels out.
Putting this in numbers:
The probability of playing golf:
$P(Yes|today) = \dfrac{2}{9} \cdot \dfrac{2}{9} \cdot \dfrac{6}{9} \cdot \dfrac{6}{9} \cdot \dfrac{9}{14} = 0.0141$The probability of not playing golf:
$P(No|today) = \dfrac{3}{5} \cdot \dfrac{2}{5} \cdot \dfrac{1}{5} \cdot \dfrac{2}{5} \cdot \dfrac{5}{14} = 0.0068$Time to golf!
Our test text: Threw out the runner
Goal: determine, based on input conditions, whether we should categorize this as a baseball phrase or a business phrase.
Calculating the prior probability is easy: the count of "Baseball" categories versus the total number of phrases is the prior probability of selecting the Baseball category: $\dfrac{3}{6}$, or 50%. The same goes for Business.
So what are our features? The answer is, individual words!
Calculate $P(threw|Baseball)$ => count how many times "threw" appears in Baseball texts, divided by the number of words in Baseball texts.
The answer here is $\dfrac{1}{18}$.
What about the word "the"? It doesn't appear in any of the baseball texts, so it would have a result of $\dfrac{0}{18}$.
Because we multiply all of the word probabilities together, a single 0 leads us to a total probability of 0%.
But you're liable to see new words, so this isn't a good solution.
To fix the zero probability problem, we can apply Laplace smoothing: add 1 to each count so it is never zero. Then, add N (the number of unique words) to the denominator.
There are 29 unique words in the entire data set:
a and bullish fell hitter investors no nobody of on opportunity out percent pitched prices runners second seized shares situation stock the third thirty threw tough up were with
Baseball is therefore the best category for our phrase.
Ways that we can improve prediction quality:
naivebayes
PackageThe naivebayes
package is a fast Naive Bayes solver, with built-in versions of the plot
and
predict
functions.
First, we will use the famous iris data set.
We can use the naivebayes
package in R to solve Natural Language Processing problems as well. Just like
before, we need to featurize our language samples. Also like before, we will use the bag of words technique
to build our corpus.
Unlike before, we will perform several data cleanup operations beforehand to normalize our input data set.
After looking at Naive Bayes, you might be interested in a few other algorithms:
The Naive Bayes class of algorithms are simple to understand and are reasonably accurate, making them a good starting point for data analysis. There are a number of superior algorithms for specific problems, but starting with Naive Bayes will give you an idea of whether the problem is solvable and what your expected baseline of success should be.
To learn more, go here:
https://CSmore.info/on/naivebayes
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact