bibliography | |
---|---|
|
We have looked at the application of Bayesian approaches to linear
regression. We now turn to its application to classification. This is
the problem of assigning a class to a point in D-dimensional space
given by the coordinates . In this section we seek to
validate the use of a Nave Bayesian classifier to solve the problem of
optical character recognition. This is where an image of a character,
which is often hand-written, is used as an input. The algorithm must
then determine what that character is. The Nave Bayesian classifier is a
simple algorithm. However, this algorithm can provide effective
results.
To test the effectiveness of this algorithm, we now turn to the MNIST
dataset[@lecun]. This consists of 60 000 handwritten digits of numbers,
between 0 and 9, and their corresponding labels. These can be used to
train the algorithm. It also provides a further 10 000 digits that can
be used to test the effectiveness of the developed algorithm. A benefit
of the MNIST dataset, is that all digits have been shifted to the centre
of the image and scaled to a set size. This reduces the amount of
preprocessing required to handle the digits, allowing more focus to be
put on the development of the classification algorithm. Sample digits
from the MNIST dataset are shown in figure [fig:P2:MNIST].
In order to apply the Nave Bayesian classifier to the MNIST dataset, we first need features which can be extracted from the characters. We now consider various features and provide motivation for their use.
{width="9.00000%"} {width="9.00000%"} {width="9.00000%"} {width="9.00000%"} {width="9.00000%"} {width="9.00000%"} {width="9.00000%"} {width="9.00000%"} {width="9.00000%"} {width="9.00000%"}
The ark length is defined as the length of the longest continuous set of pixels, which have a value greater than a threshold. This feature was considered in the hopes that each character has a different ark length. This can be seen easily in the character “1” which has a very short length when compared to other characters. It was decided to provide various bins for this data rather than processing it as continuous Gaussian data. This is because this descriptor will be multi-modal. Providing bins for the data, we are better able to capture aspects of this multi-modal distribution. We can analyse the effectiveness of this descriptor by considering Kononenko’s information gain. This is defined as follows:
Here we have defined the base of the logarithm as . It is common practice to use , as this can denote the effective number of bits gained in information. However, we will use , where is the number of classes. This choice of , provides an upper bound to the information gain as . It is therefore easier to interoperate. An alternative view of this, is to determine the information gain of a class given a predicted class . The information gain formula now becomes: Using the training dataset to compute the information gain yields the following result:From , we can see that the ark length is good at
determining if a digit belongs to the class ’1’. It is also reasonable
at determining if a digit belongs to class ’5’.
We can also plot the ark length versus the count of digits in each
class. This provides a visual on how well the descriptor splits the
data. This is shown in figure [fig:P2:arkLenIG] and confirms that this
descriptor is good at determining if a digit is a ’1’.
The ark length alone does not provide excellent results but it does
split the data to a degree. Therefore, this feature is suitable to be
used in the Nave Bayesian classifier, provided it is used amongst other
descriptors.
The next feature considered is the number of pixels enclosed by a contour. This concept is explained with figure [fig:P2:Enclosed0], where the area enclosed is coloured in red. The reason for choosing this descriptor is to try and distinguish between numbers enclosing a space and those that don’t. We also hope to distinguish various characters by how much space they enclose. This can be seen when considering a ’0’ and a ’6’. More often than not, a ’0’ encloses more total space than a ’6’.\
The information gain for this feature is given below:
This feature does exactly as expected and provides a large information gain to characters which encircle pixels This feature assigns the same value to characters that don’t enclose a space and hence don’t provide much information. Figure \[fig:P2:IGEnclosed\] provides a visual depiction on how this feature splits the data. The bin associated with 0 pixels has been removed as a large portion of the data was contained in this bin. Due to this descriptors ability to differentiate between digits which encircle an area, it is a suitable feature to use in the Nave Bayesian classifier.The number of contours is defined as the number of continuous edges of the digit. This concept is shown in figure [fig:P2:8Cont], where each contour is highlighted in orange. The total number of contours in this image is 3. Small contours are removed from this count to avoid counting fullstops which are present in some of the images. This feature is included to attempt to distinguish ’8’ better from other digits. This is useful as many descriptors struggle to differentiate between ’3’ and ’8’.\
The information gain of this algorithm is given in equation [eqn:P2:IGNOC], and a histogram containing the number of classes, for each value of this descriptor, is given in feature [fig:P2:IGCont]. Both of these show that this feature is indeed able to identify an ’8’ to some degree. Therefore, this feature will provide useful information to the Nave Bayesian classifier.
This feature moves a window over the image and determines the magnitude
and direction of the gradients in that window. It then builds a
histogram containing bins for various gradient directions. This feature
is used in various human detection algorithms. This descriptor was used
in [@ebrahimzadeh2014efficient] for OCR. This algorithm was tested on
the MNIST and obtained an accuracy of . Due to this, this
feature is of great interest.
We are able to use this feature by considering each bar as a separate
feature which can be used by the algorithm. One can also adjust the
window size in order to change the granularity of the search. Smaller
windows tend to provide a better accuracy up to a point but require more
time to compute and more space to store.
The information gain of all the features produced by the HOG algorithm
is given in equation [eqn:P2:IGHOG]. Samples of some of the features
produced by the algorithm are also provided in figure [fig:P2:IGHOG].
This feature yields excellent results and could even be used by itself.
{width="49.00000%"} {width="49.00000%"} {width="49.00000%"} {width="49.00000%"}
Image moments provide a numerical value for how the image intensities are distributed. The following equation, given in [@1057692], is used to calculate image moments:
Where $M_{ij}$ is given in equation \[eqn:P2:Mij\] and is the pixel intensity at position .The information gain of this feature is given in equation [eqn:P2:IGGrad]. Figures [fig:P2:IGGrad] and [fig:P2:IGGrad2] provides the histograms containing the likelihood of each moment considered. This descriptor also provides excellent results in splitting the data.
{width="49.00000%"} {width="49.00000%"} {width="49.00000%"} {width="49.00000%"}
{width="49.00000%"} {width="49.00000%"} {width="49.00000%"} {width="49.00000%"}
In order to classify he data, we look to derive the probability . This is easily determined using Bayes rule as follows:
If we now assume that the prior probability is relatively flat over all the classes, we can simplify this to:This result can now be extended to multiple features as follows:
This can be further simplified, as shown in figure \[eqn:P3:indPxGx\], if we assume that each feature is independent. This requires that features should not depend on similar data, which is not always practically possible. However, if the dependence between features are kept minimal, the mutual independence assumption provides a decent approximation to the true distribution. It is important to note that this algorithm only works in one dimension at a time. Better results could be achieved if the relations between multiple features where considered together in a D-dimensional space. However, this would require more computation power and more complex mechanisms. Hence there is a trade-off between accuracy and complexity. However, the accuracy gain can often be minimal.\ We now consider the evaluation of . There are 2 important cases to consider when determining this probability from the training data. These two cases are when is discrete and when is continuous. We first consider the case where is discrete and can take on one of K values, formally denoted . The can be simply given as: Where is the count of the data observed having value and class . We have also defined as the count of data observed belonging to class .\ We now turn our attention to the case of as a continuous variable. In order to evaluate , we need to make an assumption on the expected form of this probability distribution. We assume that the data takes on the form of a Gaussian distribution. This distribution appears often due to the central limit theory. However, various features can have a multi-modal distribution. Therefore, it is possible to miss some detail in the distribution by making this assumption. It still produces excellent results when put to the test. With the assumption on the form of , we can show that: Where we can use the following equations to estimate and from the training data within each class: We can the normalise over all the classes to calculate the value of . It is important to notice the minus 1 in the denominator of equation \[eqn:P2:sigma\], as this accounts for the degree of freedom used when fitting the mean to the data.There are 2 main stages in the Nave Bayesian classifier. Firstly, the algorithm needs to be trained. We then need to make predictions with the trained models. The implementation of these are considered separately below.
Discrete variables can be trained with the use of a matrix which has the classes across the columns and the across the rows. Each new observation in the training set then just increments the counter in the relevant row and column. One important adjustment is to set the initial count of this matrix to 1. This is done to avoid a ’0’ probability occurring which can greatly influence the overall probability. The initialisation and training for discrete variable are given in algorithms [alg:DI] and [alg:DT].
When training continuous variables, we need to consider all the data as a whole to calculate the mean and variance. Due to this, the training is split into two parts. The first part collects all the data associated with each class and the second part calculated the mean and average within each class. Algorithms [alg:CI], [alg:CO] and [alg:CT] show how this could be implemented.
We now show how the trained classifiers can be used to make predictions. To do this we first need to calculate the for both discrete and continuous variables. Algorithms [alg:DP] and [alg:CP] provide possible pseudo code for achieving this. One we know the probability for each input , we can simply multiply all of these together to get the total probability . Due to equation [eqn:cGxpxGc], we can now assign the point given by by the class that gives the maximum value for .
To test this algorithm, we will use MNIST dataset. Features will be generated for both the test set and the training set which contains 10 000 digits and 60 000 digits respectively. The generated features of the training set and their associated labels are then used to train the classifier. Thereafter, we will use the training set features to classify each digit. It is important that we use a separate dataset for testing and training as this allows us to test for over-fitting. In this section, the tests performed will be described.
Aim
This experiment aims to acquire the mean accuracy of the system and also
provide a confusion matrix. The confusion matrix will be used to provide
insight into what the algorithm struggles to classify. This could be
used to make later improvements.
Expectations
One could expect to see many misclassification of the digit ’7’ and ’9’
as these two digits are very similar in structure. There is also very
little structural difference between an ’8’ and a ’0’.
Procedure
The following is performed in order to obtain the results:
-
A 1010 matrix is initialised to 0. This is used as the confusion matrix.
-
An item containing the features of a single digit is read from a file.
-
This digit is classified setting the most likely class as the predicted class.
-
The actual class is also loaded from the file.
-
The matrix value at the index provided by the actual label and the predicted label is incremented by 1.
-
This process is repeated for all the test digits provided by the MNIST dataset.
-
The mean accuracy is then calculated as
$100\times\text{Trace}(\textit{confusion matrix})/\text{Sum}(\textit{confusion matrix})$
Aim
This experiment aims to provide an indication on how accurate this
classifier would work on various other datasets. This is achieved by
sub-sampling the test dataset and testing the variance of the
sub-samples.
Expectations
One would expect that the classifier would yield similar results for
similar datasets.
Procedure
The following is performed in order to obtain the results:
-
Two empty lists are initialised
-
An item containing the features of a single digit is read from a file.
-
The probabilities associated with this digit belonging to the various classes are determined as described in section [sec:P2:theorAnal].
-
A ’1’ is appended to an array if the correct classification was made, else a ’0’ is appended.
-
We also obtain the classifies probability of the actual label being correct. This is done by indexing the probability array with the actual label. This is also appended to a list
-
This process is repeated for all the test digits provided by the MNIST dataset.
-
We now take 1000 random test samples from the first array and count the total number of correctly classified digits (1’s). It is important to note that the 1000 samples are taken from a uniform distribution with no repetition in digits.
-
This number is then divided by 10 and appended to a list to get the accuracy in percentage.
-
We also take 1000 random samples out of the second list in a similar fashion.
-
The sum of these probabilities are calculated and divided by 10 to get how sure the classifier is of the correct decision, on average, in percentage.
-
This sampling process is repeated 100 000 times.
-
The respective means, medians, quartiles, minimums and maximums can be calculated. These provide insight into how the accuracy varies with a variation in the test dataset.
The confusion matrix for this classifier is given in table [tbl:Conf] and the overall accuracy is determined to be .
0 1 2 3 4 5 6 7 8 9
0 915 2 13 8 0 4 15 0 20 3
1 0 1075 19 2 13 1 0 5 19 1
2 9 1 956 29 3 2 0 4 27 1
3 0 2 26 925 2 23 1 7 16 8
4 0 7 27 0 892 2 10 8 5 31
5 7 0 7 50 3 777 10 2 32 4
6 10 10 7 1 9 26 881 0 14 0
7 0 6 35 10 32 0 0 842 10 93
8 30 9 23 26 9 16 2 11 828 20
9 9 9 17 10 8 8 0 26 16 906
: Confusion matrix for the Nave Bayesian classifier[]{data-label="tbl:Conf"}
Figure [IMG:P2:BoxAndWhisacar] shows the box and whisker diagram for the randomly sampled data. In this experiment, we obtain a mean accuracy of and a mean certainty of .
[ ytick=[1,2]{}, yticklabels=[Certainty, Accuracy]{}, ] +[ boxplot prepared=[ median=89.93, upper quartile=90.53, lower quartile=89.33, upper whisker=93.57, lower whisker=85.97 ]{}, ] coordinates ; +[ boxplot prepared=[ median=90.0, upper quartile=90.6, lower quartile=89.4, upper whisker=93.9, lower whisker=85.3 ]{}, ] coordinates ;
The Nave Bayesian classifier was able to take a set of input variables
and assign classes to a set of target variables with a high accuracy.
This is very impressive when considering the simplicity of the
algorithm. However, it would be possible to achieve a higher accuracy if
we where able to select features which where completely independent.
This would be equivalent to warping the D-Dimensional space to an axis
which splits the data the most.
It is interesting to see how each feature contributes to the
effectiveness of the algorithm. When analysing the information gain, we
can see that the ark length is good at separating 1’s and 5’s out from
other digits. The detection of the digit ’1’ can be explain by its
relatively short ark length. The digit ’5’ on the other hand is due to
its consistent ark length when compared to other digits like ’4’. The
next feature that can be considered is the area enclosed. This feature
is best at detecting 0’s due to the large area enclosed in the centre.
Using he number of contours as a feature, we could detect 8’s really
well. This is because the figure ’8’ usually contains 3 distinct
contours. These features provided minimal information gain when compared
to the HOG descriptor and the moments of the image. This is due to the
amount of information that can be captured by these features.
Due to the confusion matrix, it is possible to determine where the
algorithm excelled and where improvements could be made. A large portion
of the invalid predictions are due to 7’s getting detected as 9’s and
visa versa. This amounts to a total of 119 misclassified images, which
is equivalent to . When considering the similarity of a ’7’ and
a ’9’, this result could be expected. This effect is magnified by the
various handwriting styles, where a ’7’ could be drawn with
imperfections which could make it visually similar to a ’9’. Looking
further into the confusion matrix, a better accuracy could be achieved
if more features were tailored to detecting 8’s and 9’s, as most of the
misclassification’s occur in these columns.
In experiment 2, the consistency of the classifier is tested by using
smaller random samples of the bigger test set. Assuming that the larger
test set contains enough data-points to represent all possible data
which could be provided to the classifier, we can assume that the
variance of the algorithm within the dataset, will approximate the
variance of the algorithm in a real-world environment. Therefore, figure
[IMG:P2:BoxAndWhisacar] shows us that the algorithm will most likely
have an accuracy of at least for most real world data. However,
with the MNIST dataset, the classifier obtained an average accuracy
. This is poor when compared to recent algorithms, which can
achieve error rates of only [@lecun]. However, these modern
algorithms can take advantage of multiple layers of abstraction,
creating linear separations to complex relationships between the data.
In this paper, we have seen how probability theory can be used to
quantify uncertainty. We have also seen how this theorem is applicable
to many real-life scenarios and how they can be used to predict the
likelihood of various events occurring. We also see how ones intuition
on probabilities can often be incorrect.
We continue to explore the application of probability theory to the
problem of regression. This is compared to a frequentest approach, where
the advantage of a probabilistic approach become apparent. One can see
that a probabilistic approach avoids the problem of over fitting, which
is present when using a frequentest approach. This can be seen when
comparing figures [fig:E3:Or9:LSQ] and [fig:E3:Or9:Bays]. One can
also see that, a probabilistic approach, allows one to determine the
level of certainty that the data will be contained within a certain
range. It is interesting to not that the certainty decreases when there
is a lack of data around a given range of input variables. This can be
seen in figure [fig:E3:Or9:Bays:RandRem], where observations have been
removed. This creates a distribution which is more broad around the area
of the missing data. One would expect this to occur as the algorithm has
no information about the shape of the curve in the proximity of these
input variables. The final advantage of using a probabilistic approach
lies in model selection. Given a list of possible models, a
probabilistic approach allows one to select the most probable model
based on the training data alone. This is not possible using a
frequentest approach and would require a separate testing dataset to
avoid over fitting. This is a large waste of data which could be used
for further training.
Lastly, we look at how a probabilist approach could be taken to data
classification. A Nave Bayesian classifier was developed to solve the
problem of optical character recognition (OCR). This classifier needs
either discrete or continuous data from an image. In order to achieve
this, various feature extraction methods where explored and their use
validated. The classification algorithm could then be tested on the
MNIST dataset[@lecun], where it achieved an average accuracy of
.
Probability theory provides a powerful framework, whereby predictions can be made from previous observations. Therefore, this is an essential tool in artificial intelligence and forms the basis for the majority of machine learning algorithms. It is undeniable that probabilistic approaches provide significant benefits when used for regression and classification. We see that problems such as over-fitting, model selection and certainty estimation can be solved with probabilistic approaches to linear regression. Finally, we see that a probabilistic approach to the problem of classification, can provide a simplistic but effective approach to a very complex problem.