Integrated Knowledge Solutions

Faulty LED Display Digit Recognition: Illustration of Naive Bayes Classifier using Excel

Originally published on July 17, 2017.

The Naive Bayes (NB) classifier is widely used in machine learning for its appealing tradeoffs in terms of design effort and performance as well as its ability to deal with missing features or attributes. It is particularly popular for text classification. In this blog post, I will illustrate designing a naive Bayes classifier for digit recognition where each digit is formed by selectively turning on/off segments of a seven segment LED display arranged in a certain fashion as shown below. The entire exercise will be carried out in Excel. By writing Excel formulas and seeing the results in a spreadsheet is likely to result in a better understanding of the naive Bayes classifier and the entire design process.

We will represent each digit as a 7-dimensional binary vector where a 1 in the representation implies the corresponding segment to be on. The representations for all ten digits, 0-9, is shown below. Furthermore, we assume the display to be faulty in the sense that with probability p a segment doesn't turn on(off) when it is supposed to be on(off). Thus, we want to design a naive Bayes classifier that accepts a 7-dimensional binary vector as an input and predicts the digit that was meant to be displayed.

Basics of Naive Bayes

A Naive Bayes (NB) classifier uses Bayes' theorem and independent features assumption to perform classification. Although the feature independence assumption may not hold true, the resulting simplicity and performance close to complex classifiers offer complelling reasons to treat features to be independent. Suppose we have $d$ features, $x_1,\cdots, x_d$, and two classes $ c_1\text{ and } c_2$. According to Bayes' theorem, the probability that the observation vector $ {\bf x} = [x_1,\cdots,x_d]^T$ belongs to class $ c_j$ is given by the following relationship:

$ P(c_j|{\bf x}) = \frac{P(x_1,\cdots,x_d|c_j)P(c_j)}{P(x_1,\cdots,x_d)}, j= 1, 2$

Assuming features to be independent, the above expression reduces to:

$ P(c_j|{\bf x}) = \frac{P(c_j)\prod_{i=1}^{d}P(x_i|c_j)}{P(x_1,\cdots,x_d)}, j= 1, 2$

The denominator in above expression is constant for a given input. Thus, the classification rule for a given observation vector can be expressed as:

Assign

$ {\bf {x}}\rightarrow c_1\text { if }P(c_1)\prod_{i=1}^{d}P(x_i|c_1)\geq P(c_2)\prod_{i=1}^{d}P(x_i|c_2)$

Otherwise assign

$ {\bf {x}}\rightarrow c_2$

For classification problems with C classes, we can write the classification rule as:

$ {\bf {x}}\rightarrow c_j \text{ where } P(c_j)\prod_{i=1}^{d}P(x_i|c_j) > P(c_k)\prod_{i=1}^{d}P(x_i|c_k), k=1,...,C \text{ and } k\neq j$

In case of ties, we break them randomly. The implementation of the above classification rule requires estimating different probabilities using the training set under the assumption that the training set is a representative of the classification problem at hand.

There are two major advantages of the NB classification when working with binary features. First, the naive assumption of feature independence reduces the number of probabilities that need to be calculated. This, in turn, reduces the requirement on the size of training set. As an example, consider the number of binary features to be 10. Without the naive independence assumption, we will need to calculate $ 2^{10}$ (1024) probabilities for each class. With the independent features assumption, the number of probabilities to be calculated per class reduces to 10. Another advantage of NB classification is that it is still possible to perform classification even if one or more features are missing; in such situations the terms for missing features are simply omitted from calculations.

Faulty Display Digit Recognition Steps

In order to design a classifier, we need to have training data. We will generate such data using Excel. To do so, we first enter the seven dimensional representation for each digit in Excel and name the cell ranges for each digit as digit1, digit2 etc. as shown below.

Next, we use Excel's RAND() function to decide whether the true value of a segment should be flipped or not (1 to 0 or 0 to 1). We repeat this as many times as the number of training examples for each digit need to be generated. In discussion here, we will generate 20 examples for each digit. The figure below shows some of the 20 such examples and the Excel formula used to generate them for digit 1. Noiselevel in the formula refers to a cell where we store the probabilty p of a segment being faulty. This value was set to 0.2. Similar formulas are used to generate 20 examples for each digit.

The 200 training examples generated as described are next copied and pasted into a new worksheet. This is the sheet that will be used for designing the classifier. The paste operation is carried out using the "Values Only" option. This is done to avoid anymore changes in the generated noisy examples.

Naive Bayesian Classifier Design

Having generated 200 examples of faulty display digits, we are now ready to design our NB classifier. Designing NB classifier means we need to compute/estimate class priors and conditional probabilities. Class priors are taken as the fraction of examples from each class in the training set. In the present case, all class priors are equal. This means that class priors do not play any role in arriving at the class membership decision in our present example. Thus, we need to estimate only conditional probabilities. The conditional probabilities are the frequencies of each attribute value for each class in our training set. The following relationship provides us with the probability of segment 1 being equal to 1 conditioned on that the digit being displayed is digit 1.

$P(s_{1}=1|digit1) = \frac{\text{count of 1's for segment 1 in digit1 training examples}}{\text{number of digit1 training examples}}$

Since only two possible states, 1 and 0, are possible for each segment, we can calculate the probability of segment 1 being equal to 0 conditioned on that the digit being displayed is digit 1 by the following relationship:

$ P(s_{1}=0|digit1) = 1 - P(s_{1}=1|digit1)$

In practice, however, a correction is applied to conditional probabilities calculations to ensure that none of the probabilities is 0. This correction, known as Laplace smoothing, is given by the following relationship:

$ P(s_{1}=1|digit1) = \frac{1+\text{count of 1's for segment 1 in digit1 training examples}}{2+\text{number of digit1 training examples}}$

Adding 1 to the numerator count ensures probability value doesnot become 0. Adding 2 to the denominator reflects the number of states that are possible for the attribute under consideration. In this case we have binary attributes. Note that in text classification applications, for example in email classification, where we use words in text as attributes, the denominator correction term will be V with V being the number of words in the dictionary formed by all words in the training examples. Also you will find the term Bernoulli NB being used when the feature vector is a binary vector as in the present case, and the term Multinomial NB being used when working with words as features.

Going back to our training set, we are now ready to compute conditional probabilities. The formula for one such computation is shown below along with a set of training examples for digit1. Similar formulas are used to compute the remaining conditional probabilities and the training examples to obtain 70 conditional probabilities needed to perform classification.

Testing the Classifier

Having calculated conditional probabilities, we are now ready to see how well our classifier will work on test examples. For this, we first generate five test examples for each digit following the steps outlined earlier. The test examples are copied and pasted (using the "Value Only" paste option). We also copy the probabilties computed above to the same worksheet where test examples have been pasted, just for convenience. This done, we next write formulas to compute the probabilty for each digit given a test example and the set of conditional probabilities. This is shown below in a partial screenshot of Excel worksheet where the formula shown for calculating the probability of displayed digit being 1 based on the states of seven segments. The references to cells in the formula are where we have copied the table of conditional probabilities.

While the higlighted columns indicate the highest probability value in each row and thus the classification result, the following formula in column "S" results in the classifier output as the label to be assigned to the seven component binary input representing the status of the faulty display.

=MOD(MATCH(MAX(I2:R2),I2:R2,0),10)

Next, comparing the labels in columns H (true label) and S (predicted label) we can generate the confusion matrix to tabulate the performance of the classifier. Doing so results in the following confusion matrix with 80% correct classification rate.

The 80% accuracy is for 20% noise level. If desired we can go back and rerun the entire simulation again for different noise levels and determine how the accuracy varies with varying noise levels.

Finally, it would be nice to have a visual interface where we can input a row number referencing a test example, and display the faulty digit as well as the predicted digit. Such a display can be easily created using conditional formatting and adjusting the shape and size of certain Excel cells (See a post on this). One such display is shown below. By entering a number in the range of 2-51 (50 test examples) in cell AE1, we can pull out the segment values using Indirect function of Excel. For example, the segment value shown in cell W3 in figure below is obtained by the following formula =INDIRECT("A"&$AE$1). Similary, the value in cell X3 is obtained by =INDIRECT("B"&$AE$1), and so on. The segment values in the cell range W3:AC3 are then used in conditional formatting. The predicted digit display is based on segments states corresponding to the predicted digit label read from "S: column for the row number in AE2.

As this exercise demonstrates, the design of a naive Bayes classifier is pretty straightforward. Hopefully working with Excel has provided a better understanding of the steps involved in the entire process of developing a classifier.

Understanding Knowledge Graphs - What They Are and How to Build Them

Knowledge graphs are a way to represent information and relationships between the real world entities, i.e. objects, events, facts etc., in a structured, interconnected way. They provide a powerful framework for capturing the rich semantics and context of data. Knowledge graphs are also called semantic networks.

At its core, a knowledge graph is a type of data model that structures information as entities (nodes) and their relationships (edges). Entities represent real-world objects like people, places, concepts, or events, while the relationships define how these entities are linked. An example of a small knowledge graph is shown below.

Some key characteristics of knowledge graphs are:

- Entities have properties that describe their attributes (e.g. name, date of birth for a person)

- Relationships have types that define how entities are connected (e.g. bornIn, employedBy)

- They provide context by capturing how pieces of data relate to one another

- Information is stored in a graph structure rather than tables

This graph-based approach provides flexibility in modeling complex, multi-dimensional data in a way that can be easily extended as new information emerges.

Applications of Knowledge Graphs

Knowledge graphs have broad applications across many domains:

- Search engines like Google use knowledge graphs to enhance search results with contextualized information

- Recommendation engines leverage knowledge graphs to provide relevant suggestions based on user preferences and behaviors

- Biomedical research uses knowledge graphs to represent relationships between diseases, genes, drugs, and more

- Enterprise knowledge graphs help organizations better understand their data assets and how different systems/processes are interconnected

How to Build a Knowledge Graph

There are generally five key steps in creating a knowledge graph:

1. Data Ingestion - Gathering data from various structured and unstructured sources

2. Entity Extraction - Identifying real-world entities mentioned in the data using techniques like named entity recognition (NER)

3. Relationship Extraction - Determining relationships between entities using methods like pattern matching or machine learning

4. Data Transformation - Converting the extracted triples (entity-relationship-entity) into the proper graph data model format

5. Graph Population - Loading the transformed data into the graph database

There are many different knowledge graph construction toolkits and frameworks available depending on your use case and data needs. Some popular open source options include:

- Stanford's CoreNLP for NLP processing

- OpenKBC for end-to-end knowledge graph creation

- Python libraries like rdflib and owlready2 for working with semantic web standards

Knowledge graphs unlock new opportunities for data integration, analysis, and discovery. As our ability to capture, process, and connect information continues growing, knowledge graphs will become an increasingly vital tool.

What is Neuro-Symbolic AI and Why do we Need it?

Neuro-symbolic AI, also known as symbolic neural integration or hybrid AI, has been in news lately. It is an approach that combines symbolic reasoning and neural networks to create more robust and flexible artificial intelligence systems. This integration aims to leverage the strengths of both symbolic and neural approaches, addressing some of the limitations each paradigm faces independently.

Let's look at the following image to understand the need for the neuro-symbolic approach. The deep learning model looking at the input image comes to a conclusion that the image depicts a cow that is 5.2 meter long. Obviously, the system has been trained to recognize cows and other objects but is unable to reason that cows are not that long. This is not surprising as deep learning models are excellent at recognizing features but have no capability to understand or draw conclusions based on the spatial relationships between the detected features. One way to provide such capabilities is to integrate symbolic reasoning along with the feature detection and pattern recognition capabilities of deep neural networks, and thus the interest in the neuro-symbolic approach.

The ways Neuro-Symbolic AI can help

Improved Reasoning: Integrating symbolic reasoning with neural networks can enhance AI's ability to understand and apply common sense knowledge, making it more adept in real-world scenarios.
Medical Diagnosis: Neuro-symbolic approaches can be applied to medical diagnosis, where symbolic rules can be used to represent expert knowledge, and neural networks can learn from diverse patient data to improve diagnostic accuracy.
Robotics and Autonomous Systems: Combining symbolic reasoning with neural networks can enhance the decision-making capabilities of robots, allowing them to navigate complex environments and handle uncertainty.
Natural Language Understanding: Neuro-symbolic AI can be used to improve natural language understanding systems by combining semantic representation with the ability to learn from diverse textual data.
Financial Fraud Detection: Integrating symbolic rules for detecting fraud patterns with neural networks that can adapt to new, emerging fraud trends can enhance the accuracy and robustness of fraud detection systems

Challenges and Approaches

Integrating the neural and symbolic components in a principled and effective way remains a key challenge in neuro-symbolic AI. Several promising approaches are being explored:

Neural Networks with Symbolic Inductive Biases: This approach aims to build neural architectures with inductive biases derived from symbolic AI, such as logical operations, algebraic reasoning, or compositional structure. For example, graph neural networks can inject relational inductive biases, while neural programmer-interpreters combine neural modules with program-like operations.
Augmenting Neural Models with External Memory/Reasoning: Rather than directly encoding symbolic knowledge in the neural architecture, another approach uses external memory or reasoning components that interact with a neural network. Examples include memory augmented neural networks, neural semantic encoders, and neural-symbolic concept learners.
Differentiable Inductive Logic Programming: Inspired by inductive logic programming (ILP) in symbolic AI, this approach uses neural networks to learn first-order logic rules or program-like primitives in a differentiable way that integrates symbolic reasoning into end-to-end training.
Learning Symbolic Representations from Data: Techniques in this area aim to have neural networks automatically learn disentangled, symbolic-like representations from raw data in an unsupervised or self-supervised manner. Examples include sparse factor graphs, conceptor-based models, and neuro-symbolic concept invention.

For Further Exploration

You can check the following links for further exploration of this topic:

Despite progress, several key challenges remain, including scaling symbolic reasoning, imposing harder constraints in neural architectures, achieving systematic compositionality, and maintaining interpretability as models get larger. Hybrid neuro-symbolic approaches hold promise but will require significant research efforts across multiple fields.