Precision, Recall and F1

Blogs

Book a call

Blogs

Book a call

All Blogs

Precision, Recall, and the need for balance

FUNDAMENTALS

Surender Singh

Aug 26, 2025

Understanding Accuracy

When we evaluate a machine learning model, especially in classification problems, the first number people often look at is accuracy. While accuracy is a simple and intuitive metric, it can be misleading if we don’t understand what’s happening behind the scenes.

Accuracy Formula

Accuracy measures the proportion of correct predictions out of all predictions:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

In plain English:
“How often is the model right?”

When we build a system that has to decide between two options - like “yes or no,” “positive or negative” - we need ways to check how well it’s doing. Three important ideas help us measure this:

Accuracy: Out of everything, how much did we get right? It’s like counting all your correct answers on a test, no matter the type of question.
Sensitivity (Recall): Out of all the things that were truly positive, how many did we catch? Imagine a medical test - sensitivity tells us how good it is at finding people who really have the disease.
Specificity: Out of all the things that were truly negative, how many did we correctly ignore? In the same medical test example, specificity tells us how good it is at saying “you’re healthy” when someone really is.

These measures often balance each other. A system can look good on accuracy while still missing many positives, which is why it’s important to look at sensitivity and specificity too.

Accuracy Formula

Accuracy measures the proportion of correct predictions out of all predictions:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

In plain English:
“How often is the model right?”

Accuracy: Out of everything, how much did we get right? It’s like counting all your correct answers on a test, no matter the type of question.
Sensitivity (Recall): Out of all the things that were truly positive, how many did we catch? Imagine a medical test - sensitivity tells us how good it is at finding people who really have the disease.
Specificity: Out of all the things that were truly negative, how many did we correctly ignore? In the same medical test example, specificity tells us how good it is at saying “you’re healthy” when someone really is.

These measures often balance each other. A system can look good on accuracy while still missing many positives, which is why it’s important to look at sensitivity and specificity too.

Accuracy Formula

Accuracy measures the proportion of correct predictions out of all predictions:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

In plain English:
“How often is the model right?”

Accuracy: Out of everything, how much did we get right? It’s like counting all your correct answers on a test, no matter the type of question.
Sensitivity (Recall): Out of all the things that were truly positive, how many did we catch? Imagine a medical test - sensitivity tells us how good it is at finding people who really have the disease.
Specificity: Out of all the things that were truly negative, how many did we correctly ignore? In the same medical test example, specificity tells us how good it is at saying “you’re healthy” when someone really is.

These measures often balance each other. A system can look good on accuracy while still missing many positives, which is why it’s important to look at sensitivity and specificity too.

Accuracy Formula

Accuracy measures the proportion of correct predictions out of all predictions:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

In plain English:
“How often is the model right?”

Accuracy: Out of everything, how much did we get right? It’s like counting all your correct answers on a test, no matter the type of question.
Sensitivity (Recall): Out of all the things that were truly positive, how many did we catch? Imagine a medical test - sensitivity tells us how good it is at finding people who really have the disease.
Specificity: Out of all the things that were truly negative, how many did we correctly ignore? In the same medical test example, specificity tells us how good it is at saying “you’re healthy” when someone really is.

These measures often balance each other. A system can look good on accuracy while still missing many positives, which is why it’s important to look at sensitivity and specificity too.

To get more clarity, let’s break it down using the concepts of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).

The Confusion Matrix: A Quick Overview

Imagine your model is predicting whether an email is spam or not spam. Each prediction can fall into one of four categories:

To get more clarity, let’s break it down using the concepts of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).

The Confusion Matrix: A Quick Overview

Imagine your model is predicting whether an email is spam or not spam. Each prediction can fall into one of four categories:

Outcome: True Positive (TP)
What it means: Model predicts spam, and it really is spam
Example: Email is spam → predicted as spam
Notes: Correct detection

Outcome: True Negative (TN)
What it means: Model predicts not spam, and it really isn’t spam
Example: Email is not spam → predicted as not spam
Notes: Correct rejection

Outcome: False Positive (FP)
What it means: Model predicts spam, but it isn’t spam
Example: Normal email → wrongly flagged as spam
Notes: Type I error

Outcome: False Negative (FN)
What it means: Model predicts not spam, but it actually was spam
Example: Spam email → missed as not spam
Notes: Type II error

These four outcomes form the confusion matrix, the foundation of most evaluation metrics.

Let’s understand with another example –

The Confusion Matrix

Think of an airport scanner that checks bags for prohibited items (like knives). Each bag either really has a prohibited item (positive) or it doesn’t (negative). The scanner can either flag it or clear it.

Let’s understand with another example –

The Confusion Matrix

Confusion Matrix Term: True Positive (TP)
Meaning: Bag has a knife and scanner flags it
Example: 1 bag contains a knife, and the machine alerts security
Impact: Correct catch

Confusion Matrix Term: True Negative (TN)
Meaning: Bag does not have a knife and scanner clears it
Example: 50 clean bags, and the scanner lets them pass
Impact: Correct rejection

Confusion Matrix Term: False Positive (FP)
Meaning: Bag does not have a knife, but scanner flags it
Example: Metal water bottle mistaken for a knife
Impact: False alarm (not dangerous, but inconvenient)

Confusion Matrix Term: False Negative (FN)
Meaning: Bag has a knife, but scanner does not flag it
Example: 1 bag has a knife, but the machine misses it
Impact: Serious error (dangerous item slips through)

Why This Matrix Matters

Instead of just saying “the scanner is 98% accurate,” the confusion matrix shows exactly where the mistakes happen:

Too many false positives (FP) → travelers get frustrated by constant false alarms.
Too many false negatives (FN) → real threats slip past security.

So, the confusion matrix gives a more comprehensive picture of performance by breaking accuracy into these 4 outcomes.

Why Accuracy Alone Can Mislead

Let’s say you’re building a model to detect rare diseases. If only 1 out of 100 people has the disease, a model that always predicts healthy will be 99% accurate. Sounds impressive - but the model never actually detects the disease. This shows why we can’t rely only on accuracy.

Let’s take another example :

Imagine you are designing a model to detect defective lightbulbs in a factory.

Why Accuracy Alone Can Mislead

Let’s take another example :

Imagine you are designing a model to detect defective lightbulbs in a factory.

Why Accuracy Alone Can Mislead

Let’s take another example :

Imagine you are designing a model to detect defective lightbulbs in a factory.

Why Accuracy Alone Can Mislead

Let’s take another example :

Imagine you are designing a model to detect defective lightbulbs in a factory.

Step: Scenario

Explanation: Out of 1,000 lightbulbs, only 10 are actually
defective (1%).

Step: Model

Explanation: The model always predicts “working” (never says any bulb is defective).

Step: What happens?

Explanation: Correct predictions: 990 (all the good bulbs labeled as good)

Wrong predictions: 10 (all the defective bulbs missed)

Step: Accuracy

Explanation: 990 ÷ 1,000 = 99% → looks excellent on paper.

Step: But…

Explanation: The model is useless for its real purpose - finding defective bulbs. It never catches a single faulty bulb.

Step: Impact

Explanation: The factory ships all 10 broken bulbs to customers → customers complain, the brand suffers → high “accuracy” means nothing in practice.

Step: Why this matters
Explanation: Accuracy only tells us the overall rate of correct guesses, but when one class is very rare (like defective bulbs), the model can ignore it completely and still look great.

Step: Better metrics needed
Explanation: Sensitivity (Recall): Of the 10 defective bulbs, how many were actually caught? (Here it’s 0/10 = 0%).

Specificity: Of the 990 good bulbs, how many were correctly identified? (Here it’s 990/990 = 100%).

By looking at these together, we see the truth: the model isn’t “99% good,” it’s 100% blind to the rare but important cases.

Key Takeaways

Accuracy is simple but limited. It works well when classes are balanced but can be misleading with imbalanced datasets.
Always look at the confusion matrix. It tells the full story behind the metric.
Pick the right metric for the problem. In healthcare or fraud detection, recall might matter more than accuracy. In spam filters, precision might matter more to avoid false alarms.