Book a call
Book a call
All Blogs
Precision, Recall and F1
FUNDAMENTALS



Surender Singh
Aug 26, 2025
When we evaluate machine learning models, accuracy is not always enough. Two other metrics – precision and recall - often matter more, especially when the cost of mistakes is high. Let’s start with the basics.
The Definitions
Precision
Precision tells us: Of all the things the model flagged as positive, how many were actually correct?
Precision = True Positives (TP) / True Positives (TP) + False Positives (FP)
Recall
Recall tells us: Of all the actual positives, how many did the model manage to catch?
Recall = True Positives (TP) / True Positives (TP) + False Negatives (FN)
F1 Score
The F1 score balances both precision and recall by taking their harmonic mean.
F1 = 2 * (Precision * Recall) / (Precision + Recall)
This is useful when you need a single number that accounts for both catching all positives and minimizing false alarms.
To get more clarity, let’s break it down using the concepts of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
The Confusion Matrix: A Quick Overview
Imagine your model is predicting whether an email is spam or not spam. Each prediction can fall into one of four categories:
Outcome: True Positive (TP)
What it means: Model predicts spam, and it really is spam
Example: Email is spam → predicted as spam
Notes: Correct detection
Outcome: True Negative (TN)
What it means: Model predicts not spam, and it really isn’t spam
Example: Email is not spam → predicted as not spam
Notes: Correct rejection
Outcome: False Positive (FP)
What it means: Model predicts spam, but it isn’t spam
Example: Normal email → wrongly flagged as spam
Notes: Type I error
Outcome: False Negative (FN)
What it means: Model predicts not spam, but it actually was spam
Example: Spam email → missed as not spam
Notes: Type II error
The Wildfire Drone Operator
Imagine you’re a drone operator tasked with wildfire surveillance. You receive satellite pings about possible fires in remote forests. Your mission: fly the drone, scan the target, and decide whether it’s an actual wildfire.
Your role is critical:
If you miss a real wildfire (False Negative), the fire spreads, causing devastating damage.
If you raise too many false alarms (False Positive), firefighters rush to unnecessary sites, stretching resources thin.
This is exactly the tradeoff machine learning models face when balancing precision and recall.
Let’s understand with another example –
The Confusion Matrix
Think of an airport scanner that checks bags for prohibited items (like knives). Each bag either really has a prohibited item (positive) or it doesn’t (negative). The scanner can either flag it or clear it.
Confusion Matrix Term: True Positive (TP)
Meaning: Bag has a knife and scanner flags it
Example: 1 bag contains a knife, and the machine alerts security
Impact: Correct catch
Confusion Matrix Term: True Negative (TN)
Meaning: Bag does not have a knife and scanner clears it
Example: 50 clean bags, and the scanner lets them pass
Impact: Correct rejection
Confusion Matrix Term: False Positive (FP)
Meaning: Bag does not have a knife, but scanner flags it
Example: Metal water bottle mistaken for a knife
Impact: False alarm (not dangerous, but inconvenient)
Confusion Matrix Term: False Negative (FN)
Meaning: Bag has a knife, but scanner does not flag it
Example: 1 bag has a knife, but the machine misses it
Impact: Serious error (dangerous item slips through)
Precision in the Drone Game
Suppose your drone flagged 10 hotspots as wildfires.
7 turned out to be actual wildfires (True Positives).
3 were just campfires mistaken for wildfires (False Positives).
Your precision is:
Precision = 7 / (7 + 3) = 0.7
In other words, when you raise an alarm, you’re right 70% of the time.
High precision means you’re rarely crying wolf. But it doesn’t tell us whether you’re catching all the fires out there.
Recall in the Drone Game
Now imagine there were 12 real wildfires in total that day.
You correctly caught 7 (True Positives).
You missed 5 (False Negatives).
Your recall is:
Recall = 7 / (7 + 5) = 0.58
So, you managed to detect only 58% of the actual wildfires. High recall would mean catching almost every real fire, even if that means sometimes mistaking campfires for wildfires.
Balancing the Two: F1 Score
If we combine the numbers:
Precision = 0.7
Recall = 0.58
Then your F1 score is:
F1 = 2 * (0.7 * 0.58) / (0.7 + 0.58) ≈ 0.63
This gives you a more holistic view: you’re doing okay, but both missing fires and raising false alarms need improvement.
What does is all mean?
High Precision, Low Recall: You almost never cry wolf, but you miss a lot of real fires.
High Recall, Low Precision: You catch nearly every fire, but also mistake every campfire, barbecue, or sunset glow for wildfires.
High F1: You strike a balance, ensuring both efficiency and safety.
Key Takeaways
Precision and recall aren’t abstract math – they’re real tradeoffs with life-or-death consequences. As the wildfire drone operator, your decisions affect forests, communities, and emergency responders.
Similarly, in AI systems - whether detecting fraud, diagnosing disease, or scanning for wildfires – choosing between precision and recall depends on what costs you can afford:
Missing positives, or
Triggering false alarms.
The goal isn’t always perfect precision or perfect recall—but the right balance for the problem at hand.
Next time you see a model boasting “95% accuracy,” ask yourself: But what about precision and recall?
Why Accuracy Alone Can Mislead
Let’s say you’re building a model to detect rare diseases. If only 1 out of 100 people has the disease, a model that always predicts healthy will be 99% accurate. Sounds impressive - but the model never actually detects the disease. This shows why we can’t rely only on accuracy.
Let’s take another example :
Imagine you are designing a model to detect defective lightbulbs in a factory.
Why Accuracy Alone Can Mislead
Let’s say you’re building a model to detect rare diseases. If only 1 out of 100 people has the disease, a model that always predicts healthy will be 99% accurate. Sounds impressive - but the model never actually detects the disease. This shows why we can’t rely only on accuracy.
Let’s take another example :
Imagine you are designing a model to detect defective lightbulbs in a factory.
Why Accuracy Alone Can Mislead
Let’s say you’re building a model to detect rare diseases. If only 1 out of 100 people has the disease, a model that always predicts healthy will be 99% accurate. Sounds impressive - but the model never actually detects the disease. This shows why we can’t rely only on accuracy.
Let’s take another example :
Imagine you are designing a model to detect defective lightbulbs in a factory.


North Beach,
San Francisco, CA
Email:
founders@trainloop.ai
© 2025 TrainLoop. All rights reserved.


North Beach,
San Francisco, CA
Email:
founders@trainloop.ai
© 2025 TrainLoop. All rights reserved.


North Beach,
San Francisco, CA
Email:
founders@trainloop.ai
© 2025 TrainLoop. All rights reserved.