– Sujoy De
Lets begin with, why do we need a confusion matrix?
There are several metrics by which we can evaluate a classification problem where accuracy is one of the most popular.
Accuracy = Correct predictions/Total population
But accuracy does not always solve our purpose. In an imbalanced classification problem (where the number of one class is significantly larger than the other), Accuracy will always be high. For example, In a cancer detection problem where more than 99% population is non-cancerous, we can predict all the cases to be non-cancerous and that will put our accuracy to be higher than 99%.
How do we solve this problem for an imbalanced classification problem?
We turn our attention to confusion matrix and its evaluation metrics (more on that later). A confusion matrix normally looks like the table below for binary classification problem.
TN = True Negatives ; TP = True Positives ; FP = False Positives ; FN = False Negatives ;
Lets say we are interested in finding if photos from a particular folder contains images of dog. Here our positive class means image containing a dog and negative class means image without a dog in it.
TN = photos without dog and model predicting it correctly
TP = photos with dog and model predicting it correctly
FP = photos without dog but model predicting otherwise
FN = photos with dog but model predicting otherwise
Coming back to evaluation metrics apart from accuracy, we have metrics precision, recall, F1-score
• Precision = TP/ (TP + FP)
Intuitively precision means out of all predicted positive class samples, how many are correct.
Normally precision is used when we want to decrease the false positives (Cases which are negative actually but have been classified as falsely positive). Example – Marking an email as a spam: where marking an legitimate email as spam is costly in comparison of missing out tagging a spam mail. In this case precision can be used as the metric.
• Recall = TP/ (TP + FN)
Intuitively recall means out of all actual positive class samples, how many can we predict correctly. Normally recall is used when we want to decrease the false negatives (Cases which are positive actually but have been classified as falsely negative). Example – Building a model to detect fraudulent transaction: where missing out a fraudulent transaction is costly in comparison of having a false alarm.
How to determine when to use precision or recall. It really depends on the cost of false positive vs false negative. If the cost of false positive is more than the cost of false negative (cost means the business cost, it can be both tangible as well as intangible) then precision should be chosen. If its the other way around then recall should be chosen.
• F1 score = (2 * precision * recall)/ (precision + recall)
Increasing precision or recall comes at the cost of the other which means reduction in false positives might result in increase in false negatives and vice versa. F1 score is the harmonic mean of precision and recall. F1 score is used when we want to use the best of both worlds i.e. when we want to decrease false negatives as well as false positives.
Lets dig a bit deeper.
What if we are interested in finding these metrics for a multi class classification problem.
- Precision as defined earlier is TP/(TP + FP). Here also it is similar but a bit different. Precision for Class A is the number of correctly predicted class A samples (TP-A), out of all predicted class A samples.
Mathematically, it is TP-A/(TP-A+False-3+False-5)
- Recall as defined earlier is TP/(TP+FN). Here also it is similar but a bit different. Recall for class A is the number of correctly predicted class A samples (TP-A), out of all actual class A samples.
Mathematically it is TP-A/(TP-A + False-1 + False-2)
- F1-score per class for the same problem statement is harmonic mean of precision and recall.
F1 score for class A = (2 * precision for class A * recall for class A)/ (precision for class A + recall for class A).
But then how do you combine all the F1-scores all classes into a single value for comparison purposes.
- macro-averaged F1-score – arithmetic mean of the per-class F1-scores. We can do similar thing for precision and recall.
- weighted-average F1-score – weighted average of the per class F1-scores according to the number of samples from that class. We can do similar thing for precision and recall.
- micro-averaged F1-score – Here, we take all the classes together to calculate precision and recall. By definition for a given class we have, Precision = TP/(TP + FP) and Recall = TP/(TP+FN). When we take all classes together Precision = Recall = (TP-A + TP-B + TP-C)/(All samples). This is because, If we take all the samples together, each prediction error is a false positive for a particular class but false negative for another class. So Total FP = Total FN which makes precision = recall and hence, micro averaged F1-score = precision = recall = accuracy.
Which of the above methods to take for combining multi class metric, well it really depends on the problem statement. Try to develop the business intuition on which metrics will work for you and if it is a multi class classification algorithm then which method will work best for combining them.