Discover the most popular methods for evaluating classification models and some best practices for working with classifiers.
[[{“value”:”
A classification model or a classifier is a type of machine learning algorithm that assigns categories or labels to data points. For example, a model could analyze an email and determine whether it classifies as spam. To make such a decision, the model would use the patterns it learned from examples in the training dataset. If you’ve seen the show Silicon Valley, you might recall the “brilliant” idea for an app that classifies images and shows whether the photo contains a hotdog or not.
Source: https://www.oreilly.com/content/not-hotdog-vs-mission-critical-ai-applications-for-the-enterprise/
While the example of a hotdog classifier was simple and quite low-stakes, classification models are used in pretty much all industries for a wide range of often critical tasks. Let’s look into a few examples that will demonstrate why it is crucial to evaluate these models properly:
All of our email accounts rely on classification models to filter out spam. Poor evaluation can lead to important emails being marked as spam or potentially malicious spam messages reaching our inboxes.In healthcare, classifiers are used to diagnose diseases. Misdiagnosing patients can result in wrong treatments or severe health risks.In hiring, a poorly evaluated model may be biased toward certain demographics, thus leading to unfair hiring practices.In financial services, credit scoring models are used to evaluate the creditworthiness of applicants. Without effective evaluation, these models might reject qualified applicants or approve high-risk ones. Such decisions can greatly impact both the lives of the individuals and the financial stability of the said institutions.In autonomous driving, classification models serve as an integral part of the more complex object recognition systems. After an object is detected by another component, the classifier determines whether it is a pedestrian, a vehicle, or another potential obstacle. I do not think I need to elaborate more on why that task has very high stakes and any errors can result in accidents or even loss of life.
I don’t think you need to be convinced about why proper evaluation of classification models is crucial. While evaluating our models, we can encounter various types of challenges, such as ensuring fairness across various demographic groups, addressing biases inadvertently learned by the models, handling class imbalances, and working with evaluation metrics that may fail to capture the real-world impact of classification errors.
In this article, we will explore the most popular methods for evaluating classification models, focus on strategies to overcome some of these challenges, and outline best practices to keep in mind when working with classifiers.
Evaluating Classification Models
I believe it will be easier to explain various metrics with a concrete example. For simplicity, let’s again use spam classification, where the positive label would be “spam”. While this might sound counterintuitive at first, in statistics, it is common to consider the presence of something (spam, as opposed to non-spam) as positive, and the absence of something as negative. So we are definitely not trying to convince anyone that having a spam message is positive! Another example is COVID-19 testing, where a positive result indicates that a person is infected. In this context, “positive” does not mean a favorable outcome, it simply means the virus was detected.
We also know that we are dealing with the ground truth (the actual label, i.e., whether a given email is spam or not) and the predictions of our classification model (whether the model predicts that the email is spam or not). With that in mind, we have the following four cases:
True positive (TP): The model predicts spam, and the email is actually spam.False positive (FP): The model predicts spam, but the email is not spam.True negative (TN): The model predicts that the email is not spam, and indeed it is not.False negative (FN): The model predicts that the email is not spam, but in fact, it is.
The confusion matrix summarizes all possible combinations of the predicted values versus the ground truth.
From these four cases, we can derive multiple evaluation metrics:
Accuracy: Measures the model’s ability to correctly predict the class of an observation.Precision: Measures the fraction of all predictions of the positive class (in our case, spam) that were correct. In other words, it answers the question: Out of all predictions of spam, how many were actually spam? Alternatively: When the model predicts spam, how often is it correct?Recall: Measures the fraction of all positive cases that were correctly predicted. Recall is also called sensitivity or the true positive rate. In our example, this metric answers the question: What fraction of all spam emails observed did the model predict correctly?Specificity: Measures the fraction of negative cases (non-spam emails) that were correctly identified. Specificity can be thought of as the recall of the negative class.F1 Score: A harmonic mean of precision and recall. The harmonic mean is used instead of the arithmetic mean because it considers the balance between the two scores, punishing extreme outcomes and discouraging highly unequal values. For example, using a simple average, a classifier with precision = 1 and recall = 0 would score 0.5, while using the harmonic mean, the score would be 0.
Possible Pitfalls
Without thoroughly understanding these basic metrics, we cannot correctly evaluate a model’s performance. It is equally important to be aware of the potential pitfalls when using these metrics. For example, accuracy can be highly misleading when dealing with class imbalance. Let’s consider a scenario in which 99% of the data is not spam and only 1% is actually spam. A naïve model that classifies every observation as non-spam would achieve 99% accuracy, but would be essentially worthless in practice.
In such cases, we can gain more useful insights from other metrics, and it might make more sense to optimize for them instead of for accuracy. However, the choice of the metric depends on the context (that is, the business case), as the cost of false positives and false negatives is often not equal. Understanding these context-specific costs helps decide which metric to optimize for and whether a balance among them needs to be achieved. That is why model evaluation must reflect the real-world trade-offs that the organization cares most about. Let’s consider two examples:
Failing to detect a disease (for example, cancer) when it is present (a false negative) could be far more harmful than incorrectly diagnosing someone with a disease (a false positive). In such a scenario, a false negative may delay treatment, potentially leading to fatal outcomes. That is why in this case, we might prioritize recall to minimize missed diagnoses, even if it results in a higher number of false positives.While working on fraud detection, the costs of false positives (incorrectly labeling a legitimate transaction as fraudulent) and false negatives (failing to identify an actual fraud) are not equal. False positives can frustrate customers by blocking legitimate transactions, while false negatives may allow fraudulent transactions to proceed. Depending on business priorities, we might optimize the model for precision to minimize customer inconvenience. Or we might choose to focus on recall to catch as many fraud cases as possible.Image generated with Midjourney
We can encounter another challenge when working with composite metrics that combine multiple evaluation criteria into a single value, for example, the F1 Score. While they are very useful and appealing for summarizing the model’s overall performance, composite metrics can obscure individual trade-offs between the underlying metrics. This makes it harder to understand the model’s strengths and weaknesses. For example, a high F1 score might indicate success, but it does not actually show whether the model prioritizes high precision or high recall. A model optimized for precision could achieve a high F1 score by (to some extent) sacrificing recall, which might not be desirable for the business case.
The next two pitfalls are related to aggregating performance metrics. The first is that we might miss crucial details if we focus only on overall metrics while ignoring data slices (specific subsets of data). For example, a model might perform well across all countries combined but fail entirely for certain regions.
The second potential issue arises when we evaluate all classes jointly. For example, a model might achieve a total F1-score of 95%, but this could be mostly due to strong performance on frequent and potentially easy-to-classify classes, while rare or often critical classes have very low scores (e.g., 20%). This imbalance can be especially problematic in cases where underperforming classes represent high-stakes scenarios.
The solution to both problems is to break down the metrics by subset or class to ensure that performance is acceptable across all scenarios.
Lastly, since evaluation metrics are often designed with purely mathematical goals in mind, it can be quite challenging to translate them into real-world outcomes that are relevant to and easy for business stakeholders to understand.
In the fraud detection example, to effectively translate ML metrics into business-relevant information, we need to understand how the model’s behavior impacts business objectives. False positives might result in unhappy customers and eventually lead to churn. False negatives could cause direct financial losses. Depending on their department, our business stakeholders will be interested in different objectives. Some will focus on reducing customer complaints (cost of false positives), while others will try to minimize financial losses (cost of false negatives). In both cases, their primary concern is not the raw values of precision or recall.
Improve your data quality for better AI
DagsHub helps you easily curate and annotate your vision, audio, and document data with a single platform.
More Evaluation Metrics
We have already discussed the most popular metrics, along with some of their shortcomings. Now, we will walk through additional metrics that can complement the analysis, making sure that we cover a few more possible angles and that we are more confident that our classification model performs as expected.
Let’s start with the ROC Curve, which graphically represents the performance of a classifier across all possible thresholds. Taking a step back, most frequently a classification model returns the probability or likelihood that a given observation belongs to the positive class. By default, anything above 0.5 is classified as positive. Naturally, we can adjust this decision threshold depending on our use case, and this is precisely what the ROC curve represents.
On the y-axis, it plots the True Positive Rate (TPR), which is the same as recall. On the x-axis, it shows the False Positive Rate (FPR), which indicates how many actual negative cases (e.g., non-spam emails) were wrongly identified as positive (spam). As you adjust the threshold for predicting the positive class, the TPR and FPR change. The ROC curve illustrates the trade-off between these values as the threshold changes.
A perfect classifier would score an FPR of 0 and a TPR of 1. This hints that the (0,1) point is the sweet spot in the ROC plot. A capable model’s curve would be as close to this point as possible. On the other hand, a model with no skill will have a line close to the 45° line, indicating performance equivalent to random guessing.
The ROC curve should be used when we are dealing with balanced classes. In the case of class imbalance, the ROC curve tends to present an overly optimistic picture of the model.
While the ROC curve provides a complete picture of a classifier’s performance for each decision threshold, we might also want a single, composite metric to summarize it. That is exactly what Area Under the Curve (AUC) offers. The AUC measures the total area under the ROC curve, summarizing the classifier’s ability to distinguish between positive and negative classes. An AUC of 1 represents a perfect classifier. An AUC of 0.5 indicates a random classifier. That is essentially the area under the 45° line. Anything below 0.5 implies that the model is performing worse than random guessing.
Having discussed the ROC curve, it makes sense to also mention another curve, that is, the Precision-Recall (PR) Curve. As the name suggests, it plots the trade-off between precision and recall for different classification thresholds. It comes in handy when classes are imbalanced, as it focuses only on the positive class and disregards true negatives.
Similar to the ROC curve, the PR curve also has a “no-skill line”, which represents a classifier that cannot really tell the difference between classes (and predicts randomly) or assigns a constant class to all predictions. However, unlike the ROC curve, the no-skill line in the PR curve is not fixed and changes based on the ratio of positive to negative classes in our dataset. As such, it is a horizontal line at a value corresponding to the said ratio, for example, in the case of a balanced dataset, this line would be at 0.5.
Lastly, similar to the ROC curve, we can calculate the area under the PR curve (AUCPR). Analogically, higher values indicate better performance.
A potential pitfall of using metrics such as precision, recall, or F1-Score is that these metrics are asymmetric, that is, they focus only on the positive class. This is clear when inspecting their formulas, which do not account for the true negative category. Matthew’s Correlation Coefficient (MCC) addresses this issue by considering all elements of the confusion matrix.
By examining the formula, we see that MCC treats the true class and the predicted class as two binary variables and essentially calculates their correlation coefficient. Its values range between -1 (completely inverted predictions, where all positives are classified as negatives, and all negatives are classified as positives) and 1 (indicating a perfect classifier). 0 represents a classifier that performs no better than random guessing. As MCC is a symmetric metric, achieving a high value requires the classification model to perform well in predicting both the positive and negative classes.
A potential drawback of this metric is that it is less intuitive and harder to interpret compared to the F1-Score. However, when the costs of low precision and low recall are unknown or unquantifiable, MCC can be more informative than the F1-Score, as it provides a more balanced (symmetric) evaluation of a classifier.
Another interesting metric used for evaluating classifiers is Cohen’s Kappa, which measures the agreement between two systems. In our case, these would be a classification model and ground truth labels. Its added value, compared to metrics like accuracy, is that it also considers the possibility of agreement occurring by chance. Cohen’s Kappa can be calculated as
where Po is the observed agreement, i.e., the ratio of the number of correct predictions to the total number of predictions (in other words, it is the accuracy of the model) and Pe represents the expected agreement, i.e., the proportion of agreement expected purely by chance, based on the distribution of the classes. We can calculate that proportion using the confusion matrix. We have to multiply the marginal probabilities of the true and predicted labels for each class and sum these products across all classes.
In terms of the possible values of the metric, 𝜅 = 1 indicates perfect agreement, 𝜅 = 0 means agreement is no better than random chance, and values below 0 indicate agreement worse than random.
As Cohen’s Kappa accounts for the possibility that some predictions might be correct purely by chance, it is particularly useful when working with imbalanced datasets (as long as the dataset is not extremely imbalanced). However, while it provides more information than some other metrics, it is less intuitive to interpret.
Another very important metric is log loss (also known as logarithmic loss or cross-entropy loss). It is used to evaluate the performance of classifiers that output probabilities. It provides a way to quantify how well the model’s predicted probabilities align with the true outcomes (labels). In the case of binary classification, the formula is as follows:
Where yi is the true label for the i-th observation and pi is the predicted probability for that observation.
Log loss calculates the likelihood of observing the true labels given the predicted probabilities. The lower the metric, the better the model.
Some of the advantages of this metric include:
It penalizes wrong predictions, especially those made with high confidence. This can be extremely useful in cases we have already discussed, such as medical diagnoses or fraud detection.Log loss helps assess how well-calibrated those probabilities are. In some cases, a model may be accurate in classification, but its predicted probability estimates may not reflect the true likelihood of the class. Log loss helps identify such discrepancies.Log loss can be useful when dealing with imbalanced datasets as it considers the predicted probabilities and does not treat all errors equally.
However, it also comes with certain drawbacks:
Unlike simpler metrics such as accuracy, log loss can be less intuitive to interpret for non-technical stakeholders.As the metric penalizes wrong predictions with high confidence, log loss can be heavily influenced by a few very confident but ultimately incorrect predictions. A single large error can significantly increase the log loss.
Similar to log loss, the Brier Score is a metric used to evaluate the accuracy of probabilistic predictions. While it quantifies the same thing, it differs in how it is calculated and how it penalizes errors.
As seen in the formula below, it measures the mean squared difference between predicted probabilities and the actual outcomes. A lower Brier Score indicates that the model’s predicted probabilities are closer to the true outcomes, meaning the model is well-calibrated and makes more accurate probability estimates.
Pros:
Similar to log loss, it takes into account both the accuracy and confidence of the model. This makes it useful for evaluating the calibration of a probabilistic model.Since it uses squared error, it is easier to understand than log loss, which uses logarithms.Like log loss, it is useful when dealing with imbalanced datasets.
Cons:
Just like log loss, the Brier Score can be sensitive to outliers or wrong predictions with high confidence. However, since it uses squared error instead of a logarithmic function, the penalty is not as severe.It is a less popular metric compared to simpler metrics like accuracy, precision, or recall.Image generated with Midjourney
Accounting for Fairness while Evaluating Classification Models
So far, we have focused on discussing various statistical metrics used to assess the performance of the classification models. We have also touched upon the business perspective, highlighting scenarios in which certain errors might be more costly than others. However, we have not yet discussed the issue of fairness.
In the context of evaluating ML models, fairness refers to the principle that these models should make predictions that do not systematically bias (either positively or negatively) specific groups or individuals based on certain sensitive attributes. These could be, for example, race, gender, age, religion, or disability. In other words, the outcomes of ML models should be equitable across different groups, even if there were historical biases in the training data.
To illustrate what can go wrong when fairness is not a part of the evaluation process, let’s look at a few examples from recent years:
Amazon’s recruiting tool was biased against women. This bias came from the training data, which reflected past inequalities – the majority of resumes in the training data belonged to men, as there were historical gender imbalances in the tech industry. As such, the model learned to penalize job applications containing variations of the word “woman”, for example, as in “women’s chess club captain”.When Apple introduced its Apple Card, customers noticed that the credit limits favored men over women, even in cases where women had higher credit scores. This issue likely originated from biases in the training datasets and the use of black-box models, which are notoriously difficult to interpret/debug.Several studies revealed that facial recognition systems had significantly higher error rates for women and people with darker skin tones.
Now that we are more aware of how important fairness is in the context of evaluating machine learning models, there are several approaches that we can use to promote fairness:
We can modify the input data to reduce or eliminate biases before training any model, for example, by resampling or anonymizing certain sensitive attributes.We can incorporate fairness constraints or objectives directly into the model training process, for example, by adjusting the loss function or adding fairness-aware regularization.We can adjust the model’s outputs or decisions after training to ensure fairness, for example, by recalibrating predictions to achieve demographic parity.We should regularly evaluate models for fairness using dedicated metrics and tools.
Let’s focus on the last point, that is, the metrics that we can use to evaluate the fairness of our classification models.
Fairness-Aware Metrics
Let’s start looking at the fairness-aware metric with the one called statistical parity (or demographic parity). We can use it to evaluate if the outcomes of a classification model are equally distributed across different groups. These groups would be defined by certain sensitive attributes, for example, race or gender.
The hypothesis that we want to investigate using this metric is that a model’s predictions should be independent of the sensitive attribute. To be more precise, the positive prediction rate should be consistent across groups. So let’s imagine a scenario in which we have a model used for approving loans and two groups defined by a certain sensitive attribute. If statistical parity holds, the percentage of loan approvals should be the same for both groups.
Statistical parity is especially useful when making sure that different demographic groups have equal access or opportunities, for example, in hiring or healthcare. However, it also has certain drawbacks:
It does not account for legitimate differences in the characteristics of groups.The metric might be conflicting with other fairness metrics, for example, when groups have different base rates.Statistical parity only focuses on group-level fairness, while individuals within the same group might be treated inconsistently.
The second metric to consider is called disparate impact, and it evaluates whether a classifier disproportionately affects different groups, once again, defined by certain sensitive attributes. Disparate impact measures the degree to which a model’s decisions (e.g., positive predictions) are distributed unequally across these groups. As such, it is typically defined as the ratio of positive prediction rates for two groups.
The rule of thumb, which is also used in legal contexts such as employment, is that the model is fair if the ratio is no less than 0.8. If the ratio falls below this threshold, the model is considered to have a potential disparate impact.
Pros:
Frequently used in legal and regulatory contexts to measure fairness.Provides a simple and clear ratio to assess fairness between groups.It helps identify systemic biases that may disproportionately affect specific groups.
Cons:
Similar to statistical parity, ensuring fairness at the group level does not guarantee fairness at the individual level.The metric identifies disparities but does not reveal their origin, for example, whether it is due to bias in the model, data, or historical inequalities.Correcting for disparate impact may reduce the model’s overall predictive performance.Sensitive attributes are often multidimensional (e.g., race and gender combined). As such, a single ratio may not capture the full picture.
The third metric to consider for evaluating fairness is called equalized odds. It analyzes whether the model has equal true positive rates and false positive rates across all groups. In other words, the model’s predictions should not disproportionately benefit or harm any specific group.
Equalized odds metric is often used in scenarios where the cost of errors varies across groups and has ethical implications, for example, in criminal justice, healthcare, or when approving loans.
Pros:
Makes sure that no group disproportionately suffers from false positives or false negatives.Addresses fairness in both correct and incorrect predictions.Promotes group-level fairness, making it useful in socially sensitive applications.
Cons:
Enforcing equalized odds can decrease the overall accuracy of the model. Especially in the cases when base rates differ across groups.Ensuring fairness at the group level does not necessarily mean all individuals are treated fairly.Equalized odds may conflict with metrics like statistical parity when base rates vary.
The last metric we will discuss is the demographic parity gap. It evaluates the disparity in positive predictions across groups. As such, the gap is defined as the absolute difference in positive prediction rates between two groups. A smaller gap suggests that the model treats the groups more equally, at least in terms of positive predictions.
Pros:
Easy to compute and interpret as a direct comparison of positive prediction rates.Useful for identifying systemic disparities in outcomes across demographic groups.Suitable for various domains, including hiring, healthcare, and credit scoring.
Cons:
The metric does not consider that groups might have different base rates, such as actual risk levels.Forcing demographic parity may reduce overall model performance, especially if groups have inherently different distributions.Does not ensure fairness at the individual level within groups.
A common practice is to pair the demographic parity gap with other fairness metrics, such as equalized odds, to not only balance fairness with predictive performance, but also address diverse fairness concerns.
Methods for Detecting Bias in Machine Learning Models
After learning about the possible evaluation metrics, let’s also look at how we can identify potential biases in our classifiers. First of all, we have to remember that bias can arise from various sources: biased training data, flawed model assumptions, or even inherent inequalities in the real-world system the model is applied to.
One possible approach to detecting bias would be to use Shapley values. They have gained impressive popularity as a method of evaluating the feature importance of machine learning models. But we can also use them to detect bias.
If we calculate the contribution of each feature to individual predictions, we can clearly identify if sensitive attributes are disproportionately influencing the model’s predictions. For example, if the model’s output is strongly impacted by race or gender, despite these not being part of the design, this could indicate a fairness issue.
Another approach we could take would be counterfactual testing. This framework involves creating modified versions of an input (called counterfactuals) by changing the sensitive attributes. Then, we would observe how the model’s predictions change. The goal is to evaluate how the model would behave if these sensitive attributes were different while keeping all other aspects of the input constant. The desired outcome would be that the model makes similar predictions regardless of the sensitive attribute.
Image generated with Midjourney
Best Practices for Evaluating Classification Models
To wrap up this article, let’s look at a list of best practices that you could use next time when evaluating a classification model:
Use context-aware metrics: Choose metrics that align with your organization’s business goals. For example, you might optimize for precision in fraud detection or recall in medical diagnoses.Evaluate using realistic datasets: Make sure that the model is tested on data that reflects the current real-world distributions so that you do not end up with a model overfitted to training data that may not even represent reality anymore.Watch out for composite metrics: When working with composite metrics, always additionally, examine the individual components to fully understand the model’s behavior.Use more than one metric: Always assess the model’s performance across multiple objectives and use various metrics, as each one of them has its strengths and weaknesses. By doing so, you will make sure that the evaluation is aligned with diverse business needs.Incorporate fairness and bias detection: Do not forget to also evaluate the models using fairness-aware metrics and bias-detection methods. This will prevent the models from systematically disadvantaging any group based on sensitive attributes.Use explainability and interpretability tools: You can use approaches such as Shapley values to interpret the classifier’s predictions. This helps stakeholders understand why a model makes certain predictions and identify potential biases or errors.
References
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.Meier, P., Sacks, J., & Zabell, S. L. (1984). What happened in Hazelwood: Statistics, employment discrimination, and the 80% rule. American Bar Foundation Research Journal, 9(1), 139-186.Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: a guide for data scientists. O’Reilly Media, Inc.Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. ” O’Reilly Media, Inc.”.Raschka, S., & Mirjalili, V. (2019). Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing Ltd.Raschka, S., Liu, Y. H., & Mirjalili, V. (2022). Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd.Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and machine learning. http://fairmlbook.org“}]]