Enterprise AI Model Metrics: Part 1

The market for AI applications that leverage techniques from machine learning, deep learning, signal processing, and other statistical methods is anticipated to expand to $391 billion at a compound annual growth rate of 46 percent from 2019 to 2025 . To get the most out of that growth, it is essential for organizations that develop or consume AI-enabled applications to pay close attention to how the benefits of the AI techniques are quantified.

In this multi-part series, I will explain the pros and cons of metrics that are widely used to evaluate models in enterprise AI applications. They include binary classification metrics, multi-class classification metrics, distance metrics, and regression and forecasting metrics.

This article is part 1 of the series, and it introduces the following fundamental metrics for evaluating AI models: accuracy, true positive rate, and false positive rate. It is targeted toward business leaders new to AI and toward AI practitioners looking to brush up on their fundamentals. (Don’t underestimate the value of the occasional review; these metrics have confused many candidates who apply for a data scientist position at C3.ai). This article differentiates itself from others on the same topic by using a visual approach to explain the definitions and relationships between the metrics.

Negative = Happiness

As of this writing, the COVID-19 pandemic has gripped the globe. Those of us who have managed to get tested know that, despite the word’s connotation, a “negative” test result is actually good news! The test was looking for the virus and couldn’t find it.

At C3.ai, many of our enterprise AI applications deal with similar detection problems. Just to make sure you’ve been paying attention, I will give you one quick example to demonstrate how we use “positive” and “negative” to denote model outcomes and let you fill in a second example.

  1. C3 AI Reliability. In this application, we are looking for failures in industrial system components.
    1. AI model predicts a failure: Positive
    2. AI model predicts no failure: Negative
  2. Diagram Parsing (part of C3 AI Digital Twin). In this application, we are looking for the presence of a specific symbol in an image:
    1. AI model predicts that the symbol does not exist: __________
    2. AI model predicts that the symbol exists: __________

Notice that in the first example, “negative” refers to a desirable business outcome (ideally, we see no failures). In the second example, however, the presence of a symbol may not have any desirable or undesirable business implications.

The Confusion Matrix

Now that we have covered the terms “positive” and “negative,” let’s confuse things a little!

The academic community refers to the following 2×2 matrix as the “confusion matrix.” It is the fundamental piece to understanding all metrics that concern binary classification problems.

I have conducted several interviews with candidates for a data scientist position at C3.ai who have confused the terms in the confusion matrix. I want to help you nail the terms with the following color key.

Positive/Negative: What did my model predict?
True/False: Did my model’s prediction match what actually happened?

Here’s a pop quiz: Try filling in the following matrix by thinking through which cell is positive/negative and true/false.

Answer key:

  1. False Positive
  2. True Negative
  3. True Positive
  4. False Negative

In the rest of this article, I will use the acronyms TP, FP, TN, and FN for True Positive, False Positive, True Negative, and False Negative, respectively.

Note that, in multi-class classification problems, the confusion matrix will have more than two rows and columns (it is still square).

Metric 1: Accuracy

Perhaps the most used – and misused – model scoring metric is “accuracy.” Here is how it is defined:

When referring back to the confusion matrix, visual learners will notice that accuracy is the sum of the main diagonal divided by the sum of all elements of the matrix.

Accuracy is only useful when the dataset is balanced, or when the number of actual positives is close to the number of actual negatives. Unfortunately, it is often misused for detection problems in which the number of positives is far outnumbered by the number of negatives.

Let’s look at an example of an AI-based prediction for COVID-19 to see why this metric is not useful when actual negatives outweigh actual positives.

Example: In a set of 100 patients

  • Actual:
    • 5 infected
    • 95 not infected
  • Predicted by AI model:
    • 0 infected
    • 100 not infected

What is the accuracy? Look back at the confusion matrix and try to infer the values in the cells.

Let us take a moment to reflect on that. The AI model missed all five cases of infected patients. Despite that, its accuracy was 95 percent! Clearly, accuracy is a poor metric for this model. So, the next time a data scientist in your organization boasts of a model with 95 percent accuracy, remember to view it with skepticism.

Metrics 2 & 3: True Positive Rate and False Positive Rate

Two more metrics that candidates often confuse are True Positive Rate (TPR) and False Positive Rate (FPR).

TPR is also called Recall, Sensitivity, or Hit Rate, and it is defined as follows:

FPR is also called Fall-out, and is 1 – True Negative Rate, or 1 – Specificity:

Knowing the different names of TPR and FPR is useful because different clients or data scientists you interact with may be familiar with one name but not the others.

TPR can also be thought of as answering the question: how many actual positives did the model get right?

Conversely, FPR can be thought of as answering the question: how many actual negatives did the model get wrong?

Let’s try to relate TPR and FPR back to the confusion matrix.

Visual learners will notice how the numerators in the above formulae (underlined in the confusion matrix) are both in the top row of the matrix and divided by the sum of their respective columns. The colorized confusion matrix illustrates how TPR and FPR complement each other by focusing on distinct columns of the matrix.

Let us take an example of an AI-based model that predicts whether critical machine components are going to fail in the next 10 days (a core feature of C3 AI Predictive Maintenance).

Example: In a set of 100 critical machine components

  • Actual:
    • 5 will fail
    • 95 will not fail
  • Predicted by AI model:
    • 0 will fail
    • 100 will not fail

If you plug in the values from the confusion matrix, you’ll find that:

In this case, one could say the above AI model sacrificed TPR in order to get an FPR of 0 – a bad thing in this example because it missed all the failures. Conversely, if the AI model had predicted that all components will fail, then the TPR and FPR would both be at 100 percent. That is bad because an FPR of 100 percent is not useful, even if a TPR of 100 percent is. In general, there is a trade-off between TPR and FPR, and we will discuss that trade-off in depth in Part 2 of this article.

No Confusion

It is possible, although rare, that an AI model is not confused. I have participated in a BHC3 Reliability project that had a dataset containing anomalous cases, all of which were egregious in comparison to normal cases. As a result, the AI model was able to produce scores that clearly separated normal cases from anomalous cases.

Referring back to the confusion matrix, that scenario produced zero FPs and FNs, but non-zero TPs and TNs. In other words, the confusion matrix was a diagonal matrix.

When there is no confusion, it is possible to have a TPR of 100 percent while having an FPR of 0 percent.

Conclusion

I hope you take four points away from this article:

  1. AI models can be evaluated using various metrics, some better suited to certain types of problems than others.
  2. The confusion matrix forms a canvas for visually describing those metrics and how they are related to each other.
  3. Accuracy, a popular metric for evaluating AI models, can paint an overly-optimistic view of a model’s performance – especially for imbalanced datasets.
  4. In rare cases where AI models are not confused, it is possible to have a TPR of 100 percent while having an FPR of 0 percent.

Varun Badrinath Krishna is a lead data scientist at C3.ai. He holds a Ph.D. in computer engineering from the University of Illinois at Urbana-Champaign, where he was named a Siebel Scholar. Varun loves solving hard technical problems and enjoys helping his teammates grow technically, professionally, and socially.