The No. 1 Question to Ask When Evaluating AI Tools

Here is an excerpt from an article by , and MIT Sloan Management Review. for the MIT Sloan Management Review. To read the complete article, check out others, and obtain subscription information, please click here.

Credit:  Andy Potts

* * *

Determining whether an AI solution is worth implementing requires looking past performance reports and finding the ground truth on which the AI has been trained and validated.

The Research

In the fast-moving and highly competitive artificial intelligence sector, developers’ claims that their AI tools can make critical predictions with a high degree of accuracy are key to selling prospective customers on their value. Because it can be daunting for people who are not AI experts to evaluate these tools, leaders may be tempted to rely on the high-level performance metrics published in sales materials. But doing so often leads to disappointing or even risky implementations.

Over the course of an 11-month investigation, we observed managers in a leading health care organization as they conducted internal pilot studies of five AI tools. Impressive performance results had been promised for each, but several of the tools did extremely poorly in their pilots. Analyzing the evaluation process, we found that an effective way to determine an AI tool’s quality is understanding and examining its ground truth.1 In this article, we’ll explain what that is and how managers can dig into it to better assess whether a particular AI tool may enhance or diminish decision-making in their organization.

What Is the Ground Truth of the AI Tool?

The quality of an AI tool — and the value it can bring your organization — is enabled by the quality of the ground truth used to train and validate it. In general, ground truth is defined as information that is known to be true based on objective, empirical evidence. In AI, ground truth refers to the data in training data sets that teaches an algorithm how to arrive at a predicted output; ground truth is considered to be the “correct” answer to the prediction problem that the tool is learning to solve. This data set then becomes the standard against which developers measure the accuracy of the system’s predictions. For instance, teaching a model to identify the best job candidates requires training data sets describing candidates’ features, such as education and years of experience, where each is associated with a classification of either “good candidate” (true) or “not a good candidate” (false). Training a model to flag inappropriate content such as bullying or hate speech requires data sets full of text and images that have been classified “appropriate” (true) or “not appropriate” (false). The aim is that once the model is in production, it has learned the pattern of features that predicts the correct output for a new data point.

In recent years, there has been growing awareness of the risks of using features from the training data sets that are not representative or that contain bias.2 There is surprisingly little discussion, however, about the quality of the labels that serve as the ground truth for model development. It is critical that managers ask, “Is the ground truth really true?”

The quality of an AI tool — and the value it can bring your organization — is enabled by the quality of the ground truth used to train and validate it.

The first step in gaining clarity into the ground truth for a tool is to investigate the metric typically used by AI companies to support performance claims, known as the AUC (short for area under the receiver operating characteristic curve). The AUC metric summarizes the model’s accuracy in making predictions on a scale of 0 to 1, where 1 represents perfect accuracy.3 Managers often fixate on this metric as evidence of AI quality — and take at face value the comparison with an AUC for the same prediction task done by humans.

The AUC is calculated by comparing AI outputs to ground truth categories that were used by AI designers. The AI output is considered correct if it matches the ground truth label and incorrect if it does not. The usefulness and relevance of the AUC metric is contingent upon the quality of the ground truth labels, which cannot simply be assumed to be high-quality sources of truth.

Here’s the underlying problem: For many critical decisions in organizations, there is rarely an objective “truth” ready to be fed to an algorithm. Instead, AI designers construct ground truth data, and they have considerable latitude in how to accomplish this. For example, in the medical context, AI developers make significant trade-offs when choosing what ground truth will be used to train and validate a cancer diagnosis model. They could use biopsy results to serve as the ground truth, which would provide an externally validated result for whether cancer was detected. However, most patients never undergo biopsy tests (thankfully), and acquiring these results for all patients in the training data set would require enormous investment and patient cooperation.

Alternatively, developers may use the diagnosis recorded by the clinical physician overseeing a given patient at the time. This data is relatively easy to acquire from historical electronic health records. Developers could also recruit an expert physician, or a panel of experts, to produce a diagnosis for a sample of cases in the training data set, using the average or majority of their opinions as the ground truth label. Creating this type of data set may be costly and time-consuming, but it is commonly done in the medical AI community. In any case, AI developers weigh the relative costs and benefits when deciding how to assign ground truth labels — a decision that has great influence on the overall quality and potential value of the tool.

To identify an AI tool’s ground truth, simply ask the vendor or developers. Verify their answers by searching for “ground truth” or “label” on technical research reports and methodology summaries. For medical tools subject to regulatory approval, this information is publicly available on the U.S. Food and Drug Administration website. We recommend deeply engaging with AI vendors and internal development teams and having open conversations about their ground truth selections, their logic behind those choices, and any trade-offs they considered. Reticence to discuss these topics transparently should be interpreted as a serious red flag.

* * *

Here is a direct link to the complete article.


1. S. Lebovitz, N. Levina, and H. Lifshitz-Assaf, “Is AI Ground Truth Really True? The Dangers of Training and Evaluating AI Tools Based on Experts’ Know-What,” MIS Quarterly 45, no. 3 (September 2021): 1501-1525.

2. C. DeBrusk, “The Risk of Machine-Learning Bias (and How to Prevent It),” MIT Sloan Management Review, March 26, 2018,


Posted in

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.