In operating reviews and boardrooms, I keep seeing the same pattern: leadership asks for rigor, teams deliver the numbers, and promising AI efforts get judged as underperforming before the organization has actually learned what it takes to make them real. Then someone pulls the plug, scales back the investment, or lets the initiative quietly expire.
Sometimes they’re right. But often, they’ve just used the wrong test.
The problem isn’t that leaders care about measurement. Strong measurement discipline is exactly what separates organizations that scale AI from those that accumulate pilots. The problem is that many leaders are applying a mature-business scorecard to work that isn’t mature yet—and the result is a predictable misread.
The scorecard mismatch
Think about how most established businesses evaluate success: ROI within a defined window, cost takeout, headcount efficiency. These are sensible metrics for stable operations. Used too early on emerging AI work, they don’t create discipline. They create false negatives.
AI initiatives don’t mature on the same timeline as a product refresh or a cost-reduction program. The first value often surfaces as faster decisions, reduced rework, or improved data quality—not as a line item in next quarter’s P&L. Workflow redesign—the real work of integrating AI into how people actually operate—is slow, disruptive, and invisible to traditional financial reporting until it isn’t.
When leaders demand conventional ROI on a one-to-three year horizon, teams respond rationally: they optimize for what’s measurable. They chase near-term efficiency wins, avoid the messier work of process redesign, and build pilots designed to survive a financial review rather than to learn something. It’s not bad faith. It’s a logical response to the incentives the scorecard creates.
The result is what’s now being called “proof-of-concept fatigue”—organizations running dozens of AI experiments, few of which ever reach production. Gartner predicts 30% of generative AI projects will be abandoned after proof of concept by end of 2025. That’s not primarily a technology failure rate. It’s a measurement failure rate.
Four forms of value that fall off the scorecard
When organizations apply legacy metrics to AI work, four things consistently disappear from the frame.
Learning value. Early AI initiatives should be generating organizational knowledge—about which processes are actually AI-ready, where the data problems are, which teams can absorb change and which can’t. None of that appears on a standard ROI dashboard. If learning isn’t being tracked, it isn’t being valued. Eventually, it stops happening.
Adoption reality. A model that performs well in a controlled pilot and fails at the point of deployment isn’t a technology problem. It’s a measurement design problem—the pilot criteria didn’t include the humans who would actually use it. Healthcare is full of examples: AI tools evaluated on administrative metrics that then crater when clinicians encounter them in real workflows. The benchmark omitted the most important variable.
Workflow value. McKinsey research identifies workflow redesign—not model accuracy—as the single largest driver of AI’s EBIT impact. But workflow redesign is expensive and disruptive. When leaders measure AI against near-term efficiency targets, teams have every incentive to skip it. The faster path to a defensible number is a narrow pilot that proves almost nothing about whether AI can actually change how the business operates.
Capability value. Organizations that get compounding returns from AI develop internal judgment over time—about where AI helps, where it doesn’t, how to integrate it without losing human accountability. That doesn’t show up in year-one cost savings. It shows up years later as a competitive advantage. MIT Sloan research found that organizations updating their KPIs to reflect how AI creates value were three times more likely to see meaningful financial benefit than those that didn’t. The metric change came before the financial gain.
Metrics are not neutral
This is the part that often gets lost in conversations about measurement rigor: the metrics you choose signal what you actually value.
When leadership sets traditional ROI as the primary standard for an AI initiative, they’re not just creating a framework. They’re telling the team what matters. And if what matters is a short-term number, teams will build for that. You get the outcome your scorecard rewards—which may have nothing to do with the transformation you said you wanted.
Over 40% of companies report struggling to define or measure the impact of their AI initiatives, and less than half are using AI-specific KPIs at all. That’s not a data problem. It’s a leadership problem. If the people setting the measurement standard haven’t updated their thinking about what early-stage AI value looks like, no amount of analytical capability downstream will fix it.
The questions worth sitting with
I’m not arguing against measurement. I’m arguing for measurement that fits the stage of the work.
A few questions: Are the metrics you’re applying to this initiative the same ones you’d use to evaluate a mature business line? If so, why? What would you need to see in year one to know you’re building toward something real—even if traditional ROI isn’t visible yet? Is your team optimizing for learning, or for a number that will survive a budget review?
The goal isn’t softer standards. It’s smarter ones. There’s a real difference between an initiative generating genuine learning and building toward scale, and one producing theater for a quarterly review. Good measurement tells those two things apart.
The wrong scorecard doesn’t just misread AI value. It trains the organization to produce less of it.
