Reforge has joined Miro ↗

Learning

Pricing

Blog

Book a demo

All articles

AI Evals Aren't Optional Anymore

Feb 12, 2026

Learn More and Enroll

If there is one skill more valuable than learning how to use LLMs, it's building high quality AI products.

In 2026, using AI productively is table stakes. Shipping an AI feature is no longer hard - prototypes are easy and demos look impressive. You can ship an agent in a weekend without writing a line of code yourself.

Getting AI products to be consistently good and fast for most of your users is hard. It’s no surprise that most companies are struggling to get adoption for their AI launches and public software stocks are down >30%.

95% of AI pilots might fail, but 5% will rule the world

The challenge of going from an AI demo to a systematically improving product is real across both startups and incumbents. This was demonstrated in MIT’s study last year on how 95% of pilots were still not considered successful by enterprise clients.

The gap between demo and production isn't technical—it's methodological. Teams that ship impressive prototypes often hit a wall when trying to make their AI reliable enough for real users. The difference between the 5% that succeed and the 95% that fail comes down to having a system for continuous improvement.

Only a small group of AI-native startups and rapidly transforming SaaS leaders have successfully built an **AI Flywheel — a systematic way to ship and iteratively improve AI agents and LLM-powered products.** At the center of this flywheel is AI Evals.

Here's what that looks like in practice:

Notion’s AI team has accelerated 10X thanks to systematic evals, going from fixing 3 to 30 issues every day. When a new model drops, they can ship to prod in <24 hours. AI has accelerated their company growth with over 50% adoption across their customer base for premium features.

Anthropic’s sprints now start with evals: “.. it’s the best way to stress-test whether the product requirements are concrete enough to start building."

Intercom’s eval focus has famously supported Fin’s evolution from a single channel chatbot with ~50% accuracy to a multilingual, voice-enabled platform covering 45 languages, with >90% accuracy and a whopping 66% average resolution rate. Sierra is taking it even further with simulated environments where customers can run their own evals.

Companies like Cursor marry internal evaluations with real-time user adoption signals - tracking whether users actually keep the generated code, as a single "risky edit" can destroy developer trust.

What these companies have in common is treating quality as measurable and improvable, not something you hope for after launch. They've moved from "ship and pray" to "ship and improve.”

Whether we call it monitoring, test-driven development, or verifiability, it all comes down to evals.

Without evals, AI products stagnate after launch

The real work in AI development starts after that first release.

Users have wildly different experiences. Some love the product. Others get stuck or frustrated. The team argues about whether quality is improving. No one can agree on where to focus or what to prioritize. Leadership wants to see steady progress, but what hill are we climbing?

Sound familiar? This chaos is what happens when you try to manage a probabilistic system with deterministic tools. Your old playbook of user stories, bug tickets, and A/B tests breaks down when every user gets a different output from the same input.

This is exactly the scene we walk in on when we start working with clients — particularly for companies with a large existing customer base and high reputation risk. AI development is stochastic with no two users having the same experience. Your old way of fixing bugs becomes a frustrating game of whack-a-mole with >50% of users never coming back.

The traditional PM toolkit falls short because:

Improvements are invisible. You tweak the prompt, redeploy, and... did it actually get better? For whom? In what scenarios?
Bugs aren’t always reproducible. When users report problems, the exact issue might not be reproducible since the next output will be different.
User feedback is directional, not diagnostic. You know something's wrong, but not what or why.

This era of AI products needs its own system of measuring, analyzing and improving metrics. Agents and AI products deliver work, not workflows. Your product is no longer a set of deterministic flows. It’s a probabilistic system producing a wide distribution of outcomes for every user.

As building workflow software gets increasingly commoditized, the best companies and career opportunities are moving to building AI agents that deliver the work. 2026 is the tipping point.

This is why we’re excited to launch AI Evals, a new course on Reforge for product builders focused on the practical craft of defining, measuring, and improving the quality of AI-powered products.

“Evals aren’t just a technical matter — they are fundamentally an exercise in product thinking. PMs can’t just be sitting on the sidelines when it comes to measuring AI product success. They need to be defining what good looks like and leading the team in getting there.”
— Brian Balfour

Why you should take this course

The only way to learn AI skills is by dedicating time to building and experimenting. We have designed this course as a practical introduction to AI Evals for product managers, structured around the same AI Product Flywheel we’ve developed working with leading teams on real world agents.

You have already listened to all the podcasts you need. Dive into the world of being an AI PM with demos, step-by-step walkthroughs, case studies and practical guides. This course spans 8 hours of live sessions and many dozens of hours on self-paced exercises. By the end of the course, you will walk away with these skills:

Designing PRDs for AI products and construct an eval-driven roadmap
Understanding the complete evaluation lifecycle: eval rubrics, golden datasets, and trace analysis
Designing automated evaluators and aligning them to human judgement + product taste
Evaluating agent architectures with emerging best practices in context engineering
Decomposing multi-step pipelines, visualization, and monitoring user experience
Building custom admin tools to accelerate your team’s progress

Ready to join the 5%?

The companies winning at AI aren't the ones with the best demos — they're the ones with the best systems for making their AI better everyday. Learn how to build yours.

If you're interested in a preview of this course, you can attend this event on Feb 26th.

Learn More and Enroll

Justin Bauer

Co-founder, Calibre

Justin Bauer

Co-founder, Calibre

Sandhya Hegde

Co-founder, Calibre

Sandhya Hegde

Co-founder, Calibre

If there is one skill more valuable than learning how to use LLMs, it's building high quality AI products.

95% of AI pilots might fail, but 5% will rule the world

Here's what that looks like in practice:

Anthropic’s sprints now start with evals: “.. it’s the best way to stress-test whether the product requirements are concrete enough to start building."

Companies like Cursor marry internal evaluations with real-time user adoption signals - tracking whether users actually keep the generated code, as a single "risky edit" can destroy developer trust.

What these companies have in common is treating quality as measurable and improvable, not something you hope for after launch. They've moved from "ship and pray" to "ship and improve.”

Whether we call it monitoring, test-driven development, or verifiability, it all comes down to evals.

Without evals, AI products stagnate after launch

The real work in AI development starts after that first release.

The traditional PM toolkit falls short because:

Improvements are invisible. You tweak the prompt, redeploy, and... did it actually get better? For whom? In what scenarios?
Bugs aren’t always reproducible. When users report problems, the exact issue might not be reproducible since the next output will be different.
User feedback is directional, not diagnostic. You know something's wrong, but not what or why.

As building workflow software gets increasingly commoditized, the best companies and career opportunities are moving to building AI agents that deliver the work. 2026 is the tipping point.

“Evals aren’t just a technical matter — they are fundamentally an exercise in product thinking. PMs can’t just be sitting on the sidelines when it comes to measuring AI product success. They need to be defining what good looks like and leading the team in getting there.”
— Brian Balfour

Why you should take this course

Designing PRDs for AI products and construct an eval-driven roadmap
Understanding the complete evaluation lifecycle: eval rubrics, golden datasets, and trace analysis
Designing automated evaluators and aligning them to human judgement + product taste
Evaluating agent architectures with emerging best practices in context engineering
Decomposing multi-step pipelines, visualization, and monitoring user experience
Building custom admin tools to accelerate your team’s progress

Ready to join the 5%?

The companies winning at AI aren't the ones with the best demos — they're the ones with the best systems for making their AI better everyday. Learn how to build yours.

If you're interested in a preview of this course, you can attend this event on Feb 26th.

Learn More and Enroll