Building AI Products: What Every Product Manager Needs to Know

Your AI product will probably fail, and it won’t be the model’s fault

Here’s a pattern I’ve seen at three different companies: A PM gets excited about AI. They spec out a feature. Engineering builds it. The model actually works — 85% accuracy, reasonable latency, impressive demos. Six months later, the feature is quietly deprecated because users don’t trust it, the edge cases are nightmarish, and nobody thought through how to evaluate success.

The technical team delivered exactly what was asked for. The product still failed.

Building AI products requires a different mental model than building traditional software, and most of what breaks happens before anyone writes a line of code. The mistakes are strategic, not technical. They’re about how you frame the problem, define success, and structure requirements.

This isn’t a post about machine learning concepts or how transformers work. It’s about the decisions you make as a PM that determine whether your AI feature becomes a core part of the product or an embarrassing checkbox on a roadmap slide.

Why AI products fail differently

Traditional software is deterministic. You click a button, the same thing happens. If there’s a bug, you fix the code. Users develop predictable expectations, and the product either meets them or doesn’t.

AI products break this contract in ways that cascade through the entire user experience:

Non-deterministic outputs: The same input can produce different outputs. Users notice this. They interpret it as inconsistency, unreliability, or — worse — the product being “broken.” I watched a user test where someone asked the same question twice, got slightly different answers, and concluded the system “didn’t actually know anything.” The answers were both correct.

Data-dependent quality: Your model is only as good as the data it was trained on. This sounds obvious until you realize that data quality issues often don’t surface until users from underrepresented segments start complaining. By then, you’ve shipped to thousands of people.

Trust sensitivity: Users scrutinize AI decisions in ways they never scrutinize traditional software. They’ll accept a recommendation algorithm serving them content without question, but the moment you label something as “AI-powered,” they want to know why, how confident you are, and what happens if you’re wrong.

Continuous evolution: Models improve or degrade over time. Data drift happens. User behavior changes. The product you shipped six months ago may not be the product users are experiencing today, and you might not know it.

These differences mean you can’t apply your standard PM playbook and expect it to work. You need to think about AI as a system, not a feature. [INTERNAL_LINK: AI for product managers]

Mistake 1: Treating AI as a feature instead of a system

When PMs spec AI functionality, they tend to write requirements like this: “The system should recommend relevant products based on user browsing history.” Clean, simple, familiar.

The problem is that this requirement describes an outcome without acknowledging the infrastructure that makes the outcome possible. AI features require:

Feedback loops: How does the model know if its recommendations were good? What signals are you capturing? Click-through? Purchases? Time spent? Returns?
Data pipelines: Where does training data come from? How is it labeled? Who reviews it? How often is it refreshed?
Evaluation frameworks: How do you know if version 2 of the model is better than version 1? What metrics matter? What’s your testing methodology?
Monitoring systems: How do you detect when the model starts underperforming? What alerts exist? Who’s responsible for responding?

None of these show up in a typical feature spec. They’re “implementation details.” But they’re also the difference between an AI feature that improves over time and one that silently degrades until users stop using it.

I worked with a team that shipped a document classification feature without building proper feedback loops. The model worked well at launch because the training data was recent. Eighteen months later, users had shifted their document formatting norms, the model’s accuracy had dropped 12 percentage points, and nobody knew until a customer escalation. There was no system in place to detect the drift.

When you’re building AI products, the system is the product. The model is just one component.

Mistake 2: Confusing capability with value

The team that built GPT-4 can generate images from text. That’s a capability. Whether users want that capability, trust it, or will pay for it — those are separate questions.

This confusion is everywhere in AI product development. A data science team demonstrates something technically impressive in a notebook. A PM gets excited. A roadmap gets updated. Six months later, you’re in user research watching people ignore the feature because:

They don’t trust the output enough to act on it
The value proposition isn’t clear
The workflow doesn’t fit how they actually work
The error rate is acceptable for demos but unacceptable for their use case

Capability ≠ value. The question isn’t “what can the AI do?” It’s “what will users actually trust the AI to do, given the stakes of being wrong?” [INTERNAL_LINK: product-market fit]

Consider a medical diagnosis AI. Technically capable of identifying conditions from imaging data. But physicians won’t let it make final decisions — the stakes are too high. So the actual product value isn’t “automated diagnosis.” It’s “surface potential findings for physician review, reducing time to diagnosis by 20%.” Different framing, different success metrics, different product design.

Before building anything, answer these questions:

What’s the user’s current workflow without AI?
What are the consequences of the AI being wrong?
Will users trust the AI enough to change their behavior?
What evidence do we have that users want this problem solved this way?

If you can’t answer these from user research, you’re speculating. And speculating with AI features is expensive because the iteration cycles are longer and the technical investment is higher.

Mistake 3: Not defining “good enough” before building

Here’s a conversation I’ve heard variations of at least a dozen times:

PM: “How accurate is the model?”
Engineer: “About 87% on our test set.”
PM: “Is that good?”
Engineer: “It depends on the use case.”
PM: “Well, is it good enough to ship?”
Engineer: “That’s a product decision.”

Both people are right. And nothing gets resolved because nobody defined success criteria upfront.

Before any AI development starts, you need to answer:

What accuracy is acceptable? (And how do you measure accuracy for this specific task?)
What latency is acceptable? (Sub-second for real-time features? Minutes for background processing?)
What’s the cost of false positives vs. false negatives? (Often asymmetric — spam filters should let some spam through rather than block legitimate email)
What does graceful degradation look like? (When the model can’t provide a confident answer, what happens?)

These aren’t technical decisions that engineering makes later. They’re product decisions that shape the entire build approach. A 95% accuracy requirement might double development time compared to 85%. You need to decide if that’s worth it.

I use a simple framework: map error consequences to acceptable error rates.

If the AI recommends a restaurant and it’s closed, the consequence is mild inconvenience. 80% accuracy might be fine. If the AI flags a financial transaction as fraudulent and blocks it, wrong answers mean customer complaints and potential churn. You need 99%+ accuracy or a fast human review process.

Define your tolerances before building. Write them down. Make them specific.

The four risks of AI products

Marty Cagan’s framework identifies four product risks: value, usability, feasibility, and business viability. For AI products, I’d modify this slightly because AI introduces a fifth risk that deserves explicit attention: ethics risk. [INTERNAL_LINK: product discovery]

Value risk: Will users actually use this? With AI, this becomes “will users trust this enough to change their behavior?” Adoption barriers are higher because users are skeptical of AI decisions. You need to earn trust, not just demonstrate capability.

Usability risk: Can users successfully use this? With AI, this includes: Can users understand what the AI is doing? Can they correct it when it’s wrong? Can they predict when it will or won’t work? Non-deterministic systems create usability challenges that traditional products don’t have.

Feasibility risk: Can we build this? With AI, feasibility depends heavily on data availability. You might have a great model architecture but insufficient training data. Or the data you need exists but is siloed across systems that don’t talk to each other. Or the compute costs make the feature economically unviable.

Business viability risk: Does this work for the business? AI features often have ongoing costs — compute, data labeling, model maintenance — that traditional features don’t. A feature that looks viable at 1,000 users might be prohibitively expensive at 1 million.

Ethics risk: Should we build this? AI products can harm people in ways that software bugs don’t. Biased hiring algorithms. Discriminatory credit decisions. Surveillance systems that violate privacy. This isn’t an edge case — it’s a systematic risk that requires explicit evaluation.

For every AI feature, I run a quick assessment:

Could this feature harm users from marginalized groups differently than others?
What data is used, and was it collected ethically?
If this feature fails, who bears the consequences?
Would we be comfortable if this was reported on by a journalist?

Ethics risk isn’t hypothetical. It’s the reason several major companies have had to kill AI products after launch, with significant reputation damage.

How to write requirements for AI features

Traditional requirements focus on deterministic behavior: “When the user clicks X, the system does Y.” AI requirements need to handle probabilistic behavior, define success metrics, and specify failure modes.

Here’s a template I’ve refined over multiple AI products:

AI feature requirements template

1. Problem statement

What user problem are we solving?
What does the user currently do without this feature?
What’s the cost of the status quo?

2. Capability definition

What should the AI be able to do?
What inputs does it receive?
What outputs does it produce?
What are the explicit boundaries? (What should it NOT do?)

3. Success criteria

Target accuracy/performance metric: [specific number]
Acceptable latency: [specific number]
False positive tolerance: [specific number and rationale]
False negative tolerance: [specific number and rationale]

4. Failure handling

When confidence is low, what happens?
What’s the fallback experience?
How does the user recover from AI errors?
What feedback mechanisms exist?

5. Data requirements

What training data is needed?
Where does it come from?
How is it labeled?
What are the known gaps or biases?

6. Monitoring and evaluation

How do we know if the feature is working in production?
What signals indicate degradation?
What’s the feedback loop for improvement?

7. Ethics assessment

Who could be harmed by this feature?
What bias risks exist in the data or approach?
What safeguards are in place?

This template isn’t exhaustive, but it forces conversations that typical requirements skip. Most importantly, it makes you specify numbers where you’d otherwise use vague language.

Evaluating AI features in production

You can’t improve what you don’t measure. AI evaluation is harder than traditional product analytics because the relationship between user behavior and AI performance is indirect.

Three types of signals matter:

Explicit feedback: Thumbs up/down. Star ratings. “Was this helpful?” These are valuable but have low response rates. You’ll hear from users at the extremes — delighted or frustrated — and miss the middle.

Behavioral signals: Did users accept the AI’s suggestion? Did they modify it? Did they ignore it entirely? Did they complete their workflow faster? These are higher volume but require careful interpretation. An ignored suggestion could mean the AI was wrong, or that the user already knew what they wanted.

Downstream outcomes: If the AI recommends products, did users buy them? If it classifies support tickets, did the ticket get resolved? These are the most meaningful but have the longest feedback loops.

Good evaluation combines all three. Explicit feedback tells you about user perception. Behavioral signals tell you about in-the-moment usefulness. Downstream outcomes tell you about actual impact.

One specific technique I’ve found valuable: measure the correction rate. When users can edit AI outputs, track how often they make changes and what kinds of changes they make. High correction rates in specific categories point to systematic model weaknesses. Decreasing correction rates over time indicate the model is learning (or users are giving up).

Building trust with users

Trust is the ultimate constraint on AI product value. Users who don’t trust the AI won’t change their behavior, regardless of how good the model actually is.

Trust is built through:

Transparency: Tell users what the AI is doing and why. Not technical explanations, but functional ones. “We recommended this because you’ve bought similar items before” is more trust-building than a mysterious “Recommended for you.”

Explainability: Let users see the reasoning. Highlight the factors that influenced a decision. This doesn’t need to be technically accurate to how the model works internally — it needs to be conceptually honest about the logic.

Confidence indicators: Show users how certain the AI is. If the model isn’t confident, say so. “I’m not sure about this, but…” is more trustworthy than false confidence. Notion AI does this well — it qualifies uncertain outputs.

Graceful failure: Design the experience so that AI failures don’t feel catastrophic. Let users easily correct mistakes. Make the undo path obvious. Treat AI outputs as suggestions, not decisions.

Human oversight: For high-stakes decisions, keep humans in the loop. Even if the AI is technically capable of making the decision alone, human review builds trust. Over time, as users develop confidence, you can adjust the balance.

The most important principle: earn trust incrementally. Start with low-stakes applications where being wrong is cheap. Let users develop confidence in the AI’s judgment through repeated positive experiences. Expand to higher-stakes applications only after trust is established.

Google Maps earned trust over years of accurate turn-by-turn directions before users started following it blindly. Your AI feature doesn’t get that trust on day one.

What to do next

If you’re currently speccing or building an AI feature, pull up your requirements document and check: Have you specified a numeric accuracy target? Do you know your false positive vs. false negative tolerance? Is there a plan for measuring success in production?

If the answer to any of those is no, stop and fill in the gaps before more code gets written. The time you spend now will save you months of rework later.

Building AI products that actually work means making decisions that feel like “implementation details” but actually determine success or failure. The model is the easy part. The system around it is where products succeed or fail.

Frequently asked questions

What makes AI products different from traditional software?

AI products are non-deterministic (same input may produce different outputs), data-dependent (quality depends on training data), continuously improving (or degrading), and require trust-building with users who may be skeptical of AI decisions.

What are the biggest risks in AI product development?

Hallucination (AI confidently wrong), bias (unfair outputs for certain groups), latency (AI is slower than rule-based systems), explainability (why did the AI do that?), and data quality (garbage in, garbage out).

How do you write requirements for AI features?

Define the desired behavior with examples, specify acceptable error rates, define what ‘good enough’ looks like for the model, identify what data is needed for training, and specify what the fallback behavior is when the AI fails or is uncertain.

Building AI products: the mistakes PMs make before a line of code is written