Protected Case Study

Mise

This case study is password protected and intended for hiring managers only.

Incorrect password. Please try again.

Meta · AI Product · Internal Tools · Data & Analytics · 2025–2026

Mise

Designing for Trust at Scale

Mise is Meta's internal AI for analytics work — the #1 trending Metamate agent, with 6,600+ weekly active users and 56% of analytics ICs relying on it for daily tasks. I led design for the context management system that turned its biggest engineering constraint into its strongest trust signal.

AI Product Trust & Transparency Design Systems Cross-functional Leadership Eval-Driven Design

01 The Challenge

When the Model Knows More Than the User Can See

Mise had a problem that didn't look like a design problem on the surface. It looked like an engineering problem. The agent's context window — the working memory that lets it answer questions accurately — was filling up faster than the team could expand it. Long sessions degraded. Multi-step analyses lost the thread. Users blamed the AI, not the architecture, and trust eroded one stale answer at a time.

The team had been treating context as a capacity problem: more tokens, smarter retrieval, better cookbooks. I argued it was a trust problem. Users weren't asking for more context — they were asking to understand what the agent knew, when, and why. The fix wasn't bigger memory. It was visible memory.

My Role

Product Design Manager, Meta · Design lead on Mise Context Management. Working with: 1 PM, 1 EM, engineering team, content design, eval and research partners. Scope: Context Management system (knowledge bases, response evaluation, AI self-critique sub-agent, Spot Check).

Core User Problems

Users couldn't tell whether the agent was working from fresh, relevant context or stale defaults
Errors compounded silently — by the time a user noticed, they'd already shared the output
The feedback loop for fixing failing outputs was buried in engineering tooling, not accessible to designers or analysts
Trust was binary — when the agent was wrong once, users defaulted to manual work for weeks

02 The Design Question

How might we make context — usually invisible AI infrastructure — into a trust surface legible to non-technical users and actionable for analysts?

The opportunity wasn't to hide the machinery better. It was to expose the right parts of it at the right moments. Three audiences needed three different views of the same context state: the casual user at a glance, the curious user one click deeper, and the debugger who needs the full reasoning trail.

Our Strategy

Progressive disclosure as a trust mechanic

Glance, Curious, Debugger — three depth levels, one system. Surface confidence at the top; layer reasoning underneath.

Earn the right to be ignored

Inline signals should be quiet when the agent is performing well. They should escalate only when the user needs to look. Trust is built by being unobtrusive on good days.

Non-blocking by default

Trust features flag potential issues but never gate the user's flow. The user decides when to slow down.

Honest about uncertainty

When the model isn't sure, say so. A confident "I don't know" outperforms a confident wrong answer every time.

03 The Work

A Three-Tier Context System

I designed the conceptual model that the team built against: Immediate Context Window (what's loaded right now), Grounding & Retrieval (what the AI is pulling in from indexed sources), and Externalized State (the knowledge base and configuration layer that persists across sessions). Naming these tiers explicitly let engineering, design, and product debate them as a system, not a black box.

Tier 1: Immediate Context Window — the agent's working memory for the current turn
Tier 2: Grounding & Retrieval — RAG-driven context surfaced from indexed sources
Tier 3: Externalized State — cookbooks, recipes, eval results that persist across sessions

Knowledge Bases and Response Evals as a Designer Surface

Mise is used across 35+ product teams, each with their own domain knowledge and expectations. I worked with engineering to design the Improvement Loop — view a failing evaluation, read the AI's diagnosis, accept the fix, re-run, see the score comparison.

Improvement Loop UI: failing eval → diagnosis → accept/edit → re-run → score delta
Three eval creation paths sized to expertise level — from "describe what you want in plain language" to "write the grader spec yourself"
Coverage targets that became real: Q1 hit 100+ evals across 10+ teams; Q2 targeting 20% knowledge base coverage

Critic AI: The Sub-Agent That Earned a Seat

Critic AI is a sub-agent that validates Mise responses before they reach the user. Eval accuracy improved from ~63% to ~72.4% with Critic AI in the loop. The design challenge: the critic was invisible, not configurable, and felt like a mode rather than a layer.

Inline trust signals at Glance, Curious, and Debugger depths
User-configurable critic strength (off / light / strict)
Latency cost disclosed upfront — roughly doubles response time when engaged
Critic disagreements logged for the eval pipeline, closing the feedback flywheel

Spot Check: Human Review as a Trust Signal

Spot Check is a user-initiated review feature that sends agent-generated analysis to a human reviewer before the user acts on it. When the agent detects low confidence in its output, it proactively recommends a Spot Check in the chat — but the decision to request one always stays with the user.

I led the design direction for Spot Check, working closely with a junior designer who executed the visual design and interaction flows under my guidance. The feature was designed around two principles: keep it non-blocking (never a gate, always an offer), and close the loop (reviewer feedback feeds directly back into the model's training, improving confidence over time for similar queries).

Agent surfaces a Spot Check recommendation inline when confidence is low — framed as a suggestion, not a warning
User decides whether to send for review; the flow adds no friction to the primary path
Reviewer feedback is structured and routed back into model training, building a compound improvement loop
Over time, Spot Check reduces its own frequency — as the model learns from human corrections, the queries that previously triggered low confidence are resolved at the model layer

This last point mattered for adoption: Spot Check had to be useful enough to use, but designed to make itself less necessary. That required the feedback architecture to be a first-class design consideration, not an engineering afterthought.

04 Outcomes

The #1 Trending Agent

The system shipped and Mise became the top trending internal agent at Meta. It was cited by Meta's CMO in his Davos 2026 remarks as the canonical example of "adding agents to your team."

6,600+ WAU

Weekly active users across analytics teams at Meta

56%

Share of analytics ICs using Mise for daily work

~63% → ~72.4%

Eval accuracy lift with Critic AI in the loop

Trending Metamate agent across Meta — cited by CMO at Davos 2026

05 What's Next

The Foundation for the Next Wave of Agent Work

The context management system is now the foundation for the next wave of agent work — the knowledge base and eval patterns have been inherited directly by Research Studio's agentic layer.

The harder open question: as more sub-agents enter the system, how does the trust surface stay coherent? One critic is a feature. Five critics is a UI problem I haven't fully solved yet.

Status

Shipped (Beta) · Scale phase in planning · Cookbook and eval patterns inherited by Research Studio agentic layer.

06 What This Work Represents

The hardest part of designing AI products isn't the chat interface. It's the invisible substrate. Most AI design work focuses on the visible turn — the prompt, the response, the buttons. The interesting work is one layer deeper: the context, the memory, the eval system that decides what "good" means.

Make the invisible legible — but only the parts that earn the user's attention.
A trust surface is a system, not a screen. Design the layers, not just the labels.
If designers aren't in the eval loop, the AI's idea of "good" will diverge from the user's.

"You can't design a chat interface your way out of a context problem. You have to design the context."