Protected Case Study
Mise
This case study is password protected and intended for hiring managers only.
Incorrect password. Please try again.
Mise
Designing for Trust at Scale
Mise is Meta's internal AI for analytics work — the #1 trending Metamate agent, with 6,600+ weekly active users and 56% of analytics ICs relying on it for daily tasks. I led design for the context management system that turned its biggest engineering constraint into its strongest trust signal.
When the Model Knows More Than the User Can See
Mise had a problem that didn't look like a design problem on the surface. It looked like an engineering problem. The agent's context window — the working memory that lets it answer questions accurately — was filling up faster than the team could expand it. Long sessions degraded. Multi-step analyses lost the thread. Users blamed the AI, not the architecture, and trust eroded one stale answer at a time.
The team had been treating context as a capacity problem: more tokens, smarter retrieval, better cookbooks. I argued it was a trust problem. Users weren't asking for more context — they were asking to understand what the agent knew, when, and why. The fix wasn't bigger memory. It was visible memory.
My Role
Product Design Manager, Meta · Design lead on Mise Context Management. Working with: 1 PM, 1 EM, engineering team, content design, eval and research partners. Scope: Context Management system (knowledge bases, response evaluation, AI self-critique sub-agent, Spot Check).
Core User Problems
- Users couldn't tell whether the agent was working from fresh, relevant context or stale defaults
- Errors compounded silently — by the time a user noticed, they'd already shared the output
- The feedback loop for fixing failing outputs was buried in engineering tooling, not accessible to designers or analysts
- Trust was binary — when the agent was wrong once, users defaulted to manual work for weeks
How might we make context — usually invisible AI infrastructure — into a trust surface legible to non-technical users and actionable for analysts?
The opportunity wasn't to hide the machinery better. It was to expose the right parts of it at the right moments. Three audiences needed three different views of the same context state: the casual user at a glance, the curious user one click deeper, and the debugger who needs the full reasoning trail.
Our Strategy
Glance, Curious, Debugger — three depth levels, one system. Surface confidence at the top; layer reasoning underneath.
Inline signals should be quiet when the agent is performing well. They should escalate only when the user needs to look. Trust is built by being unobtrusive on good days.
Trust features flag potential issues but never gate the user's flow. The user decides when to slow down.
When the model isn't sure, say so. A confident "I don't know" outperforms a confident wrong answer every time.
A Three-Tier Context System
I designed the conceptual model that the team built against: Immediate Context Window (what's loaded right now), Grounding & Retrieval (what the AI is pulling in from indexed sources), and Externalized State (the knowledge base and configuration layer that persists across sessions). Naming these tiers explicitly let engineering, design, and product debate them as a system, not a black box.
- Tier 1: Immediate Context Window — the agent's working memory for the current turn
- Tier 2: Grounding & Retrieval — RAG-driven context surfaced from indexed sources
- Tier 3: Externalized State — cookbooks, recipes, eval results that persist across sessions
Knowledge Bases and Response Evals as a Designer Surface
Mise is used across 35+ product teams, each with their own domain knowledge and expectations. I worked with engineering to design the Improvement Loop — view a failing evaluation, read the AI's diagnosis, accept the fix, re-run, see the score comparison.
- Improvement Loop UI: failing eval → diagnosis → accept/edit → re-run → score delta
- Three eval creation paths sized to expertise level — from "describe what you want in plain language" to "write the grader spec yourself"
- Coverage targets that became real: Q1 hit 100+ evals across 10+ teams; Q2 targeting 20% knowledge base coverage
Critic AI: The Sub-Agent That Earned a Seat
Critic AI is a sub-agent that validates Mise responses before they reach the user. Eval accuracy improved from ~63% to ~72.4% with Critic AI in the loop. The design challenge: the critic was invisible, not configurable, and felt like a mode rather than a layer.
- Inline trust signals at Glance, Curious, and Debugger depths
- User-configurable critic strength (off / light / strict)
- Latency cost disclosed upfront — roughly doubles response time when engaged
- Critic disagreements logged for the eval pipeline, closing the feedback flywheel
Spot Check: Human Review as a Trust Signal
Spot Check is a user-initiated review feature that sends agent-generated analysis to a human reviewer before the user acts on it. When the agent detects low confidence in its output, it proactively recommends a Spot Check in the chat — but the decision to request one always stays with the user.
I led the design direction for Spot Check, working closely with a junior designer who executed the visual design and interaction flows under my guidance. The feature was designed around two principles: keep it non-blocking (never a gate, always an offer), and close the loop (reviewer feedback feeds directly back into the model's training, improving confidence over time for similar queries).
- Agent surfaces a Spot Check recommendation inline when confidence is low — framed as a suggestion, not a warning
- User decides whether to send for review; the flow adds no friction to the primary path
- Reviewer feedback is structured and routed back into model training, building a compound improvement loop
- Over time, Spot Check reduces its own frequency — as the model learns from human corrections, the queries that previously triggered low confidence are resolved at the model layer
This last point mattered for adoption: Spot Check had to be useful enough to use, but designed to make itself less necessary. That required the feedback architecture to be a first-class design consideration, not an engineering afterthought.
The #1 Trending Agent
The system shipped and Mise became the top trending internal agent at Meta. It was cited by Meta's CMO in his Davos 2026 remarks as the canonical example of "adding agents to your team."
Weekly active users across analytics teams at Meta
Share of analytics ICs using Mise for daily work
Eval accuracy lift with Critic AI in the loop
Trending Metamate agent across Meta — cited by CMO at Davos 2026
The Foundation for the Next Wave of Agent Work
The context management system is now the foundation for the next wave of agent work — the knowledge base and eval patterns have been inherited directly by Research Studio's agentic layer.
The harder open question: as more sub-agents enter the system, how does the trust surface stay coherent? One critic is a feature. Five critics is a UI problem I haven't fully solved yet.
Status
Shipped (Beta) · Scale phase in planning · Cookbook and eval patterns inherited by Research Studio agentic layer.
The hardest part of designing AI products isn't the chat interface. It's the invisible substrate. Most AI design work focuses on the visible turn — the prompt, the response, the buttons. The interesting work is one layer deeper: the context, the memory, the eval system that decides what "good" means.
- Make the invisible legible — but only the parts that earn the user's attention.
- A trust surface is a system, not a screen. Design the layers, not just the labels.
- If designers aren't in the eval loop, the AI's idea of "good" will diverge from the user's.
"You can't design a chat interface your way out of a context problem. You have to design the context."