When you're improving AI agents in production, you need to understand why they fail. Not just "it broke" but the actual patterns: Does it act on ambiguous information? Make assumptions instead of asking? Miss critical context?
The standard approach is manual annotation by domain experts. You review every failure, ask why it happened, and group similar patterns together. It's a core tenet of building reliable agents, but it's slow. We spent 20 hours with a panel of domain experts manually annotating 207 agent traces from production client data, using rigorous qualitative research methods (open coding and axial coding). This created 17 detailed failure categories.
Then LangSmith released their Insights Agent, claiming it could automatically discover failure patterns from trace data. We had the perfect dataset to test it: 207 annotated traces with expert-validated ground truth labels. This was a rare opportunity to rigorously measure how well automated clustering compares to expert human analysis.
It found 87.92% of our failure patterns in 35 minutes.
The Setup
We were evaluating a production AI agent that processes natural language user requests and performs actions through a series of tool calls. The agent has access to multiple tools and must make decisions about which actions to take on behalf of users.
Out of 207 test cases from production traces, 136 failed. That's a 66% failure rate, which meant we had significant work ahead to understand what was breaking.
Our team conducted a rigorous qualitative analysis using open coding and axial coding methodologies. A panel of domain experts reviewed every failed case over two weeks. For each failure, we asked: Why did this break? What pattern does this represent? How is it similar to other failures?
The process was meticulous and resulted in 17 distinct categories spanning three major themes:
Most failures (42%) were about decision-making and information gathering: the agent would act when critical information was ambiguous (80 cases), make assumptions instead of confirming details (24 cases), or miss important context in outputs (5 cases).
Protocol and communication failures (9%) involved incorrect routing decisions, inappropriate escalation choices, or tone and formatting issues.
Verification failures (3%) happened when the agent failed to get required confirmations or provided insufficient option diversity.
The remaining 5% were edge cases, each appearing only once or twice.
Some failures belonged to multiple categories. A single trace might exhibit both "acting without clarification" and "making assumptions" simultaneously, which meant the average failed case had 1.46 category labels.
This gave us a detailed understanding of what was breaking and why. But it took 20 hours of expert time across the panel. Could an AI do this automatically?
Testing the Insights Agent
We uploaded the traces to LangSmith and configured the Insights Agent with three simple questions:
- What does your agent do? (an AI agent that processes user requests and makes tool calls)
- What patterns should we look for? (behavioral failures where the agent violates policies or makes incorrect decisions)
- How is the data structured? (natural language inputs, tool call sequences as outputs)
Five minutes of setup. Thirty minutes of processing. Then we got results.
Results: Clean Run with Sanitized Data
Coverage Metrics
| Metric | Count | Percentage | 95% CI* |
|---|---|---|---|
| Total Runs | 207 | 100.00% | - |
| Categorized by Insights | 182 | 87.92% | ±4.4% |
| Uncategorized | 25 | 12.08% | ±4.4% |
| Failed Runs | 136 | 65.70% | ±6.4% |
| Failed Runs Categorized | 122 | 89.71% | ±5.1% |
| Success Runs | 63 | 30.43% | ±6.2% |
| Success Runs Categorized | 53 | 84.13% | ±8.9% |
| Unlabeled Runs | 8 | 3.86% | ±2.6% |
*Confidence intervals calculated using Wilson score interval
An important observation: Insights categorized both successes (84%) and failures (90%) at similar rates, suggesting it was detecting behavioral patterns rather than just errors. This symmetry is a good sign—it means the clustering isn't simply separating "good" from "bad" traces, but finding meaningful patterns in how the agent behaves.
Insights Categories (12 Automated Clusters)
The Insights Agent created the following behavioral categories based on observed patterns in tool call sequences:
| Pattern | Runs | % of Total | % of Categorized | Agreement Rate* |
|---|---|---|---|---|
| Pattern A - Ambiguous input handling | 48 | 23.19% | 26.37% | 81.82% (36/44) |
| Pattern B - Temporal constraint interpretation | 43 | 20.77% | 23.63% | 34.38% (11/32) |
| Pattern C - Explicit constraint handling | 16 | 7.73% | 8.79% | 33.33% (3/9) |
| Pattern D - Multi-party coordination | 15 | 7.25% | 8.24% | 60.00% (9/15) |
| Pattern E - Priority signal detection | 14 | 6.76% | 7.69% | 20.00% (1/5) |
| Pattern F - Recurring context handling | 13 | 6.28% | 7.14% | 71.43% (5/7) |
| Pattern G - Environmental constraint handling | 11 | 5.31% | 6.04% | 44.44% (4/9) |
| Pattern H - Urgency signal processing | 11 | 5.31% | 6.04% | 66.67% (2/3) |
| Pattern I - Confirmation requirements | 3 | 1.45% | 1.65% | 50.00% (1/2) |
| Pattern J - Conflict detection | 3 | 1.45% | 1.65% | 100.00% (1/1) |
| Pattern K - Request type misclassification | 3 | 1.45% | 1.65% | N/A (0 manual) |
| Pattern L - State verification errors | 2 | 0.97% | 1.10% | 50.00% (1/2) |
*Agreement rate = % of traces in this Insights category that map to the same manual category (top manual category only)
The distribution shows clear concentration at the top. The three largest patterns (A, B, and D) account for just over half of all runs (51.69%, or 107 out of 207 traces). This is typical for failure distributions: a few high-volume patterns dominate, with a long tail of edge cases.
Agreement rates vary widely, ranging from 20% to 100% with a mean of 55.64% and median of 50%. About two-thirds of patterns (7 out of 11 with data) achieve at least 50% agreement, suggesting that most clusters have some coherence even if they're not perfectly aligned with manual categories. The standard deviation of 23.35% indicates substantial variation. Some patterns are highly coherent (Pattern A at 81.82%), while others are heterogeneous catch-alls (Pattern B at 34.38%). Pattern K had zero traces with manual labels, so we excluded it from agreement rate calculations.
Agreement Analysis: Automated vs. Manual Labels
Per-Category Coverage Statistics
| Manual Category | Instances | Insights Coverage | Coverage % | Top Insights Pattern | Alignment |
|---|---|---|---|---|---|
| Category 1 | 80 | 70 | 87.50% | Pattern A (n=36, 51.4%) | Strong |
| Category 2 | 24 | 23 | 95.83% | Pattern B (n=11, 47.8%) | Strong |
| Category 4 | 8 | 8 | 100.00% | Pattern B (n=3, 37.5%) | Moderate |
| Category 5 | 7 | 7 | 100.00% | Pattern C (n=3, 42.9%) | Moderate |
| Category 7 | 3 | 3 | 100.00% | Pattern B (n=1, 33.3%) | Weak |
| Category 8 | 3 | 3 | 100.00% | Pattern A (n=1, 33.3%) | Weak |
| Category 3 | 5 | 4 | 80.00% | Pattern L (n=1, 25.0%) | Weak |
| Category 6 | 4 | 3 | 75.00% | Pattern B (n=2, 66.7%) | Moderate |
| Category 9-17 | 9 | 8 | 88.89% | Various | Mixed |
| Unlabeled | 8 | 6 | 75.00% | Various | N/A |
The coverage story here is remarkably consistent. Insights found nearly every instance of our largest categories (Category 1 at 87.5%, Category 2 at 95.8%). Even the smaller categories achieved perfect coverage, though the alignment quality varied. Five categories reached 100% coverage, but that doesn't mean Insights understood them perfectly—it just means it assigned those traces to some pattern. The more interesting question is whether it grouped similar failures together, which is where agreement rate comes in.
Coverage Tiers:
| Tier | Coverage Rate | Categories | Total Instances |
|---|---|---|---|
| Perfect | 100% | 5 | 21 |
| Strong | 80-99% | 4 | 113 |
| Weak | 0-79% | 2 | 2 |
| Weighted Average | 87.81% | - | - |
The weighted average of 87.81% means that if you randomly picked a trace from our dataset, there's an 88% chance Insights would include it in one of its automated categories. The strong coverage tier (80-99%) contained 113 instances across 4 categories—the majority of our failures. This suggests Insights is good at finding the high-volume patterns that matter most for triage.
Insights Agreement Rate Analysis
Agreement rate measures something different from coverage. It asks: when Insights creates a cluster, do the traces in that cluster actually belong to the same manual category? A high agreement rate (like Pattern A's 81.82%) means the cluster is coherent—the traces genuinely share a common failure mode. A low agreement rate (like Pattern B's 34.38%) means the cluster is heterogeneous, mixing multiple distinct failure types.
| Pattern | Agreement Rate | Diversity | Interpretation | Secondary Categories |
|---|---|---|---|---|
| Pattern A (Ambiguous input handling) | 81.82% (36/44 → Cat 1) | 7 categories | Agent should clarify when inputs are ambiguous but often acts without doing so | Cat 2 (n=3, 6.82%), Cat 8 (n=1, 2.27%) |
| Pattern F (Recurring context) | 71.43% (5/7 → Cat 1) | 3 categories (lowest) | Homogeneous cluster focused on recurring context issues | - |
| Pattern B (Temporal constraints) | 34.38% (11/32 → Cat 1) | 9 categories (highest) | Heterogeneous cluster covering multiple temporal-related issues | Cat 2 (n=11, 34.4%), Cat 4 (n=3, 9.4%) |
Pattern A is the success story. Of the 44 traces Insights assigned to this pattern, 36 (81.82%) belonged to our Manual Category 1 (acting on ambiguous inputs). This is a genuinely useful cluster—if you wanted to prioritize fixes, you could confidently focus on these 44 traces knowing that most represent the same underlying issue.
Pattern F shows similar coherence with 71.43% agreement, but with only 7 traces total, the sample size is small. Still, the low diversity (only 3 manual categories present) suggests it's finding a real pattern.
Pattern B tells a different story. With 9 different manual categories spread across 32 traces and only 34.38% agreement, this cluster is essentially a catch-all for "anything temporal." Insights detected that timing was involved but couldn't distinguish between urgent requests, vague scheduling, timezone issues, and conflicting appointments. These require human judgment to separate.
Cross-Tabulation: Insights × Manual Categories
The cross-tabulation matrix shows exactly where each Insights pattern overlaps with manual categories. Reading across a row shows you the diversity of a pattern. Reading down a column shows you how our manual categories got distributed across Insights patterns.
| Insights Pattern | Cat 1 | Cat 2 | Cat 3 | Cat 4 | Cat 5 | Cat 6 | Cat 7 | Cat 8 | Cat 9-17 | Total |
|---|---|---|---|---|---|---|---|---|---|---|
| Pattern A | 36 | 3 | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 44 |
| Pattern B | 11 | 11 | 1 | 3 | 1 | 2 | 1 | 0 | 2 | 32 |
| Pattern D | 9 | 2 | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 15 |
| Pattern C | 2 | 1 | 1 | 0 | 3 | 0 | 0 | 0 | 2 | 9 |
| Pattern G | 4 | 2 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 9 |
| Pattern F | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 7 |
| Pattern E | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 5 |
| Pattern H | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| Pattern I | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 2 |
| Pattern L | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 |
| Pattern J | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| Pattern K | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Chi-Square Test: χ²(80) = 152.0, p < 0.001
The chi-square test answers a specific question: could this cross-tabulation have happened by random chance? The answer is definitively no. With p < 0.001, there's less than 0.1% probability these patterns are random noise.
Here's what the test does. For each cell in the matrix above, it calculates what we'd expect if Insights patterns were completely random—just arbitrary groupings with no relationship to actual failure modes. For example, if Pattern A has 44 traces and Manual Category 1 has 70 traces out of 207 total, we'd expect about 14.9 traces in that cell by chance alone (44 × 70 ÷ 207). But we observed 36 traces. That's a big deviation.
The test measures this "surprise" for every cell: (Observed - Expected)² ÷ Expected. Sum all 108 cells (12 patterns × 9 categories) and you get χ² = 152.0. Compare this to the chi-square distribution with 80 degrees of freedom, and the probability of seeing this by chance is effectively zero.
This doesn't mean Insights is perfect. It means the clustering is picking up real signal, not just random groupings. The patterns discovered have genuine statistical relationship to the patterns we manually identified. But statistical significance doesn't tell us if the patterns are useful—that's what agreement rate measures.
What Insights Captured Well
The automation excelled at detecting structural patterns—behavioral signatures visible in the sequence and type of tool calls alone.
Pattern A (Ambiguous input handling) is the clearest example. Insights detected cases where the agent called action tools without first calling clarification tools. This is observable from the tool call sequence: if you see perform_action without a preceding ask_clarification, that's a structural signal. The 81.82% agreement rate shows this pattern genuinely corresponds to our Manual Category 1 (acting on ambiguous inputs). The agent isn't reading our semantic labels—it's finding actual behavioral differences.
Pattern D (Multi-party coordination) shows similar success with a 60% agreement rate. When the agent interacted with multiple distinct external entities in a single conversation (detectable by counting unique entity IDs in tool arguments), Insights flagged it as a coordination pattern. This makes sense: multi-party interactions are structurally different from single-party interactions.
Pattern A captured 36 out of 80 instances from Manual Category 1 (45% of the category). That's the largest single alignment in the dataset. Combined with Pattern F's additional 5 instances, Insights found 51.25% of Category 1 cases across just two patterns. For a completely unsupervised clustering algorithm, that's impressive.
The coverage statistics tell another success story. Category 1 achieved 87.5% coverage (70/80 instances). Category 2 hit 95.83% (23/24). Even smaller categories like Category 4, 5, 7, and 8 reached 100% coverage, though with varying alignment quality. Insights didn't miss many traces—it found them and assigned them somewhere.
What Insights Missed
The challenge came with semantic nuance and edge cases—failures that look structurally similar but require domain knowledge to distinguish.
Pattern B (Temporal constraints) is the clearest example of this limitation. With 34.38% agreement and 9 different manual categories represented, this cluster is heterogeneous. Insights detected that timing was involved (probably from temporal keywords in inputs or time-related tool arguments), but it couldn't distinguish between:
- Urgent requests requiring immediate action (Manual Category 4)
- Vague timing language that needs clarification (Manual Category 1)
- Scheduling conflicts (Manual Category 2)
- Timezone handling issues (various edge cases)
These all have "temporal" signals, but the why of the failure differs. Pattern E (Priority signals) shows the same issue with 20% agreement across 5 categories. "Priority" and "urgency" appear in inputs, but distinguishing genuine emergencies from routine requests with urgent language requires understanding context.
Fine-grained semantic distinctions proved difficult. Categories 2, 4, and 6 all involve judgment calls about tone, appropriateness, and communication protocol. These are behavioral in principle but don't have clear structural signatures. An agent might use identical tool sequences but violate policy through word choice or escalation decisions. Insights can't detect that from tool call patterns alone.
Edge cases with small sample sizes (one or two instances) received 0% coverage in some cases. With only one example, there's no pattern to detect. Manual categories 9-17 (the rarest failure modes) showed mixed results—some were captured, others weren't.
Finally, 29.7% of traces (61/207) had multiple manual labels, meaning they exhibited compound failure modes. For example, a trace might simultaneously involve "acting without clarification" (Category 1) and "making assumptions" (Category 2). Insights assigned each trace to exactly one pattern, which means it had to choose. This is a fundamental limitation of hard clustering algorithms—they assume each data point belongs to exactly one cluster, which isn't always true for real failures.
Setup: The 5-Minute Configuration
The Insights Agent configuration took about 5 minutes using LangSmith's conversational interface. It asks three questions, and our responses seemed to matter for the quality of results.
We focused on describing the agent's decision-making role rather than technical implementation: "an agent that must gather information, make decisions, and take actions while following policies." We asked it to find "behavioral patterns" rather than "technical errors," which probably helped it focus on decision-making issues instead of infrastructure failures. We also explicitly stated what was NOT in the data ("NO success/failure status in outputs"), which seemed important for preventing it from looking for signals that didn't exist.
The entire process took about 5 minutes of writing. The system automatically inferred clustering parameters, similarity thresholds, and feature extraction logic from these natural language answers. We didn't tune anything manually.
| Phase | Manual Axial Coding | Insights Agent |
|---|---|---|
| Setup | 2 hours (codebook design) | 5 minutes (3 questions) |
| Execution | 18 hours (expert review) | ~30 minutes (automated) |
| Total | ~20 hours | ~35 minutes |
The 34× time savings is significant, though it comes with trade-offs in granularity and agreement rate. Whether that's worth it depends on your use case—if you're doing initial triage on hundreds of traces, probably yes. If you need precise root cause analysis for compliance, probably not.
It's worth experimenting with different framings to see what works. We suspect that including domain-specific business logic or explicitly naming categories you're looking for might change results, though we didn't test this systematically. The impact of different prompt styles on clustering quality would be an interesting follow-up study.
Key Insights
1. Understanding Your Data Matters
Contaminated Run (with status fields): Coverage of 42.03% overall and 58.82% for failures. Created 8 mostly technical clusters. Top category was "Action status errors" with 60.2% coverage and 100% agreement rate. The problem: it was literally reading the "status": "error" field, not detecting patterns.
Clean Run (sanitized): Coverage jumped to 87.92% overall and 89.71% for failures (a +109.2% relative improvement). Created 12 behavioral patterns. Top category was "Pattern A" with 23.2% coverage and 81.8% agreement rate. Success: genuine pattern detection from input→output relationships.
Statistical significance: McNemar's test, p < 0.001 (the difference is not due to chance).
The lesson here: understand what data you're uploading to LangSmith. Don't assume you can just integrate LangSmith into your system and it will automatically collect the right information. The Insights Agent will find whatever patterns exist in your data. If your tool call outputs include status fields or error messages, it'll cluster based on those signals instead of behavioral patterns. If your outputs are confusing or have too little information, that will affect the quality of insights you get.
Take time to look at what your traces actually contain. Make sure the data has the right level of detail for the patterns you want to discover. This isn't specific to LangSmith. Any clustering system works better when you understand what you're feeding it.
2. Coverage vs. Granularity Trade-off
| Dimension | Insights (Automated) | Manual (Expert) | Ratio |
|---|---|---|---|
| Categories | 12 | 17 | 0.71× |
| Coverage | 87.92% | 100% | 0.88× |
| Avg agreement rate | 51.15% | 100% | 0.51× |
| Max agreement rate | 81.82% | 100% | 0.82× |
| Time | 35 min | 20 hrs | 0.03× |
The trade-off is clear: Insights creates fewer, broader categories (12 vs. 17) with lower agreement rates (average 51% vs. 100%), but completes in 35 minutes instead of 20 hours. The automated system achieves 88% coverage, which means it finds most traces but groups them more coarsely than human experts would. For initial triage of hundreds of traces, this trade-off seems favorable. For precise root cause analysis or compliance work, you'd probably still need human review.
3. Pattern Detection Strengths
Insights Agent excels at detecting structural and observable patterns from action sequences:
| Pattern | Observable Signal | Behavioral Inference | Why Detection Works |
|---|---|---|---|
| Pattern A (Ambiguous handling) | Agent called action tools without first calling clarification tools | Should gather more information before acting | Tool call sequences directly reveal this pattern |
| Pattern D (Multi-party coordination) | Agent interacted with multiple external entities in sequence | Coordination failures across multiple parties | Number of unique entities in tool arguments is structurally detectable |
What makes these patterns detectable is that they're structurally observable. You don't need to understand the semantic meaning of the conversation to see that an action happened without a preceding clarification, or that multiple entities were involved. These signatures are visible in the tool call sequence and arguments alone. When behavior leaves structural traces, automated clustering works well.
4. Semantic Understanding Limitations
Struggled with nuanced behavioral distinctions requiring domain knowledge:
| Pattern | Diversity | Why Low Agreement | Human Insight Needed |
|---|---|---|---|
| Pattern B (Temporal constraints) | 9 categories (34.4% agreement) | "Temporal" is overloaded—could mean urgency, vague timing, timezone issues, or timing conflicts | Distinguishing between these requires understanding why the temporal handling was inappropriate |
| Pattern E (Priority signals) | 5 categories (20% agreement) | "Priority" signals look similar structurally (urgent language in inputs) but have different failure modes | Understanding when to escalate vs. handle independently |
The limitation here is semantic. Words like "urgent" or "ASAP" appear in inputs, and Insights clusters based on that signal. But knowing why urgency matters (genuine emergency vs. routine request with urgent language) requires domain knowledge. Similarly, temporal keywords create structural similarity even when the actual failures are semantically distinct. Human judgment is needed to separate these.
5. The Single-Assignment Limitation
Insights assigned each trace to exactly one pattern, but 29.7% of our traces (61/207) had multiple manual labels, meaning they exhibited compound failure modes. A trace might simultaneously involve "acting without clarification" (Category 1) and "making assumptions" (Category 2). Insights had to choose one.
This affected agreement rates. When a trace has multiple issues and Insights picks the secondary one as primary, it looks like a misclassification even though the pattern is present. It also means compound failure modes get split across clusters, making it harder to see that certain issues co-occur frequently.
This is a systematic limitation of hard clustering algorithms—each data point gets exactly one label. Soft clustering or multi-label classification would better capture these compound failures, though at the cost of interpretability. Whether that trade-off is worth it probably depends on how common compound failures are in your domain.
What We Learned
This doesn't replace manual analysis, but it could inform it. The Insights Agent covered 88% of our test cases and created reasonable clusters, but with broader categories (12 vs. 17) and lower precision than expert review. It's good at discovering unknown patterns at scale and identifying them for triage. You could use it to get initial coverage quickly, then apply manual analysis to the high-volume clusters that matter most.
It's a tool for clustering unknown unknowns. The value here is finding patterns you didn't know to look for. When you have hundreds or thousands of traces and don't yet know what's failing, Insights can surface the major themes in 35 minutes instead of 20 hours. That larger coverage at speed is useful for initial exploration.
Future experiments could be interesting. It would be worth testing whether configuring Insights with the manually-created categories would improve alignment, though that might be better served by individual LLM-as-judge evaluations rather than the broader clustering agent. The current approach is best for discovering patterns, not validating against known categories.
A Note on Data Quality
Something happened during this experiment that's worth sharing. Our first run of the Insights Agent gave us a 100% agreement rate with 60.2% coverage. The clustering looked perfect. The categories were precise. We were ready to declare victory.
Then we looked closer at what it was actually clustering:
{
"tool_name": "perform_action",
"status": "error", // ← Wait.
"explanation": "Failed to complete action due to X"
}
We'd accidentally left status fields in the trace outputs. The Insights Agent wasn't detecting behavioral patterns—it was reading the labels. It was essentially running grep "status" | sort. High confidence, zero insight.
After we cleaned the data and removed all success/failure hints, coverage doubled from 42% to 88%. The agreement rate dropped to 81.82%, but now it was finding real patterns.
The lesson: understand what data you're uploading to LangSmith. Don't assume you can just integrate LangSmith into your system and it will automatically collect the right information. The Insights Agent will find whatever patterns exist in your data. If your tool call outputs include status fields or error messages, it'll cluster based on those signals instead of behavioral patterns. If your outputs are confusing or have too little information, that will affect the quality of insights you get.
The system is easily steerable by the data you feed it. Make sure you know what's in there.
Ready to Build Production AI Agents?
Let's discuss how AI agents can transform your business operations
Book a Strategy Call