Announcing US$ 3.2M Seed round · OneVC · Maya · Norte Ventures Read →
Strattum Evals

Measure the context.
Not the LLM.

Evals assesses the quality of the context Strattum delivers to the agent — completeness, accuracy, relevance. Data teams control what they can control; the rest is up to the LLM.

Input
same for every run
QUESTION
"What is Maria Silva's balance and her Q4 churn risk?"
contextv17
suitegolden · 128 q.
matrix4 prompts × 4 LLMs
Strattum
Evals
8 remaining · run #1247
Claude
Claude
ChatGPT
ChatGPT
Gemini
Gemini
Grok
Grok
P1base
P2CoT
P3few-shot
P4self-ask
88%
RUN86%
81%
96%
93%
91%
87%
92%
91%
89%
84%
89%
88%
85%
82%
Best run
same answer, best combo
96%
faithfulness0.98
grounded0.96
recall0.95
precision0.94
P2 · CoTprompt
Claude
ClaudeLLM
TAKEAWAY
Same context, the score climbs +12pp when you swap the prompt — vs +1pp when you swap the LLM.

LLM eval became commodity.
Context eval is where enterprise AI wins.

Completeness

Did the retrieval bring EVERYTHING that was relevant? Missing active contract, recent ticket, NPS — Strattum measures it.

Accuracy

Is the returned information correct and current? Strattum checks each chunk against ground-truth or rules.

Relevance

Was what came back relevant to the question? Measures retrieval noise — throwaway chunks are expensive for the LLM to process.

Diff between strategies

A/B test: ontology v2 vs v3, embedding model A vs B, top_k 5 vs 10. Data-driven decisions.

From question
to data-driven decision.

1

Define the golden set

Set of questions + expected answer (or approval rules). Strattum handles the rest.

  • UI for curation
  • CSV import
  • Git-versioned
2

Eval runs automatically

Every ontology/Skill/transform PR triggers Evals. Reports before merge.

  • CI/CD integration
  • Threshold gates
  • Visual diff
3

Reports + alerts

Dashboards per dimension. Regression triggers alert. PR approval based on score.

  • Dashboards per dimension
  • Regression alerts
  • Approval gates

Evaluation infrastructure
for production context.

Three-dimension scoring

Completeness, accuracy, and relevance as separate metrics. Independent regression detection per dimension.

Ground-truth management

UI for curating golden sets. CSV import. Version controlled in Git. Team can update without engineering.

CI/CD integration

Every PR to ontology, Skills, or transforms runs Evals automatically. Blocking gates configurable by threshold.

Strategy comparison

Run the same golden set against two retrieval configurations. Clear diff before deploying changes to production.

Regression alerts

Automatic notification on score drop. Slack, PagerDuty, email. Configurable per metric and per golden set.

Historical tracking

Metric history per version. See how each ontology change or new connector affected retrieval quality over time.

From subjective to measurable
enterprise context quality.

Ontology Change

Ontology v3 deployment without production regression

Before merging ontology v3, Evals runs the full golden set against both versions. The diff shows completeness up 12%, precision unchanged. Team merges with confidence.

"Ontology v3: completeness +12%, accuracy 98.4% (same as v2), relevance +3%. Approved for production." — automated report in the PR.
New Connector

Validate that new SAP connector improves retrieval

After ingesting SAP data, Evals rerun the financial services golden set. Credit questions that previously returned incomplete context now score full completeness.

"SAP connector: completeness of credit-related questions went from 67% to 94%. 3 golden set items still failing — missing contract fields. Backlog created."
Compliance

Demonstrate context quality to the Risk Committee

Risk Committee requests evidence that the agent is answering with complete and accurate data. Evals provides historical reports per dimension, per golden set, and per production version.

"Last 90 days: average completeness 96.2%, average accuracy 99.1%. 2 regressions detected and corrected before reaching production." — auditable report.

Evals closes the quality loop
across the platform.

Memory Graph

Evals measures completeness of entity context delivered to agents — before and after every ontology change.

Explore Memory Graph →

Knowledge

Evals measures retrieval quality per source, per document, and per golden question.

Explore Knowledge →

Observability

Evals scores feed Observability dashboards — quality metrics alongside latency and freshness.

Explore Observability →

Make context quality
an observable metric.

Schedule a 30-minute demo. We show Strattum running with data similar to yours, in the architecture your enterprise can receive.