Strattum Evals — Context quality, not LLM quality

LLM eval became commodity.
Context eval is where enterprise AI wins.

Completeness

Did the retrieval bring EVERYTHING that was relevant? Missing active contract, recent ticket, NPS — Strattum measures it.

Accuracy

Is the returned information correct and current? Strattum checks each chunk against ground-truth or rules.

Relevance

Was what came back relevant to the question? Measures retrieval noise — throwaway chunks are expensive for the LLM to process.

Diff between strategies

A/B test: ontology v2 vs v3, embedding model A vs B, top_k 5 vs 10. Data-driven decisions.

From question
to data-driven decision.

Define the golden set

Set of questions + expected answer (or approval rules). Strattum handles the rest.

UI for curation
CSV import
Git-versioned

Eval runs automatically

Every ontology/Skill/transform PR triggers Evals. Reports before merge.

CI/CD integration
Threshold gates
Visual diff

Reports + alerts

Dashboards per dimension. Regression triggers alert. PR approval based on score.

Dashboards per dimension
Regression alerts
Approval gates

Evaluation infrastructure
for production context.

Three-dimension scoring

Completeness, accuracy, and relevance as separate metrics. Independent regression detection per dimension.

Ground-truth management

UI for curating golden sets. CSV import. Version controlled in Git. Team can update without engineering.

CI/CD integration

Every PR to ontology, Skills, or transforms runs Evals automatically. Blocking gates configurable by threshold.

Strategy comparison

Run the same golden set against two retrieval configurations. Clear diff before deploying changes to production.

Regression alerts

Automatic notification on score drop. Slack, PagerDuty, email. Configurable per metric and per golden set.

Historical tracking

Metric history per version. See how each ontology change or new connector affected retrieval quality over time.

From subjective to measurable
enterprise context quality.

Ontology Change

Ontology v3 deployment without production regression

Before merging ontology v3, Evals runs the full golden set against both versions. The diff shows completeness up 12%, precision unchanged. Team merges with confidence.

"Ontology v3: completeness +12%, accuracy 98.4% (same as v2), relevance +3%. Approved for production." — automated report in the PR.

New Connector

Validate that new SAP connector improves retrieval

After ingesting SAP data, Evals rerun the financial services golden set. Credit questions that previously returned incomplete context now score full completeness.

"SAP connector: completeness of credit-related questions went from 67% to 94%. 3 golden set items still failing — missing contract fields. Backlog created."

Compliance

Demonstrate context quality to the Risk Committee

Risk Committee requests evidence that the agent is answering with complete and accurate data. Evals provides historical reports per dimension, per golden set, and per production version.

"Last 90 days: average completeness 96.2%, average accuracy 99.1%. 2 regressions detected and corrected before reaching production." — auditable report.

Evals closes the quality loop
across the platform.

Memory Graph

Evals measures completeness of entity context delivered to agents — before and after every ontology change.

Explore Memory Graph →

Knowledge

Evals measures retrieval quality per source, per document, and per golden question.

Explore Knowledge →

Observability

Evals scores feed Observability dashboards — quality metrics alongside latency and freshness.

Explore Observability →

Measure the context.
Not the LLM.

LLM eval became commodity.
Context eval is where enterprise AI wins.

Completeness

Accuracy

Relevance

Diff between strategies

From question
to data-driven decision.

Define the golden set

Eval runs automatically

Reports + alerts

Evaluation infrastructure
for production context.

Three-dimension scoring

Ground-truth management

CI/CD integration

Strategy comparison

Regression alerts

Historical tracking

From subjective to measurable
enterprise context quality.

Ontology v3 deployment without production regression

Validate that new SAP connector improves retrieval

Demonstrate context quality to the Risk Committee

Evals closes the quality loop
across the platform.

Memory Graph

Knowledge

Observability

Make context quality
an observable metric.

Measure the context.Not the LLM.

LLM eval became commodity.Context eval is where enterprise AI wins.

Completeness

Accuracy

Relevance

Diff between strategies

From questionto data-driven decision.

Define the golden set

Eval runs automatically

Reports + alerts

Evaluation infrastructurefor production context.

Three-dimension scoring

Ground-truth management

CI/CD integration

Strategy comparison

Regression alerts

Historical tracking

From subjective to measurableenterprise context quality.

Ontology v3 deployment without production regression

Validate that new SAP connector improves retrieval

Demonstrate context quality to the Risk Committee

Evals closes the quality loopacross the platform.

Memory Graph

Knowledge

Observability

Make context qualityan observable metric.

Measure the context.
Not the LLM.

LLM eval became commodity.
Context eval is where enterprise AI wins.

From question
to data-driven decision.

Evaluation infrastructure
for production context.

From subjective to measurable
enterprise context quality.

Evals closes the quality loop
across the platform.

Make context quality
an observable metric.