Ontology v3 deployment without production regression
Before merging ontology v3, Evals runs the full golden set against both versions. The diff shows completeness up 12%, precision unchanged. Team merges with confidence.
Evals assesses the quality of the context Strattum delivers to the agent — completeness, accuracy, relevance. Data teams control what they can control; the rest is up to the LLM.
Did the retrieval bring EVERYTHING that was relevant? Missing active contract, recent ticket, NPS — Strattum measures it.
Is the returned information correct and current? Strattum checks each chunk against ground-truth or rules.
Was what came back relevant to the question? Measures retrieval noise — throwaway chunks are expensive for the LLM to process.
A/B test: ontology v2 vs v3, embedding model A vs B, top_k 5 vs 10. Data-driven decisions.
Set of questions + expected answer (or approval rules). Strattum handles the rest.
Every ontology/Skill/transform PR triggers Evals. Reports before merge.
Dashboards per dimension. Regression triggers alert. PR approval based on score.
Completeness, accuracy, and relevance as separate metrics. Independent regression detection per dimension.
UI for curating golden sets. CSV import. Version controlled in Git. Team can update without engineering.
Every PR to ontology, Skills, or transforms runs Evals automatically. Blocking gates configurable by threshold.
Run the same golden set against two retrieval configurations. Clear diff before deploying changes to production.
Automatic notification on score drop. Slack, PagerDuty, email. Configurable per metric and per golden set.
Metric history per version. See how each ontology change or new connector affected retrieval quality over time.
Before merging ontology v3, Evals runs the full golden set against both versions. The diff shows completeness up 12%, precision unchanged. Team merges with confidence.
After ingesting SAP data, Evals rerun the financial services golden set. Credit questions that previously returned incomplete context now score full completeness.
Risk Committee requests evidence that the agent is answering with complete and accurate data. Evals provides historical reports per dimension, per golden set, and per production version.
Evals measures completeness of entity context delivered to agents — before and after every ontology change.
Explore Memory Graph →Evals measures retrieval quality per source, per document, and per golden question.
Explore Knowledge →Evals scores feed Observability dashboards — quality metrics alongside latency and freshness.
Explore Observability →Schedule a 30-minute demo. We show Strattum running with data similar to yours, in the architecture your enterprise can receive.