Knowledge Policy × OTEL Observability Integration Plan¶
Status: drafted Owner: observability + knowledge-policy working group Related: docs/architecture/adr/0001-observability.md, docs/plans/knowledge-layer-persistence-plan.md
Motivation¶
The knowledge-policy subsystem (decay, promotion, visibility suppression, on-access mutations) is fully instrumented internally — every evaluation produces a ScoringResolution and every flush drains an accumulator with per-entity access counts — but none of that state is exported. Operators have no way to answer questions like:
- How much of the working set is currently suppressed, and why (threshold vs. score-floor vs. on-access)?
- Are access-flush batches healthy, or is the buffer filling faster than the timer can drain it?
- Is a schema-change reconcile actually touching the expected number of entities?
- What is the decay-score distribution for each entity kind, and does it look bimodal (healthy) or flat (misconfigured half-life)?
The OTEL layer is well-established: pkg/observability.Provider plugs in a TracerProvider, a MeterProvider backed by a Prometheus registry, and already owns closed-enum subsystem catalogs (cypher, storage, bolt, auth, etc.). The BSP self-metrics pattern shows how to attach observability to a pipeline without inverting ownership. The two surfaces have no cross-references today; there are no architectural blockers, just absent wiring.
Subsystem Identity¶
- Register a new closed-enum subsystem
knowledge_policyinpkg/observability/metrics.goby appending the string to theallowedSubsystemsslice. - New file
pkg/observability/catalog_knowledge_policy.gofollowing the shape ofcatalog_cypher.go: KnowledgePolicyMetricsstruct holding pre-bound instruments.- Constructor
NewKnowledgePolicyMetrics(reg *Registry, tenantLabelsEnabled bool) *KnowledgePolicyMetrics. - Pre-bind hot-path fires (scored counter, decay-score histogram) so call sites do a straight
m.ObserveScore(...)with zero per-call label allocation.
Instrument Catalog¶
| Name | OTel type | Labels | Fires when | Interpretation |
|---|---|---|---|---|
nornicdb_knowledge_policy_scored_total | Int64Counter | entity_kind{node,edge,property}, result{visible,suppressed,no_decay} | End of Scorer.score() before return, plus early-return paths | Total scoring evaluations; suppressed / visible ratio = working-set attrition |
nornicdb_knowledge_policy_decay_score | Float64Histogram (buckets 0.0, 0.1, …, 1.0), sampled 1/32 | entity_kind | Same site as scored_total | Post-decay score distribution; bimodal = healthy, flat = misconfigured half-life |
nornicdb_knowledge_policy_suppressions_total | Int64Counter | entity_kind, reason{below_threshold,score_floor,on_access,explicit_flag,rule_cap} | Scorer suppression path, read-path filter flag path, on-access suppression | Why suppressions happen — critical for tuning |
nornicdb_knowledge_policy_access_flush_batch_rows | Float64Histogram (RowCountBuckets) | — | AccessFlusher.flush() after DrainAll() | Batch pressure; p99 ≈ maxBufferSize indicates backpressure |
nornicdb_knowledge_policy_access_flush_duration_seconds | Float64Histogram (LatencyBucketsSeconds) | — | Wrap flush body; record on return | Flush cost; correlates with storage write p99 |
nornicdb_knowledge_policy_access_flush_buffer_fullness | GaugeFunc | — | Passive scrape reads len(accumulator.buffer) / maxBufferSize | Tripwire for flush-interval tuning |
nornicdb_knowledge_policy_on_access_mutations_total | Int64Counter | result{applied,skipped_no_policy,error} | applyOnAccessMutations + evaluatePropertySuppression | On-access policy workload |
nornicdb_knowledge_policy_deindex_enqueued_total | Int64Counter | entity_kind{node,edge} | EnqueueDeindexIfSuppressed caller after becameSuppressed=true | Downstream search-index cost generator |
nornicdb_knowledge_policy_read_filter_dropped_total | Int64Counter | entity_kind{node,edge} | filterNodeByDecay / filterEdgeByDecay return-true path | Read-path suppression rate |
nornicdb_knowledge_policy_reconcile_total | Int64Counter | trigger{schema_change,startup,manual} | ReconcileDecaySuppressionWithChanges callers | Schema-driven churn |
All labels are closed enums and bounded. No user-defined label or property-key axis; see "Cardinality discipline" below.
Where the Scorer Gets Its Meter¶
Split approach along the control-flow boundary:
- Constructor injection for
AccessFlusherandScorer— functional optionWithMetrics(m *observability.KnowledgePolicyMetrics)on both. Wire throughpkg/nornicdb/db.gothe same waycypherMetricsflows intoexec.SetCypherMetricsinmain.go. - Package-level
atomic.Pointerfor the Badger read-path filter — mirrorbsp_self_metrics.go'sbspMetricsRefsidiom. New filepkg/observability/knowledgepolicy_metrics_ref.goholdskpMetricsRefs atomic.Pointer[KnowledgePolicyMetrics]withSetKnowledgePolicyMetrics/GetKnowledgePolicyMetricsaccessors.pkg/storage/badger_decay_filter.gocalls the getter on each filter invocation and no-ops on nil.
Trade-off: atomic-pointer is harder to unit-test (tests must set/restore the global) and a forgotten SetKnowledgePolicyMetrics() produces silent no-ops. Mitigated by a provider_test.go assertion that the pointer is non-nil after Provider.Start() when Memory.DecayEnabled is true.
Tracing¶
Spans only at coarse-grained boundaries — scoring is too hot.
nornicdb.knowledge_policy.flush— wrapsAccessFlusher.Flush(). Attributes:batch.row_count,batch.suppressed_count,batch.deindex_enqueued_count,batch.property_suppression_changes. Downstream storage spans (nornicdb.storage.Put) nest beneath, producing one coherent flush trace.nornicdb.knowledge_policy.reconcile— wrapsReconcileDecaySuppressionWithChanges. Attributes:trigger,changes_count,tokens_invalidated.- No per-scoring-call span. Metric-only. Optional
Observability.DebugSpansflag enables per-evaluation spans for investigation; off by default.
Cardinality Discipline¶
Tenant-label scoping respects cfg.Observability.Metrics.TenantLabelsEnabled exactly like the existing catalogs — pass tenantLabelsEnabled into NewKnowledgePolicyMetrics and omit the database labelname when false.
Explicit exclusions: - Graph-schema label (e.g., Product, User) — unbounded from user DDL. NOT a metric label. Route to exemplar trace attributes. - Property keys — same reasoning.
If per-label audit is ever needed, use structured logs or trace attributes — not Prometheus labels.
Wiring Sequence¶
Edits in this order:
pkg/observability/metrics.go— append"knowledge_policy"toallowedSubsystems.pkg/observability/catalog_knowledge_policy.go— new file, constructor + struct + methods.pkg/observability/knowledgepolicy_metrics_ref.go— new file,atomic.Pointeraccessors.cmd/nornicdb/main.go— constructkpMetricsalongsideNewCypherMetrics/NewStorageMetrics; callSetKnowledgePolicyMetrics; hand todb.AttachKnowledgePolicyMetrics(kpMetrics).pkg/nornicdb/db.go— attach handle toaccessFlusherandBadgerEnginescorer factory inside theconfig.Memory.DecayEnabledblock.pkg/knowledgepolicy/scorer.go— record counter + histogram just before return inscore(); early-return paths getresult="no_decay".pkg/knowledgepolicy/access_flusher.go— span, duration, batch-row histogram, on-access counter, deindex counter fires.pkg/knowledgepolicy/on_access_runtime.go— incrementon_access_mutations_total{result=…}.pkg/storage/badger_decay_filter.go— incrementread_filter_dropped_total{entity_kind}via the global pointer on the suppress-true path.pkg/nornicdb/db.goreconcile path — counter + span.
Testing Plan¶
pkg/knowledgepolicy/metrics_test.go(new) — table-driven test: knownCompiledBinding+ access-meta inputs fed throughScorer.score; assert counter deltas via Prometheus testutil, histogram buckets viareg.Gather(). Mirrorpkg/cypher/executor_spans_test.go'stracetest.NewInMemoryExporterfor span assertions.pkg/knowledgepolicy/access_flusher_metrics_test.go(new) — fakeAccessMetaStore; assertaccess_flush_batch_rowsrecords exactly one sample perFlush()call, including zero-batch short-circuit.pkg/observability/catalog_knowledge_policy_test.go(new) — constructor smoke test against a fresh*Registry;AssertCardinalityCeilingfor suppressions (3 entity_kind × 5 reason = 15 series cap).- Extend existing knowledge-policy e2e in
pkg/nornicdb/— seed a node, tick past half-life, force flush, scrape viapromtest.CollectAndCount, assert presence of new series.
Documentation¶
docs/observability/knowledge-policy-metrics.md(new) — instrument reference: name, type, labels, fires-when, interpretation, recommended alert thresholds.docs/architecture/adr/0001-observability.md— appendknowledge_policyto the subsystem list; cross-link new metrics doc.docs/plans/knowledge-layer-persistence-plan.md— new "Observability" section linking the metrics doc and calling out runbook thresholds (e.g., "buffer_fullness> 0.9 sustained → raiseAccessFlushBufferSizeor lowerDecayInterval").
Rollout Considerations¶
- Feature flag. Gate subsystem registration on
cfg.Observability.Metrics.KnowledgePolicyEnabled, defaulting totruewhenMemory.DecayEnabledis true andfalseotherwise. Runtime toggling NOT supported — metric registration is one-shot. - Default dashboard. Five series:
scored_totalrate,suppressions_totalrate by reason,decay_scorep50/p95,access_flush_duration_secondsp99,deindex_enqueued_totalrate. - Debug dashboard.
buffer_fullnessgauge,reconcile_total, per-reason suppression breakdown. - Experimental.
read_filter_dropped_totalflagged experimental for first two releases — most likely to need label-axis revision once real-world cardinality is observed.
Risks / Open Questions¶
- Scoring hot-path cost.
Scorer.score()runs per node returned from any Cypher match when decay is enabled. A pre-boundcounter.Add(1)is ~15 ns; ahistogram.Observeis ~50 ns. At 1M-row scans that's 65 ms of metric overhead. Mitigation: sample the histogram 1/32 and document it. - Tenant-label propagation from
AccessFlusher. The flusher is shared across namespaces; per-tenant sub-aggregation doesn't exist. Verdict: flush-level metrics are NOT tenant-scoped, only scoring metrics are. Documented asymmetry. - Init-order gap.
be.SetDecayEnabled(true)runs beforemain.goconstructskpMetrics. Early reads may miss metrics. Recommend: move metrics construction beforedb.Openso the ref is set first. result="no_decay"dominance. The early-return path for entities without any binding may dominatescored_totaland obscure thesuppressed/visibleratio. Consider a separateunscored_totalcounter if the noise is bad.reasonenum closure. On-access suppression has sub-reasons (cap hit, floor hit, rule-based). Pre-register all five anticipated reasons now (below_threshold,score_floor,on_access,explicit_flag,rule_cap) to avoid a breaking enum extension.
Critical Files¶
pkg/observability/metrics.gopkg/observability/catalog_knowledge_policy.go(new)pkg/observability/knowledgepolicy_metrics_ref.go(new)pkg/knowledgepolicy/scorer.gopkg/knowledgepolicy/access_flusher.gopkg/knowledgepolicy/on_access_runtime.gopkg/storage/badger_decay_filter.gopkg/nornicdb/db.gocmd/nornicdb/main.go