ADR-0001: Best-in-Class Observability for NornicDB¶
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-04-29 |
| Sign-off date | 2026-04-30 |
| Authors | Allen Sanabria |
| Supersedes | (none — establishes the ADR convention for this repo) |
| Related | docs/operations/monitoring.md, docs/architecture/replication.md, pkg/audit/audit.go |
| Target branch | otel |
About this ADR. This document is the first formal Architecture Decision Record in NornicDB. ADRs in this repo live under
docs/architecture/adr/NNNN-<slug>.md, are numbered monotonically, and use theContext → Decision → Consequencesform (MADR‑lite). The ADR is the contract; implementation lives behind it and may evolve, but the named guarantees here may not be silently broken.
1. Context¶
NornicDB is a Neo4j-compatible, Cypher/Bolt graph database with a hybrid vector + BM25 search stack, built-in embedding generation, MVCC lifecycle management, multi-database tenancy, and three replication modes (ha_standby, raft, multi_region). It is delivered as a single Go binary, shipped as a container, and is increasingly run on Kubernetes by customers.
A complete audit of the current observability surface was performed against the otel branch:
| Surface | What exists today | Gap |
|---|---|---|
HTTP /health | Returns {"status":"healthy"} unconditionally (pkg/server/server_public.go:44). | Liveness only. No readiness signal that distinguishes "process up" from "search index warm". |
HTTP /status | Auth‑gated; rich JSON with uptime, request counts, node/edge counts, embedding queue, search‑warming. | Hand‑built JSON, not machine‑contract; no schema. |
HTTP /metrics | Auth‑gated; hand-rolled Prometheus text with ~10 gauges/counters (pkg/server/server_public.go:198-268). | No histograms, no labels (no database, method, op, result), no exemplars, PermRead auth breaks Prometheus operator scraping. |
| HTTP middleware | metricsMiddleware increments in-flight and total request counters (pkg/server/server_middleware.go:344). | No latency histogram, no per-route/per-status labels. |
| Logging | log.Printf/fmt.Println (~165 call sites in pkg/server, pkg/cypher, pkg/storage). | LoggingConfig{Level,Format,Output} exists in pkg/config/config.go:585 but Format=json is unhonored — there is no slog, zerolog, or zap anywhere. |
| Bolt protocol | pkg/bolt/server.go has 16 log.Printf sites; no metrics; no spans. | Bolt is fully opaque to operators today. |
| Tracing | go.opentelemetry.io/* packages are present in go.sum only as // indirect. | No SDK initialization, no exporter, no spans, no propagation across replication. |
| Replication | Internal counters in pkg/replication/replicator.go (term, commit index) but not exposed. | No leader/lag/quorum signal at /metrics. |
| Audit | pkg/audit/audit.go writes immutable JSON for GDPR/HIPAA/SOC2. | Already correct; this ADR explicitly does not fold compliance audit into operational telemetry. |
| K8s integration | Probes documented, no ServiceMonitor/PodMonitor ships with the chart. | Customers must hand-roll, and the PermRead gate currently makes that fail closed. |
1.1 Forces¶
The decision must reconcile competing forces:
- Customer-facing. NornicDB is a product. Metric names are a public API that downstream Grafana dashboards, Alertmanager rules, Datadog monitors, and SRE runbooks will hard-code. Renaming a metric is a breaking change. Treat the metric catalog like SQL — versioned, deprecated with overlap, never removed without a major release.
- Self-hosted on K8s. The dominant deployment is
kube-prometheus-stack ServiceMonitor. Anything that breaks the no-auth scrape contract on port9090breaks every prospect's first day.- Cloud-neutral. Customers who do not run Prometheus (Datadog, New Relic, Honeycomb, Grafana Cloud) need an OTLP push path with the same data.
- Defense in depth. Metric scraping must not become a tenant naming oracle (label cardinality leak) or an unauthenticated profile path.
- AGENTS.md golden rules. No >10% memory regression, 90% coverage, files under 2,500 lines, Neo4j wire compatibility preserved.
- Two protocols. HTTP/REST and Bolt have disjoint code paths and must both be instrumented to the same standard.
- Replication-aware. Tracing and metrics must survive the leader→follower wire, otherwise distributed-system bugs become unobservable.
1.2 Out of scope¶
- The compliance audit log (
pkg/audit) is not changed by this ADR. It remains a distinct, append-only, signed stream. - Continuous profiling (Pyroscope/Parca) is deferred to a follow-up ADR; pprof endpoints are addressed only as a gated debug surface.
2. Decision¶
NornicDB will adopt a three-pillar observability stack — metrics, traces, structured logs — wired through a new package pkg/observability and exposed on a dedicated network port suitable for a Kubernetes ServiceMonitor.
2.1 Architecture overview¶
┌──────────────────────── pkg/observability ────────────────────────┐
│ │
HTTP edge ───► otelhttp ──► slog (JSON) ──► prom registry ──► /metrics on :9090 │
Bolt edge ───► bolt span ──► slog (JSON) ──► prom registry │
Cypher ───► exec span ──► slog (JSON) ──► prom histograms (planner/exec/rows) │
Storage ───► badger span ─► slog (JSON) ──► prom histograms (read/write/compact) │
Replication ─► raft span ──► slog (JSON) ──► prom gauges (lag, term, role) │
│ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ OTel SDK (single tracer + meter provider) │ │
│ │ ├── OTLP/gRPC exporter (traces + metrics push) │ │
│ │ ├── Prom exporter (`/metrics` text) │ │
│ │ └── stdout exporter (DEV only) │ │
│ └───────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
Service ports:
:7474 HTTP API (auth required, customer data plane)
:7687 Bolt (auth required, customer data plane)
:9090 /metrics, /livez, /readyz (cluster-internal, no auth)
:9091 /debug/pprof (opt-in, NORNICDB_PPROF_ENABLED=true)
Three signals, one instrumentation surface (pkg/observability), two egress paths (Prometheus pull + OTLP push) so customers pick whichever fits their stack without us writing the code twice.
2.2 Metrics — client_golang is canonical, OTel is an additional egress¶
We will instrument metrics with prometheus/client_golang as the canonical API, not the OTel meter API. Reasoning:
/metricsis a published customer contract.client_golangproduces Prometheus exposition format with native counter naming (_total), histogram bucket layout, and# HELP/# TYPEsemantics that downstream dashboards depend on. The OTel→Prom bridge has documented quirks (unit suffix translation, dotted-attribute renaming) that would silently rename metrics for customers who are already scraping us today.- Where customers want OTLP push (Datadog/New Relic/Honeycomb), an OTel metric reader will be wired in parallel against the same registry via the
otelprombridge. Internal call sites instrument once; both pipelines read the same numbers.
Naming and labeling rules (binding, customer-facing):
- Namespace prefix is always
nornicdb_. - Subsystem names map to internal packages:
http_,bolt_,cypher_,storage_,mvcc_,embed_,search_,replication_,cache_,apoc_,plugin_,process_,auth_,knowledge_policy_. Theknowledge_policycatalog (MET-ext-01) owns decay / promotion / suppression / on-access-mutation / access-flush instrumentation; seedocs/observability/knowledge-policy-metrics.mdfor the per-instrument reference anddocs/plans/knowledge-policy-observability-plan.mdfor the implementation plan. - Counters end in
_total. Histograms end in the unit they measure (_seconds,_bytes,_rows). - Allowed labels:
database(tenant database name; bounded bydbManager.ListDatabases())method,path_template,status_class(2xx|4xx|5xx) for HTTPopfor Bolt message type (RUN,PULL,BEGIN,COMMIT,ROLLBACK)result(success|error|conflict|timeout)cache(plan|node|edge|query|embedding)modefor search (vector|bm25|hybrid)provider,modelfor embeddings (only from configured allow-list)role(leader|follower|standby) for replicationpeer(cluster node ID, bounded by config) for replication links- Forbidden labels (cardinality bombs / PII): full HTTP path, raw Cypher query, user ID, email, IP address, node ID/UUID, embedding text, span/trace IDs.
- Histograms use NornicDB's standard latency buckets:
[0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]seconds. Storage/byte histograms use the standardprometheus.ExponentialBuckets(1024, 4, 10).
Two-path egress. prometheus/client_golang is canonical for the :9090/metrics text exposition; an OTel meter reader runs in parallel against the same numbers via the otelprom bridge for OTLP push customers. The bridge is configured with WithoutUnits() to suppress its auto-suffix translation (_milliseconds_total, _bytes), and bridge-emitted metrics live under the nornicdb_otel_* namespace to prevent silent collision with hand-instrumented nornicdb_* series. Resource attributes do not become labels through the bridge — they appear only as the target_info metric, which the bundled ServiceMonitor drops via metricRelabelings.
Tenant label kill switch. A metrics.tenant_labels_enabled config flag (default true on K8s detect via KUBERNETES_SERVICE_HOST env + ServiceAccount token; false otherwise) controls whether the database label appears on all tenant-scoped subsystems — nornicdb_http_*, nornicdb_storage_*, nornicdb_cypher_*, nornicdb_search_*, and nornicdb_mvcc_* series. Operators can override with metrics.tenant_labels_enabled=false even on K8s. The resolved value is logged at startup.
2.2.1 Reconciled amendments¶
The following research-surfaced amendments were folded into the §2.2 prose above:
- A1 — OTel→Prom bridge configuration: the
otelpromexporter is configured withWithoutUnits()to suppress the auto-suffix translation (_milliseconds_total,_bytes); bridge-emitted metrics live under thenornicdb_otel_*namespace to prevent silent collision with hand-instrumentednornicdb_*series; theServiceMonitorincludesmetricRelabelingsto drop the per-podtarget_infoseries.
2.3 The NornicDB-centric metric catalog (v1)¶
This catalog is the customer contract. Modeled on pg_stat_* (Postgres) and db.* (Neo4j JMX beans) but renamed to NornicDB semantics. Each line is a metric NornicDB v1 guarantees to expose; additions are non-breaking, removals require a major version.
HTTP edge¶
nornicdb_http_requests_total{method,path_template,status_class,database} counter
nornicdb_http_request_duration_seconds{method,path_template,database} histogram
nornicdb_http_in_flight_requests gauge
nornicdb_http_request_body_bytes{method,path_template} histogram
nornicdb_http_response_body_bytes{method,path_template} histogram
Bolt protocol¶
nornicdb_bolt_connections_active gauge
nornicdb_bolt_connections_total{result} counter
nornicdb_bolt_session_duration_seconds histogram
nornicdb_bolt_messages_total{op,result} counter
nornicdb_bolt_message_duration_seconds{op} histogram
nornicdb_bolt_packstream_decode_errors_total{reason} counter
Cypher (the "pg_stat_statements" of NornicDB)¶
nornicdb_cypher_queries_total{database,op_type,result} counter
nornicdb_cypher_query_duration_seconds{database,op_type} histogram
nornicdb_cypher_planner_duration_seconds{database} histogram
nornicdb_cypher_planner_cache_hits_total{database} counter
nornicdb_cypher_planner_cache_misses_total{database} counter
nornicdb_cypher_planner_cache_size gauge
nornicdb_cypher_rows_returned{database,op_type} histogram
nornicdb_cypher_active_transactions{database} gauge
nornicdb_cypher_transaction_conflicts_total{database,kind} counter
nornicdb_cypher_slow_queries_total counter
nornicdb_cypher_slow_query_threshold_seconds gauge
op_type is one of read|write|schema|admin|fabric — derived in the planner, not taken from the query text. This bounds cardinality.
Storage / Badger¶
nornicdb_storage_nodes_total{database} gauge
nornicdb_storage_edges_total{database} gauge
nornicdb_storage_bytes{database,kind=nodes|edges|index|wal|search} gauge
nornicdb_storage_op_duration_seconds{database,op=get|put|delete|scan} histogram
nornicdb_storage_compactions_total{level,result} counter
nornicdb_storage_compaction_duration_seconds{level} histogram
nornicdb_storage_wal_lag_bytes{database} gauge
nornicdb_storage_index_rebuild_total{database,index,result} counter
MVCC lifecycle¶
The existing MVCCLifecycleEngine already exposes a pressure band; this becomes a first-class metric (mirroring mvcc.bytes_pinned_by_oldest_reader in pkg/server/server_helpers.go:367):
nornicdb_mvcc_pressure_band{database,band=normal|warn|high|critical} gauge
nornicdb_mvcc_pinned_bytes{database} gauge
nornicdb_mvcc_oldest_reader_age_seconds{database} gauge
nornicdb_mvcc_active_readers{database} gauge
Embeddings¶
nornicdb_embed_queue_depth gauge
nornicdb_embed_processed_total{provider,model,result} counter
nornicdb_embed_duration_seconds{provider,model} histogram
nornicdb_embed_cache_hits_total counter
nornicdb_embed_cache_misses_total counter
nornicdb_embed_worker_running gauge
Search (vector/BM25/hybrid)¶
nornicdb_search_requests_total{database,mode,result} counter
nornicdb_search_duration_seconds{database,mode,stage=embed|index|fuse} histogram
nornicdb_search_candidates{database,mode} histogram
nornicdb_search_index_size_bytes{database,kind=hnsw|bm25} gauge
Replication¶
nornicdb_replication_role{role} gauge (1 for current role, 0 otherwise; per-replica disambiguation via Prometheus scrape-target `instance` label)
nornicdb_replication_term gauge
nornicdb_replication_commit_index gauge
nornicdb_replication_apply_index gauge
nornicdb_replication_lag_bytes{peer} gauge
nornicdb_replication_lag_entries{peer} gauge
nornicdb_replication_apply_duration_seconds histogram
nornicdb_replication_rtt_seconds{peer} histogram
nornicdb_replication_leader_changes_total counter
nornicdb_replication_last_contact_seconds{peer} gauge
Cache and runtime¶
nornicdb_cache_hits_total{cache} counter
nornicdb_cache_misses_total{cache} counter
nornicdb_cache_size_bytes{cache} gauge
nornicdb_cache_evictions_total{cache,reason} counter
nornicdb_process_uptime_seconds gauge
nornicdb_build_info{version,commit,go_version,backend} gauge (always 1)
Cache hit ratio is exposed as a Prometheus recording rule (nornicdb_cache_hit_ratio{cache}) shipped in Phase 10, computed from cache_hits_total and cache_misses_total. It is not a raw metric.
Auth¶
result is bounded (success|failure|error); protocol is bounded (http|bolt).
The Go runtime collector (collectors.NewGoCollector) and process collector (collectors.NewProcessCollector) are registered alongside the above and provide go_*/process_* for free.
2.3.0 Reconciled amendments (Phase 0)¶
Three research-surfaced gap closures were folded into the §2.3 catalog above:
- A2 — GAP-1
nornicdb_replication_last_contact_seconds{peer}: added to the Replication subsystem (10th metric family). Standard SRE alert for quiescence vs. stall. - A3 — GAP-6
nornicdb_auth_attempts_total{result,protocol}: added as a new Auth subsystem. Failed-auth rate is a standard security alert. - A4 — GAP-2
nornicdb_cache_hit_ratio: ships as a recording-rule template (Phase 10) over the existingcache_hits_total/cache_misses_totalcounters, not as a raw metric — keeps the catalog primitives bounded.
2.3.1 Phase 5 amendment — Legacy translation layer & tenant flag (2026-05-04)¶
This subsection records the customer-API contract additions delivered by Phase 5 (Legacy Translation Layer & Tenant Flag, plans 05-01 through 05-05). Future changes to any of the four contract surfaces below require a follow-up ADR amendment + reviewer sign-off; the subsection IS the contract.
Sunset header value (D-03): The legacy :7474/metrics endpoint emits Sunset: Fri, 31 Dec 2027 23:59:59 GMT (RFC 7231 IMF-fixdate format) on every response. This date is the customer-API contract for migration off the legacy 12-metric surface; changing it requires a follow-up ADR amendment + reviewer sign-off. The literal lives in pkg/observability/legacy_translation.go as const LegacySunset and is locked by pkg/observability/legacy_translation_test.go::TestRenderLegacy_HeadersConsts.
Deprecation header value (D-03a / R-01): The endpoint emits Deprecation: true per the original draft of the Deprecation HTTP header. RFC 9745 (March 2025) supersedes that draft and now mandates a Structured-Field Date format (Deprecation: @<unix-timestamp>); M1 deliberately ships true per the locked REQ MET-20 / ROADMAP SC#2 / D-03a contract. An RFC-9745 cutover is captured as a deferred item in .planning/phases/05-legacy-translation-layer-tenant-flag/deferred-items.md; the non-conformance with RFC 9745 §2.1 is intentional and customer-API-stable for M1.
Autodetect contract (D-02): The metrics.tenant_labels_enabled flag defaults via multi-signal AND logic: KUBERNETES_SERVICE_HOST env var non-empty AND /var/run/secrets/kubernetes.io/serviceaccount/token exists AND has non-zero size. Operator can override via explicit YAML metrics.tenant_labels_enabled: true|false (R-02 sentinel *bool field MetricsConfig.TenantLabelsExplicit preserves operator intent through to startup-resolution time so explicit_false beats autodetect_true on K8s). Resolution happens once at startup, between observability.New and the first bag constructor (Phase 4 D-02c init-order chokepoint). The resolved value + source reason is logged once via the injected slog logger (LOG-09 compliant — never slog.Default):
level=INFO msg="resolved tenant labels enabled" enabled=<bool> reason=<enum> service_host_present=<bool> token_file_present=<bool>
The reason enum is closed-set with exactly six values: explicit_yaml, k8s_detected, not_k8s_service_host_absent, not_k8s_token_file_absent, not_k8s_token_file_empty, not_k8s_token_stat_error. Operators grep this finite vocabulary for forensic resolution outcomes. The token file is checked via os.Stat only — JWT bytes never enter process memory (T-05-11 mitigation).
Translation layer (D-01..D-01f): Customer scrapers on :7474/metrics continue to receive 12 legacy metric names (nornicdb_uptime_seconds, nornicdb_requests_total, nornicdb_errors_total, nornicdb_active_requests, nornicdb_nodes_total, nornicdb_edges_total, nornicdb_embeddings_processed, nornicdb_embeddings_failed, nornicdb_embedding_worker_running, nornicdb_slow_queries_total, nornicdb_slow_query_threshold_ms, nornicdb_info) fed from the unified Phase 4 registry via the pkg/observability.RenderLegacy(reg, now) []byte pure function. The 12-row mapping table is locked by pkg/observability/legacy_snapshot.golden (1660 bytes, 36 emit lines); CI byte-equality verifies that future Phase-4 catalog changes do not silently break the legacy contract. Regeneration is gated behind REGEN=1 env var (go test -run TestRenderLegacy_RegenerateGolden); CI never sets REGEN. The :7474/metrics auth gate (s.withAuth(s.handleMetrics, auth.PermRead)) is unchanged — dropping auth would itself require its own deprecation cycle per CLAUDE.md "Public API contract" (D-04).
2.4 Tracing — OpenTelemetry SDK with OTLP¶
- Tracer provider:
go.opentelemetry.io/otel/sdk/tracewith aBatchSpanProcessorconfiguredMaxQueueSize=8192,MaxExportBatchSize=1024,BatchTimeout=2s(production-correct for bursty DB workloads). The BSP exposes self-instrumentation:nornicdb_otel_bsp_queue_depth(gauge) andnornicdb_otel_bsp_dropped_spans_total(counter). - Exporter: OTLP/gRPC (
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc) with HTTP fallback. Endpoint, headers, and TLS controlled by the standardOTEL_EXPORTER_OTLP_*environment variables — do not inventNORNICDB_OTLP_*aliases; defer to OTel conventions so customers can use any collector they already run. - OTLP endpoint URL must be HTTPS in production mode; plaintext
http://is rejected unless the standardOTEL_EXPORTER_OTLP_INSECURE=trueis set explicitly (per-signal variantsOTEL_EXPORTER_OTLP_TRACES_INSECURE/OTEL_EXPORTER_OTLP_METRICS_INSECUREare also honored). YAML symmetry:tracing.endpointfield complements the standardOTEL_EXPORTER_OTLP_*env vars. - Sampler (v1 default):
TraceIDRatioBased(NORNICDB_TRACE_SAMPLE_RATIO)standalone, default0.01(1%). The default does not honor parent context — storage-layer trace-volume control is preserved against upstream 100% samplers. Two opt-in modes are available:NORNICDB_TRACE_PARENT_MODE=capped(parent_capped— honors upstreamsampled=trueup toNORNICDB_TRACE_PARENT_MAX_QPS, default 100/s, then falls back to ratio-based) andNORNICDB_TRACE_PARENT_MODE=strict(parent_strict— full upstream honor; emits aWARNstartup log about unbounded volume risk). Tail sampling remains the OTel collector's job, not ours. - Resource attributes:
service.name=nornicdb,service.version=<buildinfo>,service.instance.id(resolution chain:cfg.NodeID→POD_NAMEenv →os.Hostname()→"standalone"; resolved value logged at startup),nornicdb.cluster.mode,nornicdb.replication.role. - Propagators: W3C
traceparent/tracestate+ W3Cbaggage(defaults).
Spans NornicDB will produce by default:
| Span name | Where created | Notable attributes |
|---|---|---|
nornicdb.http.<route> | otelhttp middleware on mux | http.method, http.route, http.status_code, nornicdb.database |
nornicdb.bolt.session | pkg/bolt/server.go per-conn | nornicdb.bolt.user, nornicdb.bolt.client_version |
nornicdb.bolt.message | per Bolt message dispatch | nornicdb.bolt.op, nornicdb.bolt.fields |
nornicdb.cypher.execute | StorageExecutor.Execute | nornicdb.cypher.op_type, nornicdb.cypher.plan_hash, nornicdb.cypher.rows |
nornicdb.cypher.plan | planner entry | nornicdb.cypher.plan_cache_hit |
nornicdb.cypher.exec.<op> | per logical operator (NodeByLabel…) | nornicdb.exec.estimated_rows, nornicdb.exec.actual_rows |
nornicdb.storage.<op> | Badger adapter | nornicdb.storage.kind, nornicdb.storage.bytes |
nornicdb.embed.batch | embed queue worker | nornicdb.embed.provider, nornicdb.embed.model, nornicdb.embed.batch_size |
nornicdb.search.<mode> | search service | nornicdb.search.candidates, nornicdb.search.recall |
nornicdb.replication.append | leader replication path | nornicdb.replication.peer, nornicdb.replication.entries, nornicdb.replication.bytes |
nornicdb.replication.apply | follower apply loop | nornicdb.replication.term, nornicdb.replication.index |
Bolt + replication context propagation. The existing replication codec (pkg/replication/codec.go) gains a codec_version field FIRST (a separate PR ahead of any tracing work) so that follower frame parsing is safe in both rolling-upgrade directions; the optional traceparent field is a SECOND, separate PR that depends on the codec-version PR having merged. A new Bolt extra nornicdb.traceparent is accepted on BEGIN/RUN so that client-generated spans connect to server spans. Both are backwards-compatible: missing context simply starts a new trace, never an error. Per-driver Bolt conformance (KD-08) is documented in Phase 8 (DOC-07); v1 minimum is neo4j-go-driver/v5.
Exemplars. Histograms attach the active span's trace ID as a Prometheus exemplar via client_golang's ExemplarObserver API; emission is unconditional (an IsValid() guard is applied before observe). End-to-end correlation requires customer-side configuration: Prometheus --enable-feature=exemplar-storage, OpenMetrics negotiation (Accept: application/openmetrics-text), and a Tempo (or equivalent) datasource. The CI integration test against kube-prometheus-stack + Tempo (Phase 10) is the proof that the full chain works; the ADR's "one-click trace correlation" claim is gated on that test, not on the emission itself.
Sample-rate mismatch detector. When peer replicas report different sample ratios (e.g., during a rolling config rollout), nornicdb_otel_sample_rate_mismatch_total increments so operators can detect the mismatch without reading SDK source. AsyncEngine flush goroutine spans use WithLinks (carrying trace.SpanContext from the originating request) rather than parent-child, so embedding-queue handoffs do not appear as orphaned roots.
2.4.1 Reconciled amendments¶
Four research-surfaced amendments to the tracing pillar:
- A5 —
service.instance.idresolution: fallback chaincfg.NodeID→POD_NAMEenv →os.Hostname()→"standalone". Resolved value is logged at startup so operators can verify which leg fired. - A6 — BSP tuning + self-instrumentation:
BatchSpanProcessoris configured withMaxQueueSize=8192,MaxExportBatchSize=1024,BatchTimeout=2s(production-correct for bursty DB workloads). Two self-instrumentation metrics expose pipeline health:nornicdb_otel_bsp_queue_depth(gauge) andnornicdb_otel_bsp_dropped_spans_total(counter). - A7 — Replication codec versioning prerequisite (TRC-21): the optional
traceparentfield onAppendEntriesis preceded by acodec_versionfield. These are separate, ordered PRs; the codec-version PR ships first to make rolling-upgrade frame parsing safe in both directions. - A8 —
parent_cappedsampler (TRC-06): a third opt-in sampler mode (NORNICDB_TRACE_PARENT_MODE=capped) that honors upstreamsampled=trueup toNORNICDB_TRACE_PARENT_MAX_QPS(default 100/s) and falls back to ratio-based otherwise. The v1 default sampler (TraceIDRatioBased(0.01)standalone) is unchanged.
2.5 Logs — log/slog with mandatory fields¶
- Migrate the codebase from
log.Printf/fmt.Printlntolog/slog. The global logger is constructed incmd/nornicdb/main.gofromLoggingConfigand stored inpkg/observability.Logger(). - Public interface: stdlib
log/slog(slog.Logger,slog.Handler). All call sites code against the slog interface so the underlying handler is swappable. - Handler implementation:
slog.NewJSONHandlerwhenLogging.Format=json(the default for containers);slog.NewTextHandlerotherwise. Phase 2 benchmarks the stdlib handler against NornicDB's hot paths; if it doesn't meet the per-record allocation budget, a custom slog-compatibleHandleris fair game (we're not married to the stdlib implementation, only to the interface contract). The decision is recorded in the Phase 2 SUMMARY.md with benchmark numbers. - Mandatory fields on every record:
time(RFC3339Nano, automatic)level(DEBUG|INFO|WARN|ERROR)msgservice="nornicdb"version=buildinfo.Versionnode_id(cluster node ID;"standalone"if unclustered)trace_id,span_idwhen a span is active inctx- Conventional groups (used via
slog.Group): http:method,path,status,duration_ms,request_idcypher:op_type,plan_hash,rows,duration_ms,databasestorage:op,kind,bytes,duration_msreplication:role,term,peer,lag_bytes- PII redaction. The handler runs through a small
Redactormiddleware that strips known sensitive keys (password,token,authorization,secret,api_key) regardless of nesting. Same allow-list aspkg/auditto keep the rules in one place. - The slow-query log emits
WARNwith the truncated 500-char query, theplan_hash, andcypher.duration_ms. The full query text is never attached as a metric label or span attribute — it goes to the log only. pkg/audit/audit.gois not migrated toslog. Audit must remain a distinct, signed, append-only stream with its own retention policy.
2.6 Kubernetes service monitor — dedicated unauthenticated port¶
A new listener on port 9090 (configurable via NORNICDB_TELEMETRY_PORT, value is the port number with no colon prefix; bind address is : so the listener accepts on all interfaces and the NetworkPolicy below restricts ingress) serves:
| Path | Auth | Purpose |
|---|---|---|
/metrics | None | Prometheus pull endpoint. Returns client_golang registry. |
/livez | None | Process liveness — returns 200 once the OS process is past main.go startup. |
/readyz | None | Readiness — returns 200 throughout the lifecycle: during warm-up the body is a progress JSON object ({"phase": "warming", "storage_open": true, "search_warming": "...%"}), and after warm-up completes the body is a steady-state {"phase": "ready"}. The contract is "200-with-body, parse the body for phase". The startupProbe (high failureThreshold, default 60 ≈ 10 min) enforces the time bound; readinessProbe polls /readyz and the kubelet treats 200 as routable. Operators who prefer 503-during-warm-up semantics can configure that via metrics.readyz_503_until_ready=true (default false). |
/version | None | Plain-text buildinfo.DisplayVersion() (already public in /health parent). |
The data-plane :7474 keeps the legacy /metrics, /health, /status endpoints intact in M1 and emits deprecation headers (Deprecation: true, Sunset: <date>) to give existing scrapers time to migrate. M1 ships only the deprecation markers — no removal logic. The actual removal (data-plane /metrics going away, /health becoming a thin alias for /livez) is scheduled for v1.1.0 in a separate follow-up PR coordinated with issue #117; that work is explicitly out of scope for ADR-0001.
The opt-in :9091 pprof listener binds 127.0.0.1 by default (not :9091 interpreted as 0.0.0.0:9091); binding all interfaces would make pprof a credential-exfiltration surface. Operators who need remote pprof access must explicitly configure a non-loopback listen address and a corresponding NetworkPolicy.
Why a separate port and not just "remove auth from /metrics":
- Metric labels include
database(tenant name). Exposingnornicdb_storage_nodes_total{database="acme-prod"}to anyone who can reach:7474from the public internet is a tenant-naming oracle. The defense is network-scoped, not bearer-scoped — exactly how Postgres'postgres_exporterand CockroachDB's_status/varsdo it. - Prometheus Operator's
ServiceMonitordoes not natively send bearer tokens that rotate with NornicDB's auth; making it work requires either long-lived static tokens (a security regression) or a sidecar (operational regression). A dedicated port closes both problems.
A reference Helm ServiceMonitor will ship at docs/operations/kubernetes-servicemonitor.md as a Phase 9 deliverable (K8S-01); the inlined snippet below is the contract that doc will encode and ship in the bundled Helm chart:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nornicdb
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app.kubernetes.io/name: nornicdb
endpoints:
- port: telemetry # 9090, named in the Service
path: /metrics
interval: 30s
scrapeTimeout: 10s
honorLabels: false
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
targetLabel: cluster
A NetworkPolicy template in the same doc is default-on in the bundled Helm chart, restricting :9090 ingress to the monitoring namespace; operators opt out via --set networkPolicy.enabled=false. A PodMonitor template ships alongside ServiceMonitor, toggled via --set podMonitor.enabled=true. The chart's startupProbe (high failureThreshold, default 60 ≈ 10 min) is distinct from readinessProbe; /readyz's 200-with-progress-JSON contract (defined in the table above) lets the kubelet treat the pod as routable as soon as core surfaces are up while still surfacing warm-up state to operators. Versioned alert rule files (nornicdb-alerts-v1.yaml) ship with an upgrade-diff doc; the cache hit-ratio recording rule is part of the same alerts bundle.
2.6.1 Reconciled amendments¶
One research-surfaced amendment to the K8s pillar:
- A9 — pprof bind address (OBS-06):
:9091binds127.0.0.1by default. Binding all interfaces would make pprof a credential exfiltration surface; the Helm chart template honors this default.
2.7 Configuration surface¶
New Observability block in pkg/config/config.go, all overridable via environment so containers stay declarative:
observability:
metrics:
enabled: true # NORNICDB_METRICS_ENABLED
port: 9090 # NORNICDB_TELEMETRY_PORT (number, no colon)
# The legacy `:7474/metrics` data-plane endpoint stays enabled in M1 — no
# config flag. It carries Deprecation/Sunset headers and is removed
# post-M1 in a follow-up PR (target v1.1.0, see §3.3 risk row + #117).
tracing:
enabled: false # NORNICDB_TRACING_ENABLED
sample_ratio: 0.01 # NORNICDB_TRACE_SAMPLE_RATIO
# OTel-standard env vars are honored: OTEL_EXPORTER_OTLP_ENDPOINT, etc.
logging:
level: "INFO" # NORNICDB_LOG_LEVEL
format: "json" # NORNICDB_LOG_FORMAT (default flips to json)
output: "stdout" # NORNICDB_LOG_OUTPUT
redact_extra: [] # NORNICDB_LOG_REDACT_EXTRA (csv)
pprof:
enabled: false # NORNICDB_PPROF_ENABLED
listen: "127.0.0.1:9091" # loopback by default; override only with explicit NetworkPolicy
2.8 Implementation map¶
A new package pkg/observability is the single place call sites import. This keeps the Prometheus and OTel SDK surface out of business logic, makes testing straightforward (interface seam), and lets us swap exporters without a tree-wide diff.
pkg/observability/
observability.go // Init(ctx, cfg) -> Provider; Shutdown(ctx)
metrics.go // package-level registry + helpers (must call Init first)
tracing.go // tracer, propagators, span helpers
logging.go // slog handler, redactor, ctx-aware logger
middleware_http.go // wraps mux: otelhttp + per-route histogram
middleware_bolt.go // hooks into pkg/bolt server lifecycle
middleware_grpc.go // wraps grpc.Server with otelgrpc Unary + Stream interceptors;
// produces standard gRPC server/client metrics
// (rpc.server.duration, rpc.server.requests, etc.) and spans
// for replication peer-to-peer traffic and any future gRPC surfaces
exemplar.go // histogram exemplar wiring (trace_id from ctx)
testing.go // in-memory exporters for unit tests
Wiring sequence in cmd/nornicdb/main.go runServe:
- Load config; build
LoggingConfigslogger first so subsequent steps log structured. observability.Init(ctx, cfg.Observability, buildinfo)— registers collectors, starts OTel SDK, starts the:9090listener.- Pass the resulting
Providerintoserver.New,bolt.New, and the replication transport so each subsystem can grab itsMeter/Tracer. - On shutdown, drain in this order: HTTP server → Bolt server → background workers →
observability.Shutdown(ctx). The OTel batch span processor needs ~5s to flush; the existing graceful-shutdown context allows this.
Per AGENTS.md golden rules:
- Every new metric/span/log call site comes with a benchmark that proves ≤2% throughput regression on
make bench-cypherand ≤5% onmake bench-bolt. Hot paths (Cypher execute, storage get/put) usemetric.WithAttributesonly when the result is non-trivial; in the fast path we use pre-bound counters. - The
pkg/observabilitypackage targets ≥90% test coverage with the in-memory OTel exporter +prometheus/testutil. - No file in
pkg/observabilityexceeds 800 lines. - Bolt wire format and HTTP responses are not changed except for the optional
traceparentextra; Neo4j drivers that don't send it continue to work unchanged.
2.8.1 Reconciled amendments¶
Two research-surfaced amendments to the implementation map:
- A10a — Shutdown context independence: observability shutdown uses a separate
context.WithTimeout(context.Background(), 30s), not the cancelled parent context (using the parent would cause an immediate return without flush). The mandatory shutdown order is HTTP → Bolt → background workers → observability flush, with the:9090listener closing last so the kubelet can scrape during graceful drain. - A10b — Test isolation pattern: every test in
pkg/observabilityconstructs an isolated*prometheus.Registry,InMemoryExporter, and*slog.Logger— neverprometheus.DefaultRegisterer, neverslog.SetDefault. Each*Vechas aTestMetricCardinality_<name>test assertingCollectAndCount(reg, name) <= ceilingunder 1k tenant UUIDs and adversarial label values. The three named bucket constants (LatencyBucketsSeconds,SizeBucketsBytes,RowCountBuckets) plus the long-tailEmbeddingLatencyBucketsSecondsare the single source of truth; registration helpers enforce bucket type at definition site. - A10c — Lifecycle supervisor primitive: the canonical process supervisor for all current and future NornicDB binaries (
cmd/nornicdb/, plannedcmd/metrics-doc-gen/, future CLI utilities) is aRun(ctx, components...) errordriver inpkg/lifecycle/built ongolang.org/x/sync/errgroupwithsignal.NotifyContext(SIGINT, SIGTERM). Reverse-order drain runs on a freshcontext.WithTimeout(context.Background(), 30s)per A10a. The dependencygolang.org/x/sync/errgroupis named here per CLAUDE.md "direct deps require ADR amendment"; it was already present ingo.sumas an indirect transitive — Phase 1 promotes it to direct require.
2.9 Rollout plan¶
| Phase | Scope | Gate |
|---|---|---|
| 1 | Land pkg/observability, slog migration in HTTP edge + Cypher executor, :9090/metrics parity. | Existing Prometheus scrapes keep working against legacy :7474. |
| 2 | Add tracing (OTel SDK, otelhttp, executor + storage spans, OTLP exporter). | make bench-cypher regression ≤2%; OTEL_EXPORTER_OTLP_ENDPOINT opt-in. |
| 3 | Bolt instrumentation + replication propagation + exemplar wiring. | New Bolt benchmarks added; raft soak test stays green for 24h. |
| 4 | Ship Helm ServiceMonitor, NetworkPolicy, Grafana dashboard JSON, alert rules. | Tested against kube-prometheus-stack in CI. |
| 5 | Deprecate data-plane :7474/metrics. Major version bump. | At least one minor release of overlap with Sunset header. |
Each phase is its own PR with its own benchmark proof, per the AGENTS.md "prove value before merging" rule.
3. Consequences¶
3.1 Positive¶
- Customer-grade telemetry. Operators get the same depth of insight that
pg_stat_statementsand Neo4j JMX provide, but in modern formats (Prom + OTLP) and with NornicDB-specific signals (MVCC pressure, embedding queue depth, hybrid-search stage timings). - One-click trace correlation. Exemplars on histograms make latency outliers actionable in seconds.
- Zero-config K8s integration.
helm installproduces a workingServiceMonitorand dashboard against anykube-prometheus-stackcluster. - Cloud-neutral. OTLP push covers the non-Prometheus customers without duplicating instrumentation.
- Defense in depth. Tenant names never leak through unauthenticated surfaces; the metrics port is network-scoped.
- Foundations for SLOs. Every metric has a
resultlabel, enabling customer-side SLO recording rules (error budget = sum(...{result="error"}) / sum(...)).
3.2 Negative / costs¶
- Real migration work. ~165
log.Printfsites becomeslogcalls. Each change is mechanical but reviewer-heavy. - One new port to firewall. Operators upgrading from older NornicDB must open
:9090cluster-internally (documented in the upgrade notes). - Memory floor rises. OTel SDK + Prometheus registry add ~10–20 MB to the resident set. Acceptable per AGENTS.md ≤10% rule for typical 1 GB deployments; we will document the cost for low-memory mode and provide
observability.metrics.enabled=falsefor embedded use. - Public API surface. Metric and span names become a versioned contract. Renames now require a deprecation window.
- Cardinality discipline forever. The label allow-list is the wall between us and a Prometheus blow-up. Code review must enforce it; a golangci-lint custom analyzer is a follow-up.
3.3 Risks and mitigations¶
| Risk | Mitigation |
|---|---|
| Hot-path overhead from span creation in Cypher executor. | Sample at the trace level, not the span level; pre-bind counters; benchmark gate ≤2%. |
| OTLP endpoint misconfiguration crashes the server on startup. | Init returns a noop provider on exporter error and logs WARN; never fatal. |
Customer scrapers depending on the legacy :7474/metrics break on removal. | M1 only ships deprecation headers (Deprecation: true, Sunset: <date>) + metric-name parity on :9090. Removal is scheduled for v1.1.0 in a separate follow-up PR alongside #117, giving customers at least one minor-release overlap window to migrate. |
| Trace context across Bolt accidentally breaks driver compatibility. | Use a new extra (nornicdb.traceparent); ignore if absent. Conformance test added against neo4j-go-driver. |
| Label cardinality regression sneaks in via a future PR. | Build a custom golangci-lint analyzer that flags free-form strings reaching prometheus.With* outside an allow-list. |
3.4 Accepted challenges (KD-04..KD-09)¶
The following challenges to ADR-0001's original draft were raised during review, accepted, and folded back into the body of the ADR. This table is the audit trail GOV-01 reviewers verify against: every row points to the pillar where the resolution now lives.
| KD | Challenge | Resolution | Where it lives now |
|---|---|---|---|
| KD-04 | Adopt a router dep (chi/gorilla) for path_template extraction in HTTP middleware? | Reject — Go 1.22+ stdlib http.ServeMux exposes the matched template via r.Pattern; no new dep added. | §2.2 (allowed labels — path_template derived from r.Pattern) |
| KD-05 | Default sampler should be ParentBased to honor upstream sampling decisions | Reject — storage layer must control its own trace volume; default is TraceIDRatioBased(0.01) standalone. ParentBased-style behavior is opt-in only. | §2.4 (Sampler — default + opt-in parent_capped / parent_strict modes) |
| KD-06 | NetworkPolicy alone is sufficient defense for tenant-name leak through :9090 labels | Reject — non-K8s deployments lack NetworkPolicy. Add metrics.tenant_labels_enabled flag with K8s auto-detect (default true on K8s, false otherwise). | §2.2 (label kill-switch) and §2.6 (K8s detect signals + override) |
| KD-07 | Cut over :7474/metrics to the new schema in one release | Reject — published metric names are a customer contract. Phase 1 ships a translation layer that emits legacy names from the new registry; one-cycle deprecation overlap with Sunset header. | §2.6 (legacy data-plane endpoint + Deprecation/Sunset headers) |
| KD-08 | Test the nornicdb.traceparent Bolt extra against the full driver matrix in Phase 1 | Defer — full matrix decision is a Phase 3 entry input; minimum required for v1 is neo4j-go-driver/v5 round-trip. Per-driver conformance doc ships in Phase 8. | §2.4 (Bolt + replication context propagation — per-driver conformance doc DOC-07) |
| KD-09 | Claim "one-click trace correlation" as an unqualified ADR guarantee | Reject — exemplars are emitted unconditionally, but end-to-end correlation requires customer-side Prom --enable-feature=exemplar-storage + Tempo datasource. CI integration test against kube-prometheus-stack + Tempo is the proof. | §2.4 (Exemplars wording) and §3.3 (risk row reframed) |
3.5 What this enables next¶
- Continuous profiling (Pyroscope/Parca) — the second observability ADR builds on the same
pkg/observabilityprovider. - Query insights UI — the
cypher.plan_hashlabel lets us surface a "top slow plans" panel in the bundled UI without re-running EXPLAIN. - SLO templates — recording-rule and burn-rate alert templates ship in
docs/operations/. - Tenant-aware billing/usage —
nornicdb_cypher_queries_total{database,…}is the same primitive a future managed-cloud control plane would meter on.
4. References¶
pkg/server/server_public.go(existing handlers under change)pkg/server/server_middleware.go(metricsMiddlewareto be replaced)pkg/config/config.go:585(LoggingConfigto be extended)pkg/audit/audit.go(compliance audit; explicitly not changed)pkg/replication/codec.go(gains optionaltraceparentfield)pkg/bolt/server.go(gains lifecycle hooks forpkg/observability/middleware_bolt.go)docs/operations/monitoring.md(will be updated to reference the:9090model after Phase 1)- OpenTelemetry semantic conventions for HTTP and database client spans (https://opentelemetry.io/docs/specs/semconv/database/)
- Prometheus instrumentation best practices (https://prometheus.io/docs/practices/instrumentation/)
4.1 Phase exit references¶
M1 ships as a single PR. All 13 phases of Milestone 1 land on the otel branch and merge together via PR #126; there is no per-phase PR. This means the GOV-03 audit trail uses commit ranges on otel rather than PR links. Each row below is filled in by the verifier of the corresponding phase when the phase's commits land on otel and the verifier passes — not at PR merge.
The Commit range column is the inclusive SHA range that delivered the phase (e.g., <first-sha>..<last-sha>); the Verified column is the date the verifier signed off; the Verifier column is the reviewer handle who verified (per the §5 reviewer roster). Phase 0's row is filled in below (§5 sign-off → §4.1 row 0) to establish the format.
| Phase | Title | Commit range | Verified | Verifier |
|---|---|---|---|---|
| Phase 0 | ADR Governance & Sign-off | e97ed2b..8e384cf | 2026-04-30 | orneryd |
| Phase 1 | Observability Foundation Skeleton | fa272fc..3ad44d9 | 2026-04-30 | asanabria |
| Phase 2 | Structured Logging Migration | 4201a24..021a983 | 2026-05-02 | asanabria |
| Phase 3 | Metrics Infrastructure & Discipline | 1ba8f01..a62d30a | 2026-05-02 | asanabria |
| Phase 4 | Subsystem Metric Catalog | 523c23d..9a0e574 | 2026-05-03 | asanabria |
| Phase 5 | Legacy Translation Layer & Tenant Flag | 65f8771..973bbd7 | 2026-05-04 | asanabria |
| Phase 6 | Tracing SDK & Core Spans | |||
| Phase 7 | Replication Codec Versioning | |||
| Phase 8 | Bolt, Replication & Async Tracing + PII Defense | |||
| Phase 9 | Kubernetes Helm Chart Integration | |||
| Phase 10 | Dashboards, Alerts & CI Integration Test | |||
| Phase 11 | Metrics Reference Doc Generator | |||
| Phase 12 | Performance Gates & Hardening |
5. Decision sign-off¶
Sign-off coordination: PR #126 is the M1 PR — it carries all 13 phases of work and stays open until M1 is complete. The four-role sign-off below is for ADR-0001 itself (the design contract), not for the M1 release as a whole. Reviewer name, approval date, and PR #126's URL are filled in here as approvals land. Once all four cells are populated, the metadata Status field at the top of this file is flipped from **Proposed** to **Accepted**, and §4.1 row "Phase 0" is filled in with the commit range on otel that delivered the ADR work — making §4.1's first row the format template that subsequent phases append to. The §5 sign-off recorded here is durable text in the ADR; it survives any GitHub-side review state changes (e.g., dismiss_stale_reviews_on_push after subsequent phase commits land).
| Role | Reviewer | Date | PR |
|---|---|---|---|
| Architecture | orneryd | 2026-04-30 | https://github.com/orneryd/NornicDB/pull/126 |
| SRE / Ops | linuxdynasty | 2026-04-30 | https://github.com/orneryd/NornicDB/pull/126 |
| Security | linuxdynasty | 2026-04-30 | https://github.com/orneryd/NornicDB/pull/126 |
| Public API owner | orneryd | 2026-04-30 | https://github.com/orneryd/NornicDB/pull/126 |