Monitoring¶

Monitor NornicDB health, performance, and security.

NornicDB's observability surface has two operator-facing layers:

Prometheus-compatible metrics exposed from the telemetry listener
OpenTelemetry traces exported through the configured OTLP pipeline

The Prometheus registry is the canonical metrics contract. OTEL metric export reads the same underlying registry via the bridge defined in the observability ADR, so the stable names you should alert and dashboard on are the nornicdb_* series.

Endpoints¶

Endpoint	Auth Required	Description
`:7474/health`	No	Legacy basic health check
`:7474/status`	Yes	Detailed status
`:7474/metrics`	Yes	Legacy authenticated metrics endpoint
`:9090/metrics`	No	Preferred telemetry listener metrics endpoint
`:9090/livez`	No	Process liveness
`:9090/readyz`	No	Readiness and warm-up phase
`:9090/version`	No	Plain-text build version
`127.0.0.1:9091/debug/pprof/*`	No	Opt-in profiling listener

The :9090 telemetry listener is the preferred scrape target. The :7474/metrics data-plane endpoint remains available for migration compatibility, but the observability ADR treats the telemetry listener as the stable monitoring surface.

OTEL And Prometheus Model¶

Prometheus names are always nornicdb_*.
Subsystems are bounded and closed, including http, bolt, cypher, storage, mvcc, embed, search, replication, cache, auth, and knowledge_policy.
When OTLP push is enabled, the OTEL pipeline exports the same measurements rather than a second, separately-maintained metric tree.
Tenant-scoped labels can be disabled with observability.metrics.tenant_labels_enabled when database labels would be too sensitive or too high-cardinality for a deployment.

For the architecture-level contract, see ADR 0001. For the generated catalog, see metrics-reference.md.

Health Check¶

Basic Health¶

curl http://localhost:7474/health

Response:

{
  "status": "healthy"
}

Telemetry Liveness And Readiness¶

curl http://localhost:9090/livez
curl http://localhost:9090/readyz

During warm-up, readyz returns progress JSON rather than just a bare 200 OK with no body.

Kubernetes Probes¶

livenessProbe:
  httpGet:
    path: /livez
    port: 9090
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /readyz
    port: 9090
  initialDelaySeconds: 5
  periodSeconds: 5

Status Endpoint¶

Detailed Status¶

curl http://localhost:7474/status \
  -H "Authorization: Bearer $TOKEN"

Response:

{
  "status": "healthy",
  "server": {
    "version": "0.1.4",
    "uptime": "24h15m30s",
    "started_at": "2024-12-01T00:00:00Z"
  },
  "database": {
    "nodes": 150000,
    "edges": 450000,
    "data_size": "2.5GB"
  },
  "embeddings": {
    "enabled": true,
    "provider": "ollama",
    "model": "mxbai-embed-large",
    "pending": 0
  }
}

Prometheus Metrics¶

Enable Metrics¶

The preferred scrape endpoint is the telemetry listener on port 9090.

curl http://localhost:9090/metrics

If you are still scraping the legacy data-plane endpoint during migration:

curl http://localhost:7474/metrics \
  -H "Authorization: Bearer $TOKEN"

Telemetry Configuration¶

observability:
  metrics:
    enabled: true
    port: 9090
    tenant_labels_enabled: true
  tracing:
    enabled: true
    endpoint: http://otel-collector:4318
    protocol: http/protobuf
  pprof:
    enabled: false
    listen: 127.0.0.1:9091

Environment overrides commonly used in operations:

NORNICDB_METRICS_ENABLED
NORNICDB_TELEMETRY_PORT
NORNICDB_PPROF_ENABLED
NORNICDB_PPROF_LISTEN
standard OTEL exporter variables such as OTEL_EXPORTER_OTLP_ENDPOINT

Example Metrics¶

# HTTP edge
nornicdb_http_requests_total{method="GET",path_template="/health",status_class="2xx"} 1234
nornicdb_http_request_duration_seconds_bucket{method="POST",path_template="/db/{database}/tx/commit",le="0.05"} 42

# Storage
nornicdb_nodes_total 150000
nornicdb_storage_edges_total 450000
nornicdb_storage_bytes 2684354560

# Cypher
nornicdb_queries_total{op_type="read"} 5678
nornicdb_cypher_active_transactions 3

# Embeddings
nornicdb_embed_processed_total{provider="local",result="success",mode="metal"} 10000
nornicdb_queue_depth 0

# Knowledge policy
nornicdb_knowledge_policy_scored_total{entity_kind="node",result="visible"} 4200
nornicdb_knowledge_policy_suppressions_total{entity_kind="node",reason="below_threshold"} 17

Knowledge Policy Metrics¶

The knowledge-policy subsystem adds a dedicated nornicdb_knowledge_policy_* family. These metrics are the primary OTEL/Prometheus signals for decay, suppression, on-access mutation work, and deindex churn.

Metric	Type	Use
`nornicdb_knowledge_policy_scored_total`	Counter	Total visible, suppressed, and no-decay scoring evaluations
`nornicdb_knowledge_policy_decay_score`	Histogram	Score distribution, sampled 1 in 32
`nornicdb_knowledge_policy_suppressions_total`	Counter	Why suppressions happen
`nornicdb_knowledge_policy_access_flush_batch_rows`	Histogram	Batch pressure in the access flusher
`nornicdb_knowledge_policy_access_flush_duration_seconds`	Histogram	Cost of a flush cycle
`nornicdb_knowledge_policy_access_flush_buffer_fullness`	Gauge	Backpressure tripwire for the flush buffer
`nornicdb_knowledge_policy_on_access_mutations_total`	Counter	Promotion-policy mutation workload
`nornicdb_knowledge_policy_deindex_enqueued_total`	Counter	Secondary-index cleanup workload after suppression
`nornicdb_knowledge_policy_read_filter_dropped_total`	Counter	Read-path drops caused by visibility filtering
`nornicdb_knowledge_policy_reconcile_total`	Counter	Schema or startup-driven reconcile activity

Practical interpretation:

rising scored_total{result="suppressed"} means the working set is shrinking
a flat decay_score histogram usually means the half-life or score anchor is mis-tuned
sustained access_flush_buffer_fullness > 0.9 means the flusher is falling behind
rising deindex_enqueued_total should correlate with intentionally tighter suppression behavior

For the full per-instrument explanation, enum values, dashboard slices, and runbook, see ../observability/knowledge-policy-metrics.md.

Knowledge Policy Spans¶

The knowledge-policy subsystem also emits OTEL tracing spans around its expensive lifecycle work:

nornicdb.knowledge_policy.flush
nornicdb.knowledge_policy.reconcile

Those spans are the right place to investigate long-running flush cycles, schema-change churn, or suppression-recheck work that is not obvious from counters alone.

Prometheus Configuration¶

# prometheus.yml
scrape_configs:
  - job_name: 'nornicdb'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: '/metrics'

In Kubernetes, prefer a ServiceMonitor or PodMonitor bound to the telemetry port rather than scraping the data plane.

Grafana Dashboard¶

Example Dashboard JSON¶

{
  "title": "NornicDB",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(nornicdb_http_requests_total[5m])"
        }
      ]
    },
    {
      "title": "Knowledge Policy Suppressions",
      "type": "graph",
      "targets": [
        {
          "expr": "sum by (reason) (rate(nornicdb_knowledge_policy_suppressions_total[5m]))"
        }
      ]
    },
    {
      "title": "Knowledge Policy Flush Duration p99",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum by (le) (rate(nornicdb_knowledge_policy_access_flush_duration_seconds_bucket[5m])))"
        }
      ]
    }
  ]
}

Alerting¶

Prometheus Alerts¶

# alerts.yml
groups:
  - name: nornicdb
    rules:
      - alert: NornicDBDown
        expr: up{job="nornicdb"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "NornicDB is down"

      - alert: HighErrorRate
        expr: rate(nornicdb_http_requests_total{status_class="5xx"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"

      - alert: KnowledgePolicyFlushBackpressure
        expr: nornicdb_knowledge_policy_access_flush_buffer_fullness > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Knowledge-policy access flusher is near saturation"

      - alert: KnowledgePolicySuppressionsSpike
        expr: rate(nornicdb_knowledge_policy_suppressions_total[5m]) > 5
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "Knowledge-policy suppressions are rising"

Logging¶

Log Levels¶

export NORNICDB_LOG_LEVEL=info  # debug, info, warn, error

Log Format¶

{
  "time": "2024-12-01T10:30:00.123Z",
  "level": "info",
  "msg": "Query executed",
  "query_type": "cypher",
  "duration_ms": 23,
  "rows": 100
}

Log Aggregation¶

logging:
  driver: "fluentd"
  options:
    fluentd-address: "localhost:24224"
    tag: "nornicdb"

Performance Monitoring¶

pprof For Goroutines And Locking¶

The pprof listener is separate from the telemetry listener and is disabled by default.

Enable it only when you are actively diagnosing runtime behavior:

export NORNICDB_PPROF_ENABLED=true
export NORNICDB_PPROF_LISTEN=127.0.0.1:9091

When pprof is enabled, NornicDB also enables:

mutex profiling with runtime.SetMutexProfileFraction(1)
block profiling with runtime.SetBlockProfileRate(1000000)

That makes these endpoints useful immediately:

/debug/pprof/goroutine
/debug/pprof/mutex
/debug/pprof/block
/debug/pprof/heap
/debug/pprof/profile

Typical commands:

# CPU profile
go tool pprof http://127.0.0.1:9091/debug/pprof/profile?seconds=30

# Goroutine dump
go tool pprof http://127.0.0.1:9091/debug/pprof/goroutine

# Mutex contention
go tool pprof http://127.0.0.1:9091/debug/pprof/mutex

# Blocking profile
go tool pprof http://127.0.0.1:9091/debug/pprof/block

Use pprof together with the OTEL/Prometheus metrics:

if nornicdb_knowledge_policy_access_flush_duration_seconds rises, check /debug/pprof/mutex and /debug/pprof/block
if nornicdb_knowledge_policy_access_flush_buffer_fullness stays high, inspect goroutine growth and blocked stacks
if the system is spending time in repeated reconcile or suppression work, correlate reconcile_total and suppression counters with goroutine and mutex profiles

Query Performance¶

nornicdb serve --log-queries

Search Timing Diagnostics¶

Enable detailed search timing logs when tuning search latency:

export NORNICDB_SEARCH_LOG_TIMINGS=true
export NORNICDB_SEARCH_DIAG_TIMINGS=true

You will see two complementary log lines per search:

⏱️ Search timing: stage-level search-service timings (vector_ms, bm25_ms, fusion_ms, candidate counts, fallback)
🔎 Search timing db=...: request-path timings (embed_total, search_total, embed_calls, chunk info)

Field reference (Apple M3 Max, 64GB RAM, Feb 2026):

Embedding-query path (best collected):
Sequential varied queries: p50 11.28ms, p95 25.84ms
Concurrent (8 workers): p50 76.36ms, p95 87.41ms
Typical diagnostic pattern: embed_total dominates request time.
Fulltext-only path (best collected):
Sequential varied queries: p50 0.57ms, p95 2.77ms
Diagnostic pattern: embed_calls=0, embed_total=0s, handler internal timing in tens of microseconds.

Slow Query Log¶

{
  "level": "warn",
  "msg": "Slow query",
  "query": "MATCH (n)-[r*1..5]->(m) RETURN n, r, m",
  "duration_ms": 1500,
  "threshold_ms": 1000
}

Security Monitoring¶

Failed logins are logged and can trigger alerts:

{
  "level": "warn",
  "msg": "Login failed",
  "username": "admin",
  "ip": "192.168.1.100",
  "reason": "invalid_password"
}

Audit Log Monitoring¶

tail -f /var/log/nornicdb/audit.log | jq 'select(.type == "LOGIN_FAILED")'

Health Check Script¶

#!/bin/bash

HEALTH=$(curl -s http://localhost:7474/health)
STATUS=$(echo "$HEALTH" | jq -r '.status')

if [ "$STATUS" != "healthy" ]; then
  echo "NornicDB unhealthy: $HEALTH"
  exit 1
fi

echo "NornicDB healthy"
exit 0