Skip to content

MVCC Lifecycle and Compaction Control Plan

Status: Partially Implemented and Verified
Owner: Storage/Engine
Date: 2026-03-25

Verification Legend

  • Implemented and covered by current tests
  • [~] Implemented in part, or implemented but not fully covered/finished
  • Not implemented yet

Current Verification Summary

  • Core lifecycle manager exists and is wired through storage wrappers, DB admin methods, Heimdall metrics, and server admin endpoints
  • Operator admin UI exists under Security and drives lifecycle inspection and control end to end
  • Core manager package tests are currently passing
  • Server lifecycle admin route tests are currently passing
  • Real-engine MVCC churn and tombstone compaction tests exist in the storage package
  • [~] Advanced operator semantics from the original plan are only partially implemented
  • Performance non-regression gate from the original plan has not been completed

1. Purpose

This document defines a single cohesive architecture to control MVCC history growth, prevent compaction starvation, and preserve snapshot correctness under sustained read/write load.

2. Goals

  1. Bound storage growth while preserving current temporal semantics.
  2. Make retention pressure actionable, not just observable.
  3. Avoid global compaction stalls caused by long-running readers.
  4. Provide predictable operator behavior under normal and emergency pressure.
  5. Ensure fairness across tenants/workloads.

3. Non-Goals

  1. No breaking changes to existing snapshot-read semantics.
  2. No removal of retained-floor behavior.
  3. No mandatory tiered-history rollout in this phase.

4. Architecture

Verification:

  • Introduce one subsystem: MVCCLifecycleManager
  • Centralize reader tracking, watermark computation, prune planning, fenced apply, metrics, and pressure policy in the manager/runtime package
  • Keep existing APIs (PruneMVCCVersions, RebuildMVCCHeads) as delegating wrappers

  • Introduce one subsystem: MVCCLifecycleManager.

  • Centralize in manager:

  • reader tracking

  • watermark computation
  • prune planning
  • apply with version fences
  • metrics and pressure policy

  • Keep existing APIs (PruneMVCCVersions, RebuildMVCCHeads) as delegating wrappers.

5. Core Data and Safety Model

Verification:

  • safe_floor is computed via minimum retained bounds and monotonic floor advancement helpers
  • Floor can only advance, never regress
  • Pruning is constrained by the retained floor, and explicit chain-hard-cap fallback behavior is implemented
  • Snapshot reads below retained floor still return not found

  • safe_floor per logical key is computed as:

safe_floor = min(
  oldest_reader_version,
  ttl_bound_version,
  max_versions_bound_version
)
new_floor = monotonic_max(previous_floor, safe_floor)
  1. Floor can only advance, never regress.
  2. Pruning and chain-cap actions are only allowed above safe_floor.
  3. If snapshot version is below floor, return not found (current contract).

6. Reader Watermark Model

Verification:

  • Global active-reader boolean has been replaced by an active reader registry in lifecycle manager code
  • Reader records include reader ID, snapshot version, start time, and namespace
  • Oldest reader version and oldest reader age are computed and exposed
  • Watermark remains runtime state rather than persisted state
  • Correctness still relies on persisted floor/head invariants, not persisted watermark state

  • Replace global active-reader boolean with active reader registry.

  • Track per reader:

  • reader ID

  • snapshot version
  • start timestamp
  • tenant/namespace

  • Compute:

  • oldest_reader_version

  • oldest_reader_ts
  • oldest_reader_age_seconds

  • Watermark is runtime state, not persisted.

  • Correctness relies on persisted floor/head invariants, not persisted watermark.

7. Planner and Apply Execution

Verification:

  • Planner reads persisted MVCC heads and version keyspace iteration to build immutable prune plans
  • [~] Optional narrowing indexes are not present in the current implementation
  • Apply phase validates head-version fences before mutating storage
  • Fence mismatch handling skips the key, increments mismatch accounting, and backs off retries
  • [~] Iterator-boundary fence checks and explicit snapshot-consistent iterator guarantees are not separately implemented beyond current head/version scan behavior

  • Planner source of truth:

  • persisted MVCC head metadata

  • MVCC version keyspace iterator
  • optional narrowing indexes

  • Planner creates immutable run plan with per-key fences.

  • Apply phase checks version fence before mutation.
  • If fence mismatch:

  • skip key

  • increment stale-plan metric
  • requeue for replan with backoff

  • Planner iteration guarantee:

  • use snapshot-consistent iterator when available; otherwise enforce iterator-boundary fence checks.

7.1 Fence Retry and Invalid-Plan Rules

Verification:

  • Initial requeue delay and exponential backoff with jitter up to a capped maximum are implemented
  • Per-key retry limit per run is implemented
  • Cross-run retry budget within a rolling ten-minute window is implemented
  • Hot-contention cooldown is implemented after repeated fence mismatch
  • [~] Infinite-loop prevention exists through backoff and cooldown, but the exact "second mismatch same-cycle" rule is not implemented as written

  • Requeue policy:

  • initial requeue delay: 100ms

  • exponential backoff with jitter up to 5s cap

  • Retry limits:

  • per-key retries per run: 3

  • per-key retries across runs: configurable default 20 within 10 minutes

  • Long-term invalidation:

  • if retry budget exceeded, mark key as hot-contention for cooldown window (default 60s)

  • continue servicing other keys (fairness preserved)

  • Infinite-loop prevention:

  • no immediate same-cycle replan for same key after second mismatch

  • all retries must pass through backoff queue

8. Work Prioritization and Fairness

Verification:

  • Priority scoring exists and includes debt, tombstone depth, and age signals
  • Cost model uses iterator seeks, value-log reads, and bytes rewritten/deleted proxies
  • Scheduler orders work by score-over-cost
  • Anti-starvation behavior exists through skip-count boosting and reserved slice handling
  • Namespace budgets and reserved work slices provide basic multi-tenant fairness controls

  • Priority score:

priority = f(debt_bytes, tombstone_depth, key_hotness, key_age)
  1. Cost model uses concrete proxies:

  2. iterator seeks

  3. value-log reads
  4. bytes rewritten/deleted

  5. Scheduler maximizes debt-reduction-per-cost.

  6. Anti-starvation:

  7. priority aging

  8. max skip count per key
  9. reserved slice for oldest unserved high-debt keys

  10. Multi-tenant isolation:

  11. optional per-namespace lifecycle budget caps

  12. minimum guaranteed maintenance slice per namespace

9. Pressure Policy and Backpressure

Verification:

  • Pressure bands normal, high, and critical exist
  • Hysteresis windows for band transitions are implemented and tested
  • Long-snapshot admission tightening is implemented and client warning headers/fields are emitted
  • Pinned-bytes metric is implemented and feeds pressure evaluation
  • Pressure controller enforcement can reject long snapshots under pressure
  • [~] Graceful cancel and hard-kill behavior for already-running snapshots is implemented for active transaction readers, but not yet for every broader shared session/query scope

  • Bands:

  • normal

  • high
  • critical

  • Hysteresis for band transitions to avoid flapping.

  • Reactions:

  • high: rate-limit new long snapshots, emit client warnings

  • critical: reject new long snapshots, keep short snapshots

  • Pinned-bytes threshold policy is mandatory.

  • Metric with enforcement:

  • mvcc_bytes_pinned_by_oldest_reader

  • Snapshot lifetime policy:

  • configurable max snapshot lifetime

  • graceful cancel first
  • hard kill only under sustained critical pressure

9.1 Baseline Threshold Guidance

Verification:

  • Baseline thresholds exist in default lifecycle config
  • Enter/exit debounce windows exist in default lifecycle config

Default baseline (operator-tunable):

  1. high_enter: max(5 GiB, 0.15 * data_dir_free_space)
  2. high_exit: 0.8 * high_enter
  3. critical_enter: max(20 GiB, 0.35 * data_dir_free_space)
  4. critical_exit: 0.8 * critical_enter

Default debounce windows:

  1. enter window: 30s sustained breach
  2. exit window: 120s sustained below exit threshold

10. Snapshot Kill Semantics

Verification:

  • [~] Transaction-scoped forced expiration is implemented, but broader query/session-scoped expiration is not
  • [~] Safe cancel points now exist at transaction operation and commit boundaries, but not across every shared snapshot consumer
  • Deterministic forced-expiration error family is implemented for graceful cancel and hard expiration
  • Structured audit events for forced expiration are implemented

  • Scope kills per query/session/client only.

  • Cancel points occur at safe boundaries (no torn row/page semantics).
  • Return deterministic transient/resource-pressure error code.
  • Emit structured audit event for each forced expiration.

10.1 Session and Dependent Transaction Effects

Verification:

  • [~] Shared-snapshot graceful-cancel semantics are partially implemented through transaction-scoped reader cancellation
  • [~] Snapshot-scope hard-expiration behavior is partially implemented through transaction-scoped reader expiration
  • Multiplexed-query forced-failure behavior is not implemented

  • If multiple queries share one snapshot/session:

  • graceful cancel applies to that query first

  • hard expiration applies to that snapshot/session scope only

  • Dependent transactions on other snapshots are unaffected.

  • If a client multiplexes many queries on one long-lived snapshot, all queries on that snapshot fail consistently after hard expiration with same error code family.

11. Extreme Churn Guardrails

Verification:

  • max_chain_hard_cap fallback is enforced in prune planning
  • Hard-cap enforcement remains bounded by safe_floor invariants
  • Emergency mode can activate from debt-growth slope under critical pressure
  • Emergency mode increases compaction budget, tightens snapshot lifetime, and adds separate emergency prioritization logic
  • Emergency adjustments honor configured resource ceilings

  • Add max_chain_hard_cap fallback.

  • Hard cap is bounded by safe_floor invariants.
  • Emergency mode trigger on debt-growth slope.
  • Emergency mode behavior:

  • increase compaction budget

  • tighten long-snapshot admission
  • prioritize highest debt-yield keys

  • Emergency mode must honor global resource ceilings.

12. Resource Ceilings

Verification:

  • [~] Runtime and IO ceilings are enforced inside each cycle, and CPU share is throttled on a best-effort basis, but full hard CPU enforcement is still not complete
  • Limit fields exist for max CPU share, IO budget, and runtime per cycle
  • Emergency mode budget adjustments are capped by those limits

  • Lifecycle manager must enforce hard max resource share.

  • Define limits:

  • max CPU share

  • max IO budget per interval
  • max runtime per cycle

  • Emergency mode cannot exceed these limits.

13. Metrics

Verification:

  • Required pressure metrics are exposed in lifecycle status
  • [~] Several lifecycle metrics are defined, exposed, and actively populated from live prune plans, but not all planned metrics are fully populated yet
  • Global aggregate metrics are exposed
  • [~] Per-namespace metrics exist for aggregated debt and prunable/pruned bytes, but not every planned metric is tracked per namespace

  • Required pressure metrics:

  • mvcc_bytes_pinned_by_oldest_reader

  • mvcc_compaction_debt_bytes
  • mvcc_compaction_debt_keys

  • Required lifecycle metrics:

  • mvcc_active_snapshot_readers

  • mvcc_oldest_reader_age_seconds
  • mvcc_prunable_bytes_total
  • mvcc_pruned_bytes_total
  • mvcc_tombstone_chain_max_depth
  • mvcc_floor_lag_versions
  • mvcc_prune_run_duration_seconds
  • mvcc_prune_run_keys_scanned_total
  • mvcc_prune_stale_plan_skips_total

  • Expose all metrics per namespace and global aggregate.

13.1 Metrics Cadence and Overhead Controls

Verification:

  • Debt sampling fraction and periodic full-scan controls are implemented in the planner
  • Separate 10s/60s rollup intervals are implemented in lifecycle status output
  • Per-key debug export with capped cardinality is implemented through admin debt inspection

  • Per-key debt sampling:

  • default sample fraction: 5%

  • full scan every N cycles (default 20)

  • Aggregation interval:

  • 10s rollup for hot counters

  • 60s rollup for debt histograms

  • Export model:

  • per-namespace aggregates always on

  • per-key detail behind debug flag and capped cardinality

14. API and Operator Surface

Verification:

  • Lifecycle status endpoint exists and returns pressure, debt, reader, and last-run data
  • Manual operations exist for prune-now and pause/resume
  • Runtime lifecycle schedule control exists through POST /admin/databases/{db}/mvcc/schedule
  • Inspect top N debt keys is implemented through GET /admin/databases/{db}/mvcc/debt?limit=N
  • Client warning headers/fields for pressure-induced degradations are implemented
  • Admin UI exists for database selection, runtime controls, debt inspection, reader inspection, and confirmation-gated lifecycle actions

  • Add lifecycle status endpoint:

  • pressure band

  • oldest reader
  • pinned bytes
  • debt bytes/keys
  • last run summary
  • skipped due to fence mismatch

  • Add manual operations:

  • trigger prune now

  • pause/resume lifecycle manager
  • inspect top N debt keys

  • Add client warning headers/fields for pressure-induced degradations.

15. Replication and Coordination Scope

Verification:

  • Current lifecycle behavior is node-local
  • Reader watermark state is node-local runtime state
  • [~] Replication-safe compaction behavior is the working assumption, but there is not dedicated verification for coordinated replicated lifecycle behavior in this phase
  • Control-plane coordinated pressure hints are not implemented

  • Default model: lifecycle decisions are local to each node.

  • Watermark is node-local because active readers are node-local runtime state.
  • In replicated deployments:

  • compaction actions must remain log/order-safe for local store invariants

  • no cross-node watermark coordination required in this phase

  • Optional future mode:

  • control-plane coordinated pressure hints only (not shared watermark correctness).

16. Implementation Sequence

Verification:

  • Manager scaffolding and config exist
  • Reader registry and watermark computation exist
  • Planner/apply with version fences exist
  • Prioritization, fairness, and cost model exist in initial form
  • Pressure bands and admission enforcement exist
  • Metrics and status endpoint exist
  • Emergency mode and ceiling-aware budget adjustment exist in initial form
  • Legacy maintenance methods delegate into lifecycle support
  • Feature-flagged rollout path was not implemented; current behavior is config-driven instead

  • Create manager scaffolding and config.

  • Add reader registry and watermark computation.
  • Implement planner/apply with version fences.
  • Add prioritization, fairness, and cost model.
  • Add pressure bands and enforcement actions.
  • Add metrics and status endpoint.
  • Add emergency mode and hard ceilings.
  • Wire legacy maintenance methods to manager.
  • Enable by feature flag, then default-on.

17. Testing Plan

Verification:

  • Unit tests exist for floor monotonicity, fence correctness, scheduler behavior, hysteresis transitions, and policy triggers
  • [~] Integration coverage exists for active-reader tombstone compaction, high-churn prune bounding, and operator debt inspection, but not all planned contention scenarios are covered
  • [~] Reliability coverage exists in part, but restart-with-history and watermark reset scenarios are not fully covered as described here
  • Performance tests for lifecycle debt-reduction rate and write-latency impact are not complete

  • Unit tests:

  • floor monotonicity

  • fence correctness
  • scoring and fairness behavior
  • hysteresis transitions
  • policy action triggers

  • Integration tests:

  • one long reader + high churn

  • many staggered medium readers with slow watermark movement
  • mixed-tenant contention with quotas
  • stale-plan race under concurrent writes

  • Reliability tests:

  • restart with active lifecycle history

  • watermark reset behavior
  • emergency mode enter/exit under sustained pressure

  • Performance tests:

  • debt reduction rate

  • bounded latency impact on writes
  • no runaway growth under configured policy

18. Acceptance Criteria

Verification:

  • Storage growth is bounded under the tested churn scenarios currently covered by storage tests
  • No known snapshot correctness regression was introduced in the covered lifecycle changes
  • Compaction can continue making progress with active-reader-aware behavior in tested cases
  • [~] Operators can inspect pressure and debt, but some planned explanatory metrics are still unpopulated
  • [~] One-tenant starvation mitigation exists through namespace budgets, but broad acceptance-level proof is incomplete
  • [~] Emergency mode exists, but full stabilization proof against ceilings has not been completed

  • Storage growth is bounded by retention policy under sustained churn.

  • No snapshot correctness regressions against current behavior.
  • Compaction continues making progress with active readers.
  • Operators can explain pressure via pinned-bytes and debt metrics.
  • One tenant cannot starve all others under lifecycle load.
  • Emergency mode stabilizes debt without exceeding resource ceilings.

19. Future Tiered-History Compatibility

Verification:

  • [~] The current architecture is compatible in principle, but no tiered-history implementation work has been done yet

This design intentionally prepares tiered temporal storage by making floor advancement, debt accounting, and compaction decisions explicit and policy-driven.

20. Performance Non-Regression Gate (Mandatory)

Verification:

  • No before/after lifecycle benchmark suite with confidence intervals has been completed
  • No validated p50/p95/p99 regression report has been produced for reads or writes
  • No published storage-growth-slope benchmark report exists yet
  • This release gate remains open

All lifecycle changes must preserve serving performance.

Release gate:

  1. No statistically significant regression in p50/p95/p99 read latency on representative workloads.
  2. No statistically significant regression in p50/p95/p99 write latency on representative workloads.
  3. Throughput regression budget:

  4. reads: <= 3%

  5. writes: <= 5%

  6. Lifecycle CPU/IO overhead must remain within configured ceilings in normal mode.

  7. Under pressure/emergency mode, latency may degrade, but system must remain within SLO error budget and recover to baseline after pressure subsides.
  8. Any regression above budgets requires:

  9. explicit waiver

  10. documented tradeoff
  11. rollback plan

Validation protocol:

  1. Run before/after benchmarks with same dataset, config, and hardware.
  2. Include cache-warm and cache-busted runs.
  3. Report p50/p95/p99, throughput, allocs/op, and storage growth slope.
  4. Publish results with confidence intervals.

Adaptation path:

  1. debt_bytes splits by tier (hot, warm, cold).
  2. safe_floor advancement drives tier demotion eligibility.
  3. planner can optimize for cross-tier debt reduction while preserving current snapshot guarantees.