MVCC Lifecycle and Compaction Control Plan¶
Status: Partially Implemented and Verified
Owner: Storage/Engine
Date: 2026-03-25
Verification Legend¶
- Implemented and covered by current tests
- [~] Implemented in part, or implemented but not fully covered/finished
- Not implemented yet
Current Verification Summary¶
- Core lifecycle manager exists and is wired through storage wrappers, DB admin methods, Heimdall metrics, and server admin endpoints
- Operator admin UI exists under Security and drives lifecycle inspection and control end to end
- Core manager package tests are currently passing
- Server lifecycle admin route tests are currently passing
- Real-engine MVCC churn and tombstone compaction tests exist in the storage package
- [~] Advanced operator semantics from the original plan are only partially implemented
- Performance non-regression gate from the original plan has not been completed
1. Purpose¶
This document defines a single cohesive architecture to control MVCC history growth, prevent compaction starvation, and preserve snapshot correctness under sustained read/write load.
2. Goals¶
- Bound storage growth while preserving current temporal semantics.
- Make retention pressure actionable, not just observable.
- Avoid global compaction stalls caused by long-running readers.
- Provide predictable operator behavior under normal and emergency pressure.
- Ensure fairness across tenants/workloads.
3. Non-Goals¶
- No breaking changes to existing snapshot-read semantics.
- No removal of retained-floor behavior.
- No mandatory tiered-history rollout in this phase.
4. Architecture¶
Verification:
- Introduce one subsystem:
MVCCLifecycleManager - Centralize reader tracking, watermark computation, prune planning, fenced apply, metrics, and pressure policy in the manager/runtime package
-
Keep existing APIs (
PruneMVCCVersions,RebuildMVCCHeads) as delegating wrappers -
Introduce one subsystem:
MVCCLifecycleManager. -
Centralize in manager:
-
reader tracking
- watermark computation
- prune planning
- apply with version fences
-
metrics and pressure policy
-
Keep existing APIs (
PruneMVCCVersions,RebuildMVCCHeads) as delegating wrappers.
5. Core Data and Safety Model¶
Verification:
-
safe_flooris computed via minimum retained bounds and monotonic floor advancement helpers - Floor can only advance, never regress
- Pruning is constrained by the retained floor, and explicit chain-hard-cap fallback behavior is implemented
-
Snapshot reads below retained floor still return not found
-
safe_floorper logical key is computed as:
safe_floor = min(
oldest_reader_version,
ttl_bound_version,
max_versions_bound_version
)
new_floor = monotonic_max(previous_floor, safe_floor)
- Floor can only advance, never regress.
- Pruning and chain-cap actions are only allowed above
safe_floor. - If snapshot version is below floor, return not found (current contract).
6. Reader Watermark Model¶
Verification:
- Global active-reader boolean has been replaced by an active reader registry in lifecycle manager code
- Reader records include reader ID, snapshot version, start time, and namespace
- Oldest reader version and oldest reader age are computed and exposed
- Watermark remains runtime state rather than persisted state
-
Correctness still relies on persisted floor/head invariants, not persisted watermark state
-
Replace global active-reader boolean with active reader registry.
-
Track per reader:
-
reader ID
- snapshot version
- start timestamp
-
tenant/namespace
-
Compute:
-
oldest_reader_version oldest_reader_ts-
oldest_reader_age_seconds -
Watermark is runtime state, not persisted.
- Correctness relies on persisted floor/head invariants, not persisted watermark.
7. Planner and Apply Execution¶
Verification:
- Planner reads persisted MVCC heads and version keyspace iteration to build immutable prune plans
- [~] Optional narrowing indexes are not present in the current implementation
- Apply phase validates head-version fences before mutating storage
- Fence mismatch handling skips the key, increments mismatch accounting, and backs off retries
-
[~] Iterator-boundary fence checks and explicit snapshot-consistent iterator guarantees are not separately implemented beyond current head/version scan behavior
-
Planner source of truth:
-
persisted MVCC head metadata
- MVCC version keyspace iterator
-
optional narrowing indexes
-
Planner creates immutable run plan with per-key fences.
- Apply phase checks version fence before mutation.
-
If fence mismatch:
-
skip key
- increment stale-plan metric
-
requeue for replan with backoff
-
Planner iteration guarantee:
-
use snapshot-consistent iterator when available; otherwise enforce iterator-boundary fence checks.
7.1 Fence Retry and Invalid-Plan Rules¶
Verification:
- Initial requeue delay and exponential backoff with jitter up to a capped maximum are implemented
- Per-key retry limit per run is implemented
- Cross-run retry budget within a rolling ten-minute window is implemented
- Hot-contention cooldown is implemented after repeated fence mismatch
-
[~] Infinite-loop prevention exists through backoff and cooldown, but the exact "second mismatch same-cycle" rule is not implemented as written
-
Requeue policy:
-
initial requeue delay: 100ms
-
exponential backoff with jitter up to 5s cap
-
Retry limits:
-
per-key retries per run: 3
-
per-key retries across runs: configurable default 20 within 10 minutes
-
Long-term invalidation:
-
if retry budget exceeded, mark key as
hot-contentionfor cooldown window (default 60s) -
continue servicing other keys (fairness preserved)
-
Infinite-loop prevention:
-
no immediate same-cycle replan for same key after second mismatch
- all retries must pass through backoff queue
8. Work Prioritization and Fairness¶
Verification:
- Priority scoring exists and includes debt, tombstone depth, and age signals
- Cost model uses iterator seeks, value-log reads, and bytes rewritten/deleted proxies
- Scheduler orders work by score-over-cost
- Anti-starvation behavior exists through skip-count boosting and reserved slice handling
-
Namespace budgets and reserved work slices provide basic multi-tenant fairness controls
-
Priority score:
-
Cost model uses concrete proxies:
-
iterator seeks
- value-log reads
-
bytes rewritten/deleted
-
Scheduler maximizes debt-reduction-per-cost.
-
Anti-starvation:
-
priority aging
- max skip count per key
-
reserved slice for oldest unserved high-debt keys
-
Multi-tenant isolation:
-
optional per-namespace lifecycle budget caps
- minimum guaranteed maintenance slice per namespace
9. Pressure Policy and Backpressure¶
Verification:
- Pressure bands
normal,high, andcriticalexist - Hysteresis windows for band transitions are implemented and tested
- Long-snapshot admission tightening is implemented and client warning headers/fields are emitted
- Pinned-bytes metric is implemented and feeds pressure evaluation
- Pressure controller enforcement can reject long snapshots under pressure
-
[~] Graceful cancel and hard-kill behavior for already-running snapshots is implemented for active transaction readers, but not yet for every broader shared session/query scope
-
Bands:
-
normal
- high
-
critical
-
Hysteresis for band transitions to avoid flapping.
-
Reactions:
-
high: rate-limit new long snapshots, emit client warnings
-
critical: reject new long snapshots, keep short snapshots
-
Pinned-bytes threshold policy is mandatory.
-
Metric with enforcement:
-
mvcc_bytes_pinned_by_oldest_reader -
Snapshot lifetime policy:
-
configurable max snapshot lifetime
- graceful cancel first
- hard kill only under sustained critical pressure
9.1 Baseline Threshold Guidance¶
Verification:
- Baseline thresholds exist in default lifecycle config
- Enter/exit debounce windows exist in default lifecycle config
Default baseline (operator-tunable):
high_enter:max(5 GiB, 0.15 * data_dir_free_space)high_exit:0.8 * high_entercritical_enter:max(20 GiB, 0.35 * data_dir_free_space)critical_exit:0.8 * critical_enter
Default debounce windows:
- enter window: 30s sustained breach
- exit window: 120s sustained below exit threshold
10. Snapshot Kill Semantics¶
Verification:
- [~] Transaction-scoped forced expiration is implemented, but broader query/session-scoped expiration is not
- [~] Safe cancel points now exist at transaction operation and commit boundaries, but not across every shared snapshot consumer
- Deterministic forced-expiration error family is implemented for graceful cancel and hard expiration
-
Structured audit events for forced expiration are implemented
-
Scope kills per query/session/client only.
- Cancel points occur at safe boundaries (no torn row/page semantics).
- Return deterministic transient/resource-pressure error code.
- Emit structured audit event for each forced expiration.
10.1 Session and Dependent Transaction Effects¶
Verification:
- [~] Shared-snapshot graceful-cancel semantics are partially implemented through transaction-scoped reader cancellation
- [~] Snapshot-scope hard-expiration behavior is partially implemented through transaction-scoped reader expiration
-
Multiplexed-query forced-failure behavior is not implemented
-
If multiple queries share one snapshot/session:
-
graceful cancel applies to that query first
-
hard expiration applies to that snapshot/session scope only
-
Dependent transactions on other snapshots are unaffected.
- If a client multiplexes many queries on one long-lived snapshot, all queries on that snapshot fail consistently after hard expiration with same error code family.
11. Extreme Churn Guardrails¶
Verification:
-
max_chain_hard_capfallback is enforced in prune planning - Hard-cap enforcement remains bounded by
safe_floorinvariants - Emergency mode can activate from debt-growth slope under critical pressure
- Emergency mode increases compaction budget, tightens snapshot lifetime, and adds separate emergency prioritization logic
-
Emergency adjustments honor configured resource ceilings
-
Add
max_chain_hard_capfallback. - Hard cap is bounded by
safe_floorinvariants. - Emergency mode trigger on debt-growth slope.
-
Emergency mode behavior:
-
increase compaction budget
- tighten long-snapshot admission
-
prioritize highest debt-yield keys
-
Emergency mode must honor global resource ceilings.
12. Resource Ceilings¶
Verification:
- [~] Runtime and IO ceilings are enforced inside each cycle, and CPU share is throttled on a best-effort basis, but full hard CPU enforcement is still not complete
- Limit fields exist for max CPU share, IO budget, and runtime per cycle
-
Emergency mode budget adjustments are capped by those limits
-
Lifecycle manager must enforce hard max resource share.
-
Define limits:
-
max CPU share
- max IO budget per interval
-
max runtime per cycle
-
Emergency mode cannot exceed these limits.
13. Metrics¶
Verification:
- Required pressure metrics are exposed in lifecycle status
- [~] Several lifecycle metrics are defined, exposed, and actively populated from live prune plans, but not all planned metrics are fully populated yet
- Global aggregate metrics are exposed
-
[~] Per-namespace metrics exist for aggregated debt and prunable/pruned bytes, but not every planned metric is tracked per namespace
-
Required pressure metrics:
-
mvcc_bytes_pinned_by_oldest_reader mvcc_compaction_debt_bytes-
mvcc_compaction_debt_keys -
Required lifecycle metrics:
-
mvcc_active_snapshot_readers mvcc_oldest_reader_age_secondsmvcc_prunable_bytes_totalmvcc_pruned_bytes_totalmvcc_tombstone_chain_max_depthmvcc_floor_lag_versionsmvcc_prune_run_duration_secondsmvcc_prune_run_keys_scanned_total-
mvcc_prune_stale_plan_skips_total -
Expose all metrics per namespace and global aggregate.
13.1 Metrics Cadence and Overhead Controls¶
Verification:
- Debt sampling fraction and periodic full-scan controls are implemented in the planner
- Separate 10s/60s rollup intervals are implemented in lifecycle status output
-
Per-key debug export with capped cardinality is implemented through admin debt inspection
-
Per-key debt sampling:
-
default sample fraction: 5%
-
full scan every N cycles (default 20)
-
Aggregation interval:
-
10s rollup for hot counters
-
60s rollup for debt histograms
-
Export model:
-
per-namespace aggregates always on
- per-key detail behind debug flag and capped cardinality
14. API and Operator Surface¶
Verification:
- Lifecycle status endpoint exists and returns pressure, debt, reader, and last-run data
- Manual operations exist for prune-now and pause/resume
- Runtime lifecycle schedule control exists through
POST /admin/databases/{db}/mvcc/schedule - Inspect top N debt keys is implemented through
GET /admin/databases/{db}/mvcc/debt?limit=N - Client warning headers/fields for pressure-induced degradations are implemented
-
Admin UI exists for database selection, runtime controls, debt inspection, reader inspection, and confirmation-gated lifecycle actions
-
Add lifecycle status endpoint:
-
pressure band
- oldest reader
- pinned bytes
- debt bytes/keys
- last run summary
-
skipped due to fence mismatch
-
Add manual operations:
-
trigger prune now
- pause/resume lifecycle manager
-
inspect top N debt keys
-
Add client warning headers/fields for pressure-induced degradations.
15. Replication and Coordination Scope¶
Verification:
- Current lifecycle behavior is node-local
- Reader watermark state is node-local runtime state
- [~] Replication-safe compaction behavior is the working assumption, but there is not dedicated verification for coordinated replicated lifecycle behavior in this phase
-
Control-plane coordinated pressure hints are not implemented
-
Default model: lifecycle decisions are local to each node.
- Watermark is node-local because active readers are node-local runtime state.
-
In replicated deployments:
-
compaction actions must remain log/order-safe for local store invariants
-
no cross-node watermark coordination required in this phase
-
Optional future mode:
-
control-plane coordinated pressure hints only (not shared watermark correctness).
16. Implementation Sequence¶
Verification:
- Manager scaffolding and config exist
- Reader registry and watermark computation exist
- Planner/apply with version fences exist
- Prioritization, fairness, and cost model exist in initial form
- Pressure bands and admission enforcement exist
- Metrics and status endpoint exist
- Emergency mode and ceiling-aware budget adjustment exist in initial form
- Legacy maintenance methods delegate into lifecycle support
-
Feature-flagged rollout path was not implemented; current behavior is config-driven instead
-
Create manager scaffolding and config.
- Add reader registry and watermark computation.
- Implement planner/apply with version fences.
- Add prioritization, fairness, and cost model.
- Add pressure bands and enforcement actions.
- Add metrics and status endpoint.
- Add emergency mode and hard ceilings.
- Wire legacy maintenance methods to manager.
- Enable by feature flag, then default-on.
17. Testing Plan¶
Verification:
- Unit tests exist for floor monotonicity, fence correctness, scheduler behavior, hysteresis transitions, and policy triggers
- [~] Integration coverage exists for active-reader tombstone compaction, high-churn prune bounding, and operator debt inspection, but not all planned contention scenarios are covered
- [~] Reliability coverage exists in part, but restart-with-history and watermark reset scenarios are not fully covered as described here
-
Performance tests for lifecycle debt-reduction rate and write-latency impact are not complete
-
Unit tests:
-
floor monotonicity
- fence correctness
- scoring and fairness behavior
- hysteresis transitions
-
policy action triggers
-
Integration tests:
-
one long reader + high churn
- many staggered medium readers with slow watermark movement
- mixed-tenant contention with quotas
-
stale-plan race under concurrent writes
-
Reliability tests:
-
restart with active lifecycle history
- watermark reset behavior
-
emergency mode enter/exit under sustained pressure
-
Performance tests:
-
debt reduction rate
- bounded latency impact on writes
- no runaway growth under configured policy
18. Acceptance Criteria¶
Verification:
- Storage growth is bounded under the tested churn scenarios currently covered by storage tests
- No known snapshot correctness regression was introduced in the covered lifecycle changes
- Compaction can continue making progress with active-reader-aware behavior in tested cases
- [~] Operators can inspect pressure and debt, but some planned explanatory metrics are still unpopulated
- [~] One-tenant starvation mitigation exists through namespace budgets, but broad acceptance-level proof is incomplete
-
[~] Emergency mode exists, but full stabilization proof against ceilings has not been completed
-
Storage growth is bounded by retention policy under sustained churn.
- No snapshot correctness regressions against current behavior.
- Compaction continues making progress with active readers.
- Operators can explain pressure via pinned-bytes and debt metrics.
- One tenant cannot starve all others under lifecycle load.
- Emergency mode stabilizes debt without exceeding resource ceilings.
19. Future Tiered-History Compatibility¶
Verification:
- [~] The current architecture is compatible in principle, but no tiered-history implementation work has been done yet
This design intentionally prepares tiered temporal storage by making floor advancement, debt accounting, and compaction decisions explicit and policy-driven.
20. Performance Non-Regression Gate (Mandatory)¶
Verification:
- No before/after lifecycle benchmark suite with confidence intervals has been completed
- No validated p50/p95/p99 regression report has been produced for reads or writes
- No published storage-growth-slope benchmark report exists yet
- This release gate remains open
All lifecycle changes must preserve serving performance.
Release gate:
- No statistically significant regression in p50/p95/p99 read latency on representative workloads.
- No statistically significant regression in p50/p95/p99 write latency on representative workloads.
-
Throughput regression budget:
-
reads: <= 3%
-
writes: <= 5%
-
Lifecycle CPU/IO overhead must remain within configured ceilings in normal mode.
- Under pressure/emergency mode, latency may degrade, but system must remain within SLO error budget and recover to baseline after pressure subsides.
-
Any regression above budgets requires:
-
explicit waiver
- documented tradeoff
- rollback plan
Validation protocol:
- Run before/after benchmarks with same dataset, config, and hardware.
- Include cache-warm and cache-busted runs.
- Report p50/p95/p99, throughput, allocs/op, and storage growth slope.
- Publish results with confidence intervals.
Adaptation path:
debt_bytessplits by tier (hot,warm,cold).safe_flooradvancement drives tier demotion eligibility.- planner can optimize for cross-tier debt reduction while preserving current snapshot guarantees.