#Persistence Ops Runbook

Version: 0.51.0 Updated: 2026-07-16 Applies to: ranvier-runtime Category: Guides

#1. Adapter Selection Guide

Choose the adapter that matches your durability, latency, and infrastructure requirements.

Criterion	`InMemoryPersistenceStore`	`PostgresPersistenceStore`	`RedisPersistenceStore`
Durability	None (process-scoped)	Committed database rows; one checkpoint append is not a single multi-statement transaction	Depends on an operator-configured Redis RDB/AOF policy; not implied by the adapter
Checkpoint latency	Deployment-dependent; benchmark locally	Deployment- and pool-dependent	Deployment- and network-dependent
Resume after process crash	No	Yes, from a committed non-terminal checkpoint	Conditional on Redis persistence and failover policy; not verified by the default fixture
Multi-process access	No (node-local)	Shared database	Shared Redis service
Best for	unit tests, local dev	production workflows	high-throughput, ephemeral checkpoints
Feature flag	none	`persistence-postgres`	`persistence-redis`

#Decision tree

graph TD
    Q1{"Need to survive\nprocess restart?"} -->|No| MEM["InMemoryPersistenceStore\n(tests only)"]
    Q1 -->|Yes| Q2{"Need strong durability?\n(audit trail, compensation log)"}
    Q2 -->|Yes| PG1["PostgresPersistenceStore"]
    Q2 -->|No| Q3{"Need sub-millisecond\ncheckpoint latency?"}
    Q3 -->|Yes| RED["RedisPersistenceStore"]
    Q3 -->|No| PG2["PostgresPersistenceStore\n(default-safe choice)"]

#PostgreSQL setup

# Cargo.toml
[features]
persistence-postgres = ["ranvier-runtime/persistence-postgres"]

let pool = sqlx::PgPool::connect(&env::var("DATABASE_URL")?).await?;
let store = PostgresPersistenceStore::new(pool.clone());
store.ensure_schema().await?;   // idempotent -- call once at startup

#Redis setup

[features]
persistence-redis = ["ranvier-runtime/persistence-redis"]

let store = RedisPersistenceStore::connect(&env::var("REDIS_URL")?).await?;

#2. Crash Recovery Scenarios

#Scenario A: Process crash mid-workflow

What happens:

The Axon pipeline is executing step N
The process crashes (OOM, signal, deployment restart)
External side effects from steps 0..N-1 have already been committed

Recovery procedure:

// On restart, load the interrupted trace
let trace_id = "order-trace-abc123";
let persisted = store.load(trace_id).await?;

if let Some(trace) = persisted {
    if trace.completion.is_none() {
        // Find the last successfully completed step
        let last_step = trace.events.last().map(|e| e.step).unwrap_or(0);
        let cursor = store.resume(trace_id, last_step).await?;
        
        // Re-run the pipeline from the cursor position
        tracing::info!(
            trace_id,
            resume_from = cursor.next_step,
            "resuming interrupted workflow"
        );
        // ... re-execute with the cursor
    }
}

PersistenceAutoComplete(true) records normal runtime terminal paths. It cannot execute after process termination and is not crash prevention. Resume only committed events, verify the stored schematic version, and choose a domain-safe cursor; do not assume that every last event is safe to replay.

#Scenario B: Database failure during checkpoint

What happens: store.append() returns an Err because the database is unreachable.

Recommended pattern:

match store.append(envelope).await {
    Ok(()) => { /* continue */ }
    Err(e) => {
        tracing::error!(
            trace_id = %envelope.trace_id,
            step = envelope.step,
            error = %e,
            "checkpoint failed -- applying circuit breaker"
        );
        // Option A: Halt the pipeline (safest -- no partial progress)
        return Outcome::Fault(anyhow::anyhow!("persistence unavailable: {e}"));
        // Option B: Continue without checkpoint (only for idempotent workflows)
        // tracing::warn!("continuing without checkpoint");
    }
}

Detection: The adapter does not currently export a built-in checkpoint failure counter. Emit an application metric such as checkpoint_failures_total at the call site and alert on its rate.

#Scenario C: Network partition or ambiguous append

Symptoms: append() times out, and it is unclear whether the write has been committed.

The PostgreSQL state upsert and event insert are separate statements. Event rows are uniquely keyed by (trace_id, step), but the event insert does not silently accept a duplicate. A transport error can therefore leave the caller uncertain whether the event committed. Do not blindly retry and treat a duplicate-key error as success.

if let Err(error) = store.append(envelope.clone()).await {
    // Reconcile the deterministic step key before deciding whether to retry.
    let committed = store
        .load(&envelope.trace_id)
        .await?
        .is_some_and(|trace| {
            trace.events.iter().any(|event| {
                event.step == envelope.step
                    && event.outcome_kind == envelope.outcome_kind
                    && event.payload_hash == envelope.payload_hash
            })
        });
    if !committed {
        return Outcome::Fault(anyhow::anyhow!(
            "persistence append is unresolved: {error}"
        ));
    }
}

#Scenario D: Verified outage and process recovery

From the ranvier/ checkout, run the opt-in Podman harness:

.\scripts\dependency_failure_smoke_podman.ps1

The M419-RQ8 fixture actually stops and restarts PostgreSQL 16 and Redis 7. It proves that one existing PostgreSQL store/pool and one existing Redis store/ connection manager recover within a bounded retry window. A separate process loads a committed PostgreSQL checkpoint and compensation-idempotency marker, derives the next cursor, and confirms that a completed trace rejects ordinary resume.

The fixture does not prove Redis durability. It also does not make the external compensation side effect and the local idempotency marker atomic; handlers remain at-least-once around that crash window and must be idempotent.

#3. Checkpoint Retention Policies

#PostgreSQL -- TTL-based purge

The adapter does not expose general trace TTL or a completed_at column. Delete child rows before state rows and derive age from the latest persisted event. Review this example against your table prefix, backup policy, and query plan before scheduling it:

BEGIN;
CREATE TEMP TABLE expired_ranvier_traces ON COMMIT DROP AS
SELECT state.trace_id
FROM ranvier_persistence_state AS state
JOIN (
    SELECT trace_id, MAX(timestamp_ms) AS latest_event_ms
    FROM ranvier_persistence_events
    GROUP BY trace_id
) AS event_age USING (trace_id)
WHERE state.completion IS NOT NULL
  AND event_age.latest_event_ms <
      EXTRACT(EPOCH FROM (NOW() - INTERVAL '30 days')) * 1000;

DELETE FROM ranvier_persistence_interventions
WHERE trace_id IN (SELECT trace_id FROM expired_ranvier_traces);
DELETE FROM ranvier_persistence_events
WHERE trace_id IN (SELECT trace_id FROM expired_ranvier_traces);
DELETE FROM ranvier_persistence_state
WHERE trace_id IN (SELECT trace_id FROM expired_ranvier_traces);
COMMIT;

Alternatively, use the built-in purge method for compensation idempotency:

let cutoff_ms = (Utc::now() - Duration::days(30)).timestamp_millis();
let purged = idempotency_store.purge_older_than_ms(cutoff_ms).await?;
tracing::info!(purged_rows = purged, "idempotency cleanup complete");

#Redis checkpoint retention

RedisPersistenceStore does not currently expose a checkpoint TTL builder. with_prefix_and_ttl belongs to the compensation-idempotency store, not the trace store. If checkpoint expiry is mandatory, apply and test a separate Redis lifecycle policy or use PostgreSQL until a typed trace-retention API is available. Never infer trace durability or retention from the idempotency-key TTL.

#Recommended retention windows

Environment	Retention	Rationale
dev / staging	7 days	Debugging window
prod (standard)	30 days	Audit + replay window
prod (regulated)	Policy-defined	Legal/compliance review; no universal framework default

#4. Scaling Considerations

#High-concurrency checkpointing (PostgreSQL)

Pool sizing:

let pool = sqlx::postgres::PgPoolOptions::new()
    .max_connections(20)     // match your DB max_connections / expected concurrency
    .min_connections(5)      // keep warm connections ready
    .acquire_timeout(Duration::from_secs(3))
    .connect(&database_url).await?;

Partitioning (very high throughput):

For services with >10,000 checkpoints/second, partition the ranvier_persistence_events table by trace_id hash:

-- Partition events table into 16 buckets
CREATE TABLE ranvier_persistence_events (
    trace_id TEXT NOT NULL,
    ...
) PARTITION BY HASH (trace_id);

CREATE TABLE ranvier_persistence_events_0
    PARTITION OF ranvier_persistence_events
    FOR VALUES WITH (MODULUS 16, REMAINDER 0);
-- ... repeat for 1..15

#Redis topology

The built-in adapter uses redis::aio::ConnectionManager with a single Redis client URL. M419-RQ8 verifies reconnect to a restarted single-node fixture; it does not certify Redis Cluster, Sentinel failover, replica consistency, or cross-region recovery. Validate those topologies with a dedicated adapter and failure plan before production use.

#5. Backup and Restore

#PostgreSQL

Backup:

# Full dump (for small databases)
pg_dump -U postgres -d mydb -t 'ranvier_persistence_*' \
  --format=custom -f ranvier_persistence_$(date +%Y%m%d).dump

# Continuous WAL archiving (for production)
# Configure postgresql.conf:
# wal_level = replica
# archive_mode = on
# archive_command = 'cp %p /mnt/backup/wal/%f'

Restore:

# Restore from dump
pg_restore -U postgres -d mydb ranvier_persistence_20260226.dump

# Verify row counts after restore
psql -U postgres -d mydb -c \
  "SELECT COUNT(*) FROM ranvier_persistence_state WHERE completion IS NOT NULL;"

#Redis

Backup (RDB snapshot):

# Trigger immediate save
redis-cli BGSAVE

# Copy the dump file
cp /var/lib/redis/dump.rdb /mnt/backup/redis_$(date +%Y%m%d).rdb

Restore:

# Stop Redis, replace dump.rdb, restart
systemctl stop redis
cp /mnt/backup/redis_20260226.rdb /var/lib/redis/dump.rdb
systemctl start redis

Key health check after restore:

# Verify persistence keys exist
redis-cli --scan --pattern "ranvier:persistence:*" | wc -l

#6. See Also

`docs/manual/04_PERSISTENCE.md` -- concept guide and adapter selection
`ranvier/runtime/src/persistence.rs` -- API reference
`ranvier/examples/persistence-production-demo/` -- demo scenarios