Persistence Ops Runbook

Version: 0.12 (M156 — Production Gate)
Updated: 2026-02-26
Related: docs/manual/04_PERSISTENCE.md, ranvier/runtime/src/persistence.rs

1. Adapter Selection Guide

Choose the adapter that matches your durability, latency, and infrastructure requirements.

Criterion	`InMemoryPersistenceStore`	`PostgresPersistenceStore`	`RedisPersistenceStore`
Durability	None (process-scoped)	Full (ACID)	Configurable (RDB/AOF)
Checkpoint latency	~1 µs	~1–5 ms	~0.1–1 ms
Resume after crash	❌ (data lost)	✅	✅ (if persistence enabled)
Multi-process scale	❌ (node-local)	✅	✅
Best for	unit tests, local dev	production workflows	high-throughput, ephemeral checkpoints
Feature flag	none	`persistence-postgres`	`persistence-redis`

Decision tree

Need to survive process restart?
├── No  → InMemoryPersistenceStore (tests only)
└── Yes → Need strong durability (audit trail, compensation log)?
          ├── Yes → PostgresPersistenceStore
          └── No  → Need sub-millisecond checkpoint latency?
                    ├── Yes → RedisPersistenceStore
                    └── No  → PostgresPersistenceStore (default-safe choice)

PostgreSQL setup

# Cargo.toml
[features]
persistence-postgres = ["ranvier-runtime/persistence-postgres"]

let pool = sqlx::PgPool::connect(&env::var("DATABASE_URL")?).await?;
let store = PostgresPersistenceStore::new(pool.clone());
store.ensure_schema().await?;   // idempotent — call once at startup

Redis setup

[features]
persistence-redis = ["ranvier-runtime/persistence-redis"]

let store = RedisPersistenceStore::connect(&env::var("REDIS_URL")?).await?;

2. Crash Recovery Scenarios

Scenario A: Process crash mid-workflow

What happens:

Axon pipeline is executing step N
Process crashes (OOM, signal, deployment restart)
External side effects from steps 0..N-1 are already committed

Recovery procedure:

// On restart, load the interrupted trace
let trace_id = "order-trace-abc123";
let persisted = store.load(trace_id).await?;

if let Some(trace) = persisted {
    if trace.completion.is_none() {
        // Find the last successfully completed step
        let last_step = trace.events.last().map(|e| e.step).unwrap_or(0);
        let cursor = store.resume(trace_id, last_step).await?;
        
        // Re-run the pipeline from the cursor position
        tracing::info!(
            trace_id,
            resume_from = cursor.next_step,
            "resuming interrupted workflow"
        );
        // ... re-execute with the cursor
    }
}

Prevention: Set PersistenceAutoComplete(true) to ensure complete() is called even on unexpected termination paths.

Scenario B: Database failure during checkpoint

What happens: store.append() returns Err because the database is unreachable.

Recommended pattern:

match store.append(envelope).await {
    Ok(()) => { /* continue */ }
    Err(e) => {
        tracing::error!(
            trace_id = %envelope.trace_id,
            step = envelope.step,
            error = %e,
            "checkpoint failed — applying circuit breaker"
        );
        // Option A: Halt the pipeline (safest — no partial progress)
        return Outcome::Fault(anyhow::anyhow!("persistence unavailable: {e}"));
        // Option B: Continue without checkpoint (only for idempotent workflows)
        // tracing::warn!("continuing without checkpoint");
    }
}

Detection: Monitor checkpoint_failures_total counter (emit via tracing::info! with metric=checkpoint_failure label).

Scenario C: Network partition (partial write)

Symptoms: append() times out; unclear if the write was committed.

Safe strategy: Use idempotent writes — PostgreSQL adapter uses ON CONFLICT DO NOTHING, so retrying is safe.

// Retry with exponential backoff
for attempt in 0..3 {
    match store.append(envelope.clone()).await {
        Ok(()) => break,
        Err(e) if attempt < 2 => {
            tokio::time::sleep(Duration::from_millis(100 * 2u64.pow(attempt))).await;
        }
        Err(e) => return Outcome::Fault(e.into()),
    }
}

3. Checkpoint Retention Policies

PostgreSQL — TTL-based purge

Implement a periodic purge job for completed traces older than your retention window:

-- Purge events for traces completed more than 30 days ago
DELETE FROM ranvier_persistence_events
WHERE trace_id IN (
    SELECT trace_id FROM ranvier_persistence_state
    WHERE completion IS NOT NULL
      AND completed_at < NOW() - INTERVAL '30 days'
);

DELETE FROM ranvier_persistence_state
WHERE completion IS NOT NULL
  AND completed_at < NOW() - INTERVAL '30 days';

Or use the built-in purge method for compensation idempotency:

let cutoff_ms = (Utc::now() - Duration::days(30)).timestamp_millis();
let purged = idempotency_store.purge_older_than_ms(cutoff_ms).await?;
tracing::info!(purged_rows = purged, "idempotency cleanup complete");

Redis — TTL on checkpoint keys

Set TTL at store creation for automatic key expiry:

// Checkpoints expire after 7 days
let store = RedisPersistenceStore::with_prefix_and_ttl(
    manager,
    "ranvier:persistence:prod",
    7 * 24 * 60 * 60,  // seconds
);

Recommended retention windows

Environment	Retention	Rationale
dev / staging	7 days	Debugging window
prod (standard)	30 days	Audit + replay window
prod (regulated)	90–365 days	Compliance requirement

4. Scaling Considerations

High-concurrency checkpointing (PostgreSQL)

Pool sizing:

let pool = sqlx::postgres::PgPoolOptions::new()
    .max_connections(20)     // match your DB max_connections / expected concurrency
    .min_connections(5)      // keep warm connections ready
    .acquire_timeout(Duration::from_secs(3))
    .connect(&database_url).await?;

Partitioning (very high throughput):

For services with >10,000 checkpoints/second, partition the ranvier_persistence_events table by trace_id hash:

-- Partition events table into 16 buckets
CREATE TABLE ranvier_persistence_events (
    trace_id TEXT NOT NULL,
    ...
) PARTITION BY HASH (trace_id);

CREATE TABLE ranvier_persistence_events_0
    PARTITION OF ranvier_persistence_events
    FOR VALUES WITH (MODULUS 16, REMAINDER 0);
-- ... repeat for 1..15

Redis — sharding and replication

Use Redis Cluster for horizontal scaling:

// Use a cluster-aware client URL
let store = RedisPersistenceStore::connect(
    "redis+cluster://redis-node1:6379,redis-node2:6379,redis-node3:6379"
).await?;

Note: Redis Cluster does not support multi-key operations across slots. Ensure all keys for a single trace map to the same slot by appending a hash tag: prefix with {trace_id}:.

5. Backup and Restore

PostgreSQL

Backup:

# Full dump (for small databases)
pg_dump -U postgres -d mydb -t 'ranvier_persistence_*' \
  --format=custom -f ranvier_persistence_$(date +%Y%m%d).dump

# Continuous WAL archiving (for production)
# Configure postgresql.conf:
# wal_level = replica
# archive_mode = on
# archive_command = 'cp %p /mnt/backup/wal/%f'

Restore:

# Restore from dump
pg_restore -U postgres -d mydb ranvier_persistence_20260226.dump

# Verify row counts after restore
psql -U postgres -d mydb -c \
  "SELECT COUNT(*) FROM ranvier_persistence_state WHERE completion IS NOT NULL;"

Redis

Backup (RDB snapshot):

# Trigger immediate save
redis-cli BGSAVE

# Copy the dump file
cp /var/lib/redis/dump.rdb /mnt/backup/redis_$(date +%Y%m%d).rdb

Restore:

# Stop Redis, replace dump.rdb, restart
systemctl stop redis
cp /mnt/backup/redis_20260226.rdb /var/lib/redis/dump.rdb
systemctl start redis

Key health check after restore:

# Verify persistence keys exist
redis-cli KEYS "ranvier:persistence:*" | wc -l

6. See Also

docs/manual/04_PERSISTENCE.md — concept guide + adapter selection
ranvier/runtime/src/persistence.rs — API reference
ranvier/examples/persistence-production-demo/ — demo scenarios