Persistence Ops Runbook
Version: 0.12 (M156 — Production Gate)
Updated: 2026-02-26
Related: docs/manual/04_PERSISTENCE.md, ranvier/runtime/src/persistence.rs
1. Adapter Selection Guide
Choose the adapter that matches your durability, latency, and infrastructure requirements.
| Criterion | InMemoryPersistenceStore |
PostgresPersistenceStore |
RedisPersistenceStore |
|---|---|---|---|
| Durability | None (process-scoped) | Full (ACID) | Configurable (RDB/AOF) |
| Checkpoint latency | ~1 µs | ~1–5 ms | ~0.1–1 ms |
| Resume after crash | ❌ (data lost) | ✅ | ✅ (if persistence enabled) |
| Multi-process scale | ❌ (node-local) | ✅ | ✅ |
| Best for | unit tests, local dev | production workflows | high-throughput, ephemeral checkpoints |
| Feature flag | none | persistence-postgres |
persistence-redis |
Decision tree
Need to survive process restart?
├── No → InMemoryPersistenceStore (tests only)
└── Yes → Need strong durability (audit trail, compensation log)?
├── Yes → PostgresPersistenceStore
└── No → Need sub-millisecond checkpoint latency?
├── Yes → RedisPersistenceStore
└── No → PostgresPersistenceStore (default-safe choice)PostgreSQL setup
# Cargo.toml
[features]
persistence-postgres = ["ranvier-runtime/persistence-postgres"]let pool = sqlx::PgPool::connect(&env::var("DATABASE_URL")?).await?;
let store = PostgresPersistenceStore::new(pool.clone());
store.ensure_schema().await?; // idempotent — call once at startupRedis setup
[features]
persistence-redis = ["ranvier-runtime/persistence-redis"]let store = RedisPersistenceStore::connect(&env::var("REDIS_URL")?).await?;2. Crash Recovery Scenarios
Scenario A: Process crash mid-workflow
What happens:
- Axon pipeline is executing step N
- Process crashes (OOM, signal, deployment restart)
- External side effects from steps 0..N-1 are already committed
Recovery procedure:
// On restart, load the interrupted trace
let trace_id = "order-trace-abc123";
let persisted = store.load(trace_id).await?;
if let Some(trace) = persisted {
if trace.completion.is_none() {
// Find the last successfully completed step
let last_step = trace.events.last().map(|e| e.step).unwrap_or(0);
let cursor = store.resume(trace_id, last_step).await?;
// Re-run the pipeline from the cursor position
tracing::info!(
trace_id,
resume_from = cursor.next_step,
"resuming interrupted workflow"
);
// ... re-execute with the cursor
}
}Prevention: Set PersistenceAutoComplete(true) to ensure complete() is called even on unexpected termination paths.
Scenario B: Database failure during checkpoint
What happens: store.append() returns Err because the database is unreachable.
Recommended pattern:
match store.append(envelope).await {
Ok(()) => { /* continue */ }
Err(e) => {
tracing::error!(
trace_id = %envelope.trace_id,
step = envelope.step,
error = %e,
"checkpoint failed — applying circuit breaker"
);
// Option A: Halt the pipeline (safest — no partial progress)
return Outcome::Fault(anyhow::anyhow!("persistence unavailable: {e}"));
// Option B: Continue without checkpoint (only for idempotent workflows)
// tracing::warn!("continuing without checkpoint");
}
}Detection: Monitor checkpoint_failures_total counter (emit via tracing::info! with metric=checkpoint_failure label).
Scenario C: Network partition (partial write)
Symptoms: append() times out; unclear if the write was committed.
Safe strategy: Use idempotent writes — PostgreSQL adapter uses ON CONFLICT DO NOTHING, so retrying is safe.
// Retry with exponential backoff
for attempt in 0..3 {
match store.append(envelope.clone()).await {
Ok(()) => break,
Err(e) if attempt < 2 => {
tokio::time::sleep(Duration::from_millis(100 * 2u64.pow(attempt))).await;
}
Err(e) => return Outcome::Fault(e.into()),
}
}3. Checkpoint Retention Policies
PostgreSQL — TTL-based purge
Implement a periodic purge job for completed traces older than your retention window:
-- Purge events for traces completed more than 30 days ago
DELETE FROM ranvier_persistence_events
WHERE trace_id IN (
SELECT trace_id FROM ranvier_persistence_state
WHERE completion IS NOT NULL
AND completed_at < NOW() - INTERVAL '30 days'
);
DELETE FROM ranvier_persistence_state
WHERE completion IS NOT NULL
AND completed_at < NOW() - INTERVAL '30 days';Or use the built-in purge method for compensation idempotency:
let cutoff_ms = (Utc::now() - Duration::days(30)).timestamp_millis();
let purged = idempotency_store.purge_older_than_ms(cutoff_ms).await?;
tracing::info!(purged_rows = purged, "idempotency cleanup complete");Redis — TTL on checkpoint keys
Set TTL at store creation for automatic key expiry:
// Checkpoints expire after 7 days
let store = RedisPersistenceStore::with_prefix_and_ttl(
manager,
"ranvier:persistence:prod",
7 * 24 * 60 * 60, // seconds
);Recommended retention windows
| Environment | Retention | Rationale |
|---|---|---|
| dev / staging | 7 days | Debugging window |
| prod (standard) | 30 days | Audit + replay window |
| prod (regulated) | 90–365 days | Compliance requirement |
4. Scaling Considerations
High-concurrency checkpointing (PostgreSQL)
Pool sizing:
let pool = sqlx::postgres::PgPoolOptions::new()
.max_connections(20) // match your DB max_connections / expected concurrency
.min_connections(5) // keep warm connections ready
.acquire_timeout(Duration::from_secs(3))
.connect(&database_url).await?;Partitioning (very high throughput):
For services with >10,000 checkpoints/second, partition the ranvier_persistence_events table by trace_id hash:
-- Partition events table into 16 buckets
CREATE TABLE ranvier_persistence_events (
trace_id TEXT NOT NULL,
...
) PARTITION BY HASH (trace_id);
CREATE TABLE ranvier_persistence_events_0
PARTITION OF ranvier_persistence_events
FOR VALUES WITH (MODULUS 16, REMAINDER 0);
-- ... repeat for 1..15Redis — sharding and replication
Use Redis Cluster for horizontal scaling:
// Use a cluster-aware client URL
let store = RedisPersistenceStore::connect(
"redis+cluster://redis-node1:6379,redis-node2:6379,redis-node3:6379"
).await?;Note: Redis Cluster does not support multi-key operations across slots. Ensure all keys for a single trace map to the same slot by appending a hash tag: prefix with {trace_id}:.
5. Backup and Restore
PostgreSQL
Backup:
# Full dump (for small databases)
pg_dump -U postgres -d mydb -t 'ranvier_persistence_*' \
--format=custom -f ranvier_persistence_$(date +%Y%m%d).dump
# Continuous WAL archiving (for production)
# Configure postgresql.conf:
# wal_level = replica
# archive_mode = on
# archive_command = 'cp %p /mnt/backup/wal/%f'Restore:
# Restore from dump
pg_restore -U postgres -d mydb ranvier_persistence_20260226.dump
# Verify row counts after restore
psql -U postgres -d mydb -c \
"SELECT COUNT(*) FROM ranvier_persistence_state WHERE completion IS NOT NULL;"Redis
Backup (RDB snapshot):
# Trigger immediate save
redis-cli BGSAVE
# Copy the dump file
cp /var/lib/redis/dump.rdb /mnt/backup/redis_$(date +%Y%m%d).rdbRestore:
# Stop Redis, replace dump.rdb, restart
systemctl stop redis
cp /mnt/backup/redis_20260226.rdb /var/lib/redis/dump.rdb
systemctl start redisKey health check after restore:
# Verify persistence keys exist
redis-cli KEYS "ranvier:persistence:*" | wc -l6. See Also
docs/manual/04_PERSISTENCE.md— concept guide + adapter selectionranvier/runtime/src/persistence.rs— API referenceranvier/examples/persistence-production-demo/— demo scenarios