Phase 3 — links, monitors, supervision
Goal: fault tolerance the Erlang way — processes fail loudly, failures propagate along links, and supervisors turn a crash into a restart. "Let it crash." Graduates: the fault-recovery scenario to live data.
Why this matters
The point of isolation is that a failure stays contained and visible. Links and monitors are how one process learns another died; trap_exit is how a supervisor turns that signal into a restart instead of dying itself.
What we built (TDD throughout)
- Exit reasons —
ExitReason::{Normal, Killed, Crashed, NoProc}withis_abnormal(). The reason is captured at teardown: theProcessGuardchecksstd::thread::panicking()in itsDrop, so a panicking body is recorded asCrashed— nocatch_unwind, no per-call cost. link/unlink— bidirectional. When a linked process exits abnormally, the signal propagates to its peers.spawn_link(parent, body)— spawn already linked, atomically (no window where the child can die before the link exists).monitor(watcher, target) -> MonitorRef— one-directional, non-fatal: the watcher just receives aReceived::Down { reference, pid, reason }.set_trap_exit(pid, true)— converts incoming exit signals intoReceived::Exit { from, reason }mailbox messages instead of killing the receiver. This is what lets a supervisor survive its children.- Exit cascades —
exit(pid, reason)propagates along links with a staged reason, so a crash can tear down a linked subtree exactly like the BEAM. - Fault-recovery engine (
rusm-bench) — crash-and-restart loop reporting real restarts/sec (~285k/sec).
How a developer uses it
rust
runtime.set_trap_exit(supervisor, true); // survive child exits
let child = runtime.spawn_link(supervisor, body); // linked at birth
// ... supervisor's body:
if let Received::Exit { from, reason } = ctx.recv().await {
if reason.is_abnormal() { /* restart `from` */ }
}Design notes — why it's cheap
- No
catch_unwind. Crash detection rides onthread::panicking()in the Drop guard already present from Phase 1 — failure capture costs nothing on the happy path. - Signals reuse the mailbox.
Down/Exitare variants of the sameReceivedstream from Phase 2 — one ordered queue, no separate signal plumbing.
Concepts introduced
- Links, monitors, supervision, cascades — see links & supervision.
Play with it
sh
cargo run -p rusm-bench -- run fault-recovery 5 # ~285k restarts/secVerification
cargo test -p rusm-otp green (link cascade, monitor Down, trap_exit, crash → Crashed, kill → Killed, NoProc); fault-recovery live in the dashboard.
Next
Phase 4: process management — a named registry, timers, and graceful shutdown.