Phase 9 — distributed clusters + live attach
Goal: make RUSM processes message each other across nodes — distributed Erlang, for WebAssembly. Several independent nodes connect into a secure cluster; a process reaches a service by name without knowing which node it lives on; a node can spawn work on a peer and see what a peer is running. All over an encrypted, authenticated transport.
Why this matters
A single machine has limits; horizontal scale needs cheap, secure node-to-node messaging. RUSM already had the actor model on one node (Phases 1–8). Phase 9 adds the wire between nodes — and keeps the programming model identical: the same send, the same registry, now spanning machines.
What we built (TDD throughout)
The new Wasm-free rusm-cluster crate, layered over rusm-otp (no Wasmtime — the hard boundary holds; distribution is an actor-core concern, not a Wasm one).
- QUIC + TLS transport. A
ClusterNodewraps aRuntimewith a QUIC endpoint (quinn, rustls + ring). A cluster shares one self-signedIdentity— a pre-shared cluster certificate: a node only completes a handshake with a peer presenting the same cert, and the client pins it as its sole trust root. A wrong certificate is rejected at the handshake. (Per-node certs under a cluster CA are a later refinement; the transport seam doesn't change.) - Per-peer streams. The handshake's bidirectional stream stays open as a control channel (node-name exchange, then registry gossip and a control-plane RPC); every message rides its own uni-stream, so cross-node messages never head-of-line-block each other — the reason to reach for QUIC over TCP.
- Cross-node messaging.
node.send("london", "greeter", bytes)routes to the process registered asgreeteronlondon; aRemoteNodehandle fromconnectdoes the same without naming the node. - Global registry.
register_global(name, pid)registers locally and gossips ownership to every peer; a freshly-connected peer is bootstrapped with the names we already own, and late registrations broadcast.send_global(name, bytes)resolves the owning node and routes there — location transparency.whereis_globalreturns the owner. When a peer's control channel closes, its connection and the global names it owned are pruned. - Remote spawn. A node
register_spawnable(name, factory)s work it can build; a peer callsspawn_remote(node, factory, args)and gets back the pid spawned there. The cluster can't ship a closure across the wire, so a node spawns only what it has been taught — explicit, and friendly to capability control. - Live attach.
remote_pids(node)lists the processes alive on a peer — the cluster primitive behind attaching to a running node. Both remote spawn and live attach ride one request/reply control-plane RPC (correlated by id, awaited on aoneshot), handled off the gossip loop so a slow op never stalls a peer's registry sync.
Performance
The cluster_fanout benchmark measures the transport on loopback, in two honest phases (latency separately from throughput — under saturation, latency is queue-dominated, so one number would mislead):
- unloaded cross-node round-trip latency: ~39µs p50 / ~112µs p99;
- saturation throughput: ~280k round-trips/sec ≈ ~550k cross-node messages/sec.
Concepts introduced
- Distributed nodes — connecting, location transparency, the wire shape.
- Live attach — hooking into a running node.
- Distributed model — the full reference.
Play with it
# The smallest two-node cluster (cross-node send, global registry, live attach):
cargo run -p rusm-bench --example cluster
# Benchmark the cross-node transport (release, for real numbers):
cargo run --release -p rusm-bench --example cluster_fanout -- 5 4Verification
cargo test -p rusm-cluster green — cross-node delivery (by handle and by node name), a single link carrying messages both ways, the global registry (gossip on connect and on late registration, local fast-path), remote spawn (the factory ran on the remote, its pid alive there) with an unknown-factory error, live-attach listing of a peer's live pids, and wrong-certificate rejection — plus frame parse/round-trip. cargo fmt + clippy clean. The Wasm-free invariant holds: rusm-cluster depends on rusm-otp, never Wasmtime.
The dashboard's distributed-fanout scenario was graduated to this real engine too (a hub + worker nodes, a sender pool keeping one round-trip in flight so latency stays representative): live, it does ~71k round-trips/sec at ~105µs p50 on the Balanced profile. With that, every one of the nineteen dashboard scenarios now runs on real data — none remain synthetic. (Runner::start_synthetic keeps a runtime-free deterministic preview mode for UI development.) The six serving scenarios (HTTP/WS/SSE and their *-ts twins) are co-resident live demos driving a real in-process WASM server; the fair, credible serving headline numbers are still measured out-of-process by rusm-loadtest against a live rusm serve port.
Next
Phase 10: scale & hardening — an on-demand instance tier above the pooled ceiling, opt-in bounded mailboxes, and supervisor restart-intensity.