← back

Phalanx

2026

Distributed Systems Engineer

A high-availability distributed consensus engine based on the Raft protocol, built from scratch in Go. Features a 5-node global mesh with double-fault tolerance, specialized production tuning for cross-continental latency, and a single-threaded event loop for lock-free concurrency.

KEY METRIC

5-node global mesh, double-fault tolerance (Quorum=3), <2ms p99 write latency

Stack

GoRaft ProtocolgRPCSWIM GossipBadgerDBFly.ioDockerNext.jsTypeScriptChaos Engineering

Overview

Phalanx is a production-grade implementation of the Raft consensus algorithm, engineered to provide a reliable 'source of truth' across a globally distributed cluster. The engine ensures that all nodes agree on a linearized log of operations, even in the face of network partitions or hardware failures.

The system utilizes a custom gossip-based peer discovery mechanism over Fly.io's private IPv6 network and features a "terminal-noir" technical manual for operational transparency. It implements Pre-Vote extensions (§9.6), lease-based linearizable reads, and a deterministic, tick-based state machine orchestrated through a single-threaded event loop.

Architecture

gRPC :9000N=5 Q=3 · tolerates 2 region failuresClient (CLI)Node 0JNB · JohannesburgNode 1LHR · LondonNode 2ORD · ChicagoLEADERNode 3SIN · SingaporeNode 4FRA · FrankfurtRaftRaftRaftRaftRaftKV FSMKV FSMKV FSMKV FSMKV FSMBadgerDBBadgerDBBadgerDBBadgerDBBadgerDBSWIM Gossip Mesh · 5 Regions

phalanx global mesh — 5-node consensus cluster across 5 continents

Technical Challenges

Taming Cross-Continental Latency

Standard Raft timeouts (100ms) cause 'election flapping' when nodes are spread across continents due to speed-of-light constraints. I recalibrated the state machine with a 200ms tick and a randomized 4-8s election window, absorbing global RTT spikes without compromising cluster stability.

Dynamic Peer Discovery in Mesh Networks

Static IP configurations are fragile in cloud environments. I engineered a startup sequence that resolves peers via Fly.io's internal AAAA DNS records, allowing nodes to dynamically join the cluster and exchange gossip seeds regardless of their geographic region.

Proving Double-Fault Tolerance

Theoretical reliability doesn't guarantee production safety. I built a 'Chaos Mesh' runner that randomly terminates up to 2 nodes simultaneously while under heavy write load. Phalanx successfully maintained a Quorum of 3, proving it can survive the loss of two entire global regions without data corruption.

View on GitHub ↗Live Demo ↗