The Docker default ate two weeks of my time

I ship a chat product: multi-tenant, encrypted, distributed across 5 partner-edge domains, serving real customers in production. The spec for a single live event is 1000 concurrent subscribers in one room, all getting every message in real time.

For 2 weeks straight the benchmark on that room said the same number: 998 of 1000 subscribers received nothing. The sender thought it was sending, every metric I had wired in showed green, and the server logs were quiet; nobody was disconnecting, they just were not getting any messages. I was about to rip Postgres LISTEN/NOTIFY out of the entire system and rewrite the fanout layer on Redis Streams, and the migration plan was 3 pages deep when I finally found what was actually broken.

Things that looked like the problem

The first real bug I found was the in-process broadcast bus between the Postgres listener task and the per-subscriber consumers: it had a queue depth of 8, and messages destined for subscribers were being dropped because the bus could not keep up with the inbound NOTIFY rate. The counter I had wired in showed 21,933 drops in a single benchmark run, and bumping the queue depth to 256 took the drop counter to zero on the next run, but the delivery rate at 1000 subscribers stayed at 0.002.

The next bug was a stale Postgres trigger left over from an earlier migration, firing on every message write and serializing badly with the new NOTIFY payload writes. Dropping the trigger restored an invariant that had been quietly broken since I shipped the new write path: an L2 cache that had literally never hit once in production suddenly started firing on every message. The delivery rate at 1000 subscribers stayed at 0.002.

I then built a Phoenix-style fastlane cache: pre-encode the message JSON exactly once at insert time, store it as a shared Bytes reference, broadcast that reference to all subscribers, and skip the N–1 redundant serialization calls per message. The measurement came back at 191× reuse per insert, which is exactly the kind of structural win that should have moved a 1000-subscriber bench measurably. The delivery rate at 1000 subscribers stayed at 0.002.

Things that looked like the problem and weren’t

I had a hypothesis about a race between two code paths trying to emit the same message id within a short window, and I wrote a diagnostic counter to catch it before committing to a refactor. After 4 benches the counter sat at exactly zero across all of them, which meant the race I had been planning a 1-2 week refactor around did not exist at the scale I was actually running; the diagnostic counter saved a week of work on something that turned out not to be the problem, which is a reminder that empirical disproof beats architectural intuition every time.

The bench harness itself turned out to be a partial bottleneck, since it was running as a single tokio process with a 1001-way barrier sync that saturated CPU on a 4-core ARM box before the server-side fanout had a chance to be the binding constraint. I rewrote it as a multi-process orchestrator with a worker pool, which meant some of the bench numbers I had been collecting up to that point were partially harness-limited (a useful discovery on its own), but the delivery rate at 1000 subscribers still stayed at 0.002.

At this point I was starting to think it just doesn’t scale

I started reading what other people had built at scale: Phoenix Channels at Discord, Centrifuge with its 64 sharded hubs, Synapse’s single Notifier and why Python’s GIL serializes it, Mattermost’s filter-on-writer anti-pattern, SignalR’s ChannelFullMode.Wait and the catastrophic cascading stall that comes with it. The reading was excellent and I felt like I had a working mental map of the landscape, but the benchmark had been at 0.002 for two weeks and I was 3 pages deep into an architecture document proposing a migration off Postgres LISTEN/NOTIFY to either Redis Streams or NATS JetStream.

The thing that was actually broken

Before I committed to the migration plan I opened the server logs in a separate terminal and actually watched a benchmark run, because for 2 weeks I had been staring exclusively at /metrics and had not been watching stderr at all. In stderr, every accept-loop iteration was emitting the same line:

WARN accept error; sleeping 1s;
Os { code: 24, kind: Uncategorized, message: "Too many open files" }

That is EMFILE, errno 24, the per-process file descriptor limit; new connections were being rejected by the operating system before the accept loop could even hand them off to my application code. I ran one command to confirm:

$ docker exec oxpulse-chat-staging sh -c 'ulimit -n'
1024

The host’s ulimit -n was 1048576, but the container’s was 1024, which is the Docker default nofile limit that I had silently inherited along with the rest of the compose defaults. Each TCP socket counts as one file descriptor, and so does each Postgres LISTEN handle, each metrics-scrape connection, and each open log file; at 1000 chat subscribers I was holding roughly 3-4 file descriptors per session (HTTP socket, database pool slot, LISTEN handle, sometimes a per-connection log buffer), plus a couple of hundred baseline for the process itself, which put me comfortably past the 1024 cap. The accept loop had been failing silently with errno 24 the entire time, which is exactly why none of my application-layer metrics could see what was happening: the operating system was rejecting connections at the socket layer, before any of my application code ran, before any metric had a chance to increment.

The fix was three lines of YAML in docker-compose.yml:

ulimits:
  nofile:
    soft: 65535
    hard: 65535

I committed the change, pushed it to the deploy pipeline, redeployed staging, and re-ran the same 1000-subscriber benchmark that had been pinned at 0.002 for two weeks. The delivery rate came back at 0.468, and every one of the 1000 subscribers received at least one message for the first time. The 0.468 is a benchmark artifact rather than a steady-state number, since subscribers in the harness join over a 10-second ramp while the sender starts posting immediately and the late half of subscribers misses the early messages; real product clients that stay connected for hours measure effective delivery above 95%, which is the actual scale target. Two weeks of work and roughly a dozen real fixes at the application layer had moved the bench by approximately zero, and one line of YAML at the OS layer moved it by 234×.

What was wrong with my mental model

I had instrumented the application layer thoroughly: every disconnect reason partitioned by cause, every cache hit and miss counted, every broadcast-bus drop attributed to channel and reason. The application metrics were good, but the OS metrics were absent entirely, and that asymmetry is what kept me at the wrong layer for two weeks.

Two specific signals would have made this visible the first time the benchmark ran. The first is a counter on the accept loop, something like accept_error_total{reason} partitioned by errno, which the application is fully capable of incrementing on its own accept-loop failures and which I simply had not wired in. The second is the Linux Prometheus process-collector convention process_open_fds and process_max_fds, which the prometheus crate I am using already ships via ProcessCollector; enabling that collector would have taken one register call and would have produced a process_open_fds = 1023 reading the first time I ran a 1000-subscriber bench, at which point the entire investigation would have collapsed into an afternoon of work.

Application-layer metrics are a 90% solution for application-layer bugs, but operating-system constraints (FD exhaustion, memory pressure, cgroup CPU throttling, TCP backlog overflow) need their own signals piped to the same metrics endpoint, or you spend 2 weeks reading distributed-systems papers when the answer was man docker run.

The takeaway

If a high-fan-out real-time backend looks broken while every application counter is green, the first thing to run is docker exec <service> sh -c 'ulimit -n', not strace, not a flamegraph, and not a 5-page redesign document; it is one shell command, and it would have collapsed this entire investigation into a single afternoon. Production SSE products that actually run at scale (Mattermost, Centrifuge, anything in the Discord ballpark) assume 100K+ file descriptors as their baseline, while the Docker default is 1024, and that gap is wide enough to silently break everything between roughly 200 and 5000 concurrent connections, which is exactly the range a small real-time service grows through immediately after launch.

The 11 other fixes I shipped during those 2 weeks were not wasted: the broadcast-bus cap and the stale trigger were correctness bugs that would have surfaced eventually, the Phoenix-style fastlane cache is a structural win that pays off at every subsequent scale tier, the empirical disproof of the cursor race saved me a 1-2 week refactor, and the reviewer-caught bugs in my own graceful-shutdown PR prevented a quiet regression. None of them moved the 1000-subscriber benchmark needle, but most of them will be the binding constraint at 5000 subscribers instead.

For two weeks I had been debugging an OS-layer constraint with application-layer tools, which is a category error you cannot detect from inside the application no matter how dense your instrumentation gets. The shell command is the only escape hatch.