#2747 encouraged me to audit our codebase for similar issues, as now I
am particularly suspicious that our flaky tests are due to a racy
deadlock.
I asked Codex to audit our code, and one of its suggestions was this:
> **High-Risk Patterns**
>
> All `send_*` methods await on a bounded
`mpsc::Sender<OutgoingMessage>`. If the writer blocks, the channel fills
and the processor task blocks on send, stops draining incoming requests,
and stdin reader eventually blocks on its send. This creates a
backpressure deadlock cycle across the three tasks.
>
> **Recommendations**
> * Server outgoing path: break the backpressure cycle
> * Option A (minimal risk): Change `OutgoingMessageSender` to use an
unbounded channel to decouple producer from stdout. Add rate logging so
floods are visible.
> * Option B (bounded + drop policy): Change `send_*` to try_send and
drop messages (or coalesce) when the queue is full, logging a warning.
This prevents processor stalls at the cost of losing messages under
extreme backpressure.
> * Option C (two-stage buffer): Keep bounded channel, but have a
dedicated “egress” task that drains an unbounded internal queue, writing
to stdout with retries and a shutdown timeout. This centralizes
backpressure policy.
So this PR is Option A.
Indeed, we previously used a bounded channel with a capacity of `128`,
but as we discovered recently with #2776, there are certainly cases
where we can get flooded with events.
That said, `test_shell_command_approval_triggers_elicitation` just
failed one one build when I put up this PR, so clearly we are not out of
the woods yet...
**Update:** I think I found the true source of the deadlock! See
https://github.com/openai/codex/pull/2876