There are exactly 4 types of flaky tests in Windows x86 right now:
1. `review_input_isolated_from_parent_history` => Times out waiting for
closing events
2. `review_does_not_emit_agent_message_on_structured_output` => Times
out waiting for closing events
3. `auto_compact_runs_after_token_limit_hit` => Times out waiting for
closing events
4. `auto_compact_runs_after_token_limit_hit` => Also has a problem where
auto compact should add a third request, but receives 4 requests.
1, 2, and 3 seem to be solved with increasing threads on windows runner
from 2 -> 4.
Don't know yet why # 4 is happening, but probably also because of
WireMock issues on windows causing races.