Adding the ability to resume conversations.
we have one verb `resume`.
Behavior:
`tui`:
`codex resume`: opens session picker
`codex resume --last`: continue last message
`codex resume <session id>`: continue conversation with `session id`
`exec`:
`codex resume --last`: continue last conversation
`codex resume <session id>`: continue conversation with `session id`
Implementation:
- I added a function to find the path in `~/.codex/sessions/` with a
`UUID`. This is helpful in resuming with session id.
- Added the above mentioned flags
- Added lots of testing
There are exactly 4 types of flaky tests in Windows x86 right now:
1. `review_input_isolated_from_parent_history` => Times out waiting for
closing events
2. `review_does_not_emit_agent_message_on_structured_output` => Times
out waiting for closing events
3. `auto_compact_runs_after_token_limit_hit` => Times out waiting for
closing events
4. `auto_compact_runs_after_token_limit_hit` => Also has a problem where
auto compact should add a third request, but receives 4 requests.
1, 2, and 3 seem to be solved with increasing threads on windows runner
from 2 -> 4.
Don't know yet why # 4 is happening, but probably also because of
WireMock issues on windows causing races.
## 📝 Review Mode -- Core
This PR introduces the Core implementation for Review mode:
- New op `Op::Review { prompt: String }:` spawns a child review task
with isolated context, a review‑specific system prompt, and a
`Config.review_model`.
- `EnteredReviewMode`: emitted when the child review session starts.
Every event from this point onwards reflects the review session.
- `ExitedReviewMode(Option<ReviewOutputEvent>)`: emitted when the review
finishes or is interrupted, with optional structured findings:
```json
{
"findings": [
{
"title": "<≤ 80 chars, imperative>",
"body": "<valid Markdown explaining *why* this is a problem; cite files/lines/functions>",
"confidence_score": <float 0.0-1.0>,
"priority": <int 0-3>,
"code_location": {
"absolute_file_path": "<file path>",
"line_range": {"start": <int>, "end": <int>}
}
}
],
"overall_correctness": "patch is correct" | "patch is incorrect",
"overall_explanation": "<1-3 sentence explanation justifying the overall_correctness verdict>",
"overall_confidence_score": <float 0.0-1.0>
}
```
## Questions
### Why separate out its own message history?
We want the review thread to match the training of our review models as
much as possible -- that means using a custom prompt, removing user
instructions, and starting a clean chat history.
We also want to make sure the review thread doesn't leak into the parent
thread.
### Why do this as a mode, vs. sub-agents?
1. We want review to be a synchronous task, so it's fine for now to do a
bespoke implementation.
2. We're still unclear about the final structure for sub-agents. We'd
prefer to land this quickly and then refactor into sub-agents without
rushing that implementation.