Commit Graph

9 Commits

Author SHA1 Message Date
pakrym-oai
06853d94f0 Use wait_for_event helpers in tests (#4753)
## Summary
- replace manual event polling loops in several core test suites with
the shared wait_for_event helpers
- keep prior assertions intact by using closure captures for stateful
expectations, including plan updates, patch lifecycles, and review flow
checks
- rely on wait_for_event_with_timeout where longer waits are required,
simplifying timeout handling

## Testing
- just fmt


------
https://chatgpt.com/codex/tasks/task_i_68e1d58582d483208febadc5f90dd95e
2025-10-04 22:04:05 -07:00
Jeremy Rose
4a5f05c136 make tests pass cleanly in sandbox (#4067)
This changes the reqwest client used in tests to be sandbox-friendly,
and skips a bunch of other tests that don't work inside the
sandbox/without network.
2025-09-25 13:11:14 -07:00
pakrym-oai
14a115d488 Add non_sandbox_test helper (#3880)
Makes tests shorter
2025-09-22 14:50:41 +00:00
dedrisian-oai
72733e34c4 Add dev message upon review out (#3758)
Proposal: We want to record a dev message like so:

```
{
      "type": "message",
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "<user_action>
  <context>User initiated a review task. Here's the full review output from reviewer model. User may select one or more comments to resolve.</context>
  <action>review</action>
  <results>
  {findings_str}
  </results>
</user_action>"
        }
      ]
    },
```

Without showing in the chat transcript.

Rough idea, but it fixes issue where the user finishes a review thread,
and asks the parent "fix the rest of the review issues" thinking that
the parent knows about it.

### Question: Why not a tool call?

Because the agent didn't make the call, it was a human. + we haven't
implemented sub-agents yet, and we'll need to think about the way we
represent these human-led tool calls for the agent.
2025-09-16 18:43:32 -07:00
dedrisian-oai
7fe4021f95 Review mode core updates (#3701)
1. Adds the environment prompt (including cwd) to review thread
2. Prepends the review prompt as a user message (temporary fix so the
instructions are not replaced on backend)
3. Sets reasoning to low
4. Sets default review model to `gpt-5-codex`
2025-09-16 13:36:51 -07:00
dedrisian-oai
2aa84b8891 Fix EventMsg Optional (#3604) 2025-09-15 00:34:33 +00:00
Ahmed Ibrahim
a30e5e40ee enable-resume (#3537)
Adding the ability to resume conversations.
we have one verb `resume`. 

Behavior:

`tui`:
`codex resume`: opens session picker
`codex resume --last`: continue last message
`codex resume <session id>`: continue conversation with `session id`

`exec`:
`codex resume --last`: continue last conversation
`codex resume <session id>`: continue conversation with `session id`

Implementation:
- I added a function to find the path in `~/.codex/sessions/` with a
`UUID`. This is helpful in resuming with session id.
- Added the above mentioned flags
- Added lots of testing
2025-09-14 19:33:19 -04:00
dedrisian-oai
b2f6fc3b9a Fix flaky windows test (#3564)
There are exactly 4 types of flaky tests in Windows x86 right now:

1. `review_input_isolated_from_parent_history` => Times out waiting for
closing events
2. `review_does_not_emit_agent_message_on_structured_output` => Times
out waiting for closing events
3. `auto_compact_runs_after_token_limit_hit` => Times out waiting for
closing events
4. `auto_compact_runs_after_token_limit_hit` => Also has a problem where
auto compact should add a third request, but receives 4 requests.

1, 2, and 3 seem to be solved with increasing threads on windows runner
from 2 -> 4.

Don't know yet why # 4 is happening, but probably also because of
WireMock issues on windows causing races.
2025-09-14 23:20:25 +00:00
dedrisian-oai
90a0fd342f Review Mode (Core) (#3401)
## 📝 Review Mode -- Core

This PR introduces the Core implementation for Review mode:

- New op `Op::Review { prompt: String }:` spawns a child review task
with isolated context, a review‑specific system prompt, and a
`Config.review_model`.
- `EnteredReviewMode`: emitted when the child review session starts.
Every event from this point onwards reflects the review session.
- `ExitedReviewMode(Option<ReviewOutputEvent>)`: emitted when the review
finishes or is interrupted, with optional structured findings:

```json
{
  "findings": [
    {
      "title": "<≤ 80 chars, imperative>",
      "body": "<valid Markdown explaining *why* this is a problem; cite files/lines/functions>",
      "confidence_score": <float 0.0-1.0>,
      "priority": <int 0-3>,
      "code_location": {
        "absolute_file_path": "<file path>",
        "line_range": {"start": <int>, "end": <int>}
      }
    }
  ],
  "overall_correctness": "patch is correct" | "patch is incorrect",
  "overall_explanation": "<1-3 sentence explanation justifying the overall_correctness verdict>",
  "overall_confidence_score": <float 0.0-1.0>
}
```

## Questions

### Why separate out its own message history?

We want the review thread to match the training of our review models as
much as possible -- that means using a custom prompt, removing user
instructions, and starting a clean chat history.

We also want to make sure the review thread doesn't leak into the parent
thread.

### Why do this as a mode, vs. sub-agents?

1. We want review to be a synchronous task, so it's fine for now to do a
bespoke implementation.
2. We're still unclear about the final structure for sub-agents. We'd
prefer to land this quickly and then refactor into sub-agents without
rushing that implementation.
2025-09-12 23:25:10 +00:00