There are exactly 4 types of flaky tests in Windows x86 right now:
1. `review_input_isolated_from_parent_history` => Times out waiting for
closing events
2. `review_does_not_emit_agent_message_on_structured_output` => Times
out waiting for closing events
3. `auto_compact_runs_after_token_limit_hit` => Times out waiting for
closing events
4. `auto_compact_runs_after_token_limit_hit` => Also has a problem where
auto compact should add a third request, but receives 4 requests.
1, 2, and 3 seem to be solved with increasing threads on windows runner
from 2 -> 4.
Don't know yet why # 4 is happening, but probably also because of
WireMock issues on windows causing races.
We need to construct the history different when compact happens. For
this, we need to just consider the history after compact and convert
compact to a response item.
This needs to change and use `build_compact_history` when this #3446 is
merged.
No (intended) functional change.
This refactors the transcript view to hold a list of HistoryCells
instead of a list of Lines. This simplifies and makes much of the logic
more robust, as well as laying the groundwork for future changes, e.g.
live-updating history cells in the transcript.
Similar to #2879 in goal. Fixes#2755.
## 📝 Review Mode -- Core
This PR introduces the Core implementation for Review mode:
- New op `Op::Review { prompt: String }:` spawns a child review task
with isolated context, a review‑specific system prompt, and a
`Config.review_model`.
- `EnteredReviewMode`: emitted when the child review session starts.
Every event from this point onwards reflects the review session.
- `ExitedReviewMode(Option<ReviewOutputEvent>)`: emitted when the review
finishes or is interrupted, with optional structured findings:
```json
{
"findings": [
{
"title": "<≤ 80 chars, imperative>",
"body": "<valid Markdown explaining *why* this is a problem; cite files/lines/functions>",
"confidence_score": <float 0.0-1.0>,
"priority": <int 0-3>,
"code_location": {
"absolute_file_path": "<file path>",
"line_range": {"start": <int>, "end": <int>}
}
}
],
"overall_correctness": "patch is correct" | "patch is incorrect",
"overall_explanation": "<1-3 sentence explanation justifying the overall_correctness verdict>",
"overall_confidence_score": <float 0.0-1.0>
}
```
## Questions
### Why separate out its own message history?
We want the review thread to match the training of our review models as
much as possible -- that means using a custom prompt, removing user
instructions, and starting a clean chat history.
We also want to make sure the review thread doesn't leak into the parent
thread.
### Why do this as a mode, vs. sub-agents?
1. We want review to be a synchronous task, so it's fine for now to do a
bespoke implementation.
2. We're still unclear about the final structure for sub-agents. We'd
prefer to land this quickly and then refactor into sub-agents without
rushing that implementation.
this adds some more capabilities to the default sandbox which I feel are
safe. Most are in the
[renderer.sb](https://source.chromium.org/chromium/chromium/src/+/main:sandbox/policy/mac/renderer.sb)
sandbox for chrome renderers, which i feel is fair game for codex
commands.
Specific changes:
1. Allow processes in the sandbox to send signals to any other process
in the same sandbox (e.g. child processes or daemonized processes),
instead of just themselves.
2. Allow user-preference-read
3. Allow process-info* to anything in the same sandbox. This is a bit
wider than Chromium allows, but it seems OK to me to allow anything in
the sandbox to get details about other processes in the same sandbox.
Bazel uses these to e.g. wait for another process to exit.
4. Allow all CPU feature detection, this seems harmless to me. It's
wider than Chromium, but Chromium is concerned about fingerprinting, and
tightly controls what CPU features they actually care about, and we
don't have either that restriction or that advantage.
5. Allow new sysctl-reads:
```
(sysctl-name "vm.loadavg")
(sysctl-name-prefix "kern.proc.pgrp.")
(sysctl-name-prefix "kern.proc.pid.")
(sysctl-name-prefix "net.routetable.")
```
bazel needs these for waiting on child processes and for communicating
with its local build server, i believe. I wonder if we should just allow
all (sysctl-read), as reading any arbitrary info about the system seems
fine to me.
6. Allow iokit-open on RootDomainUserClient. This has to do with power
management I believe, and Chromium allows renderers to do this, so okay.
Bazel needs it to boot successfully, possibly for sleep/wake callbacks?
7. Mach lookup to `com.apple.system.opendirectoryd.libinfo`, which has
to do with user data, and which Chrome allows.
8. Mach lookup to `com.apple.PowerManagement.control`. Chromium allows
its GPU process to do this, but not its renderers. Bazel needs this to
boot, probably relatedly to sleep/wake stuff.
Azure Responses API doesn't work well with store:false and response
items.
If store = false and id is sent an error is thrown that ID is not found
If store = false and id is not sent an error is thrown that ID is
required
Add detection for Azure urls and add a workaround to preserve reasoning
item IDs and send store:true
sometimes the model forgets to actually invoke `apply_patch` and puts a
patch as the script body. trying to execute this as bash sometimes
creates files named `,` or `{` or does other unknown things, so catch
this situation and return an error to the model.
## Compact feature:
1. Stops the model when the context window become too large
2. Add a user turn, asking for the model to summarize
3. Build a bridge that contains all the previous user message + the
summary. Rendered from a template
4. Start sampling again from a clean conversation with only that bridge
It turns out that we want slightly different behavior for the
`SetDefaultModel` RPC because some models do not work with reasoning
(like GPT-4.1), so we should be able to explicitly clear this value.
Verified in `codex-rs/mcp-server/tests/suite/set_default_model.rs`.
## Summary
Standardizes the shell description across sandbox_types, since we cover
this in the prompt, and have moved necessary details (like
network_access and writeable workspace roots) to EnvironmentContext
messages.
## Test Plan
- [x] updated unit tests
This adds `SetDefaultModel`, which takes `model` and `reasoning_effort`
as optional fields. If set, the field will overwrite what is in the
user's `config.toml`.
This reuses logic that was added to support the `/model` command in the
TUI: https://github.com/openai/codex/pull/2799.
`ClientRequest::NewConversation` picks up the reasoning level from the user's defaults in `config.toml`, so it should be reported in `NewConversationResponse`.
Adds further information on how to get started with `codex mcp`:
- Tool details and parameter references
- Quickstart with example using MCP inspector.
## Summary
Handle timeouts the same way, regardless of approval mode. There's more
to do here, but this is simple and should be zero-regret
## Testing
- [x] existing tests pass
- [x] test locally and verify rollout
Created this PR by:
- adding `redundant_clone` to `[workspace.lints.clippy]` in
`cargo-rs/Cargol.toml`
- running `cargo clippy --tests --fix`
- running `just fmt`
Though I had to clean up one instance of the following that resulted:
```rust
let codex = codex;
```
Apparently `-F` is the correct thing to use. From the code sample on
https://docs.github.com/en/rest/git/refs?apiVersion=2022-11-28#update-a-reference
```shell
gh api \
--method PATCH \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
/repos/OWNER/REPO/git/refs/REF \
-f 'sha=aa218f56b14c9653891f9e74264a383fa43fefbd' -F "force=true"
```
Also, I ran the following locally and verified it worked:
```shell
export GITHUB_REPOSITORY=openai/codex
export GITHUB_SHA=305252b2fb2d57bb40a9e4bad269db9a761f7099
gh api \
repos/${GITHUB_REPOSITORY}/git/refs/heads/latest-alpha-cli \
-X PATCH \
-f sha="${GITHUB_SHA}" \
-F force=true
```
`$GITHUB_REPOSITORY` and `$GITHUB_SHA` should already be available as
environment variables for the `run` step without having to be redeclared
in the `env` section.
This PR does the following:
* Adds the ability to paste or type an API key.
* Removes the `preferred_auth_method` config option. The last login
method is always persisted in auth.json, so this isn't needed.
* If OPENAI_API_KEY env variable is defined, the value is used to
prepopulate the new UI. The env variable is otherwise ignored by the
CLI.
* Adds a new MCP server entry point "login_api_key" so we can implement
this same API key behavior for the VS Code extension.
<img width="473" height="140" alt="Screenshot 2025-09-04 at 3 51 04 PM"
src="https://github.com/user-attachments/assets/c11bbd5b-8a4d-4d71-90fd-34130460f9d9"
/>
<img width="726" height="254" alt="Screenshot 2025-09-04 at 3 51 32 PM"
src="https://github.com/user-attachments/assets/6cc76b34-309a-4387-acbc-15ee5c756db9"
/>
- Ensure replacements are applied in index order for determinism.
- Add tests for addition chunk followed by removal and worktree-aware
helper.
This fixes a panic I observed.
Co-authored-by: Codex <199175422+chatgpt-codex-connector[bot]@users.noreply.github.com>