This PR does the following:
- divides user msgs into 3 categories: plain, user instructions, and
environment context
- Centralizes adding user instructions and environment context to a
degree
- Improve the integration testing
Building on top of #3123
Specifically this
[comment](https://github.com/openai/codex/pull/3123#discussion_r2319885089).
We need to send the user message while ignoring the User Instructions
and Environment Context we attach.
### Overview
This PR introduces the following changes:
1. Adds a unified mechanism to convert ResponseItem into EventMsg.
2. Ensures that when a session is initialized with initial history, a
vector of EventMsg is sent along with the session configuration. This
allows clients to re-render the UI accordingly.
3. Added integration testing
### Caveats
This implementation does not send every EventMsg that was previously
dispatched to clients. The excluded events fall into two categories:
• “Arguably” rolled-out events
Examples include tool calls and apply-patch calls. While these events
are conceptually rolled out, we currently only roll out ResponseItems.
These events are already being handled elsewhere and transformed into
EventMsg before being sent.
• Non-rolled-out events
Certain events such as TurnDiff, Error, and TokenCount are not rolled
out at all.
### Future Directions
At present, resuming a session involves maintaining two states:
• UI State
Clients can replay most of the important UI from the provided EventMsg
history.
• Model State
The model receives the complete session history to reconstruct its
internal state.
This design provides a solid foundation. If, in the future, more precise
UI reconstruction is needed, we have two potential paths:
1. Introduce a third data structure that allows us to derive both
ResponseItems and EventMsgs.
2. Clearly divide responsibilities: the core system ensures the
integrity of the model state, while clients are responsible for
reconstructing the UI.
In this test, the ChatGPT token path is used, and the auth layer tries
to refresh the token if it thinks the token is “old.” Your helper writes
a fixed last_refresh timestamp that has now aged past the 28‑day
threshold, so the code attempts a real refresh against auth.openai.com,
never reaches the mock, and you end up with
received_requests().await.unwrap() being empty.
The gpt-oss models require reasoning with subsequent Chat Completions
requests because otherwise the model forgets why the tools were called.
This change fixes that and also adds some additional missing
documentation around how to handle context windows in Ollama and how to
show the CoT if you desire to.
We have two ways of loading conversation with a previous history. Fork
conversation and the experimental resume that we had before. In this PR,
I am unifying their code path. The path is getting the history items and
recording them in a brand new conversation. This PR also constraint the
rollout recorder responsibilities to be only recording to the disk and
loading from the disk.
The PR also fixes a current bug when we have two forking in a row:
History 1:
<Environment Context>
UserMessage_1
UserMessage_2
UserMessage_3
**Fork with n = 1 (only remove one element)**
History 2:
<Environment Context>
UserMessage_1
UserMessage_2
<Environment Context>
**Fork with n = 1 (only remove one element)**
History 2:
<Environment Context>
UserMessage_1
UserMessage_2
**<Environment Context>**
This shouldn't happen but because we were appending the `<Environment
Context>` after each spawning and it's considered as _user message_.
Now, we don't add this message if restoring and old conversation.
- added `uninlined_format_args` to `[workspace.lints.clippy]` in the
`Cargo.toml` for the workspace
- ran `cargo clippy --tests --fix`
- ran `just fmt`
this dramatically improves time to run `cargo test -p codex-core` (~25x
speedup).
before:
```
cargo test -p codex-core 35.96s user 68.63s system 19% cpu 8:49.80 total
```
after:
```
cargo test -p codex-core 5.51s user 8.16s system 63% cpu 21.407 total
```
both tests measured "hot", i.e. on a 2nd run with no filesystem changes,
to exclude compile times.
approach inspired by [Delete Cargo Integration
Tests](https://matklad.github.io/2021/02/27/delete-cargo-integration-tests.html),
we move all test cases in tests/ into a single suite in order to have a
single binary, as there is significant overhead for each test binary
executed, and because test execution is only parallelized with a single
binary.
We want to send an aggregated output of stderr and stdout so we don't
have to aggregate it stderr+stdout as we lose order sometimes.
---------
Co-authored-by: Gabriel Peal <gpeal@users.noreply.github.com>
## Summary
GPT-5 introduced the concept of [custom
tools](https://platform.openai.com/docs/guides/function-calling#custom-tools),
which allow the model to send a raw string result back, simplifying
json-escape issues. We are migrating gpt-5 to use this by default.
However, gpt-oss models do not support custom tools, only normal
functions. So we keep both tool definitions, and provide whichever one
the model family supports.
## Testing
- [x] Tested locally with various models
- [x] Unit tests pass
This PR adds a central `AuthManager` struct that manages the auth
information used across conversations and the MCP server. Prior to this,
each conversation and the MCP server got their own private snapshots of
the auth information, and changes to one (such as a logout or token
refresh) were not seen by others.
This is especially problematic when multiple instances of the CLI are
run. For example, consider the case where you start CLI 1 and log in to
ChatGPT account X and then start CLI 2 and log out and then log in to
ChatGPT account Y. The conversation in CLI 1 is still using account X,
but if you create a new conversation, it will suddenly (and
unexpectedly) switch to account Y.
With the `AuthManager`, auth information is read from disk at the time
the `ConversationManager` is constructed, and it is cached in memory.
All new conversations use this same auth information, as do any token
refreshes.
The `AuthManager` is also used by the MCP server's GetAuthStatus
command, which now returns the auth method currently used by the MCP
server.
This PR also includes an enhancement to the GetAuthStatus command. It
now accepts two new (optional) input parameters: `include_token` and
`refresh_token`. Callers can use this to request the in-use auth token
and can optionally request to refresh the token.
The PR also adds tests for the login and auth APIs that I recently added
to the MCP server.
## Summary
We've experienced a bit of drift in system prompting for `apply_patch`:
- As pointed out in #2030 , our prettier formatting started altering
prompt.md in a few ways
- We introduced a separate markdown file for apply_patch instructions in
#993, but currently duplicate them in the prompt.md file
- We added a first-class apply_patch tool in #2303, which has yet
another definition
This PR starts to consolidate our logic in a few ways:
- We now only use
`apply_patch_tool_instructions.md](https://github.com/openai/codex/compare/dh--apply-patch-tool-definition?expand=1#diff-d4fffee5f85cb1975d3f66143a379e6c329de40c83ed5bf03ffd3829df985bea)
for system instructions
- We no longer include apply_patch system instructions if the tool is
specified
I'm leaving the definition in openai_tools.rs as duplicated text for now
because we're going to be iterated on the first-class tool soon.
## Testing
- [x] Added integration tests to verify prompt stability
- [x] Tested locally with several different models (gpt-5, gpt-oss,
o4-mini)
## Summary
Before we land #2243, let's start printing environment_context in our
preferred format. This struct will evolve over time with new
information, xml gives us a balance of human readable without too much
parsing, llm readable, and extensible.
Also moves us over to an Option-based struct, so we can easily provide
diffs to the model.
## Testing
- [x] Updated tests to reflect new format
## What? Why? How?
- When running on Windows, codex often tries to invoke bash commands,
which commonly fail (unless WSL is installed)
- Fix: Detect if powershell is available and, if so, route commands to
it
- Also add a shell_name property to environmental context for codex to
default to powershell commands when running in that environment
## Testing
- Tested within WSL and powershell (e.g. get top 5 largest files within
a folder and validated that commands generated were powershell commands)
- Tested within Zsh
- Updated unit tests
---------
Co-authored-by: Eddy Escardo <eddy@openai.com>
Codex created this PR from the following prompt:
> upgrade this entire repo to Rust 1.89. Note that this requires
updating codex-rs/rust-toolchain.toml as well as the workflows in
.github/. Make sure that things are "clippy clean" as this change will
likely uncover new Clippy errors. `just fmt` and `cargo clippy --tests`
are sufficient to check for correctness
Note this modifies a lot of lines because it folds nested `if`
statements using `&&`.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with [ReviewStack](https://reviewstack.dev/openai/codex/pull/2465).
* #2467
* __->__ #2465
Motivation: we have users who uses their API key although they want to
use ChatGPT account. We want to give them the chance to always login
with their account.
This PR displays login options when the user is not signed in with
ChatGPT. Even if you have set an OpenAI API key as an environment
variable, you will still be prompted to log in with ChatGPT.
We’ve also added a new flag, `always_use_api_key_signing` false by
default, which ensures you are never asked to log in with ChatGPT and
always defaults to using your API key.
https://github.com/user-attachments/assets/b61ebfa9-3c5e-4ab7-bf94-395c23a0e0af
After ChatGPT sign in:
https://github.com/user-attachments/assets/d58b366b-c46a-428f-a22f-2ac230f991c0
## Summary
We've been seeing a number of issues and reports with our synthetic
`apply_patch` tool, e.g. #802. Let's make this a real tool - in my
anecdotal testing, it's critical for GPT-OSS models, but I'd like to
make it the standard across GPT-5 and codex models as well.
## Testing
- [x] Tested locally
- [x] Integration test
This PR:
* Added the clippy.toml to configure allowable expect / unwrap usage in
tests
* Removed as many expect/allow lines as possible from tests
* moved a bunch of allows to expects where possible
Note: in integration tests, non `#[test]` helper functions are not
covered by this so we had to leave a few lingering `expect(expect_used`
checks around
The high-order bit on this PR is that it makes it so `sandbox.rs` tests
both Mac and Linux, as we introduce a general
`spawn_command_under_sandbox()` function with platform-specific
implementations for testing.
An important, and interesting, discovery in porting the test to Linux is
that (for reasons cited in the code comments), `/dev/shm` has to be
added to `writable_roots` on Linux in order for `multiprocessing.Lock`
to work there. Granting write access to `/dev/shm` comes with some
degree of risk, so we do not make this the default for Codex CLI.
Piggybacking on top of #2317, this moves the
`python_multiprocessing_lock_works` test yet again, moving
`codex-rs/core/tests/sandbox.rs` to `codex-rs/exec/tests/sandbox.rs`
because in `codex-rs/exec/tests` we can use `cargo_bin()` like so:
```
let codex_linux_sandbox_exe = assert_cmd::cargo::cargo_bin("codex-exec");
```
which is necessary so we can use `codex_linux_sandbox_exe` and therefore
`spawn_command_under_linux_sandbox` in an integration test.
This also moves `spawn_command_under_linux_sandbox()` out of `exec.rs`
and into `landlock.rs`, which makes things more consistent with
`seatbelt.rs` in `codex-core`.
For reference, https://github.com/openai/codex/pull/1808 is the PR that
made the change to Seatbelt to get this test to pass on Mac.
Previous to this PR, `codex-rs/core/tests/sandbox.rs` contained
integration tests that were specific to Seatbelt. This PR moves those
tests to `codex-rs/core/src/seatbelt.rs` and designates
`codex-rs/core/tests/sandbox.rs` to be used as the home for
cross-platform (well, Mac and Linux...) sandbox tests.
To start, this migrates
`python_multiprocessing_lock_works_under_seatbelt()` from #1823 to the
new `sandbox.rs` because this is the type of thing that should work on
both Mac _and_ Linux, though I still need to do some work to clean up
the test so it works on both platforms.
## Summary
Currently, we use request-time logic to determine the user_instructions
and environment_context messages. This means that neither of these
values can change over time as conversations go on. We want to add in
additional details here, so we're migrating these to save these messages
to the rollout file instead. This is simpler for the client, and allows
us to append additional environment_context messages to each turn if we
want
## Testing
- [x] Integration test coverage
- [x] Tested locally with a few turns, confirmed model could reference
environment context and cached token metrics were reasonably high
This PR does two things because after I got deep into the first one I
started pulling on the thread to the second:
- Makes `ConversationManager` the place where all in-memory
conversations are created and stored. Previously, `MessageProcessor` in
the `codex-mcp-server` crate was doing this via its `session_map`, but
this is something that should be done in `codex-core`.
- It unwinds the `ctrl_c: tokio::sync::Notify` that was threaded
throughout our code. I think this made sense at one time, but now that
we handle Ctrl-C within the TUI and have a proper `Op::Interrupt` event,
I don't think this was quite right, so I removed it. For `codex exec`
and `codex proto`, we now use `tokio::signal::ctrl_c()` directly, but we
no longer make `Notify` a field of `Codex` or `CodexConversation`.
Changes of note:
- Adds the files `conversation_manager.rs` and `codex_conversation.rs`
to `codex-core`.
- `Codex` and `CodexSpawnOk` are no longer exported from `codex-core`:
other crates must use `CodexConversation` instead (which is created via
`ConversationManager`).
- `core/src/codex_wrapper.rs` has been deleted in favor of
`ConversationManager`.
- `ConversationManager::new_conversation()` returns `NewConversation`,
which is in line with the `new_conversation` tool we want to add to the
MCP server. Note `NewConversation` includes `SessionConfiguredEvent`, so
we eliminate checks in cases like `codex-rs/core/tests/client.rs` to
verify `SessionConfiguredEvent` is the first event because that is now
internal to `ConversationManager`.
- Quite a bit of code was deleted from
`codex-rs/mcp-server/src/message_processor.rs` since it no longer has to
manage multiple conversations itself: it goes through
`ConversationManager` instead.
- `core/tests/live_agent.rs` has been deleted because I had to update a
bunch of tests and all the tests in here were ignored, and I don't think
anyone ever ran them, so this was just technical debt, at this point.
- Removed `notify_on_sigint()` from `util.rs` (and in a follow-up, I
hope to refactor the blandly-named `util.rs` into more descriptive
files).
- In general, I started replacing local variables named `codex` as
`conversation`, where appropriate, though admittedly I didn't do it
through all the integration tests because that would have added a lot of
noise to this PR.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with [ReviewStack](https://reviewstack.dev/openai/codex/pull/2240).
* #2264
* #2263
* __->__ #2240
Wait for newlines, then render markdown on a line by line basis. Word wrap it for the current terminal size and then spit it out line by line into the UI. Also adds tests and fixes some UI regressions.
## Summary
Our current approach to prompt caching is fragile! The current approach
works, but we are planning to update to a more resilient system (storing
them in the rollout file). Let's start adding some integration tests to
ensure stability while we migrate it.
## Testing
- [x] These are the tests 😎
## Summary
Allow tui conversations to resume after the client fails out of retries.
I tested this with exec / mocked api failures as well, and it appears to
be fine. But happy to add an exec integration test as well!
## Testing
- [x] Added integration test
- [x] Tested locally
`CodexAuth::new()` was the first method listed in `CodexAuth`, but it is
only meant to be used by tests. Rename it to
`create_dummy_chatgpt_auth_for_testing()` and move it to the end of the
implementation.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with [ReviewStack](https://reviewstack.dev/openai/codex/pull/1962).
* #1971
* #1970
* #1966
* #1965
* __->__ #1962
- `/status` renders
```
signed in with chatgpt
login: example@example.com
plan: plus
```
- Setup for using this info in a few more places.
---------
Co-authored-by: Michael Bolin <mbolin@openai.com>
Replaces the `include_default_writable_roots` option on
`sandbox_workspace_write` (that defaulted to `true`, which was slightly
weird/annoying) with `exclude_tmpdir_env_var`, which defaults to
`false`.
Though perhaps more importantly `/tmp` is now enabled by default as part
of `sandbox_mode = "workspace-write"`, though `exclude_slash_tmp =
false` can be used to disable this.
## Summary
- Prioritize provider-specific API keys over default Codex auth when
building requests
- Add test to ensure provider env var auth overrides default auth
## Testing
- `just fmt`
- `just fix` *(fails: `let` expressions in this position are unstable)*
- `cargo test --all-features` *(fails: `let` expressions in this
position are unstable)*
------
https://chatgpt.com/codex/tasks/task_i_68926a104f7483208f2c8fd36763e0e3
## Summary
Includes a new user message in the api payload which provides useful
environment context for the model, so it knows about things like the
current working directory and the sandbox.
## Testing
Updated unit tests
## Summary
A split-up PR of #1763 , stacked on top of a tools refactor #1858 to
make the change clearer. From the previous summary:
> Let's try something new: tell the model about the sandbox, and let it
decide when it will need to break the sandbox. Some local testing
suggests that it works pretty well with zero iteration on the prompt!
## Testing
- [x] Added unit tests
- [x] Tested locally and it appears to work smoothly!
## Summary
Our recent change in #1737 can sometimes lead to the model confusing
AGENTS.md context as part of the message. But a little prompting and
formatting can help fix this!
## Testing
- Ran locally with a few different prompts to verify the model
behaves well.
- Updated unit tests
## Summary
Users frequently complain about re-approving commands that have failed
for non-sandbox reasons. We can't diagnose with complete accuracy which
errors happened because of a sandbox failure, but we can start to
eliminate some common simple cases.
This PR captures the most common case I've seen, which is a `command not
found` error.
## Testing
- [x] Added unit tests
- [x] Ran a few cases locally
To make `--full-auto` safer, this PR updates the Seatbelt policy so that
a `SandboxPolicy` with a `writable_root` that contains a `.git/`
_directory_ will make `.git/` _read-only_ (though as a follow-up, we
should also consider the case where `.git` is a _file_ with a `gitdir:
/path/to/actual/repo/.git` entry that should also be protected).
The two major changes in this PR:
- Updating `SandboxPolicy::get_writable_roots_with_cwd()` to return a
`Vec<WritableRoot>` instead of a `Vec<PathBuf>` where a `WritableRoot`
can specify a list of read-only subpaths.
- Updating `create_seatbelt_command_args()` to honor the read-only
subpaths in `WritableRoot`.
The logic to update the policy is a fairly straightforward update to
`create_seatbelt_command_args()`, but perhaps the more interesting part
of this PR is the introduction of an integration test in
`tests/sandbox.rs`. Leveraging the new API in #1785, we test
`SandboxPolicy` under various conditions, including ones where `$TMPDIR`
is not readable, which is critical for verifying the new behavior.
To ensure that Codex can run its own tests, e.g.:
```
just codex debug seatbelt --full-auto -- cargo test if_git_repo_is_writable_root_then_dot_git_folder_is_read_only
```
I had to introduce the use of `CODEX_SANDBOX=sandbox`, which is
comparable to how `CODEX_SANDBOX_NETWORK_DISABLED=1` was already being
used.
Adding a comparable change for Landlock will be done in a subsequent PR.
## Summary
- stream command stdout as `ExecCommandStdout` events
- forward streamed stdout to clients and ignore in human output
processor
- adjust call sites for new streaming API
- Add operation to summarize the context so far.
- The operation runs a compact task that summarizes the context.
- The operation clear the previous context to free the context window
- The operation didn't use `run_task` to avoid corrupting the session
- Add /compact in the tui
https://github.com/user-attachments/assets/e06c24e5-dcfb-4806-934a-564d425a919c