valknar/llmx - llmx - dev.pivoine.art

Author	SHA1	Message	Date
Michael Bolin	74d2741729	chore: require uninlined_format_args from clippy (#2845 ) - added `uninlined_format_args` to `[workspace.lints.clippy]` in the `Cargo.toml` for the workspace - ran `cargo clippy --tests --fix` - ran `just fmt`	2025-08-28 11:25:23 -07:00
Dylan	7f7d1e30f3	[exec] Clean up apply-patch tests (#2648 ) ## Summary These tests were getting a bit unwieldy, and they're starting to become load-bearing. Let's clean them up, and get them working solidly so we can easily expand this harness with new tests. ## Test Plan - [x] Tests continue to pass	2025-08-25 15:08:01 -07:00
Jeremy Rose	32bbbbad61	test: faster test execution in codex-core (#2633 ) this dramatically improves time to run `cargo test -p codex-core` (~25x speedup). before: ``` cargo test -p codex-core 35.96s user 68.63s system 19% cpu 8:49.80 total ``` after: ``` cargo test -p codex-core 5.51s user 8.16s system 63% cpu 21.407 total ``` both tests measured "hot", i.e. on a 2nd run with no filesystem changes, to exclude compile times. approach inspired by [Delete Cargo Integration Tests](https://matklad.github.io/2021/02/27/delete-cargo-integration-tests.html), we move all test cases in tests/ into a single suite in order to have a single binary, as there is significant overhead for each test binary executed, and because test execution is only parallelized with a single binary.	2025-08-24 11:10:53 -07:00
Dylan	236c4f76a6	[apply_patch] freeform apply_patch tool (#2576 ) ## Summary GPT-5 introduced the concept of [custom tools](https://platform.openai.com/docs/guides/function-calling#custom-tools), which allow the model to send a raw string result back, simplifying json-escape issues. We are migrating gpt-5 to use this by default. However, gpt-oss models do not support custom tools, only normal functions. So we keep both tool definitions, and provide whichever one the model family supports. ## Testing - [x] Tested locally with various models - [x] Unit tests pass	2025-08-22 13:42:34 -07:00
Dylan	6df8e35314	[tools] Add apply_patch tool (#2303 ) ## Summary We've been seeing a number of issues and reports with our synthetic `apply_patch` tool, e.g. #802. Let's make this a real tool - in my anecdotal testing, it's critical for GPT-OSS models, but I'd like to make it the standard across GPT-5 and codex models as well. ## Testing - [x] Tested locally - [x] Integration test	2025-08-15 11:55:53 -04:00
Parker Thompson	a075424437	Added `allow-expect-in-tests` / `allow-unwrap-in-tests` (#2328 ) This PR: * Added the clippy.toml to configure allowable expect / unwrap usage in tests * Removed as many expect/allow lines as possible from tests * moved a bunch of allows to expects where possible Note: in integration tests, non `#[test]` helper functions are not covered by this so we had to leave a few lingering `expect(expect_used` checks around	2025-08-14 17:59:01 -07:00
Parker Thompson	c26d42ab69	Fix AF_UNIX, sockpair, recvfrom in linux sandbox (#2309 ) When using codex-tui on a linux system I was unable to run `cargo clippy` inside of codex due to: ``` [pid 3548377] socketpair(AF_UNIX, SOCK_SEQPACKET\|SOCK_CLOEXEC, 0, <unfinished ...> [pid 3548370] close(8 <unfinished ...> [pid 3548377] <... socketpair resumed>0x7ffb97f4ed60) = -1 EPERM (Operation not permitted) ``` And ``` 3611300 <... recvfrom resumed>0x708b8b5cffe0, 8, 0, NULL, NULL) = -1 EPERM (Operation not permitted) ``` This PR: * Fixes a bug that disallowed AF_UNIX to allow it on `socket()` * Adds recvfrom() to the syscall allow list, this should be fine since we disable opening new sockets. But we should validate there is not a open socket inheritance issue. * Allow socketpair to be called for AF_UNIX * Adds tests for AF_UNIX components * All of which allows running `cargo clippy` within the sandbox on linux, and possibly other tooling using a fork server model + AF_UNIX comms.	2025-08-14 17:12:41 -07:00
Michael Bolin	2ecca79663	fix: run python_multiprocessing_lock_works integration test on Mac and Linux (#2318 ) The high-order bit on this PR is that it makes it so `sandbox.rs` tests both Mac and Linux, as we introduce a general `spawn_command_under_sandbox()` function with platform-specific implementations for testing. An important, and interesting, discovery in porting the test to Linux is that (for reasons cited in the code comments), `/dev/shm` has to be added to `writable_roots` on Linux in order for `multiprocessing.Lock` to work there. Granting write access to `/dev/shm` comes with some degree of risk, so we do not make this the default for Codex CLI. Piggybacking on top of #2317, this moves the `python_multiprocessing_lock_works` test yet again, moving `codex-rs/core/tests/sandbox.rs` to `codex-rs/exec/tests/sandbox.rs` because in `codex-rs/exec/tests` we can use `cargo_bin()` like so: ``` let codex_linux_sandbox_exe = assert_cmd::cargo::cargo_bin("codex-exec"); ``` which is necessary so we can use `codex_linux_sandbox_exe` and therefore `spawn_command_under_linux_sandbox` in an integration test. This also moves `spawn_command_under_linux_sandbox()` out of `exec.rs` and into `landlock.rs`, which makes things more consistent with `seatbelt.rs` in `codex-core`. For reference, https://github.com/openai/codex/pull/1808 is the PR that made the change to Seatbelt to get this test to pass on Mac.	2025-08-14 15:47:48 -07:00
Michael Bolin	221ebfcccc	fix: run apply_patch calls through the sandbox (#1705 ) Building on the work of https://github.com/openai/codex/pull/1702, this changes how a shell call to `apply_patch` is handled. Previously, a shell call to `apply_patch` was always handled in-process, never leveraging a sandbox. To determine whether the `apply_patch` operation could be auto-approved, the `is_write_patch_constrained_to_writable_paths()` function would check if all the paths listed in the paths were writable. If so, the agent would apply the changes listed in the patch. Unfortunately, this approach afforded a loophole: symlinks! * For a soft link, we could fix this issue by tracing the link and checking whether the target is in the set of writable paths, however... * ...For a hard link, things are not as simple. We can run `stat FILE` to see if the number of links is greater than 1, but then we would have to do something potentially expensive like `find . -inum <inode_number>` to find the other paths for `FILE`. Further, even if this worked, this approach runs the risk of a [TOCTOU](https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use) race condition, so it is not robust. The solution, implemented in this PR, is to take the virtual execution of the `apply_patch` CLI into an _actual_ execution using `codex --codex-run-as-apply-patch PATCH`, which we can run under the sandbox the user specified, just like any other `shell` call. This, of course, assumes that the sandbox prevents writing through symlinks as a mechanism to write to folders that are not in the writable set configured by the sandbox. I verified this by testing the following on both Mac and Linux: ```shell #!/usr/bin/env bash set -euo pipefail # Can running a command in SANDBOX_DIR write a file in EXPLOIT_DIR? # Codex is run in SANDBOX_DIR, so writes should be constrianed to this directory. SANDBOX_DIR=$(mktemp -d -p "$HOME" sandboxtesttemp.XXXXXX) # EXPLOIT_DIR is outside of SANDBOX_DIR, so let's see if we can write to it. EXPLOIT_DIR=$(mktemp -d -p "$HOME" sandboxtesttemp.XXXXXX) echo "SANDBOX_DIR: $SANDBOX_DIR" echo "EXPLOIT_DIR: $EXPLOIT_DIR" cleanup() { # Only remove if it looks sane and still exists [[ -n "${SANDBOX_DIR:-}" && -d "$SANDBOX_DIR" ]] && rm -rf -- "$SANDBOX_DIR" [[ -n "${EXPLOIT_DIR:-}" && -d "$EXPLOIT_DIR" ]] && rm -rf -- "$EXPLOIT_DIR" } trap cleanup EXIT echo "I am the original content" > "${EXPLOIT_DIR}/original.txt" # Drop the -s to test hard links. ln -s "${EXPLOIT_DIR}/original.txt" "${SANDBOX_DIR}/link-to-original.txt" cat "${SANDBOX_DIR}/link-to-original.txt" if [[ "$(uname)" == "Linux" ]]; then SANDBOX_SUBCOMMAND=landlock else SANDBOX_SUBCOMMAND=seatbelt fi # Attempt the exploit cd "${SANDBOX_DIR}" codex debug "${SANDBOX_SUBCOMMAND}" bash -lc "echo pwned > ./link-to-original.txt" \|\| true cat "${EXPLOIT_DIR}/original.txt" ``` Admittedly, this change merits a proper integration test, but I think I will have to do that in a follow-up PR.	2025-07-30 16:45:08 -07:00
Michael Bolin	d76f96ce79	fix: support special --codex-run-as-apply-patch arg (#1702 ) This introduces some special behavior to the CLIs that are using the `codex-arg0` crate where if `arg1` is `--codex-run-as-apply-patch`, then it will run as if `apply_patch arg2` were invoked. This is important because it means we can do things like: ``` SANDBOX_TYPE=landlock # or seatbelt for macOS codex debug "${SANDBOX_TYPE}" -- codex --codex-run-as-apply-patch PATCH ``` which gives us a way to run `apply_patch` while ensuring it adheres to the sandbox the user specified. While it would be nice to use the `arg0` trick like we are currently doing for `codex-linux-sandbox`, there is no way to specify the `arg0` for the underlying command when running under `/usr/bin/sandbox-exec`, so it will not work for us in this case. Admittedly, we could have also supported this via a custom environment variable (e.g., `CODEX_ARG0`), but since environment variables are inherited by child processes, that seemed like a potentially leakier abstraction. This change, as well as our existing reliance on checking `arg0`, place additional requirements on those who include `codex-core`. Its `README.md` has been updated to reflect this. While we could have just added an `apply-patch` subcommand to the `codex` multitool CLI, that would not be sufficient for the standalone `codex-exec` CLI, which is something that we distribute as part of our GitHub releases for those who know they will not be using the TUI and therefore prefer to use a slightly smaller executable: https://github.com/openai/codex/releases/tag/rust-v0.10.0 To that end, this PR adds an integration test to ensure that the `--codex-run-as-apply-patch` option works with the standalone `codex-exec` CLI. --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/openai/codex/pull/1702). * #1705 * #1703 * __->__ #1702 * #1698 * #1697	2025-07-28 09:26:44 -07:00

10 Commits